Statistics with JCHEM DBs and descriptors for QSPR/QSAR

User 677b9c22ff

02-12-2006 03:58:40

Hi,


I attached a PPT presentation which deals with some statistics, univariate methods, multivariate methods, QSPR and QSAR and how to connect


JChem DBs and calculated descriptors to other statistics packages like YALE, WEKA, R, MATLAB, Statistica, SAS etc. The reason is, ChemAxon added principal component analysis (PCA) and partial least squares (PLS) to JCHEM.


I included some examples and show some graphics.





Maybe some people would also join the discussion about what kind of statistics tools they use together with JCHEM or what kind of statistics (regression, classification, graphics they wish in the future from JCHEM?





Tobias

ChemAxon efa1591b5a

04-12-2006 11:50:09

Hi Tobias,








a,





PCA/PLS is part of JChem by mistake :-) PCA was developed ages ago and it was never tested and used. It will be removed from JChem in the next release for good. Consequently, we are not going to develop visualization tools either.





We suggest to use standard statistics packages. We join your great suggestion and encourage other ChemAxon users to share their thoughts and experience with the community.





ChemAxon, as a cheminformatics software company is commited to develop cheminformatics and molecular modelling tools. Instead of trying to develop custom statistics tools, we aim to enable the easy intarfacing between our tools and the most common statistics packages.


If you - or other user - experience difficulties in pipelining ChemAxon to Statistica or into other tools, we are ready to implement the required output formats.








b,





The other remark that I would like to make is concerned with the descriptors you used in your interesting analysis. Our chemical fingerprint is a hashed binary fingerprint thus each individual bit of the fingerprint can be regarded as a descriptor (though due to the hashed nature of the fingerprint, one bit may encode or describe several independant properties).





The particular representation used in JChemBase table, namely, that 16 integer columns, each of which contains 32 bits are used to represent the 512 bits of the chemical fingerprint is accidental (that is, not essential, it just happened to be like like for technical but not theoretical reasons). Essentially, here we 512 descriptors (or variables).





To make my point clear, what I am saying is that these 16 integer values can be regarded as molecular descriptors only with caution and limitation. Their use as variables in dimension reduction or other multivariate statistical analysis should be avoided as that can lead to distorted results.


Instead, in my opinion, the 512 individual bits should be used as descriptors, or variables.





Kind regards,


Miklos

User 677b9c22ff

05-12-2006 03:48:02

mvargyas wrote:



ChemAxon, as a cheminformatics software company is commited to develop cheminformatics and molecular modelling tools. Instead of trying to develop custom statistics tools, we aim to enable the easy intarfacing between our tools and the most common statistics packages.


If you - or other user - experience difficulties in pipelining ChemAxon to Statistica or into other tools, we are ready to implement the required output formats.


Hi Miklos,


ok, I was assuming that PLS and other statistical tests will be included in the new JChem, which is ok, because for Instant-JChem it would absolutely make sense to have some simple regression functions or some matrix plots or even a PLS or PCA (if its faster). But it must have some graphical output. The same for clustering, a very simple heatmap or a simple cluster graphics like in libmcs is always comfortable.
mvargyas wrote:



The other remark that I would like to make is concerned with the descriptors you used in your interesting analysis. Our chemical fingerprint is a hashed binary fingerprint thus each individual bit of the fingerprint can be regarded as a descriptor (though due to the hashed nature of the fingerprint, one bit may encode or describe several independant properties).





The particular representation used in JChemBase table, namely, that 16 integer columns, each of which contains 32 bits are used to represent the 512 bits of the chemical fingerprint is accidental (that is, not essential, it just happened to be like like for technical but not theoretical reasons). Essentially, here we 512 descriptors (or variables).


Yes, I think for PCA or PLS its just fine as long we use it as visualization technique and don't deploy learning models from it. And the PCA itself doesn't care about, the scree plot looks fine and also the scores and loadings. For building a regression or property modeling I would not use this 16 fp descriptor.





Infact I used a 1024 bit fingerprint matrix (at the bit level) with a size of 1024x10,000 for the the other examples (the tree model). But the logP and MTP models are nothing we actually work on (there are better models out there), it was just some data I have input models available to show the data exchange technique between JChem and Statistica.


But I am still curious how other people do it, or what they use.





Also the speed is ok, generatemd can handle 5-6 million molecules per hour (generate 1800-2500 fingerprints/second depending on the molecule size), so this would be 100 million compounds per day or 10 billion on a 100 node cluster per day. The file size is 100 MB per 100k structures which is 100 GB per 100 Mio structures which can be also screened on-the-fly.





But we don't do pharmacophore screening, so we just use "normal" chemical fingerprints. And we like to have chemical functions or classes (list of patterns) already assigned, like in the PubChem fingerprint which uses elements, SMARTS and nearest neighbors.





Kind regards


Tobias

ChemAxon efa1591b5a

05-12-2006 08:34:36

Hi Tobias,





Thank you for the clarification.





I believe heat maps should not be difficult to implement...


Tim, any wise thoughts about what kind of statistics and related visualisation tools/solutions will be available in IJC?





Miklos

ChemAxon efa1591b5a

06-12-2006 11:37:03

Quote:
And we like to have chemical functions or classes (list of patterns) already assigned, like in the PubChem fingerprint which uses elements, SMARTS and nearest neighbors.
You may find the developers' guide of the Screen package relevant: http://www.chemaxon.com/jchem/doc/guide/screen/index.html#custom


This describes how to implement a custom fingerprint (namely a small fraction of the MACCS keys). Using this guide a CXN user developed his own fingerprint and made it available for the community at http://www.chemaxon.com/forum/ftopic352.html.





I believe it is straightforward to develop your own keys with the help this guide and the contributed java code.





Miklos

ChemAxon fa971619eb

06-12-2006 14:23:31

We haven't made any firm decisions on what we are going to do in IJC and how we are going to do it, as these developments won't be done for a little while





But as an indicator as to our plans, here is a little infomation.





We do plan to provide some basic charting and visualisation support. e.g. scatter plots, histograms of distributions etc. This will initially allow viewing and analysis of IJC data, but these tools will can also be put to other uses.





Following this we would also like to add some basic support statistics and data reduction techniques. Things like PCA and PLS (as well as more basics statistics functions) will be useful, as well as clustering and decission trees.





In all cases we will almost certainly be implementing this using 3rd party libraries and we will not be trying to provide the ultimate in statistics or data mining functionality, rather trying to allow chemists/biologists to do some common useful things (following the 80/20 rule), whilst still allowing more advanced analysis to be done in external tools.





Hope this gives you an idea of our plans.





Tim