I attached a PPT presentation which deals with some statistics, univariate methods, multivariate methods, QSPR and QSAR and how to connect
JChem DBs and calculated descriptors to other statistics packages like YALE, WEKA, R, MATLAB, Statistica, SAS etc. The reason is, ChemAxon added principal component analysis (PCA) and partial least squares (PLS) to JCHEM.
I included some examples and show some graphics.
Maybe some people would also join the discussion about what kind of statistics tools they use together with JCHEM or what kind of statistics (regression, classification, graphics they wish in the future from JCHEM?
PCA/PLS is part of JChem by mistake :-) PCA was developed ages ago and it was never tested and used. It will be removed from JChem in the next release for good. Consequently, we are not going to develop visualization tools either.
We suggest to use standard statistics packages. We join your great suggestion and encourage other ChemAxon users to share their thoughts and experience with the community.
ChemAxon, as a cheminformatics software company is commited to develop cheminformatics and molecular modelling tools. Instead of trying to develop custom statistics tools, we aim to enable the easy intarfacing between our tools and the most common statistics packages.
If you - or other user - experience difficulties in pipelining ChemAxon to Statistica or into other tools, we are ready to implement the required output formats.
The other remark that I would like to make is concerned with the descriptors you used in your interesting analysis. Our chemical fingerprint is a hashed binary fingerprint thus each individual bit of the fingerprint can be regarded as a descriptor (though due to the hashed nature of the fingerprint, one bit may encode or describe several independant properties).
The particular representation used in JChemBase table, namely, that 16 integer columns, each of which contains 32 bits are used to represent the 512 bits of the chemical fingerprint is accidental (that is, not essential, it just happened to be like like for technical but not theoretical reasons). Essentially, here we 512 descriptors (or variables).
To make my point clear, what I am saying is that these 16 integer values can be regarded as molecular descriptors only with caution and limitation. Their use as variables in dimension reduction or other multivariate statistical analysis should be avoided as that can lead to distorted results.
Instead, in my opinion, the 512 individual bits should be used as descriptors, or variables.
Thank you for the clarification.
I believe heat maps should not be difficult to implement...
Tim, any wise thoughts about what kind of statistics and related visualisation tools/solutions will be available in IJC?
We haven't made any firm decisions on what we are going to do in IJC and how we are going to do it, as these developments won't be done for a little while
But as an indicator as to our plans, here is a little infomation.
We do plan to provide some basic charting and visualisation support. e.g. scatter plots, histograms of distributions etc. This will initially allow viewing and analysis of IJC data, but these tools will can also be put to other uses.
Following this we would also like to add some basic support statistics and data reduction techniques. Things like PCA and PLS (as well as more basics statistics functions) will be useful, as well as clustering and decission trees.
In all cases we will almost certainly be implementing this using 3rd party libraries and we will not be trying to provide the ultimate in statistics or data mining functionality, rather trying to allow chemists/biologists to do some common useful things (following the 80/20 rule), whilst still allowing more advanced analysis to be done in external tools.
Hope this gives you an idea of our plans.