Help with ECFP fingerprints

User 7b0ee04e66

19-06-2012 11:31:39

Good afternoon


We are trying to use the ECFP fingerprints in the JChem carridge


We are running version 5.9.3


Based on the documentation found here


http://www.chemaxon.com/jchem/doc/dev/cartridge/index.html


I need to alter the index on my table before I can use it


alter index jcxnci parameters('addDfltMdConf=ECFP')


 select count(*) from nci where jc_dissimilarity(struct, 'C[C@H](CS)C(=O)N1CCC[C@H]1C(O)=O', 'descriptorName:ECFP') < 0.3


But which index should I create to start with ? a normal index with tableType:molecules or a different one ?


We would like to be able to run a mixture of exact / sub-structure / similarity using standard tanimoto coefficient and similarity using ECFP fingerprints on the same column.


Should I use a larger number of fingerprints (16 or 32), which the results be more precise and faster if we use 32 bit ?


Thanks


Catherine


 


 


 

ChemAxon aa7c50abf8

19-06-2012 12:04:47

Good Afternoon Catherine,


The structureType index parameter can be both molecules (specific structures, like single molecules, mixtures, salts, polymers)  or anyStructure (all types of structures are allowed, but no structure type-specific searching).


We will answer your third question soon.


Peter

ChemAxon aa7c50abf8

19-06-2012 12:53:06

Should I use a larger number of fingerprints (16 or 32), which the results be more precise and faster if we use 32 bit ?

The fp_size, fp_bit, pat_length index parameters don't affect the similarity search ECFP. (These index parameters affect similarity searches using the chemical fingerprints created with indexes by default.) The addDfltMdConf=ECFP index parameter adds ECFP with a default configuration. If you intend to use custom ECFP configuration, you have to use the addMd index parameter. Please, see http://www.chemaxon.com/jchem/doc/user/ECFP.html for available configuration options (http://www.chemaxon.com/jchem/doc/user/ECFP.html#config in particular).


Peter

User 7b0ee04e66

19-06-2012 17:54:26

Hi Peter


Thanks a lot for your help


We have now set up the new index and proven we can use it !


Catherine

ChemAxon efa1591b5a

20-06-2012 07:32:32

Hi Catherine,


Default configuration should do a decent job both in terms of speed and accuracy. 


ECFP introduces two parameters that influence speed of generation and accuracy of search (to some extent speed of search as well, but that's not significant). The first is the diameter (which by default is 4) and is often increased to 6 for the sake of higher accuracy. This affects the fingerprint generation time, but not the search time.


When ECFP is folded to a binary representation the second parameter, length determines the size of the folded fingerprint. Size of the fingerprint affects both performance and the accuracy. The longer the fingerprint the slower the similarity calculation but the difference between 512 and 1024 should be negligible. In terms of accuracy the difference between 512 and 1024 bits can be more characteristic. However, this highly depends on the particular data set you are dealing with. Simple experiment using a random subset should provide valuable insight, I'd suggest to try and compare 2, 4 and 6. Higher values are not needed for similarity searching.


Also note that larger diameter generates more information which longer fingerprints represent with less data loss. So my gut feel is that diameter 6 works better with 1014 bits than 512.


Hope this helps.



Regards