fingerprint tanimoto comparison

User d028dca803

03-07-2009 11:26:26

I have attached a sdf file of 4 compounds. Their fingerprint tanimoto scores are:


WY23626 WY23626 1.000000
WY23626 7277998 0.830189
WY23626 5357691 0.660900
WY23626 9099944 0.840764
7277998 7277998 1.000000
7277998 5357691 0.578275
7277998 9099944 0.915858
5357691 5357691 1.000000
5357691 9099944 0.585761
9099944 9099944 1.000000


I don't understand why WY23626 scores so badly against 5357691.


Any suggestions?

ChemAxon efa1591b5a

07-07-2009 14:27:06

Hi, 


this must be a bug! I reckon similarity of these two compounds should not be less then 0.9. We will fix this ASAP!


Apologies for the inconvenience this problem might cause.


Best regards


Miklos

User d028dca803

17-08-2009 09:24:36

Any progress on the fingerprint bug?

ChemAxon efa1591b5a

18-08-2009 17:30:45

Hi,


Well, we are still investigating this problem. And it is a hard one: (1) from users' point of view this is a 'bug' as the calculated similarity score does not meet users' expectation, you would expect higher similarity score as the two structures exhibit only 1 bond and 1 atom difference; meanwhile, (2) from our point of view this is just a feature, the property of the particular topological fingerprint used.


I mean, that, at present, we cannot identify one single issue that we could 'fix'. We can carefully examine our fingerprinting technology and analyse why the particular extra ester group (in WY23626) introduced a large number of different bits (when compared against 5357691). In WY23626 there are 107 bits set (the fingerprint length is 512), in 5357691 there are only 83 bits sets and there are 81 bits common to both fingerprints. This results in 0.74 Tanimoto score (in version 5.2.3 of JChem) , somewhat better than you experienced in an older version.


If there is a bug, or more, than those are probably not coding problems but conceptual ones. One can argue that an extra bond and an extra atom should not so badly affect the fingerprint and thus the similarity score, though another argument is that a new functional group may change the chemical structure significantly. I will discuss this issue with my colleagues; and we are very open to learn your thoughts. 


If you feel that the difference between these two particular structures is exaggerated by our fingerprint, you may wish to try the BCUT descriptor. However, it's native metric is not Tanimoto but Euclidean, which is not a similarity measure, but a dissimilarity one.


Another interesting idea could be to introduce an MCES/MOS based Tanimoto-like score (number of MCES atom/ ( number of atoms in structure1 + number of atoms in structure2 - number of MCES atoms), for instance). Such score could better suppress the 1 atom 1 bond difference between the two molecules. We do have MCES calculation, thus to introduce such score should not be a problem. However, at present our algorithm is bound to connected common substructures only which makes its use limited.


I will let you know any progress regarding this matter in future. Please feel free to share your thoughts with us, all comments and suggestions are highly appreciated. 


Kind regards,


Miklos

ChemAxon efa1591b5a

16-02-2011 12:51:07



Hi,


thorough analysis of the fingerprint generation as well as the similarity calculation does not reveal any problem. The unexpected low similarity that you experienced is a consequence of the fingerprint generation method that is based on paths/patterns of the chemical graphs. Even if there are only few bond/atom differences between two structures, there are many paths or patterns which are affected. Each of these patterns corresponds to different bits in the two fingerprints...


Your observation is still valid and I agree with you, the similarity ratio does not really meet my expectation either. 


You may wish to try the circualar fingerpints (ECFP) that may lead to better similarity scores (closer to your expectation).


Btw, did you try the MCES based similarity? That should give you similarity scores much closer to your expectations than any fingerprint based approaches. If you think that such similarity scoring (i.e. structure based, rather than fingerprint based) would be a useful addition to JChem then we can implement it. Please let me know if you need any assistance in calculating the MCES based similarity of your compounds.


Regards


Miklos