User d028dca803
03-07-2009 11:26:26
I have attached a sdf file of 4 compounds. Their fingerprint tanimoto scores are:
WY23626 WY23626 1.000000
WY23626 7277998 0.830189
WY23626 5357691 0.660900
WY23626 9099944 0.840764
7277998 7277998 1.000000
7277998 5357691 0.578275
7277998 9099944 0.915858
5357691 5357691 1.000000
5357691 9099944 0.585761
9099944 9099944 1.000000
I don't understand why WY23626 scores so badly against 5357691.
Any suggestions?
ChemAxon efa1591b5a
07-07-2009 14:27:06
Hi,
this must be a bug! I reckon similarity of these two compounds should not be less then 0.9. We will fix this ASAP!
Apologies for the inconvenience this problem might cause.
Best regards
Miklos
User d028dca803
17-08-2009 09:24:36
Any progress on the fingerprint bug?
ChemAxon efa1591b5a
18-08-2009 17:30:45
Hi,
Well, we are still investigating this problem. And it is a hard one: (1) from users' point of view this is a 'bug' as the calculated similarity score does not meet users' expectation, you would expect higher similarity score as the two structures exhibit only 1 bond and 1 atom difference; meanwhile, (2) from our point of view this is just a feature, the property of the particular topological fingerprint used.
I mean, that, at present, we cannot identify one single issue that we could 'fix'. We can carefully examine our fingerprinting technology and analyse why the particular extra ester group (in WY23626) introduced a large number of different bits (when compared against 5357691). In WY23626 there are 107 bits set (the fingerprint length is 512), in 5357691 there are only 83 bits sets and there are 81 bits common to both fingerprints. This results in 0.74 Tanimoto score (in version 5.2.3 of JChem) , somewhat better than you experienced in an older version.
If there is a bug, or more, than those are probably not coding problems but conceptual ones. One can argue that an extra bond and an extra atom should not so badly affect the fingerprint and thus the similarity score, though another argument is that a new functional group may change the chemical structure significantly. I will discuss this issue with my colleagues; and we are very open to learn your thoughts.
If you feel that the difference between these two particular structures is exaggerated by our fingerprint, you may wish to try the BCUT descriptor. However, it's native metric is not Tanimoto but Euclidean, which is not a similarity measure, but a dissimilarity one.
Another interesting idea could be to introduce an MCES/MOS based Tanimoto-like score (number of MCES atom/ ( number of atoms in structure1 + number of atoms in structure2 - number of MCES atoms), for instance). Such score could better suppress the 1 atom 1 bond difference between the two molecules. We do have MCES calculation, thus to introduce such score should not be a problem. However, at present our algorithm is bound to connected common substructures only which makes its use limited.
I will let you know any progress regarding this matter in future. Please feel free to share your thoughts with us, all comments and suggestions are highly appreciated.
Kind regards,
Miklos