Question over similarity matches

User 952e1d9361

29-11-2010 15:54:42

Hello,


I am running a similarity search using jc_compare with the t:t and simThreshold parameters and structures are being returned as a 100% match yet do not have the same structure.  This is not my area of expertise so could you please explain to me (so i can explain to the users) why this is happening?


The attached script should illustrate the problem.


The search structure is CCCCCCCCCCC1CO1 yet results for CCCCCCCC1CO1, CCCCCCCCCCCCCCC1CO1, CCCCCCCCCCCCCCCCC1CO1, CCCCCCCCCCCCC1CO1, C(CCC1CO1)CC1CO1 and CCCCCCC1CO1 are being returned as 100% match.


Do I need to apply some parameter to the jc_compare() or jcf_tanimoto() functions to change this behaviour or is this 'correct' in some way I don't understand?


Regards, and thanks,


Steve H



ChemAxon 9c0afc9aaf

29-11-2010 23:25:10

Hi,


 


Similarity values are calculated from the fingerprints.


A 100 % similarity means these structures have exactly the same fingerprint .



Fingerprints can match exactly even with certain structural
differences.


Please see the theory on the Chemical Hashed Fingerprints:

https://www.chemaxon.com/jchem/doc/user/fingerprint.html />


Basically we are searching for all linear patterns in the
molecule up to "6" bonds length (if using default FP parameters).

Each of these sets 2 bits (by default) in the fingerprint.

Then these bit strings are compared by Tanimoto distance metric.



The reasons for structural differences might be:



1. If a pattern is repeated more than once it does not make a
difference - as only the existence is take into account.

So if you add something to a structure that does not introduce a
different pattern, the FP remains the same.


This is the case with your structures, I have also attached a screenshot of them.


A longer then 6 carbon chain produces no different FP than one 6 bonds long, and repeating the same substructure ("triangle with oxygen" - sorry i'm not a chemist :)) multiple times does not change the FP as long as no new pattern is introduced.


Apart from these cases other rarer structural differences might produce identical fingerprints too:




2. We only take into account  atom and bond types. It means:

- Isotopes and similar properties are not considered

- Stereo information is not considered

This is essential to enable these for structural search options (so for
example stereo checking can be turned on or off)



3. In some cases there might be bit collisions (see theory) when two
(or more) patters set the same bits.

Adding a feature to a structure which sets bits to 1 which are already
set will not change the fingerprint.
Less FP darkness may help (e.g. increasing the FP size), but collisions are always possible.


In summary what you see is normal behavior. 

Similarity based on Chemical Hashed Fingerprints is not "exact science"
(neither similarity search in general), it definitely does not
substitute more sophisticated similarity methods and certainly not suitable for
duplicate comparison.

Considering its simplicity and speed it has a good real-life performance though, and
still one of the most popular methods.


We also offer more sophisticated similarity calculations, including the possibility of implementing custom fingerprints and metrics:

https://www.chemaxon.com/products/screen/


Best regards,


Szilard

ChemAxon a3d59b832c

06-12-2010 14:03:13

It may help in the future that we plan an extended similarity score scheme whereby we further investigate and modify the 100% similarity scores.


The exact details are not decided, but basically it will mean that identical molecules will get e.g. 110% similarity, molecules only differing in stereo or isotope information 101-109%, etc.


Molecules differing in their topology will stay on 100%.


 


We will let you know when this new feature will be available.


 


Best regards,


Szabolcs

User 952e1d9361

06-12-2010 14:14:54

Thanks for the update.


Steve