Unique representation of molecules

User 74c85f9df0

22-02-2012 06:48:22

Hello.


 


I try to get a unique String representation of a molecule to identify duplicates in a database of structures efficiently. I have to add that I also want to include polymers and coordination compounds. I wonder now, if it is possible to get a unique String representation, that would result in molString1.equals(molString2) == true if the the two structures are identical. My criterions are:


- Stereoisomers are different structures


- Tautomers could also be considered as different structures, although a canonized tautomer form would be nice.


Possible ideas that I had in mind is a canonized uniqued SMILES, but this won't work for polymers, right? Is there any other way to achieve this? Another option is of course is to store the structures in a not-unique way, to generate the fingerprints of the structures and store them along with the structures and then do a pre-filtering followed by a substructure search. But this seems to be a rather complicated way of detecting duplicates, isn't it?


 


Thanks a lot for your help.


 


Best,


 


Tobias

ChemAxon 25dcd765a3

27-02-2012 16:01:20

Hi,


I try to get a unique String representation of a molecule to identify duplicates in a database of structures efficiently. I have to add that I also want to include polymers and coordination compounds. I wonder now, if it is possible to get a unique String representation, that would result in molString1.equals(molString2) == true if the the two structures are identical

If you have SRU polymers for example, it is not possible to represent such structures in SMILES format. In our tools you will get "Cannot convert molecule to 'smiles' format" error. The similar problem appear in case of coordination compounds, however in this case the coordination bonds are simply neglected during the export. It is not possible to represent coordination bond in SMILES as the format specification has no such bond type. So it seems that this is not the field for which SMILES format is good for.  However it is worth to do some experiments as it may work for your case.


To be able to handle tautomers you should rather use the tautomerization plugin which can generate canonical tautomer. I think this is what you want.


The fingerprint is much better idea, jchem and instant jchem is just doing this way: tautomerization, fingerprint generation etc. I think these products are your friends.

User 74c85f9df0

28-02-2012 10:13:39

 


Hello.


 


Thanks for the answer. I think I have found a way to check it very quickly. I store along with the enhanced mol file the unique SMILES for all structures, knowing that it might not encode certain features (like coordinate bonds, SRU, ...) However, the same molecule will always result in the same SMILES, although it is not unique (i.e. a different molecule might result in the same SMILES). Afterwards, I just run a quick check with the MolSearch (using the enhanced mol file, and not the smiles) with the DUPLICATE option to see which molecules are indeed the same.


However, I have a small problem. Is it possible that the MolSearch does not take coordinate bonds into account. I attached three different examples. All end up to be considered 'the same' using MolSearch. Is that behaviour expected?


 


 


Thanks a lot....


 


Tobias

ChemAxon 42004978e8

07-03-2012 10:57:29

Hi,


Could you please attach a code example of your searches? Please specify the jchem version as well. We couldn't reproduce the matching between the mentioned structures.


You can  fitd example of duplicate searches here:


http://www.chemaxon.com/jchem/doc/dev/search/index.html#searchmem


We suggest to follow the 3rd method, where not all the structure pairs are searched but only those that have the same hash code thus giving a faster way of duplicate checking.


Regards,


Robert