Unique smiles which do not tamper with bond orders??

User 21b7e0228c

10-11-2010 14:23:38

Hi,


I wonder whether there is (or there will be, in a glorious future) some :option to generate a unique SMILES string which, unlike the current :u option, does NOT attempt to tamper with bond orders/aromaticity. In other words, I'd like an option which, if provided with some arbitrary Kekülé "cyclohexatriene" structure, would return... a canonical way to write cyclohexatriene as smiles, and not smartass aromatized benzene. After all, the user has other tools at hand - aromatization, the mesomerize option, tautomer management, etc, to make sure bond orders are being set coherently in the molecules BEFORE using smiles:u. In other words, I don't think it's a good idea to let smiles:u option play the role of Standardizer Almighty. For example:


echo "O=C1C=CNC=C1" | molconvert smiles:u --- > produces "O=c1cc[nH]cc1"


with a five-legged carbonyl carbon! I positively hate this overaggressive aromatization, which may cause a lot of trouble if you try to read the output in some other god-fearing software, etc. That's why we decided to stick with the softer aromatize:b/l option in our standardization approach. Unfortunately,


echo "O=C1C=CNC=C1" | molconvert smiles:u | standardize -c "aromatize:l"


does NOT reverse the aggressive aromatization, you actually need to "dearomatize..aromatize:l" Of course, one could live with this in virtually all situations of real life, so my post is basically just for the sake of academic haggling: in my view, smiles "canonicalization" is about generating a unique string for a specified molecular graph, with the bond orders as input, and not an attempt to beautify the graph itself. For that matter, if you push split-charge nitro groups through smiles:u, it will not convert to pentavalent N - so smiles:u cannot be directly employed to check for "duplicate" molecules in a collection by checking for identical smiles:u - unless you take explicitly care for standardization issues. Simes:u thus somewhat unconfortably overlaps with the standardizer, and I'd prefer to keep them apart, so we know who's doing what.


At least... if you keep smiles:u working the way it is (I'm happy with that, either), it would be necessary to explain in the documentation that smiles:u tampers with the aromatization status... my slow neuron needed some time to track the incoherencies in my standardization protocols.


Cheers!


Dragos

ChemAxon 25dcd765a3

11-11-2010 08:47:15

Dear Dragos,


I wonder whether there is (or there will be, in
a glorious future) some :option to generate a unique SMILES string
which, unlike the current :u option, does NOT attempt to tamper with
bond orders/aromaticity. In other words, I'd like an option which, if
provided with some arbitrary Kekülé "cyclohexatriene" structure, would
return... a canonical way to write cyclohexatriene as smiles, and not
smartass aromatized benzene.




I understand your problem and let me explain some motivation behind the aromatic conversion.


Take the attached molecule. As you see it is exactly one molecule with two different Kekule form. Which results in the following two different SMILES string:


ClC1=CC=CC=C1Br and ClC1=C(Br)C=CC=C1


Most of our users want to have exactly the same SMILES string for these. How can we resolve this problem?


We have introduced the 'u' option for unique SMILES which generates the same string for both forms. But we can achieve the unique string if we convert the Kekule form to aromatic form before SMILES export.


As I told at the beginning I understand your problem and I suggest to use plain SMILES format (without the 'u' option).  As far as I know our plain SMILES export is canonical in many cases, however we know that the algorithm fails for some chiral symmetric molecules. Until know we didn't have enough time and request to make the plain SMILES export fully canonical.


It is still not clear for me if you would like


- to get an improvement for canonical (plain) SMILES export or


- just a correction in the unique SMILES export documentation which explicitly states that the molecule is converted to aromatic form (with general aromatization method) before export.


All the best


Andras

User 21b7e0228c

12-11-2010 09:26:28

Thanks a lot for the clarifications - I did not know that the "plain" smiles is canonical as well! (the documentation is not canonical ;-)


I may well live with that - my remark was just in defence of "power separation" in the ChemAxon democracy: have "canonicalization" take care of the smiles syntax, and let standardizer model the molecular graph that would, at best, describe that molecule. But, since there is already a smiles:u option which standardizes aromatic systems, it then would be nice to let it also take care of other "routine" standardization - nitro group representation, for example (the present smiles:u would produce different output upon charge-split or pentavalent N representations at input).


Cheers!


Dragos

ChemAxon 25dcd765a3

16-11-2010 09:06:43

Thanks for the comments.


Some additional information for unique smiles representation:


We designed it for database usage, so it is not very useful for presenting molecules for users in that form. (As you noticed sometimes pentavalent Carbon atoms appears.) But you can present these molecules after dearomatization.


It is very hard (if it is possible at all) to change double bond orders in a substituted cyclohexatriene ring in a unique way, so it it is not possible to avoid the conversion of single and double bonds to aromatic one to represent the molecule uniquely.


Other "routine" standardization processes can be neglected if a database follows some predefined rules (like nitro group representation), but the aromatic conversion cannot be avoided in such case as such predefined rule does not exists for single and double bond representation.


I hope it is more clear now how ChemAxon democracy is working for molecules .


All the best


Andras