Disconnected structures not reconnecting with Standardizer

User 7f33ec9a5c

16-03-2015 19:52:12

Hi Guys,

We are having trouble with the standardizer not correcting disconnected structures containing CP(O)(O)(O)O groups. Incidentally Marvin seems to generate disconnected structures for molecules with these groups so I am not really sure what is going on.

--this behaves as expected, the ring closures across the smiles are recognised and the disconnected SMILES are corrected
SELECT jcf.standardize('c1cc2.c2cc1', 'sep=~ config:removeexplicitH..aromatize~outFormat:smiles:u') FROM dual; ---- returns c1ccccc1

-- this does not behave as expected, why are the disconnected SMILES preserved?select jcf.standardize('CP123OCCO1.C(CO2)O3', 'sep=~ config:removeexplicitH..aromatize~outFormat:smiles:u') from dual

-- returns CP123OCCO1.C(CO2)O3, which is the same smiles Marvin generates for the drawn structure.

I am really not sure what is going on here.

Any guidance would be appreciated.

~mike richards

User 7f33ec9a5c

16-03-2015 20:01:56

-- and this is just really weird

select jcf.standardize('CP12(OCCO2)OCCO1', 'sep=~ config:removeexplicitH..aromatize~outFormat:smiles:u') from dual

-- returns CP123OCCO1.C(CO2)O3

ChemAxon 25dcd765a3

18-03-2015 08:06:20

It is not totally clear for me what is the problem.

CP123OCCO1.C(CO2)O3

According to the SMILES specification is a valid SMILES string and it is not disconnected.

A similar simplified example is

C1.C1

which is also not disconnected

User 7f33ec9a5c

03-04-2015 17:57:15

First, please excuse my poor wording. Where I wrote "Disconnected" I meant "Dot Separated".

To your comment "I really don't see the issue....", well that really makes me miss Daylight, and makes me wonder why I even would need to explain something like this.

SMILES is and should be a human-readable notation for chemical structures. I would hope that your standardization algorithm makes every attempt to make smiles easier to read, not harder to read when canonicalizing the smiles. I've shown several people your dot-seperated "standardized" version of the smiles, and it has confused every one into thinking they were looking at 2 disconnected structures, until they carefully interpreted the smiles.

I completely understand that C1.C1 is a "valid" smiles, but I do hope that your standardizer would choose to canonicalize this smiles to CC, as it chooses to canonicalize c1cc2.c2cc1 to c1ccccc1 . To generalize that rule, I would hope that dot-seperation should be reserved for disconnected smiles, period.

Perhaps this is just my personal preference,but I feel that having a standardization routine that inserts a dot in a single connected structure is a bug. I can't see any possible reason for representing a single connected structure with a dot.

~mike

In Dave's words: "Perverse, but correct."

ChemAxon 25dcd765a3

08-04-2015 14:47:33

Hi,

I get your point. So, the problem is not the correctness but the readability.

It is not the standardization, but the SMILES canonicalization routine which generates this SMILES string.

It is interesting to see that there are people reading SMILES without any help from visualization program (cool guys!). I understand your point about readability, but actually I believe that a tool which visualize the molecule is a better solution.

From the first sight, this issue does not look like a simple and obvious problem to fix, so I cannot promise anything about the fix. It has minor priority from our side.

User 7f33ec9a5c

09-04-2015 18:55:22

Volfi,

On the human-readability of SMILES:

I suspect you share my experience that in dealing with large databases of chemical structures the 1% of "junk" structures in the database routinely crash/confound the tools, and by far the fastest way to decipher what is going on is just to take a look at the SMILES directly.

Our end-users work with our very nice, graphical visualization tools that use your depiction engine, but for us, the informatics team, we routinely interpret smiles directly, especially when troubleshooting.

For these cases, it would be preferable if ChemAxon did it's best to adhere to the canonicalization standard used by Daylight, which did seem to move structures towards a simplified and human-readable form upon canonicalization.

In the meantime, we'll just pay a bit more attention to exactly what we are reading, and not imply anything from first glance.

Thank you,
~mike

ChemAxon 25dcd765a3

13-04-2015 11:39:12

Hi,

Thank you for your comments. This fits to our observation that SMILES are read directly just by a small team occasionally. I understand that in these cases the readability has high impact.

Regarding the canonicalization standard of Daylight, could you please point me to the documentation? I know that Daylight (the invertor of SMILES) uses canonicalization, but I as far as I know the canonicalization algorithm used by them is not public. I see that even the previously working depictmatch, and canonical SMILES/SMARTS generator (cansmi?) has been removed from the their site.

I feel that you just occasionally bumped into a structure where the generated SMILES readability is hard and I think this is not a usual case.