Canonical smiles converter problems? - ChemAxon Forum Archive

User 677b9c22ff

21-11-2008 00:29:59

Hello,

I have 16 molecules one generated from one smiles and

the other generated from another SMILES code only using Marvin.

They actually are only 8 stereoisomers, but the canonical smiles

generator molconvert smiles:u doesn't work in this case.

This is Marvin 5.1.2 with WIN32 and JAVA1.6. I remember there

will be a fix in version 5.2.

I am not obsessed with that stuff, but during our work which contains

several millions if not billions of structures such issues are hard issues.

One of them is, that the GC-MS technology can very easily distinguish

between stereoisomers, therefore coming from the in-silico side and

matching properties with the experimental side is one of the successful

approaches, which heavily relies on correct stereoisomer generation and

unique canonical structures.

See attached two SDF files.

For the sake of clarity the starter structures are different but the

stereoisomers for each of the molecules must be the same 8 isomers.

Z:\>cxcalc stereoisomers "[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C" > VOC1.sdf

Z:\>cxcalc stereoisomers "C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1" > VOC2.sdf

Code:

Z:\>molconvert smiles voc1.sdf

[H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C

[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C

[H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C

[H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C

[H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C

[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C

[H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C

[H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C

Z:\>molconvert smiles voc2.sdf

C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1

C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1

C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1

C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1

C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1

C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1

C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1

C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1

Z:\>

After sorting and deleting doublets I get 16 unique compounds (wrong):

Code:

[H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C

[H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C

[H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C

[H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C

[H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C

[H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C

[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C

[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C

C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1

C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1

C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1

C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1

C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1

C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1

C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1

C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1

If I put it into OpenBabel I get after canonical SMILES generation and deleting duplicates I get 8 (Correct)

Code:

C/C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC\1

C/C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC\1

C/C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC\1

C/C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC\1

C/C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC\1

C/C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC\1

C/C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC\1

C/C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC\1

The SMILES spec says:

Quote:

Unique SMILES (Definition from ChemAxon Help).

The "unique" name can be sometimes misleading when dealing with compounds with stereo centres. The SMILES specification (3.1. SMILES Specification Rules) defines generic, unique, isomeric and absolute SMILES as:

1. generic SMILES: representing a molecule (there can be many different representations)

2. unique SMILES: generated from generic SMILES by a certain algorithm [1]

3. isomeric SMILES: string with information about isotopism, configuration around double bonds and chirality

4. absolute SMILES: unique SMILES with isomeric information - in Marvin during graph canonicalization the isomeric information is also considered as an atom invariant

The name canonical SMILES is used for absolute or unique SMILES depending wether the string contains isomeric information or not (both strings are "canonicalized" where the atom/bond order is unambigous). Marvin generates always canonical SMILES with isomerism info if it is possible to find out from the input file. The molecule graph is always canonicalized using the algorithm in article [1] but it is not guaranteed to give absolute SMILES for all isomeric structures. With option u currently we are using an approximation to make the SMILES string as absolute (unique for isomeric structures) as possible. For correct exact (perfect) structure searching MolSearch and JChemSearch classes of JChem Base or the jc_equals SQL operator of the JChem Cartridge are suggested.

The initial ranks of atoms for the canonicalization are calculated using the following atom invariants:

1. number of connections

2. sum of non-H bond orders (single=1, double=2, triple=3, aromatic=1.5, any=0)

3. atomic number (list=110, any atom=112)

4. sign of charge: 0 for nonnegative, 1 for negative charge

5. formal charge

6. number of attached hydrogens

7. isotope mass number

See ref. [1] for details. With option u it is possible to include chirality into graph invariants. This option must be used with care since for molecules with numerous chirality centres the canonicalization can be very CPU demanding [2].

Not supported SMILES features:

* Branch specified if there is no atom to the left.

* General chiral specification: Allene like, Square-planar, Trigonal-bipyramidal, Octahedral.

[1] SMILES 2. Algorithm for Generation of Unique SMILES Notation; D. Weininger, A. Weininger, J. L. Weininger; J. Chem. Inf. Comput. Sci. 1989, 29, 97-101

[2] A New Effective Algorithm for the Unambiguous Identification of the Stereochemical Characteristics of Compounds During Their Registration in Databases; T. Cieplak and J.L. Wisniewski; Molecules 2001, 6, 915-926

™: SMILES, SMARTS, and SMIRKS are trademarks of Daylight Chemical Information Systems.

So here we have 8 unique names, but 16 different unique SMILES.

Now weather this is a bug (which I believe) or not (because it is not

supported which wouldn't make sense in this case) my

question is:

Should I trust the unique Names or the unique SMILES?

What can I do (on the API level or with Standardizer) to always generate

unique unique structures or stereoisomers? Should I avoid SMILES?

Is that the reason why cxcalc switched to SDF output and does not SMILES

output anymore?

Thank you.

Tobias

(2 edits are in black)

ChemAxon 25dcd765a3

21-11-2008 10:57:43

Dear Tobias,

The root of the problem is the following:

You have two molecule which is identical in chemical sense but not identical in representation.

The difference in the representation of the two structures is the handling of Hydrogen atoms. In the first structure the hydrogen atoms are represented in implicit form, however in the second, some Hydrogens explicitly written as Hydrogen atom.

As you started from different representation your result will differ in representation.

As far as I see this is your main problem.

The unique smiles generation will keep your representation as it was.

But you can convert your explicit Hydrogen representation to implicit one using the -H option.

Quote:

So here we have 8 unique names, but 16 different unique SMILES.

We have 8 unique molecules with 8 unique name as the name of the molecule does not depend on the representation of Hydrogen atoms, but we have 8 different molecules in two different representation of Hydrogen atoms which is 16 different SMILES string.

So I would suggest to start from the same Hydrogen representation of the molecules by converting them to implicit form.

I have doubt about wheter unique SMILES generation should convert all Hydrogen atoms to implicit form by default, but I can be convinced.

Andras

User 677b9c22ff

22-11-2008 00:06:18

Hi Andras,

I want to be on the safe side for this issue, thats why I am asking.

OB creates correctly 8 unique SMILES, so it is in general possible.

But its an annoyance to switch applications, especially during API

code use. See also Forum - how can I compare two molecules?

I looked further into that issue and I thought the smiles:u switch

creates unique unique SMILES. I am and was always suspicious

of canonical SMILES. Its a mess. Especially if they come from

different sources.

So here comes the discussion and hopefully convincing part :-)

So I used the code from Robert Wagner and Tamas Csizmazia,

regarding duplicate removal and I am aware of Instant-JChem.

I will show with 2 examples that this is a delicate issue.

I used the code from the API search examples. Changed code is also attached.

First all 30k maltotriose isomers (see attachement). The problem here is that the calculation of

duplicate structures takes several hours. I let it run overnight and it did

not finish. The same if you import the 30k molecules into Instant-Jchem

with exact search and stereo on it will take several hours. A comparison

on non-stereo containing molecules is finished in seconds. In order to get

a result at all I had to limit it to 1000 molecules.

Time for the unique SMILES cleaning is 2 seconds. For the hash-code

with stereo isomer matching it is 800 seconds. So If I have the larger file

it needs to do 30k*30k/2 (4.5E+8) pairwise comparisons. Never finishes.

Even 800 seconds for 1000 molecules is not appropriate, therefore

everybody would agree to use the fastest unique SMILES algorithm for large molecule sets.

Please, use the file maltotriose-stereoisomers.zip (the large one) and test by yourself,

another argument for concurrency.

Code:

Reading molecules from maltotriose-stereoisomers-1000.smi.

Imported 1000 structures.

Searching for duplicates with perfect search.

Matching IDs

Found 0 duplicates in 736063 milliseconds

Searching for duplicates based on unique smiles string comparison.

Matching IDs

Found 0 duplicates in 2125 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent stereo searching

Matching IDs

Found 0 duplicates in 807219 milliseconds

Now with the 16 unqiue SMILES generated by MARVIN, which are actually

only 8 unique stereoisomers. Used molconvert smiles:u VOC1andVOC2.sdf > VOC1andVOC2.smi for that.

(I know about the -H issue, but its the unique switch or?)

NOW first with the SDF file, the unique SMILES option fails.

So a unique SMILES generator which requires trickery (as cleaning

and applying additional switches is not a unique SMILES generator,

strong argument but debatable.) VOC1andVOC2.sdf

Code:

Reading molecules VOC1andVOC2.sdf.

Imported 16 structures.

Searching for duplicates with perfect search.

Matching IDs

1 13

2 14

3 15

4 16

5 9

6 10

7 11

8 12

Found 8 duplicates in 657 milliseconds

Searching for duplicates based on unique smiles string comparison.

Matching IDs

Found 0 duplicates in 140 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent stereo searching

Matching IDs

1 13

2 14

3 15

4 16

5 9

6 10

7 11

8 12

Found 8 duplicates in 297 milliseconds

NOW with SMILES VOC1andVOC2.smi

Code:

Reading molecules VOC1andVOC2.smi.

Imported 16 structures.

Searching for duplicates with perfect search.

Matching IDs

1 13

2 14

3 15

4 16

5 9

6 10

7 11

8 12

Found 8 duplicates in 609 milliseconds

Searching for duplicates based on unique smiles string comparison.

Matching IDs

Found 0 duplicates in 79 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent stereo searching

Matching IDs

1 13

2 14

3 15

4 16

5 9

6 10

7 11

8 12

Found 8 duplicates in 546 milliseconds

Now suddenly everybody would agree, yes lets take the safe side,

using the hash and if they are different lets use exact stereo match,

because in both cases with SDF and SMILES import the unique filter

fails.

The only problem is that the last option searchForDuplicatesHash

takes hours with maltotriose-stereoisomers.smi

There is the dilemma. So yes the problem can be solved

with Z:\>molconvert smiles:u-H VOC1andVOC2.sdf but why

is the unique switch not unique and why is the API example failing

(missing -H option).

Cheers

Tobias

ChemAxon 25dcd765a3

24-11-2008 18:17:55

Hi Tobias,

Thank you for your comment, I will discuss it with my colleagues.

Andras

User 677b9c22ff

24-11-2008 23:23:01

Hi,

besides the multiple options for smiles:u+H-a or smiles:u-H+a

there is also no guaranty that the same number of unique compounds

is obtained compared to other matching methods. The above

VOC example would result in all the same results if smiles:u-H+a

would be used.

However switching to another dataset now suddenly gives totally different

results again (even if stereochemistry matching is OFF). As an example

from the NCI2000 dataset some 9787 substances selected.

Results between unqiue SMILES and hash+stereo are different too.

293 duplicates for unique SMILES (tested all options +H-H+a-a)

268 duplicates for hash with stereomatch

I am aware that there are many matching methods but these

are the implications one has with unique SMILES. See attached

ZIP file. JAVA code is the same from above.

Code:

Reading molecules from Z:NCI-10000.smi.

Imported 9787 structures.

Searching for duplicates with perfect search.

Matching IDs

Found 293 duplicates in 253766 milliseconds

Searching for duplicates based on unique smiles string comparison.

Matching IDs

Found 268 duplicates in 4141 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent stereo searching

Matching IDs

Found 293 duplicates in 500 milliseconds

Tobias

ChemAxon 42004978e8

25-11-2008 16:55:31

Hallo Tobias,

The slow performance of the hash code duplicate filtering is caused by the nature of hash code calculation.

Hash codes are used for exact/perfect searching, and here the stereo, isotope, radical matching can be regulated by switched, and they can be completely switched off. So in order not to miss hits in these cases this information is not considered in hash code calculation.

In your earlier case there was a large DB with structures only differing in their stereo values. In this case hash code doesn't distinguish these values, and the distinction must be done by the search engine (see the code) which takes a longer time than duplicate filtering based on unique smiles.

Regarding your recent question:

Thanks for this example because it revealed that the duplicate filtering example code calculates different things in case of unique smiles filtering and hash code based filtering.

In case of hash code filtering every code is compared to every other (not twice, but every combination is checked.)

In case of unique smiles we put the Strings one-by-one in the TreeSet if they aren't present there yet, if they are there we report that there is a duplicate.

In your example there are several (>2) structures which are the same. The hash code lists all the pairs of them, while the unique smiles code lists only once every duplicated structure.

e.g. if 1,2,3,4 structures are the same this is:

1-2, 1-3, 1-4, 2-3, 2-4, 3-4 vs 2,3,4

which is 6 pairs vs. 3 duplicates.

That's why hash code filtering recognizes 293 duplicates - actually pairs, while unique smiles based comparison counts 268 duplicates - which have already an equivalent.

I attached a modification of the cycle using hash code calculation.

Of course the loop with unique smiles can also be changed to calculate the pairs, depending in what you are interested.

In this latter example you could see that comparison based on hash codes can generally much faster, than that based on unique smiles.

Robert

User 677b9c22ff

25-11-2008 21:14:13

Hi Robert,

thanks for looking into that and checking those issues.

However they do not solve the problem in the first place.

First the code with the NCI-10000.smi works now

and as you said, also faster. That was actually a minor problem :-)

Code:

Reading molecules.

Imported 9787 structures.

Searching for duplicates based on smiles string comparison.

Matching IDs

Found 268 duplicates in 3719 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent searching

Matching IDs

Found 268 duplicates in 1594 milliseconds

Now the mentioned VOC problem, it can not be solved with the unique

SMILES but the HASH+Stereo.

Code:

Reading molecules VOC1andVOC2.smi

Imported 16 structures.

Searching for duplicates based on smiles string comparison.

Matching IDs

Found 0 duplicates in 156 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent searching

Matching IDs

Not found with unique SMILES:9 5

Not found with unique SMILES:10 6

Not found with unique SMILES:11 7

Not found with unique SMILES:12 8

Not found with unique SMILES:13 1

Not found with unique SMILES:14 2

Not found with unique SMILES:15 3

Not found with unique SMILES:16 4

Found 8 duplicates in 485 milliseconds

By adding the smiles:u-H (removing explicit H from unique(!) SMILES)

that can be solved as seen below. And that was the discussion all about,

why are the unique SMILES not unique? You also mentioned the speed

issue in case of the maltotriose-stereoisomers.smi so that will always

take longer, but in pronciple can be solved with the unique SMILES very

fast, if they actually are unique (I am still not convinced and I am only

covering the stereoisomers here not radicals, tautomers - one of the

reasons why the InChI code has different layers).

Code:

Reading molecules VOC1andVOC2.smi.

Imported 16 structures.

Searching for duplicates based on smiles string comparison.

Matching IDs

Found 8 duplicates in 156 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent searching

Matching IDs

Found 8 duplicates in 484 milliseconds

------------------------------------------------------------------------

But now comes the problematic part:

Using another random example cyclobutane CC(O)C1C(C(C)O)C(C(C)O)C1C(C)O from

Stereoisomer generation in computer-enhanced structure elucidation by

Marko Razinger, Krishnan Balasubramanian, Marko Perdih, Morton E. Munk

J. Chem. Inf. Comput. Sci., 1993, 33 (6), pp 812-825; DOI: 10.1021/ci00016a003

See attached file complex-cyclobuta-unique-smiles-molconvert.smi

Code:

Reading molecules complex-cyclobuta-unique-smiles-molconvert.smi

Imported 44 structures.

Searching for duplicates.

Matching IDs

3 14

3 18

14 18

23 32

40 42

40 44

42 44

Found 7 duplicates in 1735 milliseconds

Searching for duplicates based on smiles string comparison.

Matching IDs

Found 2 duplicates in 359 milliseconds

Searching for duplicates based on hash-code comparison

and subsequent searching

Matching IDs

Not found with unique SMILES:14 3

Not found with unique SMILES:32 23

Not found with unique SMILES:42 40

Found 5 duplicates in 828 milliseconds

Now the problematic part is that all three codes give different results.

7 dups searchForDuplicates(mols);

2 dups searchForDuplicatesUniqueSmiles(mols);

5 dups searchForDuplicatesHash(mols);

Problem A)

Different results for 3 unique searches, I actually don't know how

to solve that.

Problem B)

Why are the unique SMILES codes from molconvert and the API

different? See attached EXCEL file stereo-cyclobutanes.xls

It took me a long time to figure that out and the invoked switches

are "smiles:u-H" in case of the API and molconvert smiles:u-H

in case of molconvert. One would expect they are the same.

That means if I trust molconvert I get 44 unique SMILES.

If I trust the API I get 42 unique SMILES.

If I trust the publication I get 39 compounds

The demo version of MOLGEN also calculates 39 stereoisomers.

I am not going to check all the other weird compounds,

but I am currently in the house of horror with all those sugars,

so should I use the API or molconvert is the question...

Attached stereo-cyclobutanes.xls

Attached complex-cyclobuta-unique-smiles-API.smi

Attached complex-cyclobuta-unique-smiles-molconvert.smi

Cheers

Tobias

ChemAxon 42004978e8

26-11-2008 16:23:03

Hallo Tobias,

I answer you regarding the problems with the DuplicateSearch example.

There are 3 different results with the three methods for the complex-cyclobuta-unique-smiles-API.smi file.

Two of these results, the difference between the number of the exhaustive duplicate search and the hash code based search are the same. The difference lies in the fact that again the first lists all pairs of duplicate, while the hash code based lists only the upcoming duplicates.

Now I modified both loops so, that only the duplicate entries are listed and not all matching pairs. (see attached)

If you wish to see the matching pairs, you can remove the break commands from the loops.

Regarding the problems with unique smiles András will answer you.

Thanks,

Robert

ChemAxon 25dcd765a3

28-11-2008 09:20:43

Dear Tobias,

We have discussed your suggestion about the unique smiles generation and found it very useful. From the forthcoming release the unique smiles will remove plain Hydrogen atoms from the molecule.

Andras

ChemAxon 25dcd765a3

01-12-2008 18:43:22

Dear Tobias,

We are also try to fix the unique SMILES problems you found as soon as possible, but it seem to be a harder problem as I thought before.

Thank you for your suggestions, comments.

Andras

ChemAxon 25dcd765a3

03-12-2008 10:54:07

Dear Tobias,

The next marvin release (Marvin 5.1.4 ) will implicitize plain Hydrogens in case of unique smiles export.

Andras