User 677b9c22ff
21-11-2008 00:29:59
Hello,
I have 16 molecules one generated from one smiles and
the other generated from another SMILES code only using Marvin.
They actually are only 8 stereoisomers, but the canonical smiles
generator molconvert smiles:u doesn't work in this case.
This is Marvin 5.1.2 with WIN32 and JAVA1.6. I remember there
will be a fix in version 5.2.
I am not obsessed with that stuff, but during our work which contains
several millions if not billions of structures such issues are hard issues.
One of them is, that the GC-MS technology can very easily distinguish
between stereoisomers, therefore coming from the in-silico side and
matching properties with the experimental side is one of the successful
approaches, which heavily relies on correct stereoisomer generation and
unique canonical structures.
See attached two SDF files.
For the sake of clarity the starter structures are different but the
stereoisomers for each of the molecules must be the same 8 isomers.
Z:\>cxcalc stereoisomers "[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C" > VOC1.sdf
Z:\>cxcalc stereoisomers "C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1" > VOC2.sdf
After sorting and deleting doublets I get 16 unique compounds (wrong):
If I put it into OpenBabel I get after canonical SMILES generation and deleting duplicates I get 8 (Correct)
The SMILES spec says:
So here we have 8 unique names, but 16 different unique SMILES.
Now weather this is a bug (which I believe) or not (because it is not
supported which wouldn't make sense in this case) my
question is:
Should I trust the unique Names or the unique SMILES?
What can I do (on the API level or with Standardizer) to always generate
unique unique structures or stereoisomers? Should I avoid SMILES?
Is that the reason why cxcalc switched to SDF output and does not SMILES
output anymore?
Thank you.
Tobias
(2 edits are in black)
I have 16 molecules one generated from one smiles and
the other generated from another SMILES code only using Marvin.
They actually are only 8 stereoisomers, but the canonical smiles
generator molconvert smiles:u doesn't work in this case.
This is Marvin 5.1.2 with WIN32 and JAVA1.6. I remember there
will be a fix in version 5.2.
I am not obsessed with that stuff, but during our work which contains
several millions if not billions of structures such issues are hard issues.
One of them is, that the GC-MS technology can very easily distinguish
between stereoisomers, therefore coming from the in-silico side and
matching properties with the experimental side is one of the successful
approaches, which heavily relies on correct stereoisomer generation and
unique canonical structures.
See attached two SDF files.
For the sake of clarity the starter structures are different but the
stereoisomers for each of the molecules must be the same 8 isomers.
Z:\>cxcalc stereoisomers "[H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C" > VOC1.sdf
Z:\>cxcalc stereoisomers "C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1" > VOC2.sdf
Code: |
Z:\>molconvert smiles voc1.sdf [H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C [H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C [H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C [H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C [H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C [H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C [H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C [H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C Z:\>molconvert smiles voc2.sdf C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1 C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1 C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1 C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1 C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1 C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1 C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1 C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1 Z:\> |
After sorting and deleting doublets I get 16 unique compounds (wrong):
Code: |
[H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C [H][C@@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C [H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C [H][C@@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C [H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C/CCC2=C [H][C@]12CC(C)(C)[C@@]1([H])CC\C(C)=C\CCC2=C [H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C/CCC2=C [H][C@]12CC(C)(C)[C@]1([H])CC\C(C)=C\CCC2=C C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1 C\C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1 C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1 C\C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1 C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC1 C\C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC1 C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC1 C\C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC1 |
If I put it into OpenBabel I get after canonical SMILES generation and deleting duplicates I get 8 (Correct)
Code: |
C/C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC\1 C/C1=C/CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC\1 C/C1=C/CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC\1 C/C1=C/CCC(=C)[C@H]2CC(C)(C)[C@H]2CC\1 C/C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@@H]2CC\1 C/C1=C\CCC(=C)[C@@H]2CC(C)(C)[C@H]2CC\1 C/C1=C\CCC(=C)[C@H]2CC(C)(C)[C@@H]2CC\1 C/C1=C\CCC(=C)[C@H]2CC(C)(C)[C@H]2CC\1 |
The SMILES spec says:
Quote: |
Unique SMILES (Definition from ChemAxon Help). The "unique" name can be sometimes misleading when dealing with compounds with stereo centres. The SMILES specification (3.1. SMILES Specification Rules) defines generic, unique, isomeric and absolute SMILES as: 1. generic SMILES: representing a molecule (there can be many different representations) 2. unique SMILES: generated from generic SMILES by a certain algorithm [1] 3. isomeric SMILES: string with information about isotopism, configuration around double bonds and chirality 4. absolute SMILES: unique SMILES with isomeric information - in Marvin during graph canonicalization the isomeric information is also considered as an atom invariant The name canonical SMILES is used for absolute or unique SMILES depending wether the string contains isomeric information or not (both strings are "canonicalized" where the atom/bond order is unambigous). Marvin generates always canonical SMILES with isomerism info if it is possible to find out from the input file. The molecule graph is always canonicalized using the algorithm in article [1] but it is not guaranteed to give absolute SMILES for all isomeric structures. With option u currently we are using an approximation to make the SMILES string as absolute (unique for isomeric structures) as possible. For correct exact (perfect) structure searching MolSearch and JChemSearch classes of JChem Base or the jc_equals SQL operator of the JChem Cartridge are suggested. The initial ranks of atoms for the canonicalization are calculated using the following atom invariants: 1. number of connections 2. sum of non-H bond orders (single=1, double=2, triple=3, aromatic=1.5, any=0) 3. atomic number (list=110, any atom=112) 4. sign of charge: 0 for nonnegative, 1 for negative charge 5. formal charge 6. number of attached hydrogens 7. isotope mass number See ref. [1] for details. With option u it is possible to include chirality into graph invariants. This option must be used with care since for molecules with numerous chirality centres the canonicalization can be very CPU demanding [2]. Not supported SMILES features: * Branch specified if there is no atom to the left. * General chiral specification: Allene like, Square-planar, Trigonal-bipyramidal, Octahedral. [1] SMILES 2. Algorithm for Generation of Unique SMILES Notation; D. Weininger, A. Weininger, J. L. Weininger; J. Chem. Inf. Comput. Sci. 1989, 29, 97-101 [2] A New Effective Algorithm for the Unambiguous Identification of the Stereochemical Characteristics of Compounds During Their Registration in Databases; T. Cieplak and J.L. Wisniewski; Molecules 2001, 6, 915-926 ™: SMILES, SMARTS, and SMIRKS are trademarks of Daylight Chemical Information Systems. |
So here we have 8 unique names, but 16 different unique SMILES.
Now weather this is a bug (which I believe) or not (because it is not
supported which wouldn't make sense in this case) my
question is:
Should I trust the unique Names or the unique SMILES?
What can I do (on the API level or with Standardizer) to always generate
unique unique structures or stereoisomers? Should I avoid SMILES?
Is that the reason why cxcalc switched to SDF output and does not SMILES
output anymore?
Thank you.
Tobias
(2 edits are in black)