Smiles export question

User 6ef33138f9

29-06-2005 16:54:44

Hello, I have a simple question about SMILES export using the -H option to remove explicit hydrogens. It does not seem to work for any atoms that have atom map numbers. For example, if I import [C:1]C and export it using -H, I get [CH3:1]C, not [C:1]C as expected. A code sample is below.





Thanks,


Chris





Code:



      String smiles1 = "CC";


      Molecule marvinMolecule = MolImporter.importMol(smiles1);      


      ByteArrayOutputStream os = new ByteArrayOutputStream();


      MolExporter exp = new MolExporter(os, "smiles:-H");


      exp.write(marvinMolecule);


      os.close();


      String smiles2 = os.toString().trim();   


      assertEquals(smiles1, smiles2);  // OK


      


      smiles1 = "[C:1][C]";


      marvinMolecule = MolImporter.importMol(smiles1);      


      os = new ByteArrayOutputStream();


      exp = new MolExporter(os, "smiles:-H");


      exp.write(marvinMolecule);


      os.close();


      smiles2 = os.toString().trim();   


      assertEquals(smiles1, smiles2); // fails


ChemAxon 25dcd765a3

30-06-2005 16:39:10

Hi Chris!


In the SMILES string [CH3:1]C the 'H' means implicit H.


The molecule with explicit H would be:


[H][C:1]([H])([H])C([H])([H])[H]





But if you export to SMARTS string, the presented molecule would look like: [C:1]C just what you would like.





All the best


Andras

ChemAxon 25dcd765a3

30-06-2005 19:02:47

One more thing I have just seen.


This is not a valid SMILES: "[C:1][C]"


You may think one of this valid SMILES: "[CH3:1][CH3]" or "[CH3:1]C"


or your original string but imported as SMARTS: "[C:1][C]"





In this latter case you have to specify in the MolImporter constructor that you want to read the string as SMARTS or set "smarts" option for the MolImporter.


Code:
MolImporter.setOptions("smarts");









(The first string is a valid SMILES: "CC")











All the best


Andras

User 6ef33138f9

30-06-2005 21:05:18

Thanks, Andras. I understand now that the atom map syntax is SMARTS only, not SMILES. For some reason I had thought it was supported by SMILES.





I'm still confused about the expected behavior with the different import and export options, though. I tried various tests importing "C[C][C:1]" as SMILES and SMARTS, and then exporting as SMILES and SMARTS.





1) If I read "C[C][C:1]" as SMILES and export it as SMILES, it exports "CC[CH3:1]". Why is the hydrogen count added only when the atom map number is present?





2) In the above example, since [C:1] is not valid SMILES, I assume that it's automatically importing and exporting as SMARTS even though the options say "smiles". If that's correct, then why is the result in #1 above different from the result in #3 below (when explicitly exporting as SMARTS)?





3) If I read "C[C][C:1]" as SMILES and export it as SMARTS, it exports "[#6]C[#6:1]". What determines when the atomic number is used instead of the symbol? Is there a way to get it to export "CC[C:1]" instead?





4) If I create an importer, call setOptions("smarts"), and read the molecule, I get exactly the same results as above: "CC[CH3:1]" when exported as SMILES, "[#6]C[#6:1]" when exported as SMARTS. Does setting the import option make any difference (between SMILES and SMARTS)?





5) If I create the importer using 'new MolImporter(inputStream, "smarts")', I get an exception that "marts" is not a valid format when reading the molecule. Is it expecting a different syntax for the options in this case?





Thanks,


Chris

ChemAxon 25dcd765a3

01-07-2005 07:30:48

Hi Chris,
Quote:
I understand now that the atom map syntax is SMARTS only, not


SMILES. For some reason I had thought it was supported by SMILES.


Sorry, atom map syntax is supported by SMILES, that is why


[CH3:1]C is a


valid SMILES. One thing you should take care in SMILES, that if you use an


atom inside '[' ']', you must write the implicit H if there are any.
Quote:
1) If I read "C[C][C:1]" as SMILES and export it as SMILES, it


exports "CC[CH3:1]". Why is the hydrogen count added only when the atom


map number is present?
Please note that H means different things in smiles and smarts. In


smiles it means the implicit hydrogens whereas in smarts it is the total


hydrogen count.


In smiles the implicit H must be written only if the atom is inside


brackets.


See more information about SMILES atom specification:


http://www.daylight.com/smiles/smiles-atoms.html





(From the next release, Marvin 4.0 we will support improper valences in


smiles, so no implicit hydrogens will be added to bracketed atoms. This


practically


means that you will get back the same smiles atoms that you entered.)
Quote:
2) In the above example, since [C:1] is not valid SMILES, I assume


that it's automatically importing and exporting as SMARTS even though the


options say "smiles".
OK let's see what the [C:1] SMILES means:


It is a carbon atom with 0 implicit H and with atom map 1.


Now let's see what C[C][C:1] means:


A carbon atom (which has implicit H, which can be calculated from the


lowest normal valence) connecting to another carbon which has no implicit


H connecting to the third carbon atom which has no implicit H


but has atom map 1.


(The reason I said this is not valid as this molecule doesn't exist,


however it has meaning. I was mentioning smarts because I thought you


wanted to express a substructure query.)





MolImporter reads a string as SMARTS if it finds some features in the


string which are not supported by the SMILES specification. However,


in this case your strings were imported as smiles, because it did not


contain any non-smiles elements.
Quote:
4) If I create an importer, call setOptions("smarts"), and read the


molecule, I get exactly the same results as above
Which string are you trying to import?
Quote:
5) If I create the importer using 'new MolImporter(inputStream,


"smarts")', I get an exception that "marts" is not a valid format when


reading the molecule. Is it expecting a different syntax for the options


in this case?
No, I think not, it is just described badly in the documentation so I


missed the ':'.


So you may need 'new MolImporter(inputStream, "smarts:")',


but I rather suggest the use of setQueryMode(true).





All the best


Andras

User 6ef33138f9

01-07-2005 14:57:21

Quote:
Sorry, atom map syntax is supported by SMILES, that is why [CH3:1]C is a valid SMILES.
OK, I've been confused by this. (-: The problem is that there are two different SMILES tutorials on the Daylight website. One of them (http://www.daylight.com/smiles/f_smiles.html) makes no mention of reactions or atom map numbers. In particular, its page on atoms (which you mentioned in your post) defines the atom format as:
Code:
atom : '[' <mass> symbol <chiral> <hcount> <sign<charge>> ']'
Nothing about map numbers there. But there's another page (http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html) that does show some example reactions with map numbers. Hmm.
Quote:
In smiles the implicit H must be written only if the atom is inside brackets.
Yes, after re-reading the page on atoms (http://www.daylight.com/smiles/smiles-atoms.html), the behavior makes sense. So the SMILES "C[C][C:1]" really means CH3-C-C, not CH3-CH2-CH3 as I had assumed. But the interesting thing is that Marvin seems to figure out what I meant and export it as "CC[CH3:1]". I guess Marvin in this version will automatically correct the improper valence for atoms in brackets when importing as SMILES? (But in 4.0 will allow the improper valences to remain as-is?)
Quote:
Quote:
4) If I create an importer, call setOptions("smarts"), and read the molecule, I get exactly the same results as above
Which string are you trying to import?
The same string as above: "C[C][C:1]".
Quote:
but I rather suggest the use of setQueryMode(true).
Thanks. Setting the option to "smarts" instead of "smiles" made no difference, but setting the query mode did. Here's a summary of the results for "C[C][C:1]":





- Import as SMILES or SMARTS with queryMode=false, export as SMILES: CC[CH3:1]


- Same as above but export as SMARTS: [#6]C[#6:1]


- Import as SMILES or SMARTS with queryMode=true, export as SMILES: CC[#6:1]


- Same as above but export as SMARTS: CC[C:1]





So I think the bottom line is:





- If I want to import and export SMILES, I should use H correctly for bracketed atoms. Then I can import "C[CH2][CH3:1]" and export it as "CC[CH3:1]" (which is syntactically different but semantically the same).





- If I want to import and export SMARTS, I should use "smarts:" format and call setQueryMode(true) when importing to get the expected result.





Correct?





Thanks,


Chris

ChemAxon 25dcd765a3

01-07-2005 15:40:00

Hi Chris,





I think now you know quite a lot of things, so I'll be short.
Quote:
I guess Marvin in this version will automatically correct the improper valence for atoms in brackets when importing as SMILES? (But in 4.0 will allow the improper valences to remain as-is?)
Exactly.
Quote:
- If I want to import and export SMILES, I should use H correctly for bracketed atoms. Then I can import "C[CH2][CH3:1]" and export it as "CC[CH3:1]" (which is syntactically different but semantically the same).





- If I want to import and export SMARTS, I should use "smarts:" format and call setQueryMode(true) when importing to get the expected result.


Correct.





All the best


Andras