trouble generating SMILES for a couple molecules

User cec0f3a823

29-10-2009 03:24:22

Trying to use MolConvertWS to generate SMILES for a couple structures but getting an error:

Some features of [Cl-].[#6]C1=C(OCCOCC[N+]([#6])([#6])CC2=CC=CC=C2)C=CC(=C1)C([#6])([#6])CC([#6])([#6])[#6] cannot be converted to the given format. Try mrv format.

I am passing a mol/sdf to the web service (sample below).

I tried to open the same structure with MarvinSketch and was also unable to generate SMILES from there!

Funny that the mol file has a smiles_code metatag that suggests that the molecule's SMILES would be: [N+](CCOCCOc1c(cc(cc1)C(CC(C)(C)C)(C)C)C)(Cc2ccccc2)(C)C

And Babel gives me the following SMILES: [Cl-].Cc1cc(ccc1OCCOCC[N+](C)(C)Cc1ccccc1)C(C)(C)CC(C)(C)C

Can you guys see what seems to be the problem with this one?

Follows the mol file contents:

-ISIS- 03220716582D

32 32 0 0 0 0 0 0 0 0999 V2000

2.9375 -6.7875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

2.9375 -7.6125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

4.3616 -6.7875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

3.6496 -6.3708 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

5.0772 -6.3771 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

5.7905 -6.7917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

6.5062 -6.3813 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

7.2194 -6.7959 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

7.9351 -6.3855 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

8.6484 -6.8001 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

9.3641 -6.3897 0.0000 N 0 3 3 0 0 0 0 0 0 0 0 0

10.0773 -6.8042 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

10.7930 -6.3938 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

11.5033 -6.8106 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

12.2168 -6.4037 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

12.2235 -5.5784 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

11.5104 -5.1616 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

10.7907 -5.5702 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

9.3665 -5.5647 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

9.5752 -7.1872 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

2.2237 -8.0260 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0

1.5086 -7.6146 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

2.2249 -8.8510 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1.8465 -6.9128 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

4.3616 -7.6125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

3.6496 -8.0208 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.7947 -8.0281 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0

0.0797 -7.6166 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.7959 -8.8531 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.5800 -7.2315 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

5.0754 -8.0260 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

13.6000 -7.3750 0.0000 Cl 0 5 0 0 0 0 0 0 0 0 0 0

6 7 1 0 0 0 0

13 18 1 0 0 0 0

14 15 1 0 0 0 0

15 16 2 0 0 0 0

16 17 1 0 0 0 0

17 18 2 0 0 0 0

1 2 1 0 0 0 0

11 19 1 0 0 0 0

7 8 1 0 0 0 0

11 20 1 0 0 0 0

1 4 2 0 0 0 0

2 21 1 0 0 0 0

8 9 1 0 0 0 0

21 22 1 0 0 0 0

2 26 2 0 0 0 0

21 23 1 0 0 0 0

9 10 1 0 0 0 0

21 24 1 0 0 0 0

25 3 2 0 0 0 0

10 11 1 0 0 0 0

25 26 1 0 0 0 0

3 4 1 0 0 0 0

22 27 1 0 0 0 0

11 12 1 0 0 0 0

27 28 1 0 0 0 0

27 29 1 0 0 0 0

12 13 1 0 0 0 0

27 30 1 0 0 0 0

13 14 2 0 0 0 0

25 31 1 0 0 0 0

3 5 1 0 0 0 0

5 6 1 0 0 0 0

M CHG 2 11 1 32 -1

M STY 1 1 GEN

M SLB 1 1 1

M SAL 1 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

M SAL 1 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

M SAL 1 1 31

M SDI 1 4 -0.2000 -10.1000 -0.2000 -5.3100

M SDI 1 4 12.5600 -4.7000 12.5600 -9.5000

M END

> <Isis_internal_number> (9)

9

> <compound_identifier> (9)

09G06

> <chemical_name> (9)

Methyl benzethonium chloride

> <solubility> (9)

H2O

> <appearance> (9)

white powder

> <literature_ref> (9)

MI, 12, 6103

> <precautions> (9)

Store at room temperature

> <lipophilicity_value> (9)

1.390000000000000e+000

> <CAS_number> (9)

25155-18-4

> <family> (9)

MED

> <smiles_code> (9)

[N+](CCOCCOc1c(cc(cc1)C(CC(C)(C)C)(C)C)C)(Cc2ccccc2)(C)C

> <Compound_identifying_number> (9)

705

> <Structurale_Source> (9)

SF

> <rotatable_bond> (9)

12

> <Prestw_number> (9)

Prestw-705

$$$$

User c1ce6b3d19

29-10-2009 09:43:13

Julio,

It might be a strange character issue with web services, but if trying to convert using mksetch doesn't work, it might not be related.

We are investigating the issue and will reply soon.

Jon

User c1ce6b3d19

29-10-2009 13:01:19

Julio,

I located the portions of the files that were not convertable. It turns out that the problem section of the SDfile describes the Sgroup Type. Generally, this is used to describe repeating substructures (shown as bracketed molecules or molecule groups). These molecular features are not available in smiles. It seems that babel may be dropping (important?) features that were present in the sdf format.

Suggestions:

You may want to reassess whether smiles is the correct format for your molecules. If so, then the S-Groups may be able to be ungrouped or expanded to fit into the smiles format.

You may want to keep the molecules in a different format (perhaps keeping them in sdf format or moving to Marvin format).

In the 5.3 release, the molConvertWS will handle input formats. The input format "sdf:Usg" or "sdf:Xsg" may be helpful. See the sdf format.

As a note:

If you remove the General SGroup Info lines (first line: "M STY ..." until the last line: "M SDI ...") you can run the molecular conversion again to get the same results at babel. The output format should be "smiles:a" which will aromatize the rings if you want it to aromatize the result and match babel's output.

Here is a link to a document that describes more about SDF and the Connection Table format.

User cec0f3a823

29-10-2009 13:29:05

You may want to reassess whether smiles is the correct format for your molecules. If so, then the S-Groups may be able to be ungrouped or expanded to fit into the smiles format.

Jon, I'm using smiles (actually cxsmiles:u) to generate a pseudo key to check for duplicate compounds in a database. It is not used to represent the molecule itself, for that I use sdf.

In the 5.3 release, the molConvertWS will handle input formats. The input format "sdf:Usg" or "sdf:Xsg" may be helpful. See the sdf format.

Any idea when 5.3 will be available?

Also, is it possible to remove/expand the S-groups with 5.2.5? Any call I could use to convert the sdf to some other format with the S-groups removed/expanded?

TIA

User c1ce6b3d19

29-10-2009 18:39:44

Julio,

I would probably use JChemSearchWS in duplicate mode to check for duplicates. The search methods are here. Use the "t:d" option string(that is, <search type option>:<duplicate search type>)

5.3 will probably be released around December 2009.

I can't think of a way to expand/remove the s-groups in 5.2.5 right now. But for the use of a database psuedo key, it would be better to use the duplicate search (see first point) since it is fast and completely accurate. After removing s-groups, you might run into two molecules having the same "unique" smiles string.

Jon

User c1ce6b3d19

01-11-2009 15:36:51

Julio,

There might be a way to use standardizer to expand s-groups during using the standardizer WS (which can also do conversion) and using a standardizer configuration with S-group (contract, expand, or ungroup) actions.

http://www.chemaxon.com/webservices/developersGuide.html#Standardizer

http://www.chemaxon.com/jchem/doc/user/StandardizerConfiguration.html#sgroupsec

Perhaps this will allow you to "remove" the s-group in a way suitable to your usage.

Jon

User cec0f3a823

02-11-2009 20:38:42

Let me try to summarize what I am trying to do and the wall I'm hitting:

I have my own database where I keep a record of all my molecules. Besides my own metadata I keep a copy of the sdf/mol file, a smarts string (cxsmarts:u) and a hash of the smarts string which I use for quick duplicate queries.

I also use the SMILES code in my database to populate JChem base, which is then used for substructure/similarity search. And I have a custom field that links back to the record in my database. That allows me to map a JChem search result to my own database.

To get the smiles/smarts code for a molecule I use MolConvertWS to convert sdf into SMARTS.

In some very specific cases I am hitting the wall as the sdf cannot be converted to SMILES/SMARTS (as noted before).

Based on your suggestions I tried to change my approach and instead of populating JChem base with smiles from my database, to use the sdf/mol data from my database. Problem is that I found no way to send my custom field value along the sdf data. When sending smiles to jchem (command line) I can add custom field values right after the smiles string. I could not find anywhere in the documentation how one would go about populating my custom field values when sending sdf data to JChem.

Is that possible?

Also, is there anything wrong with my approach?

Regards,

julio

ChemAxon 9c0afc9aaf

03-11-2009 23:09:02

Hi,

Let me answer in general terms, my colleagues may extend my answer with web service sepcifics.

The basic problem with your old approach is:

- conversion to SMILES from SDF can result in loss of data (not jsut coordinates which are anturally lost)

- in such cases we do not allow the conversion

- We would suggest duplicate search with JChem, it should be very fast and accurate

Regarding the second approach:

- We do support extra columns and polulating them from input data.

- In case of jcman GUI or command-line import these data fields should be present in the file (e.g. as SDF data fields). One can specify which SDF fields go to which DB fields.

- our Java and Web Services API also supports specifying values runtime for each structure

- storing the structures multiple times (is still this is the plan) seems superfluous

My general recommendation is to make the JChem table the primary and only table to store the structures in.

This is how our customers usually use JChem.

You may store your data in the JChem table and/or

you can also refer to these structures by their primary key from other tables, which you should be able to get back after insert (not sure if web services already support this).

An other approach can be to use JChem Cartidge for Oracle, yoy may index any column in any table and use it for structure search by SQL statements:

http://www.chemaxon.com/product/jc_cart.html

This is the most practical approach if an exisitng data structure must be kept.

Best regards,

Szilard

User c1ce6b3d19

04-11-2009 13:55:29

Julio,

Specifically for your issues in importing the custom field values present in an sdf file, this page will help you use the JChem Manager (jcman) GUI to import the custom fields by using the Connecting Fields window.

http://www.chemaxon.com/jchem/doc/admin/index.html#import

User cec0f3a823

04-11-2009 14:53:38

Nice.

--connect works and I was able to populate my own custom foreign key field.

I had read that documentation but the format for the --connect argument is not that clear. Anyway I got that working thanks. So now I am uploading sdf's instead of smiles.

As for using JChem for duplicate verification, it may be fast but it takes at least 2 web services calls to do that. A connect and a search. Using a smiles hash in my database gets me the result in one webservice call. I do know that using smiles is not a perfect solution but if we start getting to many false positives we can fallback to using JChem.

BTW, the addStructure() WebService call does not return the inserted record ID. That would be nice as I could use that information.

Thanks for all the help.

julio

User c1ce6b3d19

11-11-2009 12:21:11

Julio,

We have added the cd_id return value feature to the addStructure method and it will be part of the 5.3 release.

Also, the documentation about the --connect option has also been changed and will be included in 5.3.

Jon

User c1ce6b3d19

11-11-2009 13:57:36

Julio,

You can use the standardize method of the Standardizer Web Service to accomplish the ungrouping of your s-groups to create a smiles hash.

You can use the simple string "sgroups:ungroup" as the standardizer configuration or the following XML Configuration string:

<?xml version="1.0" encoding="UTF-8"?>
<StandardizerConfiguration>
<Actions>
<Sgroups Act="ungroup" />
</Actions>
</StandardizerConfiguration>

Be sure to escape the problem characters (e.g. "<") before sending the soap messages.

Jon