Asterisk wildcard as Atom Symbol in Molfile misinterpreted

User 6f58eb8616

10-11-2009 17:27:00

JChem Version: 5.2.3_1


Hi, 


I have encountered some unexpected behaviour when importing Molfiles and outputting them as SMILES.  It seems the asterisk "*" wildcard character is not interpreted correctly: 


final String molfile = "\n\n\n" +

"  2  1  0  0  0  0  0  0  0  0999 V2000\n" +

"    0.5100    1.5300    0.0000 *   0  0  0  0  0  0  0  0  0  0  0  0\n" +

"   -0.5100    1.5300    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n" +

"  1  2  1  0  0  0  0\n" +

"M  END\n";

 


Molecule molecule = MolImporter.importMol(molfile,"mol");

assertEquals("C*",molecule.toFormat("smiles:u"));

 


This brings back:


 


java.lang.IllegalArgumentException: 

Some features of [#6]-[#114] cannot be converted to the given format. Try mrv format.

 


 


However if I change the atom type in Molfile from "*" to an unknown symbol like "ZZZ" I get back my expected result (when I would have expected an error):


 


final String molfile = "\n\n\n" +

"  2  1  0  0  0  0  0  0  0  0999 V2000\n" +

"    0.5100    1.5300    0.0000 ZZZ   0  0  0  0  0  0  0  0  0  0  0  0\n" +

"   -0.5100    1.5300    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n" +

"  1  2  1  0  0  0  0\n" +

"M  END\n";


Molecule molecule = MolImporter.importMol(molfile,"mol");

assertEquals("C*",molecule.toFormat("smiles:u"));


Obviously not a huge problem but just seems a bit odd to code round it, or have I missed something? - Should I be usign the "enc" parameter of ImportMol?


 


Thanks in advance


 


Derek

ChemAxon 25dcd765a3

12-11-2009 09:07:19

Dear Derek,


The STAR atom in the MDL formats represents unknown or unspecified end groups.


While the '*' in SMILES represents any atom except hydrogen.


It seems to be the same, but the STAR atom may represent a whole group, while the '*' represents only one atom.


In the SMILES language it is not possible to represent unspecified group, so that is why you get the message:



Some features of [#6]-[#114] cannot be converted to the given format. Try mrv format.


 


If you write 'zzz' to the molfile than it represents only one atom which is a pseudo atom. Pseudo atom is just one atom so we represent it with '*' in SMILES.


The next release will have the smiles export option to force SMILES output in any case even if the molecule cannot be represented by SMILES, this will try to export the most information from the molecule to SMILES.


In this case (the molecule with STAR atom saved to 1.mol):


molconvert smiles:r1 1.mol
C

User c2ffbfa8f8

17-11-2009 09:16:03

Thanks Andras, I will try out the the new option when its available, I assume it'll be in v5.3?  The MDL Connection table spec I have indicates that for the atom symbol in the atom block a possible value is "* for unspecified atom" but I can see that this is a grey area generally as people do use it to represent a group also.


 


All the best


 


d


 


 

ChemAxon 25dcd765a3

17-11-2009 09:50:11

Hi,


 I will try out the the new option when its
available, I assume it'll be in v5.3?  The MDL Connection table spec I
have indicates that for the atom symbol in the atom block a possible
value is "* for unspecified atom" but I can see that this is a grey
area generally as people do use it to represent a group also.


You are right  it'll be available in v 5.3.


Yes, it is also quite strange for me that it is not stated explicitly in the MDL Connection table specification.

User 6f58eb8616

18-11-2009 11:37:50

Will v5.3 be out soon?  I think I'd previosuly heard it would be available from the 16 November.  Sorry to be a pain.


 


d


 


 

ChemAxon 25dcd765a3

18-11-2009 15:27:56

The headquarters says: Marvin 5.3 expected to be out by the end of the year.


Andras