Unique cxsmiles output appears to be non-unique

User 26d9368720

05-10-2016 14:09:53

Hello,


it came to my attention that when a specific molecule (see either SMILES below) is converted to the supposedly unique CxSMILES string the output actually depends in the notation used for import. This issue can be easily reproduced with molconvert, for example, by executing
molconvert cxsmiles:u "O=C1C[C@@H]2[C@H](C1)[C@@H]1CC(=O)C[C@H]21"
or
molconvert cxsmiles:u "O=C1C[C@@H]2[C@@H]3CC(=O)C[C@@H]3[C@@H]2C1"


To me this behavior appears rather unexpected and frustrating, as it means that at least some molecules may be represented multiple times in a database which relies on unique cxsmiles to keep structural data duplicate-free.

ChemAxon abe887c64e

05-10-2016 15:18:36

Hi Dmytro,


Thank you for notifying us, the difference between the molconvert cxsmiles:u outputs seems to be a bug. We'll start to investigate it and will inform you about the status of the fix.


As for the duplicates in the database table, there is table option Filter out duplicate structures of JChem tables which doesn't allow to import duplicates into the table - independently of the file formats of the molecules.


Best regards,


Krisztina

User 26d9368720

05-10-2016 16:33:45

Hi Krisztina,



thank you for addressing this issue and for your helpful comment on duplicate filtering option.


Regarding the registration of structures in the database, I also have another question (if it is off-topic, I can raise another thread): is there a way to force storage of only particular cxsmiles structural features (as described in documentation) in cd_smiles field of a JChem structure table, for example remove ring bond indexes, local parities and local bicyclo-alkane stereo information while keeping lone electron pairs and enhanced stereo features? It seems that configuration like this should be accessible through standardization rules, but built-in standardizer actions apparently can not cover existing variety of cxsmiles structural features.


With best regards,


Dmytro

ChemAxon abe887c64e

10-10-2016 14:01:39

Hi Dmytro,


Sorry for the late response.


Please take into considerationt that all the fields in a JChem table starting with "cd_" are implemented for internal use. They serve as tools for the precise and quick chemical structure search in a JChem table. We do not recommend to implement any external application based on these field. The only exception is the cd_structure field which stores the originally inserted chemical structure.


We recommend to add a user-defined field to that table and define a convenient chemical termĀ  e.g. molConvert() or molFormat() to fill this column with the required cxsmiles format of the molecules. From command line, for example:


jcman m <table_name> --add-ctcolcfg '<field_name>=molFormat("cxsmiles:u")'

Best regards,


Krisztina