Error in flagging duplicates

User 2f24778469

18-09-2008 15:51:58

I am using a previously written Java app to flag duplicates in our Structure database. We recently updated to JChem 5.11, and this program is a little old, so I was tasked with removing deprecated calls and updating the program.





We had hoped that updating to the new version of JChem would fix the false positives we were getting, that is, pairs of chemicals getting marked as duplicates but are clearly not.





Code:
               searcher.setQueryStructure(string_cd_structure);


               searcher.setConnectionHandler(conHandler);


               searcher.setStructureTable("JChemData");


               options.setSearchType(JChemSearch.PERFECT);


               options.setMaxResultCount(75000);


               options.setMaxTime(60000);


               searcher.setSearchOptions(options);


               searcher.run_NE();






This is the code we are using to perform the search for each chemical. An example of the false positives we get are attached.





I can provide more code and other examples if needed. Please correct me if some of my terminology is off, as I've only been using this software for a day or so.

ChemAxon 9c0afc9aaf

18-09-2008 20:02:50

Hi,





The problem is with the structures.


Both structures contain a carbon and s sulphur atom with a rather misleading alias name on the carbon.





This can be observed by looking at the sopurces fo the molfiles:








Code:






  Marvin  09170815462D         





  2  1  0  0  0  0            999 V2000


    5.2395   -4.0898    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0


    6.0645   -4.0898    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0


  1  2  2  0  0  0  0


A    1


Pb


M  EN











Code:






  Marvin  09170815462D         





  2  1  0  0  0  0            999 V2000


    3.3516   -2.8866    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0


    4.1762   -2.8619    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0


  1  2  2  0  0  0  0


A    1


Zn


M  EN






As our structure search is concerned about the structural structure (not the aliases or labels), this is a legal perfect match of two identical C=S structures.





Best regards,





Szilard

User be1039f1ca

26-09-2008 20:29:26

What is the correct way to represent those structures in mol files?


Does this look correct.


Code:






  Marvin  09260815562D         





  2  1  0  0  0  0            999 V2000


    2.6813    1.2080    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0


    1.6795    1.2375    0.0000 Zn  0  0  0  0  0  0  0  0  0  0  0  0


  2  1  2  0  0  0  0


M  END


ChemAxon 9c0afc9aaf

28-09-2008 01:02:46

At first glance it does.





You can find more information on the file format in the MDL documentation:


http://www.mdl.com/downloads/public/ctfile/ctfile.pdf





I just pasted the sources to demonstrate the problem.


If you draw the molecule in Marvin it will be correct for sure (opposed to manually editing the file). You can also check the formula with Elemental Analysis from Marvin.





Best regards,





Szilard

User 2f24778469

30-09-2008 17:56:26

Thanks for the help. Since these aliases can cause these sorts of problems, I'd like to check the mol files in my database to flag the ones with aliases. Can you recommend a method to do that?

ChemAxon 9c0afc9aaf

30-09-2008 18:12:53

Hi,





I think one could quite easily write a small Java program for that.





- get the cd_id and the cd_structure for each row


- create a Molecule from cd_structure string with either MolImporter or MolHandler


- iterate trough all atoms in the Molecule (see methods .getAtomCount() and getAtom())


- See if there is any alias string set for your molecule


http://www.chemaxon.com/jchem/doc/api/chemaxon/struc/MolAtom.html#getAliasstr()


I think this should be null if it's not set, so if not null you log the cd_id somewhere and you get a list this way.





Let me know if you have any questions.





Best regards,





Szilard

ChemAxon a3d59b832c

01-10-2008 07:44:45

We also plan to add a new standardization action possibility to convert these aliases to atoms by Standardizer.





The new action will work in a similar way as the current AliasToGroup action:


http://www.chemaxon.com/jchem/doc/user/StandardizerConfiguration.html#aliastogroupsec





But instead of abbreviations like COOH, NO2, etc. it will be able to convert alias to simple atoms. It is planned to appear in version 5.2, some time in early 2009.

User 2f24778469

01-10-2008 19:41:46

You guys are exceedingly helpful! Thanks a bunch.


I was able to get a good list of the structures with aliases. Here's the relevant code to get the list, for posterity.





Code:
      while (rs.next()){


         boolean isAliased = false;


         Molecule m = MolImporter.importMol(rs.getBytes("cd_structure"));





         for (int i = 0; i < m.getAtomCount(); i++){


            MolAtom atom = m.getAtom(i);


            if (atom.isPseudo() || atom.getAliasstr() != null){


               System.out.println(rs.getString("cd_id") + " " + rs.getString("pnum")


                + " " + atom.getAliasstr());


               count++;


               isAliased = true;


            }


         }


         if (isAliased)


            productsCount++;


      }


      


      System.out.println("total: " + count + " in " + productsCount + " products");






Thanks again for the excellent support!