User 2f24778469
18-09-2008 15:51:58
I am using a previously written Java app to flag duplicates in our Structure database. We recently updated to JChem 5.11, and this program is a little old, so I was tasked with removing deprecated calls and updating the program.
We had hoped that updating to the new version of JChem would fix the false positives we were getting, that is, pairs of chemicals getting marked as duplicates but are clearly not.
Code: |
searcher.setQueryStructure(string_cd_structure);
searcher.setConnectionHandler(conHandler);
searcher.setStructureTable("JChemData");
options.setSearchType(JChemSearch.PERFECT);
options.setMaxResultCount(75000);
options.setMaxTime(60000);
searcher.setSearchOptions(options);
searcher.run_NE(); |
This is the code we are using to perform the search for each chemical. An example of the false positives we get are attached.
I can provide more code and other examples if needed. Please correct me if some of my terminology is off, as I've only been using this software for a day or so.
ChemAxon 9c0afc9aaf
18-09-2008 20:02:50
Hi,
The problem is with the structures.
Both structures contain a carbon and s sulphur atom with a rather misleading alias name on the carbon.
This can be observed by looking at the sopurces fo the molfiles:
Code: |
Marvin 09170815462D
2 1 0 0 0 0 999 V2000
5.2395 -4.0898 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.0645 -4.0898 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
A 1
Pb
M EN
|
Code: |
Marvin 09170815462D
2 1 0 0 0 0 999 V2000
3.3516 -2.8866 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1762 -2.8619 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
A 1
Zn
M EN |
As our structure search is concerned about the structural structure (not the aliases or labels), this is a legal perfect match of two identical C=S structures.
Best regards,
Szilard
User be1039f1ca
26-09-2008 20:29:26
What is the correct way to represent those structures in mol files?
Does this look correct.
Code: |
Marvin 09260815562D
2 1 0 0 0 0 999 V2000
2.6813 1.2080 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
1.6795 1.2375 0.0000 Zn 0 0 0 0 0 0 0 0 0 0 0 0
2 1 2 0 0 0 0
M END
|
ChemAxon 9c0afc9aaf
28-09-2008 01:02:46
At first glance it does.
You can find more information on the file format in the MDL documentation:
http://www.mdl.com/downloads/public/ctfile/ctfile.pdf
I just pasted the sources to demonstrate the problem.
If you draw the molecule in Marvin it will be correct for sure (opposed to manually editing the file). You can also check the formula with Elemental Analysis from Marvin.
Best regards,
Szilard
User 2f24778469
30-09-2008 17:56:26
Thanks for the help. Since these aliases can cause these sorts of problems, I'd like to check the mol files in my database to flag the ones with aliases. Can you recommend a method to do that?
ChemAxon 9c0afc9aaf
30-09-2008 18:12:53
Hi,
I think one could quite easily write a small Java program for that.
- get the cd_id and the cd_structure for each row
- create a Molecule from cd_structure string with either MolImporter or MolHandler
- iterate trough all atoms in the Molecule (see methods .getAtomCount() and getAtom())
- See if there is any alias string set for your molecule
http://www.chemaxon.com/jchem/doc/api/chemaxon/struc/MolAtom.html#getAliasstr()
I think this should be null if it's not set, so if not null you log the cd_id somewhere and you get a list this way.
Let me know if you have any questions.
Best regards,
Szilard
ChemAxon a3d59b832c
01-10-2008 07:44:45
User 2f24778469
01-10-2008 19:41:46
You guys are exceedingly helpful! Thanks a bunch.
I was able to get a good list of the structures with aliases. Here's the relevant code to get the list, for posterity.
Code: |
while (rs.next()){
boolean isAliased = false;
Molecule m = MolImporter.importMol(rs.getBytes("cd_structure"));
for (int i = 0; i < m.getAtomCount(); i++){
MolAtom atom = m.getAtom(i);
if (atom.isPseudo() || atom.getAliasstr() != null){
System.out.println(rs.getString("cd_id") + " " + rs.getString("pnum")
+ " " + atom.getAliasstr());
count++;
isAliased = true;
}
}
if (isAliased)
productsCount++;
}
System.out.println("total: " + count + " in " + productsCount + " products"); |
Thanks again for the excellent support!