Perfect search fail with certain structure

User d68ef9d5a9

13-05-2005 17:39:15

Hi,





I store compound structures in Oracle in Structure table with aromatized mol file in cd_structure column.





When JChemSearch is used for perfect search against the Structure table, there are two different behaviors between Version 2.33 and Version 3.02.





In Version 2.33, after I insert CC1=NC=CN1 structure, but aromatized mol in the structure table, I do perfect structure search with the same smiles. It appears not matter if I aromatized the smiles structure or not, it always finds the aromatized structure in Structure table. That is good, and we like that because the aromatization of a structure is only a representation, and it does not change what it is.





However, in Version 3.02, I do the same thing. The non-aromatized seed structure will not get hit. To find the same structure in the Structure table, I have to aromatize the seed structure before executing the searching.





The fail in perfect search seems to be related to the particular structure as I used. I don't know if there are other structures having this problem. The different behaviors between two different versions created troubles in compound searching. I wonder whether it is a bug or other reasons.





I appreciate your time and help.





Ben Li

ChemAxon a3d59b832c

13-05-2005 18:04:56

Ben,





To further investigate the problem, could you send us the contents of the cd_smiles and cd_structure columns of the row containing this structure? Could you also send the programming context in which JChemSearch is used?





Thanks,


Szabolcs

User d68ef9d5a9

13-05-2005 19:19:57

These are the codes I am using. I just tested minutes ago, they behaved the way I described.





----------------------------------





ConnectionHandler coreCH = new ConnectionHandler();


coreCH.setDriver("oracle.jdbc.driver.OracleDriver");





coreCH.setUrl("URLForOracle.........");


coreCH.setLoginName("UserName.........");


coreCH.setPassword("Password........");


coreCH.connect();





String structureTable="STRUCTURE";





String smiles="CC1=NC=CN1";





MolHandler mh=new MolHandler(smiles);


Molecule mole=mh.getMolecule();


mole.aromatize(MoleculeGraph.AROM_DAYLIGHT);


mole.clean(2, null);





UpdateHandler coreUH=new UpdateHandler(coreCH,


UpdateHandler.INSERT,


structureTable,


"NAME");


coreUH.setValuesForFixColumns(mole.toFormat("mol"));


coreUH.setValueForAdditionalColumn(1, "MY COMPOUND", Types.VARCHAR);


coreUH.setDuplicateFiltering(true);


int key=coreUH.execute(true);


System.out.println("compound inserted, Id="+key);





JChemSearch coreSearcher =new JChemSearch();


coreSearcher.setConnectionHandler(coreCH);


coreSearcher.setStructureTable(structureTable);


coreSearcher.setSearchType(JChemSearch.PERFECT);


coreSearcher.setMaxResultCount(1);


coreSearcher.setMaxTime(6000000);


coreSearcher.setWaitingForResult(true);





mh=new MolHandler(smiles);


mole=mh.getMolecule();








/****this makes difference for version 2.33 and 3.02****


* without aromatization in version 3.02, perfect search fails


*


*/


mole.aromatize(MoleculeGraph.AROM_DAYLIGHT);





//mole.clean(2, null); // this does not matter





coreSearcher.setQueryStructure(mole.toFormat("mol"));


coreSearcher.run();





if (coreSearcher.getResultCount()>0){


System.out.println("result count="+coreSearcher.getResultCount());


int foundId= coreSearcher.getResult(0);


System.out.println("fount it and Id="+foundId);


}else {


System.out.println("Not found");


}


------------------------------------------------------------

ChemAxon a3d59b832c

16-05-2005 11:03:18

Ben,





The problem is that the implicit H info disappears from the aromatic N in mol format. This is a shortage of the format as there is no field in molfiles for the implicit H. In the short term we recommend to use the mrv format, which is an extension of cml (Chemical Markup Language) and supports all the features we can handle:





http://www.chemaxon.com/marvin/doc/user/mrv-doc.html





We will discuss how we can solve this molfile problem in the longer term.





All the best,


Szabolcs

ChemAxon a3d59b832c

16-05-2005 12:56:22

Another thing:





We do not insist that the molecules in the cd_structure column are aromatized. If you omit the aromatization step before insert, the insertion will work correctly.





It is generally not safe to store aromatized structures in mol format for the above reason. As far as I know, even ISIS does not allow insertion of such compounds. You can also use the format string "mol:-a" instead of "mol" to get the dearomatized mol format, but you have to make sure that the implicit H-s are not lost already, because in this case the dearomatization may not be possible. (For this matter, the smiles 'c1ncnc1' would be incorrect.)





All the best,


Szabolcs

User d68ef9d5a9

20-06-2005 19:15:25

Hi,





I have identified another scenario of perfect search failure. This time it is not related to aromatization because I tried in both situations.





String smiles1="F[C@H](Cl)Br";


String smiles2="[H][C@@](F)(Cl)Br";





Using the same algorithm I posted in my previous message with JChem Version 3.02 (the version release Dec 7, 2004), I tried these too smiles strings to create an insertion in database, and then do perfect search for the same compound. Smiles1 is successful, but the smiles2 fails. The only way to work around the issue is implicizing H before the search:





mole.implicitizeHydrogens(MolAtom.LONELY_H);





I wonder if this is a bug, or the implicitizing H is required in the 3.02 version. If it is required, is LONELY_H enough to avoid perfect search failure?





I also tried same thing on the newest version 3.0.12. The result is same. In all cases, aromatization seems not matter at all.








Ben Li

ChemAxon 9c0afc9aaf

21-06-2005 17:37:00

Dear Ben,





Szabolcs will examine this issue soon, and reply your question.


(It seems to be a search bug to me)





Best regards,





Szilard

ChemAxon 9c0afc9aaf

22-06-2005 21:32:56

Hi Ben,





We have examined this issue, and found a bug affecting PERFECT search.





The next JChem version (3.0.13) will contain the fix, its release is expected in a few days.





Best regards,





Szilard

ChemAxon 9c0afc9aaf

24-06-2005 15:32:44

Ben,





Some additional comments about your code:





- You will still need the second clean in 2D, because some stereo information won't be recovered from a 0D mol file.


This will be improved in the next major release (3.1), from then you will not need cleaning.





- You can also specify the Molecule object for JChemSearch, so there's no need for 2D cleaning and Molecule -> molfile -> Molecule conversion.


(unless you purposefully do this for testing)





Best regards,





Szilard

User d68ef9d5a9

24-06-2005 17:58:08

Thank you, Szilard and Szabolcs for your effort and time.





I want to continue the discussion for the response by Szabolcs on Mon May 16, 2005 1:56 pm.





Let’s take the compound “c1ncnc1” as an example. If a molecule is created by “c1ncnc1”, the molecular weight is 67.07. If a molecule is created by “N1C=CN=C1”, the molecular weight is 68.08.





I kind of agree the argument Szabolcs had. It cannot arbitrarily dearomatize the molecule “c1ncnc1” because there is no definitive answer to which N is connected the implicitized H. However, no matter which N has the additional H, there is only one, and must have one N connecting with this H. Therefore for this particular molecule, the molecular weight should be always 68.08 regardless of aromatization or not.





This problem seems existing in Version 3.0.2, and has caused some of our compounds' molecular weight off by one unit. It may actually contribute to certain types of perfect search failure.





Ben Li

ChemAxon a3d59b832c

27-06-2005 14:31:06

Ben,





Please note that your smiles “c1ncnc1” is invalid.





The following quotation is from the Daylight SMILES theory manual:





"A short note is in order about aromatic nitrogens, a common source of confusion in chemical information systems. All three common types of aromatic nitrogen may be specified with the aromatic nitrogen symbol n. Archetypical examples are pyridine, pyridine-N-oxide, and pyrrole.


...


Note that the pyrrolyl nitrogen in 1H-pyrrole is written [nH] to distinguish this kind of nitrogen from a pyridyl-N. Alternative valid SMILES for 1H-pyrrole include [H]n1cccc1 (with explicit hydrogen) and N1C=CC=C1 (aliphatic form) all three input forms are equivalent."





http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html#RTFToC29





Best regards:


Szabolcs

ChemAxon 9c0afc9aaf

29-06-2005 10:19:02

Hi Ben,





JChem 3.0.13 has been released, and available for download.





It contains the fix concerning [H][C@@](F)(Cl)Br and PERFECT search.





Best regards,





Szilard