Molecule Matching - standardization rules or bug?

User 7910dcb734

13-04-2016 13:51:51

Hi,


One of the molecule datasets we use is ChEMBL. We've imported all of ChEMBL into our own database.


One of the compounds is 189019. It is behaving strangely - when I download the molecule from the website, and use MolSearch to compare it to the one in our database, they do not register as a match! Obviously there is some difference between how I am testing for duplicate molecules in my code versus how the database does it. But what is the difference?


I have attached the molecule file from our database, the ChEMBL mol file, and the standardization rules. The code for how I am testing is also below. 


Any ideas why these do not match? Am I configuring the MolSearch object incorrectly? Is there a standardization rule I am missing, or is this a bug?


I have about 30 (out of Chembl's million or so) that behave similarly in this fashion, but I thought I'd start with one example.


 


//some config

this.molSearch = new StandardizedMolSearch();
Resource standardizerConfig = new ClassPathResource("uk/co/etx/bci/chemaxon/standardizerConfig.xml");
StandardizerConfigurationReader standardizerConfigurationReader = new StandardizerXMLReader(standardizerConfig.getInputStream());
molSearch.setStandardizer(new Standardizer(standardizerConfigurationReader.getConfiguration()));
MolSearchOptions molSearchOptions = new MolSearchOptions(SearchConstants.DUPLICATE);
molSearchOptions.setTautomerSearch(SearchConstants.TAUTOMER_SEARCH_ON);
molSearchOptions.setStereoSearchType(SearchConstants.STEREO_EXACT);
molSearch.setSearchOptions(molSearchOptions);


//import compounds

ClassPathResource bciChemblResource = new ClassPathResource("uk/co/etx/bci/experiments/chemblCompounds/BCI_ChEMBL189019.sdf");
ClassPathResource chemblChemblResource = new ClassPathResource("uk/co/etx/bci/experiments/chemblCompounds/CHEMBL_ChEMBL189019.mol");

Molecule bciMol = new MolImporter(bciChemblResource.getInputStream()).createMol();
Molecule chemblMol = new MolImporter(chemblChemblResource.getInputStream()).createMol();

//compare
molSearch.setTarget(bciMol);
molSearch.setQuery(chemblMol);

//why does this assertion not hold?
assertTrue(molSearch.isMatching());

ChemAxon abe887c64e

14-04-2016 16:47:45

Hi Brendan,


Can you give some details about the corresponding database table? What is the table type? What kind of standardizer actions are set?  Is the default 'assume absolutestereo' setting intact? The best would be to get the rows  - relating the table where the import is run - from the JChemProperties table.


select * from JCHEMPROPERTIES where PROP_NAME like 'table.<SCHEMA_NAME>.<TABLE_NAME>.%';

Furthermore, please write the version number of JChem you are using.


Thanks,


Krisztina

User 7910dcb734

15-04-2016 11:05:33

Hi Krisztina,


I couldn't get the SQL command to work, but the options I use when I create the structure table are below. The Standardizer settings for the table is provided by the same file I provided above (using tableOptions.setStandardizerConfig method), and the TableType is set as TableTypeConstants.TABLE_TYPE_MOLECULES. 


Version of JchemBase is 16.1.11.


StructureTableOptions tableOptions = new StructureTableOptions(tableName, tableType);


tableOptions.setDuplicateFiltering(true);
tableOptions.setAbsoluteStereo(true);
tableOptions.setTautomerDuplicateChecking(true);


Many thanks,


 


Brendan

ChemAxon abe887c64e

19-04-2016 15:04:36

Hi Brendan,


Finally, we found the cause of the different search result.


In case of database search, the stored target molecules - if possible - are taken into account in cxsmiles format. From chemical point of view, molecule BCI has two superfluous stereo bonds. When the target with stereo bonds is taken into account in the search in cxsmiles format, the superfluous stereo bonds are ignored, but in molsearch if the molfile target is applied the stereo bonds are not ignored. That's why molsearch does not qualify them as duplicates.


Best regards,


Krisztina

User 7910dcb734

19-04-2016 15:07:50

Hi Krisztina,


Many thanks for looking into this for me.


Is there a MolSearch option that can duplicate the behaviour of the database? (Can I set MolSearch to ignore these superfluous stereo bonds?) 


Best wishes,


Brendan

ChemAxon abe887c64e

19-04-2016 16:18:28

Hi Brendan,


These code lines convert the BCI molecule into cxsmiles:


import chemaxon.formats.MolExporter;
import chemaxon.formats.MolImporter;

 bciMol = MolImporter.importMol(MolExporter.exportToFormat(bciMol, "cxsmiles"));


Regards,


Krisztina

ChemAxon 822473489f

19-04-2016 16:48:31

Hi Brendan,




why do you call function "createMo()l" on MolImporter? As I can see, it will return an empty molecule. (You can check it by calling mol.getAtomCount().)


I would suggest calling function "read()"  instead of it, i.e.


MolImporter importer = new MolImporter(bciChemblResource.getInputStream());


Molecule mol = importer.read();


importer.close(); 


 


Best regards,


Monika

User 7910dcb734

20-04-2016 08:15:45

Hi Monika,


You're right; that's a mistake. I put together the test to isolate this case from some larger code (where I did not import the molecules in this fashion), and got that part wrong. Thanks for the fix.


Krisztina: many thanks, that appears to work. I'll run it through the rest of my test cases.


Best wishes,


Brendan

User 7910dcb734

21-04-2016 13:16:40

Hi Krisztina,


So far this appears to be working; many thanks once again!


I have found one molecule that throws an exception when I convert the molecule to cxsmiles, with the following message: "Some features of the molecule ... cannot be converted to the given format.


I have attached the molecule. This molecule doesn't seem to need the transformation (it matches successfully when I do not attempt the export/import transform), but I was curious if there was a format that both matched the database duplicate behaviour detection, and did not throw this exception.


Cheers,


Brendan

ChemAxon abe887c64e

21-04-2016 14:00:43

Hi Brendan,


This molecule is a little bit tricky, it seems to contain many superatom Sgroups causing the error. However, if a molecule cannot be converted to cxsmiles (or cxsmarts) then the database search also works with this molecule in its original format.


In general, mrv and sdf (mol) formats can handle all features.


In case of your previous molecule, there must be some bug in the behavior of that molecule in mol format that's why I recommended to apply cxsmiles to mimic the database search.


Best regards,


Krisztina