Jchem Search API does not retrieve inserted molecule

User 7910dcb734

16-01-2013 16:44:02

 


Hi,


I insert molecules into a JCHEM structure table via the following method (trimmed for reading):


private IntArray insertStructures(List<Molecule> molecules) {

        final Molecule[] molArray = molecules.toArray(new Molecule[molecules.size()]);

        ConnectionCallback<IntArray> connectionCallback = new ConnectionCallback<IntArray>() {
            @Override
            public IntArray doInConnection(Connection connection) {
                try {

                    ConnectionHandler connectionHandler = new ConnectionHandler(connection, propertyTableName);
                    byte[] molBytes;
                    String molString;

                    molString = (String) MolExporter.exportToObject(molArray, "mol", new MolExport());
                    if (molString == null) {
                        return new IntArray(0);
                    }
                    molBytes = molString.getBytes();
                    Importer importer = new Importer();
                    importer.setConnectionHandler(connectionHandler);
                    importer.setInput(new ByteArrayInputStream(molBytes));
                    importer.setTableName(structureTableName);
                    importer.setDuplicateImportAllowed(UpdateHandler.DUPLICATE_FILTERING_TABLE_OPTION);
                    importer.setStoreImportedIDs(true);
                    importer.setStoreDuplicates(false);
                    importer.setEmptyStructuresAllowed(false);
                    importer.run();
                    return importer.getImportedIDs();
                }
...

 


Afterwards, I need the cd_id of each imported molecule. I run a search on each molecule via its structure:


 


private void assignStructureId(final Molecule molecule) {

        ConnectionCallback<Integer> connectionCallback = new ConnectionCallback<Integer>() {
            @Override
            public Integer doInConnection(Connection connection) {
                try {
                    ConnectionHandler connectionHandler = new ConnectionHandler(connection, propertyTableName);
                    JChemSearch searcher = new JChemSearch(); // Create searcher object
                    searcher.setQueryStructure(molecule);
                    searcher.setConnectionHandler(connectionHandler);
                    searcher.setStructureTable(structureTableName);
                    searcher.setRunMode(JChemSearch.RUN_MODE_SYNCH_COMPLETE);
                    JChemSearchOptions searchOptions = new JChemSearchOptions(SearchConstants.DUPLICATE);
                    searcher.setSearchOptions(searchOptions);
                    searcher.run();
                    int[] cd_ids = searcher.getResults();
                    if (cd_ids.length > 1) {
                        String message = "Multiple entries returned for same structure: " + cd_ids[0] + ", " + cd_ids[1];
                        throw new CompoundRepositoryException(message);
                    } else if (cd_ids.length == 0) {
                        String message = "No match returned for entered structure.";
                        throw new CompoundRepositoryException(message);
                    } else {
                        return cd_ids[0];
                    }

...



 


However, for one particular molecule (attached as .sdf) I do not find any matching structures - despite having inserted the very same molecule into the structure table (and I have checked that this is indeed inserted - there is just no match found with the searcher).


Any thoughts? It seems like this may be a bug, as it happens only with this molecule out of tens of thousands. However I have not managed to track down a cause, so perhaps I am doing something wrong.


Brendan

ChemAxon 9c0afc9aaf

16-01-2013 18:01:58

Hi,


Could you let us know please


1) The exact version of the JChemBase API used


(chemaxon.jchem.version.VersionInfo.JCHEM_VERSION )


2) The table settings printed by:


jcman t <table_name>


Best Regards,


Szilard


 

User 7910dcb734

17-01-2013 10:09:36

Hi Szilard,


1) The exact version is 5.11.5


2)


Table type: Molecules

Table version: 5110000

Uses tautomers for duplicate search: No

Filters out the duplicate structures: Yes

Fingerprint settings:

        Length (bits): 512
        Pattern length: 6
        Bits per pattern: 2

Table uses default standardization.

    Column name     Type name
  1 CD_ID           INT
  2 CD_STRUCTURE    MEDIUMBLOB
  3 CD_SMILES       VARCHAR
  4 CD_FORMULA      VARCHAR
  5 CD_SORTABLE_FOR VARCHAR
  6 CD_MOLWEIGHT    DOUBLE
  7 CD_HASH         INT
  8 CD_FLAGS        VARCHAR
  9 CD_TIMESTAMP    DATETIME
 10 CD_PRE_CALCULAT TINYINT
 11 CD_FP1          INT
 12 CD_FP2          INT
 13 CD_FP3          INT
 14 CD_FP4          INT
 15 CD_FP5          INT
 16 CD_FP6          INT
 17 CD_FP7          INT
 18 CD_FP8          INT
 19 CD_FP9          INT
 20 CD_FP10         INT
 21 CD_FP11         INT
 22 CD_FP12         INT
 23 CD_FP13         INT
 24 CD_FP14         INT
 25 CD_FP15         INT
 26 CD_FP16         INT



ChemAxon 9c0afc9aaf

17-01-2013 14:49:55

Hi,


 


Strangely using the command-line tools "jcman" and "jcsearch" the structure is found OK with duplicate search (these are essentially using the same API.)


I have tested with the same version, same settings.


I assume you are you are using MySQL, right ? (tested with that)


There is one potential for discrepancy in your approach: you convert the molecule to "mol" before inserting. If some features of the Molecule cannot nbe represented in "mol" format then obviously there should not be a match.


- Is the attached SDF the original source/format of the Molecule object ?


- How was the Molecule created; apart from import were there any manipulations on it ?


- Could you attach or paste the Molecule converted to MRV format right before insert please ?


MolExporter.exportToFormat.(mol, "mrv")


Best regards,


Szilard

User 7910dcb734

17-01-2013 15:37:15

Hi Szilard,


Yes, I found the same thing with the JchemManager software, which I found strange.


I am using MySQL, yes.


If there a better alternative to converting to mol before inserting? I could find no way to directly insert from a Molecule object; have I missed this?


The attached sdf is the original source of the Molecule object.


The Molecule was created using the MolImporter class to read the sdFile. It did have some manipulations: the structure checker (with default fixers for each error found) and the standardizer. I have attached the configuration xmls for these as well as the exported molecule in .mrv format immediately prior to insertion.


Many thanks for the help,


Brendan

ChemAxon 9c0afc9aaf

17-01-2013 23:51:42

Hi Brendan,


We could reproduce he issue with the MRV, thank you.


We will investigate this further and get back to you here.


Regarding the other questions:


- In general converson to "mrv" format is the best (supports all possible features). It also seems to be a workaround for this problem.


molString = (String) MolExporter.exportToObject(molArray, "mrv", new MrvExport());


- You need to specify some String, as this String will be stored in the cd_structure column, and possibly accessed for display directly.


BTW it seems that UpdateHandler could be quite handy in this case instead of Importer - have you taken a look at that class yet ?


Best regards,


Szilard

User 7910dcb734

21-01-2013 10:15:31

Thanks Szilard. I will start using "mrv" (particularly as it seems to be a workaround for this problem).


I will also look at UpdateHandler.


 


Cheers,


 


Brendan

User 7910dcb734

11-03-2013 13:25:09

Hi Szilard,


Has there been any progress with this issue? I have a number of molecules that have the same problem, and using "mrv" does not work as a workaround.


Regards,


Brendan


 


Edited to remove spurious attachement.

ChemAxon 9c0afc9aaf

11-03-2013 13:33:26

I think the original problem was fixed in 5.12.


We will also check the recently posted structures and get back to you.


Szilard

User 7910dcb734

11-03-2013 13:35:57

Hi Szilard,


I've not updated to 5.12 yet (I missed that release). I'll do so now and get back to you if the problem persists.


 


Regards,


 


Brendan

User 7910dcb734

11-03-2013 16:32:51

Hi Szilard,


 


I think the original problem has been fixed (certainly the molecule I originally posted).


 


Unfortunately I have a new molecule I am still getting the same issue with. I have attached it in mrv form, immediately prior to insertion.


To be clear, as before, after inserting into the database I then immediately query for it. It is not found.

ChemAxon a3d59b832c

11-03-2013 17:36:21

Hi Brendan,


 


It is not clear what is the role of the standardizer and structure checker configurations attached.


From the jcman output, it seems that the standardization is not applied to the table, so it must have been run external to JChem Base.


 


Do you apply the standardization and structure checking to both the inserted and the query structures?


 


Thanks,


Szabolcs

User 7910dcb734

11-03-2013 18:37:37

Hi Szabolcs,


 


The standardiser and checker are applied to the molecule object (loaded into memory from an sdfile format) before attempting the insert. You can see my code above for usage. The molecule object is turned into an mrv string for insertion, while the same object is used in the following search. I will use jcman again tomorrow for the latest table properties; it should use the same standardiser config (though this shouldn't matter).


Brendan

User 7910dcb734

12-03-2013 09:23:09

The jcman output for the table:



Table type: Molecules

Table version: 5120000

Uses tautomers for duplicate search: No

Filters out the duplicate structures: Yes

Fingerprint settings:

        Length (bits): 512
        Pattern length: 6
        Bits per pattern: 2

Custom standardization configuration:
----------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<!-- Standardizer configuration file -->
<!-- Sample example from ChemAxon documentation -->

<StandardizerConfiguration Version ="0.1">
    <Actions>
        <Action ID="aromatize" Act="aromatize"/>
        <Transformation ID="PlusMinus" Structure="[*+:1][*-:2]>>[*:1]=[*:2]"/>
        <!-- File missing for test <Transformation ID="PlusMinusDouble" Structure="molfiles/PlusMinusDouble.mol"/> -->
        <Transformation ID="Enamine" Structure="[H]N[C:1]=[C:2]>>[H][C:2][C:1]=N"/>
        <Transformation ID="Enol" Structure="[H:4][O:3][C:1]=[C:2]>>[H:4][C:2][C:1]=[O:3]"/>
        <Transformation ID="ClMinus" Structure="[Cl-]>>" Exact="true" Groups="target,g1"/>
        <RemoveExplicitH ID="removeH" Charged="true" Radical="true" Mapped="true"/>
        <Removal ID="keepOne" Method="keepLargest" Measure="molMass"/>
        <RemoveRGroupDefinitions ID="removeRGroupDefinitions"/>
        <RemoveAttachedData ID="removeAttachedData"/>
        <RemoveAtomValues ID="removeAtomValues"/>
        <Aromatize ID="chemaxonaromatize" Type="basic"/>
        <AddExplicitH ID="addH"/>
        <AliasToGroup ID="aliastogroup"/>
        <AliasToAtom ID="aliastoatom"/>
        <Sgroups ID="expand" Act="Expand" Exclude="Ph,Ac"/>
        <ClearStereo ID="clearstereo" Type="Chirality"/>
        <AbsoluteStereo ID="setstereo" Act="Set"/>
        <Expand ID="stoichiometry" Data="COEFF"/>
        <Dearomatize ID="dearomatize"/>
        <Neutralize ID="neutralize"/>
        <ClearIsotopes ID="clearisotopes"/>
        <!-- File missing for test <Clean Type="TemplateBased" TemplateFile="templates.mrv" ID="clean"/> -->
        <Tautomerize ID="tautomer"/>
        <Mesomerize ID="mesomer"/>
        <Removal ID="RemoveFragment" Method="keepLargest" Measure="atomCount"/>
    </Actions>
</StandardizerConfiguration>
----------------------------------------

    Column name     Type name
  1 CD_ID           INT
  2 CD_STRUCTURE    MEDIUMBLOB
  3 CD_SMILES       VARCHAR
  4 CD_FORMULA      VARCHAR
  5 CD_SORTABLE_FOR VARCHAR
  6 CD_MOLWEIGHT    DOUBLE
  7 CD_HASH         INT
  8 CD_FLAGS        VARCHAR
  9 CD_TIMESTAMP    DATETIME
 10 CD_PRE_CALCULAT TINYINT
 11 CD_FP1          INT
 12 CD_FP2          INT
 13 CD_FP3          INT
 14 CD_FP4          INT
 15 CD_FP5          INT
 16 CD_FP6          INT
 17 CD_FP7          INT
 18 CD_FP8          INT
 19 CD_FP9          INT
 20 CD_FP10         INT
 21 CD_FP11         INT
 22 CD_FP12         INT
 23 CD_FP13         INT
 24 CD_FP14         INT
 25 CD_FP15         INT
 26 CD_FP16         INT

ChemAxon abe887c64e

12-03-2013 16:58:58

Hi Brendan,


Reviewing the standardizer configuration file you sent we have to mention that 'tautomerize' and 'mesomerize' actions will transform the structure into Kekule form. If you would like to get aromatic molecule you should execute 'aromatize' (and only one 'aromatize' action) after these two actions only.


In addition, 'tautomerize' and 'mesomerize' actions are canonical transformations and do not retain substructure parts exactly, therefore, substructure search may not get hits if these standardizer actions are applied.


Could you run a test with accordingly modified standardizer configuration?


Best regards,


Krisztina

User 7910dcb734

13-03-2013 09:25:29

Hi kvajda,


 


Thanks, I think that has fixed it.


 


Cheers,


 


Brendan