command-line subtructure count?

User 74d30c678b

15-07-2009 08:45:12

Hi,


we have a command-line tool using MDL functionalities I would like to re-implement using ChemAxon. This is what it does:


Input: a query.mol and an SD file with the compounds to match


Output: Compound_ID, count of non-overlapping substructure matches, count of overlapping substructure matches


Compound_id is a tag in the SD file "<COMPOUND_ID>". Also, non-matches have to be output as well. Here is a mock-up example of the output:


Cmpd_A 0 0
Cmpd_B 1 2
Cmpd_C 0 0
Cmpd_D 2 2
Cmpd_E 0 0
Cmpd_F 0 0
...


How could this be implemented? Using evaluate or jcsearch?


The tool should at least output compound_id and the count of non-overlapping substructure counts. And the substructure definition must be a mol file, not a SMARTS string.


Any suggestions are welcome.


Thanks and best regards,


Daniel

ChemAxon e08c317633

15-07-2009 17:29:20

Hi Daniel,


You can use Chemical Terms evaluator (evaluate) for that. matchCount() and disjointMatchCount() functions needs to be used for the matching.


See documentation:


http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#category10list


 


The field() function can be used to return the sdf tag:


http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#ct35lnk


 


The fourth match example on this page:


http://www.chemaxon.com/marvin/help/chemicalterms/ChemicalTerms.html />
shows how external query molecules can be used:

4. The same with referencing the query by molecule file path: match(2, "mols/query.mol", 1, 2)


 


Putting it all together:

$ evaluate -e "field('COMPOUND_ID'); disjointMatchCount('query.mol'); matchCount('query.mol')" mols.sdf
Cmpd_1;0;0
Cmpd_2;1;1
Cmpd_3;0;0
Cmpd_4;2;7
Cmpd_5;0;0


query.mol is a molfile that contains the query, mols.sdf contains the input molecules.


A simple shell script, which sets the query file for matchCount and disjointMatchCount, and replaces the semicolons with spaces in the output:

$ ./substructurecount.sh query.mol mols.sdf
Cmpd_1 0 0
Cmpd_2 1 1
Cmpd_3 0 0
Cmpd_4 2 7
Cmpd_5 0 0


The solution works also with SMARTS query:


$ ./substructurecount.sh "CCN" mols.sdf
Cmpd_1 0 0
Cmpd_2 1 1
Cmpd_3 0 0
Cmpd_4 2 6
Cmpd_5 0 0


The example molecules are attached.

Best regards,
Zsolt

User 74d30c678b

16-07-2009 07:03:33

Dear Zsolt,


many, many thanks for your quick reply. It works perfectly.


I played with disjointMatchCount, but I wasn't aware that one may input a mol file, I only found examples with SMART strings.


BTW, is there a list of "molecule constants" one may pass to the match functions? ("nitro", "carboxylate", etc)? I didn't find it in the online documentation.


Thanks again and best regards,


Daniel


 

ChemAxon e08c317633

16-07-2009 11:03:55










stoffler wrote:

BTW, is there a list of "molecule constants" one may pass to the match functions? ("nitro", "carboxylate", etc)? I didn't find it in the online documentation.



Hi Daniel,


See the text below the table:


http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#category10list


Link to the mols.smarts file:


http://www.chemaxon.com/marvin/help/chemicalterms/Evaluator_files/mols.smarts


Regards,


Zsolt

ChemAxon e08c317633

17-07-2009 14:00:56










Zsolt wrote:

$ evaluate -e "field('COMPOUND_ID'); disjointMatchCount('query.mol'); matchCount('query.mol')" mols.sdf
Cmpd_1;0;0
Cmpd_2;1;1
Cmpd_3;0;0
Cmpd_4;2;7
Cmpd_5;0;0



Daniel, we discovered a bug in disjointMatchCount() function. If the input file (mols.sdf in the example above) contains multiple structures, then only the result for the first hit (Cmpd_2 in the example) is always correct, second and further results can be wrong. This bug affects only disjointMatchCount() function, matchCount() and field() functions are fine.


We will fix this bug, the fix will be available for download next week (in JChem 5.2.3.2).


Best regards,


Zsolt