User 74d30c678b
15-07-2009 08:45:12
Hi,
we have a command-line tool using MDL functionalities I would like to re-implement using ChemAxon. This is what it does:
Input: a query.mol and an SD file with the compounds to match
Output: Compound_ID, count of non-overlapping substructure matches, count of overlapping substructure matches
Compound_id is a tag in the SD file "<COMPOUND_ID>". Also, non-matches have to be output as well. Here is a mock-up example of the output:
Cmpd_A 0 0
Cmpd_B 1 2
Cmpd_C 0 0
Cmpd_D 2 2
Cmpd_E 0 0
Cmpd_F 0 0
...
How could this be implemented? Using evaluate or jcsearch?
The tool should at least output compound_id and the count of non-overlapping substructure counts. And the substructure definition must be a mol file, not a SMARTS string.
Any suggestions are welcome.
Thanks and best regards,
Daniel
ChemAxon e08c317633
15-07-2009 17:29:20
Hi Daniel,
You can use Chemical Terms evaluator (evaluate) for that. matchCount() and disjointMatchCount() functions needs to be used for the matching.
See documentation:
http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#category10list
The field() function can be used to return the sdf tag:
http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#ct35lnk
The fourth match example on this page:
http://www.chemaxon.com/marvin/help/chemicalterms/ChemicalTerms.html
/>
shows how external query molecules can be used:
4. The same with referencing the query by molecule file path: match(2, "mols/query.mol", 1, 2)
Putting it all together:
$ evaluate -e "field('COMPOUND_ID'); disjointMatchCount('query.mol'); matchCount('query.mol')" mols.sdf
Cmpd_1;0;0
Cmpd_2;1;1
Cmpd_3;0;0
Cmpd_4;2;7
Cmpd_5;0;0
query.mol is a molfile that contains the query, mols.sdf contains the input molecules.
A simple shell script, which sets the query file for matchCount and disjointMatchCount, and replaces the semicolons with spaces in the output:
$ ./substructurecount.sh query.mol mols.sdf
Cmpd_1 0 0
Cmpd_2 1 1
Cmpd_3 0 0
Cmpd_4 2 7
Cmpd_5 0 0
The solution works also with SMARTS query:
$ ./substructurecount.sh "CCN" mols.sdf
Cmpd_1 0 0
Cmpd_2 1 1
Cmpd_3 0 0
Cmpd_4 2 6
Cmpd_5 0 0
The example molecules are attached.
Best regards,
Zsolt
User 74d30c678b
16-07-2009 07:03:33
Dear Zsolt,
many, many thanks for your quick reply. It works perfectly.
I played with disjointMatchCount, but I wasn't aware that one may input a mol file, I only found examples with SMART strings.
BTW, is there a list of "molecule constants" one may pass to the match functions? ("nitro", "carboxylate", etc)? I didn't find it in the online documentation.
Thanks again and best regards,
Daniel
ChemAxon e08c317633
16-07-2009 11:03:55
ChemAxon e08c317633
17-07-2009 14:00:56
Zsolt wrote: |
$ evaluate -e "field('COMPOUND_ID'); disjointMatchCount('query.mol'); matchCount('query.mol')" mols.sdf Cmpd_1;0;0 Cmpd_2;1;1 Cmpd_3;0;0 Cmpd_4;2;7 Cmpd_5;0;0
|
Daniel, we discovered a bug in disjointMatchCount() function. If the input file (mols.sdf in the example above) contains multiple structures, then only the result for the first hit (Cmpd_2 in the example) is always correct, second and further results can be wrong. This bug affects only disjointMatchCount() function, matchCount() and field() functions are fine.
We will fix this bug, the fix will be available for download next week (in JChem 5.2.3.2).
Best regards,
Zsolt