User 6f58eb8616
01-05-2009 14:40:35
JChem Version: 5.1.3_2
Environment: WinXP
Problem: We have many SMARTS which we would like to use with the JChem toolkit. Most work as expected, however, some use the recursive and component level functionality of the SMARTS specification , and this seems to be causing memory problems in JChem.
Here is an example:
try {
Molecule target = MolImporter.importMol("CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O","smiles:c");
Molecule smartsFilter;
smartsFilter = MolImporter.importMol("[$(C.C.C.[#7,#8,#16])]", "smarts");
MolSearch ms = new MolSearch();
ms.setTarget(target);
mol.aromatize();
ms.setQuery(smartsFilter);
int[] [] hits = null;
hits = ms.findAll();
if(hits!= null){
System.out.println("NUMBER OF HITS: " + hits.length);
}
} catch (MolFormatException e) {
e.printStackTrace();
} catch (SearchException e) {
e.printStackTrace();
}
When I run this kind of operation I am getting:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2760)
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
at java.util.ArrayList.add(ArrayList.java:351)
at chemaxon.sss.search.StructureSearch.findNext0(StructureSearch.java:5875)
at chemaxon.sss.search.StructureSearch.findNext(StructureSearch.java:5749)
at chemaxon.sss.search.MolSearch.findNextEnumerated(MolSearch.java:987)
at chemaxon.sss.search.MolSearch.findNextFiltered(MolSearch.java:841)
at chemaxon.sss.search.MolSearch.findNext(MolSearch.java:732)
at chemaxon.sss.search.Search.getMatchingAtoms(Search.java:1019)
at chemaxon.sss.search.SmartsAtomMatcher.isRecursiveMatching(SmartsAtomMatcher.java:81)
at chemaxon.sss.search.SmartsAtomMatcher.evaluateSmartsAtomTree(SmartsAtomMatcher.java:363)
at chemaxon.sss.search.StructureSearch.compareAtomQueryProperties(StructureSearch.java:3949)
at chemaxon.sss.search.StructureSearch.compareAtoms(StructureSearch.java:4070)
at chemaxon.sss.search.StructureSearch.initMaps(StructureSearch.java:3148)
at chemaxon.sss.search.StructureSearch.findFirst0(StructureSearch.java:5644)
at chemaxon.sss.search.StructureSearch.findFirst(StructureSearch.java:5608)
at chemaxon.sss.search.MolSearch.findNextEnumerated(MolSearch.java:984)
at chemaxon.sss.search.MolSearch.findNextFiltered(MolSearch.java:841)
at chemaxon.sss.search.MolSearch.findFirst(MolSearch.java:672)
at chemaxon.sss.search.MolSearch.findAllFiltered(MolSearch.java:1040)
at chemaxon.sss.search.MolSearch.findAll(MolSearch.java:784)
The code snippet will work if I remove one of the components from the target molecule or remove a component from the SMARTS itself. If there are limitations on the type of molecules we can use for filtering in JChem then any info would be useful. I have checked the SMARTS we are using and they are all legal expressions according to the specification.
Any ideas?
Thanks muchly
ChemAxon aa7c50abf8
01-05-2009 14:48:09
How do you run your code? In particular: how much maximum heap size (the -Xmx switch to java.exe) is available for the JVM? (The ChemAxon command line tools use 200 megabyte by default.)
Thanks
Peter
ChemAxon a3d59b832c
01-05-2009 15:13:40
ChemAxon a3d59b832c
01-05-2009 15:16:11
(I am sorry, this forum engine makes funny things with the links, but at least it works now.)
User 6f58eb8616
01-05-2009 15:46:47
Hi Guys,
Thanks for the prompt response. I was running the code as a Unit Test in Eclipse, I'm not sure what the default -Xmx value is set to (for Eclipse itself it uses 512, but could be using quite a lot of that) , but I added a new JVM arg of "-Xmx512m" for the test runner. This got the test to pass, but it did take a while. I then tried the online version with a TARGET with another couple of components:
"CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O"
and the same SMARTS query as before. This produced the same Heap Space problem in the Online matcher:
org.apache.jasper.JasperException: Java heap space
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:449)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:371)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
root cause
javax.servlet.ServletException: Java heap space
org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846)
org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:779)
org.apache.jsp.examples.sss.index_jsp._jspService(index_jsp.java:329)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:328)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
root cause
java.lang.OutOfMemoryError: Java heap space
Is there an upper limit to the size/number of components of the TARGET that we can use. I suppose this is also dependent on the amount of memory we have at or disposal.
ChemAxon a3d59b832c
01-05-2009 16:08:31
There is no hardwired limit, but indeed the search process consumes more memory and is slower for large or many component targets. It would be much faster and require less memory if you searched the components one-by-one.
Please also note that the disconnected C-s in the recursive smarts give rise to a kind of combinatorial explosion in the intermediate results (matching of the recursive part).
(Recursive smarts is used to constrain for the environment of the atom in question. However, the disconnected C-s will match anywhere in the structure. So I wonder if this recursive smarts expression is the correct one you originally wanted to formulate.)
Best regards,
Szabolcs
User 6f58eb8616
01-05-2009 17:46:34
I agree with you about the SMARTS, I will look to rewrite them in a better way. My only worry is that in our case, users can create their own SMARTS for use in a search and potentially cause memory problems unless we somehow vet the SMARTS first for any potential recursive anomalies. Could be an interesting problem!
Thanks for the help.
ChemAxon 9c0afc9aaf
01-05-2009 18:07:27
ChemAxon 9c0afc9aaf
01-05-2009 18:11:45
PS: sorry, links do not seem to be working due to a problem with the forum engine, but you can copy/paste the first (correct) part.
ChemAxon a3d59b832c
02-05-2009 13:57:48
From the stack trace the problem is that the result of the recursive part (all matching index array) alone fills the memory.
To reduce the combinatoric explosion, we can change the recursive smarts semantics so that it requires all internal smarts atoms(in the recursive part) be in the same component. (As we already discussed, it is a sensible constraint, and it just means that we will include the brackets from $(...) into the smarts expression to search. :) ) It will also make searches with multi-component target run faster.
Let us know if you agree, and we can include it quickly in the next minor version.
Best regards,
Szabolcs
User 6f58eb8616
07-05-2009 08:53:08
Hi, sorry for late reply, this sounds like a good idea to me but I will need to get confirmation from a few people here (as I will need to amend various SMARTS we have , and possibly some I don't know about). Do you know if this is the same kind of semantics that Daylight uses (as there seems to be some kind of short circuit mechanism that deals with components in recursive SMARTS in the Daylight Tk)?
Anyhow, will be in touch as soon as I know.
Thanks again
User 6f58eb8616
07-05-2009 09:13:14
Actually, I think I misinterpreted your last post. If the components from the recursive search are moved into the overall SMARTS expression then we shouldnt have to change these recursive expressions at all. So we can leave our SMARTS as they are, they will just bring back results more efficiently. Have I got it right?
ChemAxon a3d59b832c
07-05-2009 11:46:28
No, I didn't mean that you should change the SMARTS expression. I only meant that the recursive SMARTS will have no "power" outside the component it is being matched.
For example:
Query: CCC[$(C.O)]
Target: CCCC.COC
Currently it matches, because the outer part of the SMARTS matches to the first component, and the recursive part to the first and the second.
My proposal would disallow the match in this case, because there exist no component
that could embrace both the outer and the recursive SMARTS.
From Daylight's depictmatch, it seems that they work the same way as our current system: the above query-target pair matches.
User 6f58eb8616
11-05-2009 10:46:41
Ah right got it. Yes, sounds like a good plan to me, but still waiting for some feedback from other folk here as the new behaviour will be a little different to Daylight even though its probably working in a beter way.
ChemAxon a3d59b832c
15-05-2009 10:04:38
Actually, we have an idea to improve the algorithm to avoid the combinatorial explosion in such cases without changing the semantics. We will try to incorporate it into JChem 5.2.3, which is planned for June.(JChem 5.2.2 is now too close to include it there.)
We will let you know how we progress.
Best regards,
Szabolcs
User c2ffbfa8f8
15-05-2009 10:13:44
That would definitely be the best case scenario from our point of view. Thanks very much for taking the time to find a solution.
Best wishes
Derek
ChemAxon a3d59b832c
12-06-2009 10:17:17
We fixed the above. Because of the improved algorithm, recursive smarts matching became faster also. It will be relased in version 5.2.3 (expected at the end of June or the first days of July).
Thanks for sharing this issue with us.
Best regards,
Szabolcs
User 6f58eb8616
25-06-2009 11:44:14
Thanks Szabolcs, I will test this out as soon as it is available.
All the best
Derek