OutOfMemoryError on single search using JChem Toolkit

User 6f58eb8616

01-05-2009 14:40:35

JChem Version: 5.1.3_2

Environment: WinXP

Problem:  We have many SMARTS which we would like to use with the JChem toolkit.  Most work as expected, however, some use the recursive and component level functionality of the SMARTS specification , and this seems to be causing  memory problems in JChem.

Here is an example:


 






        try {
            Molecule target = MolImporter.importMol("CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O","smiles:c");
          
            Molecule smartsFilter;
            smartsFilter = MolImporter.importMol("[$(C.C.C.[#7,#8,#16])]", "smarts");
            MolSearch ms = new MolSearch();
            ms.setTarget(target);
            mol.aromatize();
            ms.setQuery(smartsFilter);
          
            int[] [] hits = null;

            hits = ms.findAll();
          
            if(hits!= null){
                System.out.println("NUMBER OF HITS: " + hits.length);              
            }
          
        } catch (MolFormatException e) {
            e.printStackTrace();
        } catch (SearchException e) {
            e.printStackTrace();
        }


 


 




 


When I run this kind of operation I am getting:


 




 



java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2760)
    at java.util.Arrays.copyOf(Arrays.java:2734)
    at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
    at java.util.ArrayList.add(ArrayList.java:351)
    at chemaxon.sss.search.StructureSearch.findNext0(StructureSearch.java:5875)
    at chemaxon.sss.search.StructureSearch.findNext(StructureSearch.java:5749)
    at chemaxon.sss.search.MolSearch.findNextEnumerated(MolSearch.java:987)
    at chemaxon.sss.search.MolSearch.findNextFiltered(MolSearch.java:841)
    at chemaxon.sss.search.MolSearch.findNext(MolSearch.java:732)
    at chemaxon.sss.search.Search.getMatchingAtoms(Search.java:1019)
    at chemaxon.sss.search.SmartsAtomMatcher.isRecursiveMatching(SmartsAtomMatcher.java:81)
    at chemaxon.sss.search.SmartsAtomMatcher.evaluateSmartsAtomTree(SmartsAtomMatcher.java:363)
    at chemaxon.sss.search.StructureSearch.compareAtomQueryProperties(StructureSearch.java:3949)
    at chemaxon.sss.search.StructureSearch.compareAtoms(StructureSearch.java:4070)
    at chemaxon.sss.search.StructureSearch.initMaps(StructureSearch.java:3148)
    at chemaxon.sss.search.StructureSearch.findFirst0(StructureSearch.java:5644)
    at chemaxon.sss.search.StructureSearch.findFirst(StructureSearch.java:5608)
    at chemaxon.sss.search.MolSearch.findNextEnumerated(MolSearch.java:984)
    at chemaxon.sss.search.MolSearch.findNextFiltered(MolSearch.java:841)
    at chemaxon.sss.search.MolSearch.findFirst(MolSearch.java:672)
    at chemaxon.sss.search.MolSearch.findAllFiltered(MolSearch.java:1040)
    at chemaxon.sss.search.MolSearch.findAll(MolSearch.java:784)


 




 


The code snippet will work if I remove one of the components from the target molecule or remove a component from the SMARTS itself.  If there are limitations on the type of molecules we can use for filtering in JChem then any info would be useful.  I have checked the SMARTS we are using and they are all legal expressions according to the specification.

Any ideas?

Thanks muchly

ChemAxon aa7c50abf8

01-05-2009 14:48:09

How do you run your code? In particular: how much maximum heap size (the -Xmx switch to java.exe) is available for the JVM? (The ChemAxon command line tools use 200 megabyte by default.)


Thanks


Peter

ChemAxon a3d59b832c

01-05-2009 15:13:40

That's correct, more heap memory should help.


 


Your smiles and smarts structures work on our online substructure search page:


 


http://www.chemaxon.com/jchem/examples/sss/index.jsp />


 


Best regards,


Szabolcs

ChemAxon a3d59b832c

01-05-2009 15:16:11

(I am sorry, this forum engine makes funny things with the links, but at least it works now.)


 

User 6f58eb8616

01-05-2009 15:46:47

Hi Guys,


 


Thanks for the prompt response.  I was running the code as a Unit Test in Eclipse, I'm not sure what the default -Xmx value is set to (for Eclipse itself it uses 512, but could be using quite a lot of that) , but I added a new JVM arg of "-Xmx512m" for the test runner.  This got the test to pass, but it did take a while.  I then tried the online version with a TARGET with another couple of components:


"CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O.CC(C)[C@H](NC(=O)[C@H](CCCNC(=N)N)NC(=O)N[C@@H](Cc1ccccc1)C(=O)O)C(=O)N[C@@H](Cc2ccccc2)C=O"


and the same SMARTS query as before.  This produced the same Heap Space problem in the Online matcher:


 




 


org.apache.jasper.JasperException: Java heap space
    org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:449)
    org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:371)
    org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
    org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

root cause

javax.servlet.ServletException: Java heap space
    org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846)
    org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:779)
    org.apache.jsp.examples.sss.index_jsp._jspService(index_jsp.java:329)
    org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
    org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:328)
    org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
    org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

root cause

java.lang.OutOfMemoryError: Java heap space


 




 


 


Is there an upper limit  to the size/number of components of the TARGET that we can use.  I suppose this is also dependent on the amount of memory we have at or disposal.


 


 



ChemAxon a3d59b832c

01-05-2009 16:08:31

There is no hardwired limit, but indeed the search process consumes more memory and is slower for large or many component targets. It would be much faster and require less memory if you searched the components one-by-one.


Please also note that the disconnected C-s in the recursive smarts give rise to a kind of combinatorial explosion in the intermediate results (matching of the recursive part).


 


(Recursive smarts is used to constrain for the environment of the atom in question. However, the disconnected C-s will match anywhere in the structure. So I wonder if this recursive smarts expression is the correct one you originally wanted to formulate.)


 


Best regards,


Szabolcs

User 6f58eb8616

01-05-2009 17:46:34

I agree with you about the SMARTS, I will look to rewrite them in a better way.  My only worry is that in our case, users can create their own SMARTS for use in a search and potentially cause memory problems unless we somehow vet the SMARTS first for any potential recursive anomalies.  Could be an interesting problem!


Thanks for the help.


 



ChemAxon 9c0afc9aaf

01-05-2009 18:07:27

Hi,


 


There is a certain step limit for the graph search algorithm, so memory and CPU usage cannot grow to infinity:


http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/search/MolSearchOptions.html#setTimeoutLimit(int)


 


 


Szilard


 

ChemAxon 9c0afc9aaf

01-05-2009 18:11:45

PS: sorry, links do not seem to be working due to a problem with the forum engine, but you can copy/paste the first (correct) part.

ChemAxon a3d59b832c

02-05-2009 13:57:48

From the stack trace the problem is that the result of the recursive part (all matching index array) alone fills the memory.


To reduce the combinatoric explosion, we can change the recursive smarts semantics so that it requires all internal smarts atoms(in the recursive part) be in the same component. (As we already discussed, it is a sensible constraint, and it just means that we will include the brackets from $(...) into the smarts expression to search. :) ) It will also make searches with multi-component target run faster.


Let us know if you agree, and we can include it quickly in the next minor version.


 


Best regards,


Szabolcs

User 6f58eb8616

07-05-2009 08:53:08

Hi, sorry for late reply, this sounds like a good idea to me but I will need to get confirmation from a few people here (as I will need to amend various SMARTS we have , and possibly some I don't know about).  Do you know if this is the same kind of semantics that Daylight uses (as there seems to be some kind of short circuit mechanism that deals with components in recursive SMARTS in the Daylight Tk)?  


 


Anyhow, will be in touch as soon as I know.


 


Thanks again

User 6f58eb8616

07-05-2009 09:13:14

Actually, I think I misinterpreted your last post.  If the components from the recursive search are moved into the overall SMARTS expression then we shouldnt have to change these recursive expressions at all.  So we can leave our SMARTS as they are, they will just bring back results more efficiently. Have I got it right?

ChemAxon a3d59b832c

07-05-2009 11:46:28

No, I didn't mean that you should change the SMARTS expression. I only meant that the recursive SMARTS will have no "power" outside the component it is being matched.


 


For example:


Query: CCC[$(C.O)]


Target: CCCC.COC


 


Currently it matches, because the outer part of the SMARTS matches to the first component, and the recursive part to the first and the second.


My proposal would disallow the match in this case, because there exist no component
that could embrace both the outer and the recursive SMARTS.


 


From Daylight's depictmatch, it seems that they work the same way as our current system: the above query-target pair matches.


 

User 6f58eb8616

11-05-2009 10:46:41

Ah right got it.  Yes, sounds like a good plan to me, but still waiting for some feedback from other folk here as the new behaviour will be a little different to Daylight even though its probably working in a beter way.

ChemAxon a3d59b832c

15-05-2009 10:04:38

Actually, we have an idea to improve the algorithm to avoid the combinatorial explosion in such cases without changing the semantics. We will try to incorporate it into JChem 5.2.3, which is planned for June.(JChem 5.2.2 is now too close to include it there.)


We will let you know how we progress.


 


Best regards,


Szabolcs

User c2ffbfa8f8

15-05-2009 10:13:44

That would definitely be the best case scenario from our point of view.  Thanks very much for taking the time to find a solution.


Best wishes


Derek


 


 

ChemAxon a3d59b832c

12-06-2009 10:17:17

We fixed the above. Because of the improved algorithm, recursive smarts matching became faster also. It will be relased in version 5.2.3 (expected at the end of June or the first days of July).


 


Thanks for sharing this issue with us.


 


Best regards,


Szabolcs

User 6f58eb8616

25-06-2009 11:44:14

Thanks Szabolcs, I will test this out as soon as it is available.


 


All the best


 


Derek