MolInputStream.getFormat()

User 870ab5b546

16-02-2009 04:09:44

Hi,





We have a line of code,





String formatString = (new MolInputStream(new ByteArrayInputStream(bytes))).getFormat()








We find that it hangs or goes into an infinite loop when bytes (which is a byte[]) represents a String more than about 256 characters long and is not a molecule.  





-- Bob

ChemAxon aa7c50abf8

16-02-2009 08:00:38

Hi Bob,





Which JChem/Marvin version is it? Version 5.1.1 had a bug (FS#6571) which resulted in this (or very similar) behaviour.





Thanks





Peter

User 870ab5b546

16-02-2009 13:59:14

We're currently using JChem 5.1.0.

ChemAxon aa7c50abf8

16-02-2009 14:40:28

The bug presumably at work here was introduced in 5.1.0 and was fixed in 5.1.3. I suggest to try and see if the same problem persists with JChem 5.1.3 or later.

User 870ab5b546

19-02-2009 01:49:39

Would this same bug also have manifested in MolImporter.importMol()?  That is, would it have become confused by a longish string (>= 250 characters) not in any recognized format, and gone into an infinite loop or other memory-hogging operation?

ChemAxon aa7c50abf8

19-02-2009 06:22:41

Yes, bug FS#6571 is in MolImporter.importMol(). It is not exactly an infinite loop, it is just very slow.





What happens is that while trying to recognize the molecule format, MolImporter.importMol() is letting the available format handling modules have a go to see if any of them can make sense of the input stream. This approach is fine as long as none of the format handling modules ties up the process for too long.





This particular bug is about the IUPAC Name Import module spending an excessive amount of time figuring out what to make of the input stream. One of its characteristics is high CPU utilization.


User 870ab5b546

19-02-2009 14:43:32

Ah.  Well, you guys owe me a beer now.  I tried to give an exam with ACE last night, and this bug caused the server to slow down so much that I had to cancel the online part and just give it by paper.  The bug only manifested when 160 students were all triggering it at once.  Sigh.  At least the students don't seem too angry about it.

ChemAxon aa7c50abf8

19-02-2009 15:04:59

We're really sorry about this inconvenience.





It would be useful if we could verify that your problem is really a result of the above bug. It appears that an arbitratry, large input string doesn't necessarily trigger this behaviour.





I am going to move this topic to the appropriate Forum section (Drawing & visualization: Marvin/Sketch /View /Space Support for MarvinSketch, MarvinView, MarvinSpace, file formats and image generation.)

User 870ab5b546

19-02-2009 15:46:34

I would love some help verifying it.  It doesn't appear to be a problem when it's just little ol' me sitting here submitting strings.  It only manifests when a large number of people (160 or so) are submitting strings all at once.  Do you have a way of conducting a load test on ACE?

ChemAxon aa7c50abf8

19-02-2009 16:02:07

This really does make some difference: 160 hundred concurrent users...





How did you identify the problematic line of code? Did you do a thread dump?

User 870ab5b546

19-02-2009 16:40:38

No, I don't know how to do a thread dump.





First I noticed that the problem occurred only when students were submitting long strings, not short ones. But I knew that long molecule strings didn't cause a problem, so it had to be related to the fact that they were text strings.  So I looked around to find where in the code a text string might cause a problem where a molecule string wouldn't, and I found the getFormat() call.  We had it in a try block, and we simply caught the exception when it wasn't a molecule.  When i commented that code, it made a big difference in performance.  Unfortunately, there was another part of the code where we also relied on an exception to tell us that a string was not a molecule, and that was what caused the big slowdown last night. 

User 870ab5b546

19-02-2009 21:44:35

I just did a test that I think confirms the problem, though not unequivocally.  I repeatedly submitted a longish test string, and via top I watched the CPU usage climb to 54%.  Then I eliminated the call to MolImporter.importMol(), and did the same; this time, the CPU message never reached 8%.

ChemAxon aa7c50abf8

20-02-2009 13:22:59

I created a test case (attached) to see what happens when the code you incriminated








Code:
String formatString = (new MolInputStream(new ByteArrayInputStream(bytes))).getFormat()











is executed truly concurrently in 160 instances. Using JChem 5.1.0 and feeding a large non-molecule text file to it, the test case completed in about 1 minute on my dual-core laptop. I am not sure whether this duration can be considered excessive or not. With 5.1.4, it takes about 45 seconds.








If you experience delays much longer than 1 minute, then we might need to know more either about the input or about the environment you're experiencing this problem with. It is entirely possible that the source of the problem lies somewhere else completely.





For the momement, I think the best approach would be to try reproducing the problem with your application. As I understand this is a web application, you could try sending it a largish number of concurrent requests of the kind you suspect were leading to the hang -- to simulate the behaviour you were experiencing in production. Once you are able to reproduce the problem in a test environment, we can proceed to locate the source.

ChemAxon aa7c50abf8

20-02-2009 13:47:32

Abstracting further away from MolInputStream and bug FS#6571: another problem has come to my mind which manifested itself in symptoms similar to those you reported. An early version of one of the third party libraries (dom4j.jar, which used for XML processing) used to got into a kind of infinite loop when executed in a concurrent environment. As JChem uses XML processing in a great number of places behind the scenes, it is within the realm of possible that you ran into this problem. We believe to have solved this problem with JChem version 5.1.5 by shipping JChem with a recent version of dom4j.

User 870ab5b546

20-02-2009 13:58:39

OK, we will upgrade to JChem 5.1.5, and next week I will have my class do another test with text questions.  Meanwhile, I have taken care to ensure that we only attempt to parse as Molecules those strings that we already know are in an appropriate format.