Scripting GenerateMD for large database sets

User a18e201107

28-10-2012 21:43:16

Hello

I am trying to use the generatemd tool to create pharmacophore fingerprints on a set of 250K compounds and I am running into an issue with memory. Clearly from the error (below) this is not a Jchem out of memory issue, rather a MySQL one probably from trying to return too many results for the generatemd program to use. I was curious if there are any best practices for dealing with large sets of data whereby a database connection is made followed by a descriptor writer back to the DB? Is this something that should be scripted in a loop (10K structures or so at a time?) or is there a way to limit the reads from the command line?

As a followup I am hoping to then use the Screener tool which I suspect may have a similar issue. Since I cannot place all structures from my database in memory will this effect the outcome of the screen experiment? In otherwords does the screener tool need the full test set or can it be broken down into smaller more manageable test sets?

Thank you for any insight into the tools,

Dennis

C:\Program Files\ChemAxon\JChem\examples\config>generatemd c -a structures -k PF

pharmafinger -c pharma-frag.xml -d "com.mysql.jdbc.Driver" -u "jdbc:mysql://loc

alhost:3306/leadlike"

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at com.mysql.jdbc.Buffer.<init>(Buffer.java:58)

at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1452)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2883)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:476)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2576)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1757)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2167)

at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)

at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2571)

at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1464)

at chemaxon.descriptors.GenerateMD.initSource(GenerateMD.java:953)

at chemaxon.descriptors.GenerateMD.init(GenerateMD.java:654)

at chemaxon.descriptors.GenerateMD.main(GenerateMD.java:1453)

ChemAxon efa1591b5a

29-10-2012 10:11:16

Hi,

Try to increase the heap space for generatemd by running it with the -Xmx option in the commandline:

screenmd -Xmx1024M .....your options and files here as before.....

Does this help?

Miklos

User a18e201107

30-10-2012 19:01:59

Miklos

Thank you kindly for the response. So I think that increasing the heap space could work for Screenmd and Generatemd in the case of 100K (even 500k) compounds. How about if we are talking 5 million compounds? What is the best way to generate molecular descriptors in this case / followed by screener? My thought was to script the molecular descriptors in batches of 250K or so and do the same thing for the screening tool, however depending on how the screener algorithm works the results could be incomplete if I am doing 250K batches (this is my assumption). I believe that screener simply calculates a 'distance' from the training set compounds, in this case breaking the analysis up into small batches does not matter, but if this is not the way screener works I would get mixed results...correct?

Thank you again for the help,

Dennis

ChemAxon efa1591b5a

31-10-2012 11:14:28

Hi Dennis,

Good point! In general, JDBC drivers should handle this via the fetchSize. This, however, did not work properly in MySql in the past. It is quite likely that the related issue has been fixed in the mysql driver since then but we have to check that.

I'll get back to you soon.

Regards,

Miklos

User a18e201107

09-11-2012 17:22:59

Miklos

Not sure if you have made any progress with exploring the JDBC Mysql driver, but I seem to have run into another issue which is potentially related.

After running the following statement

generatemd c -a structures -k PF pharmafinger -c pharma-frag.xml -d "com.mysql.jdbc.Driver" -u "jdbc:mysql://localhost:3306/leadlike" -l user -p password -q"cd_id < 200000"

which worked fine...

I tried

generatemd u structures pharmafinger -c pharma-frag.xml -d "com.mysql.jdbc.Driver" -u "jdbc:mysql://localhost:3306/leadlike" -l user -p password -q "cd_id between 200000 AND 200002"

just to see if I could simply update the database with some additional PF descriptors for compounds I did not try in the initial descriptor statement.

This also lead to a out of memory error- slightly different than my orginal post:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Unknown Source)

at java.util.Arrays.copyOf(Unknown Source)

at java.util.ArrayList.ensureCapacity(Unknown Source)

at java.util.ArrayList.add(Unknown Source)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2888)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:476)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2576)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1757)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2167)

at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)

at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2571)

at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1464)

at chemaxon.jchem.db.MDTableHandler.getMissingMDRows(MDTableHandler.java:597)

at chemaxon.jchem.db.MDTableHandler.incRegenerateDescriptorTable(MDTableHandler.java:432)

at chemaxon.descriptors.GenerateMD.updateMDTable(GenerateMD.java:1111)

at chemaxon.descriptors.GenerateMD.main(GenerateMD.java:1434)

This is interesting as I am only trying to update (or add) one calculated row. Not sure if I have the wrong syntax for updating here, but I tried restarting MySql Service to be sure I had flushed the memory, but still I got the above error. Is the generatemd program reading in all the currently calculated fingerprints? Seems odd! Is there a way to log what generatemd is trying to do so I can spool off the issue in more detail?

Thank you for your help, I know we will find a way to do this and I am happy to post my final scripts.

dennis

User a18e201107

09-11-2012 20:26:11

Miklos

In some investigation I did looks like there is a way to limit MySQL from reading in all rows on a query by adding

&useServerPrepStmts=true&useCursorFetch=true

to the connection string. Is there a way to add arguements to connections made via GenerateMD?

Thanks

Dennis

ChemAxon 1b9e90b2e7

09-11-2012 21:15:44

Hi Dennis,

I could reproduce the OutOfMemoryError issue with JChem 5.11.3 and MySQL server 5.5. The input file was one sdf file from Zinc druglike set: http://zinc.docking.org/db/bysubset/1/1_p0.0.sdf.gz

gunzip 1_p0.0.sdf.gz
jcman c zinc1
jcman a zinc1 1_p0.0.sdf
generatemd c -a zinc1 -k PF PF_zinc

The jdbc parameters were taken from the [user home]/chemaxon/.jchem file.

You should modifiy the JDBC url either in the .jchem file or in command line by adding:

jdbc:mysql://localhost:3306/leadlike?&useServerPrepStmts=true&useCursorFetch=true

The FP generation according to the process above was finished properly for me with this extra setting without the OOME.

The answer for your former question, namely "Will splitting the generation process modify the fingerprint" the answer is no, it will not, you can split. Although the real solution is setting the cursorFetch parameter as above.

Hope this helps and sorry for the late response.

Adrian

ChemAxon 1b9e90b2e7

09-11-2012 22:26:37

One important parameter was left out, so the jdbc url correctly:

jdbc:mysql://localhost:3306/leadlike? defaultFetchSize=1000&useServerPrepStmts=true&useCursorFetch=true

User a18e201107

10-11-2012 12:20:45

Adrian

Thank you so much for your response. I was missing the defaultFetchSize! Spoiled by Oracle I guess.

I did try running the ScreenMD app after generating the pharmacophorefingerprints and I am getting a null pointer excpetion:

screenmd My_queries.sdf -a structures -d "com.mysql.jdbc.Driver" -u "jdbc:mysql://localhost:3306/leadlike?defaultFetchSize=10000&useServerPrepStmts=true&useCursorFetch=true" -l user -p password -k pharmafinger -M Tanimoto Euclidean -o hitsMCL1.sdf

null

java.lang.NullPointerException

at chemaxon.descriptors.ScreenMD.initSimilarity(ScreenMD.java:249)

at chemaxon.descriptors.ScreenMD.main(ScreenMD.java:218)

I checked the pharmafinger descriptor table and it appears to have data in it. Any ideas what would lead to this?

Thanks again for your help on this

Dennis

ChemAxon 1b9e90b2e7

12-11-2012 22:45:05

Hi Dennis,

thank you for the bug report, we could reproduce the issue. We are working on to solve it asap.

Adrian

User a18e201107

12-11-2012 23:36:32

Adrian

Thank you for the response, let me know if I can help out in any way (testing or what not)

Dennis

ChemAxon 822473489f

13-11-2012 12:51:51

Dear Dennis,

we have fixed this bug, the fix will be released in 5.11.5 (expected in some days).

Let us know if you need a workaround until 5.11.5 is released.

Best regards,

Monika

User a18e201107

13-11-2012 19:37:28

Hi Monika

Thanks for the update, if you have a workaround I would be happy to try it!

Dennis

ChemAxon 822473489f

22-11-2012 16:30:32

Hi Dennis.

5.11.4 has been released, which contains this fix.

Best regards,

Monika