substructure searching performance tuning for large database

User 0d64032b61

11-04-2014 02:55:14

Hi ChemAxon,


I was using JChem Base + PostgreSQL 9.2 to handle the large database hosting more than 20 million structures. My system is CentOS 6.5 at Intel i-7 CPU with 8-Core and 12G Memory. I have referenced the JChem document on substructure searching performance tuning at pgsql, for example in postgresql.conf:


shared_buffers = 8192MB
temp_buffers = 1024MB
work_mem = 512MB
maintenance_work_mem = 1024MB
max_stack_depth = 8MB


Meanwhile, I tried to set HEAP_LIMIT to 8192 and SERVER_MODE to true in the script file jcsearch, and run the command to do substructure searching and export the result:


jcsearch -q "Fc1ccc(Cl)cc1CNCC(=O)N" -t:s -f smiles DB:my_structure_table -o myhits.smiles


There are around 80 hits. However the whole process spent more than 5 minutes! It is too long to unacceptable for my application. Have I missed something in the performance tuning? Would you please give me some suggestion? Thanks.


I wonder the second problem below is caused by similar reason:


I want to execute the command  jcman s my_structure_table to view the statistics about the table, but it threw an exception in 1 minute:


Collecting statistics for table: my_structure_table ...
java.lang.OutOfMemoryError: Java heap space
    at java.lang.Class.privateGetDeclaredFields(Class.java:2305)
    at java.lang.Class.getDeclaredField(Class.java:1882)
    at java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.<init>(AtomicReferenceFieldUpdater.java:181)
    at java.util.concurrent.atomic.AtomicReferenceFieldUpdater.newUpdater(AtomicReferenceFieldUpdater.java:65)
    at java.sql.SQLException.<clinit>(SQLException.java:353)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1817)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:374)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:254)
    at chemaxon.jchem.db.TableStatistics.collect(TableStatistics.java:205)
    at chemaxon.jchem.db.TableStatistics.getPrintableStats(TableStatistics.java:126)
    at chemaxon.jchem.db.TableStatistics.collectStatisitcs(TableStatistics.java:72)
    at chemaxon.jchem.Command.printTableStatistics(Command.java:1491)
    at chemaxon.jchem.Command.run(Command.java:816)
    at chemaxon.jchem.Command.main(Command.java:272)


Thanks for your help in advance.


CT

ChemAxon 25dcd765a3

11-04-2014 12:14:23

Hi,


Which JChem version are you using?


It seems that the java heap is not enough. 


Could you please try running jcsearch with increased heap:


jcsearch -Xmx 14g ...


We have also met such cases when the heap space was enough then the search was much slower.


You can monitor your heap usage with jvisualvm to see if you have provided enough heap.

User 0d64032b61

11-04-2014 13:33:09

Hi volfi,


Thanks for your help. I am using JChem 6.2.1 whose java version is 1.6 with its java home path at /opt/i4j_jres/1.6.0_45 - My system JDK is the version 1.7, but JChem uses 1.6 supplied by itself. I run the command with -Xmx 14g as you have suggested, but it seems not helpful to improve the performance. I run jvisualvm and got the following message:



At the beginning when I created the structure table, I set the fingerprint length to 16X4bytes (512 bits), Bits to be set for patterns to 2, and Maximum pattern length to 6 bonds, without Structural keys - as JChem document says these parameters should be fine for large database. Does that need tuning?


CT

ChemAxon 25dcd765a3

11-04-2014 14:20:47

OK the jcsearch problem is quite clear for me now:


jcsearch -q "Fc1ccc(Cl)cc1CNCC(=O)N" -t:s -f smiles DB:my_structure_table -o myhits.smiles

There are around 80 hits. However the whole process spent more than 5 minutes! It is too long to unacceptable for my application. Have I missed something in the performance tuning? Would you please give me some suggestion? Thanks.

So you have 20M structures. Which means that jchem need to load the structure cache for 20M molecules once before search. The cache load time for 10M molecule at one our test system is: 874s which is more than 10 min.


jcsearch command line connect to the db server loads the cache runs the search at each run.


This explains the search time you received. Long cache loading, fast search afterwards.


In case of API usage you can fire up the jchem server once and create multiple searches with one cache loading (before the first search the cache should be loaded). 


So you don't need to fine tune the fingerprint, but rather create a client-server model.

User 0d64032b61

12-04-2014 01:01:10

Many thanks volfi,


I will try with API to create the application in C/S model. Is there a ready program in C/S model based on JChem Base that I can test my database right now?


Best regards,


CT

ChemAxon 25dcd765a3

15-04-2014 08:30:05

 


 Is there a ready program in C/S model based on JChem Base that I can test my database right now?

No, you should write your own one if you want it in java.


We have webservice which is a general solution for such problems.