How do I search a large Markush database efficiently?

User 24ceacde10

02-08-2011 19:44:52

My company maintains a Markush database consisting of 500,000 Markush structures. This database is currently installed on a server running Oracle 11g with JChem Cartridge v5.5. The database was created as a Markush database using the default settings. I have written a demo application to perform a substructure search against the Markush database. The basic code is as shown below. When a search is run, the search progress messages reported are: Collecting fingerprints, and Loading cache. The status stays at the Loading cache state for hours and never seems to get past this step. Is there something I need to do to the Markush database to make searching more efficient?

search = new JChemSearch();

search.setStructureTable("MARKUSH_STRUCTS");

search.setQueryStructure(query);

search.setConnectionHandler(connectionHandler);

search.setRunMode(JChemSearch.RUN_MODE_ASYNCH_COMPLETE);

search.setSearchOptions(searchOptions);

try

{

search.run();

String currMessage, prevMessage = "";

while (search.isRunning()) {

Thread.sleep(1000);

currMessage = search.getProgressMessage();

if (!currMessage.equalsIgnoreCase(prevMessage)){

System.out.println("Search status: "+search.getProgressMessage());

}

prevMessage = currMessage;

}

System.out.println("Search resulted in "+search.getResultCount()+" hits.");

}

catch (Exception e) {e.printStackTrace();}

ChemAxon a3d59b832c

03-08-2011 09:20:14

Hi Chris,

The code seems to be correct, although it uses JChem Base to search the Markush table, not the Cartridge server.

A consequence of this is that structure cache is loaded in the client JVM, and not in JChem Server. Searching will also take place in the client JVM.

See architecture diagrams here:

http://www.chemaxon.com/jchem/doc/dev/dbconcepts/index.html#architecture

(Rich client architecture vs. JChem Cartridge architecture)

In this situation, I can imagine two bottlenecks:

1. Memory size of the client machine or its JVM. The JVM has to have enough memory to hold the cache, and related work memory. We measured that ~1.2M patent Markush structures (full MMS database) require about 15GB in cache size, plus several GB of working memory is needed for the search process, depending on the number of processor cores.

For example, a 64-core search was comfortable with a total JVM memory of 40 GB.

We have another test scenario of 700K patent Markush structures using a total JVM memory of 30 GB on 4 processor cores. (Unfortunately, I do not have information about the size of the cache in this case.)

These tests were using JChem 5.5, and 64-bit JVM. (32-bit JVM has a limit of ~2GB maximum memory.)

2. Network connection between the client and database server. During cache loading, basically all the cd_markush and cd_fp... fields are transferred from the database to the searching JVM. (In case of the Cartridge, it is the JChem server that may be installed on the same or different server machine as the Oracle server, but it is advisable to have a fast connection between the two.)

We should first check if these bottlenecks may be present, and then might need to check the logs or do some kind of instrumentation to see what is going on.

Best regards,

Szabolcs

User 24ceacde10

03-08-2011 12:14:49

You are correct that the search was performed directly against the Markush database. Eventually the following error message was reported by the client JVM running the search.

chemaxon.jchem.db.DatabaseSearchException: ERROR: Table "MARKUSH.MARKUSH_STRUCTS" could not fit into structure cache (out of memory).

Please visit http://www.chemaxon.com/jchem/FAQ.html#outofmemory and

http://www.chemaxon.com/jchem/FAQ.html#cacheSize

at chemaxon.jchem.db.JChemSearch.loadCacheIfNeeded(JChemSearch.java:2369)

at chemaxon.jchem.db.JChemSearch.search1(JChemSearch.java:2045)

at chemaxon.jchem.db.JChemSearch.search(JChemSearch.java:1879)

at chemaxon.jchem.db.JChemSearch.access$2800(JChemSearch.java:121)

at chemaxon.jchem.db.JChemSearch$SearchThread.run(JChemSearch.java:853)

Based on your response, I understand the bottleneck introduced by searching directly against the Markush database. I will review the JChem server documentation you referenced and try submitting Markush searches to the JChem server instead of the database.

User 24ceacde10

03-08-2011 16:44:36

I've created the user JCHEM with the role JCC_USER_ROLE as described in the post-installation instructions.

I can connect successfully using the url jdbc:oracle:thin:@<server>:port:sid with the owner account for the Markush database. I also can connect with the JCHEM user account, but when I attempt a search I get a "table not found error".

How do I perform a search uising the JCHEM user account?

ChemAxon aa7c50abf8

04-08-2011 12:59:35

Please, could you provide the SQL statement which fails along with the full error message?

User 24ceacde10

05-08-2011 12:50:40

I've solved the problem. I forgot to add the schema name to the tables being searched.