Filter Query Performance

User acbf1b875b

05-05-2006 18:20:49

I am using the following sql for substructure search:





Code:
SELECT cpd_sid FROM STRUCTURE WHERE


jc_compare(jc_smiles,'c1ccccc1',' sep=! t:s!filterQuery:SELECT rowid FROM STRUCTURE s WHERE  (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=112) AND p.cpd_sid=s.cpd_sid))!stereoSearch:n')=1






and it takes 26 seconds. The filter query itself takes only 32 milliseconds and returns 3000 rows. STRUCTURE table contains about 2.5M rows. JChem version is 3.1.5.





Is there a way to speed it up?

ChemAxon a3d59b832c

08-05-2006 05:35:10

Hi Andrey,





Was the cache loaded before this timing?





I recommend to have a look through these topics below concerned with search performance:





http://www.chemaxon.com/forum/ftopic106.html


http://www.chemaxon.com/forum/ftopic1357.html





We will also try to simulate your scenario and get back what we found.





Best regards,


Szabolcs

ChemAxon aa7c50abf8

08-05-2006 10:24:20

In order to help you put in context the topics Szabolcs suggested to review, I am listing the main steps which are performed internally by JChem Cartridge to execute your query:





1) The query parameters are sent to JChem Streams (in Tomcat)


2) The filter query is wrapped and executed as part of a slightly larger query:


Code:
"SELECT cd_id FROM " + indexTable + " WHERE rid IN (" + filterQuery + ")"



where "indexTable" is <index-name>_jcx


3) A structure search is executed using the JChem Java API on the cd_ids returned by the previous query


3a) "indexTable" is loaded into the structure cache, if not yet loaded


3b) A pre-screening is performed based on fingerprints


3c) A graph search is performed on the structures left over from the fingerprint pre-screening


4) The cd_ids of the hits are returned from JChem Streams to Oracle


5) The cd_ids are converted to rowids of the base table (in your case: of the table structure)


6) The rowids are returned to the Oracle Execution engine which internally processes them further.





In an attempt to reproduce your case, I set up a structure table "structure" with 2.6 million smiles and having a primary key column called cpd_sid. cpd_sid of type numeric(10,0) with numbers constantly incremented by 1 starting with 1 up to 2.6 million. I slightly reformulated your query as follows:


Code:
SELECT cpd_sid FROM STRUCTURE WHERE


jc_compare(jc_smiles,'c1ccccc1',' sep=! t:s!filterQuery:SELECT rowid FROM STRUCTURE WHERE cpd_sid < 3200))!stereoSearch:n')=1





Since you said your entire filter query is executing within a fraction of a second, the fact that I use a simpler filter query is not relevant -- compared to the 26 seconds for the entire query. The relevant thing about the filter query is, I think, that it returns at least 3000 items -- and that is met by this simpler query as well.





I tested this query on a 3GHz Intel P4 with Hyper-Threading enabled, with Linux kernel 2.6.15 smp, Oracle 10.1.0.3.0 with 396 MB __db_cache_size, 16MB __java_pool_size, 4MB __large_pool_size, 148M __shared_pool_size. I ran my test on machine with very light workload.





The first time I ran this query with nothing in the structure cache on the Tomcat side, the query took 2 minutes 7 seconds to complete. Before I ran the query I turned on "Extra information to the standard error" in the JChem Cartridge administration GUI. The extra information can be retrieved in <tomcat-home>/logs/catalina.out. This extra information covers step 3.) above. Without anything in the structure I see this output:
Quote:
Mon May 08 11:46:25 CEST 2006


Search mode: SUBSTRUCTURE


Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX


Query: c1ccccc1


Screened: 2959


Hits: 2951


Cache loading: 123255 ms


Cache size (this table / total): 264.01 / 264.01 MBytes


Total time: 3917 ms Screening: 45 ms


Current / peak / maximum searches per minute: 1 / 1 / Unlimited


A significant amount of the time is accounted for by the cache loading (sub-step 3a) above): 123 seconds. Fingerprint-based prescreening (step 3b) above) takes 45 milliseconds; graph search (step 3c)) Total time - Screening: 3917 ms - 45 ms.





Running the query the second time (the structure table already already loaded into structure cache), it completes in 1.45 seconds and gives the following output in catalina.out:
Quote:
Mon May 08 11:52:02 CEST 2006


Search mode: SUBSTRUCTURE


Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX


Query: c1ccccc1


Screened: 2959


Hits: 2951


Total time: 1153 ms Screening: 19 ms


Current / peak / maximum searches per minute: 1 / 1 / Unlimited


The third run takes 0.89 seconds with the following extra info:
Quote:
Mon May 08 11:52:06 CEST 2006


Search mode: SUBSTRUCTURE


Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX


Query: c1ccccc1


Screened: 2959


Hits: 2951


Total time: 612 ms Screening: 3 ms


Current / peak / maximum searches per minute: 2 / 2 / Unlimited


Please, could you provide the results for a similar test sequence in your environment?





It would be also useful to know





- the main parameters of your hardware (and eventually your operating system)


- the load of your system while running the query (heavy, light, medium?)





Thanks





Peter

User acbf1b875b

08-05-2006 17:56:39

Hi guys,





Thanks for prompt replies.





In our case the index is of course loaded. This is a production database, which otherwise works fine.





I've tried your example. You are right, when the query is on the same table, it takes very short time to perform, and brings a result in 2-3 seconds:
Quote:
Mon May 08 10:47:03 PDT 2006


Search mode: SUBSTRUCTURE


Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX


Query: c1ccccc1


Screened: 3111


Hits: 2580


Total time: 2341 ms Screening: 3 ms


Processing threads: 2


Current / peak / maximum searches per minute: 3 / 5 / Unlimited
The time becomes under a second if I run exactly same query several times, but only the first run corresponds to real-life scenario.








With my own query however, it still takes longer first time:
Quote:
Mon May 08 10:36:26 PDT 2006


Search mode: SUBSTRUCTURE


Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX


Query: c1ccccc1


Screened: 3057


Hits: 3023


Total time: 11138 ms Screening: 4 ms


Processing threads: 2


Current / peak / maximum searches per minute: 2 / 2 / Unlimited
On the second and third run of exactly same query I get much better results:
Quote:
Mon May 08 10:43:09 PDT 2006


Search mode: SUBSTRUCTURE


Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX


Query: c1ccccc1


Screened: 3057


Hits: 3023


Total time: 556 ms Screening: 3 ms


Processing threads: 2


Current / peak / maximum searches per minute: 4 / 4 / Unlimited
I checked again, my filter query by itself takes a negligible time. My initial impression was that you somehow don't utilize the index on cpd_sid in post_c table... The size of that table is ~110K records.





Our Hardware:


DB Server: 8 Dual-core CPUs (16 cores total) 1.35 Ghz Sun UltraSparc 4, 32Gb memory.


Tomcat server: Dual 2.2 Ghz 64 bit Opterons, 2Gb memory

ChemAxon aa7c50abf8

08-05-2006 20:43:32

Hi,





The output of JChem Java API for your query executed first time:
Quote:
Mon May 08 10:36:26 PDT 2006


Search mode: SUBSTRUCTURE


Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX


Query: c1ccccc1


Screened: 3057


Hits: 3023


Total time: 11138 ms Screening: 4 ms


Processing threads: 2


Current / peak / maximum searches per minute: 2 / 2 / Unlimited
It is this first time execution of your query where your entire SQL statement (with jc_compare) takes ~26 seconds, isn't it?





How much does it take for the entire SQL statement (with jc_compare) to execute for the second time -- when the JChem Java API search executes in 556 ms.





Thanks


Peter

User acbf1b875b

08-05-2006 22:45:59

Quote:
It is this first time execution of your query where your entire SQL statement (with jc_compare) takes ~26 seconds, isn't it?
Yes, that's correct. It is not 26 seconds every time, it ranges from 15 to 40 seconds depending on the value of p.screen_id in my example (the number of returned rows varies between 2000 and 6000). I cannot accurately reproduce exactly the same situation every time because the query is already cached in Oracle.
Quote:
How much does it take for the entire SQL statement (with jc_compare) to execute for the second time -- when the JChem Java API search executes in 556 ms.
Less than one second.

ChemAxon aa7c50abf8

09-05-2006 06:12:33

Would it be possible to test queries similar to the following:


Code:
SELECT cd_id FROM LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX WHERE rid IN (SELECT rowid FROM STRUCTURE s WHERE  (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=:a) AND p.cpd_sid=s.cpd_sid)))
where :a is a value which you think is meaninful for the test.





How long does this query take when run for the first time. Is it much slower than the enclosed filter query?





Thanks


Peter

User acbf1b875b

11-05-2006 18:03:56

It's becoming increasingly difficult to reproduce the problem, because the results are getting cached :)





Your example with variable binding run with an uncached variable in 736 ms. It is approximately 20 times slower than the filter query itself.





I run the substructure search again on another pid, and made sure I matched tomcat log to the query execution time:
Quote:
Thu May 11 10:49:59 PDT 2006


Search mode: SUBSTRUCTURE


Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX


Query: c1ccccc1


Screened: 4314


Hits: 4291


Total time: 13818 ms Screening: 4 ms


Processing threads: 2


Current / peak / maximum searches per minute: 2 / 204 / Unlimited
Total time of executing the same query is 15501 ms.

ChemAxon aa7c50abf8

11-05-2006 18:43:20

Thank you very much, Andrey. This is valuable information, indeed.





But I am still itching to know: just how much concurrent workload is serviced by your hardware as you are doing these measurements. Or is it an otherwise idle system?





Thank you!





Peter

User acbf1b875b

11-05-2006 18:46:55

The system is not idle, but usage is very light. Average CPU load is 0%, according to our IT :)


While I was performing these tests nobody used the cartridge.

ChemAxon aa7c50abf8

11-05-2006 20:21:05

Quote:
Your example with variable binding run with an uncached variable in 736 ms.
The 736 ms does not seem to be very significant compared to the total time of 13818 ms, does it? (The time to execute the filter query is actually included in the 13818 ms of the JChem Java API search.)





Would it be possible to run a few tests with my example query using a literal screen_id (no variable binding)? Does it make a huge difference? (JChem Cartridge is, of course, unable to substitute bind variables for literal values in the text of the filter queries.)





Thanks


Peter

User acbf1b875b

11-05-2006 21:17:57

pkovacs wrote:
The 736 ms does not seem to be very significant compared to the total time of 13818 ms, does it?


No, 736 is acceptable, although imho strange for such a simple query.


13 seconds is a little too much :)





Using variable binding in my filter query does not noticeably affect the performance.

ChemAxon aa7c50abf8

12-05-2006 19:12:43

Hi Andrey,





Would it be possible to do an explain plan for


Code:
SELECT cd_id FROM LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX WHERE rid IN (SELECT rowid FROM STRUCTURE s WHERE  (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=<literal-screen-id>) AND p.cpd_sid=s.cpd_sid)))



where <literal-screen-id> is a value with which you experienced slow searches. Please, try to run the explain plan so that the resulting plan matches the closest possible the plan used during the substructure search in Tomcat. In other words: try to eliminate the factors susceptible to cause a change in the execution plan. Please, review the "How Execution Plans Can Change" section of the chapter "Using EXPLAIN PLAN" in the "Oracle Database Performance Tuning Guide and Reference" for a list of the possible factors. Please, post the resulting plan possibly formatted (with select * from table(dbms_xplan.display()) for example).





It would also be interesting to know whether you use table statistics or not.





Thank you,


Peter