User acbf1b875b
05-05-2006 18:20:49
I am using the following sql for substructure search:
Code: |
SELECT cpd_sid FROM STRUCTURE WHERE
jc_compare(jc_smiles,'c1ccccc1',' sep=! t:s!filterQuery:SELECT rowid FROM STRUCTURE s WHERE (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=112) AND p.cpd_sid=s.cpd_sid))!stereoSearch:n')=1 |
and it takes 26 seconds. The filter query itself takes only 32 milliseconds and returns 3000 rows. STRUCTURE table contains about 2.5M rows. JChem version is 3.1.5.
Is there a way to speed it up?
ChemAxon a3d59b832c
08-05-2006 05:35:10
ChemAxon aa7c50abf8
08-05-2006 10:24:20
In order to help you put in context the topics Szabolcs suggested to review, I am listing the main steps which are performed internally by JChem Cartridge to execute your query:
1) The query parameters are sent to JChem Streams (in Tomcat)
2) The filter query is wrapped and executed as part of a slightly larger query:
Code: |
"SELECT cd_id FROM " + indexTable + " WHERE rid IN (" + filterQuery + ")" |
where "indexTable" is <index-name>_jcx
3) A structure search is executed using the JChem Java API on the cd_ids returned by the previous query
3a) "indexTable" is loaded into the structure cache, if not yet loaded
3b) A pre-screening is performed based on fingerprints
3c) A graph search is performed on the structures left over from the fingerprint pre-screening
4) The cd_ids of the hits are returned from JChem Streams to Oracle
5) The cd_ids are converted to rowids of the base table (in your case: of the table structure)
6) The rowids are returned to the Oracle Execution engine which internally processes them further.
In an attempt to reproduce your case, I set up a structure table "structure" with 2.6 million smiles and having a primary key column called cpd_sid. cpd_sid of type numeric(10,0) with numbers constantly incremented by 1 starting with 1 up to 2.6 million. I slightly reformulated your query as follows:
Code: |
SELECT cpd_sid FROM STRUCTURE WHERE
jc_compare(jc_smiles,'c1ccccc1',' sep=! t:s!filterQuery:SELECT rowid FROM STRUCTURE WHERE cpd_sid < 3200))!stereoSearch:n')=1
|
Since you said your entire filter query is executing within a fraction of a second, the fact that I use a simpler filter query is not relevant -- compared to the 26 seconds for the entire query. The relevant thing about the filter query is, I think, that it returns at least 3000 items -- and that is met by this simpler query as well.
I tested this query on a 3GHz Intel P4 with Hyper-Threading enabled, with Linux kernel 2.6.15 smp, Oracle 10.1.0.3.0 with 396 MB __db_cache_size, 16MB __java_pool_size, 4MB __large_pool_size, 148M __shared_pool_size. I ran my test on machine with very light workload.
The first time I ran this query with nothing in the structure cache on the Tomcat side, the query took 2 minutes 7 seconds to complete. Before I ran the query I turned on "Extra information to the standard error" in the JChem Cartridge administration GUI. The extra information can be retrieved in <tomcat-home>/logs/catalina.out. This extra information covers step 3.) above. Without anything in the structure I see this output: Quote: |
Mon May 08 11:46:25 CEST 2006
Search mode: SUBSTRUCTURE
Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX
Query: c1ccccc1
Screened: 2959
Hits: 2951
Cache loading: 123255 ms
Cache size (this table / total): 264.01 / 264.01 MBytes
Total time: 3917 ms Screening: 45 ms
Current / peak / maximum searches per minute: 1 / 1 / Unlimited
|
A significant amount of the time is accounted for by the cache loading (sub-step 3a) above): 123 seconds. Fingerprint-based prescreening (step 3b) above) takes 45 milliseconds; graph search (step 3c)) Total time - Screening: 3917 ms - 45 ms.
Running the query the second time (the structure table already already loaded into structure cache), it completes in 1.45 seconds and gives the following output in catalina.out: Quote: |
Mon May 08 11:52:02 CEST 2006
Search mode: SUBSTRUCTURE
Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX
Query: c1ccccc1
Screened: 2959
Hits: 2951
Total time: 1153 ms Screening: 19 ms
Current / peak / maximum searches per minute: 1 / 1 / Unlimited
|
The third run takes 0.89 seconds with the following extra info: Quote: |
Mon May 08 11:52:06 CEST 2006
Search mode: SUBSTRUCTURE
Structure table: JCHEMUSER315.JCXBRIM_SMILES_JCX
Query: c1ccccc1
Screened: 2959
Hits: 2951
Total time: 612 ms Screening: 3 ms
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
|
Please, could you provide the results for a similar test sequence in your environment?
It would be also useful to know
- the main parameters of your hardware (and eventually your operating system)
- the load of your system while running the query (heavy, light, medium?)
Thanks
Peter
User acbf1b875b
08-05-2006 17:56:39
Hi guys,
Thanks for prompt replies.
In our case the index is of course loaded. This is a production database, which otherwise works fine.
I've tried your example. You are right, when the query is on the same table, it takes very short time to perform, and brings a result in 2-3 seconds:
Quote: |
Mon May 08 10:47:03 PDT 2006
Search mode: SUBSTRUCTURE
Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX
Query: c1ccccc1
Screened: 3111
Hits: 2580
Total time: 2341 ms Screening: 3 ms
Processing threads: 2
Current / peak / maximum searches per minute: 3 / 5 / Unlimited |
The time becomes under a second if I run exactly same query several times, but only the first run corresponds to real-life scenario.
With my own query however, it still takes longer first time: Quote: |
Mon May 08 10:36:26 PDT 2006
Search mode: SUBSTRUCTURE
Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX
Query: c1ccccc1
Screened: 3057
Hits: 3023
Total time: 11138 ms Screening: 4 ms
Processing threads: 2
Current / peak / maximum searches per minute: 2 / 2 / Unlimited |
On the second and third run of exactly same query I get much better results: Quote: |
Mon May 08 10:43:09 PDT 2006
Search mode: SUBSTRUCTURE
Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX
Query: c1ccccc1
Screened: 3057
Hits: 3023
Total time: 556 ms Screening: 3 ms
Processing threads: 2
Current / peak / maximum searches per minute: 4 / 4 / Unlimited |
I checked again, my filter query by itself takes a negligible time. My initial impression was that you somehow don't utilize the index on cpd_sid in post_c table... The size of that table is ~110K records.
Our Hardware:
DB Server: 8 Dual-core CPUs (16 cores total) 1.35 Ghz Sun UltraSparc 4, 32Gb memory.
Tomcat server: Dual 2.2 Ghz 64 bit Opterons, 2Gb memory
ChemAxon aa7c50abf8
08-05-2006 20:43:32
Hi,
The output of JChem Java API for your query executed first time:
Quote: |
Mon May 08 10:36:26 PDT 2006
Search mode: SUBSTRUCTURE
Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX
Query: c1ccccc1
Screened: 3057
Hits: 3023
Total time: 11138 ms Screening: 4 ms
Processing threads: 2
Current / peak / maximum searches per minute: 2 / 2 / Unlimited |
It is this first time execution of your query where your entire SQL statement (with jc_compare) takes ~26 seconds, isn't it?
How much does it take for the entire SQL statement (with jc_compare) to execute for the second time -- when the JChem Java API search executes in 556 ms.
Thanks
Peter
ChemAxon aa7c50abf8
09-05-2006 06:12:33
Would it be possible to test queries similar to the following:
Code: |
SELECT cd_id FROM LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX WHERE rid IN (SELECT rowid FROM STRUCTURE s WHERE (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=:a) AND p.cpd_sid=s.cpd_sid))) |
where :a is a value which you think is meaninful for the test.
How long does this query take when run for the first time. Is it much slower than the enclosed filter query?
Thanks
Peter
User acbf1b875b
11-05-2006 18:03:56
It's becoming increasingly difficult to reproduce the problem, because the results are getting cached :)
Your example with variable binding run with an uncached variable in 736 ms. It is approximately 20 times slower than the filter query itself.
I run the substructure search again on another pid, and made sure I matched tomcat log to the query execution time:
Quote: |
Thu May 11 10:49:59 PDT 2006
Search mode: SUBSTRUCTURE
Structure table: LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX
Query: c1ccccc1
Screened: 4314
Hits: 4291
Total time: 13818 ms Screening: 4 ms
Processing threads: 2
Current / peak / maximum searches per minute: 2 / 204 / Unlimited |
Total time of executing the same query is 15501 ms.
ChemAxon aa7c50abf8
11-05-2006 18:43:20
Thank you very much, Andrey. This is valuable information, indeed.
But I am still itching to know: just how much concurrent workload is serviced by your hardware as you are doing these measurements. Or is it an otherwise idle system?
Thank you!
Peter
User acbf1b875b
11-05-2006 18:46:55
The system is not idle, but usage is very light. Average CPU load is 0%, according to our IT :)
While I was performing these tests nobody used the cartridge.
ChemAxon aa7c50abf8
12-05-2006 19:12:43
Hi Andrey,
Would it be possible to do an explain plan for
Code: |
SELECT cd_id FROM LDDB_WAREHOUSE.STRUCTURE$JC_IDX_JCX WHERE rid IN (SELECT rowid FROM STRUCTURE s WHERE (EXISTS (SELECT cpd_sid FROM post_c p WHERE (screen_id=<literal-screen-id>) AND p.cpd_sid=s.cpd_sid))) |
where <literal-screen-id> is a value with which you experienced slow searches. Please, try to run the explain plan so that the resulting plan matches the closest possible the plan used during the substructure search in Tomcat. In other words: try to eliminate the factors susceptible to cause a change in the execution plan. Please, review the "How Execution Plans Can Change" section of the chapter "Using EXPLAIN PLAN" in the "Oracle Database Performance Tuning Guide and Reference" for a list of the possible factors. Please, post the resulting plan possibly formatted (with select * from table(dbms_xplan.display()) for example).
It would also be interesting to know whether you use table statistics or not.
Thank you,
Peter