User ee6724032a
20-04-2009 09:34:27
Hi,
I have quite a general task of performing a comparison of the two table for duplicates, followed by stroring the results in some other table. The only adviced solution I could find is to use Overlap Analysis of Instant JChem. Thought it work just fine, the resulting data is somewhat cumbersome to work with.
The other obvious way to do it is 'insert from' statement with jc_compare('t:d') and that's where I was hoping to get an advice on performace tunning.
The idea of the following statement is to map the IDs of structures in non-unique collection to those in unique collection.
INSERT INTO t_ids SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;
and it takes about 5.5 hours for 10k records in non-unique table and 9k records in unique table.
I must say that I'm quite rusty when it comes to sql and I might be missing something very basic here. So please, let me know if there is something I can do to speed up the procedure adove.
Thank you in advance
ChemAxon aa7c50abf8
20-04-2009 11:23:55
Hi,
How long does it take for the SELECT statement alone to execute?
SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;
Thanks
Peter
User ee6724032a
20-04-2009 12:02:04
Hi,
I ran it a couple of times and had to cancel it after about 3 hours.
Just to clarify, the tables were created by JChemManager and the data in both tables is the same, except for the 'allow duplicates' flag for the non-unique one.
Some info on the platform: Cartridge and Oracle 11g (x64) run on the same host under W2008 x64, 2x Xeon 5500 3.0, 16Gb, Raid 5 SAS HDD.
Thank you.
ChemAxon aa7c50abf8
20-04-2009 12:05:18
Hi,
What query plan do you get for the SELECT statement?
Thanks
Peter
User ee6724032a
20-04-2009 12:34:54
Hi,
The statement is following
Operation | Optimizer | Cost | Cardinality | Bytes | Acess Predicates |
---|
SELECT STATEMENT |
ALL_ROWS |
232687 |
110583159 |
234325713921 |
|
- NESTED LOOPS |
|
232687 |
110583159 |
234325713921 |
|
-- TABLE ACCESS(FULL) JCHEM.T_U |
|
11315 |
110583 |
222824745 |
|
-- TABLE ACCESS(BY INDEX ROWID) JCHEM.T_NU |
ANALYZED |
232687 |
1000 |
104000 |
|
--- DOMAIN INDEX JCHEM.IDX_T_NU |
|
|
|
|
"JCHEM"."JC_COMPAREB"("T_NU"."CD_STRUCTURE","T_U"."CD_STRUCTURE",'t:d')=1 |
User ee6724032a
20-04-2009 22:12:40
pkovacs wrote: |
Hi,
How long does it take for the SELECT statement alone to execute?
SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;
Thanks
Peter
|
My previous answer on that wasn't entirely accurate. The first resulting rows of the statement above are fetch just after 1.5 seconds. But if I do SELECT COUNT(t_nu.cd_id)... then it take more than 3 hours to complete.
ChemAxon aa7c50abf8
21-04-2009 09:34:34
Hi,
Using two tables both containing 10k structures, the select statement runs about 400 seconds on my laptop (T7250 Core Duo @ 2.00GHz). The query plan I got for the statement is very similar to yours.
Please, could you post the tkprof output of the following statement:
SELECT sum(t_nu.cd_id + t_u.cd_id) FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;
Thanks
Peter
User ee6724032a
21-04-2009 11:34:29
Hi,
I've started the autotrace and, by the look of it, it will take hours to complete. Meanwhile, is there anything else you want me to check, something like Statistics, Indices, JChem Properties etc?
Thank you.
ChemAxon aa7c50abf8
21-04-2009 13:58:46
Hi,
Collecting statics on the schema is always a good idea.
Have you tried using only a subset of the t_u, t_nu tables? I mean: it would be practical to have a manageable test case -- if possible.
Thanks
Peter