jchem cartridge 'perfect' seach performance tunning

User ee6724032a

20-04-2009 09:34:27

Hi,


I have quite a general task of performing a comparison of the two table for duplicates, followed by stroring the results in some other table. The only adviced solution I could find is to use Overlap Analysis of Instant JChem. Thought it work just fine, the resulting data is somewhat cumbersome to work with.


The other obvious way to do it is 'insert from' statement with jc_compare('t:d') and that's where I was hoping to get an advice on performace tunning.


The idea of the following statement is to map the IDs of structures in non-unique collection to those in unique collection.


INSERT INTO t_ids SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;

and it takes about 5.5 hours for 10k records in non-unique table and 9k records in unique table.


I must say that I'm quite rusty when it comes to sql and I might be missing something very basic here. So please, let me know if there is something I can do to speed up the procedure adove.


Thank you in advance

ChemAxon aa7c50abf8

20-04-2009 11:23:55

Hi,


How long does it take for the SELECT statement alone to execute?


SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;


Thanks


Peter

User ee6724032a

20-04-2009 12:02:04

Hi,


I ran it a couple of times and had to cancel it after about 3 hours.


Just to clarify, the tables were created by JChemManager and the data in both tables is the same, except for the 'allow duplicates' flag for the non-unique one.


Some info on the platform: Cartridge and Oracle 11g (x64) run on the same host under W2008 x64, 2x Xeon 5500 3.0, 16Gb, Raid 5 SAS HDD.


Thank you.

ChemAxon aa7c50abf8

20-04-2009 12:05:18

Hi,


What query plan do you get for the SELECT statement?


Thanks


Peter

User ee6724032a

20-04-2009 12:34:54

Hi,


The statement is following

















































OperationOptimizerCostCardinalityBytesAcess Predicates
SELECT STATEMENT ALL_ROWS 232687 110583159 234325713921
- NESTED LOOPS 232687 110583159 234325713921
-- TABLE ACCESS(FULL) JCHEM.T_U 11315 110583 222824745
-- TABLE ACCESS(BY INDEX ROWID) JCHEM.T_NU ANALYZED 232687 1000 104000
--- DOMAIN INDEX JCHEM.IDX_T_NU "JCHEM"."JC_COMPAREB"("T_NU"."CD_STRUCTURE","T_U"."CD_STRUCTURE",'t:d')=1

User ee6724032a

20-04-2009 22:12:40










pkovacs wrote:

Hi,


How long does it take for the SELECT statement alone to execute?


SELECT t_nu.cd_id, t_u.cd_id FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;


Thanks


Peter



My previous answer on that wasn't entirely accurate. The first resulting rows of the statement above are fetch just after 1.5 seconds. But if I do SELECT COUNT(t_nu.cd_id)... then it take more than 3 hours to complete.

ChemAxon aa7c50abf8

21-04-2009 09:34:34

Hi,


Using two tables both containing 10k structures, the select statement runs about 400 seconds on my laptop (T7250 Core Duo @ 2.00GHz). The query plan I got for the statement is very similar to yours.


Please, could you post the tkprof output of the following statement:


SELECT sum(t_nu.cd_id + t_u.cd_id) FROM t_nu, t_u WHERE jc_compareb(t_nu.cd_structure, t_u.cd_structure, 't:d') = 1;


Thanks


Peter

User ee6724032a

21-04-2009 11:34:29

Hi,


I've started the autotrace and, by the look of it, it will take hours to complete. Meanwhile, is there anything else you want me to check, something like Statistics, Indices, JChem Properties etc?


Thank you.

ChemAxon aa7c50abf8

21-04-2009 13:58:46

Hi,


Collecting statics on the schema is always a good idea.


Have you tried using only a subset of the t_u, t_nu tables? I mean: it would be practical to have a manageable test case -- if possible.


Thanks


Peter