Substructure search is taking long time

User 325f2762fd

18-04-2008 16:43:51

We are running the following query and it is taking 15 minutes. How we can improve the performance

COMPOUND table contains smiles and it has 2.2 million records

PROJECT_COMPOUNDS contains ID and has 6 million records

The other tables are small in the range of 1000. When I run without jchem search it is very quick.

SELECT DISTINCT COMPOUND.compound_id, COMPOUND.smiles,

COMPOUND.h_bond_acceptors, COMPOUND.h_bond_donors,

COMPOUND.rotatable_bonds, COMPOUND.Clogp,

COMPOUND.molecular_weight, COMPOUND.synthesisable

FROM COMPOUND,

PROJECT_COMPOUNDS,

eseek_projects,

evoseek.VM_USER_PROJECT_ACCESS

WHERE COMPOUND.compound_id = PROJECT_COMPOUNDS.compound_id

AND PROJECT_COMPOUNDS.project_id = eseek_PROJECTS.project_id

AND jc_contains (COMPOUND.smiles, 'c1ccccc1') = 1

AND evoseek.VM_USER_PROJECT_ACCESS.project_id = eseek_projects.project_id

AND evoseek.VM_USER_PROJECT_ACCESS.user_id = 434

ChemAxon aa7c50abf8

18-04-2008 17:07:25

Quote:

When I run without jchem search it is very quick.

What about running first without jchem search and then applying jchem search on the output of the first part?

User 325f2762fd

18-04-2008 17:10:18

Oracle is already doing that see the explain plan below

Code:

Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop

SELECT STATEMENT Optimizer Mode=ALL_ROWS    2 K 8004

HASH UNIQUE    2 K 330 K 8004

HASH JOIN    2 K 330 K 7925

TABLE ACCESS BY INDEX ROWID EVOSEEK.COMPOUND 22 K 2 M 7080

DOMAIN INDEX EVOSEEK.IXJC_COMPOUND_SMILES

NESTED LOOPS    103 K 2 M 558

NESTED LOOPS    4 44 1

INDEX FULL SCAN EVOSEEK.PK_PROJECTS 308 1 K 1

INDEX UNIQUE SCAN EVOSEEK.PK_VM_USER_PROJECT_ACCESS 1 7 0

INDEX RANGE SCAN EVOSEEK.UNQ_PROJECT_COMPS_PROJ_COMPID 24 K 238 K 139

ChemAxon aa7c50abf8

18-04-2008 17:41:28

Based on this explain plan, Oracle doesn't do what I meant. Oracle does a full domain index scan -- meaning that the substructure search is applied to every structure in the table. Looking at your query structure, JC_CONTAINS probably finds quite a lot of hits, which take quite a long time to return with most of them probably being subsequently discarded as the two legs of the execution plan are hash joined. (Returning hits from a domain index scan is a fairly time consuming operation and its duration increases linearly with the number of the hits returned.)

What I meant was to effectively reduce the input (and output) of the substructure search part of your query. I suggest to use the filterQuery option of the JC_COMPARE operator: http://www.chemaxon.com/jchem/doc/guide/cartridge/cartapi.html#jc_compare_filterQuery

This basically means that you pack the non-structure-search conditions (the project and user specific conditions) in a SELECT statement and then make this SELECT statement the value of the filterQuery option. This way, even though JC_COMPARE will still be executed in domain index scan mode, internally it will apply substructure search only to the structures returned by the filterQuery. This will result both in reduced time spent on substructure search and in (typically) substantially less hits returned by the domain index scan.

User 325f2762fd

18-04-2008 18:06:37

Fantastic. The query takes 3 minutes now after changing the query as per your suggestion.

Thanks for your help. See below the change made to the query

Code:

SELECT COMP.compound_id, COMP.smiles,

COMP.h_bond_acceptors, COMP.h_bond_donors,

COMP.rotatable_bonds, COMP.clogp,

COMP.molecular_weight, COMP.synthesisable

FROM evoseek.COMPOUND comp

WHERE jc_compare (COMP.smiles, 'c1ccccc1','sep=! t:s!filterQuery:select compound.rowid

         FROM evoseek.COMPOUND,

                     evoseek.PROJECT_COMPOUNDS,

evoseek.PROJECTS,

evoseek.VM_USER_PROJECT_ACCESS

   WHERE COMPOUND.compound_id = PROJECT_COMPOUNDS.compound_id

                  AND PROJECT_COMPOUNDS.project_id = PROJECTS.project_id

AND evoseek.VM_USER_PROJECT_ACCESS.project_id = PROJECTS.project_id

AND evoseek.VM_USER_PROJECT_ACCESS.user_id = 434') = 1

What is sep=! mean and also ! sign after t:s mean?

ChemAxon aa7c50abf8

18-04-2008 18:15:13

sep=! means that the options in the option list (the third argument of the jc_compare operator) will be separated by the '!' string (which consists in this case of just one character). By default, the options in the list are separated by space(s) which is not appropriate for option values which themselves may contain spaces (such as a SQL statemet). For further details please, see http://www.chemaxon.com/jchem/doc/guide/cartridge/index.html#option_list_parameter.