Substructure search is taking long time

User 325f2762fd

18-04-2008 16:43:51

We are running the following query and it is taking 15 minutes. How we can improve the performance





COMPOUND table contains smiles and it has 2.2 million records


PROJECT_COMPOUNDS contains ID and has 6 million records


The other tables are small in the range of 1000. When I run without jchem search it is very quick.





SELECT DISTINCT COMPOUND.compound_id, COMPOUND.smiles,


COMPOUND.h_bond_acceptors, COMPOUND.h_bond_donors,


COMPOUND.rotatable_bonds, COMPOUND.Clogp,


COMPOUND.molecular_weight, COMPOUND.synthesisable


FROM COMPOUND,


PROJECT_COMPOUNDS,


eseek_projects,


evoseek.VM_USER_PROJECT_ACCESS


WHERE COMPOUND.compound_id = PROJECT_COMPOUNDS.compound_id


AND PROJECT_COMPOUNDS.project_id = eseek_PROJECTS.project_id


AND jc_contains (COMPOUND.smiles, 'c1ccccc1') = 1


AND evoseek.VM_USER_PROJECT_ACCESS.project_id = eseek_projects.project_id


AND evoseek.VM_USER_PROJECT_ACCESS.user_id = 434

ChemAxon aa7c50abf8

18-04-2008 17:07:25

Quote:
When I run without jchem search it is very quick.
What about running first without jchem search and then applying jchem search on the output of the first part?

User 325f2762fd

18-04-2008 17:10:18

Oracle is already doing that see the explain plan below





Code:
Operation   Object Name   Rows   Bytes   Cost   Object Node   In/Out   PStart   PStop





SELECT STATEMENT Optimizer Mode=ALL_ROWS      2 K       8004                                  


  HASH UNIQUE      2 K   330 K   8004                                  


    HASH JOIN      2 K   330 K   7925                                  


      TABLE ACCESS BY INDEX ROWID   EVOSEEK.COMPOUND   22 K   2 M   7080                                  


        DOMAIN INDEX   EVOSEEK.IXJC_COMPOUND_SMILES                                             


      NESTED LOOPS      103 K   2 M   558                                  


        NESTED LOOPS      4     44     1                                  


          INDEX FULL SCAN   EVOSEEK.PK_PROJECTS   308     1 K   1                                  


          INDEX UNIQUE SCAN   EVOSEEK.PK_VM_USER_PROJECT_ACCESS   1     7     0                                  


        INDEX RANGE SCAN   EVOSEEK.UNQ_PROJECT_COMPS_PROJ_COMPID   24 K   238 K   139                                  


ChemAxon aa7c50abf8

18-04-2008 17:41:28

Based on this explain plan, Oracle doesn't do what I meant. Oracle does a full domain index scan -- meaning that the substructure search is applied to every structure in the table. Looking at your query structure, JC_CONTAINS probably finds quite a lot of hits, which take quite a long time to return with most of them probably being subsequently discarded as the two legs of the execution plan are hash joined. (Returning hits from a domain index scan is a fairly time consuming operation and its duration increases linearly with the number of the hits returned.)





What I meant was to effectively reduce the input (and output) of the substructure search part of your query. I suggest to use the filterQuery option of the JC_COMPARE operator: http://www.chemaxon.com/jchem/doc/guide/cartridge/cartapi.html#jc_compare_filterQuery





This basically means that you pack the non-structure-search conditions (the project and user specific conditions) in a SELECT statement and then make this SELECT statement the value of the filterQuery option. This way, even though JC_COMPARE will still be executed in domain index scan mode, internally it will apply substructure search only to the structures returned by the filterQuery. This will result both in reduced time spent on substructure search and in (typically) substantially less hits returned by the domain index scan.

User 325f2762fd

18-04-2008 18:06:37

Fantastic. The query takes 3 minutes now after changing the query as per your suggestion.


Thanks for your help. See below the change made to the query





Code:
SELECT          COMP.compound_id, COMP.smiles,


                COMP.h_bond_acceptors, COMP.h_bond_donors,


                COMP.rotatable_bonds, COMP.clogp,


                COMP.molecular_weight, COMP.synthesisable


           FROM evoseek.COMPOUND comp


            WHERE jc_compare (COMP.smiles, 'c1ccccc1','sep=! t:s!filterQuery:select compound.rowid


                         FROM evoseek.COMPOUND,


                          evoseek.PROJECT_COMPOUNDS,


                                 evoseek.PROJECTS,


                                 evoseek.VM_USER_PROJECT_ACCESS


                           WHERE COMPOUND.compound_id = PROJECT_COMPOUNDS.compound_id


                     AND PROJECT_COMPOUNDS.project_id = PROJECTS.project_id


                            AND evoseek.VM_USER_PROJECT_ACCESS.project_id = PROJECTS.project_id


                            AND evoseek.VM_USER_PROJECT_ACCESS.user_id = 434') = 1






What is sep=! mean and also ! sign after t:s mean?

ChemAxon aa7c50abf8

18-04-2008 18:15:13

sep=! means that the options in the option list (the third argument of the jc_compare operator) will be separated by the '!' string (which consists in this case of just one character). By default, the options in the list are separated by space(s) which is not appropriate for option values which themselves may contain spaces (such as a SQL statemet). For further details please, see http://www.chemaxon.com/jchem/doc/guide/cartridge/index.html#option_list_parameter.