Cartridge performance of full structure search

User 8139ea8dbd

07-07-2010 01:46:53

Have a structure table of 7million entries.


The SQL below takes 3.2 seconds, but I suspect it does not make use of cd_hash key.
select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;

Because,
select * from gnf_imc.v_cpd$jc_idx_20091020_jcx t where cd_hash=-400332141;
only takes 0.03 seconds and only found 3 matches.
select * from vendor_cpd v where v.rowid in ('AAAjcKABGAAAoRoAAY', 'AAAjcKAAOAAIRY9AAb','AAAjcKAAPAAHUZaAAR');
takes another 0.03 seconds to fetch all smiles.


The graph match for 3 pairs takes no time.
If it makes use of cd_hash, it should finish the initial full structure search within 1 second.


I believe I asked the same question maybe two years ago, and the reply was this was not reproducible. We have been using pretty recent cartridge and this issue has been always there in our case.


If this is not reproducible, is there a way to get some time measurement information to figure out where the bottle neck is? (When we search 300 compounds in vendor catelog, it takes 15 minutes!)


(For the same initial SQL t:d also takes 3 seconds, but t:s only takes 2 seconds! Something is not right. That's why I don't think cd_hash has been used.)


Thanks.

ChemAxon aa7c50abf8

07-07-2010 13:56:40

Running the query on the first 8 million structure of pubchem completes in about half a second returning no hits in my test environment. Duplicate search for the same query structure takes about quarter of a second to complete (and returns nothing).


Before I ran these searches I removed the # (comment) sign from the start of the following line in jchem/cartridge/conf/logging.properties in order to enable diagnostic information from core JChemSearch:


# chemaxon.jchem.db.JChemSearch.stats.level = FINE

(Either the server.sh reload command should be executed or JChem Server restarted for this change to take effect.) Having done so gave me the following output in the jchem/cartridge/logs/jcart?.log file. (? stands here for either 0 or 1, eventually for 2 depending on the circumstances.):




[Thread-25/14]: Jul 7, 2010 2:18:57 PM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Wed Jul 07 14:18:57 CEST 2010
Search mode: FULL
Structure table: PKOVACSUSER_TRUNK.JCXPBCH_8M_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 4
Hits: 0
Total time: 34 ms  Screening: 1 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited





and




[Thread-35/16]: Jul 7, 2010 2:21:26 PM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Wed Jul 07 14:21:26 CEST 2010
Search mode: DUPLICATE
Structure table: PKOVACSUSER_TRUNK.JCXPBCH_8M_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 4
Hits: 0
Total time: 129 ms  Screening: 0 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited





The "Screened" count (the number of structures to undergo graph search) being the same in both cases suggest that cd_hash is used for screening in the full structure search as well.


Please, could you provide the corresponding output for these searches from your environment?



Thanks


Peter

User 4cd5052280

30-07-2010 16:36:01

Here is the logging info  for the following search on two different hosts.  


select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;


What is baffling is that host #1 is slower in every aspect (I/O, cpus, memory..) than Host #2.  I am trying to narrow down what the cause of the performance difference could be. 


Host #1:


FINE:


Thu Jul 29 09:53:03 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 123 ms  Screening: 3 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 3 / Unlimited


Host #2:


FINE:
Fri Jul 30 01:17:56 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3139 ms  Screening: 2945 ms
Processing threads: 16
Current / peak / maximum searches per minute: 1 / 1 / Unlimited


 


 

User 8139ea8dbd

30-07-2010 20:49:50

What's also puzzling is the same query, substructure search only takes 2sec, while duplicate search takes 3sec. Why duplicate search spends 3sec in "screening" after it gets the 2 entries match the cd_hash value? Thanks.

ChemAxon 9c0afc9aaf

30-07-2010 20:57:31

Hi,




The two indexes may have been created with different parameters, and this can explain the situation.


 


Hash code is not used for Full structure search in the following cases:


- query contains certain features (your query is OK)


- Markush table type / index type


- Tautomer duplicate filtering is set for the table


 


Do you think any of the latter two can be true in the slower environment ?


 


Best regards,


 


Szilard


 

User 8139ea8dbd

30-07-2010 21:01:00

Could offer some details which settings should we check regarding the 2nd and 3rd situation?

ChemAxon aa7c50abf8

30-07-2010 21:08:53

For the second, you would create the index with the structureType=markush index parameter.


create index ... parameters('structureType=markush')

For the third, you would create the index with the TDF=y index parameter:


create index ... parameters('tdf=y')

https://www.chemaxon.com/jchem/doc/dev/cartridge/index.html#create_index


Peter

User 4cd5052280

30-07-2010 21:11:06

Both indexes were built with this parameter:


 'tableType=anyStructures'

ChemAxon 9c0afc9aaf

30-07-2010 21:29:21



Hi,













yzhou wrote:

Could offer some details which settings should we check regarding the 2nd and 3rd situation?



I could not quicly find how to obtain information on index settings, so here is a not-so-elegant way until then:


1. Locate the property table to which the indexes belong. Default name is "JChemProperties"


2. Look for the properties referring to these index tables.


E.g.  


select * from <property_table> where prop_name like '%GNF_IMC.V_CPD$JC_IDX_20091020_JCX%'


Then you can compare the differences in settings, especially the state of "tautomerDuplicateFiltering" can be interesting.


 


Best regards,


 


Szilard




ChemAxon 9c0afc9aaf

30-07-2010 23:29:51

PS: 


You may also send us the the whole content of the property table (exported in some text format). That would provide the most information.


Also, could you let us know your exact JChem version please ?


 


Thanks,


Szilard

ChemAxon aa7c50abf8

01-08-2010 18:00:18

Could offer some details which settings should we check regarding the 2nd and 3rd situation?

I am attaching a script with a PL/SQL package which will help you in this. The following call will return the information needed on the index:


select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('vendor_cpd', 'jc_smiles')))

The script will be part of the JChem Cartridge package starting with the next minor (or major — whichever comes first) JChem version.


Peter

User 4cd5052280

02-08-2010 16:02:33

Do I install this package in the cartridge owner schema or the schema where the jchem index is located?


 

ChemAxon aa7c50abf8

02-08-2010 17:22:41

I think the JChem Cartridge owner's schema is the best place for the package to avoid (unlikely, yet possible) name clashes. (The JChem Cartridge installer of future JChem versions will load it in the JChem Cartridge owner's schema.)


The particular SQL call mentioned in my previous e-mail has been tested when executed as the owner of the jc_idxtype index being queried about. (GNF_IMC in your case, I guess.)


Peter

User 4cd5052280

02-08-2010 20:24:04

I had to add the AUTHID CURRENT_USER to the package spec so it would use the correct JC_IDX_PROPERTY and JCHEMPROPERTIES tables. 


Output looked the same for both databases.  The tables have different names but have the same structure.


 


Host #1:


SQL> select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('VENDOR_CPD_20091020', 'JC_SMILES')));



NAME                                                                   VALUE
---------------------------------------------------------------------- ----------------------------------------
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.validityTimestamp                    2010-04-14 11:07:03.906
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.creationTime                         2010-04-14 11:07:03.709
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.updateCounter                        1
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.version                              5030002
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.absoluteStereo                       true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfBits             512
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfOnes             2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfEdges            6
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfStrucFPCols      0
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.structuralKeys           smartsMode=true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.haltOnBadFormat                      true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.tableType                            2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.idxSubType                           0
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.JChemVersion                         5.3.0.2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.duplicateFiltering                   false
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.tautomerDuplicateFiltering           false
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.exclusiveDuplicateFiltering          false


17 rows selected.


=====> Index name: GNF_IMC.V_CPD$JC_IDX
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX.JChemProperties: GNF_IMC.JChemProperties
=====> JChem Properties table name: GNF_IMC.JChemProperties
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX.idxTable: GNF_IMC.V_CPD$JC_IDX_JCX
=====> Index table name: GNF_IMC.V_CPD$JC_IDX_JCX
=====> Opening cursor for : select prop_name, prop_value, prop_value_ext  from GNF_IMC.JChemProperties where lower(prop_name) like '%table.g
nf_imc.v_cpd$jc_idx_jcx.%'


 


Host #2:


SQL> select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('VENDOR_CPD_20091020', 'JC_SMILES')));


NAME                                                                   VALUE
---------------------------------------------------------------------- ----------------------------------------
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.validityTimestamp           2010-05-07 21:23:17.263
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.creationTime                2010-05-07 21:23:17.226
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.updateCounter               1
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.version                     5030002
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.absoluteStereo              true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfBits    512
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfOnes    2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfEdges   6
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfStrucFP 0
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.structuralKeys  smartsMode=true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.haltOnBadFormat             true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.tableType                   2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.idxSubType                  0
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.JChemVersion                5.3.0.2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.duplicateFiltering          false
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.tautomerDuplicateFiltering  false
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.exclusiveDuplicateFiltering false


17 rows selected.


=====> Index name: GNF_IMC.V_CPD$JC_IDX_20091020
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX_20091020.JChemProperties: GNF_IMC.JChemProperties
=====> JChem Properties table name: GNF_IMC.JChemProperties
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX_20091020.idxTable: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
=====> Index table name: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
=====> Opening cursor for : select prop_name, prop_value, prop_value_ext  from GNF_IMC.JChemProperties where lower(prop_name) like '%table.gnf_imc.v_cpd$jc_idx_20091020_jcx.%'

ChemAxon aa7c50abf8

02-08-2010 22:40:28

Thank you, Jim!


Please, could you execute the following statement and post the global memory utilization indicators (Total JVM memory, Free JVM memory, Max JVM memory) as well as the index specific indicator for the indices in question on both hosts:


select * from table(jcf.t_get_gmem_util())

There is namely a slim chance that the slow index is not cached...


Thanks,


Peter

User 4cd5052280

02-08-2010 23:03:59

Strange. Host #1 is not showing any cached objects. 


 


Host #1:


SQL>  select * from table(jcf.t_get_gmem_util());


DESCRIPTION                                           UTIL_MB
-------------------------------------------------- ----------
Total JVM memory                                           21
Free JVM memory                                            16
Max JVM memory                                           1781


3 rows selected.


 


 


Host #2:


SQL>  select * from table(jcf.t_get_gmem_util());


DESCRIPTION                                           UTIL_MB
-------------------------------------------------- ----------
Total JVM memory                                         2778
Free JVM memory                                           300
Max JVM memory                                           2778
LDDB_WAREHOUSE.JC_IDX_STRUC_SMILES                        332
GNF_IMC.CPD$JC_IDX                                        497
GNF_IMC.V_CPD$JC_IDX_20091020                             682


6 rows selected.

User 4cd5052280

03-08-2010 16:02:53

Sorry. I forgot I had restarted the cartridge:


Host #1:


 


DESCRIPTION                                           UTIL_MB
-------------------------------------------------- ----------
Total JVM memory                                         1015
Free JVM memory                                           325
Max JVM memory                                           1781
GNF_IMC.V_CPD$JC_IDX                                      682


 

ChemAxon aa7c50abf8

03-08-2010 17:08:18

Thank you, Jim!


Would it be possible to



  1. restart JChem Server on host#2 as well

  2. run twice the problematic full structure search and

  3. post the JChem Search statistics for the searches


?


Also, would it be possible to



  1. run a duplicate search (jc_compare(..., 't:d')) with the same query structure (CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O) against the same structure columns on both hosts and

  2. post the JChem Search statistics of the duplicate searches for both hosts


 ?


Thanks


Peter

User 4cd5052280

03-08-2010 18:08:04

Will do. This will need to wait till off hours tonight.  I'll post the results tomorrow.

User 4cd5052280

06-08-2010 15:10:16

Host #1:


select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;



Aug 6, 2010 8:14:48 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:14:48 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 198 ms  Screening: 3 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited



select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;


Aug 6, 2010 8:16:37 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:16:37 PDT 2010
Search mode: DUPLICATE
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 161 ms  Screening: 4 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited



Host #2:


select rowid,t.* from vendor_cpd_20091020 t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;


Aug 6, 2010 8:13:59 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:13:56 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3081 ms  Screening: 2962 ms
Processing threads: 16
Current / peak / maximum searches per minute: 2 / 2 / Unlimited



select rowid,t.* from vendor_cpd_20091020 t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;


Aug 6, 2010 8:18:49 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:18:46 PDT 2010
Search mode: DUPLICATE
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3122 ms  Screening: 3003 ms
Processing threads: 16
Current / peak / maximum searches per minute: 2 / 2 / Unlimited


 


 

ChemAxon aa7c50abf8

08-08-2010 19:04:39

Thank you, Jim!


The screening time during DUPLICATE search on host#2 being about the same as the screening time of the FULL search on the same host appears to indicate that this is not a FULL structure search problem, after all. It appears that screening by the cd_hash column in general is slow on this machine for some reason.


After consulting my colleagues, it appears that the single longest operation during the cd_hash screening is by far the SQL query similar to (or the same as?) what was referred to in the first post and took 0.03 seconds to execute: select * from gnf_imc.v_cpd$jc_idx_20091020_jcx t where cd_hash=-400332141 .


Would it be possible to check what Oracle actually executes during DUPLICATE search? Unfortunately, JChem Server (which opens the connection used to execute the search) is currently not well instrumented to start sql-tracing. The best way to start sql tracing I can come up with right now is to use a logon trigger: http://www.freelists.org/post/oracle-l/trace-via-logon-trigger,8 .


Thanks


Peter

User 4cd5052280

10-08-2010 22:41:36

Peter,


I am familiar with setting the 10046 event via a logon trigger.  Before we go down that road I am going to upgrade our cartridge to 5.3.6 over the weekend.  If we experience the same issue I will sent you the trace file for the duplicate search.


Thanks.