User 8139ea8dbd
07-07-2010 01:46:53
Have a structure table of 7million entries.
The SQL below takes 3.2 seconds, but I suspect it does not make use of cd_hash key.
select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;
Because,
select * from gnf_imc.v_cpd$jc_idx_20091020_jcx t where cd_hash=-400332141;
only takes 0.03 seconds and only found 3 matches.
select * from vendor_cpd v where v.rowid in ('AAAjcKABGAAAoRoAAY', 'AAAjcKAAOAAIRY9AAb','AAAjcKAAPAAHUZaAAR');
takes another 0.03 seconds to fetch all smiles.
The graph match for 3 pairs takes no time.
If it makes use of cd_hash, it should finish the initial full structure search within 1 second.
I believe I asked the same question maybe two years ago, and the reply was this was not reproducible. We have been using pretty recent cartridge and this issue has been always there in our case.
If this is not reproducible, is there a way to get some time measurement information to figure out where the bottle neck is? (When we search 300 compounds in vendor catelog, it takes 15 minutes!)
(For the same initial SQL t:d also takes 3 seconds, but t:s only takes 2 seconds! Something is not right. That's why I don't think cd_hash has been used.)
Thanks.
ChemAxon aa7c50abf8
07-07-2010 13:56:40
Running the query on the first 8 million structure of pubchem completes in about half a second returning no hits in my test environment. Duplicate search for the same query structure takes about quarter of a second to complete (and returns nothing).
Before I ran these searches I removed the # (comment) sign from the start of the following line in jchem/cartridge/conf/logging.properties in order to enable diagnostic information from core JChemSearch:
# chemaxon.jchem.db.JChemSearch.stats.level = FINE
(Either the server.sh reload command should be executed or JChem Server restarted for this change to take effect.) Having done so gave me the following output in the jchem/cartridge/logs/jcart?.log file. (? stands here for either 0 or 1, eventually for 2 depending on the circumstances.):
[Thread-25/14]: Jul 7, 2010 2:18:57 PM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Wed Jul 07 14:18:57 CEST 2010
Search mode: FULL
Structure table: PKOVACSUSER_TRUNK.JCXPBCH_8M_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 4
Hits: 0
Total time: 34 ms Screening: 1 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited
and
[Thread-35/16]: Jul 7, 2010 2:21:26 PM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Wed Jul 07 14:21:26 CEST 2010
Search mode: DUPLICATE
Structure table: PKOVACSUSER_TRUNK.JCXPBCH_8M_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 4
Hits: 0
Total time: 129 ms Screening: 0 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
The "Screened" count (the number of structures to undergo graph search) being the same in both cases suggest that cd_hash is used for screening in the full structure search as well.
Please, could you provide the corresponding output for these searches from your environment?
Thanks
Peter
User 4cd5052280
30-07-2010 16:36:01
Here is the logging info for the following search on two different hosts.
select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;
What is baffling is that host #1 is slower in every aspect (I/O, cpus, memory..) than Host #2. I am trying to narrow down what the cause of the performance difference could be.
Host #1:
FINE:
Thu Jul 29 09:53:03 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 123 ms Screening: 3 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 3 / Unlimited
Host #2:
FINE:
Fri Jul 30 01:17:56 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3139 ms Screening: 2945 ms
Processing threads: 16
Current / peak / maximum searches per minute: 1 / 1 / Unlimited
User 8139ea8dbd
30-07-2010 20:49:50
What's also puzzling is the same query, substructure search only takes 2sec, while duplicate search takes 3sec. Why duplicate search spends 3sec in "screening" after it gets the 2 entries match the cd_hash value? Thanks.
ChemAxon 9c0afc9aaf
30-07-2010 20:57:31
Hi,
The two indexes may have been created with different parameters, and this can explain the situation.
Hash code is not used for Full structure search in the following cases:
- query contains certain features (your query is OK)
- Markush table type / index type
- Tautomer duplicate filtering is set for the table
Do you think any of the latter two can be true in the slower environment ?
Best regards,
Szilard
User 8139ea8dbd
30-07-2010 21:01:00
Could offer some details which settings should we check regarding the 2nd and 3rd situation?
ChemAxon aa7c50abf8
30-07-2010 21:08:53
User 4cd5052280
30-07-2010 21:11:06
Both indexes were built with this parameter:
'tableType=anyStructures'
ChemAxon 9c0afc9aaf
30-07-2010 23:29:51
PS:
You may also send us the the whole content of the property table (exported in some text format). That would provide the most information.
Also, could you let us know your exact JChem version please ?
Thanks,
Szilard
ChemAxon aa7c50abf8
01-08-2010 18:00:18
Could offer some details which settings should we check regarding the 2nd and 3rd situation?
I am attaching a script with a PL/SQL package which will help you in this. The following call will return the information needed on the index:
select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('vendor_cpd', 'jc_smiles')))
The script will be part of the JChem Cartridge package starting with the next minor (or major — whichever comes first) JChem version.
Peter
User 4cd5052280
02-08-2010 16:02:33
Do I install this package in the cartridge owner schema or the schema where the jchem index is located?
ChemAxon aa7c50abf8
02-08-2010 17:22:41
I think the JChem Cartridge owner's schema is the best place for the package to avoid (unlikely, yet possible) name clashes. (The JChem Cartridge installer of future JChem versions will load it in the JChem Cartridge owner's schema.)
The particular SQL call mentioned in my previous e-mail has been tested when executed as the owner of the jc_idxtype index being queried about. (GNF_IMC in your case, I guess.)
Peter
User 4cd5052280
02-08-2010 20:24:04
I had to add the AUTHID CURRENT_USER to the package spec so it would use the correct JC_IDX_PROPERTY and JCHEMPROPERTIES tables.
Output looked the same for both databases. The tables have different names but have the same structure.
Host #1:
SQL> select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('VENDOR_CPD_20091020', 'JC_SMILES')));
NAME VALUE
---------------------------------------------------------------------- ----------------------------------------
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.validityTimestamp 2010-04-14 11:07:03.906
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.creationTime 2010-04-14 11:07:03.709
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.updateCounter 1
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.version 5030002
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.absoluteStereo true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfBits 512
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfOnes 2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfEdges 6
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.numberOfStrucFPCols 0
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.fingerprint.structuralKeys smartsMode=true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.haltOnBadFormat true
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.tableType 2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.idxSubType 0
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.JChemVersion 5.3.0.2
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.duplicateFiltering false
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.tautomerDuplicateFiltering false
idxtable.GNF_IMC.V_CPD$JC_IDX_JCX.exclusiveDuplicateFiltering false
17 rows selected.
=====> Index name: GNF_IMC.V_CPD$JC_IDX
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX.JChemProperties: GNF_IMC.JChemProperties
=====> JChem Properties table name: GNF_IMC.JChemProperties
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX.idxTable: GNF_IMC.V_CPD$JC_IDX_JCX
=====> Index table name: GNF_IMC.V_CPD$JC_IDX_JCX
=====> Opening cursor for : select prop_name, prop_value, prop_value_ext from GNF_IMC.JChemProperties where lower(prop_name) like '%table.gnf_imc.v_cpd$jc_idx_jcx.%'
Host #2:
SQL> select * from table(jcc_idx_stats.get_all_jcprop(jcc_idx_stats.get_idxname('VENDOR_CPD_20091020', 'JC_SMILES')));
NAME VALUE
---------------------------------------------------------------------- ----------------------------------------
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.validityTimestamp 2010-05-07 21:23:17.263
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.creationTime 2010-05-07 21:23:17.226
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.updateCounter 1
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.version 5030002
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.absoluteStereo true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfBits 512
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfOnes 2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfEdges 6
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.numberOfStrucFP 0
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.fingerprint.structuralKeys smartsMode=true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.haltOnBadFormat true
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.tableType 2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.idxSubType 0
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.JChemVersion 5.3.0.2
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.duplicateFiltering false
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.tautomerDuplicateFiltering false
idxtable.GNF_IMC.V_CPD$JC_IDX_20091020_JCX.exclusiveDuplicateFiltering false
17 rows selected.
=====> Index name: GNF_IMC.V_CPD$JC_IDX_20091020
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX_20091020.JChemProperties: GNF_IMC.JChemProperties
=====> JChem Properties table name: GNF_IMC.JChemProperties
=====> Master index attribute for GNF_IMC.V_CPD$JC_IDX_20091020.idxTable: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
=====> Index table name: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
=====> Opening cursor for : select prop_name, prop_value, prop_value_ext from GNF_IMC.JChemProperties where lower(prop_name) like '%table.gnf_imc.v_cpd$jc_idx_20091020_jcx.%'
ChemAxon aa7c50abf8
02-08-2010 22:40:28
Thank you, Jim!
Please, could you execute the following statement and post the global memory utilization indicators (Total JVM memory, Free JVM memory, Max JVM memory) as well as the index specific indicator for the indices in question on both hosts:
select * from table(jcf.t_get_gmem_util())
There is namely a slim chance that the slow index is not cached...
Thanks,
Peter
User 4cd5052280
02-08-2010 23:03:59
Strange. Host #1 is not showing any cached objects.
Host #1:
SQL> select * from table(jcf.t_get_gmem_util());
DESCRIPTION UTIL_MB
-------------------------------------------------- ----------
Total JVM memory 21
Free JVM memory 16
Max JVM memory 1781
3 rows selected.
Host #2:
SQL> select * from table(jcf.t_get_gmem_util());
DESCRIPTION UTIL_MB
-------------------------------------------------- ----------
Total JVM memory 2778
Free JVM memory 300
Max JVM memory 2778
LDDB_WAREHOUSE.JC_IDX_STRUC_SMILES 332
GNF_IMC.CPD$JC_IDX 497
GNF_IMC.V_CPD$JC_IDX_20091020 682
6 rows selected.
User 4cd5052280
03-08-2010 16:02:53
Sorry. I forgot I had restarted the cartridge:
Host #1:
DESCRIPTION UTIL_MB
-------------------------------------------------- ----------
Total JVM memory 1015
Free JVM memory 325
Max JVM memory 1781
GNF_IMC.V_CPD$JC_IDX 682
ChemAxon aa7c50abf8
03-08-2010 17:08:18
Thank you, Jim!
Would it be possible to
- restart JChem Server on host#2 as well
- run twice the problematic full structure search and
- post the JChem Search statistics for the searches
?
Also, would it be possible to
- run a duplicate search (jc_compare(..., 't:d')) with the same query structure (CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O) against the same structure columns on both hosts and
- post the JChem Search statistics of the duplicate searches for both hosts
?
Thanks
Peter
User 4cd5052280
03-08-2010 18:08:04
Will do. This will need to wait till off hours tonight. I'll post the results tomorrow.
User 4cd5052280
06-08-2010 15:10:16
Host #1:
select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:f')=1;
Aug 6, 2010 8:14:48 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:14:48 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 198 ms Screening: 3 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
select rowid,t.* from vendor_cpd t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;
Aug 6, 2010 8:16:37 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:16:37 PDT 2010
Search mode: DUPLICATE
Structure table: GNF_IMC.V_CPD$JC_IDX_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 161 ms Screening: 4 ms
Processing threads: 4
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
Host #2:
select rowid,t.* from vendor_cpd_20091020 t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;
Aug 6, 2010 8:13:59 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:13:56 PDT 2010
Search mode: FULL
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3081 ms Screening: 2962 ms
Processing threads: 16
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
select rowid,t.* from vendor_cpd_20091020 t where jc_compare(jc_smiles, 'CCN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O', 't:d')=1;
Aug 6, 2010 8:18:49 AM chemaxon.jchem.db.JChemSearch printSearchInfo
FINE:
Fri Aug 06 08:18:46 PDT 2010
Search mode: DUPLICATE
Structure table: GNF_IMC.V_CPD$JC_IDX_20091020_JCX
Query: [#6]CN1C(=S)S\C(=C\c2cccc(OCC(O)=O)c2)C1=O
Screened: 3
Hits: 2
Total time: 3122 ms Screening: 3003 ms
Processing threads: 16
Current / peak / maximum searches per minute: 2 / 2 / Unlimited
ChemAxon aa7c50abf8
08-08-2010 19:04:39
Thank you, Jim!
The screening time during DUPLICATE search on host#2 being about the same as the screening time of the FULL search on the same host appears to indicate that this is not a FULL structure search problem, after all. It appears that screening by the cd_hash column in general is slow on this machine for some reason.
After consulting my colleagues, it appears that the single longest operation during the cd_hash screening is by far the SQL query similar to (or the same as?) what was referred to in the first post and took 0.03 seconds to execute: select * from gnf_imc.v_cpd$jc_idx_20091020_jcx t where cd_hash=-400332141 .
Would it be possible to check what Oracle actually executes during DUPLICATE search? Unfortunately, JChem Server (which opens the connection used to execute the search) is currently not well instrumented to start sql-tracing. The best way to start sql tracing I can come up with right now is to use a logon trigger: http://www.freelists.org/post/oracle-l/trace-via-logon-trigger,8 .
Thanks
Peter
User 4cd5052280
10-08-2010 22:41:36
Peter,
I am familiar with setting the 10046 event via a logon trigger. Before we go down that road I am going to upgrade our cartridge to 5.3.6 over the weekend. If we experience the same issue I will sent you the trace file for the duplicate search.
Thanks.