Tanimoto Similarity

User c0e481a82c

05-01-2006 17:21:58

Hi,





I'm trying to calculate the average similarity between a molecule and a set of molecules for an application I'm writing, using the 3.1.1 version of the cartridge. There are 2 points about this that I'd like to raise.





The first issue is one of performance. Because what I'm doing is calculating the similarity between my molecule and all molecules in an indexed table (SELECT AVG(JC_TANIMOTO(T.SMILES, 'myMolecule')) FROM TABLE T), the calculation is quite slow. In the previous data cartridge we were using the same approach and it was around 10 times quicker for the same calculation. I notice from the explain plan for this query, that the index isn't used (or at least if it is, Oracle doesn't know about it, which I can understand). Is there some other way of getting back this answer quicker?





The second issue is that if I use JCF_Tanimoto rather than the JC_Tanimoto in the SQL above, I get a different answer. I assume this is because one function uses the index whereas the other does not. However, why then is it that this doesn't dramatically speed things up (JCF_ and JC_ take approximately the same time, but JCF_ is marginally quicker)? And how can I make sure that both versions return the same answer?





Thanks for your help.





Regards,





Phil.

ChemAxon aa7c50abf8

06-01-2006 11:57:59

Hello Phil,
Quote:
(SELECT AVG(JC_TANIMOTO(T.SMILES, 'myMolecule')) FROM TABLE T), the calculation is quite slow
With SQL statements where a cartridge operator does not appear in the WHERE clause, the operator will be called by Oracle as many times as there are rows meeting the conditions in the WHERE clause. In the statement above the jc_tanimoto operator is called once for each row of the table T. This behaviour is common to all Oracle Cartridge implementations. The JChem Cartridge specific aspect is the following. Currently, most JChem Cartridge operators (including jc_tanimoto) do a call to JChem components in Tomcat to perform the required business functions. Calling Tomcat from Oracle has a large communication overhead compared to the time required to calculate the tanimoto between two given molecules.





The communication overhead is not a problem for statements like





Code:
select count(*) t where jc_tanimoto(smiles, 'myMolecule') > 0.9






In this case there will be just one single call to Tomcat for the entire statement and the required similarity calculation will be performed in one batch. The reason why we use the Tomcat-call-out approach even for SQL statements like the one you are referring to is that executing the algorithm in Oracle (either in Java or using PL/SQL) would be even slower. (I can only guess that some of the competing products use an architecture with a relatively lower per-call communication overhead. The trade-off they appear to make with their architecture is that, according to reports from some of our customers, they are slower than JChem Cartridge with SQL statements where operators are in the WHERE clause.)





The immediate solution for this performance problem would be to restructure the performance sensitive parts of your application so as to avoid using SQL statements where JChem Cartridge operators appear before the WHERE clause or use sufficiently restrictive conditions in the WHERE clause.
Quote:



I notice from the explain plan for this query, that the index isn't used
During the execution of the statement you are referring to, the jc_idxtype index is used in so far as the fingerprint of the target structure is taken from the index -- instead of being calculated on the fly. Oracle does not know of this, because this is "hidden" in the cartridge implementation. To my knowledge, the explain plan only shows whether the optimizer selected to execute the SQL statement with domain index scan or not. (The statement is executed in domain index scan mode typically when the domain index operator appears in the WHERE clause.) Even with domain index scan, Oracle can not know if the index data is actually used by the cartridge implementation during the scan. (Though normally it will be used because this is the main reason for implementing a data cartridge.)
Quote:
if I use JCF_Tanimoto rather than the JC_Tanimoto in the SQL above, I get a different answer.


This is due to a bug in jcf_tanimoto. It will be fixed in the next JChem release. Thank you for reporting it and sorry for the inconvenience.





Peter

User d9e3fbf2fa

02-10-2008 13:30:33

I'm still seeing a difference in the current version. Is this because our index is using non-default parameters? If so, is there any way to get the jcf_tanimoto function to use these same parameters, or should we just use jc_tanimoto in both the select and the where clauses?

ChemAxon aa7c50abf8

02-10-2008 15:59:13

Yes, index parameters are certainly candidates for explaining such differences.





In general, it is strongly recommended to use the operator forms of functions (such jc_tanimoto) wherever they are accepted by Oracle -- especially with structure columns having jc_idxtype indexes on them. Operators are supposed to take account of the information "embedded" in indexes -- such as the Standardizer configuration used for the index.





The function forms of operators (such as jcf_tanimoto) cannot access indexes. Their only purpose is to be used for assignment operations in PL/SQL code with literal (string constant) parameters where Oracle doesn't accept operators (and index information is not available).





Thanks


Peter