Retrieving the top n compounds based on tanimoto score

User 50ed9e02cc

16-07-2009 18:55:56

Hello,


We are migrating an application from Daycart to the JChem cartridge.  One of the functions our system supports is finding the top n compounds based on Tanimoto similarity score, where n is a variable supplied by the user.   An efficient way to do that with Daycart is using a query like the following:


SELECT * FROM structure WHERE TANIMOTO(SMILES, '<query_structure>', 10) >= 0


which returns the top n matches in less than a second from a structures table of ~80,000 entries. 


My question:  is there a similarly effecient way to do this using the JChem cartridge (short of using a subquery that computes the tanimoto score on all compounds then pick the top n from the ordered result)?


The closest I could find is something like:


SELECT * FROM structure WHERE JC_COMPARE(SMILES, '<query_structure>', 't:t simThreshold:0 maxHitCount:10') = 1


with the obvious flaw that it does't return the compounds with the highest scores, but rather simply returns some 10 compounds.


Thank you.

ChemAxon aa7c50abf8

16-07-2009 20:25:17

Hello,


This practically means that the returned result set is limited to the top 'n' structures most similar to the query, correct? We currently don't provide such a parameter/functionality, but it doesn't look to be difficult to implement at all. (FS#8701)


Performance-wise the following is the most efficient solution I can come up with
right now (and the response time is very sensitive to the threshold used):



select id, sim

  from
    (select id, jc_tanimoto(structure, query) sim

      from table_name

      where id in (select id from nci_1k where jc_tanimoto(structure,
query) > 0.8)

      order by sim desc)

  where rownum < 10

where id is the primary key of the structures.



If I had any influence on the application that wants to have the top
most similar structures, I'd execute the above statement repeatedly
using decreasing threshold values starting from an "experimental" high
threshold value (say, 0.95?) assumed to return just a very few hits,
until either the returned hit count reaches the count specified by the
user (10 in this case) or the threshold specified by the user (0.8 in
this case) has been reached.


Peter

ChemAxon 9c0afc9aaf

21-07-2009 19:10:52

Hi,


We have decided to change the default behavior in both JChem Base and the Cartridge as well:


similarity search will always return the most similar compounds if the result count is limited.


We are planing to implement this change for the next minor release (5.2.4) which is currently scheduled for August 7.  (actual release date vary)


Best regards,


Szilard

User 50ed9e02cc

21-07-2009 20:06:00

Thank you for your quick response and resolution of this issue!


Could you elaborate on what you mean by "if the result count is limited"?  Does this mean if the number of results found is small enough or if the user is asking for a limited number of results?  If it's the latter, how will you detect that in the cartridge case?


Mohammad


 

ChemAxon 9c0afc9aaf

21-07-2009 20:57:38

Could you elaborate on what you mean by "if the
result count is limited"?  Does this mean if the number of results
found is small enough or if the user is asking for a limited number of
results?  If it's the latter, how will you detect that in the cartridge
case?


 


The latter, if the result count is limited by the user.


I'm not sure understand what do you mean on detecting, so there are two answers:


- the user will be able to specify the maximum result count by a "maxHitCount" search parameter, this already exists for "jc_compare",just the behavior needs to altered in the case of similarity search


- in the rare case that the number of returned hits exactly equals the specified maximum there is no way to tell if the limit was reached or there were only exactly the same number of structures satisfying th treshold. (The JChemBase Java API can tell you this)


 


Szilard

User 50ed9e02cc

19-08-2009 16:04:36

Szilard,


Does the 5.2.4 release on August 14 include the change you mentioned in this thread?  I couldn't find an indication of that when I looked at the history of changes.


Thank you.


Mohammad

ChemAxon a3d59b832c

19-08-2009 17:05:47

Hi Mohammad,


Unfortunately, this is a bit larger development than it looked initially, and it could not fit into 5.2.4.


 


We plan to implement it for 5.2.5, which i due next month.


 


Best regards,


Szabolcs

ChemAxon aa7c50abf8

17-09-2009 12:08:21

JChem 5.2.5 has been released with the new behavior of the maxHitCount option.


Peter

User 8139ea8dbd

13-05-2010 22:27:10

select structure_id from structure where jc_compare(jc_smiles, 'myQuerySmiles', 't:t simThreshold:0.9 maxHitCount:5') =1;


It does returns the best 5 structures, but the order of the structure_id is random, not from the most similar to the least similar. Certainly we can use another wrapping SQL to sort the structure_id, but just wonder if you can make the returned id ranked, or Oracle will shuffle the ranking anyway?


Thanks

ChemAxon aa7c50abf8

14-05-2010 12:13:59


Oracle document "Ordering of Result Data" [ID 344135.1] says: "The only way that Oracle guarantees the row order is if you supply an order by clause in your statements." I assume this to be true for both top-level (non-recursive) and recursive query statements.


The ROWIDs returned by JChem Cartridge are processed in at least one more layer by Oracle, before the results of the user's top level query are returned to the user. I found no indication in Oracle documents that, without the user specifying an ORDER BY clause, results would be returned to the user in the same order as they are returned by extensible index implementations. In fact, as users are rarely interested in the raw ROWIDs themselves, some translation from ROWIDs to actual user data will be required even in the case of the simplest SQL queries. I speculate that Oracle might well opt for a specific retrieval path which (in the absence of an ORDER BY clause) would eventually reorder the results for efficiency/performance purposes.


Peter