Performance of chemical terms filter on search

User 0261d34ad7

05-10-2012 10:28:20

Hi,


I have a question regarding the performance impact of using a Chemical Terms expression to filter search results in JChemSearch.


I've just been reading through the following forum posting on the topic:


https://www.chemaxon.com/forum/ftopic2082.html&start=0&postdays=0&postorder=asc


Which talks about the setFilter method. This appears to be a good match to our filtering needs, because it allows us to apply arbitrary filters.


We haven't tried it yet, but the pressing question is performance ... so is it possible to get a description of how the filters are applied? In particular: 



*For example properties calculated normally as part of searching? Perhaps mol weight, atom count, etc?


Any advice on this topic would be awesome.


Kind regards,


Jim from SureChem

ChemAxon 42004978e8

05-10-2012 12:56:57

Hi Jim,


 


The setFilterQuery (https://www.chemaxon.com/jchem/doc/dev/java/api/chemaxon/sss/search/JChemSearchOptions.html#setFilterQuery(java.lang.String) ) method requires a SQL query to be specified. This is applied prior to searching and it's performed fast as it uses SQL operations on the DB level, but it can use only information that's stored in the DB.You can specify additional colums (e.g. atom count, pKa, name …) on the table in the following way:


http://www.chemaxon.com/jchem/doc/dev/dbconcepts/index.html#calculatedcolumns


http://www.chemaxon.com/jchem/doc/admin/index.html#modify


If you would like to use measures not contained in the table you need to specify a chemTermsFilter, but these are calculated on the fly for every screened target:


https://www.chemaxon.com/jchem/doc/dev/java/api/chemaxon/sss/search/SearchOptions.html#setChemTermsFilter(java.lang.String)


 


Bye,


Robert

User 0261d34ad7

05-10-2012 13:08:04

Hi Robert,


Yes we've come across the setFilterQuery mechanism, but the performance to be honest is pretty terrible (even with an index on the column) which is why we're looking at the Chemical Terms.


To be honest, the key questions I was hoping to have answered were the bullet points I raised:


 




*For example properties calculated normally as part of searching? Perhaps mol weight, atom count, etc?



 


Thanks,


J

ChemAxon 42004978e8

05-10-2012 13:40:16

Hi,


 


I wonder why the performance is poor for you. Could you specify the filterQuey you use? You can send it to the support email address, if it's confidential. Do you use columns from the same database?


The answers to your first and second points are in the previous post, which are the followings:


1. ) filter query: "it can use only information that's stored in the DB" - so it's a stored value that's used.


 chem terms: " these are calculated on the fly" - interpreted prior to search.


 


2.) filter query: "This is applied prior to searching and it's performed fast as it uses SQL operations on the DB level…"


Under search I ment the whole DB search process, so this operation is performed before screening.


chem terms filter: "but these are calculated on the fly for every screened target" so it's performed after screening. 


3.) for DB stored values nothing is calculated during searching. (filter query)


chem terms calculation depends on the given calculation. pKa, logP are more expensive than calculations like molweight or atom count.


 


Bye,


Robert

User 0261d34ad7

05-10-2012 13:59:13

Thanks - we picked up on the post screening, that helps a lot.


We're using a very simple DB filter query, similar to an example on the ChemAxon website:



select cd_id from cafp where not radical;

Where radical is an indexed column, and there are ~13 million records.


The query returns about 12.3 million cd_ids, and takes around 2 minutes just to run the filter query:




Thu Oct 04 13:44:51 BST 2012
Search mode: SIMILARITY
Structure table: cafp
Query: C1=CC=CC=C1 |c:0,2,4|
Total screened: -1
Unique screened: -1
Hits: 224
Filter query SQL executed in: 124066 ms
Total time: 154449 ms  Screening: 154449 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited


The particular scenario here is that we're trying to filter a few hundred thousand compounds out.


Also WRT to question 1, I think there was a small miscommunication - I'm interested to know whether the chemical terms expression itself is compiled, rather than the actual terms themselves :) I'm guessing yes, given that the docs talk about using the Java Expression Parser (IIRC).


It's good to hear that some specific terms will be faster, that's more or less what we were thinking. Depending on how easy it is to create, do you think there's any chance of finding out which properties would naturally be fast to calculate? We thought atom count would be quick, but if there are others it will be very helpful to know.


Thanks,


Jim

ChemAxon a3d59b832c

05-10-2012 14:19:20

A minor correction:


Chemical Terms filtering is done as the final part of the searching. - After the screening and the atom-by-atom search. This way it usually does not call back to the database, everything is executed in memory.


I guess that a slow filterQuery indicates that your database may not be set up optimally. How does the filterQuery SQL statement execution time and the full JChem search time compare?


 



*For example properties calculated normally as part of searching? Perhaps mol weight, atom count, etc?


Yes, these are relatively quickly calculated, as well as some other simple counts in the topology and elemental analyzer plugin. Physico-chemical calculations on the other hand are relatively heavy-weight. (For example pka, tautomers, H bond donors, acceptors, etc.)

User 0261d34ad7

06-10-2012 06:44:22

Great, thanks for the information.


With the testing we ran previously, the similarity search took about 10 seconds. The "radical=yes" search took about 15 seconds, the "radical=no" search took 2 minutes. The DB setup is certainly a potential issue, though we've found that the performance of this type of filtering always seems to come with an overhead with our dataset. 


BTW the chemical terms filtering seems to be working well, and the performance is great so far. We've hit an odd issue related to similarity searching, but my colleague Richard has created a new topic for that.


Thanks again,


Jim

ChemAxon 42004978e8

08-10-2012 10:51:33

Hi Jim,


 


Chemical terms expressions are interpreted and this is executed as Szabolcs wrote.


I wonder why chemterms filtering is faster than filtering based on SQL expressions.


Generally we assume filterQuery is faster, as it performs computation upon import not during searching- However with a slow DB connection chem terms filter can outperform filterQuery because chemterms works in memory and only on structures that pass searching.


Could you please execute your filterQuery outside jchem tools and measure its performance?


Bye


Robert

User 0261d34ad7

08-10-2012 13:32:59

Cool. We'll give this a go and report back. 

ChemAxon 42004978e8

08-10-2012 13:44:32

Ok, we are looking forward to see your results.


Robert