User 0261d34ad7
16-03-2012 19:00:13
Hi,
We're using JChemSearch to search a very large chemical database, in excess of 12 million chemical compounds.
However we need to filter the compounds that can be returned. This is because new compounds may be added to our database, but may be missing from special maps that we use.
So we were wondering, what would be the impact on search performance of having a large list of "include" cs_ids? e.g. 12 million. How exactly does JChemSearch use the filter?
Any information you can provide would be great, because we can then assess whether to use this approach.
Thanks,
Jim
ChemAxon 9c0afc9aaf
16-03-2012 22:34:34
Hi,
We basically sort this list with QuickSort, which should not take too long.
(we sort only once until the list is updated again)
Then we do a binary search on the list for each screened hit candidate, this should be quick.
Of course you have to obtain the ID values from the DB, this might be one of the more time-consuming things if you do it at every query - in this case the performace might be similar to using a filterQuery.
Best,
Szilard
User 0261d34ad7
17-03-2012 06:25:43
Great, thanks for the information, very helpful.
We actually have an in memory store of cs_id values which we are planning to consult, so the only performance penalty we'll hit will be the quick sort and the per-cs_id binary search.
Thanks again,
Jim