Batch structure search using JChem Base

User 8d34d3a066

13-09-2012 14:06:26

Hi,

We have a scenario where we wish to submit a batch of structures to a structure search (similarity, substructure etc.) and retrieve all structures that match one or more of the input set. Ideally this would work on large sets of up to 1000+ structures.

Looking at the documentation and the API I cannot see an obvious way of doing this and was wondering if you could provide any direction.

thanks

Richard

ChemAxon a3d59b832c

14-09-2012 09:01:14

Hi Richard,

That is correct, JChem Base API does not directly support multiple query structures at the same time.

You have to start separate searches with each of the query input list.

(This way you can also easily identify which result belongs to which query, and you can also control how many query is kept in memory at the same time.)

But there are several other interfaces in the JChem family that already supports this:

1. In JChem Cartridge, you can provide multiple concatenated query structures in one jc_compare call. See:

http://www.chemaxon.com/jchem/doc/dev/cartridge/cartapi.html#jc_compare

http://www.chemaxon.com/jchem/doc/dev/cartridge/index.html#query_structure

- However, I don't think we ever tested the resource implications of 1000+ concatenated structures.

2. Using also the cartridge, you can also create a complex SQL select statement that use multiple queries.

(Using two embedded select statements, e.g. taking queries from a table.)

3. Instant JChem has an overlap analysis functionality, that basically supports this. See:

http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

I am sure that methods 2 & 3 works well and efficiently even in case of very large query lists as well.

(As well as the brute force JChem Base API method using a for loop. )

Best regards,

Szabolcs

ChemAxon 42004978e8

14-09-2012 09:39:48

Hi,

I would like to comment on the JChemSearch API solution.

If you are interrested only in the fact that a structure is hit by any of the queries but not an exact list of all the hitting queries, you can spare time by limiting the search on only those structures that were not hit yet by earlier queries.

You can exclude targets that were aleady hit by earlier queries using the JChemSearch.setFilterIDNotList method. (These excluded targets will be member of the final result either way, independent of the current result.)

This may be a significant speed-up for queries hitting a large portion of the database.

Bye,

Robert

User 8d34d3a066

14-09-2012 15:29:43

Thank you for the quick responses.

Firstly regarding Roberts point, we have a set of over 10 million structures, and our query sets are likely to be quite targeted, so the setFilterIDNotList option has already been considered but deemed unsuitable.

Using the cartridge and jc_compare seems like a good fit. We could easily add a JDBC wrapper around the calls to make use of the operations in our Java application.

I have a couple of follow up questions:

Is there a technical reason why this isn't available in the JChem Base API?

Do you have any plans to include it?

Given our situation of a large-ish database, how long (given reasonable hardware) would you expect jc_compare, with a query of 1000 structures and a similarity threshold of 0.95, to take to complete?

cheers

Richard

ChemAxon a3d59b832c

17-09-2012 11:26:08

Hi Richard,

Is there a technical reason why this isn't available in the JChem Base API?

Not really technical. Rather a bit organizational: the JChemSearch API is already quite heavy.

The other issue is that if there was a need to identify which record matched to which query, then the required API change would be even more heavy.

Do you have any plans to include it?

No, we do not have plans to include it in the near future.

However, it is not difficult to write a loop yourself, and this way you can take care of the extra things that your workflow requires.

Given our situation of a large-ish database, how long (given
reasonable hardware) would you expect jc_compare, with a query of 1000
structures and a similarity threshold of 0.95, to take to complete?

I am using our benchmark data in the JChem FAQ:

http://www.chemaxon.com/jchem/doc/admin/Performance.html#benchmark3

Extrapolating from that, a similarity search on 10M records would be about 0.25 s. (I am not expecting too many results with similarity threshold 0.95, so this sounds right. A much lower similarity threshold could give many results, and that would affect the cartridge search time as well. - Because the results need to be pushed back through Oracle in that case.)

So for 1000 queries, it would be ~ 1000 x 0.25 = 250 s. (4-5 minutes)

Best regards,

Szabolcs