Instant-JChem+Derby faster than JChem-Oracle on 3 Mio DBs

User 677b9c22ff

13-11-2006 03:48:05

Hi,

these are these PPT slides from a seminar I gave at the Genome Center about Instant-JChem.

Title: "Benchmarking JChem Oracle and Instant-JChem (and more)".

Some topics

* Importing Structures into Instant JChem + speed

* influence of JAVA hotspot compiler (client or server)

* Influence of number of CPUs with Instant-JChem (n=1..8)

* Speed of complex calculations with Instant-JChem with SMP CPUs

* Derby database file sizes (100k structures to 20 Mio structures)

* Instant-JChem on disk based (RAID) and RAM-Disk based systems

* Substructure search in 3 million compound DB search with

Instant-JChem+Apache Derby DB on Dual Opteron 2.8 GHz vs.

JChem+Oracle DB on Dual Xeon 3 GHz vs. 8 core Opteron with

Instant-JChem which is almost 8-10x faster than the Xeon (not only 4x) :-)

* A 20 million compound DB with Instant-JChem and what happened

* Results and Conclusions from JChem Oracle vs. Instant-JChem

Kind regards

Tobias Kind

http://fiehnlab.ucdavis.edu/staff/kind/

ChemAxon fa971619eb

13-11-2006 15:47:57

Tobias,

Thanks for that information. Very interesting!

You've gone well beyond the scale that most users will need, and its good to see that the perfomance you saw was generally very good.

There are definitely some areas where we can improve the performance of Instant JChem for very large data sets and we'll be working on these.

A few additional comments that might be of interest:

- you can easily store predicted properties in the db using Chemical Terms columns. Not sure whethere you are aware of this, but if you are frequently running the same predictions then this will be much more efficient as the prediction for each structure is only run once, and the query is execured as a SQL query which will be much faster than calculating the values repeatedly.

- with your oracle benchmarks the oracle db was on the local machine? It would obviously impact perfomance if it was being accessed over the network.

- you should always make sure that you are not using the first search in any benchmarks as the first search is always slower than subsequent ones as the JChem structure cache is being loaded for the first time. For large databases this adds a significant delay to the first search.

Thanks again for that great data!

Tim

User 677b9c22ff

15-11-2006 19:12:40

Hi Tim,

thanks for your reply.

tdudgeon wrote:

Tobias,

- you can easily store predicted properties in the db using Chemical Terms columns. Not sure whethere you are aware of this, but if you are frequently running the same predictions then this will be much more efficient as the prediction for each structure is only run once, and the query is execured as a SQL query which will be much faster than calculating the values repeatedly.

Yes, I totally agree. The thing I wanted to show was, that Instant-JChem can be fast even in calculating multiple terms. I updated my PPT and included one additional column with the direct query times from predicted values. But this was boring, nothing to show, because it was alwayse less than 1 second :-)

tdudgeon wrote:

- with your oracle benchmarks the oracle db was on the local machine? It would obviously impact perfomance if it was being accessed over the network.

The times are from ChemAxon as quoted, but Oracle has a certain "overhead" which slows it down in the first place for smaller DBs.

So whenever I can hold the whole DB and the program in memory I guess I will be always faster. Another thing is proper hardware. I agree, I can a have a small 1 million DB on my old laptop without hazzle. But if provided with enough memory and multiple cores, Instant-JChem can be fast and easy and fun.

tdudgeon wrote:

- you should always make sure that you are not using the first search in any benchmarks as the first search is always slower than subsequent ones as the JChem structure cache is being loaded for the first time. For large databases this adds a significant delay to the first search.

Yes. This is certainly true. What I usually do on very large DBs (million compounds) I make sure to search one compund first, before making complex queries. But this is not an issue on small DBs. And the times were mostly measured 3 times, so you can see the query times from the LINUX console, but not in Windows.

Kind regards

Tobias