JKlustor - Some performance questions - ChemAxon Forum Archive

User 010ddfdc89

16-10-2009 03:56:43

Hi Folks,

We have just started exploring some of the capabilities of JKlustor and one of our chemists had some performance questions (shown below) . Wonder if anyone has any experience / metrics on this?

We are working on a database with 3,000,000 druglike compounds. 
Assuming we are doing all calculations using a single core on a new xeon (or similar) 
processor. Approximateley how long time does it take to run the following JKlustor jobs
using standard settings: 
  
Calculate 32 bit hashed binary chemical fingerprints? 
Calculate 64 bit hashed binary chemical fingerprints? 
Calculate 2-dimensional pharmacophore fingerprints? 
Calculate Fuzzy pharmacophore fingerprints? 
Library self-dissimilarity test using each of the above fingerprints? 
Variable-length Jarvis-Patrick clustering using 32 bit binary fingerprints with the central option? 
Ward's hierarchic clustering using 32 bit binary fingerprints with the central option?

I was actually planning on running this on a quad-core intel machine. Any thoughts on how well jKluster utilises multiple cores? How much physical memory would be recommended for the above scanario.

Any info much appreciated.

Cheers.

Peter

ChemAxon efa1591b5a

20-10-2009 14:12:32

Hi Peter,

1. Calculate 32 bit hashed binary chemical fingerprints?

- 32 bit is very short, for clustering and screening the typical (minimum length) is 1024 bits. Even 64 bits are not sufficient to represent chemical structures the density of such fingerprint would be very high (i.e. information content is minimum).

- calculation time does not depend on fingerprint length but rather on number of input structures and typical structure size and complexity (cubane takes much longer than octane due to larger number of bonds and interconnections)

- as a rule of thumb, 1000 structures can be process in 1 s (2.5GHz intel single core), this contains loading input, importing structures, generating the fingerprints and saving them in the output file;

thus, to calculate 3 million fingerprint takes about 3,000 s, ie. 50 mins, less than 1 hour

2. Calculate 2-dimensional pharmacophore fingerprints?

As a rule of thumb, it's about 4 times slower than generating topological fingerprints, thus for 3 millions it would be about 3 hours in total

3. Calculate Fuzzy pharmacophore fingerprints?

Same as above, fuzzy smoothing is fast.

4. Library self-dissimilarity test using each of the above fingerprints?

I'm afraid that this is not feasible with the current JKlustor technology. We aim to implement such tools in the future.

5. Variable-length Jarvis-Patrick clustering using 32 bit binary fingerprints with the central option?

Same as above, Jarvis-Patrick can cope with couple of 100K structures, but not 3 millions. For that purpose we recommend either the sphere exclusion or the k-means algorithms, to be released soon in version 5.3 of JChem. Those are intend to process large libraries.

To cluster 10K input structures running times is 10 seconds or so. But for 100 K it takes 24 minutes on 2.16GHz intel core 2 running MacOS-X, java 1.5.

Memory requirement is not so bad, even 100K should fit in a 200MB heap.

I never tried 1 M, but you made me curious, so I launch an evening run on one of our servers. I'll get back to this topic later.

6. Ward's hierarchic clustering using 32 bit binary fingerprints with the central option?

Ward, being a hierarchical approach takes long time and large memory to operate. Practically it can process < 100K structures.

7. Any thoughts on how well jKluster utilises multiple cores?

The current implementation does not make use of multiple cores. This should come in 5.4, next year.

Best regards,

Miklos