Hi Peter,
1. Calculate 32 bit hashed binary chemical fingerprints?
- 32 bit is very short, for clustering and screening the typical (minimum length) is 1024 bits. Even 64 bits are not sufficient to represent chemical structures the density of such fingerprint would be very high (i.e. information content is minimum).
- calculation time does not depend on fingerprint length but rather on number of input structures and typical structure size and complexity (cubane takes much longer than octane due to larger number of bonds and interconnections)
- as a rule of thumb, 1000 structures can be process in 1 s (2.5GHz intel single core), this contains loading input, importing structures, generating the fingerprints and saving them in the output file;
thus, to calculate 3 million fingerprint takes about 3,000 s, ie. 50 mins, less than 1 hour
2. Calculate 2-dimensional pharmacophore fingerprints?
As a rule of thumb, it's about 4 times slower than generating topological fingerprints, thus for 3 millions it would be about 3 hours in total
3. Calculate Fuzzy pharmacophore fingerprints?
Same as above, fuzzy smoothing is fast.
4. Library self-dissimilarity test using each of the above fingerprints?
I'm afraid that this is not feasible with the current JKlustor technology. We aim to implement such tools in the future.
5. Variable-length Jarvis-Patrick clustering using 32 bit binary fingerprints with the central option?
Same as above, Jarvis-Patrick can cope with couple of 100K structures, but not 3 millions. For that purpose we recommend either the sphere exclusion or the k-means algorithms, to be released soon in version 5.3 of JChem. Those are intend to process large libraries.
To cluster 10K input structures running times is 10 seconds or so. But for 100 K it takes 24 minutes on 2.16GHz intel core 2 running MacOS-X, java 1.5.
Memory requirement is not so bad, even 100K should fit in a 200MB heap.
I never tried 1 M, but you made me curious, so I launch an evening run on one of our servers. I'll get back to this topic later.
6. Ward's hierarchic clustering using 32 bit binary fingerprints with the central option?
Ward, being a hierarchical approach takes long time and large memory to operate. Practically it can process < 100K structures.
7. Any thoughts on how well jKluster utilises multiple cores?
The current implementation does not make use of multiple cores. This should come in 5.4, next year.
Best regards,
Miklos