LibMCS benchmark

28-06-2006 10:27:29

Clustering 100,000 compounds from the NCI database takes 70 mins in fast mode. (P4 2.6GHz, 2GB RAM, Java 1.4.2, RedHat Linux 9.1)

Further details can be found in a recent presentation given at the 2nd User Group Meeting:

http://www.chemaxon.com/forum/viewpost6499.html#6499

24-10-2006 03:31:02

1) Using the JAVA server mode gives 40-50% speed boost (in general)

2) Using aggressive JAVA parameter tuning gives 1-5% speed boost (in general)

3) Using a pre-sorted data array in descending order gives a 28-times speed-up (!) (If I am a PR Dept - its 2800% speed improvement)

The calling modes are (as example with my personal classpath):

client mode java -client -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi

server mode java -server -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi

aggressive server mode java -server -Xms1624m -Xmx1624m -XX:+AggressiveHeap -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor nci-100000.smi

However remember to sort your data in descending order before you put it into libmcs or jklustor, all the JAVA parameters can be set in the batch files (WIN) or sh scripts (LINUX) in the jchem/bin directory. All the changes could also be implemented in later jchem versions.

Source and example files (10MB):

http://fiehnlab.ucdavis.edu/staff/kind/Collector/libmcs-benchmark.zip

Kind regards

Tobias Kind

http://fiehnlab.ucdavis.edu/staff/kind/

PS: The documentation of LibMCS is not up to date.

The current batch file libmcs.bat (WIN) calls JKlustor and all old arguments result in funny beahviour if called with libmcs.bat (WIN) or sh (LINUX).

http://www.chemaxon.com/jchem/index.html?content=doc/user/JKlustor.html

16-11-2006 00:23:16

Did you have already a look into the rar file I attached, the smiles are just

sorted according to their alphabetical order (with Textpad - a great tool).

I also tried to run libmcs on a multiprocessor machine, however even if it fires up some threads, it seems the main-routine is only single-threaded which is bad if you have 7 cores hanging around doing nothing.

However the pairwise comparison would scale very well with n-CPUs if once detected that a specific fragment is not totally unique. Would you agree to that? The PubChem DB (currently has 10 Mio. compounds) so this would be a good number of compounds to start.

Kind regards

Tobias