LibMCS benchmark

28-06-2006 09:26:46

How does the performance of the LibMCS tool compare with other similar products? Are benchmark test results available?





Can LibMCS handle large sets of structures, like over 10,000 molecules, or perhaps even more, let's say 100,000?

ChemAxon efa1591b5a

28-06-2006 10:27:29

The performance of LibMCS highly depends on types of structures, library and other circumstances.


For example to cluster 10,000 structures from the NCI database takes about the 3/4 of the clustering time of 10,000 structures from a combichem library (275 seconds and 312 seconds, respectively).





Three modes/methods are available at the moment: normal (which ensures no solution is lost), fast (that does not guarantee that), super fast (that certainly misses solutions). All three are useful and important: the conservative normal mode provides high quality clusters, while the fastest method gives quick insight into the dataset.





Benchmark test results using a combichem test set (times in seconds):





Code:



n. of. structs.     normal        fast          superfast


    100                 5           0.613           0.129


   1000                31           6.5             2.9


  10000               900         272             164


  30000              3600        1380             958








Clustering 100,000 compounds from the NCI database takes 70 mins in fast mode. (P4 2.6GHz, 2GB RAM, Java 1.4.2, RedHat Linux 9.1)





Further details can be found in a recent presentation given at the 2nd User Group Meeting:


http://www.chemaxon.com/forum/viewpost6499.html#6499

User 677b9c22ff

24-10-2006 03:31:02

Hi,


I was just talkin with Alex Allardyce about this and that and he suggested to share more ideas and suggestions and make results public available, so I just run over a LIBMCS example and how to improve speed.





Benchmark for libmcs (Library MCS 0.3) with JCHEM 3.2, WIN-XP 32-bit


MonarchComputer Dual Opteron 254, Areca 1120 Raid-5 Array (WD Raptor), JAVA 1.5 server (for 32-bit 1.6 Gbyte memory maximum)


Keywords: MCS, Maxiumum common substructure, JKlustor, scaffolds,Maximum speed , Memory requirements





The examples are 1,000 and 10,000 example molecules from the NCI99 database (attached with all results as download) with the task to find the maximum common substructures in the shortest time.





Code:



Time (total)  No of Molecules               Argument


-------------------------------------------------------------


16 sec        NCI-1000-sorted-ascending     Java client mode


16 sec        NCI-1000-sorted-ascending     Java server mode


15 sec        NCI-1000-sorted-ascending     Java server mode aggressive


9 sec         NCI-1000-sorted-descending    Java client mode


9 sec         NCI-1000-sorted-descending    Java server mode


9 sec         NCI-1000-sorted-descending    Java server mode aggressive





1h 32min      NCI-10000-sorted-ascending     Java server mode aggressive


2min 53sec    NCI-10000-sorted-descending    Java server mode aggressive








1) Using the JAVA server mode gives 40-50% speed boost (in general)


2) Using aggressive JAVA parameter tuning gives 1-5% speed boost (in general)


3) Using a pre-sorted data array in descending order gives a 28-times speed-up (!) (If I am a PR Dept - its 2800% speed improvement)





The calling modes are (as example with my personal classpath):


client mode java -client -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi


server mode java -server -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi


aggressive server mode java -server -Xms1624m -Xmx1624m -XX:+AggressiveHeap -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor nci-100000.smi





However remember to sort your data in descending order before you put it into libmcs or jklustor, all the JAVA parameters can be set in the batch files (WIN) or sh scripts (LINUX) in the jchem/bin directory. All the changes could also be implemented in later jchem versions.





Source and example files (10MB):


http://fiehnlab.ucdavis.edu/staff/kind/Collector/libmcs-benchmark.zip





Kind regards


Tobias Kind


http://fiehnlab.ucdavis.edu/staff/kind/





PS: The documentation of LibMCS is not up to date.


The current batch file libmcs.bat (WIN) calls JKlustor and all old arguments result in funny beahviour if called with libmcs.bat (WIN) or sh (LINUX).


http://www.chemaxon.com/jchem/index.html?content=doc/user/JKlustor.html

ChemAxon efa1591b5a

25-10-2006 08:19:10

Hi Tobias,





thank your very much for the valuable benchmarking of LibMCS and in particular for sharing your findings with us and with our users.





The server mode speed up is clear, though I did not expect so significant boost, it's very good to know!





However, I don't understand the role of sorting. How did you sort your input? By the number of atoms? or..?





What makes me think is that the input is regarded as a set of molecules rather than a sequence of molecules (at least in theory!), thus any kind of sorting (or permutation) of the input structures should not affect either the result (clusters formed) or the running time of the program.


We must investigate this very thoroughly. We are grateful for you for revealing this artefact.





Regarding docs, you are absolutely right that it's outdated. The application is changing rapidly (both in terms in iternal algorithms and the GUI), we will document the first stable version.





Thank you again for yor work.





Kind regards,


Miklos

User 677b9c22ff

16-11-2006 00:23:16

Hi Miklos,


sorry I totally forgot to answer.
mvargyas wrote:
Hi Tobias,


However, I don't understand the role of sorting. How did you sort your input? By the number of atoms? or..?


Miklos
Did you have already a look into the rar file I attached, the smiles are just


sorted according to their alphabetical order (with Textpad - a great tool).


I also tried to run libmcs on a multiprocessor machine, however even if it fires up some threads, it seems the main-routine is only single-threaded which is bad if you have 7 cores hanging around doing nothing.





However the pairwise comparison would scale very well with n-CPUs if once detected that a specific fragment is not totally unique. Would you agree to that? The PubChem DB (currently has 10 Mio. compounds) so this would be a good number of compounds to start.





Kind regards


Tobias

ChemAxon efa1591b5a

21-11-2006 09:41:42

Hi Tobias,





just like me.... I also forgot to respond, apologies.





We found the bug that caused the sort-dependant behaviour. We are working on that right now, actually, it was not a programming bug but a conceptual one - thank you very much for drawing our attention to this problem.





We plan to release a multithreaded/parallel version next year - you are right that the algorithm should scale well with the number of cores/cpu-s.





Also, thanks for the pubchem db suggestion, we will include that or at least parts of it in our standard test set, 10M copmounds is more than enough - I wish LibMCS could cope with million...





Best regards,


Miklos