How does the performance of the LibMCS tool compare with other similar products? Are benchmark test results available?
Can LibMCS handle large sets of structures, like over 10,000 molecules, or perhaps even more, let's say 100,000?
ChemAxon efa1591b5a
28-06-2006 10:27:29
The performance of LibMCS highly depends on types of structures, library and other circumstances.
For example to cluster 10,000 structures from the NCI database takes about the 3/4 of the clustering time of 10,000 structures from a combichem library (275 seconds and 312 seconds, respectively).
Three modes/methods are available at the moment: normal (which ensures no solution is lost), fast (that does not guarantee that), super fast (that certainly misses solutions). All three are useful and important: the conservative normal mode provides high quality clusters, while the fastest method gives quick insight into the dataset.
Benchmark test results using a combichem test set (times in seconds):
Code: |
n. of. structs. normal fast superfast
100 5 0.613 0.129
1000 31 6.5 2.9
10000 900 272 164
30000 3600 1380 958
|
Clustering 100,000 compounds from the NCI database takes 70 mins in fast mode. (P4 2.6GHz, 2GB RAM, Java 1.4.2, RedHat Linux 9.1)
Further details can be found in a recent presentation given at the 2nd User Group Meeting:
http://www.chemaxon.com/forum/viewpost6499.html#6499
User 677b9c22ff
24-10-2006 03:31:02
Hi,
I was just talkin with Alex Allardyce about this and that and he suggested to share more ideas and suggestions and make results public available, so I just run over a LIBMCS example and how to improve speed.
Benchmark for libmcs (Library MCS 0.3) with JCHEM 3.2, WIN-XP 32-bit
MonarchComputer Dual Opteron 254, Areca 1120 Raid-5 Array (WD Raptor), JAVA 1.5 server (for 32-bit 1.6 Gbyte memory maximum)
Keywords: MCS, Maxiumum common substructure, JKlustor, scaffolds,Maximum speed , Memory requirements
The examples are 1,000 and 10,000 example molecules from the NCI99 database (attached with all results as download) with the task to find the maximum common substructures in the shortest time.
Code: |
Time (total) No of Molecules Argument
-------------------------------------------------------------
16 sec NCI-1000-sorted-ascending Java client mode
16 sec NCI-1000-sorted-ascending Java server mode
15 sec NCI-1000-sorted-ascending Java server mode aggressive
9 sec NCI-1000-sorted-descending Java client mode
9 sec NCI-1000-sorted-descending Java server mode
9 sec NCI-1000-sorted-descending Java server mode aggressive
1h 32min NCI-10000-sorted-ascending Java server mode aggressive
2min 53sec NCI-10000-sorted-descending Java server mode aggressive
|
1) Using the JAVA server mode gives 40-50% speed boost (in general)
2) Using aggressive JAVA parameter tuning gives 1-5% speed boost (in general)
3) Using a pre-sorted data array in descending order gives a 28-times speed-up (!) (If I am a PR Dept - its 2800% speed improvement)
The calling modes are (as example with my personal classpath):
client mode java -client -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi
server mode java -server -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor NCI-1000-asc.smi
aggressive server mode java -server -Xms1624m -Xmx1624m -XX:+AggressiveHeap -classpath C:\chemistry\jchem\lib\jchem.jar chemaxon.clustering.gui.JKlustor nci-100000.smi
However remember to sort your data in descending order before you put it into libmcs or jklustor, all the JAVA parameters can be set in the batch files (WIN) or sh scripts (LINUX) in the jchem/bin directory. All the changes could also be implemented in later jchem versions.
Source and example files (10MB):
http://fiehnlab.ucdavis.edu/staff/kind/Collector/libmcs-benchmark.zip
Kind regards
Tobias Kind
http://fiehnlab.ucdavis.edu/staff/kind/
PS: The documentation of LibMCS is not up to date.
The current batch file libmcs.bat (WIN) calls JKlustor and all old arguments result in funny beahviour if called with libmcs.bat (WIN) or sh (LINUX).
http://www.chemaxon.com/jchem/index.html?content=doc/user/JKlustor.html
ChemAxon efa1591b5a
25-10-2006 08:19:10
Hi Tobias,
thank your very much for the valuable benchmarking of LibMCS and in particular for sharing your findings with us and with our users.
The server mode speed up is clear, though I did not expect so significant boost, it's very good to know!
However, I don't understand the role of sorting. How did you sort your input? By the number of atoms? or..?
What makes me think is that the input is regarded as a set of molecules rather than a sequence of molecules (at least in theory!), thus any kind of sorting (or permutation) of the input structures should not affect either the result (clusters formed) or the running time of the program.
We must investigate this very thoroughly. We are grateful for you for revealing this artefact.
Regarding docs, you are absolutely right that it's outdated. The application is changing rapidly (both in terms in iternal algorithms and the GUI), we will document the first stable version.
Thank you again for yor work.
Kind regards,
Miklos
User 677b9c22ff
16-11-2006 00:23:16
Hi Miklos,
sorry I totally forgot to answer.
mvargyas wrote: |
Hi Tobias,
However, I don't understand the role of sorting. How did you sort your input? By the number of atoms? or..?
Miklos |
Did you have already a look into the rar file I attached, the smiles are just
sorted according to their alphabetical order (with Textpad - a great tool).
I also tried to run libmcs on a multiprocessor machine, however even if it fires up some threads, it seems the main-routine is only single-threaded which is bad if you have 7 cores hanging around doing nothing.
However the pairwise comparison would scale very well with n-CPUs if once detected that a specific fragment is not totally unique. Would you agree to that? The PubChem DB (currently has 10 Mio. compounds) so this would be a good number of compounds to start.
Kind regards
Tobias
ChemAxon efa1591b5a
21-11-2006 09:41:42
Hi Tobias,
just like me.... I also forgot to respond, apologies.
We found the bug that caused the sort-dependant behaviour. We are working on that right now, actually, it was not a programming bug but a conceptual one - thank you very much for drawing our attention to this problem.
We plan to release a multithreaded/parallel version next year - you are right that the algorithm should scale well with the number of cores/cpu-s.
Also, thanks for the pubchem db suggestion, we will include that or at least parts of it in our standard test set, 10M copmounds is more than enough - I wish LibMCS could cope with million...
Best regards,
Miklos