LibMCS 0.5 - Sorting smiles gives different cluster numbers?
I am just wondering if the old sorting bug (sort according length desc/asc) still exists? The speed of libmcs greatly improved, But I still get different results if I sort my smiles datasets according to string length. I tried to set the cluster count at maximum, to prevent any early finish.
Please see the old forum post for the old discussion
and please see the two attached SMILES files.
The smiles are not uniquified or canoncial. Still the SMILES itself are 100% identical in both lists, hence they should give the same cluster results. Are there any settings which can be used for that or is this still the old bug/feature?
Thanks for all details Tobias. The sorting bug should have gone - if not, then that's a very bad news (and I'd say: it came back!).
We'll check it and resolve it. Btw: for a reason yet unknown to me, nci1000 behaves very badly with libmcs! Clustering is very slow, I reckon there is a nasty structure that causes problem, so you post also reminds me to fix that issue.
O.k. I managed to reproduce the order dependency problem. I have an idea about the reason behind this behavior, and it's rather a feature than a bug. This is related to a new way of reducing the memory consumption of libmcs, which can cause such non-deterministic behavior as experienced with your test files.
At present related parameters are not interfaced, thus the user cannot set memory usage policy. With 1000 inut structures one should be able to set these parameters to relaxed values thus achieving a deterministic behavior. We'll do this: all related parameters will be available in the gui (in the options dialog), and besides we change default values to allow consistent behavior for small files - how, no such guarantee can be given for large files (e.g. over 30K structures).
The real problem here is the unbelievably slow progress of libmcs: it should not take an hour to cluster 1000 diverse structure, so first we need to work on this.
Thanks for the valuable feedback.
the easiest way of getting rid of this behavior would be to pre-sort the list in memory, but this would not solve the problem in the first place (if it is actually a problem). I would rather have it as a feature.
However, if you look at these two files from ZINC, the give
two large benzene clusters (both files are the same, just sorted). One is giving 755 benzene and the other one 649 groups. Both files are finished in one second with standard settings (nothing changed).
The speed is not that bad, the NCI-10000 test set takes now 53 seconds instead of 3 minutes. But if you take extreme large molecules its getting bad (lots of ring structures). I would argue that most people dont deal with libmcs and molecules larger than 1000 Da anyway. There you can just assemble common substructures and make a 2000 bit fingerprint (take the NCI fp set) and cluster the structures according to that. And these structures can later
be accessed with a much more fine grained libmcs method.
If I cancel a calculation and change options, the result window sometims pops-up (should be cancelled) with an error:
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
Miklos is on holiday until 25 July, he will answer to you as soon as he is back.
I'll try to fix the cancelling problem, thank you for reporting it.
the cancelling problem is fixed in LibMCS 0.5.3 available at http://www.chemaxon.com/shared/libMCS/
Unfortunately cancelling doesn't work properly yet when LibMCS failes to import structures (progress seems to be stopped on gui, but the thread is running with high CPU usage).
We will analyse it with Miklos when he is back.
thanks for checking it out. I will come back later, maybe it would be good to have a standard test set from your side published so one can check the changes, or otherwise I can upload some of my files as test set.
I think it would be also helpful to generate an output from a random file (like NCI subset) will all parameters changed or a design of experiment (DOE) of all different parameters within libmcs to see the outcome and results of the different settings (including time-wise normal, fast, veryfast), atomtype ,bondtype, charge, minimal mcs size, keep rings, required cluster size, maximum level count, minimal similarity and so on.
Is it possible to generate commandline options to alllow such an DOE?
For the GUI programs like libmcs it would be nice to have a small integrated helpfile, such a help file is just a "must have".