Diverse Compound Selector

User 2347372188

08-08-2011 23:04:56

Hello.  During a past user group meeting, I believe I someone at ChemAxon say you were going to implement a method selecting the N most diverse molecules from a large set of molecules.  However, I haven't noticed anything in your documentation that suggests that this functionality has been implemented.  If it has not been implemented, when do you plan to implement it?



ChemAxon 8b644e6bf4

24-08-2011 01:58:29

Dear Steven,

Sorry for the late answer. A diverse selection algorithm is acessible from "jklustor" command line as a special clustor using option "-c mmds" or "-c mmds:<SETCOUNT>". This algorithm also available in jklustor web demo at http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp

Alternatively, using cluster centroids identified by the sphere exclusion clustering (use option "-c sphex" or "-c sphex:<DISSIMILARITYRADIUS>) can be treated as a diverse subset.

A detailed command line help available with "jklustor -h"

More detailed examples and description of  these algorithms:

Examples - Online demo

Examples - Command line

Maximum of minimal dissimilarity selection (MMDS)

 This selection algorithm yields a diverse subset which size (k) is specified. The selection algorithm:

Note that this algorithm typically tends to select the outliers (apart from the first centrum) from the input set.

Clustering using MMDS

A clustering algorithm (accessible with “-c mmds:<k>” in jklustor command line) is defined which used the MMDS algorithm described above:

Using sphere exclusion clustering

centroids identified by sphere exclusion clustering algorithm can be
considered as a diverse subset.. The clustering algorithm currently

that any two centroids have a higher dissimilarity than the given
radius. The proper dissimilarity radius depends on the input set and the
fingerprint method (CFP/ECFP) used; determining it requires an
itetative refinement.



User 30619d62ec

26-10-2011 16:14:03


I have a simple question. If you are clustering compounds from a SMILES or SDF file using the MMDS algorithm,  and you don't specify the descriptors or the metrics, which are used?

It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:


but I haven't found the default scoring metric. Could someone please inform me?


Thank you!

ChemAxon 8b644e6bf4

03-11-2011 01:07:13


Dear Daniel,


It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:

Yes, this fingerprint is used as default; this is mentioned in command line help (use "jklustor -h"):

-d <desc>[:<metrics>]
            Specify molecular descriptor and optionally the metric to use.
            Please note that certain clustering  algorithms can be incompatible
            with  certain descriptors or  metrics. It is always  safe to not to
            specify descriptors or metrics.
   cfp        ChemAxon's chemical fingerprint (default)
              Supported dissimilarity metrics:
              tanimoto, manhattan, euclid, euclidsqr, commonbits
   ecfp       ChemAxon's ECFP fingerprint implementation
              tanimoto, manhattan, euclid, euclidsqr, commonbits

Information on default metric is missing; this is currently tanimoto.

Further information on the used descriptor is available in jklustor command line standalone server mode. (Use additional parameter -s <PORT> and connect with a browser to http://localhost:<PORT> (use 89 for example as <PORT>):

$ ./jklustor C -c mmds -s 89
Launch listening server on port 89

Connecting to http://localhost:89/show/overview reveals:

"Molecular descriptor used
Chemical fingerprint; metric: default tanimoto | CFP length=1024; CFP bitCount=2; CFP bondCount=7"



User bc75629bf0

03-06-2014 11:55:14



I have a related problem. I have a table of compounds sorted by a score value (let's call it desirability score). I want to select a diverse subset of them in the following manner:

1. Select the one with the best desirability score. (1st entry in the sorted table)

2. Go on to the next entry, but select it only if it is less similar to ALL of the previously selected compounds than a defined cutoff (e.g. Tanimoto similarity must be less than 0.7).

3. Repeat 2nd step until I get an output of a desired number of compounds (e.g. 100).

(I would use ECFP fingerprints for similarity calculations.)

Is there a workaround for this?




ChemAxon 8b644e6bf4

18-07-2014 16:35:14

Dear David,

Currently such workaround is not available, however an implementation could be constructed using already available APIs.

What front end are you using (API, command line, etc)?

What is the typical input set size?



User bc75629bf0

12-08-2014 13:14:55

Dear Gabor,


Thanks for your reply. I'm mostly using the instant jchem client; or command line if necessary. The typical input size for this problem is a few hundred molecules. Thanks for your help in advance.


All the best,