Diverse Compound Selector - ChemAxon Forum Archive

User 2347372188

08-08-2011 23:04:56

Hello. During a past user group meeting, I believe I someone at ChemAxon say you were going to implement a method selecting the N most diverse molecules from a large set of molecules. However, I haven't noticed anything in your documentation that suggests that this functionality has been implemented. If it has not been implemented, when do you plan to implement it?

Thanks.

-&

ChemAxon 8b644e6bf4

24-08-2011 01:58:29

Dear Steven,

Sorry for the late answer. A diverse selection algorithm is acessible from "jklustor" command line as a special clustor using option "-c mmds" or "-c mmds:<SETCOUNT>". This algorithm also available in jklustor web demo at http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp

Alternatively, using cluster centroids identified by the sphere exclusion clustering (use option "-c sphex" or "-c sphex:<DISSIMILARITYRADIUS>) can be treated as a diverse subset.

A detailed command line help available with "jklustor -h"

More detailed examples and description of these algorithms:

Examples - Online demo

go to http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp style="color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">

enter http://www.chemaxon.com/shared/libMCS/default.sdf style="color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;"> into the left input field (this can be done by clicking on the last “example” link)

click “Add!”

in the “launch clustering” box select “Diverse subset” and click “Launch”

click “View Clustering results”

click on the floppy icon in the line ‘Total cluster count (including singletons)” to save diverse subset in SMILES format

Examples - Command line

jklustor -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf />Select 10 diverse structures using MMDS algorithm (described below) and write them to the output

jklustor -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf style="color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">
Select diverse structures as sphere exclusion clustering (with dissimilarity radius = 0.8) centroids

jklustor -l 88 -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf />
when message “Launch listening server on port 88” appears connect with a browser to http://localhost:88

Select
10 diverse structures using MMDS algorithm (described below) and start
listening on port 88 with a web user interface similar to the online
example above (

Maximum of minimal dissimilarity selection (MMDS)

This selection algorithm yields a diverse subset which size (k) is specified. The selection algorithm:

A centrum node is identified as the firts element of the selection
- Select
  the node which has the smallest rmsd dissimilarity from the other nodes
  (the sum of the squares of dissimilarity scores from the other nodes is
  the smallest)

Select n-1 diverse nodes in n-1 selection steps.
- For each node find the most similar previously selected node (nearest selected) which has the smallest dissimilarity score
- Select the node which nearest selected node has the highest dissimilarity score

Note that this algorithm typically tends to select the outliers (apart from the first centrum) from the input set.

Clustering using MMDS

A clustering algorithm (accessible with “-c mmds:<k>” in jklustor command line) is defined which used the MMDS algorithm described above:

Select k diverse nodes using the MMDS algorithm

Consider these nodes cluster representants

Assign
every input node (including those are selected) into the cluster which
cluster representant has the smallest dissimilarity value (assign to the
nearest selected)

Using sphere exclusion clustering

Cluster
centroids identified by sphere exclusion clustering algorithm can be
considered as a diverse subset.. The clustering algorithm currently
implemented:

First structure read is selected as a cluster centroid

For every input structure the least dissimilar (“nearest”) previously selected centroid is identified
- If
  the dissimilarity of the nearest centroid is above a given
  dissimilarity radius then the structure is selected as a new centroid

When all structure read and the individual input structures are used (either by a “wrmols” output action, either by giving option “-l”) every input structure (including the selected centroids) will be assigned to the least dissimilar (nearest) centroid

Note
that any two centroids have a higher dissimilarity than the given
radius. The proper dissimilarity radius depends on the input set and the
fingerprint method (CFP/ECFP) used; determining it requires an
itetative refinement.

Regards,

Gabor

User 30619d62ec

26-10-2011 16:14:03

Hello,

I have a simple question. If you are clustering compounds from a SMILES or SDF file using the MMDS algorithm, and you don't specify the descriptors or the metrics, which are used?

It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:

http://www.chemaxon.com/jchem/doc/user/fingerprint.html

but I haven't found the default scoring metric. Could someone please inform me?

Thank you!

ChemAxon 8b644e6bf4

03-11-2011 01:07:13

Dear Daniel,

It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:

Yes, this fingerprint is used as default; this is mentioned in command line help (use "jklustor -h"):

-d <desc>[:<metrics>]
            Specify molecular descriptor and optionally the metric to use.
            Please note that certain clustering  algorithms can be incompatible
            with  certain descriptors or  metrics. It is always  safe to not to
            specify descriptors or metrics.
 Descriptors:
   cfp        ChemAxon's chemical fingerprint (default)
              Supported dissimilarity metrics:
              tanimoto, manhattan, euclid, euclidsqr, commonbits
   ecfp       ChemAxon's ECFP fingerprint implementation
              tanimoto, manhattan, euclid, euclidsqr, commonbits

Information on default metric is missing; this is currently tanimoto.

Further information on the used descriptor is available in jklustor command line standalone server mode. (Use additional parameter -s <PORT> and connect with a browser to http://localhost:<PORT> (use 89 for example as <PORT>):

$ ./jklustor C -c mmds -s 89
Launch listening server on port 89

Connecting to http://localhost:89/show/overview reveals:

"Molecular descriptor used

Chemical fingerprint; metric: default tanimoto | CFP length=1024; CFP bitCount=2; CFP bondCount=7"

Regards,

Gabor

User bc75629bf0

03-06-2014 11:55:14

Hello,

I have a related problem. I have a table of compounds sorted by a score value (let's call it desirability score). I want to select a diverse subset of them in the following manner:

1. Select the one with the best desirability score. (1st entry in the sorted table)

2. Go on to the next entry, but select it only if it is less similar to ALL of the previously selected compounds than a defined cutoff (e.g. Tanimoto similarity must be less than 0.7).

3. Repeat 2nd step until I get an output of a desired number of compounds (e.g. 100).

(I would use ECFP fingerprints for similarity calculations.)

Is there a workaround for this?

Thanks.

David

ChemAxon 8b644e6bf4

18-07-2014 16:35:14

Dear David,

Currently such workaround is not available, however an implementation could be constructed using already available APIs.

What front end are you using (API, command line, etc)?

What is the typical input set size?

Regards

Gabor

User bc75629bf0

12-08-2014 13:14:55

Dear Gabor,

Thanks for your reply. I'm mostly using the instant jchem client; or command line if necessary. The typical input size for this problem is a few hundred molecules. Thanks for your help in advance.

All the best,

David