Technical Support Forum Index
Technical Support Forum
Access ChemAxon scientists and developers here. For registration and login issues contact website support.

Support Ticket System is replacing forum

This forum was converted into a searchable archive. You cannot add posts here any more. For support please use our new Ticket System.

Create your first ticket
Diverse Compound Selector
To watch this topic for replies  Register (enables digests) or give email address:
This topic is locked: you cannot edit posts or make replies.
Display posts from previous:   
    View previous topic :: View next topic    
Author Message
Steven

Joined: 02 Nov 2001
Posts: 52

View user's profile

Back to top
Link to postPosted: Tue Aug 09, 2011 12:04 amPost subject: Diverse Compound Selector Reply with quote

Hello.  During a past user group meeting, I believe I someone at ChemAxon say you were going to implement a method selecting the N most diverse molecules from a large set of molecules.  However, I haven't noticed anything in your documentation that suggests that this functionality has been implemented.  If it has not been implemented, when do you plan to implement it?

Thanks.

-&

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Wed Aug 24, 2011 2:58 amPost subject: Reply with quote

Dear Steven,


Sorry for the late answer. A diverse selection algorithm is acessible from "jklustor" command line as a special clustor using option "-c mmds" or "-c mmds:<SETCOUNT>". This algorithm also available in jklustor web demo at http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp

Alternatively, using cluster centroids identified by the sphere exclusion clustering (use option "-c sphex" or "-c sphex:<DISSIMILARITYRADIUS>) can be treated as a diverse subset.

A detailed command line help available with "jklustor -h"

More detailed examples and description of  these algorithms:

Examples - Online demo

  • enter http://www.chemaxon.com/shared/libMCS/default.sdf into the left input field (this can be done by clicking on the last “example” link)
  • click “Add!”
  • in the “launch clustering” box select “Diverse subset” and click “Launch”
  • click “View Clustering results”
  • click on the floppy icon in the line ‘Total cluster count (including singletons)” to save diverse subset in SMILES format

Examples - Command line

Maximum of minimal dissimilarity selection (MMDS)

 This selection algorithm yields a diverse subset which size (k) is specified. The selection algorithm:

  • A centrum node is identified as the firts element of the selection
    • Select the node which has the smallest rmsd dissimilarity from the other nodes (the sum of the squares of dissimilarity scores from the other nodes is the smallest)
  • Select n-1 diverse nodes in n-1 selection steps.
    • For each node find the most similar previously selected node (nearest selected) which has the smallest dissimilarity score
    • Select the node which nearest selected node has the highest dissimilarity score


Note that this algorithm typically tends to select the outliers (apart from the first centrum) from the input set.

Clustering using MMDS


A clustering algorithm (accessible with “-c mmds:<k>” in jklustor command line) is defined which used the MMDS algorithm described above:

  • Select k diverse nodes using the MMDS algorithm
  • Consider these nodes cluster representants
  • Assign every input node (including those are selected) into the cluster which cluster representant has the smallest dissimilarity value (assign to the nearest selected)

Using sphere exclusion clustering

Cluster centroids identified by sphere exclusion clustering algorithm can be considered as a diverse subset.. The clustering algorithm currently implemented:

  • First structure read is selected as a cluster centroid
  • For every input structure the least dissimilar (“nearest”) previously selected centroid is identified
    • If the dissimilarity of the nearest centroid is above a given dissimilarity radius then the structure is selected as a new centroid
  • When all structure read and the individual input structures are used (either by a “wrmols” output action, either by giving option “-l”) every input structure (including the selected centroids) will be assigned to the least dissimilar (nearest) centroid


Note that any two centroids have a higher dissimilarity than the given radius. The proper dissimilarity radius depends on the input set and the fingerprint method (CFP/ECFP) used; determining it requires an itetative refinement.


Regards,

Gabor

Daniel

Joined: 26 Oct 2011
Posts: 1

View user's profile

Back to top
Link to postPosted: Wed Oct 26, 2011 5:14 pmPost subject: Reply with quote

Hello,

I have a simple question. If you are clustering compounds from a SMILES or SDF file using the MMDS algorithm,  and you don't specify the descriptors or the metrics, which are used?

It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:

http://www.chemaxon.com/jchem/doc/user/fingerprint.html

but I haven't found the default scoring metric. Could someone please inform me?

 

Thank you!

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Thu Nov 03, 2011 2:07 amPost subject: Reply with quote

 

Dear Daniel,

 

It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:

Yes, this fingerprint is used as default; this is mentioned in command line help (use "jklustor -h"):

-d <desc>[:<metrics>]
            Specify molecular descriptor and optionally the metric to use.
            Please note that certain clustering  algorithms can be incompatible
            with  certain descriptors or  metrics. It is always  safe to not to
            specify descriptors or metrics.
 Descriptors:
   cfp        ChemAxon's chemical fingerprint (default)
              Supported dissimilarity metrics:
              tanimoto, manhattan, euclid, euclidsqr, commonbits
   ecfp       ChemAxon's ECFP fingerprint implementation
              tanimoto, manhattan, euclid, euclidsqr, commonbits


Information on default metric is missing; this is currently tanimoto.

Further information on the used descriptor is available in jklustor command line standalone server mode. (Use additional parameter -s <PORT> and connect with a browser to http://localhost:<PORT> (use 89 for example as <PORT>):

$ ./jklustor C -c mmds -s 89
Launch listening server on port 89


Connecting to http://localhost:89/show/overview reveals:

"Molecular descriptor used
Chemical fingerprint; metric: default tanimoto | CFP length=1024; CFP bitCount=2; CFP bondCount=7"

Regards,

Gabor

David

Joined: 06 Sep 2013
Posts: 2

View user's profile

Back to top
Link to postPosted: Tue Jun 03, 2014 12:55 pmPost subject: Diversity selection from sorted table Reply with quote

Hello,

 

I have a related problem. I have a table of compounds sorted by a score value (let's call it desirability score). I want to select a diverse subset of them in the following manner:

1. Select the one with the best desirability score. (1st entry in the sorted table)

2. Go on to the next entry, but select it only if it is less similar to ALL of the previously selected compounds than a defined cutoff (e.g. Tanimoto similarity must be less than 0.7).

3. Repeat 2nd step until I get an output of a desired number of compounds (e.g. 100).

(I would use ECFP fingerprints for similarity calculations.)

Is there a workaround for this?

 

Thanks.

David

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Fri Jul 18, 2014 5:35 pmPost subject: Reply with quote

Dear David,

Currently such workaround is not available, however an implementation could be constructed using already available APIs.

What front end are you using (API, command line, etc)?

What is the typical input set size?

Regards

Gabor

David

Joined: 06 Sep 2013
Posts: 2

View user's profile

Back to top
Link to postPosted: Tue Aug 12, 2014 2:14 pmPost subject: Reply with quote

Dear Gabor,

 

Thanks for your reply. I'm mostly using the instant jchem client; or command line if necessary. The typical input size for this problem is a few hundred molecules. Thanks for your help in advance.

 

All the best,

 

David

This topic is locked: you cannot edit posts or make replies.
Page 1 of 1


To watch this topic for replies   Register (enables digests) or give email address  
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum