User 2347372188
08-08-2011 23:04:56
Hello. During a past user group meeting, I believe I someone at ChemAxon say you were going to implement a method selecting the N most diverse molecules from a large set of molecules. However, I haven't noticed anything in your documentation that suggests that this functionality has been implemented. If it has not been implemented, when do you plan to implement it?
Thanks.
-&
ChemAxon 8b644e6bf4
24-08-2011 01:58:29
Dear Steven,
Sorry for the late answer. A diverse selection algorithm is acessible from "jklustor" command line as a special clustor using option "-c mmds" or "-c mmds:<SETCOUNT>". This algorithm also available in jklustor web demo at http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp
Alternatively, using cluster centroids identified by the sphere exclusion clustering (use option "-c sphex" or "-c sphex:<DISSIMILARITYRADIUS>) can be treated as a diverse subset.
A detailed command line help available with "jklustor -h"
More detailed examples and description of these algorithms:
Examples - Online demo
- enter http://www.chemaxon.com/shared/libMCS/default.sdf style="color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;"> into the left input field (this can be done by clicking on the last “example” link)
- click “Add!”
- in the “launch clustering” box select “Diverse subset” and click “Launch”
- click “View Clustering results”
- click on the floppy icon in the line ‘Total cluster count (including singletons)” to save diverse subset in SMILES format
Examples - Command line
- jklustor -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf
/>Select 10 diverse structures using MMDS algorithm (described below) and write them to the output
- jklustor -c sphex:0.8 http://www.chemaxon.com/shared/libMCS/default.sdf style="color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">
Select diverse structures as sphere exclusion clustering (with dissimilarity radius = 0.8) centroids
- jklustor -l 88 -c mmds:10 http://www.chemaxon.com/shared/libMCS/default.sdf
/>
when message “Launch listening server on port 88” appears connect with a browser to http://localhost:88
Select
10 diverse structures using MMDS algorithm (described below) and start
listening on port 88 with a web user interface similar to the online
example above (
Maximum of minimal dissimilarity selection (MMDS)
This selection algorithm yields a diverse subset which size (k) is specified. The selection algorithm:
- A centrum node is identified as the firts element of the selection
- Select
the node which has the smallest rmsd dissimilarity from the other nodes
(the sum of the squares of dissimilarity scores from the other nodes is
the smallest)
- Select n-1 diverse nodes in n-1 selection steps.
- For each node find the most similar previously selected node (nearest selected) which has the smallest dissimilarity score
- Select the node which nearest selected node has the highest dissimilarity score
Note that this algorithm typically tends to select the outliers (apart from the first centrum) from the input set.
Clustering using MMDS
A clustering algorithm (accessible with “-c mmds:<k>” in jklustor command line) is defined which used the MMDS algorithm described above:
- Select k diverse nodes using the MMDS algorithm
- Consider these nodes cluster representants
- Assign
every input node (including those are selected) into the cluster which
cluster representant has the smallest dissimilarity value (assign to the
nearest selected)
Using sphere exclusion clustering
Cluster
centroids identified by sphere exclusion clustering algorithm can be
considered as a diverse subset.. The clustering algorithm currently
implemented:
- First structure read is selected as a cluster centroid
- For every input structure the least dissimilar (“nearest”) previously selected centroid is identified
- If
the dissimilarity of the nearest centroid is above a given
dissimilarity radius then the structure is selected as a new centroid
- When all structure read and the individual input structures are used (either by a “wrmols” output action, either by giving option “-l”) every input structure (including the selected centroids) will be assigned to the least dissimilar (nearest) centroid
Note
that any two centroids have a higher dissimilarity than the given
radius. The proper dissimilarity radius depends on the input set and the
fingerprint method (CFP/ECFP) used; determining it requires an
itetative refinement.
Regards,
Gabor
User 30619d62ec
26-10-2011 16:14:03
Hello,
I have a simple question. If you are clustering compounds from a SMILES or SDF file using the MMDS algorithm, and you don't specify the descriptors or the metrics, which are used?
It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:
http://www.chemaxon.com/jchem/doc/user/fingerprint.html
but I haven't found the default scoring metric. Could someone please inform me?
Thank you!
ChemAxon 8b644e6bf4
03-11-2011 01:07:13
Dear Daniel,
It is my understanding that the ChemAxon chemical fingerprint is used as the descriptor:
Yes, this fingerprint is used as default; this is mentioned in command line help (use "jklustor -h"):
-d <desc>[:<metrics>]
Specify molecular descriptor and optionally the metric to use.
Please note that certain clustering algorithms can be incompatible
with certain descriptors or metrics. It is always safe to not to
specify descriptors or metrics.
Descriptors:
cfp ChemAxon's chemical fingerprint (default)
Supported dissimilarity metrics:
tanimoto, manhattan, euclid, euclidsqr, commonbits
ecfp ChemAxon's ECFP fingerprint implementation
tanimoto, manhattan, euclid, euclidsqr, commonbits
Information on default metric is missing; this is currently tanimoto.
Further information on the used descriptor is available in jklustor command line standalone server mode. (Use additional parameter -s <PORT> and connect with a browser to http://localhost:<PORT> (use 89 for example as <PORT>):
$ ./jklustor C -c mmds -s 89
Launch listening server on port 89
Connecting to http://localhost:89/show/overview reveals:
"Molecular descriptor used
Chemical fingerprint; metric: default tanimoto | CFP length=1024; CFP bitCount=2; CFP bondCount=7"
Regards,
Gabor
User bc75629bf0
03-06-2014 11:55:14
Hello,
I have a related problem. I have a table of compounds sorted by a score value (let's call it desirability score). I want to select a diverse subset of them in the following manner:
1. Select the one with the best desirability score. (1st entry in the sorted table)
2. Go on to the next entry, but select it only if it is less similar to ALL of the previously selected compounds than a defined cutoff (e.g. Tanimoto similarity must be less than 0.7).
3. Repeat 2nd step until I get an output of a desired number of compounds (e.g. 100).
(I would use ECFP fingerprints for similarity calculations.)
Is there a workaround for this?
Thanks.
David
ChemAxon 8b644e6bf4
18-07-2014 16:35:14
Dear David,
Currently such workaround is not available, however an implementation could be constructed using already available APIs.
What front end are you using (API, command line, etc)?
What is the typical input set size?
Regards
Gabor
User bc75629bf0
12-08-2014 13:14:55
Dear Gabor,
Thanks for your reply. I'm mostly using the instant jchem client; or command line if necessary. The typical input size for this problem is a few hundred molecules. Thanks for your help in advance.
All the best,
David