Jarp clustering & false singletons
When clustering a large database with Jarp, some compounds don't cluster together and are found to be singletons (see examples). The dissimilarity threshold is set to 0.15. 2048-7-5 fingerprints were used for the whole dataset. Any option in Jarp to solve this problem ?
Jarvis-Patrick clustering has a tendency to create large number of singletons. This is one reason why a modification of it (the variable-length lists of nearest neighbors) has been implemented in JKlustor. However, even this method does not guarantee that a structure which exhibits under the cut-off similarity to a cluster centroid (or any other data point) is member of that cluster.
You can read more about the clustering criteria in http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro
Regarding your particular case: did you try to fiddle with -c parameter? That could be helpful to influence the merge of singletons into larger clusters.
With the example attached, at 0.15 as dissimilarity threshold, even with changing C parameter from 0.2 to 0.8, these structures don't cluster. What do you think about this ?
Also, do you have any implementation of algorithms such Exclusion Spheres in combination with Jarp. Butina has reported this approach to deal efficiently with false singletons (see: Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto
Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets, J. Chem. Inf. Comput. Sci. 1999, 39, 747-750).
Thanks in advance.
All the best,
I checked the similarity ratio of the structures you attached to your message. I found that their Tanimoto distance is 0.18, that, above your threshold of 0.15. I did this:
screenmd jarp_molecules.smiles jarp_molecules.smiles -k CF -c cfp.xml -M Tanimoto
and I got this:
I also tried compr:
compr -i d1 d2 -f 2048 -g -t 0.2
that resulted in:
id minD nneib simcnt
1 0.1752 1 1
Does this help you at all?
Thank you for your help. I will try different similarity thresholds and see the effect on clustering the rest of the database which actually contains more than 8300 compounds.
to run compr does not take long time and you can asses the distribution of similarity scores prior to clustering.
Best wishes and good luck,