User 204415f4a4
09-03-2006 14:47:08
Hello,
When clustering a large database with Jarp, some compounds don't cluster together and are found to be singletons (see examples). The dissimilarity threshold is set to 0.15. 2048-7-5 fingerprints were used for the whole dataset. Any option in Jarp to solve this problem ?
Thank you
Best regards,
IsI
ChemAxon efa1591b5a
10-03-2006 14:22:04
Hi IsI,
Jarvis-Patrick clustering has a tendency to create large number of singletons. This is one reason why a modification of it (the variable-length lists of nearest neighbors) has been implemented in JKlustor. However, even this method does not guarantee that a structure which exhibits under the cut-off similarity to a cluster centroid (or any other data point) is member of that cluster.
You can read more about the clustering criteria in
http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro.
Regarding your particular case: did you try to fiddle with -c parameter? That could be helpful to influence the merge of singletons into larger clusters.
Regards,
MIklos
User 204415f4a4
16-03-2006 10:58:26
Hi Miklos,
With the example attached, at 0.15 as dissimilarity threshold, even with changing C parameter from 0.2 to 0.8, these structures don't cluster. What do you think about this ?
Also, do you have any implementation of algorithms such Exclusion Spheres in combination with Jarp. Butina has reported this approach to deal efficiently with false singletons (see: Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto
Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets, J. Chem. Inf. Comput. Sci. 1999, 39, 747-750).
Thanks in advance.
All the best,
IsI
ChemAxon efa1591b5a
17-03-2006 13:47:15
Hi Isi,
I checked the similarity ratio of the structures you attached to your message. I found that their Tanimoto distance is 0.18, that, above your threshold of 0.15. I did this:
screenmd jarp_molecules.smiles jarp_molecules.smiles -k CF -c cfp.xml -M Tanimoto
and I got this:
q1_CF_Tan q2_CF_Tan
0.00 0.18
0.18 0.00
I also tried compr:
compr -i d1 d2 -f 2048 -g -t 0.2
that resulted in:
id minD nneib simcnt
1 0.1752 1 1
Does this help you at all?
Regards,
Miklos
User 204415f4a4
17-03-2006 14:48:15
Dear Miklos,
Thank you for your help. I will try different similarity thresholds and see the effect on clustering the rest of the database which actually contains more than 8300 compounds.
Regards,
IsI
ChemAxon efa1591b5a
17-03-2006 14:51:00
IsI,
to run compr does not take long time and you can asses the distribution of similarity scores prior to clustering.
Best wishes and good luck,
Miklos