Jarp clustering & false singletons

User 204415f4a4

09-03-2006 14:47:08

Hello,





When clustering a large database with Jarp, some compounds don't cluster together and are found to be singletons (see examples). The dissimilarity threshold is set to 0.15. 2048-7-5 fingerprints were used for the whole dataset. Any option in Jarp to solve this problem ?


Thank you





Best regards,





IsI

ChemAxon efa1591b5a

10-03-2006 14:22:04

Hi IsI,





Jarvis-Patrick clustering has a tendency to create large number of singletons. This is one reason why a modification of it (the variable-length lists of nearest neighbors) has been implemented in JKlustor. However, even this method does not guarantee that a structure which exhibits under the cut-off similarity to a cluster centroid (or any other data point) is member of that cluster.


You can read more about the clustering criteria in http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro.





Regarding your particular case: did you try to fiddle with -c parameter? That could be helpful to influence the merge of singletons into larger clusters.





Regards,


MIklos

User 204415f4a4

16-03-2006 10:58:26

Hi Miklos,





With the example attached, at 0.15 as dissimilarity threshold, even with changing C parameter from 0.2 to 0.8, these structures don't cluster. What do you think about this ?


Also, do you have any implementation of algorithms such Exclusion Spheres in combination with Jarp. Butina has reported this approach to deal efficiently with false singletons (see: Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto


Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets, J. Chem. Inf. Comput. Sci. 1999, 39, 747-750).


Thanks in advance.





All the best,


IsI

ChemAxon efa1591b5a

17-03-2006 13:47:15

Hi Isi,





I checked the similarity ratio of the structures you attached to your message. I found that their Tanimoto distance is 0.18, that, above your threshold of 0.15. I did this:





screenmd jarp_molecules.smiles jarp_molecules.smiles -k CF -c cfp.xml -M Tanimoto





and I got this:


q1_CF_Tan q2_CF_Tan


0.00 0.18


0.18 0.00





I also tried compr:





compr -i d1 d2 -f 2048 -g -t 0.2





that resulted in:


id minD nneib simcnt


1 0.1752 1 1





Does this help you at all?





Regards,


Miklos

User 204415f4a4

17-03-2006 14:48:15

Dear Miklos,





Thank you for your help. I will try different similarity thresholds and see the effect on clustering the rest of the database which actually contains more than 8300 compounds.





Regards,


IsI

ChemAxon efa1591b5a

17-03-2006 14:51:00

IsI,


to run compr does not take long time and you can asses the distribution of similarity scores prior to clustering.


Best wishes and good luck,


Miklos