Jarp clustering & false singletons - ChemAxon Forum Archive

User 204415f4a4

09-03-2006 14:47:08

Hello,

When clustering a large database with Jarp, some compounds don't cluster together and are found to be singletons (see examples). The dissimilarity threshold is set to 0.15. 2048-7-5 fingerprints were used for the whole dataset. Any option in Jarp to solve this problem ?

Thank you

Best regards,

IsI

ChemAxon efa1591b5a

10-03-2006 14:22:04

Hi IsI,

Jarvis-Patrick clustering has a tendency to create large number of singletons. This is one reason why a modification of it (the variable-length lists of nearest neighbors) has been implemented in JKlustor. However, even this method does not guarantee that a structure which exhibits under the cut-off similarity to a cluster centroid (or any other data point) is member of that cluster.

You can read more about the clustering criteria in http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro.

Regarding your particular case: did you try to fiddle with -c parameter? That could be helpful to influence the merge of singletons into larger clusters.

Regards,

MIklos

User 204415f4a4

16-03-2006 10:58:26

Hi Miklos,

With the example attached, at 0.15 as dissimilarity threshold, even with changing C parameter from 0.2 to 0.8, these structures don't cluster. What do you think about this ?

Also, do you have any implementation of algorithms such Exclusion Spheres in combination with Jarp. Butina has reported this approach to deal efficiently with false singletons (see: Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto

Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets, J. Chem. Inf. Comput. Sci. 1999, 39, 747-750).

Thanks in advance.

All the best,

IsI

ChemAxon efa1591b5a

17-03-2006 13:47:15

Hi Isi,

I checked the similarity ratio of the structures you attached to your message. I found that their Tanimoto distance is 0.18, that, above your threshold of 0.15. I did this:

screenmd jarp_molecules.smiles jarp_molecules.smiles -k CF -c cfp.xml -M Tanimoto

and I got this:

q1_CF_Tan q2_CF_Tan

0.00 0.18

0.18 0.00

I also tried compr:

compr -i d1 d2 -f 2048 -g -t 0.2

that resulted in:

id minD nneib simcnt

1 0.1752 1 1

Does this help you at all?

Regards,

Miklos

User 204415f4a4

17-03-2006 14:48:15

Dear Miklos,

Thank you for your help. I will try different similarity thresholds and see the effect on clustering the rest of the database which actually contains more than 8300 compounds.

Regards,

IsI

ChemAxon efa1591b5a

17-03-2006 14:51:00

IsI,

to run compr does not take long time and you can asses the distribution of similarity scores prior to clustering.

Best wishes and good luck,

Miklos