Test Case #2 - optimizing generatemd and jarp parameters

User f52820d97e

10-07-2006 17:23:52


Here is my latest analysis with a slightly bigger database (8146 structures) than before (see http://www.chemaxon.com/forum/ftopic1417.html).

This time I did choose the 1024-7-3 parameters for the chemically hashed fingerprints, and carried out tests with various parameters in the Jarvis-Patrick clustering. I did some statistics in terms of ratio cluster/singletons, and also some population analysis on the different clusters. Apart from a big cluster resulting from cominatorial chemistry, I have various sizes, although still a fair amount of "size 2" clusters (basically singletons?), which concerns me a bit... I guess now is the time to dig into them...

I am drawn to choose the parameters t=0.3 c=0.6 to reproduce the default parameters from Daylight (sorry to mention the competition... which I want to give up!). Although the metrics are not the same, reading the documentation I concluded that at least 0.6 is right:For t, I just played around, 0.3 looks somewhat good.

What do you think of my conclusions?



User f52820d97e

11-07-2006 08:03:22

I updated the file from yesterday's post - now there are legends... I hope it is less obscure!

ChemAxon efa1591b5a

11-07-2006 09:55:45

Hi Nicolas,

1. No problem with mentioning the sunny company! ;-) We highly appreciate their pioneering work in the field of cheminformatics.

2. Jarvis-Patrick has a tendency to produce large number of singletons, this is its major disadvantage mentioned by many authors. See for instance Brown and Martin JCICS 36, 1996 and Willett, Similarity and Clustering in Chemical Information Systems, Research Studies Press: Letchworth, 1987.

If clustering is tight, then clusters separate actives to higher degree, though with the expense of large number of singletons. However, if clustering is less tight (that is, parameters are less strict), the number of singletons decreases, but clusters tend to grow larger thus the separation of actives becomes poor.

The variable length nearest neighbour list used in JKlustor's Jarp is one way to alleviate the singleton problem in Jarvis-Patrick clustering.

3. Regarding your choice of parameters, I reckon these are good candidates, however, I am not sure about your reasoning. My concerns are due to the variable length nature of the nearest neigbour lists. Because of this, one cannot guarantee that 16 neighbours are found. Thus 'c' cannot be determined this way, instead, it is the percentage of the length of the shorter list of nearest neighbours. (Refer to the Jarp documentation for details: http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro.)

0.6 is probably a good first choice, but its interpretation is somewhat different: 60% of the length of the shorter NN list. One should not go below 50% I reckon (except if the set of structures is really diverse thus the chemical space is sparse around most compunds), and for tighter clusters even 80-90% can be tried. But this is just theory! Treat it with caution, clustering is still an experimental procedure (this is one reason why we work on a more intuitive user interface to try to play with the various parameters and see their affect instantly.)

Thank you very much for discussing these interesting questions on our forum, please share your thoughts and experimental results with the community (and us, too ;-) )



User f52820d97e

11-07-2006 10:22:32

Hi Miklos, and thank you so much for the usual outsanding answer...

I think I made a terrible mistake in interprenting your doc about Rmin. I thought you meant Rmin is a ratio of the length of common neighbors to the shorter list! Which gave me the conclusion of case 1 in the attached picture. Now, reading twice and more carefully your sentence
Rmin is a ratio of the length of the shorter NN list
I understand you mean case 2! Is that correct? It indeed changes the interpretation a bit...

I will keep posting my results, of course. To be continued...


ChemAxon efa1591b5a

11-07-2006 18:54:52

Very nice illustration indeed!!! We should borrow it in our user documentation! ;-)

Indeed, I believe that the correct interpretation is the 2nd. I hope you will not find any discrepancies - if, however, you get suspicious result I will check the source code.

I do not believe that you made a mistake when trying to interpret our rather obscured definition of Rmin, but it was us, who did not give a clear explanation. I think the definition would read better if instead of saying that "Rmin is a ratio of the length..." it would be re-phrased as "Rmin is a percentage of the length...."

What do you think?

Thank you for the excellent contribution to our discussion forum.



User f52820d97e

12-07-2006 07:47:13

Hi Miklos,

Please feel free to use it! The source is an openoffice document (attached)...

You are right, I was confused by the terminology: for me, ratio meant that someting must be divided by some other thing... so percentage would be a lot less confusing!

And thank you for being so available...