Here is my latest analysis with a slightly bigger database (8146 structures) than before (see http://www.chemaxon.com/forum/ftopic1417.html).
This time I did choose the 1024-7-3 parameters for the chemically hashed fingerprints, and carried out tests with various parameters in the Jarvis-Patrick clustering. I did some statistics in terms of ratio cluster/singletons, and also some population analysis on the different clusters. Apart from a big cluster resulting from cominatorial chemistry, I have various sizes, although still a fair amount of "size 2" clusters (basically singletons?), which concerns me a bit... I guess now is the time to dig into them...
I am drawn to choose the parameters t=0.3 c=0.6 to reproduce the default parameters from Daylight (sorry to mention the competition... which I want to give up!). Although the metrics are not the same, reading the documentation I concluded that at least 0.6 is right:
- - fixed families of 16 nearest neighbours are constructed
- 2 compounds A and B cluster together if 10 out of their nearest neighbours are in common
- 9 (10 minus A) divided by 15 (16 minus B) gives 0.6, hence the parameter c...
What do you think of my conclusions?