Usage of Jarp for scalar descriptors:

User be4c1293dc

25-02-2010 07:26:01


I am trying to cluster a set of 900 compounds using a set of  184 descriptors. When I use Jarp, it produces 900 singletons and they are not clustered even at a high threshold "t" upto "0.8". There are  "0" values in many of the descriptor columns. I hope this is creating a problem?

However when I removed all the columns with "0" values (which left me about 50 columns ), it produced 26 clusters.

But is it possible to produce clusters for all the 184 descriptors columns (including "0" values) ?

Also how does using "weights " in the  parameter help ?  I saw the help section here (link below). But can you please eloborate on the usage of "weight". Should I give weights for all the 184 descriptors or ?? Will this help for clustering.


Many thanks


ChemAxon efa1591b5a

11-03-2010 11:43:52

Hi Sangeetha,

Clustering in a 184 dimensional space can be quite tricky. Consider your 900 data points in this vast space: they are so scattered, this huge space is so sparse like the Cosmos outside the galaxies....

The distance of these data points is huge...

Reducing the dimension is very important in such cases, but that can be very tricky. However, if you have dimensions in which all/most data have 0 coordinate, then those should be removed. It's straightforward.

With those 50 columns it's still challenging but tractable exercise. How did those 26 clusters meet your expectations?
Btw, you may also wish to try the Ward clustering, with such small number of data points it can cope easily, though running time will be longer than with Jarp. For Ward you can preset the number of clusters you'd like to see at the end.

With the weights you can increase or decrease the affect of some direction (dimensions, columns) / bend the space, if you like analogues from physics. Using them you may be able to suppress the many 0 columns, though I doubt that this is the efficient and right way of doing so.

FYI, we are working on a nearest neighbour algorithm that is efficient even in such high dimensional spaces, it's expected release date is Q3 this year.

 Hope this helps, I'd be glad to hear about your further experiences, all feedback helps our product development.