Thanks for your answers.
The data you want to process is somewhat different in nature from what JKlustor can directly process, yet I believe you can use JKlustor to do the job for you.
I try to explain what is the typical workflow in JKlustor and perhaps suggest how to fit your problem into this framework.
In Jklustor's approach we have data points, let's say a1
, ..., an
and - unlike in your cases - we do not know their internal relationships instead some properties are known as values: p1
, ... pm
Then a similarity metric is applied to estimate possible 'relationships' based on values pj
By applying the metric a similarity distance matrix is generated: each row and each column is labelled by the corresponding data point, and cells of the matrix are filled in with dissimilarity ratios or distances. Clustering uses this matrix to find nearest datapoints.
a1 d11 d12
. . . .
a2 d21 d22
. . . .
. . . . . . .
. . . dij
. . . .
. . . . . . . .
In your case the data you have resembles to the above mentioned distance matrix, ie. you know (because you measured) values dij
(for all i
). It is a binary matrix, but that's not a problem (dij
is either 0 or 1).
As I said above, such matrix is calculated by the clustering algorithm anyway, so it is an obvious question why your matrix cannot be used directly by skipping the initial step of pair-wise distance calculation?!
The answer is simple and technical: JKlustor has no input interface to take the distance matrix directly.
Still you can use JKlustor simply by using dij
values as pj
for data point ai
(that is, 1 if gene i
interacts with gene j
, and 0 otherwise).
if gene1 interacts with gene3 and gene5, and
is gene2 interacts with gene3, gene4 and gene6,
then, instead of trying sg. like
3 4 6
as input, do this:
0 0 1 0 1 0
0 0 1 1 0 1
and so on. This should do the trick. Sorry if the explanation was too long or obscured by symbols, but I thought a deeper insight could help better understand hows and whys.
This much about representing your data, and now about parameters t
Data point properties are compared by similarity metrics, in case of Jarp this is the Tanimoto metric. This has values from 0 to 1, thus both t
have to be in this range. With t
you can control the number of nearest neighbours considered for each data point, while c
introduces a clustering condition: if two input data have at least c
of their nearest neighbors in common, where c
is a ratio of the length of the shorter nearest neighbour list, then the two points are clustered together.
One has to fiddle about these parameters, see the User's documentation about examples and about the complete description of the Jarvis-Patrick method (and note again, that the document uses a different notation, c
is denoted by Rmin
Hope this helps you tackle your clustering problem successfully.