Input file for Jarp

User 730da025f0

29-07-2005 16:02:32

I am interested in Jarvis-Patrick clustering. I have a problem with input file for Jarp.I gave a simple input file and the output result gives me all in cluster 1 and id 4 is differnet and should be in different cluster.





could you please let me know if I am doing something wrong.





I tried it with a very simple input:





1 2 3 4


2 1 3 4


3 1 2 4


4 3





and run it with a simple command:


jarp -i inputfile -o outputfile -z -y -c 1


and the output is:





id clid


>1 1


> 2 1


> 3 1


> 4 1


> STATISTICS


> Number of objects = 4


> Number of clusters (without singletons) = 1


> Number of singletons = 0


> List of clusters (without singletons):


> clid size


1 4

ChemAxon 43e6884a7a

29-07-2005 16:34:24

You also have to specify the following parameters:


-t (maximum dissimilarity/distance of two compounds in a cluster)


-m (number of data columns: 3 in this case)


-f (0 in this case because you don't have fingerprints)





Documentation: http://www.chemaxon.com/jchem/doc/user/Jarp.html





Example:





Code:
$ jarp -i input.txt -c 1 -t 1.6 -f 0 -m 3 -z -y


id      clid


1       1


2       1


3       1


4       -1





STATISTICS





Number of objects = 4


Number of clusters (without singletons) = 1


Number of singletons = 1





List of clusters (without singletons):


clid    size


1       3





Average dissimilarity = 1.6386371


Minimum dissimilarity = 1.0


Maximum dissimilarity = 2.4494898

User 730da025f0

29-07-2005 17:28:06

thanks. the result looks better now. I am confused between C and F.


if I change my input to


1 2 3 4


2 1 3 4


3 1 2 4


4 0 2 3





if I change the C(0.1 to 1) and keep f 1.6 the result is always the same. so when we use f do we need to use c as well?





Thanks.

ChemAxon efa1591b5a

29-07-2005 17:43:16

I reckon you refer to 't' when saying 'f'. Is that correct?





Indeed, 't' and 'c' are closely related to each other. Namely, two objects cluster together if they have at least 'c' common neighbours among those objects that are more similar than 't'. (That is, 't' is a similarity threshold here.)





Thus, if 't' is too small, than there could be less than 'c' common neighbours.


(Note, that there is a further condition, the two objects have to be each others' neighbours.)





This link provides a good introduction into Jarvis-Patrick clustering: http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro





Here parameter T is the same as '-t' in the jarp command line, while '-c' is denoted Rmin in the above referred document.





Perhaps you input is too shortl to experiment with these values, what if you try a larger set?





Hope this helps.





Regards,


Miklos

User 730da025f0

29-07-2005 18:07:55

thanks. I meant the difference between t and c.





can I have my input as string rather than integer?





Thanks


Anis

ChemAxon efa1591b5a

29-07-2005 18:35:11

O.K.





How do you mean? Do you want to cluster string entities instead of integers?





Miklos

User 730da025f0

29-07-2005 18:46:51

yes. can I have input file such as:





a1 a2 a3 a4


a2 a1 a3 a4


a3 a1 a2 a4


a4 a0 a0 a3





thanks

ChemAxon efa1591b5a

29-07-2005 18:58:28

Well, theoretically yes. But that's hard work, I mean you will need to do some java coding.


Basically, you need to implement a custom descriptor class (that can read, represent and process your string data), and also appropriate metrics to calculate the dissimilarity ratio between two such strings.


Not so hard, but a bit time consuming. Perhaps you find this link useful: http://www.chemaxon.com/forum/ftopic352.html





Instead, I suggest some sort of preprocessing your original daat. FInd a good way to translate your strings to integer or floating point values. Though this mapping from string to numbers should preserve the original ordering, that is, if string a1 > a2, then the associated values v1 and v2 should also satisfy v1 > v2. So ordinary hashing for instance is not suitable.


Do you think your data can be represented by numerical values this way?





BTW: what's you original data? Are those molecular structures and the strings representent some sort of molecular properties; or you are dealing with something completely different entities? (e.g. genetic data etc.)

User 730da025f0

29-07-2005 19:26:49

That is a good question. my original data is a set of genes. So when in my input I say:





1 2 3 4





The first column is my identifier and the second column is 2 that means gene 1 is interact with gene 2, and the third column is 3 that means gene 1 is interact with gene 3 and so on....


I would like to use jarp to cluster my genes. So that is why I am confused about t and c and how to represent my data to use Jarp.





Thanks

ChemAxon efa1591b5a

29-07-2005 20:36:32

I need to understand your problem better. I will get back to you you next week with few questions. Hope you'll not mnd.





Regards,


Miklos

User 730da025f0

01-08-2005 16:16:57

Hi, could you please let me know what was your questions?


Thanks

ChemAxon efa1591b5a

01-08-2005 16:49:55

Sure, but right now I am very busy with something else. I will talk to you as sson as possible.


Thanks.


Miklos

ChemAxon efa1591b5a

02-08-2005 08:37:08

O.k. I'm back, thanks for your patience.





Questions:


1. How many datapoints (i.e. genes) you have? 10, 100, 100000?


2. What you call 'interaction' between genes is a true/false value, or is measured in a continuous scale, let's say from 0 to 1?


3. Is there anything else apart from this interaction that you intend to consider in clstering your genes?


4. The 'interaction' data is a measured parameter, not a value that you want to predict based on other properties, is that correct?





This much for know, thanks for helping understand the problem you're dealing with.





Regards,


Miklos

User 730da025f0

02-08-2005 13:54:59

Thanks for the help.





1. How many datapoints (i.e. genes) you have? 10, 100, 100000? It varies from 10-100





2. What you call 'interaction' between genes is a true/false value, or is measured in a continuous scale, let's say from 0 to 1? A value of zero or one





3. Is there anything else apart from this interaction that you intend to consider in clstering your genes? No





4. The 'interaction' data is a measured parameter, not a value that you want to predict based on other properties, is that correct? Yes





Thanks


Anis

ChemAxon efa1591b5a

05-08-2005 10:39:00

Thanks for your answers.





The data you want to process is somewhat different in nature from what JKlustor can directly process, yet I believe you can use JKlustor to do the job for you.


I try to explain what is the typical workflow in JKlustor and perhaps suggest how to fit your problem into this framework.








In Jklustor's approach we have data points, let's say a1, a2, ..., an and - unlike in your cases - we do not know their internal relationships instead some properties are known as values: p1, p2, ... pm.


Then a similarity metric is applied to estimate possible 'relationships' based on values pj.


By applying the metric a similarity distance matrix is generated: each row and each column is labelled by the corresponding data point, and cells of the matrix are filled in with dissimilarity ratios or distances. Clustering uses this matrix to find nearest datapoints.





a1 a2 . aj . ...


a1 d11 d12 . . . .


a2 d21 d22 . . . .


. . . . . . .


ai . . . dij . . . .


. . . . . . . .


.








In your case the data you have resembles to the above mentioned distance matrix, ie. you know (because you measured) values dij (for all i and j). It is a binary matrix, but that's not a problem (dij is either 0 or 1).





As I said above, such matrix is calculated by the clustering algorithm anyway, so it is an obvious question why your matrix cannot be used directly by skipping the initial step of pair-wise distance calculation?!


The answer is simple and technical: JKlustor has no input interface to take the distance matrix directly.





Still you can use JKlustor simply by using dij values as pj for data point ai (that is, 1 if gene i interacts with gene j, and 0 otherwise).





Practically speaking:


if gene1 interacts with gene3 and gene5, and


is gene2 interacts with gene3, gene4 and gene6,





then, instead of trying sg. like





3 5


3 4 6





as input, do this:





0 0 1 0 1 0


0 0 1 1 0 1





and so on. This should do the trick. Sorry if the explanation was too long or obscured by symbols, but I thought a deeper insight could help better understand hows and whys.





This much about representing your data, and now about parameters t and c.





Data point properties are compared by similarity metrics, in case of Jarp this is the Tanimoto metric. This has values from 0 to 1, thus both t and c have to be in this range. With t you can control the number of nearest neighbours considered for each data point, while c introduces a clustering condition: if two input data have at least c of their nearest neighbors in common, where c is a ratio of the length of the shorter nearest neighbour list, then the two points are clustered together.





One has to fiddle about these parameters, see the User's documentation about examples and about the complete description of the Jarvis-Patrick method (and note again, that the document uses a different notation, c is denoted by Rmin). http://www.chemaxon.com/jchem/doc/user/Jarp.html#intro





Hope this helps you tackle your clustering problem successfully.





Regards,


Miklos

User 730da025f0

05-08-2005 16:43:26

thanks for all info. ok. I tried the your suggested format (I tried with and without id)





1 0 1 1 0 1 0


2 0 1 1 0 0 0


3 1 0 0 0 1 0


4 1 1 1 1 0 0


5 0 1 1 1 1 1





I get an error when I run Jarp. the message says "unknown error".


can you give me the command to run the above example?





thanks


Anis

ChemAxon 43e6884a7a

10-08-2005 16:32:06

Check the value after the -m option, it should be equal to the number of values you have in each line of your input.





I am still thinking about the best representation of your gene related problem... I will suggest some chages to what I said before, but I need to ponder a little bit more.





In the meantime, I suggest you to try alternative tools to cluster your genes. I am saying that mainly because you measured the 'distances' between each pairs of genes, thus those do not have to be estimated by metrics. As I discussed it in my previous post this is quite different from the typical situation when Jarp is applicable.





Regards,


Miklos





P.S. Apologies for the delay [again], i'm abroad.