Java code/API to do the clustering - ChemAxon Forum Archive

User 55ffa2f197

29-01-2013 23:53:21

Hi, I have needs to do the clustering using Java code by calling the clustering API (JKclustor). I glanced through the API quickly did not find an example for it. I have a list of molecules as smiles which can be up to 20K, I read them in using MolImportor, wand to pass the smiles/molecule to cluster module, and get back the cluster ID, and also centeroid of the cluster if it is possible. Since the number of molecules can go up to 20K, I would like something running fast. Jarvis-partrick or Ward might be good one. Since this is going to be part of a larger promgram, i will not consider doing it in command line. Can you let me know the code sippet for doing so.

Thanks

Dong

ChemAxon 60613ab728

30-01-2013 16:07:09

Hi Dong,

Thank you for your request. We are going to provide you a
more detailed description about the currently available solutions for your
specific request by the beginning of the next week.

I also have to mention that JKlustor API is going to be redesigned
in the next couple of month, due to similar user demands like yours. As a major driving force for this process, we would like to
involve specific user requests into our development. We encourage you to
express your specific requests for these developments. It would help us a lot we
could have your feedback on our JKlustor API design plans.

Thank you,

Miklós

User 55ffa2f197

30-01-2013 16:41:07

Good to hear i am not all alone on this. I think there are well established theory and practice on clustering, the one that goes fast is Jarvis-Patrick with binary fp, such as the one implemented in Daylight Merline clustering, this is really old algrothm. The other one is RNN, which I used it in the past through Tripos toolkit, it also does reasonably well. I am aware the power of libMcs, but I think it is more for the cherry picking purpose with homogenerous mols. What I am after is realy a crude way to put mols into clusters defined losely by FP (ECFP or chemaxon FP, or even daylight). The cluster id, centroid attchached to compound id are the end result I am expecting. Actually those are the results PipeLine Pilot will produce.Oh of course if you can also provide the tanimoto distance of the members to the center that even better. I guess in the process of clustering you do have these as intermediate result, just need to define cluster object, and get them out. I think it is a matter of implmentation not a schollarly discussion. Just get something out quick

Thanks

Dong

ChemAxon 60613ab728

31-01-2013 15:09:05

Thank you for the more detailed user stories. We are going to include to most from your user stories into our plans and let you know our development plans soon.

Regarding your short term request, we are going to come back to you with the current capabilites early next week.

Thank you,

Miklos

ChemAxon 60613ab728

07-02-2013 13:23:14

Hi Dong,

Current implementations for Jarvis-Patric and Ward
clustering algorithms are available as command line tool. Their current API
provides only thin wrapper for the command line tools functionality.

It is possible to interchange data with them using file IO
or JDBC database connection.

You can find the documentation and the API here:

http://www.chemaxon.com/jchem/doc/user/Ward.html

http://www.chemaxon.com/jchem/doc/user/Jarp.html

http://www.chemaxon.com/jchem/doc/dev/java/api/chemaxon/clustering/Ward.html

http://www.chemaxon.com/jchem/doc/dev/java/api/chemaxon/clustering/JarvisPatrick.html

I also contacted you in email. We can transfer our discussion into emails in order to provide you more specific and better solutions.

Thank you,

Miklos