command line JKlustor - how to cluster test set?

User ea0ddb6d13

27-05-2011 09:32:17

Hello,

When I cluster training set with JKlustor kmeans method, I get clusters, but there is no cluster representants in output file and wrclus:sdf:filename gives NullPointerException. I'm using JChem 5.5.0.0

Another thing I wan't to know is - how to use cluster representants to cluster test set?

Best regards,

Geven

ChemAxon 8b644e6bf4

07-06-2011 00:43:46

Dear Geven,

Sorry for the late answer and thanks for the bug report. I was able to reproduce the problem in kmeans cluster representant handling. We will notify you in this forum when this bug is fixed.

Another thing I wan't to know is - how to use cluster representants to cluster test set?

Cluster representants are identified during the clustering process from the input set. Do you want to specify a "locked" cluster representant set?

Regards,

Gabor

User ea0ddb6d13

09-06-2011 07:24:23

Hello Gabor,

Cluster representants are identified during the clustering process from the input set. Do you want to specify a "locked" cluster representant set?

Basicly its exactly what I want to too. When I identify cluster representants from the input set I want to use these representants to check whether another input set molecules fit to these clusters or not.

Geven

ChemAxon 8b644e6bf4

16-06-2011 09:45:39

Dear Geven,

When I identify cluster representants from the
input set I want to use these representants to check whether another
input set molecules fit to these clusters or not.

This interesting use case is currently not supported.

Theoreticaly an input set could be divided into groups according to arbitrary cluster representants supplied externally. (For example by assigning each input element to the most similar supplied representant.) The question is that how to evaluate this assignment?

Do you have a desired comparison algorithm in mind?

Regards,

Gabor

User ea0ddb6d13

27-06-2011 12:19:10

gimre wrote:

This interesting use case is currently not supported.

Theoreticaly an input set could be divided into groups according to arbitrary cluster representants supplied externally. (For example by assigning each input element to the most similar supplied representant.) The question is that how to evaluate this assignment?

Do you have a desired comparison algorithm in mind?

I don't get the question about the comparison algorithm. J-klustor output includes cluster assignements developed with k-means algorithm. Therefore, the same algorithm should be used for deciding whether another compound (from another dataset) belongs to any cluster that was assigned by j-klustor. Ideally, this should work for any clustering algorithm supported by j-klustor. Basically, what we do is to take j-klustor output, identify few interesting clusters (say 1 and 4). Then later when we get a new data set and we want to check what structures in this set belong to clusters 1 and 4.

Regars,

Geven

ChemAxon 8b644e6bf4

12-07-2011 08:47:56

Dear Geven,

Therefore, the same  algorithm should be used for 

deciding whether another compound (from another dataset) belongs to any 

cluster that was assigned by j-klustor.

About deciding if a new elemnt belongs to any cluster or not: It is possible (to implement) to pick the nearest cluster representant or centroid for any further input structures. For some clustering methods it is also trivial to decide if an input structure is a member of the picked cluster. However in case of k-means algorithm it seems to be a non-trivilal decision:

There always will be one or more "nearest" centroid(s)

Continuing k-means clustering (calculate new mean, reassign clusters) will probably modify one or more previously found clusters

Deciding if the newly assigned input structure belongs to the initially selected "nearest" cluster or not still seems possible considering if its introduction induced further re-assignment

Ideally, this should work for 

any clustering algorithm supported by j-klustor. Basically, what we do 

is to take j-klustor output, identify few interesting clusters (say 1 

and 4). Then later when we get a new data set and we want to check what 

structures in this set belong to clusters 1 and 4.

Deciding if a new item is more similar to a predefined subset of cluster representants/centroids or to the remaining ones is a trivial task based on the used similarity/dissimilarity calculation method.

Are you interested in this later scenario?

Regards,

Gabor

User ea0ddb6d13

14-07-2011 12:53:37

gimre wrote:

About deciding if a new elemnt belongs to any cluster or not: It is possible (to implement) to pick the nearest cluster representant or centroid for any further input structures. For some clustering methods it is also trivial to decide if an input structure is a member of the picked cluster. However in case of k-means algorithm it seems to be a non-trivilal decision:

There always will be one or more "nearest" centroid(s)

Continuing k-means clustering (calculate new mean, reassign clusters) will probably modify one or more previously found clusters

Deciding if the newly assigned input structure belongs to the initially selected "nearest" cluster or not still seems possible considering if its introduction induced further re-assignment

Ideally, this should work for 

any clustering algorithm supported by j-klustor. Basically, what we do 

is to take j-klustor output, identify few interesting clusters (say 1 

and 4). Then later when we get a new data set and we want to check what 

structures in this set belong to clusters 1 and 4.

Deciding if a new item is more similar to a predefined subset of cluster representants/centroids or to the remaining ones is a trivial task based on the used similarity/dissimilarity calculation method.

Are you interested in this later scenario?

If I understand this problem correctly, then in case of k-means it is a question of finding the shortest distance between new compound and cluster centroids. This should be rather trivial and there is no strict need to update centroids. Basically, this seems to be the later scenario.

Geven

ChemAxon 8b644e6bf4

03-08-2011 16:59:17

Dear Geven,

... there is no strict need to update centroids ...

Agree, this way it is a trivial scenario. We will notify you when such functionality become available. Thanks for the suggestions.

Regards,

Gabor

User 68d678d290

16-08-2011 15:53:27

Dear ChemAxon team,

I am trying to cluster a set via Beamis-Murko option, normally I never had issues with a command line as below

jklustor -c bm -o wrclus:smiles:ULIBS.csv -o wrmols:smiles:S_*.smiles -o wrstat:full:ULIBS.stat ULIBS.smiles

However, for this particular set it produces no data and keeps saying following after my post message.

What is wrong with my set, it was exported from InstantJchem database?

Thank you very much in advance,

Lex

Message:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
        at chemaxon.clustering.backend.BemisMurcko$SimpleBMFrameworkReduction.removeSmallFragments(BemisMurcko.java:412)
        at chemaxon.clustering.backend.BemisMurcko$SimpleBMFrameworkReduction.reduce(BemisMurcko.java:438)
        at chemaxon.clustering.backend.BemisMurcko$1.processNewLeaf(BemisMurcko.java:116)
        at chemaxon.clustering.backend.BemisMurcko$1.processNewLeaf(BemisMurcko.java:105)
        at chemaxon.clustering.backend.HC$DefaultPreprocessorDispatcher.processNewNode(HC.java:268)
        at chemaxon.clustering.backend.HC$DefaultPreprocessorDispatcher.processNewNode(HC.java:243)
        at chemaxon.clustering.backend.MRTree.importFinished(MRTree.java:387)
        at chemaxon.clustering.backend.StoreIO$DefaultAddMoleculeNode.addNode(StoreIO.java:364)
        at chemaxon.clustering.backend.StoreIO.addMolecules(StoreIO.java:281)
        at chemaxon.clustering.backend.MRTree.importStructures(MRTree.java:254)
        at chemaxon.clustering.backend.HC.importMolecules(HC.java:223)
        at chemaxon.clustering.backend.BM.run(BM.java:464)
        at chemaxon.clustering.backend.BM.main(BM.java:88)

ChemAxon 8b644e6bf4

24-08-2011 02:13:09

Dear Lex,

Sorry for the late answer - thanks for the bug report. Is it possible that your input contains empty molecule(s)? Could you attach it if it is not confidential?

Regards,

Gabor

User 68d678d290

24-08-2011 13:24:55

Dear Gabor,

Actually I already found the source of this issue.

I am working with carboranes and other polyborane species, which are aromatic and besides that also do not obey conventional valency rules.

However, SMILES notation does not support such classes of compounds, strangely, using an sdf file as an input also caused me a similar problem.

So, I just generated a twin set where I replaced those borane groups with dummies.

Thank you very much for your attention to my problem.

Sincerely yours,

Lex