User ea0ddb6d13
27052011 09:32:17
Hello,
When I cluster training set with JKlustor kmeans method, I get clusters, but there is no cluster representants in output file and wrclus:sdf:filename gives NullPointerException. I'm using JChem 5.5.0.0
Another thing I wan't to know is  how to use cluster representants to cluster test set?
Best regards,
Geven
ChemAxon 8b644e6bf4
07062011 00:43:46
Dear Geven,
Sorry for the late answer and thanks for the bug report. I was able to reproduce the problem in kmeans cluster representant handling. We will notify you in this forum when this bug is fixed.
Another thing I wan't to know is  how to use cluster representants to cluster test set?
Cluster representants are identified during the clustering process from the input set. Do you want to specify a "locked" cluster representant set?
Regards,
Gabor
User ea0ddb6d13
09062011 07:24:23
Hello Gabor,
Cluster representants are identified during the clustering process from the input set. Do you want to specify a "locked" cluster representant set? 
Basicly its exactly what I want to too. When I identify cluster representants from the input set I want to use these representants to check whether another input set molecules fit to these clusters or not.
Geven
ChemAxon 8b644e6bf4
16062011 09:45:39
Dear Geven,
When I identify cluster representants from the
input set I want to use these representants to check whether another
input set molecules fit to these clusters or not.
This interesting use case is currently not supported.
Theoreticaly an input set could be divided into groups according to arbitrary cluster representants supplied externally. (For example by assigning each input element to the most similar supplied representant.) The question is that how to evaluate this assignment?
Do you have a desired comparison algorithm in mind?
Regards,
Gabor
User ea0ddb6d13
27062011 12:19:10
gimre wrote: 
This interesting use case is currently not supported.
Theoreticaly an input set could be divided into groups according to arbitrary cluster representants supplied externally. (For example by assigning each input element to the most similar supplied representant.) The question is that how to evaluate this assignment?
Do you have a desired comparison algorithm in mind?

I don't get the question about the comparison algorithm. Jklustor output includes cluster assignements developed with kmeans algorithm. Therefore, the same algorithm should be used for deciding whether another compound (from another dataset) belongs to any cluster that was assigned by jklustor. Ideally, this should work for any clustering algorithm supported by jklustor. Basically, what we do is to take jklustor output, identify few interesting clusters (say 1 and 4). Then later when we get a new data set and we want to check what structures in this set belong to clusters 1 and 4.
Regars,
Geven
ChemAxon 8b644e6bf4
12072011 08:47:56
Dear Geven,
Therefore, the same algorithm should be used for
deciding whether another compound (from another dataset) belongs to any
cluster that was assigned by jklustor.
About deciding if a new elemnt belongs to any cluster or not: It is possible (to implement) to pick the nearest cluster representant or centroid for any further input structures. For some clustering methods it is also trivial to decide if an input structure is a member of the picked cluster. However in case of kmeans algorithm it seems to be a nontrivilal decision:
 There always will be one or more "nearest" centroid(s)
 Continuing kmeans clustering (calculate new mean, reassign clusters) will probably modify one or more previously found clusters
 Deciding if the newly assigned input structure belongs to the initially selected "nearest" cluster or not still seems possible considering if its introduction induced further reassignment
Ideally, this should work for
any clustering algorithm supported by jklustor. Basically, what we do
is to take jklustor output, identify few interesting clusters (say 1
and 4). Then later when we get a new data set and we want to check what
structures in this set belong to clusters 1 and 4.
Deciding if a new item is more similar to a predefined subset of cluster representants/centroids or to the remaining ones is a trivial task based on the used similarity/dissimilarity calculation method.
Are you interested in this later scenario?
Regards,
Gabor
User ea0ddb6d13
14072011 12:53:37
gimre wrote: 
About deciding if a new elemnt belongs to any cluster or not: It is possible (to implement) to pick the nearest cluster representant or centroid for any further input structures. For some clustering methods it is also trivial to decide if an input structure is a member of the picked cluster. However in case of kmeans algorithm it seems to be a nontrivilal decision:
 There always will be one or more "nearest" centroid(s)
 Continuing kmeans clustering (calculate new mean, reassign clusters) will probably modify one or more previously found clusters
 Deciding if the newly assigned input structure belongs to the initially selected "nearest" cluster or not still seems possible considering if its introduction induced further reassignment
Ideally, this should work for
any clustering algorithm supported by jklustor. Basically, what we do
is to take jklustor output, identify few interesting clusters (say 1
and 4). Then later when we get a new data set and we want to check what
structures in this set belong to clusters 1 and 4.
Deciding if a new item is more similar to a predefined subset of cluster representants/centroids or to the remaining ones is a trivial task based on the used similarity/dissimilarity calculation method.
Are you interested in this later scenario?

If I understand this problem correctly, then in case of kmeans it is a question of finding the shortest distance between new compound and cluster centroids. This should be rather trivial and there is no strict need to update centroids. Basically, this seems to be the later scenario.
Geven
ChemAxon 8b644e6bf4
03082011 16:59:17
Dear Geven,
... there is no strict need to update centroids ...
Agree, this way it is a trivial scenario. We will notify you when such functionality become available. Thanks for the suggestions.
Regards,
Gabor
User 68d678d290
16082011 15:53:27
Dear ChemAxon team,
I am trying to cluster a set via BeamisMurko option, normally I never had issues with a command line as below
jklustor c bm o wrclus:smiles:ULIBS.csv o wrmols:smiles:S_*.smiles o wrstat:full:ULIBS.stat ULIBS.smiles
However, for this particular set it produces no data and keeps saying following after my post message.
What is wrong with my set, it was exported from InstantJchem database?
Thank you very much in advance,
Lex
Message:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at chemaxon.clustering.backend.BemisMurcko$SimpleBMFrameworkReduction.removeSmallFragments(BemisMurcko.java:412)
at chemaxon.clustering.backend.BemisMurcko$SimpleBMFrameworkReduction.reduce(BemisMurcko.java:438)
at chemaxon.clustering.backend.BemisMurcko$1.processNewLeaf(BemisMurcko.java:116)
at chemaxon.clustering.backend.BemisMurcko$1.processNewLeaf(BemisMurcko.java:105)
at chemaxon.clustering.backend.HC$DefaultPreprocessorDispatcher.processNewNode(HC.java:268)
at chemaxon.clustering.backend.HC$DefaultPreprocessorDispatcher.processNewNode(HC.java:243)
at chemaxon.clustering.backend.MRTree.importFinished(MRTree.java:387)
at chemaxon.clustering.backend.StoreIO$DefaultAddMoleculeNode.addNode(StoreIO.java:364)
at chemaxon.clustering.backend.StoreIO.addMolecules(StoreIO.java:281)
at chemaxon.clustering.backend.MRTree.importStructures(MRTree.java:254)
at chemaxon.clustering.backend.HC.importMolecules(HC.java:223)
at chemaxon.clustering.backend.BM.run(BM.java:464)
at chemaxon.clustering.backend.BM.main(BM.java:88)
ChemAxon 8b644e6bf4
24082011 02:13:09
Dear Lex,
Sorry for the late answer  thanks for the bug report. Is it possible that your input contains empty molecule(s)? Could you attach it if it is not confidential?
Regards,
Gabor
User 68d678d290
24082011 13:24:55
Dear Gabor,
Actually I already found the source of this issue.
I am working with carboranes and other polyborane species, which are aromatic and besides that also do not obey conventional valency rules.
However, SMILES notation does not support such classes of compounds, strangely, using an sdf file as an input also caused me a similar problem.
So, I just generated a twin set where I replaced those borane groups with dummies.
Thank you very much for your attention to my problem.
Sincerely yours,
Lex