Missing molecule ID in jklustor

User 8139ea8dbd

13-05-2013 03:15:45

jklustor does not export the molecule ID in the input, this has not been a high priority item to fix for quite some time now.  There is some workarounds, such as https://www.chemaxon.com/forum/ftopic8475.html, however, if someone who prefers to use the command line jklustor, instead of API to carry out the task, it probably means the user prefers not to program.


The problem becomes more noticeable as I am trying to do "-c bm" using:
jklustor t.sdf -o wrmols:sdf:t_m.sdf -o wrclus:sdf:t_c.sdf -lfin -c bm
The compound ID in the input t.sdf file is lost in the output t_m.sdf file in this case, so even the workaround no longer works.  Now one has to do pair-wise exact structure matching in order to get meaningful results, that seems a lots of work to use jklustor.


If we could keep the either the input molecule ID, or simply the order number in input (ID1, ID2, et.), jklustor could have been much more user friendly for people who choose this command version over API.


 

ChemAxon 60613ab728

13-05-2013 09:34:35

Thank you very much for your suggestions. These are very important features that we want to provide very soon.


We have been working on the improvement of id/structure and source handling in jklustor command line. The improvements are expected to be out in the 6.1 release. We are going to handle indentification of input structures and the original input SD file fields for the output.


 

User 2347372188

20-10-2014 22:52:22

Hello.  Is it possible to recover the original compound ID from the input set.  The diverse compound selector is pretty much if one can't connect back to the original compound ID.  Thanks.


-&

User 2347372188

30-10-2014 16:45:24

I'm wondering if there has been any progress on fixing this.  It's really irritating that I have to merge back to the initial compound IDs after clustering.  It makes no sense to drop compound IDs and all SD properties post-clustering.  Please fix!


-&

User 8139ea8dbd

03-11-2014 03:37:31

It has been long enough that I have already given up using jcluster for the past few years.  Since other users echo the original request, I will try to explain again why this is important to users who care to use jcluster.


Users' data file does not just contain one SMILES column, their data typically comprises many other columns: compound ID, assay activities, etc.  So if they run jcluster, the ultimate goal is to merge the clustering results into their original data sheet, so that they can carry out additional post-jcluster analyses.  If the output of jcluster does not contain the original compound ID (or ROW ID), users have to do an all-by-all full structure match to figure out the mapping. As stated in the first post, sometimes, the output of jcluster even does not allow a reliable full structure match.  Then the only alternative, as we do now, is to use API.

User 2347372188

06-11-2014 18:50:59

Since it appears that ChemAxon is not going to fix their command line utilities for clustering, I was wondering if someone could point me to the API for the MMDS algorithm (Maximum-Minimum Dissimilarity Selection)?  I can find all the old clustering algorithms in chemaxon.clustering.*, but I have no idea where the newer algorithms live.  Thanks.


-&

ChemAxon 5fc3e8d7d0

13-11-2014 15:46:01

Dear Steven,


Sorry for the very late answer.
Actually the MMDS algorithm is not part of the public API, but you can use the following code:


import chemaxon.clustering.calculations.SimpleDiverseSubsetSelection;
import chemaxon.clustering.calculations.impl.MMDS;
import chemaxon.formats.MolFormatException;
import chemaxon.formats.MolImporter;

try {
SimpleDiverseSubsetSelection sel = new MMDS();
sel.addMolecule( MolImporter.importMol( "C1CCCCC1" ) );
sel.addMolecule( MolImporter.importMol( "CCCC1CCCCC1CCC" ) );
sel.addMolecule( MolImporter.importMol( "C1CCC(CCCCCCC)CC1CCC" ) );
sel.addMolecule( MolImporter.importMol( "N#CCCC" ) );
sel.addMolecule( MolImporter.importMol( "N#CCCC(CCCCCCC)" ) );

int[] ids = sel.getDiverseSubsetIndices( 2 );
for( int i = 0; i < ids.length; i++ ) {
System.out.println(sel.get(ids).toFormat("smiles") + "\t" + ids );
}

} catch ( MolFormatException e ) {
System.err.println("Exception: " + e.getMessage());
}

Also there is a way to use from command line:


cat molecules.smiles | java -cp jchem.jar chemaxon.clustering.calculations.impl.MMDS 20

We apologize for not dealing the compound ID issue so far, but our priority list did not let us work on this. We will notify you, if the fix is released.


Best regards,
Laszlo 

User 2347372188

14-11-2014 21:27:52

Thanks for posting the code.  It worked for me.  I can finally select compounds without loosing SD tags.  Huzzah!!


-&