How to propagate an ID tag in SDF via JKlustor

User cd46b9a398

07-12-2011 00:12:16

I am running JKlustor as follows:

jklustor P10635a.sdf -o wrclus:smi:fw.smi -o "wrmols:sdf:cluster_*.sdf" -c sphex:0.85

The file P10635a.sdf has 71 molecules, each tagged with an <ID> tag as is normal in an SDF file.

File is attached, but the problem seems to exist with any SD file I try.

The output clusters cluster_*.sdf do not contain this <ID> tag. I would like to propagate this <ID> tag into the clustered output.

Question: how do I do this please?

JKlustor identifies itself as v0.07. (Maybe that should have been my first clue?)

Thanks

John

User cd46b9a398

14-12-2011 22:05:11

Hi Guys -

Can I re-phrase my question: I want to have the identifiers of the molecules in the clustered results. Is there a way to do this please? At the moment, there does not appear to be any way to trace back from the clustered results to the input molecules.

Thanks

John

ChemAxon 8b644e6bf4

15-12-2011 00:53:51

Dear John,

Sorry for the late answer. Currently it is not possible to propagate input ids or other properties in jklustor. Implementing this functioanlity is in our plans, however it is not scheduled yet.

Using molconvert's canonic smiles functionality and simple bash tools a workaround can be constructed to assign input ID-s to generated cluster members. Overview:

Create a file containing the input structures in canonic SMILES format and IDs separated by white space (chemaxon toolkits will interpret IDs as molecule name this way)

For a selected cluster convert cluster members into canonic SMILES format and call grep for each member on the input file. This way cluster member's will be represented in SMILES format with IDs attached

Optionally convert these files into SDF using molconvert. This way input IDs will get propagated as structure names.

Details

molconvert smiles:q P10635a.sdf > tmp.smiles
will convert input structures to canonical SMILES format

cat P10635a.sdf | grep CHEMBL > ids.txt
creates a text file with IDs.

paste tmp.smiles ids.txt > input.smiles
creates the file described above

jklustor P10635a.sdf -o "wrmols:sdf:cluster_*.sdf" -c sphex:0.5 -v
creates individual files for each clusters (note that -v used to turn on verbose and stricter dissimilarity threshold used to create multiple clusters)

for i in ` molconvert smiles:q cluster_5.sdf ` ; do grep "$i" input.smi ; done > cluster_5.smiles
locates IDs for a specific cluster
Output from this command looks like:
Fc1ccc(cc1F)-c1ccc(cc1)S(=O)(=O)Cc1ccc2CCNCCc2c1        CHEMBL1077991
Clc1ccc(cc1)-c1ccc(cn1)S(=O)(=O)Cc1ccc2CCNCCc2c1        CHEMBL1078267
Fc1ccc(cc1)-c1ccc(cn1)S(=O)(=O)Cc1ccc2CCNCCc2c1 CHEMBL1078268
Clc1ccc(cc1)-c1ccc(cc1)S(=O)(=O)Cc1ccc2CCNCCc2c1        CHEMBL1078344
Fc1ccc(cc1)-c1ccc(cc1)S(=O)(=O)Cc1ccc2CCNCCc2c1 CHEMBL1078345
CC(C)Oc1ccc(cn1)S(=O)(=O)Cc1ccc2CCNCCc2c1       CHEMBL1078461
O=S(=O)(Cc1ccc2CCNCCc2c1)c1ccc(nc1)N1CCCCC1     CHEMBL1078463
Fc1ccc(Nc2ccc(cn2)S(=O)(=O)Cc2ccc3CCNCCc3c2)cc1 CHEMBL1078464
Fc1ccc(Oc2ccc(cn2)S(=O)(=O)Cc2ccc3CCNCCc3c2)cc1 CHEMBL1078555

for j in cluster_*.sdf ; do echo "Processing $j" ; for i in ` molconvert smiles:q $j ` ; do grep "$i" input.smi; done > `basename $j .sdf`.smiles ; done
an outer loop can be used to process all clusters

If you have further questions please do not hesitate to ask them

regards,

Gabor

User cd46b9a398

19-01-2012 17:20:06

Hi Gabor

Thanks for your advice. I got your workaround to work!

I appreciate your support.

John

User 247c00dc1d

01-08-2012 13:28:02

Dear
Gabor ,

do You fix the problem at the moment with ID in an output SDF file?

Or I need to attach ID's to a clustered file in the way you write above?

May be exist more quick way... I have to clusterize more than 30 SDF files...

ChemAxon 8b644e6bf4

10-08-2012 15:57:28

Dear Igor,

Sorry, this is not solved yet.

The workaround above might be extended with an outer for loop iterating through all the sdf files needed clustering.

The step "cat P10635a.sdf | grep CHEMBL > ids.txt" relies on a common prefix of IDs of sdf files.

Alternatively a simple awk script can be used to extract an sdf property:

grepsdfid.sh:

#!/bin/bash

awk '
    BEGIN { 
        nextNAME=0
    } {
        if ( $0 == "> <STRUCTURENAME>" ) {
            nextNAME=1
        } else if ( nextNAME != 0 ) {
            nextNAME=0
            print $0
        } else if ( $0 == "$$$$" ) {
            nextNAME=0
        }
    }
    '

Usage:

cat input.sdf | grepsdfid.sh > ids.txt

Regards,

Gabor