Mapping results from cluster back to smiles

User 749645d446

03-01-2012 11:00:59

I am trying to run a clustering operation over a smiles file then map those results back to the original file (using Jchem verison 5.5.1)

The input .smi file is in the format "CMPD_NAME SMILES"

Here is the command I am perfoming:

'jklustor -c sphex:0.5 /home/input.smi -o wrmols:smiles:/home/output.smi -o wrstat:normal:/home/stat.txt'

The output.smi file is what I am really intrested in, I would like to take the results it gives (GID and GSIZE) for smiles and map them back to my original smiles file so it isis in the format "CPMD_NAME SMILES GID GSIZE" for example. Which could be done with a simple python / bash script.

However the smiles post and pre clustering do not match (in most cases), making it impossible to see which smiles has which cluster information. After atempting to patch the problem using molconvert on input and output Smiles files, I cannot seem to get a good match.

Is there a known workaround or way of changing the format or order of the output smiles file so that it would be possible to match the information back.

Many Thanks,

Chris

ChemAxon 8b644e6bf4

04-01-2012 20:15:52

Dear Chris,

Sorry for the late answer.

The input .smi file is in the format "CMPD_NAME SMILES"

This can be problematic since we expect the lines to start with the SMILES content.

You can swap them with awk if CMPD_NAME does not contains whitespaces:

cat input.smi | awk '{ print $2 " " $1 }' > input_cxn.smi

However the smiles post and pre clustering do not

 match (in most cases), making it impossible to see which smiles has 

which cluster information. After atempting to patch the problem using 

molconvert on input and output Smiles files, I cannot seem to get a good

 match.

Using molconvert to generate canonical SMILES can help. A workaround for a similar problem is described here: https://www.chemaxon.com/forum/ftopic8475.html

A possible outline for Your workaround:

Ensure that input file lines begin with SMILES

Canonicalize input SMILES (q ensures canonic SMILES, n appends molecule name):
molconvert smiles:qn input_cxn.smi > input_cxn_canonical.smi

Creare output.smi using jklustor for the "swapped" input_cxn.smi (or input_cxn_canonical.smi) file. The output will not contain canonic SMILES

Iterate through output.smi lines in BASH, where
- Get the SMILES, GID, GSIZE portions of the current line
- Canonicalize SMILES portion using molconvert
- Use grep to get the appropriate line from input_cxn_canonical.smi
- Get the original CMPD_NAME from the grepped line
- Compose desired format

Regards,

Gabor