Help with bm utility please

User f50dadc210

17-09-2010 00:19:36

My goal is to cluster 300K compouds, I have the SMILES file.   The command line I am using:


bm -v -c sphex:0.1 input.csv -o wrmols:smiles:output.csv:idall


I have got two questions,


First, If two smiles's tanimoto score > 0.9 , then these two smiles will be in the same cluster, so I use sphex:0.1 in command line. Is it right?


Second, How I can get the cluster id for original smiles in my input file?  I found the smiles in output file were standardized. More werid thing is two identical smiles in output file fall into two different cluster.


Thanks


Bin

ChemAxon 8b644e6bf4

28-09-2010 09:09:23

Dear Bin,


Sorry for the late answer.


First, If two smiles's tanimoto score > 0.9 , then these two smiles will be in the same cluster, so I use sphex:0.1 in command line. Is it right?


Using 0.1 as minimal dissimilarity will ensure that any two structure selected as cluster centroid will have worse than 0.9 tanimoto score.
During clustering structures are processed sequentially: every structure is assigned to the nearest cluster centroid if it is allowed by the minimal separation parameter. (In this case having 0.9 or better similarity score.)
If no such cluster centroid can be selected then the structure becames a new centroid. Any two structures in a cluster will be separated no more than twice the given limit.

Note that any two structures in the input having arbitrarily high similarity score could be assigned to different clusters.


This simple process does not require the storage of individual inputs just the centroids.


More werid thing is two identical smiles in output file fall into two different cluster.


It seems possible that later a "better" centroid pinpointed for the structure in question. Optional re-evaluating the cluster memberships after the centroids established could help this problem. This could be implemented in a future release, we will discuss it and get back to you soon.


Second, How I can get the cluster id for original smiles in my input file?


It is currently not possible to get back the original smiles from the clustering process. As a workaround In linux or windows+cygwin you can use gawk to extract group IDs from the output and merge it with the input:

cat output.csv | gawk '{ getline smi < "input.csv" ; print $smi " " $3}'


(see http://www.cs.utah.edu/dept/old/texinfo/gawk/gawk_toc.html#SEC28 )

If you have any further questions please do not hesitate to ask them.

Regards,
Gabor