retrieving substructures

User 550505e3de

03-08-2009 17:42:27

I need a command line utility that given a set of molecules will return the maximum common substructure. Any input/output file format such as SMILES, .mol, .sdf would be fine.  After searching the Chemaxon docs and forums, I have only found examples of substructure searching and highlighting, not returning the substructures themselves.  Can this be done with Marvin, JChem, or one of the other packages?


Thank you,


Corey

ChemAxon efa1591b5a

05-08-2009 08:59:03

Hi,


Have you tried libmcs? It's part of JChem and it identifies the MC(E)Ss of an input set given in a structure file (all common formats are supported).


The simplest command line:


libmcs molecules.sdf 


In this case the output is an SDFile written in the terminal. File output can be obtained by the -o outputfilename.sdf option.


Alternatively, a SMILES output in CSV file is also available, -o CSV output.csv (see libmcs -h for all available options).


In both cases, the output contains not only the MCSs, but an entire hierarchy around the MCSs (since libmcs performs a hierarchical clustering, but that's not relevant here).


At present there is no option available to output only the MCSs, thus you need to filter these out from the output file. It's simple in both cases, perhaps CSV is easier to process with other command line tools like grep etc. In the CSV file the SMILES of the structure is followed by two "coordinates", level and position on that level, separated by a comma. The level co-ordinate of the MCSs is 1.


Thus, simply filter for the pattern ",1," in the output file to find the MCSs.


Please note, that the MCS of a set of structure is not necessarily one single structure (unless you consider extremes like a single carbon atom).


Does this help at all?


Kind regards,


Miklos


 


 

User 550505e3de

06-08-2009 01:04:43

Sounds even better than we had hoped.  We'll definitely look at libmcs.


Thank you,


Corey