Uniquify or canonize Markush results?

User 677b9c22ff

04-11-2008 11:14:08


would it make sense to uniquify (crate canonical results) from a given Markush library generation if no stereochemistry is involved?

For the attached file (Naphthalene with all R1 and R1 = H,F,Cl,Br,I cxcalc would generate 390625 results but after

aromatization and creating unique smiles the number of true results would be 98125 molecules.

cxcalc markushenumerationcount -m true napthlaene-markush-2.mol

cxcalc enumerations -f smiles napthlaene-markush-2.mol > naphthalene-h-f-cl-br-i-markush.smi

molconvert smiles:+a naphthalene-h-f-cl-br-i-markush.smi -o naphthalene-h-f-cl-br-i-markush-aroma.smi

Remove duplicates (2 options)

A) TEXTPAD sort and delete duplicate SMILES (6 seconds!)

B) Import into Instant-JChem with option: Remove duplicates

(2,143s) on Core 2Duo 2GHz with slow HD.

Result with A) and B) = 98125 molecules.



Inspiration: Depth-First

ChemAxon a3d59b832c

06-11-2008 08:35:46

Hi Tobias,

Thanks for this exercise. Indeed, currently we recommend duplication control for enumerates exactly the same way as you did it.

Let me start with a discussion whether it is practical to fully enumerate huge Markush libraries. Most of the structures in such a library are very similar to each other, so smaller number of representatives would be equally appropriate and more manageable for many practical applications. (We can generate random library members also.)

Furthermore, there are alternative methods for handling Markush structures, such as substructure search in Markush structures is available in JChem Base. (See more details here: http://www.chemaxon.com/product/markush_search.html )

Duplication control within enumerates:

Indeed, it would possible to handle this symmetry problem during enumeration to some extent. It would be relatively easy to handle in this trivial case. However, the situation becomes non-trivial in more complex situations where different generic features are involved or when only a subset of the variable domains produce identical structures (e.g. when one of the R1-s are replaced by an atom list for [O, Cl, Br]).

In fact, patent Markush structures that describe huge libraries rarely contain this high number of duplicates. There may be some symmetrical groups, but those usually only affect a small fraction of enumerated structures.

For the above reasons, duplication handling for enumeration is not high priority for us, we rather concentrate on further developments such as introducing new generic features.

Best regards,


User 677b9c22ff

06-11-2008 20:08:37

Hi Szabolcs,

I agree this was a very symmetric example, hence the high number of redundant structures. I guess this would be something for an API example I could create. I also agree that "people" use the Markush generator in more practical approaches (which would include random creation).