ECFP Feature Lookup Degeneracy

User 03a78c950c

14-05-2014 05:02:11

Hi all,

When doing feature lookup (using the API), different identifiers that have been generated for the same molecule will encode the exact same SMARTS strings.  

For example, using generatemd to produce an ECFP_6 (default configuration otherwise) for the SMILES string "[H][C@@](N1CCC2=C(C1)C=CS2)(C(=O)OC)C1=CC=CC=C1Cl"  produces, among others, the identifiers -1216914296 and -1216914295.  When you look those up using the feature lookup API, they both return the SMARTS string *~[#6](~*)~*

This is confusing to me because in the little blurb explaining ECFP generation, it pretty explicitly mentions a duplicate removal step - "the removal of multiple identifier representations of equivalent atom neighborhoods".  If these aren't actually duplicates, then do they both correspond to the same SMARTS string?

I checked to see if it was a problem with how the molExporter class was translating fragments to SMARTS strings, but that doesn't seem to be the case.  Whatever problem/degeneracy there is exists in the molecule class objects that get returned by the ECFPFeatureLookup API.  

Another thing I've noticed (which I think is a legitimate bug) is that the ends of the fragments returned by the feature lookup API are not true wildcards, but instead any atom except H.  This results in situations where some of the fragments produced by generatemd and a subsequent feature lookup cannot be mapped back onto the molecule that produced them in a substructure search.  In the end, I had to go in and manually change the anything-but-H-"wildcards" to to true wildcards to get that to work.

Any help would be much appreciated.  Thank You!


Marvin Version: 6.2.1

JChem Version: 6.1.5 (?)

ChemAxon 8b644e6bf4

18-07-2014 16:30:26

Dear Stefano,


Please note that ECFP feature lookup generates SMARTS strings as an approximate visualization. It is possible that two neighborhoods have different features considered by the actual configuration but the same smarts is created. Please check the centrum atom of the two features and examine the neighborhood and ECFP config.