When doing feature lookup (using the API), different identifiers that have been generated for the same molecule will encode the exact same SMARTS strings.
For example, using generatemd to produce an ECFP_6 (default configuration otherwise) for the SMILES string "[H][C@@](N1CCC2=C(C1)C=CS2)(C(=O)OC)C1=CC=CC=C1Cl" produces, among others, the identifiers -1216914296 and -1216914295. When you look those up using the feature lookup API, they both return the SMARTS string *~[#6](~*)~*
This is confusing to me because in the little blurb explaining ECFP generation, it pretty explicitly mentions a duplicate removal step - "the removal of multiple identifier representations of equivalent atom neighborhoods". If these aren't actually duplicates, then do they both correspond to the same SMARTS string?
I checked to see if it was a problem with how the molExporter class was translating fragments to SMARTS strings, but that doesn't seem to be the case. Whatever problem/degeneracy there is exists in the molecule class objects that get returned by the ECFPFeatureLookup API.
Another thing I've noticed (which I think is a legitimate bug) is that the ends of the fragments returned by the feature lookup API are not true wildcards, but instead any atom except H. This results in situations where some of the fragments produced by generatemd and a subsequent feature lookup cannot be mapped back onto the molecule that produced them in a substructure search. In the end, I had to go in and manually change the anything-but-H-"wildcards" to to true wildcards to get that to work.
Any help would be much appreciated. Thank You!
Marvin Version: 6.2.1
JChem Version: 6.1.5 (?)