CF/ECFP Fingerprint representations

User 61e6d0ff7a

12-07-2014 16:58:10

I have a question regarding the different representations of the hashed fingerprints.

It was my understanding that using option -D in generatemd always results in all hash codes present in molecule, while using options -2 in a folded version of the fingerprint (specified with parameter -f ).

For the ECFP generatemd behaves like I would expect it to do: If using option -D, always the same list of hash codes is produced (independent of specified fngerprint length), while the bitstring representations are different.

But the situation is different for the CF descriptor: The total number of produced integer hash codes with option -D varies with the specified fingerprint length. But the output is the same if called multiple times with the same length parameter. Why does the fingerprint length influence the generated hash codes?

ChemAxon 8b644e6bf4

18-07-2014 16:25:16

Dear Florian,


In case of the chemical fingerprint the individual hashes for the features are not available (like in ECFP). The decimal representation of the chemical fingerprint is the folded binary representation; packed into 32 bit integers (folding of feature hashes is done during CF generation).

I would like to note that in the new descriptors API we expose the folded binary representation by default.

Which representation (feature hashes or folded binary string) is relevant to your usage?