I'm not sure if this is the right board for this, so please redirect me if it is.
I am having some trouble interpreting the output from generatemd when I am asking for fingerprints with counts and decimal output. My two concerns are (1) I don't get a the same number of entries in every row and (2) I get negative counts.
I didn't think that either of these were possible outputs and I cant find documentation on what the expected output should be. Has anyone had similar issues?
I suppose you used ECFP fingerprints with decimal output like
echo "CCCCCCCC" | ./generatemd c -k ECFP -D
-900470404 -887929887 -557513035 -544887768 -194534908 712699060 1068280288 1236888632
In this case the output is a list of the hash code of identified features (http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations):
(1) I don't get a the same number of entries in every row and
These identifiers represents the found circular neighborhoods,
(2) I get negative counts.
These are hash codes of the neighborhoods.
By using -2 instead of -D you can get fixed length folded bitsring representation:
echo "CCCCCCCC" | ./generatemd c -k ECFP -2
If you have further questions please feel free to ask them.
Thanks for the reply. I am aware that I could get the bit string representation and that is the first thing I tried. Now I am interested in getting a richer representation of structures by getting counts instead of bits.
After posting my first message I wonder if what I'm getting are hashes and played around with the generatemd output a bit, but I'm still not 100% that I'm interpreting it correctly.
The first line after the header in the ecfp configuration xml file is: <Parameters Length="1024" Diameter="4" Counts="Yes"/>, but when I look at the output I get ~80,000 unique hash codes in my library of 75000 structures. I am surprised because I was expecting 1024 unique hashes, which I could easily tally for each structure in the output.
Am I misunderstanding what the -D output is?
Sorry for the late answer!
The first line after the header in the ecfp
configuration xml file is: <Parameters Length="1024" Diameter="4"
Counts="Yes"/>, but when I look at the output I get ~80,000 unique
hash codes in my library of 75000 structures. I am surprised because I
was expecting 1024 unique hashes, which I could easily tally for each
structure in the output.
The length parameter (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#parameters) only considered for the folded bit string representation (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations) The unique hash code count for a library equals to the unique circular atom centered neigborhoods count found in the library with the given diameter.
I am still slightly confused. If I wanted to put together a matrix from the fingerprint list that I get from ecfc, how do I decide which hashes to merge or to throw away? In other words, I'm wondering what does ecfp do to pool/discard the hashes?
If I wanted to put together a matrix from the
fingerprint list that I get from ecfc, how do I decide which hashes to
merge or to throw away?
Could you please clarify this matrix?
In other words, I'm wondering what does ecfp do to pool/discard the hashes?
There is no pooling of hashes in the implementation: in general, the same atom typing settings (and same JChem version) (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#atomprops) should ensure that the same hash code is assigned to the same atomic neighborhood for every input on every runs. (Note that the atomic neighborhood might be defined based on chemical properties instead of structural attributes.)
I would like to note that it is possible to visualize the atomic neighborhoods - hash code association using the API (https://www.chemaxon.com/jchem/doc/dev/java/api/index.html?chemaxon/descriptors/ECFPFeatureLookup.html). The attached example code uses this class.