Fingerprint question

User 9e11ee2bf3

01-07-2013 15:01:43

Hello, 


I'm not sure if this is the right board for this, so please redirect me if it is. 


I am having some trouble interpreting the output from generatemd when I am asking for fingerprints with counts and decimal output. My two concerns are (1) I don't get a the same number of entries in every row and (2) I get negative counts. 


I didn't think that either of these were possible outputs and I cant find documentation on what the expected output should be.  Has anyone had similar issues?


 


Cheers, 


Igor 

ChemAxon 8b644e6bf4

03-07-2013 14:42:52

Dear Igor,


 


I suppose you used ECFP fingerprints with decimal output like


echo "CCCCCCCC" | ./generatemd c -k ECFP -D
-900470404      -887929887      -557513035      -544887768      -194534908     712699060       1068280288      1236888632



In this case the output is a list of the hash code of identified features (http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations):


(1) I don't get a the same number of entries in every row and 


These identifiers represents the found circular neighborhoods,


(2) I get negative counts. 


These are hash codes of the neighborhoods.




By using -2 instead of -D you can get fixed length folded bitsring representation:


echo "CCCCCCCC" | ./generatemd c -k ECFP -2
00000000|00000000|00000000|00000000|00000001|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00001000|10001000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000010|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|01000000|00000010|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|01000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|




 


If you have further questions please feel free to ask them.


 


Regards,


Gabor

User 9e11ee2bf3

03-07-2013 15:13:09

Thanks for the reply.  I am aware that I could get the bit string representation and that is the first thing I tried.  Now I am interested in getting a richer representation of structures by getting counts instead of bits. 


After posting my first message I wonder if what I'm getting are hashes and played around with the generatemd output a bit, but I'm still not 100% that I'm interpreting it correctly. 


The first line after the header in the ecfp configuration xml file is: <Parameters Length="1024" Diameter="4" Counts="Yes"/>, but when I look at the output I get ~80,000 unique hash codes in my library of 75000 structures.  I am surprised because I was expecting 1024 unique hashes, which I could easily tally for each structure in the output. 


Am I misunderstanding what the -D output is?

ChemAxon 8b644e6bf4

15-07-2013 17:38:10

Dear Igor,


 


Sorry for the late answer!


The first line after the header in the ecfp 
configuration xml file is: <Parameters Length="1024" Diameter="4"
Counts="Yes"/>, but when I look at the output I get ~80,000 unique
hash codes in my library of 75000 structures.  I am surprised because I
was expecting 1024 unique hashes, which I could easily tally for each
structure in the output.



The length parameter (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#parameters) only considered for the folded bit string representation (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations) The unique hash code count for a library equals to the unique circular atom centered neigborhoods count found in the library with the given diameter. 


Regards,


Gabor

User 9e11ee2bf3

21-07-2013 14:44:41

Hi Gabor, 


I am still slightly confused.  If I wanted to put together a matrix from the fingerprint list that I get from ecfc, how do I decide which hashes to merge or to throw away?  In other words, I'm wondering what does ecfp do to pool/discard the hashes?


Igor

ChemAxon 8b644e6bf4

22-07-2013 09:19:10

Dear Igor,


 


If I wanted to put together a matrix from the 
fingerprint list that I get from ecfc, how do I decide which hashes to
merge or to throw away?



Could you please clarify this matrix?


 In other words, I'm wondering what does ecfp do to pool/discard the hashes?


There is no pooling of hashes in the implementation: in general, the same atom typing settings (and same JChem version) (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#atomprops)  should ensure that the same hash code is assigned to the same atomic neighborhood for every input on every runs. (Note that the atomic neighborhood might be defined based on chemical properties instead of structural attributes.)


I would like to note that it is possible to visualize the atomic neighborhoods - hash code association using the API (https://www.chemaxon.com/jchem/doc/dev/java/api/index.html?chemaxon/descriptors/ECFPFeatureLookup.html). The attached example code uses this class.


Regards,


Gabor