Technical Support Forum Index
Technical Support Forum
Access ChemAxon scientists and developers here. For registration and login issues contact website support.

Support Ticket System is replacing forum

This forum was converted into a searchable archive. You cannot add posts here any more. For support please use our new Ticket System.

Create your first ticket
Fingerprint question
To watch this topic for replies  Register (enables digests) or give email address:
This topic is locked: you cannot edit posts or make replies.
Display posts from previous:   
    View previous topic :: View next topic    
Author Message
Igor

Joined: 02 Apr 2012
Posts: 3

View user's profile

Back to top
Link to postPosted: Mon Jul 01, 2013 4:01 pmPost subject: Fingerprint question Reply with quote

Hello, 

I'm not sure if this is the right board for this, so please redirect me if it is. 

I am having some trouble interpreting the output from generatemd when I am asking for fingerprints with counts and decimal output. My two concerns are (1) I don't get a the same number of entries in every row and (2) I get negative counts. 

I didn't think that either of these were possible outputs and I cant find documentation on what the expected output should be.  Has anyone had similar issues?

 

Cheers, 

Igor 

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Wed Jul 03, 2013 3:42 pmPost subject: Reply with quote

Dear Igor,

 

I suppose you used ECFP fingerprints with decimal output like

echo "CCCCCCCC" | ./generatemd c -k ECFP -D
-900470404      -887929887      -557513035      -544887768      -194534908     712699060       1068280288      1236888632


In this case the output is a list of the hash code of identified features (http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations):

(1) I don't get a the same number of entries in every row and 

These identifiers represents the found circular neighborhoods,

(2) I get negative counts. 

These are hash codes of the neighborhoods.


By using -2 instead of -D you can get fixed length folded bitsring representation:

echo "CCCCCCCC" | ./generatemd c -k ECFP -2
00000000|00000000|00000000|00000000|00000001|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00001000|10001000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000010|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|01000000|00000010|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|01000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|00000000|


 

If you have further questions please feel free to ask them.

 

Regards,

Gabor

Igor

Joined: 02 Apr 2012
Posts: 3

View user's profile

Back to top
Link to postPosted: Wed Jul 03, 2013 4:13 pmPost subject: Thanks Gabor Reply with quote

Thanks for the reply.  I am aware that I could get the bit string representation and that is the first thing I tried.  Now I am interested in getting a richer representation of structures by getting counts instead of bits. 

After posting my first message I wonder if what I'm getting are hashes and played around with the generatemd output a bit, but I'm still not 100% that I'm interpreting it correctly. 

The first line after the header in the ecfp configuration xml file is: <Parameters Length="1024" Diameter="4" Counts="Yes"/>, but when I look at the output I get ~80,000 unique hash codes in my library of 75000 structures.  I am surprised because I was expecting 1024 unique hashes, which I could easily tally for each structure in the output. 

Am I misunderstanding what the -D output is?

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Mon Jul 15, 2013 6:38 pmPost subject: Reply with quote

Dear Igor,

 

Sorry for the late answer!

The first line after the header in the ecfp 
configuration xml file is: <Parameters Length="1024" Diameter="4" 
Counts="Yes"/>, but when I look at the output I get ~80,000 unique 
hash codes in my library of 75000 structures.  I am surprised because I 
was expecting 1024 unique hashes, which I could easily tally for each 
structure in the output. 

The length parameter (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#parameters) only considered for the folded bit string representation (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#representations) The unique hash code count for a library equals to the unique circular atom centered neigborhoods count found in the library with the given diameter. 

Regards,

Gabor

Igor

Joined: 02 Apr 2012
Posts: 3

View user's profile

Back to top
Link to postPosted: Sun Jul 21, 2013 3:44 pmPost subject: still confused Reply with quote

Hi Gabor, 

I am still slightly confused.  If I wanted to put together a matrix from the fingerprint list that I get from ecfc, how do I decide which hashes to merge or to throw away?  In other words, I'm wondering what does ecfp do to pool/discard the hashes?

Igor

Gabor
ChemAxon personnel
Joined: 29 May 2005
Posts: 317

View user's profile

Back to top
Link to postPosted: Mon Jul 22, 2013 10:19 amPost subject: Reply with quote

Dear Igor,

 

If I wanted to put together a matrix from the 
fingerprint list that I get from ecfc, how do I decide which hashes to 
merge or to throw away?

Could you please clarify this matrix?

 In other words, I'm wondering what does ecfp do to pool/discard the hashes?

There is no pooling of hashes in the implementation: in general, the same atom typing settings (and same JChem version) (see http://www.chemaxon.com/jchem/doc/user/ECFP.html#atomprops)  should ensure that the same hash code is assigned to the same atomic neighborhood for every input on every runs. (Note that the atomic neighborhood might be defined based on chemical properties instead of structural attributes.)

I would like to note that it is possible to visualize the atomic neighborhoods - hash code association using the API (https://www.chemaxon.com/jchem/doc/dev/java/api/index.html?chemaxon/descriptors/ECFPFeatureLookup.html). The attached example code uses this class.

Regards,

Gabor

 

 

This topic is locked: you cannot edit posts or make replies.
Page 1 of 1


To watch this topic for replies   Register (enables digests) or give email address  
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum