Binary fingerprint & Ward clustering with Kelley index

User a8e6cc7b1c

13-01-2007 16:06:05

Hello,





I'd like to determine the optimal number of clusters with the help of Kelley index after clustering molecules according to Ward's method. For this procedure I tried to use the ChemAxon fingerprint of molecules in binary form. The ward program gave me an error message that the input string (binary fingerprints) is not appropiate. However when I ran the ward program with decimal type of fingerprints everything went just fine.





For command lines I used the ones on ChaemAxon's hompage at example section. I was taking care of the fingerprint size too (I gave the same number for ward as in the case of generatemd).





Is there any workaround about this problem? Or is there any way to convert binary fingerprints to decimal format? I really have to use the binary format for several reasons (ie. to comare with other fingerprints and it's much more easy to read them).





Thank you for your help!





Gergely Zahoranszky

ChemAxon efa1591b5a

15-01-2007 09:55:22

Hi,





one way is to mimic decimal fingerprint in which the only values are 0 and 1. To get to this format all you need to do is to separate the 0s and 1s in your original binary string by spaces.


Does this help at all?





Another possible solution is to write (find & download) a small simple program that can transfor your bits to decimal numbers.





Cheers


Miklos

User a8e6cc7b1c

15-01-2007 14:01:22

Hi Miklos,





thank you for your help!





I tried to write a converter program but I guess something is not absolutely clear for me. I generated the fingerprint for the example (nci1000.sdf file) as following:





./generatemd c nci.sdf -k CF -f 1024 -n 6 -2 -o nci.cf





for binary and





./generatemd c nci.sdf -k CF -f 1024 -n 6 -D -o nci.cf





for decimal.





I figured out that the conversion rule is to convert each 32 bit block into an integer. The strange thing is that in the decimal file there are negative integers. So I modified my code to consider the first bit as a sign bit and this way actually only 31 bits are converted into numbers. That worked perfectly for positive numbers but NOT for negative numbers. I picked a negative value from the example SDF file's decimal fingerprint file and check the respecting value in the binary fingerprint file. I put the binary number into a calculator program but when I converted it to decimal the value was not identical to the number in the decimal file. I tried this back and forth with the same result: they don't match. I tried with other binary-decimal pairs but the result is the same. But for positive binary-decimal conversion the 1 bit sign and 31 bit value works perfctly I checked this too.





So do you have a hint how to convert the 32 bit binary into negative integer?





Thank you for your help!





Gergely Zahoranszky

ChemAxon efa1591b5a

16-01-2007 10:28:29

Hi, the left most bit (the most significant bit) is almost simply a sign bit, but not exactly that ;-).


Binary numbers are written in the so called binary complement form, you may found this link helpful: http://en.wikipedia.org/wiki/Two%27s_complement





Negative descriptor values should not cause any problem, so the key issue here is to do the conversion properly. Do you code in Java?





Regards,


Miklos

User a8e6cc7b1c

18-01-2007 17:28:23

Dear Miklos,





thank you for the link (http://en.wikipedia.org/wiki/Two%27s_complement) about the binary to decimal conversion. It gave me the exact solution. I would like to share the code that I wrote (I'm coding in C ANSI) what converts ChemAxon binary fingerprint to ChemAxon decimal fingerprint. I hope that some might find it useful.





The syntax is pretty easy:





./bin2decCAFP inputbinaryfpfile 1024 >outputfile





All of the arguments are obligatory:





inputbinaryfpfile: name of the input ChemAxon Fingerprint file (which is in binary form)


1024: length of the ChemAxon fingerprint (can be anything if it's the multiple of 32)





The output without redirection is written to the standard output so one can pipeline it to ie. ward script and so on but in this redirection form (works only in Linux / Unix) the result is written in the file (in the command line above: outputfile).








The compilation is also easy:





g++ bin2decCAFP.c -o bin2decCAFP -Wall








the version of my compiler:





gcc version 4.1.2 20061007 (prerelease) (Debian 4.1.1-16)








(the compilation should be the same with gcc but I haven't checked it yet:


gcc bin2decCAFP.c -o bin2decCAFP -Wall)





You find the code (one file) attached to this message.





Thank you for helping me!











Gergely Zahoranszky

ChemAxon efa1591b5a

19-01-2007 08:16:45

This is excellent, thanks you.





I re-read your first message and that made me think: why do you use the fingerprint in binary form in ward? It is more straightforward to use the decimal form and feed that directly in ward.


I'm puzzled...


I believe I did not read your first message careful enough and I thought you used another fingerprint not the ChemAxon topological one.





Miklos

User a8e6cc7b1c

22-01-2007 08:32:03

Well, what drove me to this problem was that I was using actually a different fingerprint not the ChemAxon one but I converted it to ChemAxon binary form. Then I faced the problem that the ward program doesn't accept the CA FP binary form. By now I'm able to compare the CA FP and the one I am using based on the clusterisation. That is why I needed the ward program.





Thank you for your excellent help!





Regards,





Gergely Zahoranszky

ChemAxon efa1591b5a

22-01-2007 11:26:42

So what about converting your original fingerprint to decimal numbers directly, without the use of the binary format?





Miklos

User a8e6cc7b1c

22-01-2007 11:37:36

Yes, you're absolutely right, but I didn't want to be selfish. This way probably I can help some other people cause once one has ChemAxon FP in binary format and changes hes/his mind to use the decimal form (ie. wants to use the ward clustering method) then there's no need to re-generate the fingerprints in decimal form since this converter will do it pretty quickly. Sorry but I don't think I have time to write the decimal to binary converter but maybe it's not even so important.





Best regards,





Gergely

ChemAxon efa1591b5a

22-01-2007 11:44:45

:-))





many thanks for your contribution!!





Miklos