chemical fingerprints

User d028dca803

06-07-2006 14:56:33

I'm trying to generate fingerprints for a database of sdf files. I use





generatemd c in.sdf -k CF -o out.sdf -SI





but can't understand the output - I get 32 positive and negative integers under <CF> heading in my sdf file. e.g.





> <CF>


-681085131 1355961114 -1185883784 1552907635 1275972413 1572261944 -21933111 -1012494592 1608908285 -618988177 -1416929418 1136442875 -1916325891 1378672275 1594586247 -1531982300 128219050 -2109408995 909463155 -79417425 -29422979 -820622371 947641993 1833426562 -343951789 -2053341300 2000399123 -419893257 -1499057379 1607657127 -366202981 1402837040











I was expecting a binary string that I can use in a Tanimoto similarity saerch.





Any help appreciated,


Gareth

ChemAxon a3d59b832c

07-07-2006 07:19:51

Hi Gareth,





If you import the SDF into a JChem database, the database similarity search will use chemical hashed fingerprints with Tanimoto formula. This case, you don't have to bother handling the fingerprints yourself.





A colleague of mine will soon answer your question regarding the usage of command-line tools.





Best regards,


Szabolcs

ChemAxon efa1591b5a

07-07-2006 07:38:50

Hi Gareth,





you can use these decimal values in similarity calculation. Decimal values represent the same fingerprint as binary values but in a more compact way that is easier to process. What you see in the SDfile under the <CF> tag is still a binary fingerprint, though it is not written in binary form.





If - for any reason - you really need binary values printed, then you should specify the -2 flag in the command line. However, generatemd does not support binary strings in SDF, thus you need to store your zero/one string data in a separate file.





Does this help at all?





Cheers,


Miklos

User d028dca803

07-07-2006 12:55:09

Thank you for the information.





So the bainary fingerprint string is encoded in the integers listed under <CF>. I would like to score fingerprint similarity. If I had a binary output then this is a simple matter of scoring coincidences. How does one score similarity with the 32 integers?

ChemAxon efa1591b5a

07-07-2006 13:05:50

Hi, use screenmd. It can process the SDfile output of generatemd. screenmd can calculate Tanimoto among many other metrics.





If your input SDF is small (i.e. < 10,000 structures) then there is no need to generate and store the fingerprint in a preprocessing step, as screenmd can genereate them on the fly when Tanimoto coefficients are calculated.


However, if your file contains more than 10,000 structures then it's better to store the fingerprint in a file and process that file with screenmd.





If screenmd is not suitable for your needs, and you really want binary strings then feel free to use the -2 option of generatemd: your output will contain the bit strings, one chemical fingerprint per each line.





HTH


Regards


Miklos