generatemd parameters needed

User 4bdb1b72f8

25-01-2005 17:51:34

I have a library of 120,000 compounds and I would like to determine the degree of similarity of each of these compounds to other compounds in the same library. I have generated fingerprints using generatemd and then used compr to compare the the resulting fingerprint file to itself. So far I have only tried generating Chemical fingerprints (-k CF) and I have tried a few different options for fingerprint length (ie, -f 1024).





After generating fingerprints in this manner I find that when I run compr I get dissimilarity coefficients of 0 for slightly different (not identical) compounds. Does this suggest that I need to generate fingerprints using different parameters so that these compound can be distinquished? Can you provide suggestions for these parameters?





Thanks, dougb

ChemAxon efa1591b5a

26-01-2005 14:32:43

Yes, indeed, this indicates that the fingerprint is not rich enough in information. You may need to experiment with all three parameters: length, longest path (also called pattern length), and number of bits.


In case of 1024 bits the longest path can be 7 or even 8, number of bits set either 3 or 4. There is no generic formula to determine which value is the best, it really depends on your paticular data set.


So try 1024, 7, 3 first.





Then try the -T option that instructs generatemd to create a statistics of distribution of 0 and 1 bits in the fingerprints generated. Look at the average bit density, ideally it should be between 60-70%.


Increasing fingerprint length decreases this value, while increasing pattern length and bit count increases density.





Also note, there is no guarantee that you will find a certain set of parameters which ensures that different compounds have different fingerprints. Fingerprint compresses information in a way that is not bi-directional, some information is lost along the way unavoidably.





BTW: Did you consider using the BCUT descriptor family to assess the total similarity of your compound set? BCUTs are particularly suitable for this purpose - though chemical fingerprints should also do the trick.








Hope this helps.





Regards,


Miklos

User 204415f4a4

31-08-2005 14:02:21

Hello All,





To assess the chemical diversity of a molecular set, I am using Compr with a 1024 bits chemical fingerprints, a longest path of 7, and a number of bits of 3. Also, I used JKlustor with J-P algorithm. Surprisingly, the obtained average dissimilarity are not identical for both methods (clearly different, 22% and 65%). The maximum dissimilarity was the same. My set contain 12653 molecules. When I tried with smaller sets, the results are similar.





Something to do with the parameters I used ?





Thanks for your time.





Best regards,





IsI

ChemAxon efa1591b5a

01-09-2005 14:24:32

I managed to reproduce the same problem which is probably related to an unknown bug in JKlustor. We'll investigate it and let you when we made any progress.





Sorry about the inconvenience this problem caused in your work.





Regards


Miklos

User f822a95708

22-03-2006 09:10:03

Hi Miklos,





I'm a little bit confused with the guidelines you provide here. In you post below you state the ideal avarage bit density should be 60-70%. Hower in an other forum topic "questions about structure caching" (21 May 2004) you state that it should be around 50 % and should not exceed 60%.





Does this difference come from new experience or has it something to do the application it's going to be used for (cluster, structure searches, etc...)





Sorry for being so critical, I'm just wondering.





Thanks,





Peter


mvargyas wrote:
Yes, indeed, this indicates that the fingerprint is not rich enough in information. You may need to experiment with all three parameters: length, longest path (also called pattern length), and number of bits.


In case of 1024 bits the longest path can be 7 or even 8, number of bits set either 3 or 4. There is no generic formula to determine which value is the best, it really depends on your paticular data set.


So try 1024, 7, 3 first.





Then try the -T option that instructs generatemd to create a statistics of distribution of 0 and 1 bits in the fingerprints generated. Look at the average bit density, ideally it should be between 60-70%.


Increasing fingerprint length decreases this value, while increasing pattern length and bit count increases density.





Also note, there is no guarantee that you will find a certain set of parameters which ensures that different compounds have different fingerprints. Fingerprint compresses information in a way that is not bi-directional, some information is lost along the way unavoidably.





BTW: Did you consider using the BCUT descriptor family to assess the total similarity of your compound set? BCUTs are particularly suitable for this purpose - though chemical fingerprints should also do the trick.








Hope this helps.





Regards,


Miklos

ChemAxon efa1591b5a

22-03-2006 10:34:10

Hi Peter,





I reckon the confusion is caused by the fact that fingerprints are used for various purposes. One is the acceleration of structure search (this is related to structure caching), another is similarity calculation, or similarity search.


For substructure searching the bit density should not be too high to make sure that different structures are associated with different fingerprints, while for similarity search higher bit density is needed to better represent similar features.





These are just rules of thumb and one need to experiment with the actual data set in order to fine tune fingerprint parameters.





Does this help?





Regards


Miklos