I already saw some comments on this subject, but I would like to submit my case... just to make sure I don't do things completely wrong...
I have a database of 8081 compounds, and tested a number of situations in the generatemd
- Fingerprint length = 512, 1024, and 2048 bits (option -f)
- Maximum pattern length = 5, 6 and 7 (-n)
- Bits to be set for patterns = 2, 3 and 4 (-b)
The results are in the attached file:
- the first page is the Number of bits set analysis: I highlighted the average % in green when between 35 and 45%, and the maximum % when over 80%
- the second page is the density analysis of the acceptable cases in the previous analysis.
I went along with the 2048-7-4 parameters, since the memory and time is not especially an issue, and I wanted as much information as possible. What do you think?
Yhank you in advance for any insights...
Nicolas Saettel, Ph.D., Assistant Professor
very nice work indeed. Thanks for making it public on our forum!
My choice would be 1024/7/3, for the following reasons:
1. Average bit count is around 1/2 and maximum about 2/3 which are believed/experienced to be the ideal values.
2. According to the statistics 2048 bit long fingerprints with the same or more number of bits set do not exhibit clear advantage over the 1024 bit long ones.
3. In the ChemAxon fingerprint only two bits pers feature are strictly independent, the third and higher bits are derived from the second one by semi-random perturbation. Such noise proves to be rather useful, though too much noise is undesired.
4. The density distribution in the 1024/7/3 case fits the normal distribution the closest (I did not carry out any statistics tests, just "visual inspection" :-)
5. Our similar study also concluded to 1024/7/3 as the best parameter setting for virtual screening. (I know this is not too scientific though .... :-)
Regarding your choice, I think it's as good as the one I picked. Whether it has any advantage over the shorter one in term of information content or "information storage capavity" can be tested by further statistics methods, e.g by PCA, or dimension reduction, NLM etc.
So, at this stage of the study, in my opinion, both choices are equally good. Depending on your particular application area you may need to plan further tests, e.g. clustering, to reveal any major difference.
We would be glad to learn more about your research topic if possible.
P.S. Sorry for I did not meet our 24 hours standard response time....
Sorry for the delay, I was away on vacation + now teaching...
Thank you so much for such a detailed answer, it does help me a great deal!
I am following up with clustering analyses; as soon as I have all the information I wil post them here so we can discuss it too, along with the goal of my research.
I plan on doing comparison tests with the 2 different sets of parameters (mine and the one you suggested) when clustering, but in the meantime I did some further analysis on a bigger chemolibrary (yes, the chemists work hard...) with the 1024-7-3: