Test case - Optimizing generatemd parameters

User f52820d97e

26-04-2006 16:39:58

I already saw some comments on this subject, but I would like to submit my case... just to make sure I don't do things completely wrong...

I have a database of 8081 compounds, and tested a number of situations in the generatemd parameters:

The results are in the attached file:

I went along with the 2048-7-4 parameters, since the memory and time is not especially an issue, and I wanted as much information as possible. What do you think?

Yhank you in advance for any insights...

Nicolas

---

Nicolas Saettel, Ph.D., Assistant Professor

http://www.cermn.unicaen.fr

ChemAxon efa1591b5a

27-04-2006 19:48:13

Hi Nicolas,

very nice work indeed. Thanks for making it public on our forum!

My choice would be 1024/7/3, for the following reasons:

1. Average bit count is around 1/2 and maximum about 2/3 which are believed/experienced to be the ideal values.

2. According to the statistics 2048 bit long fingerprints with the same or more number of bits set do not exhibit clear advantage over the 1024 bit long ones.

3. In the ChemAxon fingerprint only two bits pers feature are strictly independent, the third and higher bits are derived from the second one by semi-random perturbation. Such noise proves to be rather useful, though too much noise is undesired.

4. The density distribution in the 1024/7/3 case fits the normal distribution the closest (I did not carry out any statistics tests, just "visual inspection" :-)

5. Our similar study also concluded to 1024/7/3 as the best parameter setting for virtual screening. (I know this is not too scientific though .... :-)

Regarding your choice, I think it's as good as the one I picked. Whether it has any advantage over the shorter one in term of information content or "information storage capavity" can be tested by further statistics methods, e.g by PCA, or dimension reduction, NLM etc.

So, at this stage of the study, in my opinion, both choices are equally good. Depending on your particular application area you may need to plan further tests, e.g. clustering, to reveal any major difference.

We would be glad to learn more about your research topic if possible.

Best regards,

Miklos

P.S. Sorry for I did not meet our 24 hours standard response time....

User f52820d97e

11-05-2006 16:41:01

Sorry for the delay, I was away on vacation + now teaching...

Thank you so much for such a detailed answer, it does help me a great deal!

I am following up with clustering analyses; as soon as I have all the information I wil post them here so we can discuss it too, along with the goal of my research.

Regards,

Nicolas

User f52820d97e

10-07-2006 17:26:43

I plan on doing comparison tests with the 2 different sets of parameters (mine and the one you suggested) when clustering, but in the meantime I did some further analysis on a bigger chemolibrary (yes, the chemists work hard...) with the 1024-7-3:

http://www.chemaxon.com/forum/ftopic1669.html

Cheers,

Nicolas

ChemAxon efa1591b5a

11-07-2006 09:58:23

We are very interested in learning some experimental results! Thanks for sharing your thoughts and results with the community.

See also http://www.chemaxon.com/forum/viewpost6981.html#6981.

Miklos