LogP Accuracy - JChem vs. EPI-Suite

User 677b9c22ff

26-10-2006 02:55:44

Hi,





as Alex requested more input I will compare the logP accuracy of JChem 3.2 versus KOWWIN. The values are taken from the free EPA EPI Suite (http://www.epa.gov/opptintr/exposure/pubs/episuite.htm) and contains > 10k experimental logp values. The KOWWIN alogrithm is based on the Meylan, W.M. & Howard, P.H. (1995). Atom/fragment contribution method for estimating octanol–water partition coefficients. Journal of Pharmacological Sciences 84, 83–92.





KOWWIN is not the very best method, but a reliable and most important a free one. ClogP (Biobyte) and ACDLogP cost big $$$-$$$$$ for academia, and unless you want enhanced results for zwitterionic molecules and error bars you are fine with KOWWIN and JCHEM logp.





The following graphic shows the improvements for JCHEM from 3.14 to 3.2. Its also attached as download.











I don't include the error values for x and y, because the outcome is pretty clear. JCHEM improved alot but is not better than KOWWIN. (n=16,000)


ChemAxon Marvin 3.14 logP - R^2 = 0.7333


Chemaxon Marvin 3.2 logP - R^2 = 0.8032


KOWWIN logP - R^2 = 0.9532





Another issue for extremely large datasets is speed.


KOWWIN takes 9 seconds on this dataset.


JCHEM takes 2 minutes on this dataset, 12 times slower.


This is not JAVA dependent (disadvantage from 0-20%).


So for extremely large datasets (>10^9 this is certainly an issue)





Kind regards


Tobias Kind


http://fiehnlab.ucdavis.edu/staff/kind/

User 851ac690a0

26-10-2006 10:33:55

Hi,





Thank you for this test.





Do you have any information about the training set of KOWWIN?











Jozsi

User 677b9c22ff

26-10-2006 18:38:32

Jozsi wrote:
Hi,


Thank you for this test.


Do you have any information about the training set of KOWWIN?


Jozsi
Hi,


values are from the Physprop database or the Logpstar list.


The KOWWIN Help file (C) EPA lists the following information:


http://www.epa.gov/oppt/exposure/pubs/episuitedl.htm





Code:
Appendix D illustrates the estimation accuracy of KOWWIN.  The statistical accuracy of the current 2464 compound training set is excellent; the correlation coefficient (r2) is 0.981, the standard deviation is 0.225 and the absolute mean error is 0.163.  However, to be effective an estimation method must be capable of making accurate predictions for chemicals not included in the training set.  Currently, KOWWIN has been tested on a validation dataset of 10,200 compounds.  The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log P are as follows:  n = 10200;   r2 = 0.94;  sd = 0.47;  me = 0.35.






Code:



Statistics Using SRC's Experimental Log P Database:


(n = number of compounds;  r = correlation coef.;  sd = standard deviation;  me = absolute mean error)








KOWWIN v1.63


Total:   n=12805;   r2=0.95; sd=0.435;  me=0.316


Training:  n=2474   r2=0.981   sd=0.22  me=0.16


Validation:n=10331  r2=0.94   sd=0.47   me=0.35





CLOGP for Windows  (v1.0)


Total:  n=11735(a)   r2=0.91    sd=0.59    me=0.384





CLOGP (UNIX version as reported by Leo, 1992)


Total:  n=7250   r2=0.96   sd=0.3


(using equation:  Log P = 0.914 CLOGP + 0.184) (b)


   





(a) Taken from the current database; the difference between the entire database (12686) and the number used (11616) is primarily due to "missing fragments" in the CLOGP program. BioByte's Internet website reports the following statistics for its starlist:  n=8942, r2=0.917, sd=0.482 using the equation: Log P = 0.876CLOGP + 0.307.






However all this stuff will shift anyway, people use and will use the the Continuum Solvation Model (COSMO and COSMO-RS) for calulation of partition coefficients (or UNIFAC). This not only for octanol/water (who wants that?) but for all possible solvents and temperatures.





Kind regards


Tobias Kind


http://fiehnlab.ucdavis.edu/staff/kind/

ChemAxon 43e6884a7a

26-10-2006 20:41:50

Tobias,





Thanks a lot for all this information, but to tell the truth it is not fair to compare prediction methods based on published data. Most of the developers of prediction tools select molecules into the training set from these publicly available databases to make sure that these charts look good.


When you test the prediction methods with in-house data you usually get much larger errors than those on these charts.


We have received a lot of very positive feedback from users who compared our method with others on in-house data.

User 677b9c22ff

31-10-2006 04:47:17

Ferenc wrote:
Tobias,





Thanks a lot for all this information, but to tell the truth it is not fair to compare prediction methods based on published data. Most of the developers of prediction tools select molecules into the training set from these publicly available databases to make sure that these charts look good.


Hi Ferenc and Jozsi,


the data for the validation set only gets worse for JCHEM because the dataset now is much smaller than before. As KOWWIN is freely available and is one of the defacto standards, I would not assume that somebody fakes the data. As I already wrote the logP itself is a rough estimate and people would like to have partition coefficients in other solvents or real phospholipids or membranes. The only reason it still exists is that there are a lot of experimental values and the other methods are just beginning to develop (COSMO). And the same is true for the Lipinski rule of five, (where logP is a part of) PDF PDF - its "just a rule". And sometimes it works and sometimes not :-)





Now the regressions for the KOWWIN validation set are here:











The datasets can be downloaded from here PDF see link ambit/data in the PDF.





Kind regards


Tobias

ChemAxon 43e6884a7a

31-10-2006 09:48:01

Tobias.


Let's not argue about the importance of octanol-water logP. Well, it is important for us because it generates significant revenue. :-)
Quote:
I would not assume that somebody fakes the data
I wouldn't either, you misunderstand something. Still, in-house databases usually contain structures that are not similar enough to the ones with published data. The training sets of the logP prediction methods are not independent enough from the publicly available data. As a result, the performance of the methods on in-house data is much worse than on public data. Companies who buy logP prediction software are aware of that so they compare the performance of the methods based on in-house data. We have many very significant users who had compared our method with others before choosing our software.

ChemAxon 43e6884a7a

03-11-2006 10:44:57

One more comment: our software calculates pKa during logP prediction to determine whether the molecule is zwitterionic. In the case of zwitterionic molecules the calculation is more difficult due to the equilibria between the different ionic species. As a result the calculation is much slower. We are considering to make this optional, because this is only useful for a few percent of the structures.