Tuesday, June 12, 2007

Log loss or hinge loss?

Suppose you want to predict binary y given x. You fit a conditional probability model to data and form a classifier by thresholding on 0.5. How should you fit that distribution?

Traditionally people do it by minimizing log-loss on data, which is equivalent to maximum likelihood estimation, but that has the property of recovering the conditional distribution exactly with enough data/modelling freedom. We don't care about exact probabilities, so in some sense it's doing too much work.

Additionally, log-loss minimization may sacrifice classification accuracy if it allows it to model probabilities better.

Here's an example, consider predicting binary y from real valued x. The 4 points give the possible 4 possible x values and their true probabilities. If you model p(y=1|x) as 1/(1+Exp(f_n(x))) where f_n is any n'th degree polynomial.

Take n=2, then minimizing log loss and thresholding on 1/2 produces Bayes-optimal classifier

However, for n=3, the model that minimizes log loss will have suboptimal decision rules for half the data.

Hinge loss is less sensitive to exact probabilities. In particular, minimizer of hinge loss over probability densities will be a function that returns returns 1 over the region where true p(y=1|x) is greater than 0.5, and 0 otherwise. If we are fitting functions of the form above, then once hinge-loss minimizer attains the minimum, adding extra degrees of freedom will never increase approximation error.

Here's example suggested by Olivier Bousquet, suppose your decision boundaries are simple, but the actual probabilities are complicated, how well will hinge loss vs log loss do? Consider the following conditional density

Now use the conditional density functions of the same form as before, find minimizers of both log-loss and hinge loss. Hinge-loss minimization always produces Bayes optimal model for all n>1

When minimizing log-loss, the approximation error starts to increase as the fitter tries to match the exact oscillations in the true probability density function, and ends up overshooting.

Here's the plot of the area on which log loss minimizer produces suboptimal decision rule

Mathematica notebook (web version)


Anonymous said...


Unknown said...

Great thanks for sharing about Machine learning. This post will be helpful for the readers who are searching for this type of information. Keep it up
Best Machine Learning institute in Chennai | machine learning with python course in chennai | best training institute for machine learning

draj said...

Excellent machine learning blog,thanks for sharing...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore

john said...

Great Article
IEEE final year projects on machine learning

JavaScript Training in Chennai

Final Year Project Centers in Chennai

JavaScript Training in Chennai

ve may bay tet said...

Đặt vé tại Aivivu, tham khảo

ve may bay di my gia re

thông tin chuyến bay từ mỹ về việt nam

từ canada về việt nam quá cảnh ở đâu

ve may bay vietnam airline tu han quoc ve viet nam

Stphen07 said...
This comment has been removed by the author.
Stphen07 said...

Thanks for machine learning project in your blog post...Its very useful for all students and learned more about in your blog post...We provideRFID project centers in chennai and research project centers in chennai projects titles with full source code at free of cost..

Stphen07 said...

Thanks for your blog post. Its very useful and learned more in your articles .We provide projects for final year students in chennai. One of the best college project centers in chennai with quality projects at minimal cost.

Tia said...

First time reading this blog, thanks for sharing.