Using cross-entropy for regression problems

Question

I usually see a discussion of the following loss functions in the context of the following types of problems:

Cross entropy loss (KL divergence) for classification problems
MSE for regression problems

However, my understanding (see here) is that doing MLE estimation is equivalent to optimizing the negative log likelihood (NLL) which is equivalent to optimizing KL and thus the cross entropy.
So:

Why isn't KL or CE used also for regression problems?
What's the relationship between CE and MSE for regresion? Are they one and the same loss under some circumstances?
If different, what's the benefit of using MSE for regression instead?

Related questions:

Does the cross-entropy cost make sense in the context of regression?
Do neural networks learn a function or a probability density function?
How to construct a cross-entropy loss for general regression targets?

Sebastian · Answer

In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density
$$f_theta(y|x)= frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right}$$
and you fix $sigma^2$ to some value
The mean $mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)
Writing the empirical approximation to the cross entropy you get:
$$sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right} right)$$
$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}}right) +frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2$$
If we e.g. set $sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:
$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pi}}right) +frac{1}{2}(y_i-mu_theta(x_i))^2$$
Minimizing this is equivalent to the minimization of the $L2$ loss.
So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss

Eweler · Answer

The mean squared error is the cross-entropy between the data distribution $p^*(x)$ and your Gaussian model distribution $p_{theta}$. Note that the standard MLE procedure is:
$$
begin{align}
max_{theta} E_{x sim p^*}[log p_{theta}(x)] &= min_{theta} left(- E_{x sim p^*}[log p_{theta}(x)]right)\ 
&= min_{theta} H(p^* Vert p_{theta}) \
&approx min_{theta} sum_i frac{1}{2} left(Vert x_i - theta_1Vert^2/theta_2^2 - log 2 pi theta_2^2right)
end{align}
$$
Where $H(p^* Vert p_{theta})$ denotes the CE and we use a Monte Carlo approximation to the expectation. And as you stated, this is equivalent to minimizing the KL divergence between the data distribution and your model distribution. Commonly the variance $theta_2$ is fixed and drops out of the objective.
Some people get confused because certain textbooks introduce the cross-entropy in terms of the Bernoulli/Categorical distribution (almost all machine learning libraries are guilty of this!), but it applies more generally than the discrete setting.

Using cross-entropy for regression problems

2 Answers

Add your own answers!

Ask a Question