TransWikia.com

Using cross-entropy for regression problems

Cross Validated Asked on November 2, 2021

I usually see a discussion of the following loss functions in the context of the following types of problems:

  • Cross entropy loss (KL divergence) for classification problems
  • MSE for regression problems

However, my understanding (see here) is that doing MLE estimation is equivalent to optimizing the negative log likelihood (NLL) which is equivalent to optimizing KL and thus the cross entropy.

So:

  • Why isn’t KL or CE used also for regression problems?
  • What’s the relationship between CE and MSE for regresion? Are they one and the same loss under some circumstances?
  • If different, what’s the benefit of using MSE for regression instead?

Related questions:

2 Answers

In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density

$$f_theta(y|x)= frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right}$$

and you fix $sigma^2$ to some value

The mean $mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)

Writing the empirical approximation to the cross entropy you get:

$$sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right} right)$$

$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}}right) +frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2$$

If we e.g. set $sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:

$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pi}}right) +frac{1}{2}(y_i-mu_theta(x_i))^2$$

Minimizing this is equivalent to the minimization of the $L2$ loss.

So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss

Answered by Sebastian on November 2, 2021

The mean squared error is the cross-entropy between the data distribution $p^*(x)$ and your Gaussian model distribution $p_{theta}$. Note that the standard MLE procedure is:

$$ begin{align} max_{theta} E_{x sim p^*}[log p_{theta}(x)] &= min_{theta} left(- E_{x sim p^*}[log p_{theta}(x)]right)\ &= min_{theta} H(p^* Vert p_{theta}) \ &approx min_{theta} sum_i frac{1}{2} left(Vert x_i - theta_1Vert^2/theta_2^2 - log 2 pi theta_2^2right) end{align} $$

Where $H(p^* Vert p_{theta})$ denotes the CE and we use a Monte Carlo approximation to the expectation. And as you stated, this is equivalent to minimizing the KL divergence between the data distribution and your model distribution. Commonly the variance $theta_2$ is fixed and drops out of the objective.

Some people get confused because certain textbooks introduce the cross-entropy in terms of the Bernoulli/Categorical distribution (almost all machine learning libraries are guilty of this!), but it applies more generally than the discrete setting.

Answered by Eweler on November 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP