TransWikia.com

In which cases is the categorical cross-entropy better than the mean squared error?

Artificial Intelligence Asked on December 11, 2021

In my code, I usually use the mean squared error (MSE), but the TensorFlow tutorials always use the categorical cross-entropy (CCE). Is the CCE loss function better than MSE? Or is it better only in certain cases?

3 Answers

We sometimes see that binary cross-entropy (BCE) loss is used for regression problems. This post is my opinion on using BCE for regression problems.

The figure below is the plots of BCE, $-t*log(x) - (1-t)*log(1-x)$, for several target values $t = 0.0, 0.1, ..., 0.5$. (The plots for $t>0.5$ are mirror images of those for $t<0.5$, so I omitted them.)

As you can see, when the target value $t$ is closer to the medium ($t=0.5$), BCE is flatter around its minimum ($x sim t$). This means that BCE is less 'focal' when the target value is intermediate value.

So, BCE suits your purpose when the edge values ($t=0$ and $t=1$) are of special importance for you but the difference between intermediate values ($t=0.4$ and $t=0.5$, for example) is not very important for you.

On the other hand, when any value of target is equally important for you, then BCE will not be a good choice. Another loss function, MSE for example, is better for you. The binary cross entropy for several target values

Note added: If you use BCE for regression problems, it will be better to subtract $-x*log(x) - (1-x)*log(1-x)$ from the original BCE expression, so that the loss becomes zero when the prediction value coincides with the target one, $x=t$. This will not matter for backpropagation, but will be convenient to monitor the value.

Note added: After I submit this post, I came to think that we can tune how 'focal' the loss function is around its minimum, by simply multiplying an arbitrary factor of target values. For example, we can tune BCE loss $L_{rm BCE}(x,t)$ by making it $f(t)*L_{rm BCE}(x,t)$ where $f(t)$ is whatever factor you want. This factor tunes how focal the loss is around its minimum for each target value $t$.

Answered by Toru Kikuchi on December 11, 2021

In a classification problem it's better to get higher error and higher error slope when we predict the label wrong.

As you see in the graph by using cross-entropy you get high error when the algorithm predict a label wrong and small error when the prediacted label is close enough, so it helps us to separate the predicted classes better.

enter image description here

Answered by amin msh on December 11, 2021

As a rule of thumb, mean squared error (MSE) is more appropriate for regression problems, that is, problems where the output is a numerical value (i.e. a floating-point number or, in general, a real number). However, in principle, you can use the MSE for classification problems too (even though that may not be a good idea). MSE can be preceded by the sigmoid function, which outputs a number $p in [0, 1]$, which can be interpreted as the probability of the input belonging to one of the classes, so the probability of the input belonging to the other class is $1 - p$.

Similarly, cross-entropy (CE) is mainly used for classification problems, that is, problems where the output can belong to one of a discrete set of classes. The CE loss function is usually separately implemented for binary and multi-class classification problems. In the first case, it is called the binary cross-entropy (BCE), and, in the second case, it is called categorical cross-entropy (CCE). The CE requires its inputs to be distributions, so the CCE is usually preceded by a softmax function (so that the resulting vector represents a probability distribution), while the BCE is usually preceded by a sigmoid.

See also Why is mean squared error the cross-entropy between the empirical distribution and a Gaussian model? for more details about the relationship between the MSE and the cross-entropy. In case you use TensorFlow (TF) or Keras, see also How to choose cross-entropy loss in TensorFlow?, which gives you some guidelines for how to choose the appropriate TF implementation of the cross-entropy function for your (classification) problem. See also Should I use a categorical cross-entropy or binary cross-entropy loss for binary predictions? and Does the cross-entropy cost make sense in the context of regression?.

Answered by nbro on December 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP