What are arguments against using the (log-)likelihood as a loss function?

Question

Context: My goal is to fit a GEV distribution function to data $z$, where the location parameter is parametrised as linear combination of predictor variables $mu(vec{x}) = mu_0 + mu_1 x_1 + ...$ (like the mean/location-parameter in linear regression). However, the amount of (potential) predictors $X$ is quite large, thus I plan to apply $l_1$ Lasso-regularisation on the respective parametrisation (c.f. an earlier question).
Question: Since I (a) know/assume the functional form (a GEV) and (b) try to not just optimise the expectation $E(Z | X=x)$ (but the full distribution), I assume it's fair to regularise the log-likelihood when fitting the distribution (several articles seem to support this approach: 2 3). However, I never came across it in the literature. I assume, this is because (a) it requires knowing/assuming a functional form (which one usually doesn't in statistical learning problems) and (b) it's more costly to calculate the log-likelihood than for example a squared error loss. Is this correct, and/or are there further reasons for not using the likelihood for regularisation?

Sextus Empiricus · Answer

However, I never came across it in the literature.

There is a lot of literature on fitting distributions. Think for instance about Pearson's method of moments and chi-squared test which is already more than a hundred years old.
Fitting by optimizing the likelihood, the maximum likelihood, is also a method that is (just) more than a hundred years old. In addition Pearson's chi-squared test is finding replacement by the G-test, which is based on likelihood.
Regularised maximum likelihood methods are neither uncommon. Model selection methods that use values like BIC or AIC can be considered regularised likelihood regression (where the regularisation parameter is $Vert beta Vert_0$).
Thus it might be a matter of terminology that you do not read much about regularised maximum likelihood methods. Another related concept is Bayesian regression. The maximum a posteriori estimate could be considered as a regularised maximum likelihood.

Your second reason, computational complexity, might also be a reason why maximum likelihood is not always used. The same is true for unregularised regression where using a simple linear estimator is often preferred (which is optimised when minimizing least squares residuals).
The first reason might also be true. An advantage of least squares is indeed that it works independent of the underlying distribution.

But both these reasons are reasons why regularised likelihood may be not often used. But they are not reasons why you can not find anything about it in the literature.

What are arguments against using the (log-)likelihood as a loss function?

One Answer

Add your own answers!

Ask a Question