TransWikia.com

Why is logistic regression particularly prone to overfitting in high dimensions?

Cross Validated Asked on November 26, 2021

Why does "the asymptotic nature of logistic regression" make it particularly prone to overfitting in high dimensions? (source):

enter image description here

I understand the LogLoss (cross entropy) grows quickly as $y$ (true probability) approaches $1-y’$ (predicted probability):

but why does that imply that "the asymptotic nature of logistic regression would keep driving the loss towards 0 in high dimensions without regularization"?

In my mind, just because the loss can grow quickly (if we get very close to the wrong and full opposite answer), it doesn’t mean that it would thus try to fully interpolate the data. If anything the optimizer would avoid entering the asymptotic part (fast growing part) of the loss as aggressively as it can.

8 Answers

It seems to me that the answer is much simpler than what has been described so elegantly with others' answers. Overfitting increases when the sample size decreases. Overfitting is a function of the effective sample size. Overfitting is minimal for a given apparent sample size when Y is continuous, i.e., has highest information content. A binary Y with cell probabilities of 0.5 0.5 has lower information than a continuous variable and results in more overfitting because of the lower effective sample size. Y with probabilities 0.01 0.99 results in worse overfitting because of an even lower effective sample size. The effective sample size is proportional to min(a, b) where a and b are the two cell frequencies in the sample data. For continuous Y the effective and actual sample sizes are the same. This is covered in https://hbiostat.org/rms

Answered by Frank Harrell on November 26, 2021

The overfitting nature of logistic regression is related to the curse of dimensionality in way that I would characterize as inversed curse, and not what your source refers to as asymptotic nature. It's a consequence of Manhattan distance being resistant to the curse of dimensionality. I could also say that it drives the loss to zero because it can.

You can lookup a highly cited paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space" by Aggarwal et al, here https://bib.dbvis.de/uploadedFiles/155.pdf They study different distance metrics and found that Manhattan distance is the most robust in high dimenional problems for the purpose of classification. Other metrics such as Euclidian distance can't tell the points apart.

Now, all sigmoid fuctions have a linear term in Taylor approximation, see this one for example: enter image description here Hence, the predictor $y(Xbeta)sim Xbeta$, which is very similar to a Manhattan distance $L_1$. The log loss function is also linear around any point of choosing $ln (x+e)=ln x + ln (1+e/x)approx e/x$. Therefore, the predictors in logistic regressions even after applying the loss function are going to be separating points in high dimensions very robustly, and will have no trouble driving the loss function to zero.

This is in contrast to OLS regression where the setup is such that Euclidian distance is used to separate points. This distance is never linear by construction, it's exactly quadratic. As I already wrote Euclidian distance doesn't work well in high dimensional problems.

You can see now that asymptotic nature has nothing to do with logit's tendency to overfit. Also, what your source means by that concept is the following: when $|Xbeta|toinfty$ then we have the predictor $y(Xbeta)$ tend to either 0 or 1. Hence, the "asymptotic" characterization. The loss at the edges is infinitely large.

Answered by Aksakal on November 26, 2021

This has not to do with that specific log loss function.

That loss function is related to binomial/binary regression and not specifically to the logistic regression. With other loss functions you would get the same 'problem'.

So what is the case instead?

  • Logistic regression is a special case of this binomial/binary regression and it is the logistic link function that has the asymptotic nature.
  • In addition the 'overfitting' is mostly problematic for cases with perfect separation.

Perfect separation and fitting with sigmoid curves

If the samples are perfectly separated then the sigmoid shape of the logistic link function can make the fit 'perfect' (zero residuals and overfitted) by increasing the coefficients (to infinity).

For instance, in the image below the true model is:

$$p(x) = frac{1}{1 + e^{-2x}}$$

But the data points, which are not equal or close to $p(x)$ but have values 0 or 1, happen to be perfectly separated classes (on one side they are all 0 and on the other side they are all 1), and as a result the fitted values $hat{p}(x)$ are also fitted equal to 0 and 1 (which the sigmoid function allows by letting $b to infty$) $$hat{p}(x) = frac{1}{1 + e^{-bx}}$$

example of perfect separation and effect on sigmoid curve fit


An analogous example, with a similar tendency to over fit, would be

$y_i = sin(b cdot x_i) + epsilon_i$

example with sine wave

So this is not so much dependent on the type of loss function (or the error distribution) and it is more about the model prediction being able to approach a perfect fit.

In the example with this sin-wave you get the overfitting when you do not limit the frequency, in the case with logistic regression you get the over-fitting when you have perfect separation.

Why does regularization work

You can solve it with regularization, but you should have some good ways to know/estimate by what extent you wish to regularize.

In the high-dimensional case it 'works' because the over-fitting (with features that link only to one or a few points/individuals) requires many parameters to be high in value. This will increase the regularization part of the cost function quickly.

The regularization will make your fit tend towards 'using less features'. And that corresponds with your prior knowledge/believe that would be that your model should rely on only a few features, instead of a large collection of many itsy-bitsy tiny bits (which could easily be noise).

Example For instance, say you wish to predict the probability to become president of the USA, then you might do well with some generalizing parameters like education, parents, money/wealth, gender, age. However your fitted classification model, if it is not regularized, might give weight to the many unique features from each single observation/president (and potentially reach perfect score/separation in the training set, but is not generalizing) and instead of putting weight on a single parameter like 'age' it might use instead things like 'smokes sigars and likes skinny dipping' (and many of them to account for each single president in the observed set).

This fitting with overly many different parameters is reduced by regularization, because you might get a better (regularized) loss when there are less parameters with high values (which means that you make the model prefer the more general parameters).

This regularization is actually a 'good thing' to do, even without the case of perfect separation.

Answered by Sextus Empiricus on November 26, 2021

I would split logistic regression into three cases:

  1. modelling "binomial proportions" with no cell proportions being 0% or 100%
  2. modelling "Bernoulli data"
  3. something in between

What's the difference?

case 1

In case 1, your data cannot be separated using your predictors, because each feature $x_i$ has multiple records, with at least 1 "success" and at least 1 "failure". The loss function then becomes

$$LogLoss=sum_i n_i left[f_ilog(p_i)+(1-f_i)log(1-p_i)right]$$

Where $f_i$ is the proportion of times $y=1$ in "cell" $i$, and $p_i=(1+exp^{-x_i^Tw})$ is the modelled probability that $y=1$ in "cell" $i$. The number $n_i$ is the number of training samples you have for "cell" $i$. What defines a "cell"? The samples with the same set of features $x_i$ are all in the same cell.

In case 1, regularisation may not be needed and can actually be harmful. It depends on how big the cell sizes ($n_i$) are.

But the loss function looks totally different to the plot you show for this case - it is more like a squared error loss function, and is can be approximated by $sum_i n_ifrac{(f_i-p_i)^2}{p_i(1-p_i)}$. This is also known as normal approximation to binomial proportion (and also underlies many gradient based algorithms for estimating the coefficients).

Perfect prediction for each sample is impossible in this scenario, and you can think of the cells themselves as a form of regularisation. The predictions are constrained to be equal for samples in the same cell. Provided no cells are homogeneous (at least 1 of both outcomes) you cannot have a coefficient wander off to infinity.

You can also think of this as being very similar to linear regression at the cell level on the observed "logits" $logleft(frac{f_i}{1-f_i}right)=x_i^Tw+error$ with each record weighted towards the "high information" samples $n_ip_i(1-p_i)$ (Ie big cell size + prediction close to decision boundary), rather than unweighted.

As a side note, you can save a lot of computing time by fitting your models as "case 1" - particularly if $n_i$ are large -compared to binary modelling the data in case 2. This is because you aggregate sums over "cells" rather than "samples". Also your degrees of freedom are defined by the number of "cells" rather than the number of "samples" (eg if you have 1 million samples but only 100 cells, then you can only fit 100 independent parameters).

case 2

In this case, the predictors uniquely characterise each sample. This means we can fit the data with zero log loss by setting fitted values to $0$ or $1$. You can use the notation before as $n_i=1$ or $n_i>1,f_iin{0,1}$. In this case we need some kind of regularisation, particularly if all the $n_i$ are small. Either "size of coefficients" (eg L1, L2) where large values for $w$ are penalised. You could also penalise "difference in coefficients" - such as needing unit which are "close" in feature space to have similar predictions - similar to forming cells like in case 1 (this is like pruning a regression tree).

Interestingly, some regularisation approaches can be characterised as adding "pseudo data" to each cell such that you have a situation more like case 1. That is, for the records with $f_i=0$ we add pseudo data for a $y=1$ case in that cell, and if $f_i=1$ we add pseudo data for a $y=0$ case in that cell. The different levels of regularisation will determine how much "weight" to give the pseudo data vs the observed data.

case 3

In this case you may have small segments of the sample that can be perfectly predicted. This is also likely to be where most real data lives. Can see that some kind of adaptive regularisation will likely help - where you focus more on regularising based on $n_i$. The difficult part is that many choices on what's best really depend on the data you're working with, and not the algorithm. This is one reason we have lots of different algorithms. In fact, the logistic regression MLE, if not penalised, will basically split the training sample into "case 1" and "case 2" datasets. Analytically this approach will minimise the log loss. The problem is computational issues tend to result in the algorithm stopping before this happens. Typically you see large coefficients with even larger standard errors when this happens. Easy enough to find these by simply looking at or filtering the coefficients (probably need to be a bit clever with visualising these if you have a huge number of coefficients).

Answered by probabilityislogic on November 26, 2021

The existing answers aren't wrong, but I think the explanation could be a little more intuitive. There are three key ideas here.

1. Asymptotic Predictions

In logistic regression we use a linear model to predict $mu$, the log-odds that $y=1$

$$ mu = beta X $$

We then use the logistic/inverse logit function to convert this into a probability

$$ P(y=1) = frac{1}{1 + e^{-mu}} $$

enter image description here

Importantly, this function never actually reaches values of $0$ or $1$. Instead, $y$ gets closer and closer to $0$ as $mu$ becomes more negative, and closer to $1$ as it becomes more positive.

enter image description here

2. Perfect Separation

Sometimes, you end up with situations where the model wants to predict $y=1$ or $y=0$. This happens when it's possible to draw a straight line through your data so that every $y=1$ on one side of the line, and $0$ on the other. This is called perfect separation.

Perfect separation in 1D

In 2D

enter image description here

When this happens, the model tries to predict as close to $0$ and $1$ as possible, by predicting values of $mu$ that are as low and high as possible. To do this, it must set the regression weights, $beta$ as large as possible.

Regularisation is a way of counteracting this: the model isn't allowed to set $beta$ infinitely large, so $mu$ can't be infinitely high or low, and the predicted $y$ can't get so close to $0$ or $1$.

3. Perfect Separation is more likely with more dimensions

As a result, regularisation becomes more important when you have many predictors.

To illustrate, here's the previously plotted data again, but without the second predictors. We see that it's no longer possible to draw a straight line that perfectly separates $y=0$ from $y=1$.

enter image description here


Code

# https://stats.stackexchange.com/questions/469799/why-is-logistic-regression-particularly-prone-to-overfitting

library(tidyverse)
theme_set(theme_classic(base_size = 20))

# Asymptotes
mu = seq(-10, 10, .1)
p = 1 / (1 + exp(-mu))
g = ggplot(data.frame(mu, p), aes(mu, p)) + 
  geom_path() +
  geom_hline(yintercept=c(0, 1), linetype='dotted') +
  labs(x=expression(mu), y='P(y=1)')
g

g + coord_cartesian(xlim=c(-10, -9), ylim=c(0, .001))

# Perfect separation
x = c(1, 2, 3, 4, 5, 6)
y = c(0, 0, 0, 1, 1, 1)
df = data.frame(x, y)
ggplot(df, aes(x, y)) +
  geom_hline(yintercept=c(0, 1), linetype='dotted') +
  geom_smooth(method='glm', 
              method.args=list(family=binomial), se=F) +
  geom_point(size=5) +
  geom_vline(xintercept=3.5, color='red', size=2, linetype='dashed')

## In 2D
x1 = c(rnorm(100, -2, 1), rnorm(100, 2, 1))
x2 = c(rnorm(100, -2, 1), rnorm(100, 2, 1))
y = ifelse( x1 + x2 > 0, 1, 0)
df = data.frame(x1, x2, y)
ggplot(df, aes(x1, x2, color=factor(y))) +
  geom_point() +
  geom_abline(intercept=1, slope=-1,
              color='red', linetype='dashed') +
  scale_color_manual(values=c('blue', 'black')) +
  coord_equal(xlim=c(-5, 5), ylim=c(-5, 5)) +
  labs(color='y')

## Same data, but ignoring x2
ggplot(df, aes(x1, y)) +
  geom_hline(yintercept=c(0, 1), linetype='dotted') +
  geom_smooth(method='glm', 
              method.args=list(family=binomial), se=T) +
  geom_point()

Answered by Eoin on November 26, 2021

You give the source’s explanation yourself, where it says in your link:

Imagine that you assign a unique id to each example, and map each id to its own feature. If you don't specify a regularization function, the model will become completely overfit. That's because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

And from Logistic Regression for Machine Learning:

It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

This "never exactly at those limits" is the point, the logistic regression can almost perfectly predict the class, but is never 100 % sure. Thus the weights can grow almost infinitely as soon as the classes are hit in the vast majority of cases, which can happen if you allow for higher dimensions with a huge mass of rare feature crosses.


Part 1: paper on rare feature crosses

Alert: I am not sure about *Part 1*, this is already edited a lot and it is still vague and might be wrong! Main point of change: an unconstrained MLE is for rare feature crosses, and the constrained MLE is the usual case in low dimensionality, meaning much more observations than features. I rather recommend *part 2* as the main answer. *Part 1* was merged with *Part 2*, it has been a separate answer before.

I assume that the paper "The Impact of Regularization on High-dimensional Logistic Regression" which also uses this outstanding term "rare feature crosses" answers the question here.

This would also be in line with the highly voted comment of @DemetriPananos:

The question is probably about ...

... complete or quasi-complete separation. High dimensional space is weird, and there may exist some affine plane which perfectly or almost perfectly separates the 1s from the 0s. In such a case, the coefficients of the model are very large and the model will predict probability near 0 or 1 for each case respectively.

Back to the paper, at best, read the abstract of the paper yourself. I just try to show the core of it here, and I am not a professional, perhaps someone can correct me in the following conclusions (with some luck, I got it right, though):

The problem arises from models

where the number of observations and parameters are comparable“ so that “the maximum likelihood estimator is biased. In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.).

Which is nothing but the mass of rare feature crosses meant in your source’s explanation.

Further:

An advantage of RLR is that it allows parameter recovery even for instances where the (unconstrained) maximum likelihood estimate does not exist.

I can only assume now that this (unconstrained) MLE does arise from a rare feature cross, with the problem of those observations that are not a rare feature cross and thus need to be "recovered" as parameters because they are dominated by the high weights that the rare feature crosses receive.

In other words: in the usual case and in small dimensionality, a constrained MLE exists for each observation, it is calculated over a given number of observations that face a smaller number of features - thus it needs to be calculated by using constraints. With higher dimensionality, rare feature crosses arise where an unconstrained MLE exists, because parameters and observations become 1:1 cases then: one unique feature (~ parameter) = one isolated class assignment (~ observation). In these cases, those observations that are not mapped to just one feature lose their impact and need to be recovered by regularisation.

####

An example from the universe: Think of a case where you can explain say that something is a planet or not from the planetary circles around the sun and you use three features for this (in this example, they are all classified as planets except for the sun). Then you add a dimension by making the earth the center instead. This means you do not need more "new" features, instead you just need a higher dimensionality of the same features that are used in the sun model to explain the circles - and that is the same as adding new features, just by using more dimensions of the original features.

Sun or earth

More details: You might just take three features to prove that all are planets around the sun as a binary problem: planet yes or no, as long as a function can explain the planetary circlre using just the three features. As a next step, you can take whatever dimensionality you want of those three features to improve your model around the earth instead. And adding those higher dimensionalities is the same as adding entirely new features. Then, those functions which perfectly explain a planet by an isolated multidimensional feature (a feature that never explains any other planet) can get very high weights in the model, while those planets that are not that isolated from each other in their functions, because their parabolic functions are rather similar, cannot have infinite weights because there will be a part of the planet circles that gets explained worse when improving the explanation of the other part to 100 %. Now if you go to a very sparse area adding more and more dimensionality, you will get to a model where finally all planets can be marked as planets according to some isolated features that are not used for the other planets' functions. These are the rare feature crosses, there is no interception anymore with any other features. Such features only explain one single planet with its planet function. And thus those high-dimensional features can get infinite weights.

####

What is the final idea here to answer the question at all?

I assume that the logistic regression which never reaches probability 1 or 0 leads to the infinite weights of the rare feature crosses which causes the model to overfit on the rare feature crosses.

This is about a multi-layer design. In low dimensions, that is, if the polynomial degrees of the input are still low, the model with sigmoid activation is simple and saturation leading to overfitting is not the problem, but vanishing gradients (derivative is 0.25, take a couple of layers of backpropagation, and the weights are low) leading to underfitting. The particular problem of overfitting due to the sigmoid activation is for polynomial inputs in high dimensions which can cause these rare feature crosses in high dimensions that describe the training set perfectly but might not generalise to the testing set.

The sigmoid curve of the logistic function causes underfitting in low dimensions and overfitting in high dimensions.

We cannot repair the MLEs because they depend on the number of features and obervations, and we cannot just change the number of features or observations. Instead, we can reduce the weights of the rare feature crosses to recover the parameters that are no rare feature crosses.

Which gives us the next conclusion: When the „number of observations and parameters are comparable“, so that you have a mass of rare feature crosses in great sparsity, you lose the ordering function of the MLEs for the rest that is not in this "mass".

End of the abstract:

... and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function.” meaning a well-calibrated regularizer can solve the whole issue of the so much needed constraints by using a convex cost function of the weights (L1 and L2 are both tested) as part of the loss.


Part 2: Intuition of rare feature crosses in maths and graphs

Repeating the quote of your link at the beginning of this answer:

This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

The rare feature crosses can already be understood in a 2-dimensional graph with 2 classes (mathematically, a logistic regression is always for 2 classes, though it can be used to predict multiple classes with the One-vs-All method) that are scattered in slightly overlapping clouds of observations, see the middle row "Classification illustration" (and then after this example, think of the mass of rare feature crosses in 3dim "Classification illustration" in a sparse area):

Stanford CheatSheet under- to overfitting

Source: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks

The borderline between the two classes in the x1/x2 "Classification illustration" example shows the constant likelihood value y = 0.5 to be class 1 or 0.

In this "Classification illustration", with every added dimension (not of new variables, but of the same explanatory variable to the power of 2, 3, 4 aso.) the borderline of the observations‘ classes gets more flexible.

It is like adding new "explanation power", until you find all the dimensions you need to explain all labels.

  1. "Classification illustration", middle graph, (dim 2):

When adding 1 dimension means to introduce x1^2 and / or x2^2, the graph has 2 features and 1 or 2 calculated "artificial" features, though there are just 2 original features.

  1. "Classification illustration", right graph, (e.g. dim 9):

In very high dimensionality, the 2 classes can be assigned so well that perfect separation can be reached. Two different classes can be spread in quite some chaos, you might perfectly separate them when you go up to the power of 9, meaning to have 9 different x1 and / or x2 variables to assign the classes correctly.

#### Deep Learning side-note START

  1. In the Deep Learning example (bottom row), the logistic regression is used as the activation function. Please note that this has to be kept apart from the classification example which is the better example to answer the question.

The logistic regression is a sigmoid function. A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons (https://en.wikipedia.org/wiki/Sigmoid_function). They are used in order to enable nonlinear mapping of the output, so that large numbers do not change so much the activation value anymore, and this because of the asymptotical nature of the curve. The problem is still the same, since every single neuron can be seen as an isolated fitting problem that can also overfit for the same reasons as it is happening in the 2-D-classification example. Once the neuron knows that "it is right", it will allow to increase the probability = activation value to almost g(z) = 1 by admitting the weights to grow infinitely.

enter image description here

From: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning

Mind that this Deep Learning paragraph should better be ignored in the direct answer of the question. It is not intuitive and just confusing to the reader since the actual problem of overfitting in neural networks is rather a problem of capacity, not of the activation function. A higher capacity leads to overfitting as well as the asymptotical nature of the logistic regression in higher dimensionality of the "Classification illustration". Better keep "Regression illustration" & "Classification illustration" separate from "Deep Learning illustration". Yet, here is a regression example of an overfitting Deep Learning model, please judge yourself whether that adds to the answer of the question:

enter image description here

Regression and classification examples in a Deep Learning look like those without Deep Learning, see the classification example as follows. On the right, see the examples of underfitting (straight line on the right) vs. overfitting (very complex curve that hits every point):

enter image description here

Capacity differences lead to the difference. It is unclear in what way the logistic regression in the activation function changes the capacity of a network. Definition Capacity: the more trainable parameters, the more functions can be learned and the more complex these functions can be. The capacity (number of layers, number of neurons, complexity of the propagation and activation function, and other parameters, seems to be comparable to the question's "higher dimensions", at least the results are the same and that is why I suspect the capacity to be the equivalent problem to the high dimensions of x1 and x2 in a non-Deep-Learning classification problem.

My guess: the activation function (here: logistic regression) is part of the capacity of a neural network. This would justify this whole paragraph here. If instead the capacity were not partly changed by the choice of the activation function, then this Deep Learning paragraph would have no relevance in answering the question and should be ignored / deleted, as other parameters (number of layers, number of neurons) are not a problem specific to logistic regression.

Here is another hint that the increased dimensionality is meant as the input also in the deep learning setting, see the green marker for the 5 higher dimensionalities.

sigmoid in high dimensions

Source: sigmoid with 7 features (5 in high dimensions) which can be run at https://playground.tensorflow.org/#activation=sigmoid&batchSize=25&dataset=circle&regDataset=reg-gauss&learningRate=0.01&regularizationRate=0.3&noise=20&networkShape=5,4&seed=0.70944&showTestData=false&discretize=false&percTrainData=30&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false

Strangely, all of the other activation functions have more overfitting than the sigmoid at the use of 5 higher dimensions in 7 features. In the tensorflow playground, you can just change the activation function to check this. The test result at 2760 epochs and 7 features (5 in high dimensions) as follows.

Relu: enter image description here

Tanh: enter image description here

Linear: enter image description here

Perhaps the logistic regression is not "especially prone to overfitting in high dimensions" in neural networks? Or these are just too few dimensions added. If we added up to dimension x^9, it might be the case that the logistic regression in the activation functions will overfit the model more than ReLU and linear. I doubt that tanh will be so much different since it is also asymptotical to -1 and 1.

#### Deep Learning side-note END

Core part of this answer, at best looking at the simple classification problem in 2D:

The increase in dimensionality has an effect as if you added new features, until finally every observation is assigned to the right class. After a certain increase in dimensionality you can hit every class.

The resulting unstructured skippy borderline is an overfitting in itself because any visual generalisability is lost, not just to the human eye in this 2dim example, but also for the determination of the correct loss to keep the training relevant for the testing set - the loss simply vanishes to 0. If the regularisation does not punish high weights in order to increase the loss again, the weights of rare feature crosses (metaphorically the skippy borderline, but instead now in a sparse area in high dimensionality) grow without restrictions, overfitting the model. Switching to the other side, this means that the weights of more densely scattered observations (that share some features among each other so that they are no rare feature crosses) lose weight, relatively and also absolutely, possibly till 0, even though they are probably relevant in the testing set.

See here how this looks mathematically. You see that the original two features x1 and x2 stay the only ones, there is no feature added! They are just used in different combinations and dimensionalities.

enter image description here

From: https://medium.com/machine-learning-intuition/overfitting-what-they-are-regularization-e950c2d66d50

And here is another visualisation of the increased dimensionality meant in the question:

enter image description here

The sigmoid activation function g(f(x)) can evaluate f(x) both as a multi-dimensional (= polynomial) regression and as a one-dimensional regression.

This supports the idea that adding dimensionality is meant to add different combinations and dimensions of the already existing features (x1,x2) - and it is not to add "new original features" (x3,x4...) as "new dimensions".

And it thus stands in contrast to the accepted answer above which explains the problem of the question by adding predictors (= original features): "As a result, regularisation becomes more important when you have many predictors." This statement seems just wrong to me.

To the point. Why the accepted answer seems to be wrong: The overfitting issue is not because of added predictors (taking the name of the accepted answer here, = features). It is about using different combinations and dimensions of the existing predictors (features) as artificially new predictors (features). Staying in the examples: x1 and x2 is all what you need to get the overfitting problem explained, no x3 is needed for this. The accepted answer would be only right if it defined "many predictors" as "existing features together with their different combinations and dimensionalities" like x1^2 + x2^2 + x1x2, which I doubt it does, since there is no word about that. Thus in this case, a 200 points assigned accepted answer seems not to offer the true and complete explanation, though its basic direction is right, since: more predictors will tend to overfit the model due to the asymptotical nature of the logistic regression - IF these "more predictors" are the derived higher dimensions from already existing predictors.

Answered by questionto42 on November 26, 2021

Logistic regression is a convex optimization problem (the likelihood function is concave), and it's known to not have a finite solution when it can fully separate the data, so the loss function can only reach its lowest value asymptomatically as the weights tend to ± infinity. This has the effect of tightening decision boundaries around each data point when the data is separable, asymptotically overfitting on the training set.

On a more practical note, logistic regression is often trained with gradient descent. This is a shallow model with a smooth non-zero loss everywhere, so the gradient doesn't vanish easily numerically. Since the optimizer cannot reach an optimal solution via gradient steps with finite step sizes, it can iterate "forever", pushing the weights to increasingly extreme values, in an attempt to reach asymptotically zero loss.

In high dimensions this problem is exacerbated because the model will have even more ways to separate the data, so gradient descent is more likely to overfit asymptotically, i.e. if you let it run for long. Note that early stopping is a form of regularization in itself, and that it can take a relatively long time for these models with vanilla gradient descent to overfit.

Answered by Amelio Vazquez-Reina on November 26, 2021

The asymptotic nature refers to the logistic curve itself. The optimizer, if not regularized, will enlarge the weights of the logistic regression to put $wx$ as far as possible to the left or right per sample to reduce the loss maximally.

Lets assume one feature that provides perfect separation, one can imagine $wx$ getting larger and larger on each iteration. Optimization will fail in this case, that is unless the solution is regularized.

$$frac{1}{1 + e^{wx}}$$

A high dimensional model creates a large hypothesis space for the possible set of parameters. The optimizer will capitalize on that by choosing the solution with the highest weights. Higher weights will reduce the loss, which is the task of the optimizer, steepen the logistic curve, and give a higher conditional likelihood of the data. The model is overconfident, a paraphrase for overfitting in this setting.

If there are several parameter configurations that have the same binary performance measure, the optimizer will always choose the configuration with the lowest loss. Due to the asymptotic nature of the logistic curve, the loss function can be reduced beyond the information provided by the binary labels.

More pragmatic, regularization, which makes the coefficients smaller, can help to reduce overfitting. A more formal explanation of the relationship between unconstrained weights, regularization and overfitting can be found using Bayesian theory.

Answered by spdrnl on November 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP