# Is it necessary to scale the target value in addition to scaling features for regression analysis?

Cross Validated Asked by user2806363 on January 3, 2022

I’m building regression models. As a preprocessing step, I scale my feature values to have mean 0 and standard deviation 1. Is it necessary to normalize the target values also?

I think the best way to know whether we should scale the output is to try both way, using scaler.inverse_transform in sklearn. Neural network is not robust to transformation, in general. Therefore, if you scale the output variables, train,then the MSE produced is for the scaled version. However, if you use that model to predict and use scaler.inverse_transform, and recompute MSE, it may be a different scence.

Answered by user25047 on January 3, 2022

It may be useful for some cases.

Even though not being a common error function, when L1 error used to calculate loss, a rather slow learning may occur.

Assume that we have a linear regression model, and also have a constant learning rate $$n$$. Say,

$$y = b_1x + b_0$$

$$n = 0.1$$

$$b_1$$ and $$b_0$$ are updated as follows:

$$b_1{new} = b_1{old} - n* frac{hat{y}-y} {|hat{y}-y|} *x$$

$$b_0{new} = b_0{old} - n* frac{hat{y}-y} {|hat{y}-y|}$$

$$frac{hat{y}-y} {|hat{y}-y|}$$ evaluates to -1 or 1. Hence, $$b_0$$ will be incremented/decremented by $$n$$, and $$b_1$$ will be incremented/decremented by $$n*x$$.

Now, if the output value is in millions or billions, obviosuly $$b_0$$ will require so much iteration to approach the cost to zero.

If the input is normalized (or standardized), $$b_1$$ will also be changed by similar and close values to $$b_0$$ (e.g. 0.1), and it will require too much iteration too.

Actually this is why a factor of the actual loss is desired in the derivative of the cost at a certain point (such as $$hat{y}-y$$).

Answered by Ricardo Cristian Ramirez on January 3, 2022

Yes, you do need to scale the target variable. I will quote this reference:

A target variable with a large spread of values, in turn, may result in large error gradient values causing weight values to change dramatically, making the learning process unstable.

In the reference, there's also a demonstration on code where the model weights exploded during training given the very large errors and, in turn, error gradients calculated for weight updates also exploded. In short, if you don't scale the data and you have very large values, make sure to use very small learning rate values. This was mentioned by @drSpacy as well.

Answered by Fernando Wittmann on January 3, 2022

$$x_{n+1} = x_{n} - gammaDelta F(x_n)$$

lets say that $$x_2$$ is a feature that is 1000 times greater than $$x_1$$

for $$F(vec{x})=vec{x}^2$$ we have $$Delta F(vec{x})=2*vec{x}$$. The optimal way to reach (0,0) which is the global optimum is to move across the diagonal but if one of the features dominates the other in terms of scale that wont happen.

To illustrate: If you do the transformation $$vec{z}= (x_1,1000*x_1)$$, assume a uniform learning rate $$gamma$$ for both coordinates and calculate the gradient then $$vec{z_{n+1}} = vec{z_{n}} - gammaDelta F(z_1,z_2) .$$ The functional form is the same but the learning rate for the second coordinate has to be adjusted to 1/1000 of that for the first coordinate to match it. If not coordinate two will dominate and the $$Delta$$ vector will point more towards that direction.

As a result it biases the delta to point across that direction only and makes the converge slower.

Answered by drSPacy_ on January 3, 2022

No, linear transformations of the response are never necessary. They may, however, be helpful to aid in interpretation of your model. For example, if your response is given in meters but is typically very small, it may be helpful to rescale to i.e. millimeters. Note also that centering and/or scaling the inputs can be useful for the same reason. For instance, you can roughly interpret a coefficient as the effect on the response per unit change in the predictor when all other predictors are set to 0. But 0 often won't be a valid or interesting value for those variables. Centering the inputs lets you interpret the coefficient as the effect per unit change when the other predictors assume their average values.

Other transformations (i.e. log or square root) may be helpful if the response is not linear in the predictors on the original scale. If this is the case, you can read about generalized linear models to see if they're suitable for you.

Answered by AlexK on January 3, 2022

Let's first analyse why feature scaling is performed. Feature scaling improves the convergence of steepest descent algorithms, which do not possess the property of scale invariance.

In stochastic gradient descent training examples inform the weight updates iteratively like so, $$w_{t+1} = w_t - gammanabla_w ell(f_w(x),y)$$

Where $w$ are the weights, $gamma$ is a stepsize, $nabla_w$ is the gradient wrt weights, $ell$ is a loss function, $f_w$ is the function parameterized by $w$, $x$ is a training example, and $y$ is the response/label.

Compare the following convex functions, representing proper scaling and improper scaling.

A step through one weight update of size $gamma$ will yield much better reduction in the error in the properly scaled case than the improperly scaled case. Shown below is the direction of $nabla_w ell(f_w(x),y)$ of length $gamma$.

Normalizing the output will not affect shape of $f$, so it's generally not necessary.

The only situation I can imagine scaling the outputs has an impact, is if your response variable is very large and/or you're using f32 variables (which is common with GPU linear algebra). In this case it is possible to get a floating point overflow of an element of the weights. The symptom is either an Inf value or it will wrap-around to the other extreme representation.

Answered by Jessica Collins on January 3, 2022

Generally, It is not necessary. Scaling inputs helps to avoid the situation, when one or several features dominate others in magnitude, as a result, the model hardly picks up the contribution of the smaller scale variables, even if they are strong. But if you scale the target, your mean squared error (MSE) is automatically scaled. Additionally, you need to look at the mean absolute scaled error (MASE). MASE>1 automatically means that you are doing worse than a constant (naive) prediction.

Answered by inzl on January 3, 2022

## Related Questions

### How to determine relationship categorical and numerical data

1  Asked on January 9, 2021 by onhalu

### Multiple Poisson regression (?) in R

2  Asked on January 9, 2021 by jonas8

### Propose a model for this time series

1  Asked on January 8, 2021 by le-anh-dung

### Would a 3D CNN require less training samples than a corresponding 2D CNN?

0  Asked on January 8, 2021 by alexander-soare

### Can regression to the mean be corrected by linear mixed effects?

0  Asked on January 8, 2021 by lili

### T value vs T-stat

1  Asked on January 8, 2021 by student010101

### How can I perform a two-sample multivariate t-test where one group is a subset of the other?

0  Asked on January 7, 2021 by grint

### Minimize the limit of K-L (Kullback Leibler) divergence for a given conditional probability $p(y|x)$ distribution?

0  Asked on January 7, 2021

### Can I use coefficients of one set of regressions as dependent variable in a new regression?

1  Asked on January 7, 2021 by jeremy

### What’s a word meaning “drawn from the same distribution”?

0  Asked on January 6, 2021 by gkhagb

### What Statistical principles are being violated by comparing specific Trainer Fatality Rates to Race Track Fatality rates?

0  Asked on January 6, 2021 by pseudoego

### How to automatically choose the number of components for PCA?

1  Asked on January 6, 2021 by foobar

### Cosine Similarity Intuition

3  Asked on January 6, 2021 by ccb

### Is there a way to get the optimal cutoff points based on probability of topic models and the outcomes?

1  Asked on January 6, 2021 by kuni

### How can I use the box plot to explain the Empirical Rule for a normal distribution?

1  Asked on January 6, 2021 by storymay

### PCA: Dimension Reduction

0  Asked on January 5, 2021 by shank

### How to choose a good operation point from precision recall curves?

4  Asked on January 5, 2021 by amelio-vazquez-reina

### How to develop a likelihood based prediction model to predict chance of rain in a particular hour of a year?

0  Asked on January 5, 2021 by nahid

### How well does GAN (generative adversarial network) perform for small samples?

1  Asked on January 4, 2021

### Using the Hotelling package in R

1  Asked on January 4, 2021 by pitchounet