Adding high p-value and low R square features in linear regression model to improve result

Question

I am working on a linear regression problem. The features for my analysis have been selected using p-values and domain knowledge. After selecting these features, the performance of $R^2$ and the $RMSE$ improved from 0.25 to 0.85. But here is the issue, the features selected using domain knowledge have very high p-values (0.7, 0.9) and very low $R^2$ (0.002, 0.0004). Does it make sense to add such features even if your model shows improvement in performance. As far I know, according to linear regression, it is preferable to only keep features with low p-values.

Can anyone share their experience? If yes, then how can I back up my proposal of new features with high p-values.

feature engineering feature selection linear regression machine learning statistics

Can anyone share their experience? If yes, then how can I back up my proposal of new features with high p-values.

Brian Spiering · Answer

In general, adding more features will increase the quality of model fit.
If your goal is best fitting modeling, add as many features as possible (regardless of p-value).
Sometimes people care about parsimonious models, they are will to lower the overall model fit because they also value a simpler model. Then they apply a threshold to features using p-values.

Adding high p-value and low R square features in linear regression model to improve result

One Answer

Add your own answers!

Ask a Question