TransWikia.com

Dealing with missing data

Data Science Asked on April 28, 2021

I have a question about data cleaning. I am a novice and have just started learning in this field so please pardon my ignorance. Suppose there are two columns and based on some samples taken from both the columns you find the correlation coefficient to be high. Now for the values that aren’t there, can you use linear regression to predict or find them out, by using the values you know as training data?

One Answer

Hi Soumyadeep and welcome to Data Science/Stack Exchange

What you are describing is called regression imputation, and it is a valid method to use on missing data. However, if the data is sparse (lots of missing values), this issue will be more difficult to handle.

In general, missing data can be handled in several ways (row deletion, imputation, substitution, etc). Regression imputation can be used if you have little or no knowledge about the data, but usually it is better to use another method. If you have some domain knowledge about the missing values, like you have an idea what the value should be, usually you can use that knowledge to fill in the missing values. Try some different methods and see which one works best.

A person pointed out that I should check for multicollinearity if the features are both independent. Does it basically mean that one feature is falling in the span of the other feature?

Definition of multicollinearity: There exist one or more exact linear relationships among some of the variables

enter image description here

enter image description here

References: https://en.wikipedia.org/wiki/Multicollinearity https://stats.stackexchange.com/questions/234870/is-multicollinearity-the-issue-here

Correct answer by Donald S on April 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP