TransWikia.com

Is it wrong to transform the target variable and test the model without dropping the column that was transformed? What's the disadvantage about it?

Data Science Asked by SP_ on April 16, 2021

I have a linear regression model, I have transformed the target variable Item_Outlet_Sales into Item_Outlet_Sales_log on both training and testing dataset. I did not delete the Item_Outlet_Sales.

Here is the snippet:

#treat extreme values in Item_Outlet_Sales
train['Item_Outlet_Sales_log'] = np.log(train.Item_Outlet_Sales)
test['Item_Outlet_Sales_log'] = np.log(test.Item_Outlet_Sales)

sns.distplot(train.Item_Outlet_Sales_log);
sns.distplot(test.Item_Outlet_Sales_log);
#distribution

enter image description here

  1. I dropped the target variable: Item_Outlet_Sales_log and assigned it to y

     #creating dummies for the training dataset
     X = train.drop('Item_Outlet_Sales_log', 1) #drop the log target column
     y = train.Item_Outlet_Sales_log
    
     X = pd.get_dummies(X)
     train = pd.get_dummies(train)
     test = pd.get_dummies(test)
    

Is it wrong to to do in terms of giving the model proper training? What is the disadvantage of it? Is it recommended to delete the old target variable or not?

One Answer

The problem is that, by definition, your target variable is not available at inference time, and that is why you want to predict it. If your target variable was available at inference time, then there is no point in predicting it.

Therefore, if you use the target variable (or a transformation of it) as input to your model, what data are you going to feed to that variable at inference time?

Correct answer by noe on April 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP