Is it wrong to transform the target variable and test the model without dropping the column that was transformed? What's the disadvantage about it?

Question

I have a linear regression model, I have transformed the target variable Item_Outlet_Sales into Item_Outlet_Sales_log on both training and testing dataset. I did not delete the Item_Outlet_Sales.
Here is the snippet:
#treat extreme values in Item_Outlet_Sales
train['Item_Outlet_Sales_log'] = np.log(train.Item_Outlet_Sales)
test['Item_Outlet_Sales_log'] = np.log(test.Item_Outlet_Sales)

sns.distplot(train.Item_Outlet_Sales_log);
sns.distplot(test.Item_Outlet_Sales_log);
#distribution

I dropped the target variable: Item_Outlet_Sales_log and assigned it to y
 #creating dummies for the training dataset
 X = train.drop('Item_Outlet_Sales_log', 1) #drop the log target column
 y = train.Item_Outlet_Sales_log

X = pd.get_dummies(X)
 train = pd.get_dummies(train)
 test = pd.get_dummies(test)

Is it wrong to to do in terms of giving the model proper training? What is the disadvantage of it? Is it recommended to delete the old target variable or not?

noe · Accepted Answer

The problem is that, by definition, your target variable is not available at inference time, and that is why you want to predict it. If your target variable was available at inference time, then there is no point in predicting it.
Therefore, if you use the target variable (or a transformation of it) as input to your model, what data are you going to feed to that variable at inference time?

Is it wrong to transform the target variable and test the model without dropping the column that was transformed? What's the disadvantage about it?

One Answer

Add your own answers!

Ask a Question