Encoding features in sklearn

Question

Suppose I have a dataset of size(10000, 45). One of the features in the dataset is activity_type in which the values vary from 1 to 15 as shown below:
df = pd.read_csv('actTrain.csv')
df['activity_type'].head()

The output of the above code is as:
0    1
1    1
2    2
3    1
4    3
Name: activity_type, dtype: int64

Will encoding the activity_type in the above code using OneHotEncoder in sklearn improve the model in anyway? Is it necessary to encode that feature? And if yes, which one should I choose : LabelEncoder or OneHotEnocder?

lanenok · Accepted Answer

LabelEncoder converts strings to integers, but you have integers already. Thus, LabelEncoder will not help you anyway.

Wenn you are using your column with integers as it is, sklearn treats it as numbers. This means, for example, that distance between 1 and 2 is 1, distance between 1 and 4 is 3. Can you say the same about your activities (if you know the meaning of the integers)? What is the pairwise distances between, for example, "exercise", "work", "rest", "leasure"?

If you think, that the pairwise distance between any pair of activities is 1, because those are just different activities, then OneHotEncoder is your choice.

Encoding features in sklearn

One Answer

Add your own answers!

Ask a Question