TransWikia.com

Encoding features in sklearn

Data Science Asked on March 16, 2021

Suppose I have a dataset of size(10000, 45). One of the features in the dataset is activity_type in which the values vary from 1 to 15 as shown below:

df = pd.read_csv('actTrain.csv')
df['activity_type'].head()

The output of the above code is as:

0    1
1    1
2    2
3    1
4    3
Name: activity_type, dtype: int64

Will encoding the activity_type in the above code using OneHotEncoder in sklearn improve the model in anyway? Is it necessary to encode that feature? And if yes, which one should I choose : LabelEncoder or OneHotEnocder?

One Answer

LabelEncoder converts strings to integers, but you have integers already. Thus, LabelEncoder will not help you anyway.

Wenn you are using your column with integers as it is, sklearn treats it as numbers. This means, for example, that distance between 1 and 2 is 1, distance between 1 and 4 is 3. Can you say the same about your activities (if you know the meaning of the integers)? What is the pairwise distances between, for example, "exercise", "work", "rest", "leasure"?

If you think, that the pairwise distance between any pair of activities is 1, because those are just different activities, then OneHotEncoder is your choice.

Correct answer by lanenok on March 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP