TransWikia.com

How to handle categorical features in K-means?

Data Science Asked by Sathya on January 11, 2021

I am working on clustering algorithms. I am working with titanic dataset. It contains 6 categorical features. I used k-means algorithm on this dataset. I am using label encoding for categorical features. But I found that categorical features should use euclidean distance. It should use Hamming distance. So, how to make k-means work finely on mixed features? I don’t need other algorithm. I just want to work with k-means only on mixed features dataset.

3 Answers

You can quantify correlation, or more precisely association, between categorical variables using something like cross-entropy. There’s an available library dython to compute such association values. Also I am curious why do you want to do clustering ? What is your expected output?

Answered by Victor Luu on January 11, 2021

Label encoding is not a good idea if the nature of categories are not ordinal (it is actually not my favorite anyways). Use one-hot encoding and see how it works. You may apply a feature extraction on top of it, e.g. PCA, to reduce the noise coming from sparsity. The other idea is to label categories by their fraction in the feature, for example:

[a,b,b,c,a,a] --> [3/6, 2/6, 2/6, 1/6, 3/6, 3/6]

Answered by Kasra Manshaei on January 11, 2021

The best way to encode the data will be through any encoding mechanism like label encoder etc. But before handling the categorical variable check the correlation of a categorical variable with the target variable using the feature selection methods like chi square test with selectKbest.

Answered by Ubaid Usmani on January 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP