Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

Question

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this)

Now I'd like to reduce the dimensions. I'm already running

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(df)

for three hours and I asked my self if my approach was correct. I also saw two variants of PCA, one for sparse data. Does it mean that it doesn't make sense to run PCA in such a mixed scenario?

As I was up to now busy with cleaning and transforming my data, I'd like to understand what a good strategy would be to eliminate irrelevant columns.

I'd appreciate some hints to move forward.

Michael_S · Answer

There are many ways to get rid of redundant dimensions. The choice wheter to do it or not depends on what kind of problem you want so solve and what kind of algorithm you plan to choose.

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

One Answer

Add your own answers!

Ask a Question