TransWikia.com

Normalization before PCA in NLP domains?

Data Science Asked on August 13, 2021

I’m working on a basic bag-of-words toy NLP pipeline for sentiment analysis using scikit-learn. From research of other questions here, it seems that the main applicable scaler for before PCA is the standardscaler. However, given that this is an NLP domain with many equivalent features, could the normalizer be considered instead?

For this dataset, I’ve tried the standardscaler before PCA, but I found that all categories were tightly clustered at 0, 0. If I replaced it with the normalizer, the data is much more spread out and some clusters start to form. Could this also be due to dataset size? I’m at about ~250 labeled documents. I’m only at the stage of looking for clusters.

My pipeline is CountVectorizer(n_gram=(1,1)) -> (either StandardScaler or Normalizer) -> PCA(n_components=2).

standard scaled

Above is the standard scaled version:

normalized

And here is the normalized version:

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP