Normalization before PCA in NLP domains?

Data Science Asked on August 13, 2021

I’m working on a basic bag-of-words toy NLP pipeline for sentiment analysis using scikit-learn. From research of other questions here, it seems that the main applicable scaler for before PCA is the standardscaler. However, given that this is an NLP domain with many equivalent features, could the normalizer be considered instead?

For this dataset, I’ve tried the standardscaler before PCA, but I found that all categories were tightly clustered at 0, 0. If I replaced it with the normalizer, the data is much more spread out and some clusters start to form. Could this also be due to dataset size? I’m at about ~250 labeled documents. I’m only at the stage of looking for clusters.

My pipeline is CountVectorizer(n_gram=(1,1)) -> (either StandardScaler or Normalizer) -> PCA(n_components=2).

Above is the standard scaled version:

And here is the normalized version:

clustering nlp pca scikit learn

Add your own answers!

Ask a Question

Get help from others!