Is smoothing in NLP ngrams done on test data or train data?

Data Science Asked by Hing Wong on November 30, 2020

Is smoothing in NLP ngram done on test data or train data?

Since smoothing is to avoid the language model predicting 0 probability of unseen corpus (test). So I wonder is smoothing done on test data only? Or on train data only? Or both? I don’t seem to find an answer to this yet.

Is smoothing in NLP ngram done on test data or train data?

In short: both.

Smoothing consists in slightly modifying the estimated probability of an n-gram, so the calculation (for instance add-one smoothing) must be done at the training stage since that's when the probabilities of the model are estimated.

But smoothing usually also involves differences at the testing stage, in particular for assigning a probability to unknown n-grams instead of 0.

Answered by Erwan on November 30, 2020

Related Questions

Reframing multilabel classification with imbalance in “both” directions

1  Asked on September 5, 2021

Interpreting a precision recall curve

1  Asked on September 5, 2021

What is the best way to normalize a set of datasets

2  Asked on September 5, 2021 by izo

Can setting of different thresholds help in model performance in case of handling class imbalances?

2  Asked on September 5, 2021

How to model user choice probability: binary model vs multi class model

2  Asked on September 5, 2021 by puneet

What does it mean that an hypotesis is consistent?

1  Asked on September 5, 2021

Training CNN on a huge data set

1  Asked on September 5, 2021 by omar-rayyan

What does it exactly mean when we say that PCA and LDA are linear methods of learning data representation?

1  Asked on September 5, 2021 by ankita-talwar

How to make an MNIST classifier work with blank images?

1  Asked on September 5, 2021 by ankit-chawla

Difference between convolution structures

1  Asked on September 5, 2021 by sm1

Is using samples from the same person in both trainset and testset considers being a data leakage?

1  Asked on September 5, 2021

Learning the uncertainty of a ML algorithm

1  Asked on September 5, 2021 by mirimo

Tensorflow API: What does the metric tf.keras.metrics.TopKCategoricalAccuracy do?

1  Asked on September 5, 2021

Confused AUC ROC score

2  Asked on September 5, 2021

Does the test set has to be in [0,1] range?

2  Asked on September 5, 2021 by skrrrt

How do I build a recommend system based on user’s past purchases?

1  Asked on September 5, 2021 by pranavm

Running H2O in databricks

1  Asked on September 5, 2021 by physics_2015

Overfitted model produces similar AUC on test set, so which model do I go with?

3  Asked on September 5, 2021 by rayven1lk

Extracting keywords from pdf file with python

1  Asked on September 5, 2021 by mr-scientist