TransWikia.com

How to compute unseen bi-grams in a corpus (for Good-Turing Smoothing)

Data Science Asked by rahs on August 1, 2020

Consider a (somewhat nonsensical) sentence – “I see saw a see saw”

The observed bi-grams would be:
“I see”

“see saw”

“saw a”

and,

“a see”.

My aim is to smoothen out the probability mass of the bi-gram probabilities by using Good-Turing smoothing. For this, I need to find the count of unseen bi-grams, i.e., bi-grams with a frequency count of 0.

How do I do this?

1) Would this be a list of all bi-grams formed by using 2 non-consecutive words? For example, “I saw”, “saw saw”, “a I”, etc.?

2) Would repetitions of the same word be included as bi-grams? Eg. “I I”, “see see”, etc.?

One Answer

I just remembered that we create a table with all possible words as the header of each row and of each column. As a result, the list of all bi-grams would be all possible bi-grams formed by concatenating any 2 words.

Answered by rahs on August 1, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP