TransWikia.com

Python to clean miswritten words with repetitive letters such as "wwwwooorrrrddss" to "words"

Data Science Asked by Pythoner on April 29, 2021

When cleaning text-data in Python3 for NLP, are there are any ‘common practices’ when it comes to addressing repetitive letters in words such as "wwwwooorrds" to "words", or "fffunnnyyyyyy" to "funny"?

The source of the miswritten words is an OCR and I am not able to address the issue upstream, and thought I would check if there was anything that I can do downstream to fix this.

Thanks!

One Answer

A simple two part solution from this site

remove any letter sequences longer than two (probably not good for welsh)

def reduce_lengthening(text):
    pattern = re.compile(r"(.)1{2,}")
    return pattern.sub(r"11", text)

print(reduce_lengthening( "finallllllly" ))

Then using pattern.en to check spelling.

from pattern.en import spelling

word = "amazzziiing"
word_wlf = reduce_lengthening(word) #calling function defined above
print word_wlf #word lengthening isn't being able to fix it completely

correct_word = spelling(word_wlf) 
print(correct_word)

NLTK is another common toolkit that can help with this

Answered by lys on April 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP