TransWikia.com

Text classification using Python

Software Recommendations Asked by Ubaid Butt on September 25, 2021

I am doing a text classification related task in Python using NLP and SkLearn.
I need to remove random words from my text. I know I can remove stop words and punctuation using nlp. But what I am asking is about completely random strings like (‘ncdjbcjdkckdvcj’, ‘khsjgcgjcbjbcj’, ‘kdhjgcjgjc’, ‘jsbjsgucgugcus’) the one that you type completely randomly. Note that I have some words in my text which are misspelled and short forms, I don’t want to remove them, just want to get rid of strings like this. ?
Is there any python module or some external solution that can help me with this problem. ?

2 Answers

Libre Office offers a collection of word libraries in a variety of languages. You could use the Pyenchant library to check words against the LibreOffice dictionaries to see if they were valid words or just garbage. Look here for some clues on using the LibreOffice libraries with Pyenchant

Answered by GBG on September 25, 2021

You could use dictionaries for your target language (like nltk.corpus words) and also for special terms which are related to your topic and use fuzzy string matching (like fuzzywuzzy) to keep all words which are similar to real words.

Alternatively, depending on the amount and quality of your data you have, you could just remove all words that are not in any dictionary and only found once in the whole set. You will lose some rare misspellings but also most random gibberish.

Answered by quassy on September 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP