TransWikia.com

How to remove irrelevant text data from a large dataset

Data Science Asked by zxcisnoias on March 16, 2021

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers)?
Or is it ok to keep them in the dataset since they only count for 1-5% ?

One Answer

You can use BERT to create vectors that will capture the context of the whole tweet. Once, you do that, try clustering (K-Means or GMM). You can then look at the clusters found and separate out this unwanted data.

Answered by Abhishek Verma on March 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP