TransWikia.com

Label data set for sentiment analysis

Data Science Asked by Anuradha on June 26, 2021

I am a beginner in this field. I have a scrapped review data set. It contains review socre (1 – 10) and review content. I am going to label the reviews according to the review score like below :

0-2 -> negative, 3-6 -> neutral, 7-10 -> positive

Is it possible to directly label contents like this? Is there any specific process to do this? Do I need to validate my labeling ?

One Answer

Is it possible to directly label contents like this? Is there any specific process to do this? Do I need to validate my labeling ?

Yes, it's definitely possible to define sentiment classes in this way. One can reasonably assume that the review score is a good approximation of the review sentiment.

it's just a method to define the gold standard, there's no particular process for that. It's important to realize that defining the gold standard is an important part of designing the task itself, as opposed to designing a system which tries to solve the task.

In some cases it makes sense to prove that whatever is used as gold standard corresponds to the goal of the task, but in this case it's straightforward: it's safe to assume that a user who writes a review gives as score a value which corresponds to their overall sentiment.

Even if this is a reasonable design, it's also important to notice the limitations:

  • By discretizing the score into 3 classes, the score information is simplified. For example the difference between a 7 and a 10 is lost.
  • The arbitrary cut-off points cause a threshold effect. Normally there is less difference between a 2 and a 3 than between a 3 and a 6, but the classes reverse this relation.

Note that sentiment analysis does not have to be a classification task (predicting a categorical variable), it can also be defined as a regression task (predicting a numerical variable). In this case the target variable could be the score itself, and that would avoid some of the problems mentioned above. This is also a design choice, it depends mostly on what the application is for.

Correct answer by Erwan on June 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP