TransWikia.com

How to assign labels/score to data using machine learning

Stack Overflow Asked by user12907213 on January 1, 2021

I have a dataframe made by many rows which includes tweets. I would like to classify them using a machine learning technique (supervised or unsupervised).
Since the dataset is unlabelled, I thought to select a few rows (50%) to label manually (+1 pos, -1 neg, 0 neutral), then using machine learning to assign labels to the other rows.
In order to do this, I did as follows:

Original Dataset

Date                   ID        Tweet                         
01/20/2020           4141    The cat is on the table               
01/20/2020           4142    The sky is blue                       
01/20/2020           53      What a wonderful day                  
...
05/12/2020           532     In this extraordinary circumstance we are together   
05/13/2020           12      It was a very bad decision            
05/22/2020           565     I know you are the best              
  1. Split the dataset into 50% train and 50% test. I manually labelled 50% of data as follows:

    Date                   ID        Tweet                          PosNegNeu
     01/20/2020           4141    The cat is on the table               0
     01/20/2020           4142    The weather is bad today              -1
     01/20/2020           53      What a wonderful day                  1
     ...
     05/12/2020           532     In this extraordinary circumstance we are together   1
     05/13/2020           12      It was a very bad decision            -1
     05/22/2020           565     I know you are the best               1
    

Then I extracted words’frequency (after removing stopwords):

               Frequency
 bad               2
 circumstance      1
 best              1
 day               1
 today             1
 wonderful         1

….

I would like to try to assign labels to the other data based on:

  • words within the frequency table, for example saying "if a tweet contains e.g. bad than assign -1; if a tweet contains wonderful assign 1 (i.e. I should create a list of strings and a rule);
  • based on sentence similarity (e.g. using Levenshtein distance).

I know that there are several ways to do this, even better, but I am having some issue to classify/assign labels to my data and I cannot do it manually.

My expected output, e.g. with the following test dataset

Date                   ID        Tweet                                   
06/12/2020           43       My cat 'Sylvester' is on the table            
07/02/2020           75       Laura's pen is black                                                
07/02/2020           763      It is such a wonderful day                                    
...
11/06/2020           1415    No matter what you need to do                  
05/15/2020           64      I disagree with you: I think it is a very bad decision           
12/27/2020           565     I know you can improve                         

should be something like

Date                   ID        Tweet                                   PosNegNeu
06/12/2020           43       My cat 'Sylvester' is on the table            0
07/02/2020           75       Laura's pen is black                          0                       
07/02/2020           763      It is such a wonderful day                    1                
...
11/06/2020           1415    No matter what you need to do                  0  
05/15/2020           64      I disagree with you: I think it is a very bad decision  -1          
12/27/2020           565     I know you can improve                         0   

Probably a better way should be consider n-grams rather than single words or building a corpus/vocabulary to assign a score, then a sentiment. Any advice would be greatly appreciated as it is my first exercise on machine learning. I think that k-means clustering could also be applied, trying to get more similar sentences.
If you could provide me a complete example (with my data would be great, but also with other data would be fine as well), I would really appreciate it.

2 Answers

I'll propose the sentence or tweet in this context to be analysed for polarity. This can be done using the textblob library. It can be installed as pip install -U textblob. Once the text data polarity is found, it can be assigned as a separate column in the dataframe. Subsequently, the sentence polarity can then be used for further analysis.

Initial Code

from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)

Intermediate Result

    Date     ...                                  sentiment
0  1/1/2020  ...                                 (0.0, 0.0)
1  2/1/2020  ...                                 (0.0, 0.0)
2  3/2/2020  ...                                 (0.0, 0.1)
3  4/2/2020  ...  (-0.6999999999999998, 0.6666666666666666)
4  5/2/2020  ...                                 (0.5, 0.6)

[5 rows x 4 columns]

From the sentiment column (in the above output), we can see the sentiment column is categorized between two — Polarity and Subjectivity.

Polarity is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.

Subjectivity is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.

Notice, the sentiment column is a tuple. So we can split it into two columns like, df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index). Now, we can create a new dataframe to which I'll append the split columns as shown;

df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)

Finally, basis of the sentence polarity found earlier, we can now add a label to the dataframe, which will indicate if the tweet is positive, negative or neutral.

import numpy as np
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

Finally, the result will look like this;

Final Result

[5 rows x 6 columns]
       Date  ID                 Tweet  ... polarity  subjectivity     label
0  1/1/2020   1  the weather is sunny  ...      0.0           0.0   neutral
1  2/1/2020   2       tom likes harry  ...      0.0           0.0   neutral
2  3/2/2020   3       the sky is blue  ...      0.0           0.0   neutral
3  4/2/2020   4    the weather is bad  ...     -0.7          -0.7  negative
4  5/2/2020   5         i love apples  ...      0.5           0.5  positive

[5 rows x 7 columns]

Data

import pandas as pd

# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
    "ID":[1,2,3,4,5],
    "Tweet":["the weather is sunny",
             "tom likes harry", "the sky is blue",
             "the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

Full Code

# create some dummy data
import pandas as pd
import numpy as np

# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
        "ID":[1,2,3,4,5],
        "Tweet":["the weather is sunny",
                 "tom likes harry", "the sky is blue",
                 "the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)

# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)

# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
print(df_new)

# add label to dataframe based on condition
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

Correct answer by maverick on January 1, 2021

IIUC, you have a percentage of the data labelled and require labelling the remaining data. I would recommend reading about Semi-Supervised machine learning.

Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data)

Sklearn provides quite an extensive variety of algorithms that can assist with this. Do check this out.

If you need more insight into this topic I would highly recommend checking this article out as well.

Here is an example with the iris data set -

import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation

#Init
label_prop_model = LabelPropagation()
iris = datasets.load_iris()

#Randomly create unlabelled samples
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
labels = np.copy(iris.target)
labels[random_unlabeled_points] = -1

#propogate labels over remaining unlabelled data
label_prop_model.fit(iris.data, labels)

Answered by Akshay Sehgal on January 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP