TransWikia.com

Get frequency of tokens by group in pandas

Stack Overflow Asked by Ravanelli on November 22, 2021

I have a pandas column, which is titles for online shopping products, classified by categories:

df

category    title 
electronics ALLDOCUBE iPlay 7T 4G LTE Kids Tablet 6.98" HD iPS Android 9.0 Tablets 16GB ROM Support 256G Expansion Dual Ai 4 Core Type C GPS
electronics Alldocube iPlay8 pro 8 inch Tablet Android 9.0 MTK MT8321 Quad core 3G Calling Tablet PC RAM 2GB ROM 32GB 800*1280 IPS OTG
accessories Alldocube iPlay10 Pro 10.1 inch Wifi Tablet  Android 9.0  MT8163  quad core 1200*1920 IPS Tablets PC RAM 3GB ROM 32GB HDMI OTG
clothing    ALLDOCUBE iPlay10 Pro Tablet 10.1 3GB RAM 32GB ROM Android 9.0 MT8163 Quad Core Tablet PC 1920 x 1200 IPS 6600mAh Wifi Tablet

I try to tokenize the title column, but returns not words, but letters. This is what I did:

df.loc[:,['category','title']].groupby('category').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

This is what it returns:

cat 
accessories {'N': 1510, 'A': 1635, 'V': 498, 'I': 873, 'F': 2453, 'O': 837, 'R': 1577, 'C': 3087, 'E': 1831, ' ': 37476, 'M': 2497, 'e': 24599, 'n': 13621, 'W': 3112, 'a': 17129, 't': 11106, 'c': 6471, 'h': 4666, 's': 10988, 'r': 15707, 'p': 2774, 'o': 12459, 'f': 2069, 'S': 4262, 'i': 12812, 'l': 12888, 'Q': 333, 'u': 4711, 'z': 460, 'g': 4720, 'y': 3522, 'k': 1944, 'w': 2697, 'U': 385, '8': 338, '2': 1530, '9': 645, '1': 913, 'L': 1578, 'x': 645, 'B': 3366, 'd': 4593, 'D': 1209, ''': 221, 'm': 5425, 'P': 1709, 'G': 1906, '.': 116, 'b': 1290, 'j': 290, 'v': 1151, 'Y': 273, 'H': 1179, '5': 687, 'Z': 270, 'K': 431, '/': 346, 'J': 1346, 'X': 53, 'T': 963, '0': 1451, 'q': 219, '6': 215, '-': 237, '7': 209, ',': 96, '3': 377, '4': 555, '&': 102, '[': 21, ']': 21, '+': 42, 'ч': 3, 'а': 8, 'с': 7, 'ы': ...
electronics {'M': 1795, 'i': 6781, 'n': 4423, ' ': 22908, 'T': 1343, 'W': 1392, 'S': 3088, 'B': 1970, 'l': 4234, 'u': 2692, 'e': 10504, 't': 6545, 'o': 8519, 'h': 2655, '5': 836, '.': 783, '0': 2088, 'E': 1009, 'a': 7290, 'r': 7997, 'p': 2513, 's': 3768, 'H': 1266, 'd': 3039, '9': 422, 'D': 1474, 'f': 1088, 'c': 2560, 'I': 1000, 'k': 801, 'X': 471, 'm': 2653, '1': 1349, '"': 36, 'A': 1639, 'O': 688, 'L': 755, 'C': 2454, 'R': 1025, 'F': 1078, 'b': 1261, 'G': 1329, 'P': 2282, '6': 742, '7': 287, 'K': 442, 'w': 760, 'g': 1607, 'z': 161, 'н': 6, 'а': 5, 'у': 5, 'ш': 5, 'и': 7, 'к': 5, 'v': 547, 'V': 800, 'N': 626, '8': 623, 'J': 106, 'Q': 118, '-': 344, '4': 899, 'x': 498, 'U': 662, 'y': 1007, '3': 883, '2': 1264, 'Y': 147, '/': 337, '(': 12, ')': 10, '*': 25, '%': 11, 'j': 75, ',': 93, '+': 72, 'q': ...

I try to do the same to a new column of tokenized text, but don’t work

df['tokenized_text'] = df['title'].apply(word_tokenize) 

df.loc[:,['cat','tokenized_text']].groupby('cat').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

EDIT
When I run

print (df['tokenized_text'].iloc[:2].tolist())

it returns list of lists of words like below:

[['NAVIFORCE',
  'Men',
  'Watches',
  'Waterproof',
  'Stainless',
  'Steel',
  'Quartz',
  'Watch',
  'Male',
  'Chronograph',
  'Military',
  'Clock',
  'Wrist',
  'watch',
  'Relogio',
  'Masculino'],
 ['CURREN',
  '8291',
  'Luxury',
  'Brand',
  'Men',
  'Analog',
  'Digital',
  'Leather',
  'Sports',
  'Watches',
  'Men',
  "'s",
  'Army',
  'Military',
  'Watch',
  'Man',
  'Quartz',
  'Clock',
  'Relogio',
  'Masculino']]

EDIT 2
I tried this code:

f = lambda x: pd.Series(nltk.FreqDist(x))
df.groupby('category')['title'].apply(f).reset_index()

and

f = lambda x: nltk.FreqDist(x)
df.groupby('category')['title'].apply(f).reset_index()

but both returns this :


    cat         level_1                                         title
0   accessories NAVIFORCE Men Watches Waterproof Stainless...   1
1   accessories CURREN 8291 Luxury Brand Men Analog Digital..   1
2   accessories PAGANI Design Brand Luxury Men Watches...       2
3   accessories NO.ONEPAUL women belt Genuine Leather New       1

One Answer

I believe you need:

f = lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist])
df.groupby('category')['tokenized_text'].apply(f)

Answered by jezrael on November 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP