TransWikia.com

How to binary encode multi-valued categorical variable from Pandas dataframe?

Data Science Asked by Denis L on August 19, 2020

Suppose we have the following dataframe with multiple values for a certain column:

    categories
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]

How can we get a table like this?

   "A"  "B"  "C"  "D"
0 - 1    1    0    0
1 - 0    1    1    1
2 - 0    1    0    1

Note: I don’t necessarily need a new dataframe, I’m wondering how to transform such DataFrames to a format more suitable for machine learning.

One Answer

If [0, 1, 2] are numerical labels and is not the index, then pandas.DataFrame.pivot_table works:

In []:
data = pd.DataFrame.from_records(
    [[0, 'A'], [0, 'B'], [1, 'B'], [1, 'C'], [1, 'D'], [2, 'B'], [2, 'D']],
    columns=['number_label', 'category'])
data.pivot_table(index=['number_label'], columns=['category'], aggfunc=[len], fill_value=0)
Out[]:
              len
category      A      B      C      D
number_label                       
0             1      1      0      0
1             0      1      1      1
2             0      1      0      1

This blog post was helpful.


If [0, 1, 2] is the index, then collections.Counter is useful:

In []:
data2 = pd.DataFrame.from_dict(
    {'categories': {0: ['A', 'B'], 1: ['B', 'C', 'D'], 2:['B', 'D']}})
data3 = data2['categories'].apply(collections.Counter)
pd.DataFrame.from_records(data3).fillna(value=0)
Out[]:
       A      B      C      D
0      1      1      0      0
1      0      1      1      1
2      0      1      0      1

Correct answer by Samuel Harrold on August 19, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP