TransWikia.com

Sorting multi-index pd.Series using pd.Categorical?

Stack Overflow Asked by np8 on December 8, 2020

Question

Why the sorting using pd.Series.sort_index does not seem to work when using categorical index? How would one sort the index of the multi-index pd.Series using other sort order than alphabetical/numerical?

MWE

Setup code

import pandas as pd 
import numpy as np

d = {
    'Card': [
        'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
        'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
        'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
    ],
    'Year': [
        'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
        'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
        'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
        'One', 'Two', 'Four', 'Six', 'Six'
    ],
    'Value': [
        45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
        736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
    ]
}

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()

Using naively just sort_index, the data looks like this:

In [2]: ser_val.sort_index()
Out[2]:
Card         Year
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

you can see that the columns are sorted alphabetically. Now, I want to force the ordering. For that, I try:

categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)

again, after sorting, the data looks like this (again, alphabetical order)

In [3]: ser_val.sort_index()
Out[3]:
Card         Year
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

I know that if I convert the data into pandas.DataFrame and sort there, it works, like this:

df_val = ser_val.reset_index().sort_values(grp_cols)
df_val['Year'] = pd.Categorical(df_val['Year'].values,
                                categories_order,
                                ordered=True)
df_val = df_val.sort_values(grp_cols).set_index(grp_cols)


In [5]: df_val
Out[5]:
                         Value
Card        Year
Master Card One    2401.000000
            Two     321.000000
            Three   146.833333
            Four    755.000000
            Five    294.000000
            Seven   505.000000
Visa        Three    84.500000
            Six      15.000000
            Seven   438.166667

Why won’t the pd.Series sort with categorical data?

I am using pandas 1.0.5 in Python 3.7.3 64-bit

2 Answers

Actually, I think you found a bug or two!

Bug #1 - Changing dtype with set_levels using pd.Categorical doesn't work.

import pandas as pd 
import numpy as np

d = {
    'Card': [
        'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
        'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
        'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
    ],
    'Year': [
        'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
        'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
        'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
        'One', 'Two', 'Four', 'Six', 'Six'
    ],
    'Value': [
        45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
        736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
    ]
}

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()

#---------------------------------------------
#Attempt 1st
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index.set_levels(categories, level=1, inplace=True)
print(ser_val.index.levels[1].dtype)

#--------------------------------------------
#Attempt 2nd
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index = ser_val.index.set_levels(categories, level='Year')
print(ser_val.index.levels[1].dtype)

#----------------------------------
#Attempt 3rd and success
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categoriesDtype = pd.CategoricalDtype(categories_order, ordered=True)
ser_val.index = ser_val.index.set_levels(ser_val.index.levels[1].astype(categoriesDtype), level='Year')
print(ser_val.index.levels[1].dtype)

Output:

object  *FAILED change type using inplace*
object  *FAILED change type using reassignment*
category  *SUCCESS change type using pd.CategoricalDtype*

Bug #2 - sort_index using categorical for MultiIndex level 1 doesn't work

This may be found already here as an open issue #24271

After a successful change of dype for index level 1:

 ser_val.index.levels[1]

Output:

CategoricalIndex(['Five', 'Four', 'One', 'Seven', 'Six', 'Three', 'Two'], categories=['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven'], ordered=True, name='Year', dtype='category')

Now, let's sort the dataframe using sort_index:

ser_val.sort_index()

Output (FAIL):

Card         Year 
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

Now, just for kicks and testing, let's swap index levels and try sort_index again.

ser_val.swaplevel(0,1).sort_index()

Output(SUCCESS):

Year   Card       
One    Master Card    2401.000000
Two    Master Card     321.000000
Three  Master Card     146.833333
       Visa             84.500000
Four   Master Card     755.000000
Five   Master Card     294.000000
Six    Visa             15.000000
Seven  Master Card     505.000000
       Visa            438.166667
Name: Value, dtype: float64

However, if we explicitly set the sorting levels... FAIL again.

ser_val.swaplevel(0,1).sort_index(level=[0,1])

Output:

Year   Card       
Five   Master Card     294.000000
Four   Master Card     755.000000
One    Master Card    2401.000000
Seven  Master Card     505.000000
       Visa            438.166667
Six    Visa             15.000000
Three  Master Card     146.833333
       Visa             84.500000
Two    Master Card     321.000000
Name: Value, dtype: float64

Answered by Scott Boston on December 8, 2020

TLDR: You need to set sort=False in your groupby, and you need to change your Categorical to a CategoricalIndex. Here's the full working example:

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols, sort=False)['Value'].mean()

categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.CategoricalIndex(ser_val.index.levels[1].values,
                                 categories=categories_order,
                                 ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)
ser_val.sort_index(inplace=True)

And ser_val is now:

Card         Year 
Master Card  One      2401.000000
             Two       321.000000
             Three     146.833333
             Four      755.000000
             Five      294.000000
             Seven     505.000000
Visa         Three      84.500000
             Six        15.000000
             Seven     438.166667
Name: Value, dtype: float64

Longer Ramblings: You ask why what you did doesn't work, which I certainly cannot explain (I'm sure that involves some rooting around in the source), but here is how I arrived at my solution.

See that the following example works, constructing a toy MultiIndex Series from scratch:

lets = ['a','b','c']*3
ids = ['MALE']*4 + ['FEMALE']*5
s = pd.Series(range(9), index=[ids,lets])

categories_order = ['b','a','c']
categories = pd.CategoricalIndex(s.index.levels[1].values,
                                 categories=categories_order,
                                 ordered=True)
s.index.set_levels(categories, level=1,inplace=True)
s.sort_index(inplace=True)

s is sorted as we want:

FEMALE  b    4
        b    7
        a    6
        c    5
        c    8
MALE    b    1
        a    0
        a    3
        c    2
dtype: int64

The only significant difference between your example and mine (that I could tell) is that yours starts from groupby. There is a sort parameter of groupby:

sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

So it seems like the groupby sort is enforcing some order that is not being overridden by your new categorical order.

But still with sort=False, your code as is didn't work. Just through googling I found that there were separate classes for Categorical and CategoricalIndex, and apparently the latter is what you need here. And sure enough my example will also fail if Categorical is used instead of CategoricalIndex.

So the groupby thing seems like the more curious issue; again the underlying rules here I can't tell you, but maybe someone can elaborate.

Answered by Tom on December 8, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP