AnswerBun.com

Sorting multi-index pd.Series using pd.Categorical?

Stack Overflow Asked by np8 on December 8, 2020

Question

Why the sorting using pd.Series.sort_index does not seem to work when using categorical index? How would one sort the index of the multi-index pd.Series using other sort order than alphabetical/numerical?

MWE

Setup code

import pandas as pd 
import numpy as np

d = {
    'Card': [
        'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
        'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
        'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
    ],
    'Year': [
        'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
        'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
        'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
        'One', 'Two', 'Four', 'Six', 'Six'
    ],
    'Value': [
        45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
        736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
    ]
}

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()

Using naively just sort_index, the data looks like this:

In [2]: ser_val.sort_index()
Out[2]:
Card         Year
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

you can see that the columns are sorted alphabetically. Now, I want to force the ordering. For that, I try:

categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)

again, after sorting, the data looks like this (again, alphabetical order)

In [3]: ser_val.sort_index()
Out[3]:
Card         Year
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

I know that if I convert the data into pandas.DataFrame and sort there, it works, like this:

df_val = ser_val.reset_index().sort_values(grp_cols)
df_val['Year'] = pd.Categorical(df_val['Year'].values,
                                categories_order,
                                ordered=True)
df_val = df_val.sort_values(grp_cols).set_index(grp_cols)


In [5]: df_val
Out[5]:
                         Value
Card        Year
Master Card One    2401.000000
            Two     321.000000
            Three   146.833333
            Four    755.000000
            Five    294.000000
            Seven   505.000000
Visa        Three    84.500000
            Six      15.000000
            Seven   438.166667

Why won’t the pd.Series sort with categorical data?

I am using pandas 1.0.5 in Python 3.7.3 64-bit

2 Answers

Actually, I think you found a bug or two!

Bug #1 - Changing dtype with set_levels using pd.Categorical doesn't work.

import pandas as pd 
import numpy as np

d = {
    'Card': [
        'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
        'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
        'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
        'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
    ],
    'Year': [
        'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
        'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
        'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
        'One', 'Two', 'Four', 'Six', 'Six'
    ],
    'Value': [
        45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
        736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
    ]
}

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()

#---------------------------------------------
#Attempt 1st
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index.set_levels(categories, level=1, inplace=True)
print(ser_val.index.levels[1].dtype)

#--------------------------------------------
#Attempt 2nd
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
                            categories=categories_order,
                            ordered=True)
ser_val.index = ser_val.index.set_levels(categories, level='Year')
print(ser_val.index.levels[1].dtype)

#----------------------------------
#Attempt 3rd and success
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categoriesDtype = pd.CategoricalDtype(categories_order, ordered=True)
ser_val.index = ser_val.index.set_levels(ser_val.index.levels[1].astype(categoriesDtype), level='Year')
print(ser_val.index.levels[1].dtype)

Output:

object  *FAILED change type using inplace*
object  *FAILED change type using reassignment*
category  *SUCCESS change type using pd.CategoricalDtype*

Bug #2 - sort_index using categorical for MultiIndex level 1 doesn't work

This may be found already here as an open issue #24271

After a successful change of dype for index level 1:

 ser_val.index.levels[1]

Output:

CategoricalIndex(['Five', 'Four', 'One', 'Seven', 'Six', 'Three', 'Two'], categories=['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven'], ordered=True, name='Year', dtype='category')

Now, let's sort the dataframe using sort_index:

ser_val.sort_index()

Output (FAIL):

Card         Year 
Master Card  Five      294.000000
             Four      755.000000
             One      2401.000000
             Seven     505.000000
             Three     146.833333
             Two       321.000000
Visa         Seven     438.166667
             Six        15.000000
             Three      84.500000
Name: Value, dtype: float64

Now, just for kicks and testing, let's swap index levels and try sort_index again.

ser_val.swaplevel(0,1).sort_index()

Output(SUCCESS):

Year   Card       
One    Master Card    2401.000000
Two    Master Card     321.000000
Three  Master Card     146.833333
       Visa             84.500000
Four   Master Card     755.000000
Five   Master Card     294.000000
Six    Visa             15.000000
Seven  Master Card     505.000000
       Visa            438.166667
Name: Value, dtype: float64

However, if we explicitly set the sorting levels... FAIL again.

ser_val.swaplevel(0,1).sort_index(level=[0,1])

Output:

Year   Card       
Five   Master Card     294.000000
Four   Master Card     755.000000
One    Master Card    2401.000000
Seven  Master Card     505.000000
       Visa            438.166667
Six    Visa             15.000000
Three  Master Card     146.833333
       Visa             84.500000
Two    Master Card     321.000000
Name: Value, dtype: float64

Answered by Scott Boston on December 8, 2020

TLDR: You need to set sort=False in your groupby, and you need to change your Categorical to a CategoricalIndex. Here's the full working example:

df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols, sort=False)['Value'].mean()

categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.CategoricalIndex(ser_val.index.levels[1].values,
                                 categories=categories_order,
                                 ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)
ser_val.sort_index(inplace=True)

And ser_val is now:

Card         Year 
Master Card  One      2401.000000
             Two       321.000000
             Three     146.833333
             Four      755.000000
             Five      294.000000
             Seven     505.000000
Visa         Three      84.500000
             Six        15.000000
             Seven     438.166667
Name: Value, dtype: float64

Longer Ramblings: You ask why what you did doesn't work, which I certainly cannot explain (I'm sure that involves some rooting around in the source), but here is how I arrived at my solution.

See that the following example works, constructing a toy MultiIndex Series from scratch:

lets = ['a','b','c']*3
ids = ['MALE']*4 + ['FEMALE']*5
s = pd.Series(range(9), index=[ids,lets])

categories_order = ['b','a','c']
categories = pd.CategoricalIndex(s.index.levels[1].values,
                                 categories=categories_order,
                                 ordered=True)
s.index.set_levels(categories, level=1,inplace=True)
s.sort_index(inplace=True)

s is sorted as we want:

FEMALE  b    4
        b    7
        a    6
        c    5
        c    8
MALE    b    1
        a    0
        a    3
        c    2
dtype: int64

The only significant difference between your example and mine (that I could tell) is that yours starts from groupby. There is a sort parameter of groupby:

sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

So it seems like the groupby sort is enforcing some order that is not being overridden by your new categorical order.

But still with sort=False, your code as is didn't work. Just through googling I found that there were separate classes for Categorical and CategoricalIndex, and apparently the latter is what you need here. And sure enough my example will also fail if Categorical is used instead of CategoricalIndex.

So the groupby thing seems like the more curious issue; again the underlying rules here I can't tell you, but maybe someone can elaborate.

Answered by Tom on December 8, 2020

Add your own answers!

Related Questions

Are these Threads synchronized?

3  Asked on November 12, 2020 by haoshoku

     

How to pause and resume a while loop in Python?

5  Asked on November 11, 2020 by mentalcombination

   

How does thrust determine arguments to pass to functor

1  Asked on November 11, 2020 by a_man

     

How can I style specific symbols in an element?

6  Asked on November 10, 2020 by ankit-aggarwal

 

Flutter crash after open apps

4  Asked on November 10, 2020 by zukijuki

       

Java alternative of product function of python form itertools

1  Asked on November 8, 2020 by vipul-tyagi

     

Kubernetes – How to run local image of jenkins

1  Asked on November 8, 2020 by jerome12

   

How to avoid ambiguous template instantiation?

2  Asked on November 8, 2020 by wintergreen_plaza

     

leetcode algorithm edgecase issue

3  Asked on November 7, 2020 by stephen1993

     

Arrows in API strings?

1  Asked on November 6, 2020 by vichofs

   

How to use SQL PARTITION BY GROUPS?

2  Asked on November 5, 2020 by radagast

     

react start cannot find files in public folder

2  Asked on November 5, 2020 by minh-triet

     

tf.keras.utils.to_categorical raises TypeError in graph mode

1  Asked on November 5, 2020 by borun-chowdhury

   

Azure IoT Hub MQTT failure(Without SDK)

1  Asked on November 5, 2020 by govtham

   

Ask a Question

Get help from others!

© 2022 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir