TransWikia.com

Trying to pass numpy array mode value to df column

Stack Overflow Asked by CoderMan on February 25, 2021

I have created a small program to find the mean, median and mode values for two particular columns of a df. I used np.mean and np.median to find the mean and median values but for the mode i created a numpy array from the df and calculated the mode. I print them to the console and the values seem fine, however i would like to get the mode value from the numpy array to appear in my df that has four columns for ‘STUDENT’ ‘score’ ‘mean’ and ‘median’. I am wondering if there is a way to get the mode value and attach to the end of the df to have a fifth column titled ‘mode’. My code is below to take a look. I would like to not use libraries like scipy for this also so as to not use sparse if there is another way around it.

def mean_median():
    df = pd.read_csv('Surveys.csv')

    dfm= df.groupby("STUDENT")[["SCORE"]].agg([np.mean, np.median]).reset_index()

    print(dfm)


    arr = dfm.to_numpy()

    print('nNumpy Arrayn----------n', arr)
    vals,counts = np.unique(arr, return_counts=True)
    index = np.argmax(counts)
    return vals[index]

Here is an example of my output if it helps makes things clearer to understand

    STUDENT      SCORE       
                mean      median
0      2443.0  93.210145   94.0
1      2445.0  94.652113   95.0
2      2447.0  93.919775   95.0
3      2451.0  95.203571   95.0
4      2832.0  94.544304   95.0
..        ...        ...    ...
276   27323.0  95.585106   96.0
277   27324.0  94.562105   95.0
278   27325.0  96.986348   98.0
279   27326.0  96.809524   97.0
280   27334.0  96.102564   97.0

[281 rows x 3 columns]

Numpy Array
----------
 [[ 2443.            93.21014493    94.        ]
 [ 2445.            94.65211268    95.        ]
 [ 2447.            93.91977481    95.        ]
 [ 2451.            95.20357143    95.        ]
 [ 2832.            94.5443038     95.        ]
 [ 2838.            94.97988265    95.        ]
 [ 2839.            93.88054608    94.        ]
 [ 2841.            93.90789474    94.        ]
 [ 2980.            94.14044944    95.        ]
 [ 3220.            94.44219067    95.        ]
 [ 3221.            93.80825959    94.        ]
 [ 3222.            93.88416076    94.        ]
 [ 3229.            98.42857143   100.        ]
 [ 3231.            92.11363636    93.        ]
 [ 3236.            94.3677686     95.        ]
 [ 3238.            93.84027778    94.        ]
 [ 3332.            93.12958963    94.        ]
 [ 3333.            92.83663366    93.5       ]

sample input data from a few rows to try and recreate

 STUDENT        SCORE
 
  25718         97            
  25719         97             
  26990         95           
  23809         92          
  24032         90            
  22723         87            
  24688         92           
  25714         89            
  25718         78            
  23078         90            
  25713         90
  24032         87
  26990         77
  26990         89

One Answer

You can use pd.Series.mode for calculating mode. Also, for mean and median you can simply use strings to reference the functions.

#Dummy dataframe
d = {'STUDENT': [25718, 25718, 25718, 25718, 25718, 22723, 22723, 22723, 22723, 22723, 25713, 25713, 25713], 
     'SCORE': [97, 97, 95, 92, 90, 87, 92, 89, 78, 92, 90, 87, 87]}

df = pd.DataFrame(d)
out = df.groupby("STUDENT")["SCORE"].agg(['mean','median',pd.Series.mode]).reset_index()
print(out)
   STUDENT  mean  median  mode
0    22723  87.6      89    92
1    25713  88.0      87    87
2    25718  94.2      95    97

This will give results if there exists a mode (at least one repeated value for each student). If there is no mode, it will throw an error.

More details here.


If you are not sure whether each student has a defined mode or not, you can simply take an average of the mode values returned by pd.Series.mode. If it returns a mode, its average is the same. If it returns multiple modes, you return average of those.

d = {'STUDENT': [25718, 25718, 25718, 25718, 25718, 22723, 22723, 22723, 22723, 22723, 25713, 25713, 25713], 
     'SCORE': [97, 97, 95, 92, 90, 87, 92, 89, 78, 92, 90, 87, 88]}

mode = lambda x: pd.Series.mean(pd.Series.mode(x))

df = pd.DataFrame(d)
out = df.groupby("STUDENT")["SCORE"].agg(['mean','median', mode]).reset_index()
out.columns = ['STUDENT','mean','median','mode']
print(out)
   STUDENT       mean  median       mode
0    22723  87.600000      89  92.000000
1    25713  88.333333      88  88.333333
2    25718  94.200000      95  97.000000

Correct answer by Akshay Sehgal on February 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP