TransWikia.com

Python : adding a columns with count of missing values by row

Code Review Asked by lcrmorin on January 10, 2021

I have a big python data-frame and I am trying to add a column to it with average number of missing values by row. I have inherited some code that is working but I’d like to reduce memory usage by removing intermediary values.

Here is a toy exemple :

students = [ ('jack', np.NaN, 'Sydeny' , 'Australia') ,
                 ('Riti', np.NaN, 'Delhi' , 'India' ) ,
                 ('Vikas', 31, np.NaN , 'India' ) ,
                 ('Neelu', 32, 'Bangalore' , 'India' ) ,
                 ('John', 16, 'New York' , 'US') ,
                 ('John' , 11, np.NaN, np.NaN ) ,
                (np.NaN , np.NaN, np.NaN, np.NaN ) 
                 ]
dfObj = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'])

And the code I inherited :

print('NanCounter -> transform')
nan_count = pd.DataFrame(data = np.mean(dfObj.isna().values, axis=1).astype('float32'), columns=['nan_count']).set_index(dfObj.index)
X_ = pd.concat([dfObj, nan_count], axis=1)
X_.set_index(dfObj.index, inplace=True)

It seems quite a convoluted way to just write :

print('NanCounter -> transform')
dfObj['nan_count'] = np.mean(dfObj.isna().values, axis=1).astype('float32')

Plus it seems to consume more memory. I am concerned I am missing something about calculcations. Are those expression equivalent ? Namely, what would be the interest with working with a supplementary variable ?

One Answer

The difference between the two code snippets is that the first creates a new DataFrame while the second modifies the original DataFrame in-place.

The first snippet can be simplified to

X_ = dfObj.assign(nan_count=dfObj.isna().mean(axis=1).astype('float32'))

Answered by GZ0 on January 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP