TransWikia.com

Merging DataFrames with "uneven" data

Stack Overflow Asked by Ribit8950 on December 4, 2021

Excuse the title, I’m not even sure how to label what I’m trying to do. I have data in a DataFrame that looks like this:

Name     Month     Status
----     -----     ------
Bob      Jan       Good
Bob      Feb       Good
Bob      Mar       Bad
Martha   Feb       Bad
John     Jan       Good
John     Mar       Bad

Not every name ‘Name’ has every ‘Month’ and ‘Status’. What I want to get is:

Name     Month     Status
----     -----     ------
Bob      Jan       Good
Bob      Feb       Good
Bob      Mar       Bad
Martha   Jan       N/A
Martha   Feb       Bad
Martha   Mar       N/A
John     Jan       Good
John     Feb       N/A
John     Mar       Bad

Where the missing months are filled in with a value in the ‘Status’ column.

What I’ve tried to do so far is export all of the unique ‘Month" values to a list, convert to a DataFrame, then join/merge the two DataFrames. But I can’t get anything to work.

What is the best way to do this?

3 Answers

Do pivot

df=df.pivot(*df).stack(dropna=False).to_frame('Status').reset_index()
     Name Month Status
0     Bob   Feb  Good
1     Bob   Jan  Good
2     Bob   Mar   Bad
3    John   Feb   NaN
4    John   Jan  Good
5    John   Mar   Bad
6  Martha   Feb   Bad
7  Martha   Jan   NaN
8  Martha   Mar   NaN

Answered by BENY on December 4, 2021

You can treat the month as a categorical column, then allow GroupBy to do the heavy lifting:

df['Month'] = pd.Categorical(df['Month'])
df.groupby(['Name', 'Month'], as_index=False).first()

     Name Month Status
0     Bob   Feb   Good
1     Bob   Jan   Good
2     Bob   Mar    Bad
3    John   Feb    NaN
4    John   Jan   Good
5    John   Mar    Bad
6  Martha   Feb    Bad
7  Martha   Jan    NaN
8  Martha   Mar    NaN

The secret sauce here is that pandas treats missing "categories" by inserting a NaN there.

Caveat: This always sorts your data.

Answered by cs95 on December 4, 2021

You have to take advantage of Pandas' indexing to reshape the data :

Step1 : create a new index from the unique values of Name and Month columns :

new_index = pd.MultiIndex.from_product(
    (df.Name.unique(), df.Month.unique()), names=["Name", "Month"]
)

Step2 : set Name and Month as the new index, reindex with new_index and reset_index to get your final output :

df.set_index(["Name", "Month"]).reindex(new_index).reset_index()

UPDATE 2021/01/08:

You can use the complete function from pyjanitor; at the moment you have to install the latest development version from github:

 # install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
 import pyjanitor
df.complete("Name", "Month")

Answered by sammywemmy on December 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP