TransWikia.com

Pandas/python join/merge two dataframes on a column of list

Stack Overflow Asked by wonder kid on December 20, 2021

Let’s consider two dataframes : Person and Movie :

dataframe Person

+---+-----------+-------------------+-----------------------------+-----------------------------------------+
|   |    nconst |       primaryName |           primaryProfession |                          knownForTitles |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 0 | nm0000103 |      Fairuza Balk |          actress,soundtrack | tt0181875,tt0089908,tt0120586,tt0115963 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 1 | nm0000106 |    Drew Barrymore | producer,actress,soundtrack | tt0120888,tt0343660,tt0151738,tt0120631 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 2 | nm0000117 |     Neve Campbell | actress,producer,soundtrack | tt0134084,tt1262416,tt0120082,tt0117571 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 3 | nm0000132 |      Claire Danes | actress,producer,soundtrack | tt0274558,tt0108872,tt1796960,tt0117509 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 4 | nm0000138 | Leonardo DiCaprio |       actor,producer,writer | tt0120338,tt0993846,tt1375666,tt0407887 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+

dataframe Movie

+---+-----------+-----------+---------------------+-----------------------+
|   |    tconst | titleType |       originalTitle |                genres |
+---+-----------+-----------+---------------------+-----------------------+
| 0 | tt0192789 |     movie | While Supplies Last |        Comedy,Musical |
+---+-----------+-----------+---------------------+-----------------------+
| 1 | tt4914592 |     movie |      Electric Heart | Adventure,Drama,Music |
+---+-----------+-----------+---------------------+-----------------------+
| 2 | tt4999994 |     movie |           Rain Doll |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 3 | tt2690572 |     movie |             Polaris |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 4 | tt1562859 |     movie |           Golmaal 3 |         Action,Comedy |
+---+-----------+-----------+---------------------+-----------------------+

As you can see knownForTitles from Person is a list of tconst from Movie dataframe

Question :

  1. How can I calculate "How many actors have ever acted in an action movie
  2. How many actors are starring in more than one genre of movies?

2 Answers

First, we create person as a DataFrame:

columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]

data = [
('nm0000103',      'Fairuza Balk',          'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106',    'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117',     'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132',      'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio',       'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]

person = pd.DataFrame(data=data, columns=columns)

Second, we split strings into lists for two of the columns:

for field in ['primaryProfession', 'knownForTitles']:
    person[field] = person[field].str.split(',')

Third, we use the explode function to convert one row into many:

person = person.explode('knownForTitles').explode('primaryProfession')

Fourth, we select only actress/actor as the primary profession:

actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]

Now, we have a data frame in so-called tidy format (each cell has a single value, not a list):

    nconst     primaryName   primaryProfession knownForTitles
0   nm0000103  Fairuza Balk   actress          tt0181875
0   nm0000103  Fairuza Balk   actress          tt0089908
0   nm0000103  Fairuza Balk   actress          tt0120586
0   nm0000103  Fairuza Balk   actress          tt0115963
1   nm0000106  Drew Barrymore actress          tt0120888

At this point, we can repeat these steps for the Movie data frame, and then join actors (using knownForTitles) and Movies (using tconst).

Sorry for the length of this response. Key points this approach are to use str.split(',') and then use explode() to transform the data frame into a format suitable for join, merge, etc.

Answered by jsmart on December 20, 2021

I'm learning pandas, so there's a good chance I'm going the wrong way with this. That said, let's give this a go:

First, let's see if we can find all rows in df Movie that are action films. Looking at Pandas dataframe select rows where a list-column contains any of a list of strings, I came up with this:

Movies['isAction'] = [ 'Action'  in x for x in Movies['genres'].tolist()  ] 

Here's the result:

      tconst titleType      originalTitle                     genres  isAction
0  tt0407887     movie  WhileSuppliesLast          [Comedy, Musical]     False
1  tt1375666     movie      ElectricHeart  [Adventure, Drama, Music]     False
2  tt4999994     movie           RainDoll                    [Drama]     False
3  tt2690572     movie            Polaris                    [Drama]     False
4  tt0134084     movie           Golmaal3           [Action, Comedy]      True

I added the isAction column to the Movies df. I also changed some of the tconst values so that we can get some positive results (rows 0,1, and 4 changed).

I changed row 4 so that Neve Cambelle would appear in the results.

We can now produce a list of tconst of Action Movies:

 listOfActionMovies = Movies[ Movies["isAction"] == True]["tconst"].tolist()

Now using the solution from Pandas dataframe select rows where a list-column contains any of a list of strings again:

Person["inAction"] = pd.DataFrame(Person.knownForTitles.tolist()).isin( listOfActionMovies ).any(1)

This yields:

      nconst       primaryName                primaryProfession                                knownForTitles  inAction
0  nm0000103       FairuzaBalk            [actress, soundtrack]  [tt0181875, tt0089908, tt0120586, tt0115963]     False
1  nm0000106     DrewBarrymore  [producer, actress, soundtrack]  [tt0120888, tt0343660, tt0151738, tt0120631]     False
2  nm0000117      NeveCampbell  [actress, producer, soundtrack]  [tt0134084, tt1262416, tt0120082, tt0117571]      True
3  nm0000132       ClaireDanes  [actress, producer, soundtrack]  [tt0274558, tt0108872, tt1796960, tt0117509]     False
4  nm0000138  LeonardoDiCaprio        [actor, producer, writer]  [tt0120338, tt0993846, tt1375666, tt0407887]     False

Now finally we can count all the People in action movies:

len(Person[ Person["inAction"] == True ])

len() solution provided by get dataframe row count based on conditions.

Answered by Mark on December 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP