TransWikia.com

Converting location names into country names in pandas dataframe

Code Review Asked on December 23, 2020

I have a dataframe df such that:

print(df['user_location'].value_counts())
India                           3741
United States                   2455
New Delhi, India                1721
Mumbai, India                   1401
Washington, DC                  1354
                                ... 
SpaceCoast,Florida                 1
stuck in a book.                   1
Beirut , Lebanon                   1
Royston Vasey - Tralfamadore       1
Langham, Colchester                1
Name: user_location, Length: 26920, dtype: int64

I wanted to know the frequency of specific countries like USA, India from the user_location column. Then I wanted to plot the frequencies as USA, India, and Others.
So, I thought about applying some operation on that column such that the value_counts() would give the output as:

India     (sum of all frequencies of all the locations in India including cities, states, etc.)
USA       (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others    (sum of all frequencies of the other locations)                    

It seemed to me that I should merge the frequencies of rows containing the same countries and merge the rest of them together! But the complexity appeared while handling the rows that contain the names of cities, states, etc. instead of their country names.


The solution what I have come up with so far is given below (and also in stackoverflow):

Firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :

# Trying to get all the locations of USA and India

import pandas as pd

us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()

us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories

in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions

usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind]) 


# Country name checker function

def checkl(T): 
    TSplit_space = [x.lower().strip() for x in T.split()]
    TSplit_comma = [x.lower().strip() for x in T.split(',')]
    TSplit = list(set().union(TSplit_space, TSplit_comma))
    res_ind = [ele for ele in ind if(ele in T)]
    res_us = [ele for ele in us if(ele in T)]
  
    if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
        T = 'India'
    elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
        T = 'USA'
    elif len(T.split(','))>1 :
        if T.split(',')[0] in indToStr or  T.split(',')[1] in indToStr :
             T = 'India'
        elif T.split(',')[0] in usToStr or  T.split(',')[1] in usToStr :
             T = 'USA'
        else:
             T = "Others"
    else:
        T = "Others"
    return T

# Appling the function on the dataframe column

print(df['user_location'].dropna().apply(checkl).value_counts())
Others    74206
USA       47840
India     20291
Name: user_location, dtype: int64

I am quite new in python coding. I think this code can be written in a better and more compact form. Also, I think there are still a lot of edge cases left to deal with.

Any criticisms and suggestions to improve the efficiency & readability of my code would be greatly appreciated. Also, I want to know if there exist any dedicated python modules so that it can convert all the locations automatically into their country names!

One Answer

Test cases

I would create test cases. Something like the code below, but using the unittest or pytest module would be better. That way you can change the code with more confidence.

known_values = [
    ('astroworld', 'Others'),
    ('New York, NY', 'USA'),
    ('Indianapolis, IN', 'USA'),
    ('Pewee Valley, KY', 'USA'),
    ('Stuck in the Middle ', 'Others'),
    ('Jammu and Kashmir', 'India'),
    ('Новоро́ссия', 'Others'),
    ('Gainesville, FL', 'USA'),
    ('Dhaka,Bangladesh', 'Others'),
    ('Hotel living - various cities!  Who needs a home when hotel living is so fabulous!', 'Others'),
    ('Africa', 'Others'),
    ('New Delhi', 'India'),
    ('Nagaland, India', 'India'),
    ('Brussels', 'Others'),
    ('Florida, USA', 'USA'),
    ('Northwest Indiana', 'USA'),
    ('Graz', 'Others'),
    ('Mumbai, India', 'India'),
 ]

for user_loc, loc in known_values:
  print(user_loc)   # The last printed before the assertion error that fails, feel free to comment out
  assert checkl(user_loc) == loc

read_html

I would use pd.read_html just twice, for the two files, and I would use those dataframes later many times. For example:

india_table = pd.read_html(in_url)
in_states = india_table[3].iloc[:, 0].tolist() + india_table[3].iloc[:, 4].tolist() + 
        india_table[3].iloc[:, 5].tolist()
in_unions = india_table[4].iloc[:, 0].tolist()

I guess it is faster, because there is no need to fetch and read the web page several times.

Other

You might define TSplit like that:

TSplit = T.strip().lower().replace(",", " ").split()

This replaces the commas with spaces first, so you don't need to parse twice. Its result will be different however. I'm not sure which is the best.

For example here are the results for Talangana Hyderabad, India and Minneapolis,MN:

 new TSplit: ['talangana', 'hyderabad', 'india']
old TSplit: ['india', 'hyderabad,', 'talangana hyderabad', 'talangana']
 new TSplit: ['minneapolis', 'mn']
old TSplit: ['minneapolis,mn', 'mn', 'minneapolis']

I would change bool(res_ind) == True to len(res_ind) > 0. It's clearer.

Answered by Arpad Horvath on December 23, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP