TransWikia.com

Building GeoDataFrame row by row

Geographic Information Systems Asked by Arkeen on January 5, 2021

I am trying to build GeoDataFrame row by row, from an empty one. The equivalent using only pandas would be something like this :

df = pandas.DataFrame(columns=['a','b','c','d'])    
df.loc['y'] = pandas.Series({'a':1, 'b':5, 'c':2, 'd':3})

(from [this answer])


My current method

So far, I build a Python list of dict with a specific structure, and then use it to create a GeoFataFrame, here is a complete example :

import geopandas as gpd
from shapely.geometry import  Point

my_dict = {
  '007': {
    'name': 'A',
    'lat': 48.843664, 
    'lon': 2.302672,
    'type': 'small'
  },
  '008': {
    'name': 'B',
    'lat': 50.575813,
    'lon': 7.258148,
    'type': 'medium'
  },
  '010': {
    'name': 'C',
    'lat': 47.058420, 
    'lon': 15.437464,
    'type': 'big'
  }
}

tmp_list = []
for item_key, item_value in my_dict.items() :
  tmp_list.append({
    'geometry' : Point(item_value['lon'], item_value['lat']),
    'id': item_key,
    'name': item_value ['name'],
    'type': item_value ['type']
   })
my_gdf = gpd.GeoDataFrame(tmp_list)
print(my_gdf.head())

Here is the result :

                    geometry   id name    type
0   POINT (2.30267 48.84366)  007    A   small
1   POINT (7.25815 50.57581)  008    B  medium
2  POINT (15.43746 47.05842)  010    C     big

What I am looking for

I would like to create an empty GeoDataFrame (my_gdf = gpd.GeoDataFrame()), and then fill it directly in the for loop, without using the temporary list after the loop (my_gdf = gpd.GeoDataFrame(tmp_list))

I think that a row by row building would, in this case, have better performance. It would also allow me to use the id key from my_dict as the GeoDataFrame index, so that the result would be :

                     geometry  name    type
007   POINT (2.30267 48.84366)    A   small
008   POINT (7.25815 50.57581)    B  medium
010  POINT (15.43746 47.05842)    C     big

2 Answers

I don't think that a row by row building would have better performance. I've tested.

Result:

import geopandas as gpd
import pandas as pd
from shapely.geometry import  Point

d = {'007': {'name': 'A', 'lat': 48.843664, 'lon': 2.302672, 'type': 'small' },
     '008': {'name': 'B', 'lat': 50.575813, 'lon': 7.258148, 'type': 'medium'},
     '010': {'name': 'C', 'lat': 47.058420, 'lon': 15.437464,'type': 'big'}}

## IN THE ABOVE CASE. Duration: ~1 ms (milisecond)
tmp_list = []
for item_key, item_value in d.items() :
    tmp_list.append({
      'geometry' : Point(item_value['lon'], item_value['lat']),
      'id': item_key,
      'name': item_value ['name'],
      'type': item_value ['type']
     })
gdf = gpd.GeoDataFrame(tmp_list)
##


## SOLUTION 1. Duration: ~2.3 ms, @gene's answer.
df = pd.DataFrame.from_dict(d, orient='index')
df["geometry"] = df.apply (lambda row: Point(row.lon,row.lat), axis=1)
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
##


## SOLUTION 2. Duration: ~2.5 ms
gdf = gpd.GeoDataFrame()    
gdf["id"]   = [k for k in d.keys()]
gdf["name"] = [d[k]["name"] for k in d.keys()]
gdf["type"] = [d[k]["type"] for k in d.keys()]
gdf["geometry"]  = [Point(d[k]["lon"], d[k]["lat"]) for k in d.keys()]    
gdf.set_index('id', inplace=True)
##


## SOLUTION 3. Duration: ~9.5 ms
gdf = gpd.GeoDataFrame(columns=["name", "type", "geometry"])
for k, v in d.items():
    gdf.loc[k] = (v["name"], v["type"], Point(v["lon"], v["lat"]))
##

print(gdf)

# OUTPUT for the last solution
#     name    type                   geometry
# 007    A   small   POINT (2.30267 48.84366)
# 008    B  medium   POINT (7.25815 50.57581)
# 010    C     big  POINT (15.43746 47.05842)

Correct answer by Kadir Şahbaz on January 5, 2021

You don't need to build the GeoDataFrame row by row here, look at pandas.DataFrame.from_dict¶

import pandas as pd
df = pd.DataFrame.from_dict(my_dict,orient='index')
print(df)
     name     lat        lon    type
007    A  48.843664   2.302672   small
008    B  50.575813   7.258148  medium
010    C  47.058420  15.437464     big
from shapely.geometry import Point
df["geometry"] = df.apply (lambda row: Point(row.lon,row.lat), axis=1)

Convert to a GeoDataFrame

gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
     name    lat        lon      type            geometry
007    A  48.843664   2.302672   small  POINT (2.302672 48.843664)
008    B  50.575813   7.258148  medium  POINT (7.258148 50.575813)
010    C  47.058420  15.437464     big  POINT (15.437464 47.05842)

Or directly:

gdf = gpd.GeoDataFrame(df, geometry=df.apply(lambda row: Point(row.lon,row.lat), axis=1)

In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data) once at the end, outside the loop.
Each call to df.append requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, it's performance will be much better -- the time cost of copying grows linearly with the number of rows. (from How to append rows in a pandas dataframe in a for loop?)

Answered by gene on January 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP