Finding the average of a raster across a given region, in R, for many points in time

Question

I have a time-series of rasters (daily maps of global temperature for the past year), and a shape-file outlining state boundaries.
I'd like to find the average temperature in each state on each day of the previous year, for a result like:
Alabama 01-01-2019 : 5.2 degrees
Alabama 01-02-2019 : 3.6 degrees
... 
Wyoming 12-30-2019 : -3.1 degrees
Wyoming 12-31-2019 : -2.0 degrees

How would I do this in R? (Other project requirements prevent me from using Python or QGIS.)
The format I've tried is:
shape = st_read("./states_shp")
temp_raster = brick("temperature_data.nc")
mean_temps = exact_extract(temp_raster, shape, 'mean')

This successfully masks my raster data using the shape file, but the 'mean' operation performs a temporal mean. Is there a way I can modify this to get the spatial mean across each state while keeping the time dimension?
Edit: following a suggestion below, here's the data. The shapefile is from the US Census (https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html):
> library(sf)
> shape = st_read("cb_2018_us_state_500k/cb_2018_us_state_500k.shp")
> head(shape)
Simple feature collection with 6 features and 9 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -103.0026 ymin: 28.92861 xmax: -75.24227 ymax: 40.6388
geographic CRS: NAD83
  STATEFP  STATENS    AFFGEOID GEOID STUSPS           NAME LSAD        ALAND
1      28 01779790 0400000US28    28     MS    Mississippi   00 121533519481
2      37 01027616 0400000US37    37     NC North Carolina   00 125923656064
3      40 01102857 0400000US40    40     OK       Oklahoma   00 177662925723
   AWATER                       geometry
1  3926919758 MULTIPOLYGON (((-88.50297 3...
2 13466071395 MULTIPOLYGON (((-75.72681 3...
3  3374587997 MULTIPOLYGON (((-103.0026 3...

Richard · Accepted Answer

(I accidentally wrote this about precipitation, but the same ideas will carry over.)

What you're looking for is called zonal statistics.

Many packages do zonal statistics slowly by doing polygon overlap checks for each cell in your gridded data. However, if you're working with gridded data it's possible to perform these calculations much quicker by tracking which cells a polygon's boundary traverses (see here for more details). Many packages that don't use performant methods fudge their results by checking to see if there's any overlap between a grid cell and a polygon. If there is overlap then the entire value of the cell is assigned to the polygon. A more accurate method would calculate the percentage overlap.

The method I detail below is both performant and accurate.

To demonstrate how to do it, let's acquire some climate data from MACA, specifically MACAv2-LIVNEH monthly data from the bcc-csm1-1 model over the historical 1990-2005 period for pr (precipitation), tasmin (minimum temperature), and tasmax (maximum temperature).

This data gives monthly values for climate variables on gridded data covering the US.

We can acquire the boundaries of the US counties from the US Census. I'll use the 5m data for speed.

Now, we can achieve your zonal statistics as follows:

#Installation command, if needed
#install.packages(c("raster", "ncdf4", "sf", "exactextractr", "matrixStats"))
#Alternative installation in case package is unavailable on CRAN (unlikely)
#devtools:::install_github("isciences/exactextractr")

library(exactextractr)   #For `exact_extract`
library(matrixStats)     #For `colWeightedMeans`
library(raster)          #For `brick`
library(sf)              #For `st_read`
library(stringr)         #For `str_sub`
library(tidyr)

#Read in climate data
precip_fname = "macav2livneh_pr_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"
tasmin_fname = "macav2livneh_tasmin_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"
tasmax_fname = "macav2livneh_tasmax_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc"

#A brick is a stack of rasters all of which have the same spatial extent and CRS
precip = raster:::brick(precip_fname)
tasmin = raster:::brick(tasmin_fname)
tasmax = raster:::brick(tasmax_fname)

#Move from [0,360) system to [-180,180) system.
#Ignore warning: "this does not look like an appropriate object for this function"
precip = raster:::rotate(precip)
tasmin = raster:::rotate(tasmin)
tasmax = raster:::rotate(tasmax)

#Read in county outlines
shape_orig = st_read("cb_2018_us_county_5m.shp")

#Function for examining the output
sanity_check = function(values, coverage){
  print(length(values))
  print(head(values))
  print(head(coverage))
  print(sum(0.0625*0.0625*111*111*coverage))
  stop()
}

exact_extract(precip, shape, fun=sanity_check, stack_apply=FALSE)

As you can see, I've written the above so that the function we're using for the zonal statistics stops on its first call while printing out stuff. This gives us an opportunity to sanity check. It can also be helpful in debugging/experimenting with functions as we go. Alternatively, we could capture the values passed to the function into globals:

bob = NA
sanity_check = function(values, coverage){
  bob <<- values  # Capture `values` to global `bob` using the special `<<-` operator
  stop()
}

The output of values looks like this

  X1990.01.15 X1990.02.15 X1990.03.15 X1990.04.15 X1990.05.15 X1990.06.15
1    40.08998    104.5751    64.49851    125.1040    77.43653    112.2190
2    37.99795    109.5864    65.67377    129.4187    73.84611    108.2572
3    36.30927    106.8414    61.52228    126.5196    70.75780    108.2304
4    35.39827    109.0072    61.63351    128.3016    72.66797    110.2836
5    32.59171    101.9070    60.04919    130.0065    69.71605    110.9107
6    43.30358    107.1016    63.17825    128.2818    78.63950    112.5901

and the output of coverage looks like this

0.029508008 0.054003537 0.047953043 0.041902550 0.001358363 0.015525309

values has 192 rows while coverage has 192 entries. These coorespond to the 192 cells in the raster dataset which intersect the first polygon.

That first polygon, obtained via head(shape), is from Highland County, OH. Google gives it an area of 1445.21 square kilometers.

The coverage values indicate what percentage of each cell overlaps the polygon of interest. Checking print(precip) we see

        geospatial_lat_resolution: 0.0625
        geospatial_lon_resolution: 0.0625

we can therefore estimate the total area of the cells covering the first polygon as (assuming that one degree of latitude and longitude are both 111km)

print(sum(0.0625*0.0625*111*111*coverage))

which gives 1857.395 km^2. If we were using a better approximation of the length of a degree of longitude, say 86.6km, we'd get 1449.103 square kilometers, which is almost exactly right.

So, if we want the average monthly precipitation in the polygon we would assume water falls homogeneously in each cell of the precip raster and find the spatial average of precipitation across the whole polygon. Therefore, we need a weighted average of each column with the coverage vector:

get_precip = function(values, coverage){
  values %>% summarize(across(everything(), ~ weighted.mean(.x, coverage, na.rm=TRUE)))
}

shape = head(shape_orig) #For speed, look at only some polygons

out = exact_extract(precip, shape, fun=get_precip, stack_apply=FALSE)

out is now a dataframe with a column for each month in the raster dataset

  X1990.01.15 X1990.02.15 X1990.03.15 X1990.04.15 X1990.05.15 X1990.06.15
1    41.00624   119.80429    66.97151   134.63554   82.209904   97.742532
2   100.64820    30.94722   311.52611    68.78153    9.951976    2.895623
3   124.48859   269.42761    91.15898   111.15357   28.810363   42.215396
4    53.80008   110.29271    61.91393   102.35224  136.917751  113.676423
5    83.20906   323.28954   117.29841   134.12935   15.661511   59.689629
6    51.07177   241.49114    96.58443   195.22793   74.009000   21.333790

but each row corresponds to one of the polygons in shape.

We can join this to our spatial data with cbind:

shape_out = cbind(shape_orig, out)

A lot happened in get_precip. Let's break it down slightly: values got piped %>% via dplyr into summarize where we asked for a summary value from each column (across(everything()) and that summary value was calculated as the weighted.mean of the values in the column .x and the coverage vector (dplyr always treats columns as vectors). We chose to drop missing data values from the dataset with na.rm=TRUE. All of that returned a data frame with one column for each month (because we didn't drop or combine any columns from the input data frame values). The combined output of all the polygons is the rbind of these data frames.

Now, what if we wanted the spatial average of the total yearly precipitation? For that we'd want to sum together all the months of the year and then get a weighted spatial average:

get_precip = function(values, coverage){
  values %>% mutate(id=1:n())                      %>% # Give each region a unique numeric id
             reshape2::melt(id.var='id')           %>% # Convert from wide format to long format. Table now has columns id, variable, value where variable is, e.g., X1990.01.15
             mutate(year=str_sub(variable,2,5))    %>% # Extract year from the variable column
             select(-variable)                     %>% # Drop variable column
             group_by(year, id)                    %>% # Group by year and id
             summarize(avg=sum(value, na.rm=TRUE)) %>% # For each year in each region, get the total precipitation
             reshape2::dcast(id~year)              %>% # Convert back to wide format where each column is a year
             select(-id)                           %>% # Drop id column
             summarize(across(everything(), ~ weighted.mean(.x, coverage, na.rm=TRUE))) # Get spatially-weighted average
}

shape = head(shape_orig) #For speed, look at only some polygons

out = exact_extract(precip, shape, fun=get_precip, stack_apply=FALSE)

#Combine with shape
shape_out = cbind(head(shape_orig), out)

You might be bothered wondering if perhaps the output of exact_extract is in a different order than shape; fortunately, the package author says the output order is the same as the input order.

Robert Hijmans · Answer

When asking an R question, please include some example data like this (mostly taken from ?raster::extract:
library(raster)
cds1 <- rbind(c(-180,-20), c(-160,5), c(-60, 0), c(-160,-60), c(-180,-20))
cds2 <- rbind(c(80,0), c(100,60), c(120,0), c(120,-55), c(80,0))
polys <- spPolygons(cds1, cds2)

b <- brick(nl=3)
values(b) <- 1:prod(ncell(b), nlayers(b))

Solution
v <- extract(b, polys, fun=mean, na.rm=TRUE)
v 
#      layer.1   layer.2  layer.3
#[1,] 39277.99 104077.99 168878.0
#[2,] 31904.29  96704.29 161504.3

Or more exact(r):
vv <- extract(b, polys, weights=TRUE, fun=mean, na.rm=TRUE)
vv
      layer.1   layer.2  layer.3
[1,] 39266.06 104066.06 168866.1
[2,] 31903.33  96703.33 161503.3

With each row a polygon (state) and each column a time step.
Or with exactrextractr
library(exactrextractr)

p <- as(polys, "sf")

ev <- exact_extract(b, p, "sum")
#   mean.layer.1 mean.layer.2 mean.layer.3
#1     39267.17    104067.16     168867.2
#2     31903.69     96703.69     161503.7

So that works fine, contrary to what you say.
You can also do this "by hand"
e <- exact_extract(b, p)
f <- function(x) {
    v <- colSums(x * x$coverage_fraction)
    v <- v / v[length(v)]
    v[-length(v)]
}

ve <- sapply(e, f)
t(ve)
#      layer.1   layer.2  layer.3
#[1,] 39764.59 105385.46 171006.3
#[2,] 32358.81  98083.23 163807.6

Finding the average of a raster across a given region, in R, for many points in time

2 Answers

Add your own answers!

Ask a Question