Replicating Missirian & Schenkler (2017): Data Processing

Joshua BrinksISciences, LLC 

Highlights:

  • We attempted to replicate a recent high-impact environment-security study.
  • We successfully demonstrated their data pre-processing and visualization steps.
  • We were unable to recreate their statistical modeling efforts.

Keywords:

replication, spei, raster, gadm

Contents:

Introduction

This is the second installment in our series replicating Missirian and Schenkler’s 2017: Asylum applications respond to temperature fluctuations. In our first entry we introduced the study and several methodological questions after reviewing the core manuscript and supplementary materials. In our 2nd vignette we’ll carry out the core data processing steps and introduce several functions developed in the course of this replication study. In addition to presenting exercises in scientific replication, duplicator features several functions designed to make high impact interdiscplinary research more accessible to users. For this replication study, we developed tools to acquire vector boundary objects, generate crop growing season summary statistics, generate crop calendars, and perform fast zonal statistics with research-oriented outputs. We’ll demonstrate all of these tools in this vignette. Let’s begin by loading duplicator.

library(duplicator)

Cropping Calendar Data

Missirian and Schenkler utilize 4 datasets in their core temperature and asylum application model:

  • University of Delaware Air Temperature & Precipitation
  • Farming the planet: 2. Geographic distribution of crop areas, yields, physiological types, and net primary production in the year 2000
  • Crop planting dates: an analysis of global patterns
  • The United Nations High Commissioner for Refugees Populations of Concern: Asylum Application Status

We’ll begin by creating the crop calendar for each source country. The authors do not explicitly state the means by which they assign the planting and harvest dates to each country. The following passage is from the supplementary materials:

Since the weather data is monthly, we define the growing season to start on the first of the month of the average planting date and to end on the last of the month of the average harvest date.

We will use exactextractr and Global Administrative Boundaries (GADM) to calculate summary statistics for cropping dates for all the source countries of interest using a zonal extraction. First, we need to ingest a list of source countries used in the study and create an sf object of vector country boundaries. duplicator can assist with both tasks; the list of countries is available as an embedded dataset in duplicator (study.countries), and combineGADM() is a convenience wrapper developed for raster::getData() that makes a singular sf object given a vector of character strings that represent ISO3 country codes of interest. duplicator::combineGADM() will create a directory to place all the downloaded country boundaries. For convenience we have pre-processed the shapefile and embedded it with our package. The sf object is available as a system file with duplicator named study_countries.shp. For more information regarding the use of combineGADM() and the study_countries.shp please refer to the help documentation and the raw data generation script.

The simple call to duplicator::combineGADM() would be as follows, but in the essence of time and space we’ll utilize the stored data

gadm.study.countries<-duplicator::combineGADM(duplicator::study.countries$iso3)
gadm.study.countries<-sf::st_read(system.file("shape/study_countries.shp", package="duplicator"))
#> Reading layer `study_countries' from data source `C:\Users\jbrin\Documents\R\win-library\3.6\duplicator\shape\study_countries.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 103 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -180 ymin: -34.83292 xmax: 180 ymax: 81.85468
#> geographic CRS: WGS 84
plot(gadm.study.countries["NAME_0"])

The source countries in this study are a diverse mix by all accounts; culturally, geographically, and size. Next we need to bring in the maize cropping calendar data. The cropping calendar data is available for several crops, planting, and harvest metrics. The maize mean planting and harvest data are made available by duplicator. If you would like to review the raw code used to produce the data included for this analysis, please review the scripts in duplicator’s raw-data directory. Let’s calculate summary statistics for the planting and harvest data using cropcalStats().

country.crop.calendar <-
  duplicator::cropcalStats(
    duplicator::maize.cal.plant,
    duplicator::maize.cal.harv,
    gadm.study.countries,
    "GID_0"
  )
#> Extracting Planting Data
#> Extracting Harvest Data

By default, cropcalStats() accepts a planting raster (maize.cal.plant), a harvest raster (maize.cal.harv), an object of vector / shapefile boundaries (gadm.study.countries), the shapefile polygon identifier ("GID_0"), and it will use exactextractr::exact_extract() to calculate the min, mean, mode, and max of each layer. You can adjust the type and number of summary statistics as you see fit. For more information on available zonal summary statistics please review the exactextractr documentation. Let’s review our crop calendar summary statistics.

country.crop.calendar[1:15]

The function produces columns for each summary statistic prefixed by either ‘plant’ or ‘harvest’. At first glance we can see several issues not only the planting and harvest dataset, but rasterizing inherently tabular data. Country polygon boundaries will meander along cells that are derived from tabular data produced by neighboring nations; therefore min and max values may be wildly out of sync with mode and min planting or harvest dates. Although those issues are more simply addressed by utilizing the mode planting and harvest values, our data.table of summary statistics reveals larger issues; 4 countries have no available planting or harvest data.

country.crop.calendar[is.na(plant.mode)|is.na(harvest.mode),GID_0]
#> [1] COM MUS BTN JAM
#> 103 Levels: AFG AGO ALB ARM AZE BDI BEN BFA BGD BIH BLR BOL BRA BTN CAF CHN CIV CMR COD COG ... ZWE

The manuscript did not reference this scenario, however, the “Crop planting dates: an analysis of global patterns” dataset does provide an interpolated dataset for areas with no tabular cropping information. While this data is made available to the public, the metadata for the interpolated cropping calendar data specifically states:

The unfilled maps (files without “.fill” in their names) contain data only for grid cells in regions where we actually have crop calendar observations. The filled maps (files with “.fill” in their names) contain spatially extrapolated crop calendar data. These filled files could be used, for example, as inputs to a global model that requires data in every grid cell. See the “Extrapolation routine” section, below, for details on how we performed this spatial extrapolation.

However, please, please bear in mind the following caveats about the extrapolated values before deciding to use them:

** WARNING: THE EXTRAPOLATED VALUES SHOULD BE USED WITH CARE **

** WE HAVE NOT CHECKED THESE EXTRAPOLATED VALUES TO MAKE SURE THEY ARE REASONABLE IN ALL PLACES **. In fact, they are probably unreasonable in many places. In some cases you may be using data that have been extrapolated from a different continent!

** YOU SHOULD ONLY USE THESE EXTRAPOLATED VALUES IF YOU HAVE CHECKED THEM YOURSELF AND HAVE DETERMINED THEM TO BE REASONABLE ENOUGH FOR YOUR PURPOSES **

Those warnings are quite strong. The authors make no mention of how they handled missing planting or harvest data, and if they employed any verification procedures for interpolated data, but we will carry out the analysis under the assumption that the interpolated datasets were used. The interpolated datasets are made available with duplicator. Let’s recalculate our summary statistics using the interpolated datasets.

country.crop.calendar <-
  duplicator::cropcalStats(
    duplicator::maize.cal.plant.fill,
    duplicator::maize.cal.harv.fill,
    gadm.study.countries,
    "GID_0"
  )
#> Extracting Planting Data
#> Extracting Harvest Data
country.crop.calendar[is.na(plant.mode)]

There are no missing data entries using the interpolated dataset. We’ll generate our country cropping calendars with the filled dataset. The cropCal() function simplifies generating country cropping calendars for countries based on specified summary statistics. In this case we will generate maize country cropping calendars based on planting and harvest mode dates from our country.crop.calendar object.

country.crop.months <- duplicator::cropCal(country.crop.calendar,
                                           "GID_0",
                                           "plant.mode",
                                           "harvest.mode")
#> Warning. The following boundaries have planting dates greater than harvest dates: 
#>  2 6 20 32 58 66 65 63 68 79 101 103 102 61 55 67 12 13 74 87 
#>  
#>  Their cropping calendars will be flipped and traverse January 1. 
#>  
#> Generating Crop Calendar For: DZA  AGO  BDI  CMR  CPV  CAF  TCD  COM  COG  COD  BEN  ETH  ERI  DJI  GAB  GMB  GHA  GIN  CIV  KEN  LBR  LBY  MDG  MWI  MLI  MRT  MUS  MAR  MOZ  NAM  NER  NGA  GNB  RWA  SEN  SLE  SOM  ZAF  ZWE  SDN  TGO  UGA  EGY  TZA  BFA  ZMB  AFG  BGD  BTN  MMR  KHM  LKA  CHN  PSE  IND  IDN  IRN  IRQ  KAZ  JOR  PRK  KWT  KGZ  LBN  MYS  MNG  NPL  PAK  PHL  SAU  VNM  SYR  TJK  THA  TKM  UZB  YEM  ALB  AZE  ARM  BIH  BLR  GEO  MDA  RUS  SRB  UKR  MKD  BOL  BRA  COL  CUB  DOM  ECU  SLV  GTM  HTI  HND  JAM  NIC  PER  SUR  VEN

This function generates a 2 column data.table / data.frame with months where the specified crop is grown and the location where the crop is grown. We can use this cropping calendar object to subset the monthly surface temperature data we prepare in the next section. For many tropical countries, the planting day of the year is greater than the harvest day of the year. cropCal will automatically check for this scenario and permit the growing season to traverse January 1st as it generates cropping calendars. cropCal will advise the user of this behavior and list the ISO3 codes of all the countries or boundaries whom had planting day of the year greater than harvest day of the year. Now that we’ve generated the country cropping calendars we will calculate the weighted mean surface temperature.

Climate and Cropping Fraction Data

The University of Delaware Air Temperature & Precipitation data is available as a NetCDF files with monthly values from 1900-2017. For this replication study we prepared a subset of the data that matches Missirian and Schenkler’s study period of 2000–2014. Please refer to raw-data for more information how this datset was prepared.

The authors state that they performed a weighted mean of mean surface temperature and cumulative precipitation adjusted for the fraction of maize cultivated in a given country. Although I feel there’s an established best practice for this process, the authors leave their methods up to interpretation. Their narrative describing the climate calculations appear to conflate the term “grid” and “cell”; making it difficult to infer their methods.

We use the gridded data set of (42) to construct the fraction of each climate grid in a country that grows maize. Moreover, the data set gives the start and end dates of the maize growing season. Since the weather data is monthly, we define the growing season to start on the first of the month of the average planting date and to end on the last of the month of the average harvest date… The average temperature is then averaged over all grids in a country, where the weights equal the maize growing area in a cell. These are time-invariant weights. This gives us the weather averaged over the time and area where maize is grown in a country. We first calculate nonlinear transformations (the square or a restricted cubic spline of the average temperature over season) for each climate grid, and then the average, since the average of a nonlinear function is different than the nonlinear function of the average.

This passage suggests that country wide surface temperature is a function of the country specific fraction of maize cropland and not weighted by the spatially explicit maize cropping fraction directly corresponding to the temperature cell. Because I feel the latter is a more appropriate analysis we will carry on by calculating the spatially explicit weighted mean surface temperature by maize cropping fraction. Furthermore, the authors performed quadratic transformations of climate data prior to summarizing to the country level so we will perform the zonal extractions twice per climate variable.

In order to calculate the weighted mean zonal extraction the target raster and the weighted raster must be in the same extent and resolution. Because the maize cropping fraction data is a finer resolution than the monthly temperature data we have to raster::aggregate() the cropping fraction data to match the monthly temperature data. Additionally, the climate data has an odd spatial extent (0–360 as opposed to -180 to 180). We must rotate the climate data to a traditional expanse before aggregating the maize cropping fraction to a resolution that matches the monthly climate data. The climate data was rotated with raster::rotate() during pre-processing before the dataset was embedded into the package.

Air Temperature

First we’ll process surface temperature data.

# Check coordinate system, extent, and resolution of rasters
raster::crs(duplicator::maize.frac)
#> CRS arguments:
#>  +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
raster::crs(duplicator::monthly.temp)
#> CRS arguments:
#>  +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

raster::extent(duplicator::maize.frac)
#> class      : Extent 
#> xmin       : -180 
#> xmax       : 180 
#> ymin       : -90 
#> ymax       : 90
raster::extent(duplicator::monthly.temp)
#> class      : Extent 
#> xmin       : -180 
#> xmax       : 180 
#> ymin       : -90 
#> ymax       : 90

raster::res(duplicator::maize.frac)
#> [1] 0.08333333 0.08333333
raster::res(duplicator::monthly.temp)
#> [1] 0.5 0.5

res.factor<-(raster::res(duplicator::monthly.temp)/raster::res(duplicator::maize.frac))[1]
maize.frac.agg<-raster::aggregate(duplicator::maize.frac, fact = res.factor, fun = mean)

The authors did not indicate if they adjusted the maize cropping fraction of each raster cell by the area of each cell. While it’s an often overlooked aspect of raster data preparation, cells projected with latitude and longitude are not uniform in area; they become smaller as you move poleward. Therefore, any weighted measurement should be adjusted for the area of the cell.

maize.frac.adj<- maize.frac.agg*raster::area(maize.frac.agg)
rm(maize.frac.agg)

Although we’ve prepared the monthly temperature and maize cropping fraction data to have congruent extents and resolutions, there are still disparities between the datsets along coastal regions due to aggregation methods. Therefore, before we calculate weighted mean temperature over maize cropping lands, we’ll manually set all maize cropping data that is NA to 0. This will ensure that surface temperature data where there is no known maize cropping information will be ignored.

Now we can calculate the spatially explicit weighted mean monthly temperature where maize is grown using the weightedStack() function.

maize.frac.adj[is.na(maize.frac.adj)]<-0
country.temp.month<-weightedStack(duplicator::monthly.temp,
                                  maize.frac.adj,
                                  gadm.study.countries,
                                  "GID_0",
                                  var.name = "month.year",
                                  value.name = "mean.temp")

We need to perform a second summary on the squared temperature and bind them together.

monthly.temp.sq <- duplicator::monthly.temp ** 2

country.temp.month.sq <- duplicator::weightedStack(
  monthly.temp.sq,
  maize.frac.adj,
  gadm.study.countries,
  "GID_0",
  var.name = "month.year",
  value.name = "mean.temp.sq"
)
country.temp.month <- merge(country.temp.month,
                            country.temp.month.sq,
                            by = c("GID_0", "month.year"))

Surface Precipitation

We’ll repeat the same procedure for the surface precipitation. The air temperature and precipitation data have the same file formats, resolutions, and extents because they are part of the same greater dataset, so the processing steps will be the same. We’ll carry it out in a single chunk.

raster::extent(maize.frac.adj)
raster::extent(duplicator::monthly.precip)

country.precip.month <- weightedStack(
  duplicator::monthly.precip,
  maize.frac.adj,
  gadm.study.countries,
  "GID_0",
  var.name = "month.year",
  value.name = "mean.precip"
)

monthly.precip.sq <- duplicator::monthly.precip ** 2

country.precip.month.sq <-
  duplicator::weightedStack(
    monthly.precip.sq,
    maize.frac.adj,
    gadm.study.countries,
    "GID_0",
    var.name = "month.year",
    value.name = "mean.precip.sq"
  )

country.precip.month <-
  merge(country.precip.month,
        country.precip.month.sq,
        by = c("GID_0", "month.year"))

Now we merge the temperature and precip data into a single data.table and clear out the rasters from our workspace now that we’re finished with them.

country.climate.month<-merge(country.temp.month, country.precip.month, 
                             by=c("GID_0", "month.year"))

Subsetting for the Crop Calendar

We’ve generated weighted mean climate data for all countries and all months, but we want to subset this data to reflect our maize cropping country calendars. We can accomplish this by creating an index of all records with a desired country - month for the maize growing season in each country. Then we’ll use that index (keeps) to subset the mean monthly weighted climate data. Lastly, we’ll perform a bit of housekeeping by re-ordering the columns and renaming the country identifier to reflect the actual coding scheme (ISO3).

country.climate.month[,month:=substr(month.year,1,3)][,
                    year:=as.numeric(substr(month.year,5,8))][,month.year:=NULL]

# Create an index of rows in the monthly temperature data we need to keep
keeps = sort(country.climate.month[country.crop.months, on=.(GID_0, month), which=TRUE, nomatch=0])

# Subset the monthly temperature data based on the index
country.climate.month<-country.climate.month[keeps]
data.table::setcolorder(country.climate.month, c("GID_0","year","month"))
names(country.climate.month)[1]<-"iso3"

rm(country.crop.calendar, country.crop.months)

Lastly, we need to calculate the baseline annual climate data for each country across the study period, and annual deviations from that mean. You can chain this together in a single data.table string but it’s not very readable. We’ll break it up into a few lines.

# First the annual averages
country.climate.year <-
  country.climate.month[, .(temp.mean = mean(mean.temp),
                            temp.sq.mean = mean(mean.temp.sq),
                            precip.mean = mean(mean.precip),
                            precip.sq.mean = mean(mean.precip.sq)),
                        by = .(iso3, year)]

# Now the baselines and anomalies
country.climate.year[
  ,temp.base := mean(temp.mean), by = iso3][
   ,temp.anom := (temp.mean - temp.base)][
    ,temp.sq.base := mean(temp.sq.mean), by = iso3][
     ,temp.sq.anom := (temp.sq.mean - temp.sq.base)][
      ,precip.base := mean(precip.mean), by = iso3][
       ,precip.anom := (precip.mean - precip.base)][
        ,precip.sq.base := mean(precip.sq.mean), by = iso3][
         ,precip.sq.anom := (precip.sq.mean - precip.sq.base)]

rm(country.climate.month)

Now we can prepare the asylum application data.

Monthly Asylum Application Data

The primary response variable for the author’s core model is the log of annual asylum applications from countries who are not a member of the European Union (EU) or Organisation for Economic Co‑operationand Development (OECD). We will need a list of the respective members in order to quickly prepare the asylum applicants. We can easily create a vector of member names with the commonCodes package and the commonCodes::commonCodes dataset.

oecd<-unique(commonCodes::commonCodes[oecd=="1",.(gadm,iso3)])
eu<-unique(commonCodes::commonCodes[eu=="1",.(gadm,iso3)])

The United Nations High Commissioner for Refugees Persons of Concern: Asylum-Seekers (Refugee Status Determination) dataset presents detailed annual records of dyadic asylum seeker data. We can acquire the latest asylum data with untools::getUNapps(). This function will retrieve the dataset directly from the UN API and sum the totals from duplicate records. Let’s download the data.

apps<-untools::getUNapps()
#> Making UNHCR API request for page: 1
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
#> Making UNHCR API request for page: 2
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
#> Making UNHCR API request for page: 3
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
names(apps)
#>  [1] "year"           "coo_name"       "coa_name"       "procedure_type" "app_type"      
#>  [6] "dec_level"      "app_pc"         "coo_iso"        "coa_iso"        "coo"           
#> [11] "coa"            "coo_id"         "coa_id"         "applied"

Our variable of interest is the log of applied. For more information regarding the asylum application status dataset refer to help(untools::getUNapps()). To properly prepare the data we must:

  • Subset for only records where the country of asylum application (coa, coa_iso) is an EU member.
  • Subset for only records where the country of origin (coo, coo_iso) is not an EU or OECD member. (We’ll be safe and subset using the prepared list of the author’s study countries I transcribed and embedded into the package).
  • Subset for only records from 2000–2014 and those with non-zero applicants.
  • Summarize annual (year) applications (applied) across all all levels of procedure_type, app_type, and dec_level by source country (coo, coo_iso).
  • Calculate the log of cumulative in-year applications.
apps<-apps[coa_iso %in% eu$iso3]
apps<-apps[!(coo_iso %in% c(eu$iso3, oecd$iso3))]
apps<-apps[coo_iso %in% study.countries$iso3][year %in% seq(2000,2014,1)]
apps<-apps[,.(apps=sum(applied)), by=.(coo, coo_iso, year)][apps>0][,apps:=log(apps)]

Lastly we need to determine the baseline of applications (annual mean) for each source country over the study period, and annual deviations from that baseline.

apps[,apps.base:=mean(apps, na.rm=TRUE), by=coo_iso]
apps[,apps.anom:=(apps-apps.base)]

We can finally merge the asylum application data with the annual weighted temperature data, verify we did not “lose” any records and do a little housekeeping with the column ordering.

model.dat <- merge(
  apps,
  country.climate.year,
  by.x = c("coo_iso", "year"),
  by.y = c("iso3", "year"),
  all.x = TRUE
)
names(model.dat)[c(1, 3)] <- c("iso3", "name")
table(is.na(model.dat))[1]
#> FALSE 
#> 27378
dim(model.dat)[1] * dim(model.dat)[2]
#> [1] 27378

data.table::setcolorder(
  model.dat,
  c(
    "name",
    "iso3",
    "year",
    "apps",
    "apps.anom",
    "apps.base",
    "temp.mean",
    "temp.anom",
    "temp.base",
    "temp.sq.mean",
    "temp.sq.anom",
    "temp.sq.base",
    "precip.mean",
    "precip.anom",
    "precip.base",
    "precip.sq.mean",
    "precip.sq.anom",
    "precip.sq.base"
  )
)

In the next section we’ll carry out some visual exploration of the prepared data:

Part III: Data Visualization

Additional Meta-Data

Data Categories:

raster, tabular, shapefile, dyadic

Add new comment

Plain text

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>
  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.