Joshua Brinks, ISciences, LLC

Summary

The importance of replication and reproducibility are at the core of the scientific process, but these components are readily overlooked.
Not only are they overlooked, but the difficulty in attempting to reproduce a peer reviewed study is rarely demonstrated.
In this series of vignettes we attempt to demonstrate reverse-engineering a recent high-impact peer review publication in the eco-security sector.

Introduction

This is the second installment in our series replicating Missirian and Schenkler’s 2017: Asylum applications respond to temperature fluctuations. In our first entry we introduced the study and several methodological questions after reviewing the core manuscript and supplementary materials. In our 2^nd vignette we’ll carry out the core data processing steps and introduce several functions developed in the course of this replication study. In addition to presenting exercises in scientific replication, duplicator features several functions designed to make high impact interdiscplinary research more accessible to users. For this replication study, we developed tools to acquire vector boundary objects, generate crop growing season summary statistics, generate crop calendars, and perform fast zonal statistics with research-oriented outputs. We’ll demonstrate all of these tools in this vignette. Let’s begin by loading duplicator.

library(duplicator)

Cropping Calendar Data

Missirian and Schenkler utilize 4 datasets in their core temperature and asylum application model:

University of Delaware Air Temperature & Precipitation
Farming the planet: 2. Geographic distribution of crop areas, yields, physiological types, and net primary production in the year 2000
Crop planting dates: an analysis of global patterns
The United Nations High Commissioner for Refugees Populations of Concern: Asylum Application Status

We’ll begin by creating the crop calendar for each source country. The authors do not explicitly state the means by which they assign the planting and harvest dates to each country. The following passage is from the supplementary materials:

Since the weather data is monthly, we define the growing season to start on the first of the month of the average planting date and to end on the last of the month of the average harvest date.

We will use exactextractr and Global Administrative Boundaries (GADM) to calculate summary statistics for cropping dates for all the source countries of interest using a zonal extraction. First, we need to ingest a list of source countries used in the study and create an sf object of vector country boundaries. duplicator can assist with both tasks; the list of countries is available as an embedded dataset in duplicator (study.countries), and combineGADM() is a convenience wrapper developed for raster::getData() that makes a singular sf object given a vector of character strings that represent ISO3 country codes of interest. duplicator::combineGADM() will create a directory to place all the downloaded country boundaries. For convenience we have pre-processed the shapefile and embedded it with our package. The sf object is available as a system file with duplicator named study_countries.shp. For more information regarding the use of combineGADM() and the study_countries.shp please refer to the help documentation and the raw data generation script.

The simple call to duplicator::combineGADM() would be as follows, but in the essence of time and space we’ll utilize the stored data

gadm.study.countries<-duplicator::combineGADM(duplicator::study.countries$iso3)

gadm.study.countries<-sf::st_read(system.file("shape/study_countries.shp", package="duplicator"))
#> Reading layer `study_countries' from data source 
#> `C:\Users\jbrin\Documents\R\win-library\4.0\duplicator\shape\study_countries.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 103 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -180 ymin: -34.83292 xmax: 180 ymax: 81.85468
#> Geodetic CRS: WGS 84
plot(gadm.study.countries["NAME_0"])

The source countries in this study are a diverse mix by all accounts; culturally, geographically, and size. Next we need to bring in the maize cropping calendar data. The cropping calendar data is available for several crops, planting, and harvest metrics. The maize mean planting and harvest data are made available by duplicator. If you would like to review the raw code used to produce the data included for this analysis, please review the scripts in duplicator’s raw-data directory. Let’s calculate summary statistics for the planting and harvest data using cropcalStats().

country.crop.calendar <-
 duplicator::cropcalStats(
 duplicator::maize.cal.plant,
 duplicator::maize.cal.harv,
 gadm.study.countries,
 "GID_0"
 )
#> Extracting Planting Data
#> Extracting Harvest Data

By default, cropcalStats() accepts a planting raster (maize.cal.plant), a harvest raster (maize.cal.harv), an object of vector / shapefile boundaries (gadm.study.countries), the shapefile polygon identifier ("GID_0"), and it will use exactextractr::exact_extract() to calculate the min, mean, mode, and max of each layer. You can adjust the type and number of summary statistics as you see fit. For more information on available zonal summary statistics please review the exactextractr documentation. Let’s review our crop calendar summary statistics.

country.crop.calendar[1:15]

The function produces columns for each summary statistic prefixed by either ‘plant’ or ‘harvest’. At first glance we can see several issues not only the planting and harvest dataset, but rasterizing inherently tabular data. Country polygon boundaries will meander along cells that are derived from tabular data produced by neighboring nations; therefore min and max values may be wildly out of sync with mode and min planting or harvest dates. Although those issues are more simply addressed by utilizing the mode planting and harvest values, our data.table of summary statistics reveals larger issues; 4 countries have no available planting or harvest data.

country.crop.calendar[is.na(plant.mode)|is.na(harvest.mode),GID_0]
#> [1] "COM" "MUS" "BTN" "JAM"

The manuscript did not reference this scenario, however, the “Crop planting dates: an analysis of global patterns” dataset does provide an interpolated dataset for areas with no tabular cropping information. While this data is made available to the public, the metadata for the interpolated cropping calendar data specifically states:

The unfilled maps (files without “.fill” in their names) contain data only for grid cells in regions where we actually have crop calendar observations. The filled maps (files with “.fill” in their names) contain spatially extrapolated crop calendar data. These filled files could be used, for example, as inputs to a global model that requires data in every grid cell. See the “Extrapolation routine” section, below, for details on how we performed this spatial extrapolation.

However, please, please bear in mind the following caveats about the extrapolated values before deciding to use them:

** WARNING: THE EXTRAPOLATED VALUES SHOULD BE USED WITH CARE **

** WE HAVE NOT CHECKED THESE EXTRAPOLATED VALUES TO MAKE SURE THEY ARE REASONABLE IN ALL PLACES **. In fact, they are probably unreasonable in many places. In some cases you may be using data that have been extrapolated from a different continent!

** YOU SHOULD ONLY USE THESE EXTRAPOLATED VALUES IF YOU HAVE CHECKED THEM YOURSELF AND HAVE DETERMINED THEM TO BE REASONABLE ENOUGH FOR YOUR PURPOSES **

Those warnings are quite strong. The authors make no mention of how they handled missing planting or harvest data, and if they employed any verification procedures for interpolated data, but we will carry out the analysis under the assumption that the interpolated datasets were used. The interpolated datasets are made available with duplicator. Let’s recalculate our summary statistics using the interpolated datasets.

country.crop.calendar <-
 duplicator::cropcalStats(
 duplicator::maize.cal.plant.fill,
 duplicator::maize.cal.harv.fill,
 gadm.study.countries,
 "GID_0"
 )
#> Extracting Planting Data
#> Extracting Harvest Data
country.crop.calendar[is.na(plant.mode)]

There are no missing data entries using the interpolated dataset. We’ll generate our country cropping calendars with the filled dataset. The cropCal() function simplifies generating country cropping calendars for countries based on specified summary statistics. In this case we will generate maize country cropping calendars based on planting and harvest mode dates from our country.crop.calendar object.

country.crop.months <- duplicator::cropCal(country.crop.calendar,
 "GID_0",
 "plant.mode",
 "harvest.mode")
#> Warning. The following boundaries have planting dates greater than harvest dates: 
#> AGO BDI COG GAB MDG MWI MUS MOZ NAM RWA ZAF ZWE ZMB MMR LKA MYS BOL BRA PER SUR 
#> 
#> Their cropping calendars will be flipped and traverse January 1. 
#> 
#> Generating Crop Calendar For: DZA AGO BDI CMR CPV CAF TCD COM COG COD BEN ETH ERI DJI GAB GMB GHA GIN CIV KEN LBR LBY MDG MWI MLI MRT MUS MAR MOZ NAM NER NGA GNB RWA SEN SLE SOM ZAF ZWE SDN TGO UGA EGY TZA BFA ZMB AFG BGD BTN MMR KHM LKA CHN PSE IND IDN IRN IRQ KAZ JOR PRK KWT KGZ LBN MYS MNG NPL PAK PHL SAU VNM SYR TJK THA TKM UZB YEM ALB AZE ARM BIH BLR GEO MDA RUS SRB UKR MKD BOL BRA COL CUB DOM ECU SLV GTM HTI HND JAM NIC PER SUR VEN

This function generates a 2 column data.table / data.frame with months where the specified crop is grown and the location where the crop is grown. We can use this cropping calendar object to subset the monthly surface temperature data we prepare in the next section. For many tropical countries, the planting day of the year is greater than the harvest day of the year. cropCal will automatically check for this scenario and permit the growing season to traverse January 1st as it generates cropping calendars. cropCal will advise the user of this behavior and list the ISO3 codes of all the countries or boundaries whom had planting day of the year greater than harvest day of the year. Now that we’ve generated the country cropping calendars we will calculate the weighted mean surface temperature.

Climate and Cropping Fraction Data

The University of Delaware Air Temperature & Precipitation data is available as a NetCDF files with monthly values from 1900-2017. For this replication study we prepared a subset of the data that matches Missirian and Schenkler’s study period of 2000–2014. Please refer to raw-data for more information how this datset was prepared.

The authors state that they performed a weighted mean of mean surface temperature and cumulative precipitation adjusted for the fraction of maize cultivated in a given country. Although I feel there’s an established best practice for this process, the authors leave their methods up to interpretation. Their narrative describing the climate calculations appear to conflate the term “grid” and “cell”; making it difficult to infer their methods.

We use the gridded data set of (42) to construct the fraction of each climate grid in a country that grows maize. Moreover, the data set gives the start and end dates of the maize growing season. Since the weather data is monthly, we define the growing season to start on the first of the month of the average planting date and to end on the last of the month of the average harvest date… The average temperature is then averaged over all grids in a country, where the weights equal the maize growing area in a cell. These are time-invariant weights. This gives us the weather averaged over the time and area where maize is grown in a country. We first calculate nonlinear transformations (the square or a restricted cubic spline of the average temperature over season) for each climate grid, and then the average, since the average of a nonlinear function is different than the nonlinear function of the average.

This passage suggests that country wide surface temperature is a function of the country specific fraction of maize cropland and not weighted by the spatially explicit maize cropping fraction directly corresponding to the temperature cell. Because I feel the latter is a more appropriate analysis we will carry on by calculating the spatially explicit weighted mean surface temperature by maize cropping fraction. Furthermore, the authors performed quadratic transformations of climate data prior to summarizing to the country level so we will perform the zonal extractions twice per climate variable.

In order to calculate the weighted mean zonal extraction the target raster and the weighted raster must be in the same extent and resolution. Because the maize cropping fraction data is a finer resolution than the monthly temperature data we have to raster::aggregate() the cropping fraction data to match the monthly temperature data. Additionally, the climate data has an odd spatial extent (0–360 as opposed to -180 to 180). We must rotate the climate data to a traditional expanse before aggregating the maize cropping fraction to a resolution that matches the monthly climate data. The climate data was rotated with raster::rotate() during pre-processing before the dataset was embedded into the package.

Air Temperature

First we’ll process surface temperature data.

# Check coordinate system, extent, and resolution of rasters
raster::crs(duplicator::maize.frac)
#> CRS arguments:
#> +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
raster::crs(duplicator::monthly.temp)
#> CRS arguments:
#> +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

raster::extent(duplicator::maize.frac)
#> class : Extent 
#> xmin : -180 
#> xmax : 180 
#> ymin : -90 
#> ymax : 90
raster::extent(duplicator::monthly.temp)
#> class : Extent 
#> xmin : -180 
#> xmax : 180 
#> ymin : -90 
#> ymax : 90

raster::res(duplicator::maize.frac)
#> [1] 0.08333333 0.08333333
raster::res(duplicator::monthly.temp)
#> [1] 0.5 0.5

res.factor<-(raster::res(duplicator::monthly.temp)/raster::res(duplicator::maize.frac))[1]
maize.frac.agg<-raster::aggregate(duplicator::maize.frac, fact = res.factor, fun = mean)

The authors did not indicate if they adjusted the maize cropping fraction of each raster cell by the area of each cell. While it’s an often overlooked aspect of raster data preparation, cells projected with latitude and longitude are not uniform in area; they become smaller as you move poleward. Therefore, any weighted measurement should be adjusted for the area of the cell.

maize.frac.adj<- maize.frac.agg*raster::area(maize.frac.agg)
rm(maize.frac.agg)

Although we’ve prepared the monthly temperature and maize cropping fraction data to have congruent extents and resolutions, there are still disparities between the datsets along coastal regions due to aggregation methods. Therefore, before we calculate weighted mean temperature over maize cropping lands, we’ll manually set all maize cropping data that is NA to 0. This will ensure that surface temperature data where there is no known maize cropping information will be ignored.

Now we can calculate the spatially explicit weighted mean monthly temperature where maize is grown using the weightedStack() function.

maize.frac.adj[is.na(maize.frac.adj)]<-0
country.temp.month<-weightedStack(duplicator::monthly.temp,
 maize.frac.adj,
 gadm.study.countries,
 "GID_0",
 var.name = "month.year",
 value.name = "mean.temp")

We need to perform a second summary on the squared temperature and bind them together.

monthly.temp.sq <- duplicator::monthly.temp ** 2

country.temp.month.sq <- duplicator::weightedStack(
 monthly.temp.sq,
 maize.frac.adj,
 gadm.study.countries,
 "GID_0",
 var.name = "month.year",
 value.name = "mean.temp.sq"
)
country.temp.month <- merge(country.temp.month,
 country.temp.month.sq,
 by = c("GID_0", "month.year"))

Surface Precipitation

We’ll repeat the same procedure for the surface precipitation. The air temperature and precipitation data have the same file formats, resolutions, and extents because they are part of the same greater dataset, so the processing steps will be the same. We’ll carry it out in a single chunk.

raster::extent(maize.frac.adj)
raster::extent(duplicator::monthly.precip)

country.precip.month <- weightedStack(
 duplicator::monthly.precip,
 maize.frac.adj,
 gadm.study.countries,
 "GID_0",
 var.name = "month.year",
 value.name = "mean.precip"
)

monthly.precip.sq <- duplicator::monthly.precip ** 2

country.precip.month.sq <-
 duplicator::weightedStack(
 monthly.precip.sq,
 maize.frac.adj,
 gadm.study.countries,
 "GID_0",
 var.name = "month.year",
 value.name = "mean.precip.sq"
 )

country.precip.month <-
 merge(country.precip.month,
 country.precip.month.sq,
 by = c("GID_0", "month.year"))

Now we merge the temperature and precip data into a single data.table and clear out the rasters from our workspace now that we’re finished with them.

country.climate.month<-merge(country.temp.month, country.precip.month, 
 by=c("GID_0", "month.year"))

Subsetting for the Crop Calendar

We’ve generated weighted mean climate data for all countries and all months, but we want to subset this data to reflect our maize cropping country calendars. We can accomplish this by creating an index of all records with a desired country - month for the maize growing season in each country. Then we’ll use that index (keeps) to subset the mean monthly weighted climate data. Lastly, we’ll perform a bit of housekeeping by re-ordering the columns and renaming the country identifier to reflect the actual coding scheme (ISO3).

country.climate.month[,month:=substr(month.year,1,3)][,
 year:=as.numeric(substr(month.year,5,8))][,month.year:=NULL]

# Create an index of rows in the monthly temperature data we need to keep
keeps = sort(country.climate.month[country.crop.months, on=.(GID_0, month), which=TRUE, nomatch=0])

# Subset the monthly temperature data based on the index
country.climate.month<-country.climate.month[keeps]
data.table::setcolorder(country.climate.month, c("GID_0","year","month"))
names(country.climate.month)[1]<-"iso3"

rm(country.crop.calendar, country.crop.months)

Lastly, we need to calculate the baseline annual climate data for each country across the study period, and annual deviations from that mean. You can chain this together in a single data.table string but it’s not very readable. We’ll break it up into a few lines.

# First the annual averages
country.climate.year <-
 country.climate.month[, .(temp.mean = mean(mean.temp),
 temp.sq.mean = mean(mean.temp.sq),
 precip.mean = mean(mean.precip),
 precip.sq.mean = mean(mean.precip.sq)),
 by = .(iso3, year)]

# Now the baselines and anomalies
country.climate.year[
 ,temp.base := mean(temp.mean), by = iso3][
 ,temp.anom := (temp.mean - temp.base)][
 ,temp.sq.base := mean(temp.sq.mean), by = iso3][
 ,temp.sq.anom := (temp.sq.mean - temp.sq.base)][
 ,precip.base := mean(precip.mean), by = iso3][
 ,precip.anom := (precip.mean - precip.base)][
 ,precip.sq.base := mean(precip.sq.mean), by = iso3][
 ,precip.sq.anom := (precip.sq.mean - precip.sq.base)]

rm(country.climate.month)

Now we can prepare the asylum application data.

Monthly Asylum Application Data

The primary response variable for the author’s core model is the log of annual asylum applications from countries who are not a member of the European Union (EU) or Organisation for Economic Co‑operationand Development (OECD). We will need a list of the respective members in order to quickly prepare the asylum applicants. We can easily create a vector of member names with the commonCodes package and the commonCodes::commonCodes dataset.

oecd<-unique(commonCodes::commonCodes[oecd=="1",.(gadm,iso3)])
eu<-unique(commonCodes::commonCodes[eu=="1",.(gadm,iso3)])

The United Nations High Commissioner for Refugees Persons of Concern: Asylum-Seekers (Refugee Status Determination) dataset presents detailed annual records of dyadic asylum seeker data. We can acquire the latest asylum data with untools::getUNapps(). This function will retrieve the dataset directly from the UN API and sum the totals from duplicate records. Let’s download the data.

apps<-untools::getUNapps()
#> Making UNHCR API request for page: 1
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
#> Making UNHCR API request for page: 2
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
#> Making UNHCR API request for page: 3
#> Parsing response...
#> Transposing parsed json list...
#> Complete.
names(apps)
#> [1] "year" "coo_name" "coa_name" "procedure_type" "app_type" "dec_level" 
#> [7] "app_pc" "coo_iso" "coa_iso" "coo" "coa" "coo_id" 
#> [13] "coa_id" "applied"

Our variable of interest is the log of applied. For more information regarding the asylum application status dataset refer to help(untools::getUNapps()). To properly prepare the data we must:

Subset for only records where the country of asylum application (coa, coa_iso) is an EU member.
Subset for only records where the country of origin (coo, coo_iso) is not an EU or OECD member. (We’ll be safe and subset using the prepared list of the author’s study countries I transcribed and embedded into the package).
Subset for only records from 2000–2014 and those with non-zero applicants.
Summarize annual (year) applications (applied) across all all levels of procedure_type, app_type, and dec_level by source country (coo, coo_iso).
Calculate the log of cumulative in-year applications.

apps<-apps[coa_iso %in% eu$iso3]
apps<-apps[!(coo_iso %in% c(eu$iso3, oecd$iso3))]
apps<-apps[coo_iso %in% study.countries$iso3][year %in% seq(2000,2014,1)]
apps<-apps[,.(apps=sum(applied)), by=.(coo, coo_iso, year)][apps>0][,apps:=log(apps)]

Lastly we need to determine the baseline of applications (annual mean) for each source country over the study period, and annual deviations from that baseline.

apps[,apps.base:=mean(apps, na.rm=TRUE), by=coo_iso]
apps[,apps.anom:=(apps-apps.base)]

We can finally merge the asylum application data with the annual weighted temperature data, verify we did not “lose” any records and do a little housekeeping with the column ordering.

model.dat <- merge(
 apps,
 country.climate.year,
 by.x = c("coo_iso", "year"),
 by.y = c("iso3", "year"),
 all.x = TRUE
)
names(model.dat)[c(1, 3)] <- c("iso3", "name")
table(is.na(model.dat))[1]
#> FALSE 
#> 27378
dim(model.dat)[1] * dim(model.dat)[2]
#> [1] 27378

data.table::setcolorder(
 model.dat,
 c(
 "name",
 "iso3",
 "year",
 "apps",
 "apps.anom",
 "apps.base",
 "temp.mean",
 "temp.anom",
 "temp.base",
 "temp.sq.mean",
 "temp.sq.anom",
 "temp.sq.base",
 "precip.mean",
 "precip.anom",
 "precip.base",
 "precip.sq.mean",
 "precip.sq.anom",
 "precip.sq.base"
 )
)

In the next section we’ll carry out some visual exploration of the prepared data:

Part III: Data Visualization

Replicating Missirian & Schenkler (2017): Data Processing