Embedding and Documenting Research Data

Joshua BrinksISciences, LLC 

Keywords:

open science, R package development

Contents:

This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.

Embedding Data

Data may be embedded inside R packages with 2 methods: 1) inside the myresearch/data/ folder as a compressed .rda file, or 2) inside of the myresearch/inst/extdata/ (installed external data) folder in any desired file format. Current best practice is, when possible, to embed data as an .rda file inside the myresearch/data/ directory, however, certain file and object types become corrupted when stored as an .rda. They can be embedded as .rda without errors or warnings, but they are not usable when brought back into your programming environment. It’s important to check for functionality after storing data as an .rda file. Although any file may be placed in the myresearch/inst/extdata/ folder, it should be done only after careful consideration. Any file placed in the /inst/ directory will be directly installed onto the system of any user who installs the package. Subsequently, one should be cognizant of the size of files, the contents (sensitive information), and any potential malicious behavior and downstream system effects of placing a file in the /inst/ directory.

Storing .rda Data

In adherence to FAIR Data practices, embedded package data should include any script used for acquisition and pre-processing, proper documentation of sources and citations, and any special notes not covered by the processing scripts. The current best practice for all package development is to store scripts used to acquire and pre-process data in the myresearch/raw-data/ folder. I prefer to use automated internet retrieval for embedded data when possible; this reduces the total size of your package, but if this is not possible, raw data should be stored in the myresearch/inst/extdata/ folder and accessed directly with the pre-processing script in the /myresearch/raw-data/ script. It’s recommended that you use the usethis package to embed data. When doing so it will automatically create the myresearch/data/ file where embedded .rda files are stored.

usethis::use_data(dataset, overwrite = TRUE, compress = "bzip2")

dataset is the name of the R data object you are saving (make sure it’s properly named before embedding), overwrite indicates whether to overwrite existing data with the same name, and compress determines the type of compression used to store the .rda. This can normally be left for the default option, but you should test the other methods with larger datasets to ensure its as small as possible.

raw-data/ Scripts

Below is an example of a raw-data/ script from the duplicator package used to embed gridded surface precipitation from the University of Delaware (Willmott and Matsuura 1995).

if(!file.exists("raw-data/air.mon.mean.v501.nc")){
  download.file("ftp://ftp.cdc.noaa.gov/Datasets/udel.airt.precip/precip.mon.total.v501.nc",
                destfile = "raw-data/precip.mon.total.v501.nc")
  monthly.precip<-raster::stack("raw-data/precip.mon.total.v501.nc")
}

start.date<-as.Date("1900/1/1")
end.date<-as.Date("2017/12/1")
temp.stack.labels<-format(seq(start.date, end.date, by="month"), format="%b.%Y")
names(monthly.precip)<-temp.stack.labels

# Create a sequence encompassing the study dates of interest
study.years<-format(seq(as.Date("2000/01/01"), as.Date("2014/12/1"), by = "month"), format="%b.%Y")
monthly.precip<-raster::subset(monthly.precip, study.years)

monthly.precip<-monthly.precip*1
monthly.precip<-raster::rotate(monthly.precip)

usethis::use_data(monthly.precip, overwrite = TRUE, compress = 'bzip2')

file.remove("raw-data/precip.mon.total.v501.nc")

This scripts checks if the NetCDF (raw-data/air.mon.mean.v501.nc) already exists in the working directory. If not, it downloads the file from the NOAA FTP server, and renames each layer for the months and years they represent. Lastly, it embeds the package data with the usethis::use_data() function, and deletes the downloaded file. This is a simple acquisition and re-naming script, but it could be more complicated or contain functions. Stored data can also be a simple stored vector of country codes, a saved R model object, a table of subject names, or any other set of data that would improve your research by being centralized and documented. For intensive pre-processing scripts, it may be more appropriate to create a vignette detailing the pre-processing steps with greater context and commentary.

The myresearch/R/data.R File

Reference manual documentation for datasets embedded as .rda files are stored in the data.R inside the myresearch/R/ directory. Documentation for all embedded data are stored in this single file using roxygen syntax similar or function documentation. Here is the entry for the monthly surface precipitation data.

#'Monthly Total Precipitation
#'
#'Monthly total precipitation from 2000-2014.
#'
#'@format An object of class \code{rasterStack}.
#'\describe{
#'  \item{resolution}{half degree}
#'  \item{extent}{0, 360, -90, 90}
#'  \item{coordinates}{latitude / longitude}
#'  \item{ellipsoid}{WGS84}
#'  \item{unites}{cm / month} }
#'@source
#'\url{https://www.esrl.noaa.gov/psd/data/gridded/data.UDel_AirT_Precip.html}
#'
#'Willmott, C. J., & Matsuura, K. (2001). Terrestrial air temperature and
#'precipitation: Monthly and annual time series (1950–1999) Version 1.02. Center
#'for Climatic Research, University of Delaware, Newark.
"monthly.precip"

Each entry starts with the title, followed by a slightly more detailed description, and then the @format tag where the structure is documented. The object class is documented (in this instance it is a rasterStack), followed by a list of spatial metadata inside the \describe{} environment. If this data were tabular it would be appropriate to list column names and descriptions here. The @source tag is used to list the url for the web host, and the official peer reviewed citation for the dataset. Lastly, the name of the dataset object used in the usethis::use_data() command is listed inside of #' and without the roxygen #' comment tag. When rendered to an .html file the data documentation looks like:

To add another dataset, skip a line with no "' and start over. Here is an example.

#'Monthly Total Precipitation
#'
#'Monthly total precipitation from 2000-2014.
#'
#'@format An object of class \code{rasterStack}.
#'\describe{
#'  \item{resolution}{half degree}
#'  \item{extent}{0, 360, -90, 90}
#'  \item{coordinates}{latitude / longitude}
#'  \item{ellipsoid}{WGS84}
#'  \item{unites}{cm / month} }
#'@source
#'\url{https://www.esrl.noaa.gov/psd/data/gridded/data.UDel_AirT_Precip.html}
#'
#'Willmott, C. J., & Matsuura, K. (2001). Terrestrial air temperature and
#'precipitation: Monthly and annual time series (1950–1999) Version 1.02. Center
#'for Climatic Research, University of Delaware, Newark.
"monthly.precip"

#'Missirian & Schenkler Prepared Model Data
#'
#'Asylum application and monthly temperature data processed for analysis.
#'
#'@format An object of class \code{data.table, data.frame} with 1545 rows and 18
#'  variables. \describe{
#'  \item{name}{Country of origin name.}
#'  \item{iso3}{Country of origin ISO3 character code.}
#'  \item{year}{Year of observation.}
#'  \item{apps}{The log of cumulative number of asylum seekers from country of
#'  observation to EU member nations.}
#'  \item{apps.anom}{In-year deviation from baseline number of applications.}
#'  \item{apps.base}{Mean log of asylum seekers from country during 2000-2014.}
#'  \item{temp.mean}{In-year mean temperature (celsius).}
#'  \item{temp.anom}{In-year deviation from study period baseline temperature.}
#'  \item{temp.base}{Mean annual temperature for source country during
#'  2000-2014.}
#'  \item{temp.sq.mean}{In-year mean squared temperature (celsius).}
#'  \item{temp.sq.anom}{In-year deviation from study period baseline squared
#'  temperature.}
#'  \item{temp.sq.base}{Mean annual squared temperature for source country during
#'  2000-2014.}
#'  \item{precip.mean}{In-year mean temperature (celsius).}
#'  \item{precip.anom}{In-year deviation from study period baseline temperature.}
#'  \item{precip.base}{Mean annual temperature for source country during 2000-2014.}
#'  \item{precip.sq.mean}{In-year mean total surface precipitation (cm).}
#'  \item{precip.sq.anom}{In-year deviation from study period baseline mean
#'  total surface precipitation.}
#'  \item{precip.sq.base}{Mean annual squared total surface precipitation for
#'  source country during 2000-2014.}
#'  }
#'@details This dataset is embedded for use in vignettes replicating Missirian
#'  and ' Schenkler's 2017 study. It permits use of the prepared data in
#'  visualization and modeling vignettes that do not include data preparation
#'  code.
#'
"model.dat"

This dataset ("model.dat") is a pre-processed data.table used for a statistical model. The \describe environment is used to list the columns, and there is a @details section at the bottom providing more context for the user. Each dataset, like functions, are rendered to a single help file. Embedded .rda are accessed like functions and can be used directly inside a piece of code or stored as an object first.

raster::extent(myresearch::monthly.precip)

# Or

ud.precip <- myresearch::monthly.precip
raster::extent(ud.precip)

Using /inst/extdata/

The myresearch/inst/extdata/ (installed external data) is the 2nd most common place to store data in a package. The good thing about using the /inst/extdata/ is that it’s simple. You just drop the data in. The downsides are that there’s no current systematic means to document /inst/extdata/ datasets, and because they are directly installed on the local system of any user who installs the package, careful consideration should be made for both the size of the files placed anywhere in the /inst/ folder and any potential harmful side effects of the files in questions. One benefit of files placed in /inst/ is that they can be accessed programmatically without specifying local paths. This is done with the system.file() command.

shp<-system.file("extdata", "myshapefile.shp", package = "myresearch")

To use system.file the user must specify the folder location ("extdata"), the file name ("myshapefile.shp"), and the package containing the data ("myresearch"). It’s important to remember that everything inside of the myresearch/inst/ folder is brought up one level into the package root directory when it is installed, so you do not specify "inst/extdata" in the system.file() command, which is where the file exists as you are developing your package.

References

Willmott, Cort J., and Kenji Matsuura. 1995. “Smart Interpolation of Annually Averaged Air Temperature in the United States.” Journal of Applied Meteorology 34 (12): 2577–86. https://doi.org/10.1175/1520-0450(1995)034<2577:SIOAAA>2.0.CO;2.

Additional Meta-Data

Add new comment

Plain text

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd>
  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.