This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.
Reports & Manuscripts
Several of the most popular packages readily make use of vignettes to provide long form documentation and demonstrations of package functionality. Vignettes provide greater context for intended use of package functions beyond what’s available in the help-files. Some of my favorites include the
data.table packages. While these are the most common uses for package vignettes, they may also be used for research workflows and creating professional manuscripts. It may be helpful to develop vignettes that narrate individual components of your research workflow. These vignettes weave written narratives with code that documents any difficulties and idiosyncrasies of the data, functions, and packages used for the workflow. I typically create vignettes documenting:
- Data acquisition and initial processing. This includes difficulties dealing with automated downloads or interfacing with APIs. Establishing naming conventions, identifying coding schemes, ensuring numeric categorical variables are properly processed, and producing exploratory figures to visualize distributions, rare values, and correlation structures.
- More intensive data processing and preparation methods such as imputation procedures, spatial data manipulations and summaries, and verifying the integrity of complicated merges.
- Trial runs for statistical modeling and machine learning that will form the bulk of the final analysis. This may include developing models for a truncated set of data, documenting the impacts of function settings on results, and light variable selection exercises.
- Exploring post-hoc analysis and visualizations expected to be part of a final manuscript.
These vignettes are far more beneficial than raw scripts filled with comments, minimal context, and no ready results or visualizations. It’s also helpful to begin to construct cited introductory or methods passages in these vignettes that may be carried over to the final manuscript. Starting off a workflow vignette with 1-2 cited paragraphs introducing the employed packages and underlying comparative methodologies lessens the workload of developing the final manuscript or technical report. Additionally, workflow vignettes are easily shared with colleagues, stake holders, students, and clients. More importantly, they serve as detailed notes when you return to the research some months or years later. After developing the workflow and desired results with individual vignettes, they can be concatenated into a singular professional report or manuscript that also exists as a vignette within your package.
Utilizing vignettes to establish and report your research also ensures that your code works. It’s not uncommon, even for experienced researchers, to establish erroneous results or workflows when working with a loose collection of scripts being fed through the IDE console. Your local environment can quickly become cluttered with hundreds of objects, renamed datasets, and testing iterations that don’t represent your intended workflow. Vignettes are executed in their own “clean” environment that only contains packages, data, and code inside the document. Moreover, vignettes are processed when the package is rebuilt, and if they fail to successfully compile, the rebuild is halted with an identifying error. This acts as a safeguard against your workflow failing. If your vignette uses an embedded dataset you have since altered, a function that’s no longer operating as intended, or any other unforeseen downstream consequence from a code change or typo, the vignette rebuild will be altered or fail.
Creating a Vignette
R Markdown can produce outputs in several file formats, but we will focus on the two most common: HTML and PDF. The easiest way to create a new vignette is with the
If it doesn’t already exist,
usethis will create the
myresearch/vignettes/ directory, create a new R Markdown vignette file (
data-acquisition.Rmd) using the quoted name provided in the function, and make a few additions to the
DESCRIPTION file (
VignetteBuilder). A quick review from an earlier section describing how vignettes are created with
knitr, and Pandoc.
- Vignettes are written in mostly plain text with code inside of “chunks” in an
knitrexecutes any embedded code in the
.Rmd), “knits” them together with the text, and produces a markdown file (
- Pandoc converts the markdown (
.md) file into the specified output format.
- For PDF outputs the
.Rmdfile is converted into a LaTeX file (
.tex) and compiled with your local LaTeX distribution. It is highly recommended that you use the
tinytexR package as your LaTeX distribution.
R Markdown (
.Rmd) files are comprised of two sections: the YAML header and the body. Generic vignettes created by
usethis have condensed versions of both sections.
Everything at the beginning of the document between the two sets of
--- is the YAML header, everything after is the body, and everything in-between pairs of
``` are code chunks are parsed by R. The first code chunk sets document wide chunk defaults for all code chunks. The default chunk options are:
collapse = TRUE,
comment = "#>"
You may establish a variety of document-wide settings for images, figures, and code parsing.
include = FALSE ensures the chunk is parsed but not displayed in the document. For more detailed information on available chunk options refer to the
knitr Chunk Options and Package Options reference guide.
The YAML header is used to provide document metadata and specify numerous options for document structure. YAML syntax implements a nesting structure for related options; an example of a typical HTML and PDF YAML is provided following the review of common fields:
General / HTML Document
title:The report or manuscript title.
author:The author(s). You may separate authors with
,, Alternatively, for more complex authors and affiliations, multiple authors may be listed with
-using the following syntax.
- John Doe
- Jane Doe
date:The date may be listed manually or automatically updated with 2021-06-15 14:19:34.
output:Establishes the output format. The most commonly used is
theme:Sets the desired theme and styling for the document. Several Bootswatch themes are available by default, but R Markdown offers lots of additional customization options; some of which will be discussed later.
toc:Establishes the table of contents when set to
toc_float:Places the table of contents to the left of the main body.
css:Name of a CSS file with optional custom styling for the HTML document.
dev:Sets the image output format. Vectorized formats (
png) are smaller.
numbered_section:Set to true for numbered sections.
abstract:The document abstract written inside of
bibliography:The bibtex file used for the document citations.
This is the YAML header for this document.
: "Open, Reproducible, and Distributable Research With R Packages"title
: "Joshua Brinks"author
: "June 1, 2020"date
: true toc
: true toc_float
: flatly theme
When creating PDF documents there are additional YAML fields of note. Some of these are familiar LaTeX settings that can be specified by YAML fields.
output:must be set to
fid_captionAll determine document figures, but they may also be set individually at each code chunk that generates an image or figure.
template:When developing PDF documents, the template is where the user specifies a custom TEX file that contains additional formatting options. You may specify a system path or even a file you place inside of
/myresearch/inst/, but to spare you from troubleshooting and differing system paths across Windows, Linux, and MacOS, place this file in the vignettes directory until you are comfortable with basic package development. This is not a required field, however, if you are an experienced LaTeX user, place most of your standard preamble in this file. In most cases, your personal macros and customizations will work seamlessly with R Markdown and
tinytex. One of the few exceptions are tables, which are best implemented using the
kableExtrapackage.1 That being said, I suggest adding components one at a time. Start with the Default Pandoc LaTeX Template and edit it to your liking. This can be taken a step further by creating your own custom variables in the pandoc template that link back to the YAML header. This allows you to set custom options and styling directly from the YAML header. For more information review the Pandoc User Guide’s section on Template Syntax.
citation_package:Sets the desired citation back-end to either
csl:The file containing the document citation style.
citecolorThese are all common LaTeX options that can be specified within the YAML header. Alternatively you may set them in a custom TEX template.
pkgdown:To properly render your PDF document in a
pkgdownwebsite you must list additional fields.
as_is:Must be set to true for
pkgdownto not override stylings.
extension:Must be set to
pkgdowndoes not override the document to an HTML when compiled on the package website.
resource_files:Specify files needed to properly render the PDF. This typically includes your bibliography and any custom TEX file(s). Multiple files are nested under the
resource_files:field and separated with
This is a sample YAML header for a PDF document.
# Universal Fields
: Modeling Conflict, Climate, and Human Migrationtitle
: February 28, 2019date
: "Amazing work and fantastic findings."abstract
: true toc
: 3 toc_depth
: no fig_crop
: josh-latex-pan-temp.latex template
: biblatex citation_package
: true number_sections
: true as_is
: pdf extension
: = 2ingeometry
# Custom YAML Pandoc Variables
-text: "Modeling Conflict, Climate, and Human Migration: Phase I"header
# Package indexing
It’s beyond the scope of this tutorial to thoroughly review R Markdown syntax, R Markdown code chunk options, all potential YAML fields, and Pandoc Customization. These are some of the best resources for more advanced development:
- R Studio’s R Markdown Basics is a quick guide for getting started with R Markdown documents and formatting.
- Yihu Xie’s R Mardown: The Definitive Guide is, simply put, the definitive guide for all thing R Markdown; an excellent resource for beginners and advanced R Markdown customization.2
ymlthispackage3 provides helper functions for YAML development in R.
ymlthisYAML Fieldguide4 is a great quick reference for available fields across all R Markdown output formats.
- The Pandoc User’s Guide is a comprehensive resource for developing custom Pandoc documents to integrate with R Markdown. The Pandoc manual does not provide any R Markdown specific documentation. It’s best used in conjunction with Yihu Xie’s R Mardown: The Definitive Guide, which indicates at which points the user should refer to the Pandoc manual for additional options.
Writing a Vignette
Research vignettes typically fall under 2 categories: 1) Analysis demonstrations and tutorials, and 2) manuscripts and technical reports. When detailing methods and workflows, it’s more more common to walk through processing steps individually in code chunks interwoven with written commentary. Every chunk, including loading libraries, should be visible in the final document.
Conversely, when creating technical reports and manuscripts, data processing and modeling are often front-loaded in the document within chunks that are not visible in the final document (
echo = FALSE). This makes processing and modeling code easily accessible while under development. For manuscripts and technical reports, chunks within the greater body of text are raw code used to construct tables and figures. As opposed to developing manuscripts with static saved images in Word or a standard LaTeX distribution, these code chunks are dynamically linked to the processing code prefacing the written body. This ensures that your figures and tables are always representative of your data processing and modeling workflow. Moreover, when data processing and visualization code are properly functionalized, the workflow becomes centralized to a handful of documented scripts inside the
myresearch/R/ directory. This greatly minimizes mistakes and results in more reliable and distributable research.