This vignette is an excerpt from the DANTE Project’s beta release of Open, Reproducible, and Distributable Research with R Packages. To view the entire current release, please visit the bookdown site. If you would like to contribute to this bookdown project, please visit the project GitLab repository.
The past decade has born rise to a growing call for open, replicable, reproducible, and distributable science. This includes access to raw data, data processing and analytic code, appropriate applications of quantitative tools, and the promotion of replication research. Despite increasing calls for transparency in science, little progress has been made in regards to open science; this is especially true in the geosciences, political sciences, and environmental sciences, and human geography.
The call for open science faces little active resistance, however, it requires significant structural change to slow moving and entrenched norms. Open science takes more time investment, and there is little to no current incentive structure for researchers partaking in open-science. Scientific advancement is a delicate balance between the desire for novelty, motivation for replication, and funding sources (Allen and Mehler 2019). Even changing the current incentive structure of scientists may not fully address competing interests of funding sources or other stakeholders (Smaldino and McElreath 2016). In fact, powerful disincentive structures are in place for researches to identify mistakes and misconduct (Laitin and Reich 2017). Proper documentation, archiving, and accreditation throughout the scientific process are all intensive time-sinks (Allen and Mehler 2019). When combined with a lack of focus on quantitative requirements and ethical practices (Laitin and Reich 2017), this can create a negative feedback loop of results driven, training, hiring, and promotion that leads to the Natural Selection of Bad Science (Smaldino and McElreath 2016).
Numerous peer-reviewed papers relay the core principles and debate the merits of open-science, but these articles are compartmentalized and rarely summarize the collective components of the open science movement. Moreover, there are no shortage of articles proselytizing the open-science movement, however, guides demonstrating detailed and practical implementations of open-science are nearly non-existent. The purpose of this book is to 1) briefly summarize the current collective elements of the greater open-science movement, 2) describe the benefits of implementing open science workflows with the R programming language and R packages, and 3) demonstrate a detailed walk-through of current best practices to make use of R packages for open, replicable, reproducible, and distributable research.
Defining Open Science
Open science represents full transparency in design, methods, code, analysis, funding, and data sources. Although there’s a general understanding of the concepts of open science, the definitions are not clearly defined across sub-disciplines (Patil, Peng, and Leek 2016). In practice, open science is an amalgamation of a few core, yet overlapping issues: 1) replication and reproducibility, 2) FAIR Data practices, and 3) appropriate use of P-Values and P-hacking. Open science encompasses all these ideals. Providing all relevant code and raw data utilized for the reported findings ensures that replication and reproducibility are possible. Providing all relevant code for statistical computing is of great import, because even open and reproducible science may suffer from poor design, data, pre-processing and statistical analysis (J. T. Leek and Peng 2015; Brown, Kaiser, and Allison 2018).
Replication & Reproducibility
Similar to the term open science, replication and reproduciblity are regularly conflated and not well defined (Bollen et al. 2015). I am not here to quibble over semantics, however, it is widely accepted that reproducibility refers to the ability of another researcher to reproduce original reported results using the same data and code, while replication refers to the ability to detect equivalent effects of the original research using new data and the same or similar analytical software (Konkol, Kray, and Pfeiffer 2019). Despite being a core component of the scientific process, replication is almost entirely overlooked as a valuable contribution to the peer-reviewed scientific community. Positive and negative replication efforts provide immense value. A positive replication can rapidly solidify understanding of a new theory. McElreath (2015) argues that even false replication findings, are more valuable than positive replications (McElreath and Smaldino 2015). While all replications are valuable, they must be carried out with intent and focus. This includes being focused on the intent to replicate, closely following the original methods, employing high statistical power, enacting transparency and openness of the replication attempt, and critically evaluating the replication results (Brandt et al. 2014).
The act of reproduction and replication varies in difficulty across sub-disciplines. Psychology and field biology directly observe data from human or other sentient beings that make both reproduction and replication difficult. In response to these hurdles, some fields have adopted close replication in order to continue to support the open science movement while respecting individual rights and the limitations of directly observed field data. Conversely, the disciplines of geographical, climatological, and political science regularly feature a large selection of versioned, cited, and centrally hosted data sets describing global climate, national economics, human censuses, human migration, municipal infrastructure, and political and ethnic conflict that are particularly conducive to replication and reproduction studies.
One of the most important consideration when undertaking a replication effort, and one of the primary sources for pushback of the entire open science movement, is the choice to replicate and the failure to replicate should not be perceived as an affront to the original work that implicitly suggest incompetence or fraud, but as a healthy part of scientific discourse (Ebersole, Axt, and Nosek 2016). Many of the current concerns about reproducibility overlook this dynamic; the iterative nature of scientific debate and theoretical formulations. In many instances, failures to replicate are not indicative of poor methodologies on part of the original researchers, but in fact a failure to identify assumptions, software settings, or other experimental conditions the researchers felt were inconsequential (Redish et al. 2018).
Reproducing research results is impossible without proper management and stewardship of the original data for the experiment. Sadly, those who develop data sets rarely receive credit, and while there are several widely known and employed data sets across the social, biological, and atmospheric sciences, many data sets developed for specific research purposes are lost to the void or not properly documented in order to ensure reusability (Stall et al. 2019). In response to the growing needs for appropriate data accreditation and reusability in the scientific community, Wilkinson (2016) presented the FAIR Data principals: findability (F), accessibility (A), interoperability (I), and resusability (R) (Wilkinson et al. 2016). The overarching ideal behind the FAIR Data movement is the concept of data and analytical stewardship; not ownership.
Findable data are assigned globally unique identifiers, are described with rich metadata that are clearly and explicitly linked to the dataset identifier, and are indexed or registered in a searchable resource. Accessible data are retrievable by their identifier using a standardized (potentially authenticated) protocol that is open and universally available and integrated with meta-data that remains available even when the data set has been deprecated. Interoperable data use a formal, shared, and broadly applicable language for metadata, and vocabularies that follow FAIR principles. Lastly, reusable data are adequately described with accurate attributes, are released with a clear usage license, and meet relevant community standards.
When properly adhered to, these guiding principles greatly strengthen the greater open science movement, knowledge, learning, and discovery. Current examples of centralized repositories adhering to the FAIR principles are Harvard Dataverse (“Harvard Dataverse” n.d.), FAIRDOM (“FAIRDOM” n.d.), ISA (“ISA Dashboard” n.d.), and Open PHACTS (“Open Pharmacological Space” n.d.). In addition to these larger repositories, several genre-specific centralized data catalogs are in operation that follow FAIR principles. These include the Socioeconomic Data and Applications Center (SEDAC) (“Socioeconomic Data and Applications Center | SEDAC” n.d.), Hydroshare (“Find, Analyze and Share Water Data | CUAHSI HydroShare” n.d.), The Arctic Data Center (“NSF Arctic Data Center The Primary Data and Software Repository for NSF Arctic Research” n.d.), and the Marine Geoscience Data System (MGDS) (“IEDA: Marine Geoscience Data System” n.d.).
Traditional Applied Statistics and P-Values
The reevaluation of traditional applied statistics and (mis)use of p-values is a closely related movement playing out alongside efforts to promote open science. This debate consists of 3 primary components: 1) P-Hacking, 2) the true meaning and value of a p-value, and 3) the reliance on traditional null hypothesis testing. P-Hacking is the practice by which researchers test numerous statistical analyses that mostly do not reject the null hypothesis, while selectively reporting the sole test that did exhibit a statistically significant relationship. This is of great concern, because it produces bias in reported results that ultimately will lead to flawed conclusions and evidence based decision making (Head et al. 2015). Hypotheses should be developed prior to experimental design and analysis, and results should be reported for the initial hypothesis of interest; significant or not. In reality, this practice is rarely observed. To counteract this effect, some researchers (primarily psychology) are pushing for Experimental Pre-Registration (Allen and Mehler 2019; Brandt et al. 2014). Pre-registration mandates that experimental design and hypotheses of interest are “registered” prior to the experiment, and reported upon when complete. This severely mitigates p-hacking and selection bias in experimental design and analysis, but is a significant and costly time-sink that currently has few incentives in science.
Selection bias, when combined with a general misunderstanding of the implications of a significant p-value, has led several theoretical statisticians to suspect that most research findings are false (Rosenthal 1979; Ioannidis 2005). The commonly accepted conceptual value of a reported p-value lower than a typical alpha of 0.05 is dependent upon several conditions: 1) the prior probability that the tested relationship is true (before the experiment), 2) the statistical power of the experimental design, 3) the magnitude of the desired effect being tested, 4) selecting the appropriate analytical approach, and 5) carrying out the experiment and analysis flawlessly. Given these conditions, a common alpha of 0.05 does not even come close to accurately reporting the potential upper bound of a Type I Error (Pashler and Harris 2012). The positive predictive value of a significant statistical test at alpha < 0.05, with: 1) a prior probability of an existing relationship of 0.50, 2) an experimental power of 0.80,3) a modest estimated effect, and 4) using appropriate and perfectly executed analytical methodologies is 0.85. Given these conditions, some statisticians estimate that positive predictive values above 0.50 are rare; most are 0.0 (Ioannidis 2005).
These facts shed light on some of the limitations of traditional null hypothesis testing and frequentist statistics in modern science. These analytical methods were born out of a different era in science. Consider the differences between a typical experiment of the late-mid 20th century and that of 2020. For the former, the researcher is directly measuring response in maize growth to a known, measured, and applied fertilizer regime, across 50 plots, in 10 complete randomized blocks. The apriori probability of a relationship is very high, the magnitude of the effect on growth is high, the experimental power is very high, and the experimental treatments and response are known and measured directly. Compare this to the latter, an analysis of national internal migration in response to conflict and drought. In this scenario internal population movement is inferred from time lapsed national census surveys. Estimated drought is calculated at the the provincial level by a network of weather stations spread throughout the country. The mean values are aggregated across the network, and drought is estimated relative to some reference period using one of several evaporative stress models. Conflict data is supplied by a global conflict data news crawler that automatically harvests and aggregates data news articles. The opportunities for bias, autocorrelation, and compounded error are limitless in the modern experiment. This is not to say modern findings are not valid or noteworthy, but it’s important to understand the era and experiments from which traditional tools of statistical inference were developed. These concerns are why many contemporary researches are in favor of completely abandoning the use of null hypothesis tests, p-values, and traditional frequentist techniques in favor of flexible methodologies more suited to modern science (Gelman 2013; J. Leek et al. 2017).
The parallel tracks of the individual components of open science are mostly being implemented within specific sub-disciplines. Often, these efforts are tied to editorial mandates for manuscript submissions in popular peer-reviewed journals. The field of Psychology is driving the majority of the debate surrounding replication, reproduction, pre-registration, and p-hacking. PlosONE was one of the first major journals to implement open-access guidelines in 2014. They were followed by guidelines formulated by Springer Nature, Science, and Elsevier. Notwithstanding these positive steps, guideline are just that; guidelines. The aforementioned guidelines are not equal in their content nor their enforcement.
Publisher implemented mandates can quickly promote the ideals of open science and FAIR data. A recent review of the impacts PlosONE’s open-science guidelines found a positive impact in most open science principles. These include significant increases of the inclusion and use of confidence intervals for point estimates and transparency for data use and exclusion (Federer et al. 2018). These are positive signs for the movement at large, however, this study found no changes in the reliance on null hypothesis testing, and 25% of respondents sought out less stringent publications when faced with open data and open code guidelines. Giofre (2017) similarly found that submissions to major psychological journals employed better FAIR data practices and reporting of confidence intervals, but still exhibited an over reliance on traditional null hypothesis testing (Giofrè et al. 2017).
While major psychology journals are promoting efforts for replication, pre-registration, and appropriate use of p-values, these movements are less prominent in the political, geographical, biological, and earth sciences Iqbal et al. (2016). Key (2016) reviewed submissions to major political science journals from 2014-2016 and found that only 58% of published articles provided code for their reported findings (Key 2016). Replication in the geo and political sciences may be inhibited by big data, extensive pre-processing, and high level computing requirements (LeVeque, Mitchell, and Stodden 2012). Simply put, many reviewers and editors lack the training and time for a proper evaluation of technically demanding research findings (J. T. Leek and Peng 2015). Although there is less of a push for replication and p-hacking in the geosciences,they are at the forefront of the FAIR Data movement (Wilkinson et al. 2016). This is not surprising because the political and geosciences rely on data sets like global atmospheric and climatological data, human migration, and national economic and social statistics. These data sets are typically large, versioned, and require extensive pre-processing and curation before release. Geoscience data are highly conducive to accessible, interoperable, and reusable data practices. This is in stark contrast to the nature of data utilized in psychology involving identifiable human subjects and survey respondents. It’s understandable that the focus in psychology is centered on replication and appropriate use of p-values, while the geosciences are promoting the FAIR Data.
In spite of the greater movement for open science there are still detractors. Those providing pushback against the call for the greater presence of replication argue that for some disciplines, such as molecular biology, experiments are extremely time consuming and require extensive in-lab specialized techniques that are not easily relayed to other labs. Subsequently, replications done in error may deter future funding and delay scientific discourse (Bissell 2013). I would argue that if methodologies vital to experimental success can not be replicated by experienced researchers via the manuscript, they should be adequately described, at minimum, in supplementary materials. Other opponents suggest that science is self-correcting over the long term (Pashler and Harris 2012), conceptual replications are more valuable than direct replication because they test broader concepts (Carpenter 2012), or that while p-hacking may exist, it is in no way homogeneous across research fields or majorly altering the theoretical landscape Fanelli (2018).
However, given the absence of any meaningful replication framework, or even a general understanding of the benefits of replication in most sub-disciplines, it’s exceedingly difficult to even formulate a baseline for the potential problem (McElreath and Smaldino 2015). The limited metadata studies and theoretical investigations surrounding the issues are not positive. As discussed earlier, several theoretical statisticians hypothesize that most reported results are false (Ioannidis 2005; Pashler and Harris 2012), analysis of statistical power in published research suggests it has not increased over time (Smaldino and McElreath 2016), and reviews of biomedical (Iqbal et al. 2016), political (Key 2016), geo (Laitin and Reich 2017; Konkol, Kray, and Pfeiffer 2019), and psychological (Federer et al. 2018) published results reveal non-existent efforts at replication, poor data and code availability, and a troubling reliance on traditional null hypothesis testing. Meaningful assessments of the true state and impacts of p-hacking, FAIR Data, and absence of replication can only be made after frameworks for these movements are firmly in place.
A System for Learning
One of the most overlooked benefits of open-science is its potential as a teaching device. Becoming proficient at data processing and applied statistical modeling is a major hurdle for those with quantitative aspirations. Completing an introductory course in statistics followed by, at best, a multivariate, time series, and possibly some seminar of interest leaves students woefully under-prepared for the type of real world problems scientists are attempting to solve. Moreover, it results in scientists with enough skill to use statistical and technical software, but not knowledgeable enough to identify areas of concern and inappropriate use (J. T. Leek and Peng 2015).
Textbooks and popular online tutorials are littered with tailor made examples of housing markets, students nested in classrooms nested in schools nested in districts, patients with physicians in hospitals, etc. They are great devices to relay statistical concepts, but in the real world, rarely exemplify the types of data sets routinely thrust upon researchers. Learning advanced statistical modeling and machine learning techniques through piecemeal internet searches and forum posts is a highly time-inefficient endeavor if you lack an available mentor, which are rare even in large graduate departments. The ability to find a research paper employing a quantitative technique of interest, in your field of interest, is an invaluable resource for students at any stage of their career.