Introduction

This guide is an entry in a series of proposed vignettes in which we walk through a deep cleaning or exploratory data analysis (EDA) of a widely employed environment-security dataset. For this entry, we will explore the Varieties of Democracy dataset (V-Dem;).¹ V-Dem is a massive dataset that aims to provide quantitative assessments of historical and nation-state democracy. V-Dem provides both multidimensional and disaggregated measures of democracy across five primary principals: electoral, liberal, participatory, deliberative, and egalitarian.² The V-Dem team is comprised of dozens of scientists spread across the globe working with thousands of local experts to quantify local and regional aspects of democracy.

V-Dem is not alone in its efforts to quantify qualitative aspects of nation-state democracy, civil liberties, and elections. Similar datasets include Polity5,³ Freedom House’s Freedom in the World, Countries at the Crossroads, and Freedom of the Press,⁴ and the Institutions and Elections Project.⁵ Although these datasets are similar in many ways, V-Dem stands out with the sheer number of metrics included. V-Dem features over 470 indicators, 82 indices, and 5 high-level metrics. That is an overwhelming amount of data on par with the World Development Indicators.⁶ Let’s get started.

Acquiring the Data

The V-Dem V10 dataset is available from the V-Dem data homepage in preconfigured csv, SPSS, and STATA formats, however, there is a recommended package available to R users available on GitHub. Installing the remote package from GitHub requires devtools. For this guide we’ll be using data.table, but all of these steps could be performed with dplyr and the greater tidyverse, or even base R if you’re a sadist. Lastly, to assist with country coding, we’ll be using the states package, which should also be pulled from GitHub to ensure you have the most recent version.

install.packages('devtools')
devtools::install_github("vdeminstitute/vdemdata")
devtools::install_github("andybega/states")

After the packages are installed load vdemdata and data.table.

library(vdemdata)
library(data.table)
library(states)

The vdemdata package is immature and not available on CRAN, but it provides direct access to the most recently available V-Dem data. The raw data and codebook are stored as embedded package datasets in the vdemdata package.

vdem.raw <- data.table::setDT(vdemdata::vdem)
vdem.codebook <- data.table::setDT(vdemdata::codebook)

Determining Variables of Interest

Using the RStudio data viewer and filter interface with the codebook allows you to quickly search for keywords and variables of interest. Although there are 4108 variables included in vdem.raw, for the purposes of this guide we’ll focus on 2 widely used high-level metrics from vdem: v2x_libdem and v2x_polyarchy. The codebook can be filtered to provide greater context.

metrics<-c('v2x_libdem','v2x_polyarchy')
vdem.codebook[tag %in% metrics, .(name, vartype, tag, question, scale)]

name	vartype	tag	question	scale
Electoral democracy index	D	v2x_polyarchy	To what extent is the ideal of electoral democracy in its fullest sense achieved?	Interval, from low to high (0-1).
Liberal democracy index	D	v2x_libdem	To what extent is the ideal of liberal democracy achieved?	Interval, from low to high (0-1).

The codebook reveals that these are 2 high level (vartype==D) democracy indices quantifying the extent of electoral (v2x_polyarchy) and liberal (v2x_libdem) democracy. Both metrics are continuous variables bound by 0-1. In addition to our desired indices, we should also subset the raw data for identification metrics such as country names, observation year, coding schemes that assist with harmonizing V-Dem data with other datasets, and indicators for country start and stop dates to manage secessions, civil wars, etc..

id.vars<-c('country_name', 'COWcode','histname' ,'codingstart_contemp', 'codingend_contemp','year')
vars<-c(id.vars, metrics)

Now we can subset the raw data and toss what we don’t need.

vdem<-vdem.raw[, ..vars]
rm(vdem.raw)

Determining Years of Interest

We’ll perform a last bit of pruning for temporal considerations. V-Dem has a large historical record dating back to 1789. This is valuable data, but far greater than most practitioners or analysts require. More commonly, analyses will start just before or after key events; i.e. WWII, the Cold War, and the War on Terror. Practically speaking, when preparing historical country-year data, we are most concerned with the headaches brought on by coding nation-state secessions, independence, unifications, etc.

With this in mind, important periods to consider/avoid are: Sudan 2011, Yugoslavia/Kosovo/Serbia/Montenegro 2003-2008, Eritrea 1993, Czech/Slovakia 1993, an even more complicated Yugoslavian dissolution, and Cold War fallout 1989-1991. Sudan is usually an easy check, but Yugoslavia/Kosovo/Serbia/Montenegro are almost always a real pain to manage across multiple datasets and they usually must be included in the analysis. For the purpose of this guide we will subset our data to 1995 and investigate any issues associated with Yugoslavia/Kosovo/Serbia/Montenegro.

vdem <- vdem[year>1994]

Country Code Checks

The most important issue to address with country-year datasets is accurate annual country codes. This includes nation-state secessions and independence (Sudan, Yugoslavia), independently listed territories (Hong Kong, Puerto Rico, Guam, French Guiana), and states with limited international recognition (Kosovo, West Bank/Palestine, Taiwan). These issues afflict international datasets in a wide variety of ways. Before you attempt to “fix” these issues, it’s important to consider how they will be addressed in all the datasets required for your analysis. Do not spend copious amounts of time coding changes to Kosovo and the West Bank if they’re completely ignored in your other datasets of concern.

V-Dem contains Correlates of War (CoW; COWcode) country codes. This is a popular coding scheme that makes country-coding an easier task. We’ll start be renaming the variable, because we will have to manipulate it a lot.

names(vdem)[2]<-"cow"

The states package can serve as a reference to check Correlates of War and Gleditsch and Ward country codes. Both are embedded in the package and available with calls to states::cowstates or states::gwstates. Let’s start by checking if any CoW codes are missing.

unique(vdem[is.na(cow),country_name])

## [1] "Palestine/West Bank" "Palestine/Gaza" "Somaliland" "Hong Kong"

It may seem like the easy way out, but these states are commonly ignored in popular environment-security datasets, and can usually be dropped from analysis. One dataset where they would be included is United Nations refugee and asylum seeker data, in which case, you would have to introduce ISO codes to harmonize them with other United Nations data. This could be done with minimal trouble using the countrycode package, but will likely lead to other issues.

library(countrycode)

vdem[, iso3:=countrycode::countrycode(cow, 
 origin = "cown",
 destination = "iso3c")]

## Warning in countrycode::countrycode(cow, origin = "cown", destination = "iso3c"): Some values were not matched unambiguously: 345, 347, 511

Now we go down the rabbit hole; who were matched unambiguously?

vdem[cow %in% c(345, 347, 511), unique(country_name)]

## [1] "Kosovo" "Serbia" "Zanzibar"

These require hard-coded fixes to their ISO3 values. This is beyond the scope of the purpose of this vignette so we will drop the missing cow observations in V-Dem and move on, but I wanted to illustrate the beginning of a country code black hole.

vdem <- vdem[!is.na(cow)][, iso3:=NULL]

Yugoslavia, Serbia, Montenegro, and Kosovo

Official CoW codes for Yugoslavia, Serbia, Montenegro, and Kosovo are 345, 345, 341, and 347, respectively. CoW maintains the 345 numeric AND YUG character designations for Serbia after the Yugoslavia break. CoW assigns Montenegro 341 starting in 2006 and Kosovo 347 in 2008 (review these changes in states::cowstates).

Check how V-Dem assigns these changes.

dcast(vdem[cow %in% c(345, 341, 347), .(country_name, cow, year)],
 year~cow, value.var = "country_name")

year	341	345	347
1995	NA	Serbia	NA
1996	NA	Serbia	NA
1997	NA	Serbia	NA
1998	Montenegro	Serbia	NA
1999	Montenegro	Serbia	Kosovo
2000	Montenegro	Serbia	Kosovo
2001	Montenegro	Serbia	Kosovo
2002	Montenegro	Serbia	Kosovo
2003	Montenegro	Serbia	Kosovo
2004	Montenegro	Serbia	Kosovo
2005	Montenegro	Serbia	Kosovo
2006	Montenegro	Serbia	Kosovo
2007	Montenegro	Serbia	Kosovo
2008	Montenegro	Serbia	Kosovo
2009	Montenegro	Serbia	Kosovo
2010	Montenegro	Serbia	Kosovo
2011	Montenegro	Serbia	Kosovo
2012	Montenegro	Serbia	Kosovo
2013	Montenegro	Serbia	Kosovo
2014	Montenegro	Serbia	Kosovo
2015	Montenegro	Serbia	Kosovo
2016	Montenegro	Serbia	Kosovo
2017	Montenegro	Serbia	Kosovo
2018	Montenegro	Serbia	Kosovo
2019	Montenegro	Serbia	Kosovo
2020	Montenegro	Serbia	Kosovo

Thankfully the codes themselves are correct, however, V-Dem maintains independent listings for all three states even while they were unified under various arrangements between 1992-2005. The course of action here depends on your intended use and additional datasets. Taking the mean of Serbia and Montenegro (maybe even Kosovo) over this time period is one potential correction. For this guide we will average Serbia and Montenegro. You may want to consider doing the same for Kosovo and Serbia or all 3 states.

for(i in 1995:2005) vdem[cow %in% c(341,345) & year==i, (metrics):=lapply(.SD, mean, na.rm = TRUE), .SDcols = metrics]

The coverage and coding for Kosovo is correct; it can be left if other data of interest recognizes the state.

Other Considerations

Sudan (625) and South Sudan (626) split in 2011. Check them in V-Dem.

dcast(vdem[cow %in% c(625,626), .(country_name, cow, year)],year~cow, value.var = "country_name")

year	625	626
1995	Sudan	NA
1996	Sudan	NA
1997	Sudan	NA
1998	Sudan	NA
1999	Sudan	NA
2000	Sudan	NA
2001	Sudan	NA
2002	Sudan	NA
2003	Sudan	NA
2004	Sudan	NA
2005	Sudan	NA
2006	Sudan	NA
2007	Sudan	NA
2008	Sudan	NA
2009	Sudan	NA
2010	Sudan	NA
2011	Sudan	South Sudan
2012	Sudan	South Sudan
2013	Sudan	South Sudan
2014	Sudan	South Sudan
2015	Sudan	South Sudan
2016	Sudan	South Sudan
2017	Sudan	South Sudan
2018	Sudan	South Sudan
2019	Sudan	South Sudan
2020	Sudan	South Sudan

This is correct. Lastly, we should check V-Dem against our CoW reference (states::cowstates) to see if V-Dem is missing any countries.

cowstates<-data.table::setDT(states::cowstates)
missing_in_vdem<-cowstates[end >= sprintf("%s-01-01", 1995)][!cowcode %in% vdem$cow]
knitr::kable(missing_in_vdem)

cowcode	cowc	country_name	start	end	microstate
31	BHM	Bahamas	1973-07-10	9999-12-31	FALSE
54	DMA	Dominica	1978-11-03	9999-12-31	TRUE
55	GRN	Grenada	1974-02-07	9999-12-31	TRUE
56	SLU	St. Lucia	1979-02-22	9999-12-31	TRUE
57	SVG	St. Vincent and the Grenadines	1979-10-27	9999-12-31	TRUE
58	AAB	Antigua & Barbuda	1981-11-01	9999-12-31	TRUE
60	SKN	St. Kitts and Nevis	1983-09-19	9999-12-31	TRUE
80	BLZ	Belize	1981-09-21	9999-12-31	FALSE
221	MNC	Monaco	1993-05-28	9999-12-31	TRUE
223	LIE	Liechtenstein	1990-09-18	9999-12-31	TRUE
232	AND	Andorra	1993-07-28	9999-12-31	TRUE
331	SNM	San Marino	1992-03-02	9999-12-31	TRUE
835	BRU	Brunei	1984-01-01	9999-12-31	FALSE
946	KIR	Kiribati	1999-09-14	9999-12-31	TRUE
947	TUV	Tuvalu	2000-09-05	9999-12-31	TRUE
955	TON	Tonga	1999-09-14	9999-12-31	TRUE
970	NAU	Nauru	1999-09-14	9999-12-31	TRUE
983	MSI	Marshall Islands	1991-09-17	9999-12-31	TRUE
986	PAL	Palau	1994-12-15	9999-12-31	TRUE
987	FSM	Federated States of Micronesia	1991-09-17	9999-12-31	TRUE
990	WSM	Samoa	1976-12-15	9999-12-31	TRUE

There is nothing of consequence here; these are mostly microstates that are commonly omitted from environment-security analysis. For simplicity, the remaining microstates included in V-Dem may be dropped unless you are carrying out a specialized analysis.

microstates <- cowstates[microstate==TRUE,unique(cowcode)]
vdem<-vdem[!(cow %in% microstates)]

Finally, we’ll examine V-Dem for duplicate country names to ensure we don’t miss any peculiarities.

dupes<-unique(vdem[,.(country_name, cow)])
# check for duplicate names across codes
table(duplicated(dupes$country_name))

## 
## FALSE 
## 172

Excellent!

Missing Values

As previously covered, v2x_polyarchy and v2x_libdem are 2 high level (vartype==D) democracy indices quantifying the extent of electoral and liberal democracy in a given state. Both metrics are continuous variables bound by 0-1. We can quickly check their distributions to get a better sense of the data.

hist.dat<-melt(vdem, 
 id.vars = c("cow", "year"),
 measure.vars = c("v2x_libdem",
 "v2x_polyarchy"),
 variable.name = "metric",
 value.name = "value")

ggplot2::ggplot(hist.dat, ggplot2::aes(x=value))+
 ggplot2::geom_histogram()+
 ggplot2::facet_wrap(~metric)+
 ggplot2::labs(title = "V-Dem Metric Distributions",
 x = "Value",
 y= "Count")+
 ggplot2::theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

These look pretty good with (mostly) uniform converage. The warnings have tipped us off to a few missing values; let’s investigate further.

vdem[is.na(v2x_libdem) | is.na(v2x_polyarchy),.(unique(country_name), n=.N, last_year=max(year)), by=cow]

cow	V1	n	last_year
860	Timor-Leste	4	1998
692	Bahrain	7	2001

There are only 11 missing values, but they should be investigated. First Timor-Leste.

vdem[cow==860, .(country_name, year, v2x_libdem, v2x_polyarchy)]

country_name	year	v2x_libdem	v2x_polyarchy
Timor-Leste	1995	NA	0.077
Timor-Leste	1996	NA	0.077
Timor-Leste	1997	NA	0.077
Timor-Leste	1998	NA	0.090
Timor-Leste	1999	0.088	0.091
Timor-Leste	2000	0.186	0.225
Timor-Leste	2001	0.238	0.292
Timor-Leste	2002	0.375	0.502
Timor-Leste	2003	0.402	0.567
Timor-Leste	2004	0.402	0.567
Timor-Leste	2005	0.403	0.575
Timor-Leste	2006	0.403	0.573
Timor-Leste	2007	0.456	0.610
Timor-Leste	2008	0.466	0.628
Timor-Leste	2009	0.471	0.629
Timor-Leste	2010	0.480	0.633
Timor-Leste	2011	0.480	0.633
Timor-Leste	2012	0.489	0.644
Timor-Leste	2013	0.490	0.642
Timor-Leste	2014	0.467	0.633
Timor-Leste	2015	0.455	0.632
Timor-Leste	2016	0.441	0.611
Timor-Leste	2017	0.455	0.646
Timor-Leste	2018	0.477	0.666
Timor-Leste	2019	0.495	0.677
Timor-Leste	2020	0.464	0.660

They are missing v2x_libdem for 1995-1998. These years are during the Indonesian occupation and prior to their internationally recognized independence. They can be ignored or dropped unless you have a special use case.

Now Bahrain.

vdem[cow==692, .(country_name, year, v2x_libdem, v2x_polyarchy)]

country_name	year	v2x_libdem	v2x_polyarchy
Bahrain	1995	NA	0.049
Bahrain	1996	NA	0.049
Bahrain	1997	NA	0.049
Bahrain	1998	NA	0.049
Bahrain	1999	NA	0.049
Bahrain	2000	NA	0.067
Bahrain	2001	NA	0.112
Bahrain	2002	0.068	0.147
Bahrain	2003	0.082	0.214
Bahrain	2004	0.077	0.214
Bahrain	2005	0.080	0.224
Bahrain	2006	0.088	0.231
Bahrain	2007	0.088	0.232
Bahrain	2008	0.088	0.232
Bahrain	2009	0.087	0.230
Bahrain	2010	0.084	0.228
Bahrain	2011	0.050	0.185
Bahrain	2012	0.043	0.165
Bahrain	2013	0.044	0.165
Bahrain	2014	0.045	0.163
Bahrain	2015	0.042	0.137
Bahrain	2016	0.041	0.130
Bahrain	2017	0.040	0.125
Bahrain	2018	0.041	0.120
Bahrain	2019	0.047	0.118
Bahrain	2020	0.048	0.118

Bahrain declared independence in 1971 and converted to a Constitutional Monarchy in 2001. The missing value in 2001 may pose an issue when trying to join on additional data sets. A simple fix would be to replace the 2001 value with the 2002 value. A more complicated fix would be some type of lead-in imputation. Let’s examine the time series.

ggplot2::ggplot(vdem[cow==692], ggplot2::aes(x=year, y=v2x_libdem))+
 ggplot2::geom_point(size = 2)+
 ggplot2::labs(title="Bahrain Libdem Time Series",
 x = "Year",
 y = "Libdem")+
 ggplot2::theme_minimal()

## Warning: Removed 7 rows containing missing values (geom_point).

There is a bit of a linear trend, but imputation would be more trouble than it’s worth. An adequate correction is to put in the 2002 value.

vdem[cow==692 & year==2001, v2x_libdem := vdem[cow==692 & year==2002, v2x_libdem]]

Final Cleanup

Before finishing, we will perform a few final processing steps. First, extract only the minimum number of variables.

vdem <- vdem[, .(cow, year, v2x_libdem, v2x_polyarchy)]

Next, set the year and CoW columns to integers.

cols<-c("cow", "year")
vdem[, (cols):=lapply(.SD, as.integer), .SDcols = cols]

Lastly, if you are working with other colleagues, strip the data.table class from the object.

data.table::setDF(vdem)

And we’re finished. I hope you found this exercise informative, and please contact me with any questions, concerns, or tips.

References

1.

Coppedge, M. et al. V-Dem Codebook V10. https://papers.ssrn.com/abstract=3557877 (2020) doi: 10.2139/ssrn.3557877.

2.

Pemstein, D. et al. The V-Dem Measurement Model: Latent Variable Analysis for Cross-National and Cross-Temporal Expert-Coded Data. https://papers.ssrn.com/abstract=3167764 (2018) doi: 10.2139/ssrn.3167764.

3.

Marshall, M. & Jaggers, K. Polity IV Project: Political Regime Characteristics and Transitions, 1800-2002. (2002).

4.

Freedom House. Freedom in the World 2014: The Annual Survey of Political Rights and Civil Liberties. ( Rowman & Littlefield, 2014).

5.

Wig, T., Hegre, H. & Regan, P. M. Updated data on institutions and elections 1960: Presenting the IAEP dataset version 2.0. Research & Politics 2, 2053168015579120 (2015).

6.

The World Bank. World Development Indicators. https://data.worldbank.org/data-catalog/world-development-indicators (2017).

Deep Clean: Varieties of Democracy (V-Dem)

Summary