Managing country codes is one of the most difficult tasks in environment-security analysis, and each dataset treats the concept differently. Country code designations may be very strict and only respect internationally recognized sovereign nations starting with their first official day of independence. Other datasets may include separate observations and codes for territories, disputed states, colonial nations, and other quasi-independent states. Coding systems like the Correlates of War1 or Gleditsch and Ward2 fall into the former with strict, to the day, international recognition, while refugee and asylum datasets provided by the United Nations3 falls into the latter. UN migration and refugee datasets include nearly every conceivable territorial designation/disaggregation (e.g. Puerto Rico, Taiwan, Faroe Islands, etc.).
When using an individual dataset, users need a general awareness for the conceptual approach to the country code specification and any potential inconsistencies, however, when attempting to harmonize multiple datasets, more intricate knowledge is required. Merging multiple datasets without first addressing disparate approaches to country coding will lead to inconsistencies, numerous
NA values, and compound downstream error in the analysis. It takes time and experience to anticipate country coding issues and know what to be on the lookout for when undergoing project planning. The goal of this vignette is to outline common issues and considerations for harmonizing multiple country-year datasets. This walkthrough uses the Varieties of Democracy (V-Dem)4 and Correlates of War coding schemes to illustrate recurrent challenges. They are good datasets to work with while learning for several reasons:
- They represent opposite ends of the country-coding ideological spectrum.
- Because V-Dem is focused on polities, it is highly disaggregated and observes long standing claims of independence.
- Conversely, CoW is very strict and adheres to official and internationally recognized claims of independence.
- Both datasets are consistent and have little to no errors and inconsistencies. This is helpful for learning, but not entirely representative of real world experiences. On top of differences in coding ideologies, many datasets have a slew of errors and inconsistent applications that further complicate pre-processing.
- They both have direct programmatic access via R that makes management, exploration, and visualization easier.
- The vdemdata package.5
- The states package6 featuring CoW and G&W code access and manipulation.
- The cshapes package;7,8 vectorized historic state boundaries using either the CoW or G&W schemes.
- V-Dem provides reference materials that are extremely helpful for learners.
- The Country Coding Manual and Historic Name column in the raw data provide a lot of context for the year to year regime changes, and their reasoning for important decisions.
V-Dem Country Codes
The Varieties of Democracy (V-Dem) presents a collection of highly disaggregated indicators (400+) depicting wide-ranging measures of democracy and institutional characteristics dating back to 1789. In contrast to several other political, social, and environmental datasets, V-Dem demonstrates an extreme level of transparency in their methods and copious documentation. This includes a dedicated supplementary manual depicting their country coding approach; V-Dem Country Coding Units v11.1.9 This tutorial will highlight the most relevant (in my opinion) portions of their approach and country-year coding decisions.
V-Dem defines a country as
…a political unit enjoying at least some degree of functional and/or formal sovereignty.
Generally speaking, V-Dem will have a country-year observation for a country if:
- The state made a formal declaration of independence; even if not yet fully recognized by the international community.
- In modern times this concerns states like Kosovo, Taiwan, and Western Sahel, in historic times this concerns states such as Colonial Asia and Africa.
- If the state in question operates with some degree of autonomy that distinguishes itself, at least in its polities and institutions, from the parent nation.
In my limited experience with V-Dem, this conceptual approach is carried out very consistently, but additional pre-processing is required to successfully merge with more strict datasets. Moreover, the Country Coding Units Manual contains information that helps users construct historical time series of nations that have limited independent observations in the dataset. For example, Bangladesh is coded independently starting in 1971, but to construct a longer historic record, the Country Coding Units Manual states that the following combination of observations may be used:
- India (Princely state of British India (1910 – 1947))
- Pakistan (1947 - 1971)
- Bangladesh (1971 - )
The Country Coding Units Manual is a great reference if you need to combine historical democracy indicators for nations with intermittent internationally recognized sovereignty with other historically disaggregated environmental, migration, or conflict data.
histname variable in the raw V-Dem dataset is also an excellent source of context for the true nature of regime changes in annual polity. While the
country_name field is static and typically reflects the modern, common, country name in the V-Dem dataset, the
histname reflects official or historic names and potential states of occupation. Take Somalia for example:
|Somalia||Somalia under formal Italian control over most of the territory [except British Somaliland]||520||1900||1909|
|Somalia||Somalia under effective (more or less) Italian control||520||1910||1940|
|Somalia||Somalia under British occupation||520||1941||1949|
|Somalia||UN Trust Territory of Somalia under Italian administration||520||1950||1959|
|Somalia||Somali Republic [unites Somaliland with the Trust Territory of Somalia]||520||1960||1968|
|Somalia||Somali Democratic Republic||520||1969||1990|
|Somalia||Somali Democratic Republic [civil war]||520||1991||2003|
|Somalia||Transitional Federal Government [civil war]||520||2004||2011|
|Somalia||Federal Republic of Somalia [civil war]||520||2012||2020|
This provides a clear understanding of Somalia’s historical sovereignty and polity for the past 120 years. When combined with the Country Coding Units Manual, the user is able to make more informative choices regarding their research interests. It also serves as excellent reference material for historical nation-states that is much faster than being sucked into a Wikipedia black hole.
Now we will walk through some of the more important country coding considerations for applied cases using Varieties of Democracy and Correlates of War for 1900-2020.
- CoW codes Canada starting in 1920 with recognition of the League of Nations.
- V-Dem codes Canada back to 1841.
Most of the Caribbean nations are coded by CoW starting with their official independence from the colonial parent nations; usually some time between 1962-1975. V-Dem starts most Caribbean countries between 1789-1900.
Central and South America
The remaining nations of Central and South America are similarly coded by CoW and V-Dem because they mostly achieved their independence prior to 1900.
The Balkans are consistently a source of frustration when attempting to merge disparate datasets. The affected states include Serbia/Yugoslavia, Bosnia and Herzegovina, Kosovo, Croatia, North Macedonia, Slovenia, and Montenegro. V-Dem disaggregates each component for as long as possible while CoW uses the official, internationally recognized, aggregated states. Because of this, when merging with most datasets, V-Dem will require some additional processing. At a minimum this may include averaging across multiple states to calculate Yugoslavia or the State Union of Serbia and Montenegro. Given time and resources, users may also consider calculating weighted averages using population data as not to give Kosovo or Slovenia equal weighting with Serbia/Yugoslavia or Montenegro. The V-Dem Country Coding manual is very helpful in reconstructing the complicated historic record of these nation-states.
- V-Dem (1789-1918; 1998-2019)
- CoW (2006- )
- V-Dem (1991- )
- CoW (1993-)
- V-Dem (1941-1945 Nazi puppet state, 1991-)
- CoW (1992-)
- Yugoslavia / Serbia
- V-Dem (1804-)
- CoW (1878-) no observations for 1942-1943. CoW maintains the YUG character code during the State Union of Serbia and Montenegro (2003-2006) and Republic of Serbia (2006-).
- Bosnia and Herzegovina
- V-Dem (1992-)
- CoW (1992-)
- V-Dem (1999-)
- CoW (2008-)
- V-Dem (1989-)
- CoW (1992-)
Austria and Hungary
V-Dem codes Austria and Hungary separately throughout their entire record. Therefore there is no corresponding record for CoW numeric code 300 (Austria-Hungary 1816-1918). The user must create this designation.
Germany consists of 3 separate designations in both datasets.
- Germany (CoW 255): This is modern day and pre-WWII “unified” Germany.
- CoW tracks Germany 1816-1945; 1990-
- V-Dem tracks Germany 1789-1944; 1991-
- West Germany (CoW 260): Post-WWII West Germany or the German Federal Republic.
- CoW tracks West Germany 1955-1990
- V-Dem tracks West Germany 1949-1990
- East Germany (CoW 265): Post-WWII East Germany or the German Democratic Republic.
- CoW tracks East Germany 1954-1990
- V-Dem tracks East Germany 1945-1990 (including 1945-1948 Third Reich Occupied by Russia)
Lastly, as a result of these various configurations, CoW does not track any form of Germany during 1946-1953, and V-Dem does not contain “Unified” or West German observations from 1945-1948.
Czechoslovakia, Czech Republic, and Slovak
CoW includes designations for Czechoslovakia (315; 1918-1992), Czech Republic (316; 1993-), and Slovakia (317; 1993-), whereas V-Dem observes Czechoslovakia and the Czech Republic on a single code (157; 1918-) with Slovakia (201) diverging from 1939-1945 and again permanently in 1993.
- CoW has no Czechoslovakia observations under German Occupation (1940-1944)
- V-Dem has Czechoslovakia observations from 1939-1944 with a historical designation for German occupation.
- V-Dem includes Slovak observations for 1939-1944, when they seceded and essentially behaved as a Nazi puppet state.
Former Soviet Union
Former members of the Soviet Union are handled similarly between both datasets. Due to the centralized control over polity and institutions on part of the USSR during this time, V-Dem does not code members independently (compared to colonial Africa or Caribbean). The exceptions are nations with a historical presence prior to annexation/occupation (Lithuania, Latvia, Uzbekistan, etc.). In these cases states are recorded independently from their pre WWI or WWII independence, and then again following the dissolution of the USSR; 1990 (V-Dem) or 1991 (CoW).
The remaining notable European countries are Poland, Ireland, and Luxembourg.
- CoW records Poland from 1918-1939 and 1945-
- V-Dem tracks Poland from 1789-1938 and 1944-
- Both include gaps for WWII occupation
- V-Dem includes starting with their declaration 1919-
- CoW starts with official independence 1922-
- V-Dem 1815-
- CoW 1920-
Middle East and North Africa
Along with the various configurations of Yugoslavia, Palestine is another consistent source of frustration when carrying out pre-processing. Many datasets and coding schemes do not recognize Palestine. In fact, CoW does not include a code for Palestine. This complicates interdisciplinary research, because including Palestine in your analysis can severely limit the number of additional datasets you can include without introducing more error through imputation or other fixes. V-Dem includes 3 separate designations for Palestine:
- Palestinian designations in V-Dem
- Palestine/British Mandate (209): 1918-1948
- Palestine/Gaza (138): 1948-1967 and 2007-. This is present day Gaza controlled by Hamas, and not influenced by Israel
- Palestine/West Bank (128): 1948-1950 and 1967-. Starting with 2007, this refers to West Bank only (Gaza is coded separately when Hamas gains control in 2007).
CoW includes separate codes for the Yemen Arab Republic/North Yemen (678; 1926-1990), unified modern Yemen (679; 1990-), and Yemen People’s Democratic Republic/South Yemen (680; 1967-).
Similar to the Czech Republic, V-Dem tracks historic, North, and modern Yemen on a single code (14; 1789-1850 and 1918-). They provide South Yemen a separate code (23; 1900-1990). V-Dem’s greater historic record for South Yemen includes only the city of Aden and its immediate surroundings from 1900-1963. There is no CoW equivalent for South Yemen/Aden during this time.
There are not many complicated distinctions throughout Africa. V-Dem codes most African nations independently starting in 1900 while indicating their colonial protectorate. Depending on the colonial parent nation, most African countries acquired internationally recognized independence between 1955-1975; this is when the CoW record begins. Some notable African countries:
- Recognized by V-Dem as an independent state from 1900-1960 and again starting in 1991-.
- CoW (and several other datasets) does not include Somaliland.
- Coded continuously by V-Dem starting in 1789; including Italian occupation from 1936-1941.
- Coded intermittently by CoW (1898-1936; 1941-); excludes Italian occupation.
- Sudan and South Sudan
- They are coded identically by V-Dem and CoW; a separate code for South Sudan starting in 2011.
V-Dem includes a greater historical record for most Asian countries, because, similar to Africa and the Caribbean, there was less centralized control exercised over their polities and institutions during colonial rule. As a result, most Asian states are recorded in V-Dem beginning sometime 1789-1900, and in CoW during one of the waves of colonial (1945-1950) or regional (1971-1975) independence. However, there are a few nation-states with more complicated records in V-Dem and CoW:
CoW tracks Korea on 3 separate numeric codes: Korea (Unified; 730), North Korea (731), and South Korea (732). Conversely, V-Dem assigns a single code for historic/unified Korea and South Korea (42) while placing North Korea on its own code (41). CoW does not list Korea while under Soviet, USA, or Japanese occupation
- Korea (Unified)
- CoW (730): 1887-1905
- V-Dem (42): Listed as South Korea 1789-
- North Korea
- CoW (731): 1948-
- V-Dem (42): 1945-
- South Korea
- CoW (732): 1949-
- V-Dem (42): Shared code with Unified Korea 1789-1944; South Korea only 1945-
Similar to Korea(s), V-Dem places historic, unified, Vietnam and post WWII South Vietnam on a single code. North Vietnam is recognized in 1945 and then absorbs the former South Vietnam in 1976 carrying on to the present. CoW does not recognize colonial or occupied Vietnam(s).
- Historic Vietnam
- V-Dem (35): 1802-1975; Includes only South Vietnam from 1945-1975.
- CoW: Not recognized as a unified state.
- North Vietnam
- V-Dem (34): 1945-; Includes former South Vietnam from 1976-.
- CoW (816): 1954-
- South Vietnam
- V-Dem (35): 1802-1975; Includes only South Vietnam from 1945-1975.
- CoW (817): 1954-