Measuring Democracy: From Texts to Data

Measuring Democracy: From Texts to Data Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Thiago Veiga Marzagão, M.A. Graduate Program in Political Science The Ohio State University 2014 Dissertation committee: Sarah Brooks, Advisor Irfan Nooruddin Marcus Kurtz Janet Box-Steffensmeier

Abstract In this dissertation I use the text-as-data approach to create a new democracy index, which I call Automated Democracy Scores (ADS). Unlike other indices, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders ideological biases. ii

Dedication Dedicated to the taxpayers who paid for this research. iii

Acknowledgements Sarah Brooks, Irfan Nooruddin, Marcus Kurtz, and Janet Box-Steffensmeier provided invaluable advice and mentorship. They pushed me to be methodologically rigorous and to fully explore the substantive implications of my research. Also, they allowed me to take a big risk: creating an automated democracy index has never been attempted before and a different committee might have found the idea too ambitious for a dissertation research. I thank my committee members for their trust and open-mindedness. I am also indebted to those who took the time to read and comment earlier drafts and/or discuss my research idea: Philipp Rehm, Paul DeBell, Margaret Hanson, Carolyn Morgan, Peter Tunkis, Vittorio Merola, Raphael Cunha, and Marina Duque. I am also grateful to the institutions and people that provided material assistance. The Fulbright (grantee ID 15101786) and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES (BEX 2821/09-5) paid my tuition, university fees, airline tickets, and part of my health insurance. The Ministério do Planejamento, Orçamento e Gestão - MPOG (proceed n. 03080.000769/2010-53) granted me four years of paid leave. The Ohio Supercomputer Center allocated computing time. Courtney Sanders patiently answered my endless questions about graduation, forms, and procedures. Finally, I am indebted to my loved ones, all of which supported me even from a great distance. All errors are mine. iv

Vita July 2003...... June 2007..... December 2012 B.S. International Relations, University of Brasília M.A. International Relations, University of Brasília M.A. Political Science, Ohio State University Publications A dimensão geogrfica das eleições brasileiras ( The spatial dimension of Brazilian elections ). Opinião Pública (Public Opinion), 19(2), 270-290, 2013. Lobby e protecionismo no Brasil contemporâneo ( Lobby and protectionism in Brazil ). Revista Brasileira de Economia (Brazilian Review of Economics), 62(3), 263-178, 2008. Fields of Study Major Field: Political Science v

Table of Contents Abstract......................................................................... Dedication....................................................................... Acknowledgements............................................................... Vita.............................................................................. List of Tables..................................................................... List of Figures.................................................................... ii iii iv v vii viii Introduction...................................................................... 1 Paper 1: Ideological Bias in Democracy Measures................................. 3 Paper 2: Automated Democracy Scores........................................... 27 Paper 3: Measuring Democracy From Texts: Can We Do Better Than Wordscores? 63 Conclusion....................................................................... 92 References........................................................................ 93 Appendix A: SEM Estimation.................................................... 99 Appendix B: Replication......................................................... 101 Appendix C: HMT............................................................... 103 Appendix D: HBB................................................................ 105 Appendix E: Multiple Decision Trees............................................. 107 vi

List of Tables Table 1. Bollen and Paxton s regressions for 1980................. 10 Table 2. Replication of Bollen and Paxton s regressions for 1980.. 12 Table 3. Simulation results for Marxism-Leninism and Catholicism 20 Table 4. Simulation results for Protestantism and monarchy...... 21 Table 5. ADS summary statistics, by year........................ 47 Table 6. Correlation between ADS and other indices, by year..... 49 Table 7. Largest discrepancies between ADS and UDS............ 50 Table 8. Overlaps for the year 2008............................... 54 Table 9. Correlations with UDS (using 50 topics)................. 81 Table 10. Correlations with UDS (using 100 topics)............... 82 Table 11. Correlations with UDS (using 150 topics)............... 83 Table 12. Correlations with UDS (using 200 topics)............... 84 Table 13. Correlations with UDS (using 300 topics)............... 85 Table 14. Top 5 topics extracted with LSA....................... 87 Table 15. First 5 topics extracted with LDA...................... 89 vii

List of Figures Figure 1. Bollen and Paxton s model............................. 7 Figure 2. Bollen and Paxton s fit statistics........................ 8 Figure 3. Fit statistics from my replication of Bollen and Paxton. 9 Figure 4. Automated Democracy Scores, 2012.................... 45 Figure 5. Automated Democracy Scores, 1993-2012............... 46 Figure 6. ADS range and press coverage.......................... 48 Figure 7. Example of wordsxtopics table generated with LSA.... 68 Figure 8. Example of topicsxdocuments table generated with LSA 70 Figure 9. Example of decision tree................................ 78 viii

Introduction In this dissertation I investigate the flaws of current democracy indices and propose a new, improved one. This dissertation consists of three papers. In the first paper I show that, unlike what previous research has led us to believe, we cannot make any claims about the nature of the ideological biases that contaminate existing democracy measures. For instance, I show that the Freedom House data, often believed to have a conservative bias, may actually have a liberal bias instead. I do that by replicating previous research on the subject (Bollen and Paxton 2000) but replacing real-world data by simulated data in which I manipulate democracy levels and the ideological biases of hypothetical raters. The results of these Monte Carlos show that even though we can confidently assert the existence of bias in some democracy measures we cannot say anything about which measures are biased or in what ways. That means we currently have no way to circumvent the circularity problem: if we find that democracy is associated with some variable X is that a genuine association or an artifact of our democracy measure being biased toward X? In the second paper I use automated text analysis to create the first machinecoded democracy index, which I call Automated Democracy Scores (ADS). I produce the ADS using the well-known Wordscores algorithm and 42 million news articles from 6,043 different sources. The ADS cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders ideological biases; and a simple (though computa- 1

tionally demanding) extension of the method would yield daily data and real-time data. I create a website where anyone can replicate and tweak the data-generating process by changing the parameters of the underlying model (no coding required): www.democracy-scores.org In the third paper I explore other ways to create an automated democracy index from news articles. More specifically, I use the same news articles used in the second paper but I replace Wordscores by other algorithms - namely, a combination of topic extraction methods (Latent Semantic Analysis and Latent Dirichlet Allocation) and decision trees. The goal is to address the issue of construct validity more directly. 2

Ideological Bias in Democracy Measures Abstract In this paper I show that, unlike what previous research has led us to believe, we cannot make any claims about the nature of the ideological biases that contaminate existing democracy measures. For instance, I show that the Freedom House data, often believed to have a conservative bias, may actually have a liberal bias instead. I do that by replicating previous research on the subject (Bollen and Paxton 2000) but replacing real-world data by simulated data in which I manipulate democracy levels and the ideological biases of hypothetical raters. The results of these Monte Carlos show that even though we can confidently assert the existence of bias in some democracy measures we cannot say anything about which measures are biased or in what ways. That means we currently have no way to circumvent the circularity problem: if we find that democracy is associated with some variable X is that a genuine association or an artifact of our democracy measure being biased toward X? 1. Introduction What do we know about ideological bias in democracy measures? Bollen and Paxton (2000), using structural equation modeling, find that several indicators from the Freedom House dataset (Sussman 1982; Gastil 1988) and from Arthur Banks Cross-National Time Series Archive - CNTS (Banks [1971], updated through 1988) are compromised by ideological bias: the coding is sensitive to a number of variables that (conceptually) have nothing to do with democracy, such as economic policy (whether the polity is Marxist-Leninist), religion (whether the polity is predominantly Roman Catholic or predominantly Protestant), and form of government (whether the polity is a monarchy or a republic). 3

Bollen and Paxton s article has been highly influential. Fourteen years after its publication it is still the only comprehensive, systematic attempt to uncover ideological bias in democracy measures. As such, it appears in nearly every discussion of democracy measurement (e.g.: Munck and Verkuilen [2002], Treier and Jackman [2008], Pemstein, Meserve and Melton [2010]). And it has influenced researchers choices - most notably, it has become commonplace to avoid the Freedom House democracy data on the grounds that Bollen and Paxton have found them to have a conservative bias. Although influential, Bollen and Paxton s findings have never been subjected to scrutiny; they are usually taken at face value. In this paper I perform the first reassessment of Bollen and Paxton s findings. I do that with simulated data in which I manipulate the countries levels of democracy and the measures ideological biases. I find that Bollen and Paxton s method: a) yields incorrect results about which democracy measures are biased; b) yields incorrect results about the nature of those biases; and c) fails to find bias when different measures are biased in similar ways. In sum, I show that for the past fourteen years political scientists have allowed flawed results to influence their choices of democracy data. Those choices, in turn, may have affected what we think we know today about democracy. For instance, are democracy and economic freedom associated, as some economists argue, 1 or is that apparent association an artifact of our democracy data being responsive to economic policy in the first place? Bollen and Paxton s results suggest that the Freedom House data have a conservative bias and that the CNTS data do not. Hence we might feel inclined to address the bias problem by avoiding the Freedom House data and using the CNTS data instead. But in this paper I show that Bollen and Paxton s results do not really tell us anything about which indices 1 See for instance Lawson and Clark (2010). 4

are biased, or in what ways. In other words, whichever democracy measure we choose, we cannot discard the possibility that our empirical tests are circular. Section 2 details Bollen and Paxton s work. Section 3 explains the Monte Carlos and presents the results. Section 4 concludes. 2. Bollen and Paxton s analysis In this section I explain in detail Bollen and Paxton s methodology and results. The gist of it is that Bollen and Paxton treat ideological bias as a latent variable and use structural equation modeling (SEM) to extract it from Freedom House indicators and from CNTS indicators. Bollen and Paxton then regress the extracted biases on a number of polity characteristics (economic, social, and political variables) and use the estimated coefficients (signs and statistical significance) to make conclusions about how exactly the Freedom House and the CNTS are biased. Bollen and Paxton s analysis is based on eight indicators, four from the Freedom House dataset and four from the CNTS dataset. The Freedom House indicators are: freedom of broadcast media, freedom of print media, civil liberties, and political rights. The CNTS indicators are: freedom of group opposition, competitiveness of the nomination process, chief executive elected, and effectiveness of the legislative body. Bollen and Paxton start by using SEM to extract five latent variables from those eight indicators. Two latent variables are assumed to be democracy features ( political liberties and democratic rule ) and three are assumed to be coder-specific ideological biases (Raymond Gastil s and Leonard Sussman s, who were Freedom House coders, and Arthur Banks, who was the CNTS coder) 2. 2 Raymond Gastil was responsible for the civil liberties and political rights indicators. Leonard Sussman was responsible for the freedom of broadcast media and freedom of print media 5

Each of the eight indicators is modeled as being determined by a traits factor, a methods factor, and random measurement error. For instance, freedom of broadcast media is modeled as being determined by the traits factor political liberties, by the methods factor Leonard Sussman s bias (since Leonard Sussman was the researcher responsible for the freedom of broadcast media indicator), and by random measurement error. More generally, each indicator is modeled as indicator kp = λ t trait tp + λ m method mp + δ kp where indicator k for polity p is a linear combination of traits factor t for polity p, methods factor m for polity p, and indicator k s random measurement error for polity p. The complete picture of which indicators load on which factors is provided in Figure 1 below, extracted from Bollen and Paxton (65). 3 indicators. Arthur Banks was responsible for the freedom of group opposition, competitiveness of the nomination process, chief executive elected, and effectiveness of the legislative body indicators. 3 I thank Prof. Bollen for helping me understand some aspects of the model specification. 6

Source: Bollen and Paxton (2000, p. 65). Figure 1. Bollen and Paxton s model Figure 1 follows the standard SEM notation, with squared boxes representing indicators (i.e., observed variables) and circles representing factors (i.e., latent variables). The factor-to-indicator arrows show which indicators load on which factors. 4 The factor-to-factor arrows show which factors correlate. 5 The E arrows show which indicators have random measurement error. 6 Bollen and Paxton estimate, for each indicator, the factor loadings (i.e., the lambdas) and the random measurement error for each year in the 1972-19888 interval (see Appendix A). They find the fit statistics shown in Figure 2 below, extracted from 4 Based on previous work (Bollen 1993), Bollen and Paxton model the freedom of group opposition as being free from Banks ideological bias. 5 The political liberties and democratic rule factors correlate because they are close concepts. Sussman and Gastil correlate because they both worked at the Freedom House. 6 Based on previous work (Bollen 1993), Bollen and Paxton model the political rights and competitiveness of the nomination process indicators as having no random measurement error, i.e., δ = 0. 7

their article (67). Source: Bollen and Paxton (2000, p. 67). IFI stands for Incremental Fit Index and IRMSEA stands for 1-Root Mean Square Error of Approximation. Figure 2. Bollen and Paxton s fit statistics Figure 2 compares two fit statistics - the Incremental Fit Index (IFI) and the Root Mean Square Error of the Approximation (1-RMSEA) - for each year from 1972 to 1988. In both cases (IFI and 1-RMSEA) the larger the statistic, the better the model fit. As we see, the model fit improves considerably when we include both traits and methods factors, as compared to when we include only traits factors. 7 The improved fit shows that each of the eight indicators is the product not only of the underlying 7 Not all factors are included in every year: Gastil s and Bank s indicators are available for the entire 1972-1988 interval, but Sussman s are only available for the 1979-1981 and 1983-1987 intervals. Hence for 1972-1978, 1980, and 1988 the estimated model is actually a restricted version of the model depicted in Figure 1 above: everything is the same except that the Sussman factor and the corresponding indicators are not included. 8

trait ( political liberties or democratic rule, according to the case) and random measurement error, but also the product of a systematic component - the rater s ideological bias. I replicated Bollen and Paxton s analysis, just to make sure I was following the same procedures, and obtained almost exactly the same fit statistics: Source: My own estimations. These estimates are essentially identical to those in Bollen and Paxton (2000, p. 67). IRMSEA stands for 1-Root Mean Square Error of Approximation and IFI stands for Incremental Fit Index. Figure 3. Fit statistics from my replication of Bollen and Paxton 9

Bollen and Paxton then use those estimates to produce three sets of factor scores, one for each rater (see Appendix A). Bollen and Paxton regress these factor scores on a number of country-level variables, for two sets of years: 1972, 1975, 1980, 1984, and 1988 (Gastil s factor scores and Banks factor scores); and 1980 and 1984 (Sussman s factor scores). Table 1 below reproduces the estimates they obtained using 1980 data. Table 1. Bollen and Paxton s regressions for 1980 a Gastil Sussman Banks Marxist-Leninist -0.976*** -0.504* 1.366** (0.294) (0.3) (0.53) Protestant 0.357 0.459 0.670** (0.324) (0.295) (0.264) Roman Catholic 0.835*** 1.239**** 0.676** (0.312) (0.331) (0.286) monarchy 0.343 0.131-1.441**** (0.263) (0.282) (0.268) ln(energy per capita) -0.122-0.063 0.124 (0.126) (0.123) (0.177) ln(years since independence) 0.152 0.084-0.515 (0.153) (0.153) (0.16) coups 0.349 0.380-0.615*** (0.268) (0.287) (0.202) internal or interstate war in 1980? -0.360 0.022-0.234 (0.302) (0.328) (0.438) ln(protests) 0.098 0.068-0.174 (0.121) (0.139) (0.109) ln(political strikes) -0.074-0.080 0.211 (0.137) (0.158) (0.135) ln(riots) -0.200-0.023 0.075 (0.142) (0.148) (0.148) media coverage 0.019 0.002-0.034 (0.036) (0.037) (0.032) ln(population) 0.155 0.029 0.279** (0.117) (0.139) (0.122) ln(area in km 2 ) -0.050-0.025-0.024 continued 10

Table 1. Continued. (0.058) (0.061) (0.076) ln(radio sets + TV sets per capita) 0.067 0.022-0.105 (0.075) (0.067) (0.102) intercept -0.817-0.494 1.922*** (0.646) (0.614) (0.632) adjusted R 2 0.24 0.26 0.29 N 81 81 81 a OLS estimates. Heteroskedastic-consistent standard errors in parentheses. * p <0.10; ** p <0.05; *** p <0.01; **** p <0.001. Data sources: The New York Times, CBS News Index, Facts on File, The World Almanac and Encyclopedia, United Nations Statistical Yearbook, others. As we observe, Leornad Sussman and Raymond Gastil seem to be biased against Marxist-Leninist countries and in favor of Roman Catholic countries, whereas Arthur Banks seems to be biased in favor of Marxist-Leninist countries, Protestant countries, and Roman Catholic countries, and against monarchic countries. In regressions using data from other years, Bollen and Paxton also find positive, statistically significant coefficients for the Protestant variable in the Gastil and Sussman regressions (Bollen and Paxton 2000, 76). As the next section will show, none of these conclusions is warranted. As before, here too I replicated Bollen and Paxton, just to make sure I was following the same procedures. Here my replication was less successful, with several discrepancies (see Appendix B for details): 11

Table 2. Replication of Bollen and Paxton s regressions for 1980 a Gastil Sussman Banks Marxist-Leninist -0.874*** -0.846** 0.587** (0.257) (0.358) (0.293) Protestant -0.103-0.081 0.147 (0.280) (0.390) (0.320) Roman Catholic 0.494** 0.872** 0.372 (0.238) (0.332) (0.272) monarchy 0.024-0.473-0.166 (0.276) (0.385) (0.315) ln(energy per capita) -0.170-0.088 0.368** (0.136) (0.190) (0.156) ln(years since independence) 0.294* 0.279-0.314** (0.129) (0.179) (0.147) coups 1976-1980 -0.018 0.054-0.620** (0.159) (0.222) (0.182) internal or interstate war in 1980? -0.677** -0.617* 0.352 (0.263) (0.366) (0.300) ln(protests in 1975-1980) 0.066 0.118* 0.004 (0.043) (0.060) (0.049) ln(strikes in 1975-1980) -0.027 0.011 0.031 (0.038) (0.053) (0.043) ln(riots in 1975-1980) -0.029-0.020 0.017 (0.041) (0.058) (0.047) ln(media coverage) 0.120 0.069-0.355*** (0.117) (0.163) (0.133) ln(population) -0.096-0.240* 0.296** (0.101) (0.141) (0.116) ln(area in km 2 ) 0.031 0.082-0.076 (0.064) (0.089) (0.073) ln(radio sets + TV sets per capita) 0.094-0.014-0.188 (0.137) (0.192) (0.157) intercept -0.986 0.429 1.004 (1.132) (1.578) (1.292) N 112 112 112 F 3.98*** 3.27*** 3.29*** adjusted R-squared 0.2871 0.2344 0.2364 a OLS estimates. Heteroskedastic-consistent standard errors in parentheses. * p <0.10; ** p <0.05; *** p <0.01. 12

As in Bollen and Paxton here too Marxism-Leninism has a negative, statistically significant coefficient in the Gastil and Sussman regressions and a positive, statistically significant coefficient in the Banks regression. Also as in Bollen and Paxton we find here that Roman Catholic has a positive, statistically significant coefficient in the Gastil and Sussman regressions. The similarities stop there. In Bollen and Paxton Protestant, Roman Catholic, and monarchy all turn out statistically significant in the Gastil regression, but in the replication they do not. (The other variables have little to do with ideological bias so they are of no interest here.) Because the point of this paper are the simulations, not the replication, I leave the details for Appendix B. 3. Monte Carlos 3.1 Basic idea How solid are the results obtained in the previous section? In this section I show that they are indeterminate; they tell us nothing about which democracy measures are biased or in what ways. The estimates from my replication of Bollen and Paxton suggest, for instance, that Gastil and Sussman are biased against Marxism-Leninism and that Banks is biased in favor of Marxist-Leninist countries. But what if all three raters are biased in the same direction, only to different degrees? If all three raters are biased in favor of Marxism-Leninism but Banks more so than Gastil and Sussman, couldn t that produce the opposite coefficient signs we observe? Or, alternatively, if all three are biased against Marxist-Leninist countries but Gastil and Sussman much more so than Banks, couldn t that produce opposite coefficient signs as well? The same applies to the other three variables of interest - Protestant, Roman 13

Catholic, and monarchy. Table 2 suggests that none of the raters are biased against or in favor of Protestant or monarchic countries. But maybe they all are, and to similar degrees - so the bias becomes invisible and the SEM estimates simply cannot capture it. Table 2 also suggests that Gastil and Sussman are biased in favor of Roman Catholic countries whereas Banks is not. But what if Banks is biased in favor of Roman Catholic countries as well, just less so than Gastil and Sussman? How can we verify all that? We cannot observe a country s true level of democracy or a rater s ideological bias - these are latent variables. But we can simulate them. In other words, we can make up some democracy levels and some raters ideological biases. We can then re-do Bollen and Paxton analysis, but using the simulated democracy data rather than the actual democracy data. Because we will know the true (i.e., simulated) democracy levels and ideological biases, we will be able to know how reliable Bollen and Paxton s results are. I start by producing simulated data in which I fix the level of democracy and the direction and magnitude of each rater s ideological bias. I then make these simulated factors load on a number of simulated indicators, estimate the structural model, and extract the factor scores. I then regress the extracted factor scores on the same country-level variables Bollen and Paxton used (Marxism-Leninism, Protestantism, etc) and check whether the coefficients are telling the truth - for instance, whether the coefficient of Marxism- Leninism has a negative and statistically significant coefficient when the simulated rater is biased against Marxism-Leninism. I repeat the process thousands of times, each time drawing a new batch of simulated factors, and count how often we obtain misleading coefficients (for instance, how often Marxism-Leninism is not negative and significant even though the simulated rater is biased in favor of Marxism-Leninism). That should give us an idea of 14

how reliable the findings on Table 2 - and, by extension, those in Bollen and Paxton - are. 3.2 Model specification I begin by simulating three factors: each country s level of democracy; the idiosyncrasies (i.e., the systematic measurement error) of a hypothetical rater we are going to call Rater #1; and the idiosyncrasies of a hypothetical rater we are going to call Rater #2. The level of democracy is generated as a uniform random variable ranging from 0 to 20. 8 Rater #1 s factor is generated as a normal random variable with mean 5 and standard deviation 5. And Rater #2 s factor is generated as a normal random variable with mean 5 and standard deviation 15. 9 For each factor I generate 112 observations (the number of countries in the dataset). The second step is to introduce ideological bias into the factors. I do that by making the raters factors alternately respond to Marxism-Leninism, Protestantism, Roman Catholicism, or monarchy. The nature of the simulated bias is different across these four variables. In the case of Marxism-Leninism Table 2 suggests opposite biases. So I test whether the same result might be obtained even if Rater #1 and Rater #2 were biased in the same direction, but to different degrees. Hence for Marxist-Leninist countries I boost Rater #1 s factor by p1 points and Rater #2 s factor by p2 points, with p2 always fixed at 0.025 and p1 taking the following values: 0.5, 1, 3, 5, 7, 10, 15, and 20. In the case of Protestantism Table 2 would have us believe that none of the raters are biased. But what if all raters are biased in the same direction and to similar 8 That seems to be the distribution of actual measures of democracy (e.g., the political rights index of the Freedom House). 9 The normal distribution is chosen because structural equation models rely on the assumption that the factors follow a multivariate normal distribution (thus we could not have all three factors follow a uniform distribution). 15

degrees? Could that not make the bias become invisible in the estimation? To check that, for Protestant countries I boost Rater #1 s factor by c1 percent and Rater #2 s factor by c2 percent, with the c1 -c2 pairs being: 30%-35%, 50%-55%, 70%-75%, 130%-135%, 150%- 155%, 170%-175%, 190%-195%, and 230%-235%. 10 In the case of Roman Catholicism Table 2 suggests that Gastil and Sussman are positively biased and that Banks is not biased in any direction. We want to know whether that result might be obtained even if all three raters were positively biased, only with Gastil and Sussman more so than Banks. So here I do the same as in the Marxism-Leninism case: for Roman Catholic countries I boost Rater #2 s factor by p2 =0.025 points and Rater #1 s factor by p1 points, with p1 taking the following values: 0.5, 1, 3, 5, 7, 10, 15, and 20. Finally, in the case of monarchy Table 2 suggests no one is biased, but - as in the case of Protestantism - perhaps Gastil, Sussman, and Banks are all biased in the same direction and to similar degrees, which could make the bias disappear in the SEM estimations. So for monarchies, as for Protestant countries, I boost Rater #1 s factor by c1 percent and Rater #2 s factor by c2 percent, with the c1 -c2 pairs being, again, 30%-35%, 50%-55%, 70%-75%, 130%-135%, 150%-155%, 170%-175%, 190%-195%, and 230%-235%. The third step is to use those simulated factors to generate simulated indicators (i.e., the variables we do observe in SEM estimation). I model them as follows: indicator1 = 14.12 rater1 + 69.87 democracy + δ 1 /m indicator2 = 06.71 rater1 + 68.57 democracy + δ 2 /m indicator3 = 18.31 rater1 + 53.45 democracy + δ 3 /m indicator4 = 31.81 rater1 + 28.21 democracy + δ 4 /m 10 Thus here the bias is multiplicative - unlike in the Marxism-Leninism case, where the bias is additive. 16

indicator5 = 95.69 rater1 + 40.63 democracy + δ 5 /m indicator6 = 38.13 rater1 + 97.69 democracy + δ 6 /m indicator7 = 70.70 rater2 + 21.17 democracy + δ 7 /m indicator8 = 31.51 rater2 + 51.63 democracy + δ 8 /m indicator9 = 90.83 rater2 + 26.09 democracy + δ 9 /m indicator10 = 12.99 rater2 + 55.01 democracy + δ 10 /m indicator11 = 53.06 rater2 + 63.13 democracy + δ 11 /m indicator12 = 52.19 rater2 + 15.67 democracy + δ 12 /m As we see, democracy loads on all twelve indicators; Rater #1 loads on the first six indicators; and Rater #2 loads on the last six indicators. There is also a random measurement error, δ, specific to each indicator. In one third of the simulations the m parameter (that divides δ) is simply 1, so the error term does not suffer any transformation. In another third of the simulations the m parameter is 0.001, so we can see what happens to the estimates when the random errors are magnified. And in another third of the simulations the m parameter is 1,000, so we can see what happens to the estimates when the random errors shrink. Each δ is a combination of a normal random variable and a beta random variable, as follows: δ 1 = N(µ = 0, σ = 827) + Beta(α = 0.68, β = 0.78) δ 2 = N(µ = 0, σ = 4188) + Beta(α = 0.10, β = 0.72) δ 3 = N(µ = 0, σ = 228) + Beta(α = 0.58, β = 0.67) δ 4 = N(µ = 0, σ = 3237) + Beta(α = 0.50, β = 0.60) δ 5 = N(µ = 0, σ = 1965) + Beta(α = 0.06, β = 0.83) δ 6 = N(µ = 0, σ = 734) + Beta(α = 0.15, β = 0.86) δ 7 = N(µ = 0, σ = 1439) + Beta(α = 0.51, β = 0.46) δ 8 = N(µ = 0, σ = 2983) + Beta(α = 0.23, β = 0.40) 17

δ 9 = N(µ = 0, σ = 1190) + Beta(α = 0.73, β = 0.21) δ 10 = N(µ = 0, σ = 112) + Beta(α = 0.26, β = 0.63) δ 11 = N(µ = 0, σ = 806) + Beta(α = 0.54, β = 0.37) δ 12 = N(µ = 0, σ = 4299) + Beta(α = 0.13, β = 0.46) These modeling choices need justification. The number of factors - three - is the minimum we need to be able to fix each country s level of democracy and to evaluate Bollen and Paxton s assertions about bias direction. The number of indicators (twelve) is somewhat arbitrary; it could have been eight or fourteen, for instance. What matters is that for each factor there are at least three or four indicators, so that there are enough data to estimate the model. The loading coefficients (14.12, 69.87, etc) are entirely arbitrary, except that they are always positive; otherwise the direction of the bias would change between the factors and the indicators 11. 12 The parameters shown above - the distributional parameters of the factors, the factor loadings of each indicator, and the distributional parameters of the error terms - remain the same across all simulations. But at each simulation the factors and the random errors are redrawn, so the indicators (which are functions of both) change as well. Also, the m parameter, as explained above, assumes three different values (1, 0.001, and 1,000). The simulations are done separately for each of the four variables of interest, i.e., in any given simulation the hypothetical raters are biased toward only one of the four variables. The basic procedure is: I generate the three factors (democracy, Rater #1, Rater #2), bias Rater #1 and Rater #2, generate the twelve random errors, 11 E.g., if Rater #1 is biased in favor of Marxism-Leninism, a negative factor loading would make the corresponding indicator be biased against Marxism-Leninism. 12 Initially I generated the random errors (the δs) as purely normal variables. But that resulted in excessive correlations between the errors, even when varying the standard deviations. That resulted in highly correlated indicators, which results in non-invertible matrices and makes SEM estimation impossible. That is why I add the beta component. 18

generate the twelve indicators, estimate the structural equations model, save the two sets of factor scores assumed to represent ideological bias (Rater #1 s and Rater #2 s), regress each set on country-level variables (same ones used to produce Table 2) 13, and check whether the outcome of interest (i.e., the outcome analogous to that of Table 2) obtains. I repeat this process 1,000 times for each of the four variables of interest and for each of the p1 -p2 pairs and c1 -c2 pairs discussed above. I also repeat the process 1,000 times using the original, unbiased simulated factors, just to have a baseline. Finally, I repeat the whole process for each of the three values of m discussed before (1, 0.001, and 1,000). Thus in total there are 108 different specifications with 1,000 repetitions each. 14 In SEM estimation identification is usually achieved by fixing some of the parameters. I do that by fixing the variances of the errors and the variances and covariances of the factors. Thus what changes from one simulation to the next are the estimated factor loadings, and consequently the factor scores and the coefficients obtained from regressing these factor scores on country-level variables. All 108,000 estimations converge, so I do not discard any of them (Paxton et al. 2001, 301-302). 15 3.3 Results The results are summarized on Tables 3 and 4 below. 13 These country-level variables are real-world data, not simulated data. 14 (4 variables of interest) (1 baseline + 8 values of p or c) (3 values of m) (1,000 repetitions) = 108,000 simulations 15 The estimations take about five hours to run on a CPU with 2.4GHz and 4GB of memory and using the sem package in R. In multi-core machines that time can be drastically reduced by parallelizing the simulations across the multiple cores (that requires the code to be rewritten though). 19

Table 3. Simulation results for Marxism-Leninism and Catholicism - frequency of misleading results a Marxism-Leninism Roman Catholicism m=1 m=0.001 m=1,000 m=1 m=0.001 m=1,000 p1 =20; p2 =0.025 189 212 216 816 821 833 p1 =15; p2 =0.025 156 167 167 796 786 786 p1 =10; p2 =0.025 94 86 98 605 569 573 p1 =7; p2 =0.025 50 47 67 355 341 357 p1 =5; p2 =0.025 15 31 31 207 211 250 p1 =3; p2 =0.025 15 18 11 122 115 127 p1 =1; p2 =0.025 9 3 7 68 76 73 p1 =0.5; p2 =0.025 6 4 2 57 47 68 no bias 5 3 2 41 48 46 a For Marxism-Leninism the misleading result is a positive, statistically significant coefficient for Rater #1 combined with a negative, statistically significant coefficient for Rater #2. For Catholicism the misleading result is a positive, statistically significant coefficient for Rater #1 combined with a non-significant coefficient for Rater #2. Statistical significance is defined based on a p-value lower than 0.10. 20

Table 4. Simulation results for Protestantism and monarchy - frequency of misleading results (except for the no bias row, which shows correct results) a Protestantism monarchy m=1 m=0.001 m=1,000 m=1 m=0.001 m=1,000 no bias 791 816 816 794 795 791 c1 =30%; c2 =35% 737 697 706 661 655 635 c1 =50%; c2 =55% 617 587 589 523 503 532 c1 =70%; c2 =75% 542 499 502 400 354 390 c1 =130%; c2 =135% 265 279 288 135 110 135 c1 =150%; c2 =155% 214 207 205 103 94 79 c1 =170%; c2 =175% 164 177 138 51 55 39 c1 =190%; c2 =195% 119 122 114 40 35 38 c1 =230%; c2 =235% 71 82 74 22 19 9 a For both variables the misleading result is a combination of non-significant coefficients for both Rater #1 and Rater #2. Statistical significance is defined based on a p-value lower than 0.10. Unlike the other rows, the no bias row does not show misleading results: it simply shows how often we obtain no evidence of bias when there is indeed no bias. The results corroborate the suspicions raised before. For Roman Catholicism and Marxism-Leninism, if the two raters are biased in the same direction but to different degrees we often obtain the same results we saw on Table 2. If Rater #1 s factor gets a 3-point boost when it comes to Roman Catholic countries and Rater #2 s factor gets only a 0.025-point boost, we obtain misleading results 12.2% of the time (122 simulations out of 1,000). As the difference becomes larger, so does the frequency of misleading results: 20.7% if Rater #1 s bonus is 5 points, and 81.6% if Rater #1 s bonus is 20 points. Granted, a bonus of 20 points is unrealistic. Rater #1 s factor is generated as a normal distribution with mean 5 and standard deviation 5, which means that about 95% of it lies in the [-4.8, 14.8] interval. A bonus of 20 would thus imply a rather passionate rater - one that is willing to rate a North Korea as a Sweden merely on the grounds of that (hypothetical) North Korea being Roman Catholic. But a bonus of 21

3 or 5 points is perfectly imaginable - and it is be enough to yield misleading results too often (more than 10% and more than 20% of the time, respectively). For Marxism-Leninism the difference in the magnitude of the bias must be somewhat extreme: a bonus of at least 15 points from Rater #1 and of only 0.025 points from Rater#2. Below 15 points we obtain misleading results less than 10% of the time. A bonus of 15 points sounds unrealistic given that Rater #1 s factor follows a normal distribution with mean 5 and standard deviation 5. For Protestantism we find that when the two hypothetical raters are biased in the same direction and to similar degrees, we are bound to find no bias whatsoever in our estimations. When the bias is in the vicinity of 30% we find non-significant coefficients 73.7% of the time - which would mislead the researcher into thinking that neither rater is biased. Even when the bias is so extreme as to be around 170% we still obtain non-significant coefficients 16.4% of the time. For monarchy the bias only becomes visible when it reaches 170% or more. In other words, the bias must be of 170% or higher so we can obtain misleading results less than 10% of the time. It is clear, on the other hand, that we do obtain the correct result when none of the hypothetical raters are biased. Here the misleading result would be evidence of bias when there is in fact no bias. For Marxism-Leninism that happens less than 1% of the time. For Roman Catholicism that happens less than 5% of the time. For Protestantism that happens less than 10% of the time. And for monarchy that happens less than 3% of the time. That provides little solace though - with real-world data we cannot know whether the lack of statistical significance means unbiasedness or whether it means that all raters are biased in similar ways. 22

3.4 Summary What do all these results tell us about Bollen and Paxton s results? They tell us two things. First, when Bollen and Paxton find bias, all we can assert is that at least one of the raters is biased, but we cannot know which one(s) or in which direction(s). Consider Protestantism, for instance. Bollen and Paxton claim that Banks is biased in favor of Protestant countries while Gastil and Sussman are unbiased (with 1980 data). But it may be the case that Gastil and Sussman are biased against Protestant countries while Banks is unbiased. Or perhaps all three are biased in favor of Protestant countries, only Banks more so than Gastil and Sussman. Or, still, perhaps all three are biased against Protestant countries, only Banks less so than Gastil and Sussman. Our simulations show that in any of these scenarios we might obtain the same results that Bollen and Paxton did. Second, when Bollen and Paxton do not find bias, there is a good chance that there is bias. We know that because when our simulated raters are biased in the same direction and to similar degrees, our results often suggest no bias; depending on the magnitude of the biases, we get wrong results over 80% of the time. In other words, when all raters are biased in a similar way the bias becomes invisible. These same warnings apply to other studies that claim to have uncovered bias in existing measures of democracy. Consider Steiner (2012), for instance. He regresses Freedom House data on other democracy measures and then checks for correlations between the residuals and a number of foreign policy indicators (voting behavior in the UN, alliances, rivalries, foreign assistance, and trade). He finds the expected correlations and concludes that the Freedom House rates countries that have closer political ties and affinities with the U.S. [...] as more democratic (4). But what if the Freedom House is unbiased but all other measures are biased 23

against US-friendly countries? Or what if all measures of democracy are biased against US-friendly countries, only the Freedom House less so than the others? In all these scenarios Steiner might observe exactly the same result, so his conclusions are completely unwarranted. 16 Unless we have an unbiased measure of democracy, any statistical attempt to uncover the direction of ideological biases is futile. 4. Conclusion We cannot uncover ideological bias a posteriori. All we can assert today is that at least some of our democracy measures are contaminated by ideological bias. We cannot say which ones and we cannot say anything about the direction or magnitude of the biases (these are known unknowns, to use Donald Rumsfeld s famous expression); and it is possible that other biases exist that we have not uncovered yet ( unknown unknowns ). In sum, we are in the dark. Because we are in the dark many democracy-related arguments rest on shaky ground. Are economic freedom and democracy associated or is that apparent association an artifact of our democracy measures being biased in favor of economic freedom? Are parliamentary democracies more stable than presidential ones or do our democracy measures favor parliamentary systems? If we do not know what democracy measures are biased, or in what ways, how can we address these questions? The implication is that we need better democracy measures. In particular, we need a democracy measure whose data-generating process is transparent and replicable, so that at the very least we can know something about how exactly the measure is 16 It is perfectly plausible that the Freedom House has an anti-market (and as a consequence perhaps an anti-us bias), since it includes socioeconomic rights and freedom from gross socioeconomic inequalities among its subcomponents (Munck and Verkuilen 2002, 9). Alternatively, it is perfectly plausible that Steiner results simply reveal that countries with closer ties to the US are more democratic, for whatever reasons. 24

biased. The crux of the matter is that all existing indices we have today - be it the Freedom House, the CNTS, or the Polity (Marshall, Jaggers, and Gurr 2013) - rely on country experts checking boxes on questionnaires. We do not observe what boxes those experts check, or why. The process is opaque, which makes it easy for country experts to boost the scores of countries that adopt the correct policies. Coding rules help, but still leave too much open for interpretation. Consider this excerpt from the Polity IV handbook: If the regime bans all major rival parties but allows minor political parties to operate, it is coded here. However, these parties must have some degree of autonomy from the ruling party/faction and must represent a moderate ideological/philosophical, although not political, challenge to the incumbent regime. (p. 73). How do we measure autonomy? Can we always observe it? What is moderate? Clearly it is not that hard to smuggle ideological contraband into democracy scores. This is not to say that there have not been innovations in the field of democracy measurement. The Varieties of Democracy project, begun in 2010, is a group effort that seeks to build a new, fine-grained measure consisting of 33 subcomponents, each subdivided into dozens of more specific indicators. 17 Pemstein, Meserve and Melton (2010), in turn, treat democracy as a latent variable and use a multirater ordinal probit model to extract that latent variable from twelve different measures (including the Polity and the Freedom House); they call the resulting measure the Unified Democracy Scores (UDS). These are interesting developments, but both fall short of addressing the issue of bias. The Varieties of Democracy project may give us fine-grained democracy indicators but just like existing measures these indicators will be produced by country experts opaquely checking boxes in a questionnaire. The UDS, in turn, does a great 17 See Coppedge et al (2001) and https://v-dem.net/ 25

job at mitigating random error, but as the authors themselves acknowledge, the UDS cannot fix systematic error. Hence neither initiative addresses the problem of bias. One possible solution would be to embrace the text-as-data approach (see Grimmer and Stewart [2013] for an overview) - for instance, by using some automated algorithm to extract regime-related information from news articles. That would make the data-generating process transparent and replicable (and also much cheaper, as we would be dispensing with country experts). 26

Automated Democracy Scores Abstract In this paper I use automated text analysis to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). I produce the ADS using the well-known Wordscores algorithm (created by Laver, Benoit, and Garry [2003]) and 42 million news articles from 6,043 different sources. The ADS cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders ideological biases; and a simple (though computationally demanding) extension of the method would yield daily data and real-time data. I create a website where anyone can replicate and tweak the data-generating process by changing the parameters of the underlying model (no coding required): www.democracy-scores.org 1. Introduction In this paper I use automated text analysis to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The basic idea behind the ADS is simple. News articles on, say, North Korea or Cuba contain words like censorship and repression more often than news articles on Belgium or Australia. Hence news articles contain regime-related information (even if we disregard word order and treat each article as a bag of words ). We can quantify that information to build a democracy index. I produce the ADS using the Wordscores algorithm, developed in Laver, Benoit, and Garry (2003), and 42 million news articles from 6,043 different sources. The ADS 27

cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders ideological biases; and a simple (though computationally demanding) extension of the method would yield daily data and real-time data. The next section explains why we need a new democracy index in the first place. The remaining sections explain the method in detail; show the results and how they compare to existing democracy data; and discuss some future extensions. 2. Why do we need yet another democracy index? There are at least twelve democracy indices today (Pemstein, Meserve, and Melton 2010). They all draw to some extent from Dahl s (1972) conceptualization: democracy as a mixture of competition and participation. But they differ markedly in how they operationalize the concept - i.e., they differ in what empirical phenomena they pick as democracy manifestations; in how they aggregate these different empirical phenomena to produce a democracy scale; and in whether they model democracy as a categorical or continuous variable (Munck and Verkuilen 2002). In light of such diversity, do we really need yet another democracy index? I argue that we do, for three reasons. First, because the democracy indices we have today do not provide adequate measures of uncertainty. Without a good uncertainty measure we cannot know whether two countries are equally democratic or not, or whether a given country has become more (or less) democratic over time. That is, we cannot do descriptive inference. Moreover, without a good uncertainty measure we cannot do causal inference when democracy is one of the regressors. As Treier and Jackman (2008) warn, whenever 28

democracy appears as an exploratory variable in empirical work, there is an (almost always ignored) errors-in-variables problem, potentially invalidating the substantive conclusions of these studies (203). Yet the two most popular indices - the Polity (Marshall, Gurr, and Jaggers 2013) and the Freedom House (Freedom House 2013) - only give us point estimates, without any measure of uncertainty. That prevents us from knowing, say, whether Uruguay (Polity score = 10) is really more democratic than Argentina (Polity score = 8) or whether the uncertainty of the measurement process is sufficient to make them statistically indistinguishable. Only two indices come with uncertainty measures: Treier and Jackman s (2008) and Pemstein, Meserve, and Melton s (2010). Treier and Jackman (2008) treat democracy as a latent variable and use an item-response model to extract it from Polity indicators. Treier and Jackman (2008) provide both the point estimates (the means of the marginal posterior distributions of the latent democracy variable) and confidence intervals (quantiles of the marginal posterior distributions). 18 Pemstein, Meserve, and Melton (2010) also treat democracy as a latent variable. They use a multirater ordinal probit model to extract that latent variable from twelve different measures (including the Polity and the Freedom House). And, as Treier and Jackman (2008), they also provide both point estimates (posterior means) and confidence intervals (posterior quantiles). They call their index Unified Democracy Scores (UDS). Both indices are big improvements over the Polity and Freedom House. It is hard to understand, in particular, why the UDS have not become the default democracy 18 Armstrong (2011) does something similar, but using Freedom House indicators instead. His goal is different though - he extracts latent variables not to create a new democracy index but to investigate some properties of the Freedom House indicators. Also, he only reports the results for the 50 most populous countries and the full set of results is not available online. 29

index in political science: the UDS summarize almost all pre-existing indices, have broad time and country coverage, and are freely available online. 19 Even if one is not interested in standard errors, why arbitrarily pick this or that individual democracy index when we can rely on the collective wisdom of all indices, condensed in the UDS? That political scientists continue to use the Polity and Freedom House is probably due to inertia. 20 If we do want standard errors though (and we should, for the sake of good descriptive and causal inference), we have a problem: both Treier and Jackman s (2008) index and the UDS offer standard errors that are too large to be useful. In Treier and Jackman s (2008) data 70 of the 153 countries are statistically indistinguishable from the United States (in the year 2000 - the only year they report). In the UDS data 70% of the countries are all statistically indistinguishable from each other (in the year 2008 - the last year in the UDS dataset); pairs as diverse (regime-wise) as Denmark and Suriname, Poland and Mali, or New Zealand and Mexico have overlapping confidence intervals. In short, existing democracy indices either have no standard errors or they have standard errors too large to be useful. That prevents us from doing descriptive inference and from knowing the effect of democracy on other variables. The second reason why we need a new democracy index is bias. Using structural equation modeling, Bollen and Paxton (2000) evaluate two popular sources of democracy data - the Freedom House and Arthur Banks Cross-National Time Series Data Archive (Banks [1971], updated through 1988) - and show that they are contaminated by ideological bias. Raters have policy preferences and boost the democracy scores of countries that adopt the correct policies. 19 http://www.unified-democracy-scores.org/ 20 A less noble possibility is cherry-picking: perhaps researchers try different indices and pick the one that yields the correct results. 30

Hence we cannot reliably use existing democracy measures to estimate the impact of democracy on policy (or vice-versa). For instance, how can we assess the impact of democracy on welfare spending when our measure of democracy is partly based on welfare spending? The democracy measures we have today make our empirical tests circular. When we regress welfare spending on democracy we usually assume that we are regressing y on x, but in reality we may be regressing y on z = f(x, y). Bollen and Paxton (2000) only evaluate two democracy datasets, but all democracy data rely on country experts and for any given country there are only so many experts. As Munck and Verkuilen (2002) warn, for all the differences that go into the construction of these indices, they have relied, in some cases quite heavily, on the same sources and even the same precoded data. (29). Hence for all we know the Polity data are no less biased than the Freedom House data or the CNTS data. True, these democracy measures correlate highly, but that may merely indicate that they are all biased in similar ways (Munck and Verkuilen 2002). Unfortunately, whatever biases exist in those indices carry over to Treier and Jackman s (2008) and to the UDS. The latent-variable approach mitigates coders random errors, but not coders systematic errors. As Pemstein, Meserve, and Melton (2010) put it, the UDS rely on the assumption that raters perceive democracy levels in a noisy but unbiased fashion (10). The third reason why we need a new democracy index is replicability. Humancoded indices like the Polity and the Freedom House (and indices based on them, like Treier and Jackman s [2008] and the UDS) rely on country experts checking boxes on questionnaires. We cannot see what boxes they are checking, or why; all we observe are the final scores. The process is opaque and at odds with the increasingly demanding standards of openness and replicability of the field. Clearly, any of these three reasons alone justifies the creation of a new democracy 31

index. 3. News articles Our first task is to select the news articles to be used. Picking this or that news source - say, The New York Times or the The Wall Street Journal - would not do. The reason is that there would not be enough text. Countries like the United States and Russia are on the news all the time, but countries like Uruguay and Cambodia only get occasional coverage and countries like Tuvalu and Kiribati are almost never mentioned. A single newspaper or magazine, or even a handful thereof, would not provide the amount of text we need to produce reliable democracy scores for all 196 independent countries in the world. Hence I use a total of 6,043 news sources. These are all the news sources in English available on LexisNexis Academic, which is an online repository of journalistic content. The list includes American newspapers like The New York Times, USA Today, and The Washington Post; foreign newspapers like The Guardian and The Daily Telegraph; news agencies like Reuters, Agence France Presse (English edition), and Associated Press; and online sources like blogs and TV stations websites. I use LexisNexis s internal taxonomy to identify and select articles that contain regime-related news. In particular, I choose all articles with one or more of the following tags: human rights violations (a subtag of crime, law enforcement and corrections ); elections and politics (a subtag of government and public administration ); human rights (a subtag of international relations and national security ); human rights and civil liberties law (a subtag of law and legal system ); and censorship (a subtag of society, social assistance and lifestyle ). 21 21 It would be interesting to know how the results change if we select different topic tags, but unfortunately that is no longer possible: on 12/23/2013 LexisNexis changed its user interface 32

LexisNexis news database covers the period 1980-present, 22 so in theory the ADS could cover that period as well. LexisNexis provides search codes for all countries that exist today - e.g., #GC508# for Afghanistan and #GC342# for Mexico. That way we can search for news articles on a specific country and be sure that all results will turn up - even the ones that do not mention the name of country (for instance, many articles use the name of the country s capital when they mean the country s government - as in Moscow retaliated by canceling the summit ). Unfortunately, however, LexisNexis provides no search codes for countries that no longer exist. To search for articles on the Soviet Union, for instance, we would need to search for the name(s) of the country (Soviet Union, USSR), its derivatives (Soviet), the name of the capital (Moscow), etc - anything that might tell us that the article refers to the Soviet Union. Clearly that would not work. What if the article does not mention any of those terms? And what if the country has a name that is also a proper noun (like Turkey)? We would have unreliable results. Thus we can only reliably search the 1992-2012 period. Other than Yugoslavia, no country has ceased to exist since 1992, so we have search codes for basically everything. (Naturally, many countries were created in that period, but that is not a problem - the dataset will simply start in 2008 for Kosovo, in 2002 for East Timor, and so on). That selection - i.e., regime-related news, all countries that exist today, 1992-2012 - results in a total of about 42 million articles (around 4 billion words total), which I then organize by country-year. 23 To help reduce spurious associations I remove proper and the dozens of political tags and subtags that existed before are now collapsed into a single Government & Politics tag, which is too broad for our purposes here. 22 Actual coverage varies by news source. 23 A small proportion of the articles (about 0.05%) is left out. When a search produces more than 3,000 results LexisNexis only returns the first 1,000. Whenever possible I overcome that problem by searching for smaller periods of time. But in a few cases even searches for a single day produce more than 3,000 results. I could not figure out what criteria LexisNexis uses to select the 1,000 33

nouns 24 (in a probabilistic way) 25. For each country-year I merge all the corresponding news articles into a single document and transform the document into a term-frequency vector - i.e., a vector that contains the absolute frequency of each word. I then merge all term-frequency vectors into one big term-frequency matrix. Rows represent words and columns represent country-years, so each cell gives us the absolute frequency of a given word in the news articles corresponding to a given country-year. 4. Algorithm There are several automated ways to extract data from text (see Grimmer and Stewart [2013] for an overview). The particular method I use is the Wordscores algorithm, created by Laver, Benoit, and Garry (2003) - henceforth LBG -, from which this section draws heavily. The next paragraphs explain the algorithm in detail, but here is the gist of it: we manually score some documents - called reference documents or training documents (Manning, Raghavan, and Schütze 2008); the algorithm then learns from the reference documents and uses that knowledge to score all other docresults it returns (I asked them by email but they never replied). Thus to avoid any selection biases I just leave all results out in those cases. (The cases are: Pakistan 5/2/2011; Pakistan 5/3/2011; Afghanistan 10/8/2001; Afghanistan 10/9/2001; United Kingdom 7/8/2005; and United States, several dates between 2003 and 2012.) I realize that doing this may introduce selection biases of its own, but at least I am creating the selection biases myself, whereas I have no idea how LexisNexis selects those 1,000 results. In any case, it is doubtful that excluding 0.05% of the news articles will have any noticeable impact on the results. 24 As we will see later the scoring algorithm works by associating certain words with certain qualities. But we want those words to refer to general phenomena like torture, repression, and censorship, not to specific people or places. We do not want, for instance, Washington being associated with high levels of democracy just because the word appears frequently on news stories featuring a highly democratic country. Removing proper nouns helps avoid that. 25 I cannot possibly read 42 million articles. And we cannot simply remove all capitalized words, as that would eliminate the first word of every sentence, even if it is not a proper noun. Hence I apply the following rule: if all occurrences of the word are capitalized then that is probably a proper noun and therefore it is removed. (For each country-year I merge all the corresponding news articles into a single document and process each document in chunks of 10MB - to reduce memory usage -, so the check is restricted to the same 10MB chunk.) 34

uments - called virgin documents. So far Wordscores has only been used to measure party ideology (from party manifestos and legislative speeches). To the best of my knowledge, this is the first time Wordscores - or any other method of automated text analysis - is used to measure democracy. The first step is to select the reference cases. In other words, we need to pick some of the 4,067 country-years we have here to serve as the baseline from which the machine will learn. Ideally the reference set must span the entire regime scale. If we only feed the algorithm, say, highly democratic cases, then the machine will not learn what words are associated with middle-of-the-road cases or authoritarian cases. To ensure that the reference set is broad enough I pick all country-years from 1992, the first year for which we have news articles (see previous section). Thus the ADS only cover the 1993-2012 period even though we have news articles from 1992 as well. The vast majority of multivariate analyses that use some measure of democracy (Polity, Freedom House, etc) use pretty recent data, rarely going farther back in time than the 1970s, so the ADS should serve most applied research well. The second step is to give each reference case a score. For us that means assigning a democracy score to each country-year from 1992. I follow LBG and extract these reference scores from an existing index. 26 In particular, I choose Pemstein, Meserve, and Melton s (2010) UDS, which I mention before. The UDS have data on 184 countries for the year 1992. Hence we have 184 reference documents and 3,883 (4,067-184) virgin documents. The third step is to compute the word scores. Let F wr be the relative frequency of word w on reference document r. The probability that we are reading document r given that we see word w is then P (r w) = F wr / r F wr. We let A r be the a priori 26 Just to be clear, LBG were measuring party ideology, not democracy, so obviously the indices they use have nothing to do with the one I use here. 35

position of reference document r and compute each word score as S w = (P (r w) r A r ). The fourth step is to use the word scores to compute the scores of the remaining documents - the virgin documents. Let F wv be the relative frequency of word w on virgin document v. The score of virgin document v is then S v = (F wv S w ). w Intuitively, the algorithm uses the training documents to learn how word usage differs across the reference cases - for instance, it learns that the word censorship is more frequent the lower the democracy score of the document. The algorithm then uses that information to produce word scores (hence the name of the method), and later uses the word scores to score the virgin documents. A concrete example may help. Suppose that we choose North Korea 2012 and Belgium 2012 as our reference cases and assign them democracy scores of 0 and 10 respectively. We merge all news articles on North Korea in 2012 into a single document and merge all news articles on Belgium in 2012 into another document. Suppose now that the word censorship accounts for 15% of all the words in the North Korea document and for 1% of the words in the Belgium document. If we see the word censorship the probability that we are reading the North Korea document is 0.15/(0.15 + 0.01) = 0.9375 and the probability that we are reading the Belgium document is 0.01/(0.15 + 0.01) = 0.0625. The score of the word censorship is thus (0.9375 0) + (0.0625 10) = 0.625. To score a virgin document we simply multiply each word score by its relative frequency and sum across. The fifth step is the computation of uncertainty measures for the point estimates. LBG propose the following measure of uncertainty: V v / N v, where V v = F wv (S w S v ) 2 and N v is the total number of virgin words. The V v term captures w the dispersion of the word scores around the score of the document. Its square root divided by the square root of N v gives us a standard error, which we can use to assess 36

whether two cases are statistically different from each other. The sixth and final step is the re-scaling of the virgin scores. In any given text the most frequent words are the, of, and, etc, which are usually of not interest. Because these words have similar relative frequencies across all reference texts they will have centrist scores. For instance, if the accounts for 10% of our (hypothetical) North Korea document (whose manually assigned score is 0) and for 10% of the (also hypothetical) Belgium document (whose manually assigned score is 10), the score of the will be 5, exactly in the middle of the scale. That makes the scores of the virgin documents bunch together around the middle of the scale; their dispersion is just not in the same metric as that of the reference texts. In LBG s estimations of party ideology in Britain, the scores of the reference documents range from 8.21 to 17.21, but the scores of the virgin documents range from 10.21 to 10.73. That is not a problem per se, as the scores of the virgin documents are perfectly comparable to each other. But they are not comparable to the scores of the reference documents, whose dispersion is higher, and that may be a problem depending on the intended goals. 27 To correct for the bunching of virgin scores, LBG propose re-scaling these as follows: Sv = (S v S v )(σ r /σ v )+S v, where S v is the raw score of virgin document v, S v is the average raw score of all virgin texts, σ r is the standard deviation of the reference scores, and σ v is the standard deviation of the virgin scores. This transformation expands the raw virgin scores by making them have the same standard deviation as the 27 We could of course remove irrelevant words, but identifying relevant and irrelevant words is not always so clear-cut. For instance, as I mention later, Monroe, Colaresi and Quinn (2008) find several non-obvious partisan words - like baby (Republican) and bankruptcy (Democrat) - in their analysis of legislative speeches in the US Senate. Thus if we exclude words a priori we risk throwing away important information. Moreover, removing any words would require knowledge of the language in which the text is written. That would defeat one of the biggest advantages of the method: the fact that it is language-blind (all we need to know are the positions of the reference documents). 37

reference scores. Martin and Vanberg (2008) propose an alternative re-scaling formula, but Benoit and Laver (2008) show that the original formula is more appropriate when there are many virgin cases and few reference cases, which is the case here. The final output is a dataset comprising all independent countries from 1993 to 2012, which makes for a total of 3,883 country-years. For each country-year three statistics are provided: the ADS point estimate, the ADS 95% lower bound, and the ADS 95% upper bound. Initially I considered having not only a democracy scale but also subcomponents, à la Polity. But of the 805 JSTOR-indexed articles that cite the Polity data over the last ten years, only a handful mention (and even fewer use) the Polity subcomponents (Pemstein, Meserve, and Melton [2010] also note this point). Hence there is simply not enough demand to justify breaking down the ADS into more specific items. 28 And, rich in regime-related information as our news articles may be, they nonetheless become progressively less informative as we move from democracy down to, say, turnover percentage in the legislature. The more specific we get, the higher the noise-to-signal ratio. Wordscores is the best-known text-scaling method in political science. It has been subject to extensive scrutiny over the years and generally found to perform well, as long as the texts are not too short 29 and share enough vocabulary 30. Klemmensen, Hobolt, and Hansen (2007), for instance, use Wordscores to measure party ideology, with Danish manifestos and speeches, and find that the method yields scores that correlate highly with those produced independently by human coders. Beauchamp 28 Thus the ADS are bound to displease, for instance, Coppedge et al. (2011), who call for thicker measures of democracy. 29 If the texts are too short then there is simply not enough data to produce meaningful results. How short is too short is unclear though: all else equal more is better, but a 5,000-word text may contain more informative words than a 10,000-word text. 30 In the extreme case where the vocabulary of the reference texts and the vocabulary of the virgin texts are disjoint, we cannot even produce any estimates. 38

(2010) applies Wordscores to US Senate speeches and, as Klemmensen, Hobolt, and Hansen (2007), also finds that the estimates correlate highly with human-coded ones. Lowe (2008) notes that Wordscores lacks an explicit model for the data-generating process of the word frequencies. But he argues that, as long as the word frequencies follow an ideal point structure, 31 Wordscores should produce good estimates - and he notes that The empirical success of the method suggests that these assumptions may be reasonable. (370). 5. Advantages over existing measures The ADS are intended to address the three issues discussed earlier: standard errors, ideological bias, and replicability. Small standard errors As shown above, with Wordscores the total number of virgin words goes in the denominator of the formula of the standard errors. Hence the more texts we have, the smaller the standard errors will be. Here we have 42 million news articles, so we should have standard errors small enough to distinguish even between very similar cases - say, between Sweden and Norway. As we will see later, that is indeed what happens. The ADS are the first democracy index whose uncertainty measure captures 31 Lowe (2008) proposes that we interpret Wordscores as an approximation to correspondence analysis - which relies on the assumption of ideal point structure. 39

such fine-grained distinctions. Less ideological bias The ADS are not immune to contamination by ideological bias. First, the journalists and editors behind news articles have their own policy preferences. And second, at least in the case of supervised learning algorithms (like Wordscores), someone must choose and score the reference cases. But the scope for manipulation is more restricted in the ADS. Journalists and editors have their policy preferences but there is a lot more ideological diversity among journalists (contrast The New York Times and The Wall Street Journal, for instance) than among political scientists, the vast majority of which are somewhere on the left of the ideology spectrum (Klein and Stern 2005; Maranto, Hess, and Redding 2009; Maranto and Woessner 2012). Combining 6,043 different news sources, as we do here, surely goes a long way toward mitigating ideological bias. The reference scores do offer a backdoor for manipulation but, unlike the anonymous country experts who fill out the Polity and Freedom House questionnaires, the researcher who assigns reference scores does so in the open and thus bears reputational costs in case of mischief. The transparency of the process creates an incentive structure that rewards honesty. As Schedler (2012) puts it, The key to accountable expert measurement [...] is publicity. Rather than treating experts the same way as we treat survey subjects, whom we grant full anonymity, experts need to assume public responsibility for their measurement decisions. True, we are using the UDS for the reference scores, and the UDS themselves must be contaminated by ideological bias, as discussed before. But the ADS do not inherit that bias. We are using regime-related news, so the vast majority of policy 40

and economic discussions is left out. Hence the biases of the UDS become, by and large, random noise in the ADS. Intuitively, imagine that the UDS are biased in favor of countries with generous welfare, like Sweden. The UDS of these countries will be boosted somewhat. But to the extent that the news articles we selected are focused on political regime and not on welfare policy, the algorithm will not associate those boosted scores with welfarerelated words and hence the word scores will not be biased. They will be less efficient, as (ideally) no particular words will be associated with those boosted scores, but that is it. Replicability The process behind the ADS is fully transparent. All the choices (reference cases and scores) are visible to the public and every part of the process can be replicated exactly. Anyone with access to LexisNexis can download the same articles, apply the same algorithm, and verify the results. There are practical obstacles though: downloading 42 million articles is timeconsuming and the computations require powerful machines and non-trivial programming. 32 Therefore I created a website that facilitates the process: www.democracyscores.org. No coding is required: there is a table with empty cells corresponding to each country-year between 1992 and 2012 and you simply enter the scores for the reference cases you choose. The results are sent by email. That way anyone can change 32 Wordscores has long been implemented in Stata and R, but these implementations load all the data into memory at once. That would not work here, as there are 200GB of data, so I had to write my own implementation of Wordscores (in Python). That implementation splits the data into chunks and processes each chunk individually, which reduces memory requirements (though not to the point where the script could be run on personal computers - there is a trade off between memory requirements and speed). 41

the reference set and produce their own ADS, regardless of computational resources or programming skills. 33 6. Justifying some choices Why not use unsupervised learning instead? Wordscores is a type of supervised learning algorithm, by which I mean that the machine learns from an initial human input (the reference cases and their scores). But there are also unsupervised learning algorithms, which do not require an initial input. In these, the machine learns by itself not only how to measure but also what to measure. (See Manning, Raghavan, and Schütze [2008] for an introduction to both supervised and unsupervised learning in the context of text analysis.) In political science, a concrete example of unsupervised learning algorithm is the one developed by Slapin and Proksch (2008), popularly known as Wordfish. Like Wordscores, Wordfish is most commonly used to measure party ideology, using party manifestos or legislative speeches. The Wordfish method does not require the user to specify or score any reference texts. It will create a scale based on whatever underlying dimension has the most impact on word frequencies. 34 If we are talking about party manifestos, that dimension may be, say, the left-right dimension. 33 The operation uses Amazon Web Services and to keep costs down for now I need to pre-authorize the user s email address. I intend to obtain funding and lift that restriction in the future. 34 Slapin and Proksch explicitly model the data-generating process (DGP) behind word frequency. That DGP is assumed to follow a Poisson distribution (hence the name of method): y ijt = P oisson(λ ijt ), where y ijt is the frequency of word j in document i at time t. The parameter λ ijt is modeled as λ ijt = exp(α it + ψ j + β j ω it ), where α it is the fixed-effect of document i at time t, ψ j is the fixed-effect of word j, β j captures the relevance of word j in capturing the underlying concept (say, party ideology), and ω it is the estimated position of party i at time t. The model is estimated using an expectation-maximization (EM) algorithm (see McLachlan and Kirshnan [2007] for details on EM). 42

But it may not. And if the scores turn out to be capturing something else there is no way to fix that; it may be hard to even know what is being captured. Supervised learning, on the other hand, allows us to calibrate the scale by explicitly showing the machine what a democratic country looks like or what a left-wing party looks like (depending on what we are trying to measure). That way we can have greater confidence in the construct validity of the resulting measure. 35 Why not use event data instead? An alternative approach would be to machine-code democracy based not on words but on events. Applications like Knowledge Manager 36 and TABARI (Schrodt 2001) can use dictionaries of actors and verbs to extract meaning from sentences. For instance, TABARI can correctly classify the sentence North Korean state media have called on the United States to forge ties of confidence with Pyongyang into the category Appeal for diplomatic cooperation (category #022 of the Conflict and Mediation Event Observations Codebook). There are voluminous event data available for free 37 and King and Lowe (2003) show that in some cases automated event coding can be as accurate as human coding. So why not use event data to produce the ADS? The reason is that although the coding itself is automated, it relies on dictionaries of actors and verbs that are produced manually, entry by entry. In other words, we must know the relevant actors and verbs a priori. With Wordscores, however, we let the data speak. As Hopkins and King (2007) note, automated text analysis allows us 35 That said, Wordscores and Wordfish are not antithetical methods. Lowe (2008) and Benoit and Nulty (2013) argue that Wordscores is also model-based in a sense, only the model is implicit. 36 http://vranet.com/ 37 Most notably the Global Data on Events, Location, and Tone (GDELT), which contains over 200 million geolocated events from 1979 to 2012. See Leetaru and Schrodt (2013). 43

to discover relevant features a posteriori. For instance, Monroe, Colaresi and Quinn (2008) find several non-obvious partisan words, like baby and bankruptcy, which a hand-coded dictionary might have missed. As a consequence, event data can be of limited usefulness. Consider, for instance, the latest version of the World Handbook of Politics (WHP), machine-coded by the Knowledge Manager application. 38 It reports three recent coups in Canada (one in 1996, one in 1998, and one in 1999), 15 recent coups in the US (three of which taking place in 1994 alone), and none in 2002 Venezuela (even though there was one). 39 Similarly nonsensical statistics are reported for other political indicators, such as censorship measures, curfews, and political arrests. That is not a very promising output, especially given the time and effort put in the creation of event data dictionaries (around 4,000 hours each) 40. Hence I chose not to work with even data, at least for now. 7. Overview of results The full 1993-2012 dataset is available for download. 41 Figure 4 below gives an idea of the ADS distribution in 2012. 38 The WHP can be downloaded from https://sociology.osu.edu/worldhandbook 39 I checked the WHP definition of coup, to make sure it is not peculiar, but that does not seem to explain the nonsensical results (the WHP defines a coup as an Irregular seizure of executive power, and rebellion by armed forces ). 40 http://eventdata.psu.edu/faq.html 41 https://s3.amazonaws.com/thiagomarzagao/ads.csv 44

Note: Range limits are Jenks natural breaks. Figure 4. Automated Democracy Scores, 2012 As expected, democracy is highest in Western Europe and in the developed portion of the English-speaking world, and lowest in Africa and in the Middle East. 45

Figure 5 below shows that the ADS follow a normal distribution. Figure 5. Automated Democracy Scores, 1993-2012 (with normal distribution) 46