CROWD-SOURCED CODING OF POLITICAL TEXTS *

Size: px

Start display at page:

Download "CROWD-SOURCED CODING OF POLITICAL TEXTS *"

Melissa McCoy
6 years ago
Views:

1 CROWD-SOURCED CODING OF POLITICAL TEXTS * Kenneth Benoit London School of Economics and Trinity College, Dublin Benjamin E. Lauderdale London School of Economics Drew Conway New York University Michael Laver New York University Slava Mikhaylov University College London May 21, 2014 Abstract Empirical social science often relies on data that are not observed in the field, but are coded into quantitative variables by expert researchers who base their codes on qualitative raw sources. Using crowd-sourcing to distribute text coding to massive numbers of non-experts, we generate results comparable to those from expert coding, but far more quickly and flexibly. Crucially, the data we collect can be reproduced or extended cheaply and transparently, making crowd-sourced datasets intrinsically replicable. This focuses researchers attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. The findings reported here concern text coding, but have general implications for expert coded data in the social sciences. 11,687 word count after the front page!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! * An earlier draft of this paper, with much less complete data, was presented at the third annual Analyzing Text as Data conference at Harvard University, 5-6 October A very preliminary version was presented at the 70th annual Conference of the Midwest Political Science Association, Chicago, April We thank Joseph Childress and other members of the technical support team at CrowdFlower for assisting with the setup of the crowd-sourcing platform. We are grateful to Neal Beck and Joshua Tucker for comments on an earlier draft of this paper. This research was funded by the European Research Council grant ERC-2011-StG QUANTESS.

2 INTRODUCTION Many widely used quantitative datasets in political science do not derive from direct measurement in the field, but from researchers sitting at their desks, extracting information from sources that consist largely of written text. Researchers use their expert judgments to turn qualitative raw information into measures of quantities that cannot be observed directly. We think of this process as data coding and of these data as coded data. Widely used examples include: 1 the Polity dataset, rating countries on a scale ranging from -10 (hereditary monarchy) to +10 (consolidated democracy) ; 2 the Comparative Parliamentary Democracy data with indicators, of the number of inconclusive bargaining rounds in government formation and conflictual government terminations 3 ; the Comparative Manifesto Project (CMP), with coded summaries of party manifestos, notably a widely-used left-right score 4 ; and the Policy Agendas Project, which codes text from laws, court decisions, political speeches into topics and subtopics (Jones 2013). All data creation confronts fundamental issues of reliability and validity, which for coded data particularly depend on the expertise and professionalism of the coders. This is bolstered by a tacit agreement in the profession that, if prominent scholars use some coded data and top journals publish their results, the quality of these data has been professionally vetted. Given the huge cost and complexity of data generation projects that may span decades, many canonical datasets are unlikely in practice to be re-collected and this sense replicated. Despite low levels of inter-coder reliability found using the CMP scheme (Mikhaylov et al. 2012), for example, a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 1 Other examples of coded data include: expert judgments on party policy positions of party positions (Benoit and Laver 2006; Hooghe et al. 2010; Laver and Hunt 1992); democracy scores from Freedom House and corruption rankings from Transparency International

3 proposal to recode each manifesto multiple times, using independent coders, would be a financial non-starter. While the CMP data could in theory be recoded many times, in practice this is not going to happen. Crowd-sourced data coding offers the potential to transform this intellectual landscape. Rather than using a few experts to craft a few canonical datasets at great expense, we can mass produce custom datasets using carefully monitored but less-skilled workers in the crowd. The coding process must be much more precisely specified for crowd coders than for experts, making crowd coding more replicable by independent researchers, as well as more scalable. In what follows, we first review the theory and practice of crowd-sourcing. We then evaluate crowd-sourced data coding using a coding experiment. This analyses and compares results of expert and crowd-sourced codings of the same documents (party manifestos), using the same coding scheme, then compares both sets of results with independent expert survey estimates of precisely the same quantities. In order to analyze our database of crowd coded manifesto sentences, we specify a method for aggregating codings of sentences (of varying complexity) by coders (of varying quality) into scores for latent variables of interest. To demonstrate the potential of crowd-sourcing to investigate new questions for which no coded text data exist, we estimate positions of British parties on immigration during the 2010 election, then show this process can be easily and cheaply replicated by repeating the entire coding exercise two months later, with essentially the same results. We conclude with a discussion of the more general applicability of crowd-sourced data coding in the social sciences. HARVESTING THE WISDOM OF CROWDS Coined by Jeff Howe in a Wired magazine article (Howe 2006), the term crowd-sourcing now implies using the Internet to distribute a large package of small specific tasks to a large number 2

4 of anonymous exchangeable workers, located around the world and offered small financial rewards per task. The idea can be traced to Aristotle (Lyon and Pacuit 2013) and later to Galton (1907), who noticed that the average of a large number of individual judgments by fair-goers of the weight of an ox is close to the true answer and, importantly, closer to this than is any individual judgment (for a general introduction see Surowiecki 2004). Now widely used for dataprocessing tasks such as image classification, video annotation, data entry, optical character recognition, translation, recommendation, and proofreading, crowd-sourcing has emerged as a paradigm for applying human intelligence to problem-solving on a massive scale. Increasingly, crowd-sourcing has also become a tool for social scientific research (Bohannon 2011), though most applications draw on crowds as a cheap alternative to traditional experimental studies (e.g. Lawson et al. 2010; Horton et al. 2011; Paolacci et al. 2010; Mason and Suri 2012). Using crowd-sourced respondents to replace traditional experimental or survey panels, of course, raises questions about external validity of the resulting data. Recent studies in political science (Berinsky et al. 2012), economics (Horton et al. 2011) and general decision theory and behavior (Paolacci et al. 2010; Goodman et al. 2013; Chandler et al. 2014), report differences in the demographics of crowd-sourced and traditional subjects, but find high levels of external validity for crowd-sourcing methods. Our approach to using crowd workers to code external data differs fundamentally from such applications because we do not care at all about whether crowd coders represent any target population. Our coders can be completely unrepresentative as long as quite different coders, on average, tend to make the same coding decisions faced with the same information. In this sense data coding, as opposed to online experiments, represents a canonical use of crowd-sourcing as described by Galton. 5!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 5 We are interested in the weight of the ox, not in how different people judge the weight of the ox. 3

5 All human generation of data requires some expertise, and several empirical studies have found that data generated by domain experts can be matched, and sometimes improved at much lower cost, by aggregating judgments of non-experts (Alonso and Mizzaro 2009; Hsueh et al. 2009; Snow et al. 2008; Alonso and Baeza-Yates 2011; Carpenter 2008; Ipeirotis et al. 2013). Provided crowd coders are not systematically biased in relation to the true value of the latent quantity of interest, the central tendency of even erratic coders will converge on this true value as the number of coders increases. Because experts are axiomatically in short supply while members of the crowd are not, crowd-sourced solutions also offer a straightforward and scalable method for addressing reliability in a manner that expert coding cannot: to improve confidence, we simply order more crowd-sourced codes. Access to a large pool of crowd coders lacking high expertise also tends to mitigate bias. Because coding is broken down into many small specific tasks, each performed by many different exchangeable coders, it tends to wash out biases that might affect a single coder. The replication potential of crowd-coding also makes it possible to test for the presence of bias. Because crowd-sourced data coding involves breaking down one large project into many small tasks for distribution to crowd workers, we need a method to aggregate the many small results back up into valid measures of our quantities of interest. 6 While some researchers have specified complex calibration models to correct for coder mistakes on particular difficult tasks, the single most important lesson from this work is that increasing the number of coders reduces error (Snow et al. 2008). Addressing statistical issues of redundant coding, Sheng et al. (2008) and Ipeirotis et al. (2010) show that repeated coding can improve the quality of data as a function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 6 Of course aggregation issues are no less important when combining any multiple judgments, including those of experts. Procedures for aggregating non-expert judgments may influence both the quality of data and convergence on some underlying truth, or trusted expert judgment. For an overview, see Quoc Viet Hung et al. (2013). 4

6 of the individual qualities and number of coders, particularly when coders are imperfect and coding categories (labels) are noisy. Ideally, we would benchmark crowd coders against a gold standard, but such benchmarks are not always available, so scholars have turned to Bayesian scaling models borrowed from item-response theory (IRT), to classify codes while simultaneously assessing coder quality (e.g. Carpenter 2008; Raykar et al. 2010). Welinder and Perona (2010) develop a classifier that integrates data difficulty and coder characteristics, while Welinder et al. (2010) develop a unifying model of the characteristics of both data and coders, such as competence, expertise and bias. A similar approach is applied to rater evaluation in Cao et al. (2010) where, using a Bayesian hierarchical model, raters codes are modeled as a function of a latent item trait, and rater characteristics such as bias, discrimination, and measurement error. We build on this work below, applying both a simple averaging method and a Bayesian scaling model that estimates latent policy positions while generating diagnostics on coder quality and sentence difficulty parameters. We find that estimates generated by this more complex approach match simple averaging very closely. A FRAMEWORK FOR CODING POLITICAL TEXTS We seek a well-specified method for data coding in the crowd which can generate a coded dataset we can reliably substitute for an analogous dataset produced by expert coders using the same information. We therefore served up an identical set of documents, and the identical text coding scheme described below, to both a small set of expert coders and a large and heterogeneous set of crowd coders located around the world. Our task revisits the well-trodden ground of estimating party policy positions using election manifestos, the core objective of the largest expert-coded content analysis project in political science, the Comparative Manifesto 5

7 Project (Budge et al. 2001; Klingemann et al. 2006). This task has spawned a growth industry of estimating party positions using automated or semi-automated methods of text analysis (e.g. Laver et al. 2003; Laver and Garry 2000; Slapin and Proksch 2008). Because party policy positions have been measured in a variety of other ways, such as expert surveys, there is also a rich set of external measures with which to compare crowd-sourced measures in order to assess validity. A coding scheme for economic and social policy Our coding scheme first asks coders to classify each manifesto sentence as referring to economic policy (left or right), to social policy (liberal or conservative), or to neither. Substantively, these two policy dimensions have been shown to offer an efficient representation of party positions in many countries. 7 They also correspond to dimensions covered by a series of expert surveys (Benoit and Laver 2006; Hooghe et al. 2010; Laver and Hunt 1992), allowing validation of estimates we derive against widely used independent estimates of the same quantities. If a sentence was classified as economic policy, we then asked coders to rate it on a five-point scale from very left to very right; those classified as social policy were rated on a five-point scale from liberal to conservative. Figure 1 shows this coding scheme. 8!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 7 See Chapter 5 of Benoit and Laver (2006) for an extensive empirical review of this for a wide range of contemporary democracies. 8 Our coding instructions fully detailed in the supplementary materials were identical for both expert and nonexpert coders, defining the economic left-right and social liberal-conservative policy dimensions we estimate and providing examples of coded sentences. 6

8 ! Figure 1: Hierarchical coding scheme for two policy domains with ordinal positioning. While the CMP has been coding party manifestos for decades, we did not use its 56- category policy coding scheme, for two main reasons. The first is methodological: complexity of the CMP scheme and uncertain boundaries between many of its categories were major sources of unreliability in coding experiments where multiple expert coders used this scheme to code the same documents (Mikhaylov et al. 2012). The second is practical in a crowd sourcing context: it is impossible to write clear and precise instructions that could be understood reliably by a globally distributed and diverse set of non-expert coders, for using a detailed and complex 56- category scheme quintessentially designed for highly trained expert coders. This highlights an important methodological issue. There may be data coding tasks that cannot feasibly be explained in clear and simple terms, sophisticated coding instructions that can only be understood and implemented by highly trained experts. More sophisticated coding instructions imply a more limited pool of coders who can understand and implement them and, for this reason, a less scalable and replicable data generation project. The striking alternative made 7

9 available by crowd-sourced data coding is to break down complicated data generation tasks into simple small jobs, as happens when complex consumer products are manufactured on factory production lines. Given this, the simple coding scheme in Figure 1 is motivated by the observation that most scholars in their published work deploy estimates of politicians positions on a few synthetic policy scales, and do not seek sophisticated analyses of party manifesto positions in terms of 56 fine-grained coding categories. Our focus is on specifying a scalable and replicable method of data coding. If this method works, others may deploy it using other coding schemes. Indeed, the ideal is for researchers to mount coding projects designed for their own precise needs, rather than relying on canonical datasets designed decades ago, for other purposes by other scholars. A shift to reliable and valid crowd-sourcing will mean the scientific object of interest is the method for collecting the data, not the dataset per se. And this method, furthermore, is fully described by specifying the corpus of documents to be coded and publishing the computer code that deploys the crowd-sourced data coding project on the Internet. This shift from canonical datasets to a replicable data generation process represents a paradigm shift in social science data production, made possible by crowdsourcing. Text corpus: British party manifestos Our text corpus comprises 18,263 natural sentences from British Conservative, Labour and Liberal Democrat manifestos for the six general elections held between 1987 and These texts were chosen for two main reasons. First, for systematic external validation, there are diverse independent estimates of British party positions for this period, from contemporary expert surveys (Laver and Hunt 1992; Laver 1998; Benoit 2005, 2010) as well as CMP expert codings of the same texts. Second, there are well-documented substantive shifts in party 8

10 positions during this period, notably the sharp shift of Labour towards the center between 1987 and The ability of crowd coders to pick up this move is a good test of external validity. Text units: natural sentences The CMP specifies a quasi-sentence as the fundamental text unit, defined as an argument which is the verbal expression of one political idea or issue (Volkens 96). Recoding experiments by Däubler et al. (2012), however, show that using natural sentences makes no statistically significant difference to point estimates, but does eliminate significant sources of both unreliability and unnecessary work. Our dataset therefore consists of all natural sentences in the 18 UK party manifestos under investigation. 9 Text unit sequence: random In classical expert coding, experts code sentences in their natural sequence, starting at the beginning and ending at the end of a document. Most coders in the crowd, however, will never reach the end of a long policy document. Coding sentences in natural sequence, moreover, creates a situation in which one sentence coding may well affect priors for subsequent sentence codings, so that summary scores for particular documents are not aggregations of independent coder assessments. 10 An alternative is to randomly sample sentences from the text corpus for coding with a fixed number of replacements per sentence across all coders so that each coding is an independent estimate of the latent variable of interest. This has the big advantage in a crowdsourcing context of scalability. Jobs for individual coders can range from very small to very large; coders can pick up and put down coding tasks at will; every little piece of coding in!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 9 Segmenting natural sentences, even in English, is never an exact science, but our rules matched those from Däubler et al. (2012), treating (for example) separate clauses of bullet pointed lists as separate sentences. 10 Coded sentences do indeed tend to occur in runs of similar topics, and hence codes; however to ensure appropriate statistical aggregation it is preferable if the codings of those sentences are independent. 9

11 the crowd contributes to the overall database of text codings. Accordingly our method for crowdsourced text coding serves coders sentences randomly selected from the text corpus rather than in naturally occurring sequence. Our decision to do this was informed by coding experiments reported in the supplementary materials, and confirmed by results reported below. Despite higher variance in individual sentence codings under random sequence coding, there is no systematic difference between point estimates of party policy positions depending on whether sentences were coded in natural or random sequence. Text authorship: anonymous In classical expert coding, coders typically know the authorship of the document they are coding. Especially in the production of political data, coders likely bring non-zero priors to coding text units. Precisely the same sentence ( we must do all we can to make the public sector more efficient ) may be coded in different ways if the coder knows this comes from a right- rather than a left-wing party. Codings are typically aggregated into document scores as if coders had zero priors, even though we do not know how much of the score given to some sentence is the coder s judgment about the content of the sentence, and how much a judgment about its author. In coding experiments reported in supplementary materials, semi-expert coders coded the same manifesto sentences both knowing and not knowing the name of the author. We found slight systematic coding biases arising from knowing the identity of the document s author. For example, we found coders tended to code precisely the same sentences from Conservative manifestos as more right wing, if they knew these sentences came from a Conservative manifesto. This informed our decision to withhold the name of the author of sentences deployed in crowd-sourcing text coding. 10

12 Context units: +/- two sentences Classical content analysis has always involved coding an individual text unit in light of the text surrounding it. Often, it is this context that gives a sentence substantive meaning, for example because many sentences contain pronoun references to surrounding text. For these reasons, careful instructions for drawing on context have long formed part of coder instructions for content analysis (see Krippendorff 2013, 101-3). For our coding scheme, on the basis of prerelease coding experiments, we situated each target sentence within a context of the two sentences either side in the text. Coders were instructed to code target sentence not context, but to use context to resolve any ambiguity they might feel about the target sentence. SCALING POLICY POSITIONS FROM CODED SENTENCES Our aim is to estimate policy positions of party manifestos, so our quantity of interest is not the code value of any single sentence, but some aggregation of these values into an estimate of each document s position on some meaningful policy scale. Classical content analytic approaches, with access to only one code value per sentence, generally rely on simple methods drawn from (normalized) counts of coded sentences in specific categories, or indices combining these (Krippendorff 2013, 189). To measure policy on the European Union, for example, the CMP subtracts the proportion of negative from the proportion of positive EU mentions. The CMP s left-right policy scale combines 26 normalized counts of categories in a similar manner. We have a more complex and informative dataset, however, with multiple codes and coders per sentence. We therefore need a way to aggregate sentence codes into a scale at the document level, while allowing for coder, sentence, and domain coding effects. One option is simple averaging: identify all economic codes assigned to sentences in a document by all coders, average these, and use this as an estimate of the economic policy 11

13 position of a document. Assigning numeric scores to our five-point scale, say -2, -1, 0, 1, and 2, we could obtain means within this range. Results in mathematical and behavioral studies on aggregations of individual judgments imply that simpler methods often perform as well as more complicated ones, and often more robustly (e.g. Ariely et al. 2000; Clemen and Winkler 1999). Simple averaging of individual judgments is the benchmark when there is no additional information on the quality of individual coders (Lyon and Pacuit 2013; Armstrong 2001; Turner et al. 2013). However, this does not permit direct estimation of domain misclassification tendencies by coders who fail to identify economic or social policy correctly, or of coderspecific effects for the use of the positional scales. Estimating uncertainty for our aggregate policy measures, furthermore, simple averaging fails to account directly for coder uncertainty and misclassification, while assuming interval properties for ordinal policy scales. An alternative is to model each sentence as containing a piece of information about the document concerning both policy domain and policy position, then scale this using a measurement model. Here we propose a model based on IRT, which allows for both coder effects as well as difficulty parameters, accounting for the strong possibility that some sentences are harder to classify and code reliably than others. Such an approach has antecedents in psychometric methods (e.g. Baker and Kim 2004; Fox 2010; Hambleton et al. 1991; Lord 1980), and has been applied to aggregate crowd ratings (e.g. Ipeirotis et al. 2010; Welinder et al. 2010; Welinder and Perona 2010; Whitehill et al. 2009). In our model, each sentence!!is described by a vector of parameters!!", which corresponds to sentence attributes on each of four latent dimensions!. These dimensions are the latent domain propensity of that sentence to be coded economic (1) and social (2) versus none, and the latent left-right position of the sentence on economic (3) and social (4) dimensions. 12

14 Individual coders! have potential biases in each of these dimensions, corresponding to their relative propensity to code sentences as belonging to the economic and social domains, and to locate sentences as further to the right on economic and social position scales. Finally, individual coders! have sensitivities to each of the four dimensions, corresponding to their relative responsiveness to changes in the latent sentence attributes in each of the four dimensions. Thus, the latent coding of sentence! by coder! on dimension! is:!!"# =!!"!!" +!!" (1 ) where the χ!" indicate relative responsiveness of coders to changes in latent sentence attributes θ!", and the ψ!" indicate relative biases towards coding sentences as economic or social (! = 1,2), and coding economic and social sentences as right rather than left (! = 3,4). Since coders do not directly provide assessments in each of these dimensions, we model their response to the choice of assignment between economic, social and neither domains using a multinomial logit given!!"! depending on!!"! and!!"!, and their choice of scale position as an ordinal logit if they code the sentence as economic and on!!"! if they code the sentence as social. 11 We arrive at the following model for the eleven possible combinations of codes and scales that a coder can give a sentence:!(!"!#) = 1 1 +!exp!(!!"! ) +!exp!(!!"! )!(!"#$;!"#$%) = exp!(!!"! ) 1 +!exp!(!!"! ) +!exp!(!!"! )!!"#$%!!!!"#$%!!"!!"#$%!!!!"#$%!!!!"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 11 By treating these as independent, and using the logit, we are assuming independence between the choices and between the social and economic dimensions (IIA). It is not possible to identify a more general model that relaxes these assumptions without asking additional questions of the coders. 13

15 !(!"#;!"#$%) = exp!(!!"! ) 1 +!exp!(!!"! ) +!exp!(!!"! )!"#$%!!!!"#$%!!"!!"#$%!!!!"#$%!!!!"! In our coding scheme (as per Figure 1), each policy domain has five scale points, and the model assumes proportional odds of being in each higher scale category in response to the sentence s latent policy positions θ!! and θ! and the coder s sensitivities to this association. The cutpoints! for ordinal scale responses are constrained to be symmetric around zero and to have the same cutoffs in both social and economic dimensions, so that the latent scales are directly comparable to one another and to the raw scales. Thus,!! =,!! =!!!,!!! =!!!, and!!! =. The primary quantities of interest are not sentence level attributes,!!", but rather aggregates of these for entire manifestos, represented by the!!,! for each document k on each dimension d. Where!!" are distributed normally with mean zero and standard deviation!!, we model these latent sentence level attributes!!"! hierarchically in terms of corresponding latent document level attributes:!!" =!!!,! +!!" As at the sentence level, two of these (d=1,2) correspond to the overall frequency (importance) of economic and social dimensions relative to other topics, the remaining two (d=3,4) correspond to aggregate left-right positions of manifestos on economic and social dimensions. This model enables us to generate estimates of not only our quantities of interest for the document-level policy positions, but also a variety of coder- and sentence- level diagnostics concerning coder agreement and the difficulty of domain and positional coding for individual sentences (details are provided in supplemental materials). Simulating from the posterior also makes it straightforward to estimates Bayesian credible intervals indicating our uncertainty over 14

16 document-level policy estimates. We estimate the model by MCMC using JAGS, and provide the code, convergence diagnostics, and other details of our estimations in supplementary materials. Posterior means of the document level!!" correlate very highly with those produced by the simple averaging methods discussed earlier: 0.95 and above, as we report below. It is therefore possible to use averaging methods to summarize results in a simple and intuitive way that is also invariant to shifts in mean document scores that might be generated by adding new documents to the coded corpus. The value of our scaling model is to allow us to estimate coder and sentence fixed effects, and correct for these if necessary. While this model is based on our particular classification scheme, it is general in the sense that nearly all attempts to measure policy in specific documents will combine domain classification with positional coding. It is easily adapted to problems with a different set of policy domains or scales. BENCHMARKING EXPERT TEXT CODING Our core objective is to compare estimates generated by coders in the crowd with analogous estimates generated by expert coders. Most classical expert content analyses, however, report results generated by a single expert coder rather than by multiple experts. Since multiple expert coders will doubtless disagree over the coding of particular sentences it would be odd bordering on suspicious if experts were reported to be in complete agreement on every code we have no sense from previously published results about typical levels of disagreement between multiple expert coders. An important secondary objective in this paper, therefore, is to benchmark levels of disagreement between experts, as well as to assess the extent to which estimates generated by expert coders can be externally validated against independent sources. 15

17 The first stage of our empirical work was to deploy multiple (four to six) 12 expert coders independently to code each of the 18,263 sentences in our 18-manifesto text corpus, using the coding scheme described above. The entire corpus was coded twice by the set of expert coders. First, sentences were served in the natural sequence in each manifesto, to mimic classical expert coding. Second, about a year later, sentences were coded in random order, to mimic the system we use for serving sentences to crowd coders. Sentences were uploaded to a custom-built, webbased coding platform that displayed sentences in context and made it easy to classify and code sentences with a few mouse clicks. In all, we obtained over 123,000 expert codes of the set of manifesto sentences, about seven per sentence. Table 1 provides details on the manifesto texts, with statistics on the overall and mean numbers of codes, for both stages of expert coding as well as crowd results we report below.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 12 Three of the authors of this paper, plus three senior PhD students in Politics from New York University coded the six manifestos from 1987 and One author of this paper and four NYU PhD students coded the other 12 manifestos. 16

18 Manifesto Total sentences in manifesto Mean expert codings: natural sequence Mean expert codings: random sequence Total expert codings Mean Crowd Codings Total Crowd codings Con , , ,594 LD , ,842 Lab , ,087 Con , , ,949 LD , ,880 Lab , ,328 Con , , ,136 LD , ,627 Lab , , ,247 Con , ,796 LD , , ,987 Lab , , ,856 Con , ,128 LD , ,173 Lab , , ,021 Con , , ,269 LD , ,344 Lab , , ,843 Total 18,263 91,400 32, , ,107 Validity of expert coding Table 1. Texts and sentences coded: 18 British party manifestos Figure 2 plots two sets of estimates of positions of the 18 manifestos on economic and social policy: one generated by experts coding sentences in natural sequence (x-axis); the other generated by independent expert surveys (y-axis). 13 Linear regression lines summarizing these plots show that positional estimates from expert text coding predicts independent expert survey measures of the same quantity very well for economic policy (R= 0.91), somewhat less well for the noisier dimension of social policy (R=0.81).!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 13 These were: Laver and Hunt (1992); Laver (1998) for 1997; Benoit and Laver (2006) for 2001; Benoit (2005, 2010) for 2005 and

19 To test whether coding sentences in their natural sequence affected the results, our expert panel also re-coded the manifesto corpus taking sentences in random order. Comparing estimates of manifesto positions from these two sets of codings, we found almost identical results, with correlations of 0.98 for both scales (details provided in supplementary materials). Moving from classical expert coding to having experts code sentences served at random from anonymized texts makes no substantive difference to point estimates of manifesto positions. This confirms results of preliminary coding experiments, and informs our decision to use the much more scalable random sentence sequencing in the crowd-sourcing method we specify. Manifesto Placement Economic Manifesto Placement Social 01 Expert Coding Estimate r= Expert Coding Estimate r= Expert Survey Placement Expert Survey Placement Figure 2. British party positions on economic and social policy ; estimates from sequential expert text coding (vertical axis) and independent expert surveys (horizontal). (Labour red, Conservatives blue, Liberal Democrats yellow, labeled by last two digits of year) Reliability of expert coding The search for reliability and replicability forms a large part of our rationale for moving from classical, single-expert coding to crowd-sourced coding using a large number of coders. Before testing whether a crowd of non-expert coders can reliably apply our text-coding scheme, we need to establish that our mini-crowd of expert coders can do so. 18

20 Agreement between expert coders As we suspected, agreement between our expert coders was far from perfect. Table 2 classifies each of the 5444 sentences in the 1987 and 1997 manifestos, all of which were coded by the same six experts. It shows how many experts agreed the sentence referred to economic, or social, policy. If experts are in perfect agreement on the policy content of each sentence, either all six code each sentence as dealing with economic (or social) policy, or none do. Thus the first data column of the table shows a total of 4,125 sentences which all experts agree have no social policy content. Of these, there are 1,193 sentences all experts also agree have no economic policy content, and 527 that all experts agree do have economic policy content. The experts disagree about the remaining 2405 sentences: some but not all experts code these as having economic policy content. Experts Assigning Economic Domain Experts Assigning Social Policy Domain Total 0 1, , " " " " " " " " " " " " " " " " " " " " " 527 Total 4, ,444 Table 2: Domain classification matrix for 1987 and 1997 manifestos: frequency with which sentences were assigned by six experts to economic and policy domains. (Shaded boxes: perfect agreement between experts.) The shaded boxes show sentences for which the six expert coders were in unanimous agreement on economic policy, social policy, or neither. There was unanimous expert agreement on about 35 percent of manifesto sentences. For about 65 percent of sentences, therefore, there was 19

21 disagreement, even about the policy area, among trained expert coders of the type usually used to generate coded data in the social sciences. Scale reliability Despite disagreement among experts on individual sentence coding, we saw above that we can derive externally valid estimates of party policy positions if we aggregate the judgments of all expert coders on all sentences in a given manifesto. This happens because, while each expert judgment on each manifesto sentence is a noisy realization of some underlying signal about policy content, the expert judgments taken as a whole scale nicely in the sense that in aggregate they are all capturing information about the same underlying quantity. Table 3 shows this, reporting a scale reliability analysis for economic policy positions of the 1987 and 1997 manifestos, derived by treating economic policy scores for each sentence allocated by each of the six expert coders as six sets of independent estimates of economic policy positions. Item N Sign Item-scale correlation Item-rest correlation Cronbach s alpha Expert 1 2, Expert 2 2, Expert 3 1, Expert 4 1, Expert 5 1, Expert Overall 0.95 Table 3. Inter-coder reliability analysis for the economic policy scale generated by aggregating all expert scores for sentences judged to have economic policy content. Two of the coders (Experts 3 and 6) were much more cautious than others in classifying sentences as having economic policy content. Despite this, and the high variance in individual sentence codings we saw in Table 2, overall scale reliability, measured by a Cronbach s alpha of 20

22 0.95, is excellent by any conventional standard. 14 We can therefore apply our model to aggregate the noisy information contained in the combined set of expert sentence codings to produce reliable and valid estimates of policy positions at the document level. This is the essence of crowd-sourcing, and shows that our experts should be seen as a small crowd. The method we apply to generating estimates of manifesto positions from crowd-sourced sentence coding is therefore identical to the method we use for our expert panel. DEPLOYING CROWD-SOURCED TEXT CODING CrowdFlower: a crowd-sourcing platform with multiple channels Many online platforms now distribute crowd-sourced micro-tasks (Human Intelligence Tasks or HITs ) via the Internet. The best known is Amazon s Mechanical Turk (MT), an online marketplace for serving up HITs to crowd-based workers. Workers often must pass a pre-task qualification test, and maintain a certain quality score from validated tasks that determines their status and qualification for future jobs. There are now many other platforms for crowd-sourcing so, rather than navigating this increasingly complicated environment, we used CrowdFlower, a service that consolidates access to dozens of crowdsourcing channels. 15 CrowdFlower not only offers an interface for designing templates and uploading tasks, but also maintains a training and qualification mode for potential workers before they can qualify for tasks, as well as quality control while tasks are being completed.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 14 Conventionally, an alpha of 0.70 is considered acceptable. Nearly identical results for social policy are available in online materials 15 See 21

23 Quality control for crowd-sourced text coding While crowdsourcing potentially offers a valuable way to generate vast amounts of data quickly and cheaply, very close attention must be paid to quality assurance. Given the natural economic motivation to finish as many jobs in as short a time as possible, it is both tempting and easy for workers to submit badly or falsely coded data. Workers who do this are called spammers ; the data they generate are tainted and, given the open nature of the platform, it is vital to prevent them from participating in a job (e.g. Kapelner and Chandler 2010; Nowak and Rger 2010; Eickhoff and de Vries 2012; Berinsky et al. forthcoming). Spamming would be a serious problem for our text coding project. It is crucial that workers thoroughly read both coding instructions and the sentences coded, and pay close attention to the coding judgments they make. Conway addressed this problem using coding experiments designed to assess methods for monitoring and controlling coder quality, using MT as a platform for text coding (Conway 2013). MT provides the capacity to filter workers from HITs by using a qualification test, whereby, workers must successfully complete a test HIT before they are allowed to participate in compensated HITs. Conway created three worker funnels, each with an increasingly strict test. 16 Two findings directly impact our design. First, inclusion of a qualification test does very significantly improve the quality of coding results. A well-designed test can filter out spammers and bad coders who otherwise tend to exploit the job. Second, once a suitable test is in place, increasing its difficulty does not improve results. While it is vital to have a filter on the front end to keep out spammers and bad coders, a tougher filter does not necessarily lead to better coders.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 16 There was a baseline test with no filter, a low-threshold filter where workers had to correctly code 4/6 sentences correctly, and a high-threshold filter that required 5/6 correct codings. Correct coding means the sentence is coded into the same policy domain as that provided by a majority of expert coders. 22

24 The primary method for quality control implemented by CrowdFlower uses answers to gold questions: tasks with unambiguous correct answers specified in advance. 17 Correctly performing gold tasks, which are both used in qualification tests and randomly sprinkled through the job, is used to monitor coder quality and block spammers and bad coders. We identified our set of gold manifesto sentences as those for which there was unanimous expert coder agreement on both the policy area (economic, social or neither), and direction of policy position (left or right, liberal or conservative), and seeded each job with the recommended proportion of about 10% gold sentences. An example of a gold sentence (from the 1997 Conservative manifesto) is: Our aim is to ensure Britain keeps the lowest tax burden of any major European economy. All six expert coders coded this as either somewhat or very to the right on economic policy, so Economic policy: somewhat or very right wing is specified as its golden coding. We used natural gold sentences found in the texts, but could just as easily have used artificial gold, crafted to represent archetypes of economic or social policy statements of a given orientation. A special case of this that we did use are gold sentences we call screeners, which contain an exact instruction on how to code the sentence, and are designed to ensure coders are paying attention throughout the coding process (Berinsky et al. forthcoming). We set screeners in a natural two-sentence context, but replaced the target sentence with one that gives a simple instruction, for example please code this sentence as having economic policy content with a score of very right. Specifying gold sentences in this way, we used CrowdFlower to implement a two-stage process of quality control. First, all workers completed a training mode whereby they were only allowed to start production coding after correctly answering 8 out of 10 gold questions in a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 17 For CrowdFlower s formal definition of gold, see 23

25 qualification test. 18 Once coders are on the job and have seen at least four more gold sentences, they are given a CrowdFlower trust score, which is simply the proportion of correctly coded gold. If coders get too many gold questions wrong, their trust level goes down; they are ejected from further production coding if their trust score falls below 0.8. The current trust score of a coder is also recorded with each coding, and can if desired be used to weight the contribution of each coding to some aggregate estimate, although our tests showed this made no substantial differences on overall results, mainly because trust scores all tend to range in a tight interval averaged around a mean of Many more potential codings than we use here were rejected as untrusted because the coders did not pass the training mode, or because their trust score fell below the critical threshold during coding. Workers are not paid for rejected codings, giving them a strong incentive to perform coding tasks carefully, as they do not know which tasks have been designated as gold questions for quality assurance. We have no hesitation in stating on the basis of our own experience that this system of thorough continuous monitoring of coder quality is a sine qua non for reliable and valid crowd sourced data coding. Deployment We set up a coding interface on CrowdFlower that was nearly identical to that in our customdesigned expert coding web system and deployed our text coding job in two stages. First, we over coded all sentences in the 1987 and 1997 manifestos, because we wanted to determine the number of codes needed to derive stable estimates of our quantities of interest. We served up sentences from the 1987 and 1997 manifestos until we obtained a minimum of 20 codes for each sentence. After analyzing the results to determine that our estimates of manifesto scale positions!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 18 Workers giving wrong codes to gold questions are given a short explanation of why they are wrong. 19 Our supplementary materials report the distribution of trust scores from the complete set of crowd-codings by country of the coder and channel, in addition to results that scale the manifesto aggregate policy scores by the trust scores of the coders. 24

26 converged on stable values once we had five codes per manifesto sentence in results we report below we served the remaining manifestos until we reached five codes per sentence. In all, we gathered 215,107 crowd codings of the 18,263 sentences in our 18 manifestos, employing a total of 1,488 different coders from 49 different countries. About 28 percent of these came from the US, 15 percent from the UK, 11 percent from India, and 5 percent each from Spain, Estonia, and Germany. The average coder processed about 145 sentences, with most coding between 10 and 70 sentences; 44 coders coded over 1,000 sentences, and four individuals over 5,000. While at the outset we drew workers only from MT, we expanded this to more Crowdflower channels, following tests that indicated, given our tight quality control system, that there was nothing special about MT workers. 20 CROWD-SOURCED POLICY POSITIONS FOR PARTY MANIFESTOS Figure 3 plots crowd-sourced estimates of the economic and social policy positions of British party manifestos against estimates generated from multiple expert codings of the same documents. 21 The very high correlations of aggregate policy measures generated by crowd and experts coders suggest both are measuring the same latent quantities. Substantively, for example, crowd coders based all over the world and coding randomly selected manifesto sentences identified the sharp rightwards shift of Labour between 1987 and 1997 on both economic and social policy, also identified by both expert text coders and independent expert surveys. The standard errors of crowd-sourced estimates are higher for social than for economic policy, reflecting both the smaller number of manifesto sentences devoted to social policy and higher!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 20 Our final crowd-coded dataset was generated by deploying through a total of 26 Crowdflower channels. The most common was Neodev (Neobux) (40%), followed by Mechanical Turk (18%), Bitcoinget (15%), Clixsense (13%), and Prodege (Swagbucks) (6%). Opening up multiple worker channels also avoided the restriction imposed by Mechanical Turk in 2013 to limit the labor pool to workers based in the US and India. Full details along with the range of trust scores for coders from these platforms are presented in the supplemental materials. 21 Full point estimates are provided in the supplemental materials. 25

27 coder disagreement over the application of this policy domain. Nonetheless Figure 3 summarizes our evidence that the crowd-sourced estimates of party policy positions can be used as substitutes for the expert estimates, which is our main concern in this paper. Manifesto Placement Economic Manifesto Placement Social Crowd Coding Estimate r= Crowd Coding Estimate r= Expert Coding Estimate Expert Coding Estimate Figure 3. Expert and crowd-sourced estimates of economic and social policy positions. Our scaling model provides a theoretically well-grounded way to aggregate all the information in our expert or crowd codes, relating the underlying position of the political text both to the difficulty of a particular sentence and to a coder s propensity to identify the correct policy domain and code the policy position within domain. 22 Because the positions from the scaling model depend on parameters estimated using the full set of coders and codings, changes to the manifesto set can affect the relative scaling. The simple mean of means method, however, is invariant to rescaling and always produces the same results, even for a single manifesto. Comparing crowd-sourced estimates from the model to those produced by a simple averaging of the mean of mean sentence scores, for instance, we find correlations of 0.96 for the economic!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 22 For instance, each coder has a domain sensitivity as well as a proclivity to code policy as left or right. We report more fully on diagnostic results for our coders on the basis of the auxiliary model quantity estimates in the supplemental materials. 26

28 and 0.97 for the social policy positions of the 18 manifestos. We present both methods as confirmation that our scaling method has not manufactured unexpected policy estimates, and to underscore that our method for taking full account of sentence and coder effects is only one of many potentially useful and valid approaches to aggregating coded sentences into a policy measure. We have already seen that noisy expert sentence codings aggregate up to reliable and valid estimates for document scores. Similarly, crowd-sourced document estimates reported in Figure 3 are derived from crowd-sourced sentence codings that are full of noise. As we already argued, this is the essence of crowd-sourcing. Figure 4 plots mean expert and against mean crowd-sourced codes for each manifesto sentence. The codes are highly correlated, though crowd coders are substantially less likely to use extremes of the scales than experts. The first principal component and associated confidence intervals show a strong and significant statistical relationship between crowd sourced and expert codings of individual manifesto sentences, with no evidence of systematic bias in the crowd-coded sentence scores. 23 Overall, despite the expected noise, our results show that crowd coders systematically tend to allocate the same sentence codes as expert coders. Given sufficient numbers, our crowd coders produced codes that in the aggregate yield results as good as our experts.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 23 Lack of bias is indicated by the fact that the fitted line crosses the origin. 27

29 28 Figure 4. Expert and crowd-sourced estimates of economic and social policy codes of individual sentences, all manifestos. Fitted line is the principal components or Deming regression line. Calibrating the required number of crowd codings A key question for our method concerns how many noisier crowd coders we need to generate reliable and valid estimates of documents such as party manifestos. To answer this, we turn to evidence from our over-coding of 1987 and 1997 manifestos. Recall we obtained a minimum of 20 crowd codings for each sentence in each of these manifestos, allowing us to explore what our estimates of the position of each manifesto would have been, had we collected fewer codings. Drawing random subsamples from our over-coded data, therefore, we can simulate the convergence of estimated document positions as a function of the number of crowd-coders per sentence. We did this by bootstrapping 100 sets of subsamples for each of the subsets of n=1 to n=20 coders, computing manifesto positions on each policy domain from aggregated sentence position means, and computing standard deviations of these manifesto positions across the 100 estimates. Figure 5 plots these for each manifesto as a function of the increasing number of Economic Domain Crowd Mean Code Expert Mean Code Social Domain Crowd Mean Code Expert Mean Code

30 coders, where each point represents the empirical standard error of the estimates for a specific manifesto. For comparison, we plot the same quantities from the expert coding in red. Economic Social Std. error of bootstrapped manifesto estimates Std error of bootstrapped manifesto estimates Expert Crowd Crowd codes per sentence Crowd codes per sentence Figure 5. Standard errors of manifesto-level policy estimates as a function of the number of coders, for the oversampled 1987 and 1997 manifestos. Each point is the bootstrapped standard deviation of the mean of means aggregate manifesto scores, computed from sentence-level random n sub-samples from the codes. The findings show a clear trend: uncertainty over the crowd-based estimates collapses as we increase the number of coders. Indeed, the only difference between the experts and the crowd is that the expert variance is smaller, as we would expect. Our findings vary somewhat with policy area, given the noisier character of social policy estimates, but adding crowd-sourced sentence codes led to convergence with our expert panel of 5-6 coders at around 15 crowd coders. However, the steep decline in the uncertainty of our document estimates leveled out at around five crowd coders, at which point the absolute level of error is already low for both policy domains. While a larger number of unbiased crowd codings will always give us better estimates, we decided on cost-benefit grounds for the second stage of our deployment to continue coding in the crowd until we had obtained five crowd codings per sentence. 29

31 Five crowd codings per sentence may seem surprisingly few, but there are a number of important factors to bear in mind in this context. First, the manifestos comprise about 1000 sentences on average; our estimates of document positions aggregate codes for these. Second, sentences were coded in random order and were randomly assigned to coders, so each sentence code can be seen as an independent estimate of the position of the manifesto on each dimension. 24 With five codes per sentence and about 1000 sentences per manifesto, we have about 5000 little estimates of the manifesto position, each a representative sample from the larger set of codings that would result from additional codings of each sentence in each document. This sample is big enough to achieve a reasonable level of precision, given the large number of sentences per manifesto. Thus, while the method we report here could certainly be used for much shorter documents, the results we infer here for the appropriate number of coders might well not apply, and would likely be higher. But, for large documents with many sentences, we find that the number of crowd codes per sentence that we need is not high. CROWD-SOURCED DATA FOR SPECIFIC PROJECTS: IMMIGRATION POLICY We have shown that crowd-sourced data coding can reproduce estimates of a well-known set of party policy positions. We now use it to generate new information. Immigration policy is a policy area of increasing concern to scholars, but is not included in canonical coding schemes such as the CMP, designed in the 1980s (Ruedin and Morales 2012; Ruedin 2013). Crowdsourcing frees research from legacy problems arising with existing datasets and allows researchers to collect information on their precise quantities of interest. To demonstrate this, we deployed a single-issue coding project to code eight British party manifestos from the 2010 election on immigration policy, including smaller parties with more extreme positions on!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 24 Coding a sentence as referring to another dimension is a null estimate. 30

32 immigration, such as the British National Party (BNP) and the UK Independence Party (UKIP), as well as the governing coalition of Conservatives and Liberal Democrats. Workers were asked to code each sentence as referring to immigration policy or not; if a sentence did cover immigration, they were asked to code it as pro- or anti-immigration, or neutral. We deployed a job containing 7,070 manifesto sentences plus 136 gold questions and screeners devised specifically for this purpose. For this job, we used an adaptive coding strategy which set a minimum of five codes per sentence, unless the first three codings were unanimous in judging a sentence not to concern immigration policy. This is efficient when coding texts with only sparse references to the matter of interest; in this case most manifesto sentences (c96%) were clearly not about immigration policy. Within just five hours, the job was completed, with 22,228 codings, for a total cost of $ While there are no text codings of immigration policy for comparison, we benchmark results against expert surveys by Benoit (2010) and the Chapel Hill Expert Survey (Marks 2010). Results are shown in Table 4 and Figure 3, which compare our crowd-sourced estimates of immigration policy positions with estimates from the Benoit (2010) expert survey, correlating With just hours from deployment to dataset, and for a fraction of the cost, crowdsourcing proved a flexible, valid, and inexpensive method of replicating measures from two costly and cumbersome expert survey projects.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 25 The job set 10 sentences per task and paid $0.15 per task. 26 CHES included two highly correlated measures, one aimed at closed or open immigration policy another aimed at policy toward asylum seekers and whether immigrants should be integrated into British society. Our measure averages the two. Full numerical results are given in supplementary materials. 31

33 Party Total Sentences Coded as Immigration Total Sentences in Manifesto Immigration Position Estimate 95% CI BNP 147 1, [2.63, 3.51] UKIP [1.54, 2.73] Labour 22 1, [-0.16, 0.98] Conservatives 10 1, [-0.77, 0.55] Conservative - LD Coalition [-1.22, 0.82] Plaid Cmyru [-2.26, 0.32] Liberal Democrats [-1.55, -0.25] Scottish National Party [-2.15, -0.16] Greens [-2.18, -0.88] Total 280 7,322!!! Correlation with Benoit (2010) Expert! Survey!!! 0.96! Correlation with Chapel Hill Expert Survey 0.94 Correlation with Mean of Means method!! 0.99 Table 4. Crowd-sourced estimates of immigration policy during the 2010 British elections. 95% Bayesian credible intervals in brackets. Expert survey data: Benoit (2010), Marks (2010). 32

34 Estimated Immigration Positions Crowd r=0.96 PC LD SNP Greens Lab Con UKIP BNP Expert Survey Figure 3. Correlation of combined immigration crowd codings with Benoit (2010) expert survey position on immigration. THE REPLICATION STANDARD AND CROWD-SOURCED DATA The replication standard is a cornerstone of modern political science (King 1995, 2006). Most journals now require that replication materials be deposited with an archive. These allow scholars to repeat the analysis of a given dataset, contributing to transparency and errorchecking, while allowing others to modify and extend published work. They rarely if ever make it possible to replicate the entire data production process, in the sense that researchers in the physical sciences reproduce experimental results by following the same procedure from start to finish in an independent laboratory, rather than simply rerunning the computer analysis of the 33

35 original experiment s lab results. True scientific replication is replication of the data production process, not just the data analysis, and crowd-sourcing does offer a tractable way to replicate the full data production process. To explore this, we repeated our immigration coding exercise with a second deployment two months after the first. Run with identical settings, this job generated another 24,551 codes of immigration sentences and completed in just over three hours. Wave Initial Replication Combined Total Crowd Codings 24,674 24,551 49,225 Number of Coders Total sentences coded as Immigration Correlation with Benoit expert survey (2010) Correlation with CHES Table 5. Comparison results for Replication of Immigration Policy Crowd-Coding. Table 5 compares results of our replication of immigration policy coding with our original estimates. The replication generated nearly identical estimates, correlating at the same high levels with external expert surveys, and correlating at 0.93 with party position estimates from the original crowd-coding. This was a hard test since only around 280 manifesto sentences (less than four percent of the total) refer to immigration policy. 27 In short, full replication of our crowd-coding exercise two months later shows that this method can easily be redeployed to produce essentially the same results. 28 This is evidence of the reliability of crowdsourcing as a method for coding political text. It also shows (see the third column of Table 5) that we can add new crowd estimates to improve existing estimates, simply by ordering more codings from the crowd. Most importantly, it shows that the crowd-sourcing method takes us!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 27 These are manifesto sentences for which more than half of coders judged them to concern immigration policy. 28 Most of the coders from the second immigration policy coding job were new to the task: from the coder IDs we were able to verify that 71% of the 38 crowd coders on the second job had not participated in the first. 34

36 closer to the replication standard in the natural sciences, according to which a replicable method of data collection is more important than any particular dataset. In our supplemental materials, we provide all settings used on CrowdFlower for our economic, social and immigration policy coding jobs. This meets the replication objectives of transparency and verifiability, as well as allowing others to extend our results by adapting our templates for other policy areas or data production tasks. More importantly, it meets a deeper replication standard that enables any scholar to reproduce the entirety of our data and results by redeploying our CrowdFlower jobs to a new crowd to generate completely new data. This is a very substantial advantage of crowd-sourced data coding. CONCLUSIONS Our core task was to replicate well-known empirical results using crowd-sourced data coding. We find crowd-sourced estimates can replace those generated by experts using the same sources, and in this sense are internally valid. Crowd-sourced estimates also correspond closely with those generated by completely independent contemporary expert surveys, by now a standard professional source. In this sense they are externally valid. Finally, both by analyzing repeated random subsamples drawn from our crowd-sourced codes, and by deploying identical crowdsourced coding jobs two months apart, we have shown the estimates we generate using crowd sourcing can be reliable. Most importantly, our method is anonymous. The coders are not our coders, trained and socialized by us, but anonymous workers in the crowd. The data generation process involves taking published crowdsourcing code, adding generic cash to it, and deploying the resulting job to a generic population of workers, distributed worldwide. Anyone with the inclination and modest funding can do this. Many different people can do it as many different times as they desire. 35

37 Unlike other applications of crowdsourcing in the social sciences, which use workers in the crowd as substitutes for survey respondents or laboratory subjects, we do not care at all about the representativeness of our workers. What we care about is that different workers tend to perform the same task in the same way. We have shown that, with clear and simple instructions, combined with rigorous quality control, this can be achieved for the task of coding political texts, with replicable results that can be substituted for those generated by traditional expert coders. We have here dealt with large texts party manifestos on average 1000 sentences long. For these, as we have seen, we obtain excellent results using five crowd coders per sentence, generating about 5000 pieces of information for each document under investigation. The number of crowd coders we need will almost certainly depend upon the length of the document for example very short documents like single tweets might need many more crowd coders before we can converge on a reliable estimate of a position on some policy or sentiment scale. This is an important topic for future work. Automated methods for text analysis are increasingly widely used in the social sciences. Crowd-sourcing can both make these methods more effective, as well as solve problems that quantitative text analysis cannot (yet) address. Supervised methods require labeled training data and unsupervised methods require a posteriori interpretation, both of which can be provided by either experts or by the crowd. But for many tasks that require interpretation, particularly of information embedded in the syntax of language rather than in which words are used, untrained human coders in the crowd provide a combination of human intelligence and affordability that neither computers nor experts can beat. Moving beyond coding text to other data coding exercises in the social sciences for example measuring positions on latent dimensions of corruption or democracy will pose 36

38 substantial research design questions that go far beyond crowd-sourcing. All coded data rely upon the text that supplies the qualitative raw information on which codes are based. Experts are expected to have read a lot, and to have processed the text they read, before they give us their judgments. Typically, however, the relevant text corpus is deeply implicit. Experts are also expected to know what to read. We cannot expect the same of workers in the crowd. What this means is that, in order to port our method from the analysis of a well-specified text corpus for example all party manifestos or legislative speeches or newspaper articles or even tweets, in country X between date Y and date Z to a more general application in coding political systems for levels of corruption or democracy, for example, we will need to think carefully about the precise specification of the corpus of raw information that will form the basis of judgments made by coders. This will be essential for coders in the crowd, but it would be wise for expert coding projects as well, if what we are seeking is a well-specified method of data production that yields replicable results. Coded data, generated by experts who transform qualitative raw sources into quantitative measures, are ubiquitous in political science. The replication standard is typically met by releasing dataset and analysis code, rarely by demonstrating that third-party scholars can independently reproduce the dataset itself. Crowd-sourced data-coding for the social sciences offers the potential to transform this intellectual landscape. Precisely because the data-production process must be very clearly specified, so that it can be served to and understood by anonymous workers in the crowd, scattered around the world, we expect the same data production process, served to a different anonymous crowd, to generate the same results. This is why crowd-sourced data coding comes closer than most data production methods in the social sciences to satisfying a true scientific replication standard. 37

39 REFERENCES Alonso, O., and R. Baeza-Yates "Design and Implementation of Relevance Assessments Using Crowdsourcing." In Advances in Information Retrieval, ed. P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee and V. Mudoch: Springer Berlin / Heidelberg. Alonso, O., and S. Mizzaro Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. Paper read at Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation. Ariely, D., W. T. Au, R. H. Bender, D. V. Budescu, C. B. Dietz, H. Gu, and G. Zauberman "The effects of averaging subjective probability estimates between and within judges." Journal of Experimental Psychology: Applied 6 (2): Armstrong, J.S., ed Principles of Forecasting: A Handbook for Researchers and Practitioners: Springer. Baker, Frank B, and Seock-Ho Kim Item response theory: Parameter estimation techniques: CRC Press. Benoit, Kenneth "Policy positions in Britain 2005: results from an expert survey." London School of Economics "Expert Survey of British Political Parties." Trinity College Dublin. Benoit, Kenneth, and Michael Laver Party Policy in Modern Democracies. London: Routledge. Berinsky, A., G. Huber, and G. Lenz "Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk." Political Analysis. Berinsky, A., M. Margolis, and M. Sances. forthcoming. "Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self-Administered Surveys." American Journal of Political Science. Bohannon, J "Social Science for Pennies." Science 334:307. Budge, Ian, Hans-Dieter Klingemann, Andrea Volkens, Judith Bara, Eric Tannenbaum, Richard Fording, Derek Hearl, Hee Min Kim, Michael McDonald, and Silvia Mendes Mapping Policy Preferences: Parties, Electors and Governments: : Estimates for Parties, Electors and Governments Oxford: Oxford University Press. Cao, J, S. Stokes, and S. Zhang "A Bayesian Approach to Ranking and Rater Evaluation: An Application to Grant Reviews." Journal of Educational and Behavioral Statistics 35 (2): Carpenter, B "Multilevel Bayesian models of categorical data annotation." 38

40 Chandler, Jesse, Pam Mueller, and Gabriel Paolacci "Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers." Behavior Research Methods 46 (1): Clemen, R., and R. Winkler "Combining Probability Distributions From Experts in Risk Analysis." Risk Analysis 19 (2): Conway, Drew Applications of Computational Methods in Political Science, Department of Politics, New York University. Däubler, Thomas, Kenneth Benoit, Slava Mikhaylov, and Michael Laver "Natural sentences as valid units for coded political text." British Journal of Political Science 42 (4): Eickhoff, C., and A. de Vries "Increasing cheat robustness of crowdsourcing tasks." Information Retrieval 15:1-17. Fox, Jean-Paul Bayesian item response modeling: Theory and applications: Springer. Galton, F "Vox Populi." Nature 75: Goodman, Joseph, Cynthia Cryder, and Amar Cheema "Data Collection in a Flat World: Strengths and Weaknesses of Mechanical Turk Samples." Journal of Behavioral Decision Making 26 (3): Hambleton, Ronald K, Hariharan Swaminathan, and H Jane Rogers Fundamentals of item response theory: Sage. Hooghe, Liesbet, Ryan Bakker, Anna Brigevich, Catherine de Vries, Erica Edwards, Gary Marks, Jan Rovny, Marco Steenbergen, and Milada Vachudova " Reliability and Validity of Measuring Party Positions: The Chapel Hill Expert Surveys of 2002 and 2006." European Journal of Political Research. 49 (5): Horton, J., D. Rand, and R. Zeckhauser "The online laboratory: conducting experiments in a real labor market." Experimental Economics 14: Howe, Jeff "The Rise of Crowdsourcing." Wired June. Hsueh, P., P. Melville, and V. Sindhwani Data quality from crowdsourcing: a study of annotation selection criteria. Paper read at Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing. Ipeirotis, Panagiotis G., Foster Provost, Victor S. Sheng, and Jing Wang "Repeated labeling using multiple noisy labelers." Data Mining and Knowledge Discovery:1-40. Ipeirotis, Panagiotis, F. Provost, V. Sheng, and J. Wang "Repeated Labeling Using Multiple Noisy Labelers." NYU Working Paper. 39

41 Jones, Frank R. Baumgartner and Bryan D "Policy Agendas Project." Kapelner, A., and D. Chandler Preventing satisficing in online surveys: A `kapcha' to ensure higher quality data. Paper read at The World s First Conference on the Future of Distributed Work (CrowdConf 2010). King, G "Replication, replication." PS: Political Science and Politics 28 (3): "Publication, publication." PS: Political Science and Politics 39 (1): Klingemann, Hans-Dieter, Andrea Volkens, Judith Bara, Ian Budge, and Michael McDonald Mapping Policy Preferences II: Estimates for Parties, Electors, and Governments in Eastern Europe, European Union and OECD Oxford: Oxford University Press. Krippendorff, Klaus Content Analysis: An Introduction to Its Methodology. 3rd ed: Sage. Laver, M "Party policy in Britain 1997: Results from an expert survey." Political Studies 46 (2): Laver, M., K. Benoit, and J. Garry "Extracting policy positions from political texts using words as data." American Political Science Review 97 (2): Laver, Michael, and John Garry "Estimating policy positions from political texts." American Journal of Political Science 44: Laver, Michael, and W. Ben Hunt Policy and party competition. New York: Routledge. Lawson, C., G. Lenz, A. Baker, and M. Myers "Looking Like a Winner: Candidate appearance and electoral success in new democracies." World Politics 62 (4): Lord, Frederic Applications of item response theory to practical testing problems: Routledge. Lyon, Aidan, and Eric Pacuit "The Wisdom of Crowds: Methods of Human Judgement Aggregation." In Handbook of Human Computation, ed. P. Michelucci: Springer. Marks, Gary et al "Chapel Hill Expert Survey 2010." Mason, W, and S Suri "Conducting Behavioral Research on Amazon's Mechanical Turk." Behavior Research Methods 44 (1):1-23. Mikhaylov, Slava, Michael Laver, and Kenneth Benoit "Coder reliability and misclassification in comparative manifesto project codings." Political Analysis 20 (1): Nowak, S., and S. Rger How reliable are annotations via crowdsourcing? a study about inter-annotator agreement for multi-label image annotation. Paper read at The 11th ACM 40

42 International Conference on Multimedia Information Retrieval, Mar 2010, at Philadelphia, USA. Paolacci, Gabriel, Jesse Chandler, and Panagiotis Ipeirotis "Running experiments on Amazon Mechanical Turk." Judgement and Decision Making 5: Quoc Viet Hung, Nguyen, Nguyen Thanh Tam, Lam Ngoc Tran, and Karl Aberer "An Evaluation of Aggregation Techniques in Crowdsourcing." In Web Information Systems Engineering WISE 2013, ed. X. Lin, Y. Manolopoulos, D. Srivastava and G. Huang: Springer Berlin Heidelberg. Raykar, V. C., S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bo- goni, and L. Moy "Learning from crowds." Journal of Machine Learning Research 11: Ruedin, Didier "Obtaining Party Positions on Immigration in Switzerland: Comparing Different Methods." Swiss Political Science Review 19 (1): Ruedin, Didier, and Laura Morales "Obtaining Party Positions on Immigration from Party Manifestos." Sheng, V., F. Provost, and Panagiotis Ipeirotis Get another label? Improving data quality and data mining using multiple, noisy labelers. Paper read at Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Slapin, Jonathan B., and Sven-Oliver Proksch "A scaling model for estimating time series policy positions from texts American Journal of Political Science 52(3).." American Journal of Political Science 52 (3): Snow, R., B. O'Connor, D. Jurafsky, and A. Ng Cheap and fast but is it good?: evaluating non-expert annotations for natural language tasks. Paper read at Proceedings of the Conference on Empirical Methods in Natural Language Processing. Steyvers, M., M. Lee, B. Miller, and P. Hemmer "The wisdom of crowds in the recollection of order information." In Advances in neural information processing systems, 23, ed. J. Lafferty and C. Williams. Cambridge, MA: MIT Press. Surowiecki, J The Wisdom of Crowds. New York: W.W. Norton & Company, Inc. Turner, Brandon M., Mark Steyvers, Edgar C. Merkle, David V. Budescu, and Thomas S. Wallsten "Forecast aggregation via recalibration." Machine Learning:1-29. Volkens, Andrea "Manifesto Coding Instructions, 2nd revised ed." In Discussion Paper (2001), p. 96., ed. W. Berlin. Welinder, P., S. Branson, S. Belongie, and P. Perona The multidimensional wisdom of crowds. Paper read at Advances in Neural Information Processing Systems 23 (NIPS 2010). 41

43 Welinder, P., and P. Perona Online crowdsourcing: rating annotators and obtaining costeffective labels. Paper read at IEEE Conference on Computer Vision and Pattern Recognition Workshops (ACVHL). Whitehill, J., P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Paper read at Advances in Neural Information Processing Systems 22 (NIPS 2009). Yi, S.K.M., M. Steyvers, M. Lee, and Matthew Dry "The Wisdom of the Crowd in Combinatorial Problems." Cognitive Science 36 (3): ! 42

44 CROWD-SOURCED CODING OF POLITICAL TEXTS: SUPPLEMENTARY MATERIALS Kenneth Benoit London School of Economics and Trinity College, Dublin Benjamin E. Lauderdale London School of Economics Drew Conway New York University Michael Laver New York University Slava Mikhaylov University College London May 21, 2014 CONTENTS 1.! Economic and Social Scaling Estimates... 1! 2.! JAGS code for model estimation... 7! 3.! Expert survey estimates... 10! 4.! Details on the Crowd Coders... 12! 5.! Details on pre-testing the deployment method using semi-expert coders... 15! 6.! Implementing manifesto coding on CrowdFlower... 20!!

45 Crowd-sourced coding of political texts / 1 1. Economic and Social Scaling Estimates a) Expert Estimates! Sequential Random Manifesto Economic Social Economic Social Con [0.91, 1.33] [1.33, 1.98] [0.82,1.10] [0.56,1.31] LD [-1.08, -0.71] [-1.25,-0.66] [-0.71,-0.33] [-1.45,-0.45] Lab [-2.40, -1.93] [-1.33,-0.70] [-1.61,-1.01] [-1.35,-0.01] Con [1.08, 1.44] [0.81,1.32] [0.72,1.02] [0.33,1.07] LD [-0.99,-0.61] [-2.20,-1.48] [-0.61,-0.21] [-2.16,-1.27] Lab [-1.51,-1.10] [-1.98,-1.30] [-0.99,-0.59] [-1.97,-0.79] Con [1.00,1.40] [1.14,1.73] [0.78,1.10] [0.80,1.49] LD [-1.27,-0.89] [-1.79,-1.37] [-0.44,-0.08] [-1.44,-0.54] Lab [-0.75,-0.40] [0.10,0.58] [-0.28,0.03] [0.13,1.15] Con [1.44,1.90] [2.05,2.68] [0.91,1.29] [1.29,2.10] LD [-1.16,-0.80] [-1.67,-1.23] [-0.76,-0.36] [-1.43,-0.55] Lab [-1.00,-0.71] [-0.02,0.36] [-0.80,-0.47] [0.21,0.73] Con [0.93,1.59] [1.48,2.23] [0.44,1.00] [0.98,1.95] LD [-0.58,-0.10] [-1.01,-0.40] [-0.54,-0.18] [-1.30,-0.22] Lab [-1.27,-0.86] [0.64,1.26] [-0.84,-0.44] [0.49,1.21] Con [0.70,1.15] [0.45,0.97] [0.46,0.85] [0.22,0.98] LD [-0.99,-0.59] [-1.33,-0.66] [-0.63,-0.19] [-1.20,-0.29] Lab [-0.97,-0.60] [0.14,0.63] [-0.55,-0.22] [0.23,1.04] Correlation with Expert Survey Estimates Correlation with Expert Mean of Means Table 1: Model Estimates for Expert-coded Positions on Economic and Social Policy.! 1

46 Crowd-sourced coding of political texts / 2! b) Crowd Estimates Manifesto Economic Social Con [0.9, 1.23] [-0.27, 0.17] LD [-1.03, -0.68] [-2.03, -1.52] Lab [-1.99, -1.51] [-2.6, -1.98] Con [1.09, 1.46] [-0.83, -0.34] LD [-0.83, -0.39] [-2.71, -2.06] Lab [-1.24, -0.66] [-2.46, -1.75] Con [0.94, 1.28] [-0.04, 0.43] LD [-0.99, -0.63] [-2.07, -1.64] Lab [-0.82, -0.46] [-1.68, -1.24] Con [1.49, 1.95] [0.78, 1.55] LD [-0.76, -0.33] [-2.28, -1.68] Lab [-0.98, -0.67] [-1.73, -1.19] Con [1.29, 2.03] [1.08, 1.9] LD [-0.2, 0.22] [-1.98, -1.31] Lab [-0.73, -0.35] [-1.2, -0.64] Con [1.18, 1.59] [-0.14, 0.52] LD [-0.05, 0.46] [-1.8, -1.15] Lab [0.02, 0.32] [-1.04, -0.5] Correlation with Expert Coding Estimates Correlation with Crowd Mean of Means Table 2: Model Estimates for Crowd-coded Positions on Economic and Social Policy.! 2

47 Crowd-sourced coding of political texts / 3 c) Comparing expert sequential versus random order sentence coding Manifesto Placement Economic Manifesto Placement Social Random r= Random r= Sequential Sequential Figure 1. Scale estimates from expert coding, comparing expert sequential and unordered sentence codings.! d) Cronbach s alpha for social policy scale Item N Sign Item-scale correlation Item-rest correlation Cronbach s alpha Expert Expert Expert Expert Expert Expert Overall 0.95 Table 3. Inter-coder reliability analysis for the social policy scale generated by aggregating all expert scores for sentences judged to have social policy content.!!!! 3

48 Crowd-sourced coding of political texts / 4!! e) Coder-level diagnostics from economic and social policy coding Coder Offsets on Topic Assignment Coder Sensitivity on Topic Assignment Relative Tendency to Label as Social Alex Slava Sebestian Livio Ken Iulia Pablo Mik Relative Sensitivity to Presence of Social Content Slava Pablo Alex Ken Sebestian Livio Mik Iulia Relative Tendency to Label as Economic Relative Sensitivity to Presence of Economic Content! Coder Offsets on Left Right Assignment Coder Sensitivity on Left Right Assignment Relative Tendency to Label as Right on Social Pablo Ken Sebestian Iulia Alex Mik Slava Livio Relative Sensitivity to Social Position Iulia Ken Pablo Alex Sebestian Livio Mik Slava Relative Tendency to Label as Right on Economic Relative Sensitivity to Economic Position! Figure 2. Coder-level parameters for expert coders (names) and crowd coders (points). Top plots show offsets!!" and sensitivities!!" in assignment to social and economic categories versus none; bottom plots show offsets and sensitivities in assignment to left-right scale positions.!! 4

49 Crowd-sourced coding of political texts / 5! f) Convergence diagnostics Economics Code Economics Left Right MCMC iteration MCMC iteration Social Code Social Left Right MCMC iteration MCMC iteration! Figure 3. MCMC traceplots for manifesto-level parameters for expert coders.!! 5

50 Crowd-sourced coding of political texts / 6 Economics Code Economics Left Right MCMC iteration MCMC iteration Social Code Social Left Right MCMC iteration MCMC iteration! Figure 4. MCMC traceplots for manifesto-level parameters for crowd coders.!!!! 6

51 Crowd-sourced coding of political texts / 7 2. JAGS code for model estimation a) Economic and social policy scaling model { for (q in 1:Ncodings){ } # Define latent response for code/scale in econ/social mucode[q,1] <- (theta[sentenceid[q],1,1] + psi[coderid[q],1,1])*chi[coderid[q],1,1]; mucode[q,2] <- (theta[sentenceid[q],2,1] + psi[coderid[q],2,1])*chi[coderid[q],2,1]; muscale[q,1] <- (theta[sentenceid[q],1,2] + psi[coderid[q],1,2])*chi[coderid[q],1,2]; muscale[q,2] <- (theta[sentenceid[q],2,2] + psi[coderid[q],2,2])*chi[coderid[q],2,2]; # Translate latent responses into 11 category probabilities (up to normalization) mu[q,1] <- 1; mu[q,2] <- exp(mucode[q,1])*(ilogit(-1*cut[2] - muscale[q,1])); mu[q,3] <- exp(mucode[q,1])*(ilogit(-1*cut[1] - muscale[q,1])-ilogit(-1*cut[2] - muscale[q,1])); mu[q,4] <- exp(mucode[q,1])*(ilogit(1*cut[1] - muscale[q,1])-ilogit(-1*cut[1] - muscale[q,1])); mu[q,5] <- exp(mucode[q,1])*(ilogit(1*cut[2] - muscale[q,1])-ilogit(1*cut[1] - muscale[q,1])); mu[q,6] <- exp(mucode[q,1])*(1-ilogit(1*cut[2] - muscale[q,1])); mu[q,7] <- exp(mucode[q,2])*(ilogit(-1*cut[2] - muscale[q,2])); mu[q,8] <- exp(mucode[q,2])*(ilogit(-1*cut[1] - muscale[q,2])-ilogit(-1*cut[2] - muscale[q,2])); mu[q,9] <- exp(mucode[q,2])*(ilogit(1*cut[1] - muscale[q,2])-ilogit(-1*cut[1] - muscale[q,2])); mu[q,10] <- exp(mucode[q,2])*(ilogit(1*cut[2] - muscale[q,2])-ilogit(1*cut[1] - muscale[q,2])); mu[q,11] <- exp(mucode[q,2])*(1-ilogit(1*cut[2] - muscale[q,2])); # 11 category multinomial Y[q] ~ dcat(mu[q,1:11]); # Specify uniform priors for ordinal thresholds (assumes left-right symmetry) cut[1] ~ dunif(0,5); cut[2] ~ dunif(cut[1],10); # Priors for coder bias parameters for (i in 1:Ncoders) { psi[i,1,1] ~ dnorm(0,taupsi[1,1]); psi[i,2,1] ~ dnorm(0,taupsi[2,1]); psi[i,1,2] ~ dnorm(0,taupsi[1,2]); psi[i,2,2] ~ dnorm(0,taupsi[2,2]); } # Priors for coder sensitivity parameters for (i in 1:Ncoders) { chi[i,1,1] ~ dnorm(0,1)t(0,); chi[i,2,1] ~ dnorm(0,1)t(0,); chi[i,1,2] ~ dnorm(0,1)t(0,);! 7

52 Crowd-sourced coding of political texts / 8 } chi[i,2,2] ~ dnorm(0,1)t(0,); # Priors for sentence latent parameters for (j in 1:Nsentences) { theta[j,1,1] ~ dnorm(thetabar[manifestoidforsentence[j],1,1],tautheta[1,1]); theta[j,2,1] ~ dnorm(thetabar[manifestoidforsentence[j],2,1],tautheta[2,1]); theta[j,1,2] ~ dnorm(thetabar[manifestoidforsentence[j],1,2],tautheta[1,2]); theta[j,2,2] ~ dnorm(thetabar[manifestoidforsentence[j],2,2],tautheta[2,2]); } # Priors for manifesto latent parameters for (k in 1:Nmanifestos) { thetabar[k,1,1] ~ dnorm(0,1); thetabar[k,2,1] ~ dnorm(0,1); thetabar[k,1,2] ~ dnorm(0,1); thetabar[k,2,2] ~ dnorm(0,1); } # Variance parameters taupsi[1,1] ~ dgamma(1,1); taupsi[2,1] ~ dgamma(1,1); taupsi[1,2] ~ dgamma(1,1); taupsi[2,2] ~ dgamma(1,1); tautheta[1,1] ~ dgamma(1,1); tautheta[2,1] ~ dgamma(1,1); tautheta[1,2] ~ dgamma(1,1); tautheta[2,2] ~ dgamma(1,1); } b) Immigration policy scaling model { for (q in 1:Ncodings){ # Define latent response for code/scale in econ/social mucode[q] <- (theta[sentenceid[q],1] + psi[coderid[q],1])*chi[coderid[q],1]; muscale[q] <- (theta[sentenceid[q],2] + psi[coderid[q],2])*chi[coderid[q],2]; # Translate latent responses into 4 category probabilities (up to normalization) mu[q,1] <- 1; mu[q,2] <- exp(mucode[q])*(ilogit(-1*cut[1] - muscale[q])); mu[q,3] <- exp(mucode[q])*(ilogit(1*cut[1] - muscale[q])-ilogit(-1*cut[1] - muscale[q])); mu[q,4] <- exp(mucode[q])*(1-ilogit(1*cut[1] - muscale[q])); # 11 category multinomial! 8

53 Crowd-sourced coding of political texts / 9 Y[q] ~ dcat(mu[q,1:4]); } # Specify uniform priors for ordinal thresholds (assumes left-right symmetry) cut[1] ~ dunif(0,10); # Priors for coder bias parameters for (i in 1:Ncoders) { psi[i,1] ~ dnorm(0,taupsi[1]); psi[i,2] ~ dnorm(0,taupsi[2]); } # Priors for coder sensitivity parameters for (i in 1:Ncoders) { chi[i,1] ~ dnorm(0,1)t(0,); chi[i,2] ~ dnorm(0,1)t(0,); } # Priors for sentence latent parameters for (j in 1:Nsentences) { theta[j,1] ~ dnorm(thetabar[manifestoidforsentence[j],1],tautheta[1]); theta[j,2] ~ dnorm(thetabar[manifestoidforsentence[j],2],tautheta[2]); } # Priors for manifesto latent parameters for (k in 1:Nmanifestos) { thetabar[k,1] ~ dnorm(0,1); thetabar[k,2] ~ dnorm(0,1); } # Variance parameters taupsi[1] ~ dgamma(1,1); taupsi[2] ~ dgamma(1,1); tautheta[1] ~ dgamma(1,1); tautheta[2] ~ dgamma(1,1); }!!! 9

54 Crowd-sourced coding of political texts / Expert survey estimates These are taken from Laver and Hunt (1992); Laver (1998) for 1997; Benoit and Laver (2006) for 2001; Benoit (2005, 2010) for 2005 and For reference and because the results from Benoit (2005, 2010) were never published, we produce them here. Party Party Name Year Dimension Mean N SE Con Conservative Party 1987 Economic Lab Labour Party 1987 Economic LD Liberal Democrats 1987 Economic PCy Plaid Cymru 1987 Economic SNP Scottish National Party 1987 Economic Con Conservative Party 1997 Economic Lab Labour Party 1997 Economic LD Liberal Democrats 1997 Economic PCy Plaid Cymru 1997 Economic SNP Scottish National Party 1997 Economic Con Conservative Party 2001 Economic Lab Labour Party 2001 Economic LD Liberal Democrats 2001 Economic PCy Plaid Cymru 2001 Economic SNP Scottish National Party 2001 Economic BNP British National Party 2005 Economic Con Conservative Party 2005 Economic Lab Labour Party 2005 Economic LD Liberal Democrats 2005 Economic PCy Plaid Cymru 2005 Economic SNP Scottish National Party 2005 Economic UKIP UK Independence Party 2005 Economic BNP British National Party 2010 Economic Con Conservative Party 2010 Economic GPEW Green Party of England and Wales 2010 Economic Lab Labour Party 2010 Economic LD Liberal Democrats 2010 Economic PCy Plaid Cymru 2010 Economic SNP Scottish National Party 2010 Economic SSP Scottish Socialist Party 2010 Economic UKIP UK Independence Party 2010 Economic Table 4. Expert Survey Estimates of UK Political Parties, Economic Dimension.!! 10!

55 Crowd-sourced coding of political texts / 11 Party Party Name Year Dimension Mean N SE Con Conservative Party 1987 Social Lab Labour Party 1987 Social LD Liberal Democrats 1987 Social PCy Plaid Cymru 1987 Social SNP Scottish National Party 1987 Social Con Conservative Party 1997 Social Lab Labour Party 1997 Social LD Liberal Democrats 1997 Social PCy Plaid Cymru 1997 Social SNP Scottish National Party 1997 Social Con Conservative Party 2001 Social Lab Labour Party 2001 Social LD Liberal Democrats 2001 Social PCy Plaid Cymru 2001 Social SNP Scottish National Party 2001 Social BNP British National Party 2005 Social Con Conservative Party 2005 Social Lab Labour Party 2005 Social LD Liberal Democrats 2005 Social PCy Plaid Cymru 2005 Social SNP Scottish National Party 2005 Social UKIP UK Independence Party 2005 Social BNP British National Party 2010 Social Con Conservative Party 2010 Social GPEW Green Party of England and Wales 2010 Social Lab Labour Party 2010 Social LD Liberal Democrats 2010 Social PCy Plaid Cymru 2010 Social SNP Scottish National Party 2010 Social SSP Scottish Socialist Party 2010 Social UKIP UK Independence Party 2010 Social BNP British National Party 2010 Immigration Con Conservative Party 2010 Immigration GPEW Green Party of England and Wales 2010 Immigration Lab Labour Party 2010 Immigration LD Liberal Democrats 2010 Immigration PCy Plaid Cymru 2010 Immigration SNP Scottish National Party 2010 Immigration SSP Scottish Socialist Party 2010 Immigration UKIP UK Independence Party 2010 Immigration UKIP UK Independence Party 2010 Economic Table 5. Expert Survey Estimates of UK Political Parties, Social and Immigration Dimensions.! 11!

56 Crowd-sourced coding of political texts / 12! 4. Details on the Crowd Coders Country Total Codings % Codings Unique Coders Mean Trust Score USA 60, GBR 33, IND 22, ESP 12, EST 10, DEU 9, HUN 9, HKG 7, CAN 7, POL 6, HRV 4, A1 3, AUS 2, MEX 2, ROU 2, NLD 2, PAK 1, IDN 1, CZE 1, GRC 1, SRB 1, LTU 1, DOM ZAF ITA IRL MKD ARG BGR DNK VNM TUR PHL FIN PRT MAR MYS Other (12) Overall 215, ! 12!

57 Crowd-sourced coding of political texts / 13 Table 6. Country Origins and Trust Scores of the Crowd Coders. Trust Score Channel Total Codings % Codings Mean 95% CI Neodev 85, [0.83, 0.83] Amt 39, [0.84, 0.85] Bitcoinget 32, [0.88, 0.88] Clixsense 28, [0.81, 0.81] Prodege 12, [0.83, 0.83] Probux 5, [0.83, 0.83] Instagc 4, [0.81, 0.81] Rewardingways 2, [0.89, 0.90] Coinworker 1, [0.90, 0.90] Taskhunter 1, [0.80, 0.81] Taskspay [0.78, 0.78] Surveymad [0.82, 0.82] Fusioncash [0.80, 0.82] Getpaid [0.81, 0.82] Other (12) [0.88, 0.91] Total 215, [0.84, 0.84] Table 7. Crowdflower Crowd Channels and Associated Mean Trust Scores.! 13!

58 Codings Minimum Mean Cost Date Gold Sentences Payment Codings Per Per Trusted Job ID Launched Sentences Sentences Trusted Untrusted Per Task Per Task Sentence Code Cost Dimension Countries Channels Feb-14 7, $ $0.02 $ Immigration Many Many Dec-13 7, $ $0.02 $ Immigration Many Many Dec-13 12, $ $0.03 $1, Econ/Social Many Many Nov $ $0.03 $7.38 Econ/Social Many Many Oct $ $0.03 $27.72 Econ/Social Many Many Oct-13 2, $ $0.03 $ Econ/Social Many Many Oct $ $0.03 $79.59 Econ/Social Many Many Oct-13 1, $ $0.04 $ Econ/Social Many Many Oct-13 2, $ $0.04 $ Econ/Social Many Many Oct $ $0.04 $ Econ/Social Many Many Oct $ $0.05 $27.64 Econ/Social US MT Oct $ $0.04 $ Econ/Social US MT Sep $ $0.04 $81.28 Econ/Social US MT Oct $ $0.04 $ Econ/Social US MT Sep $ $0.34 $27.14 Econ/Social All MT Sep $ $0.14 $66.57 Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.11 $ Econ/Social All MT Sep $ $0.06 $93.27 Econ/Social All MT Sep $ $0.06 $83.73 Econ/Social All MT Sep $ $0.12 $83.73 Econ/Social All MT Table 8. Details on Phased Crowdflower Job Deployments for Economic and Social Text Policy Coding.

59 5. Details on pre-testing the deployment method using semi-expert coders Design Our design of the coding platform followed several key requirements of crowd-sourcing, namely that the coding be split into sentence-level tasks with clear instructions, aimed only at the specific policy dimensions we have already identified. This involved several key decisions, which we settled on following an extensive tests on expert coders (including the authors and several PhD-level coders with expertise in party politics) and semi-experts consisting of trained postgraduate students given a set of experimental texts where the features being tested were varied in an experimental context to generate results to detect the design with the highest reliability and validity. These decisions were: whether to serve the sentences in sequence or randomly; whether to identify the document being coded; and how many contextual sentences to display for the sentence. Sequential versus unordered sentences In what we call classical expert coding, experts typically start at the beginning of a document and work through, sentence by sentence, to the end.1 From a practical point of view, however, most workers in the crowd will code only small sections of an entire long policy document. From a theoretical point of view, moreover, coding sentences in their natural sequence creates a situation in which coding one sentence may well affect priors for subsequent sentence codings, with the result that some sentence codings may be affected by how immediately preceding sentences have been coded. In particular, sentences in sequence tend to display runs of similar topics, and hence codes, given the natural tendency of authors to organize a text into clusters of similar topics. To mitigate the tendency of coders to also pass judgment on each text unit in runs without considering each sentence on the grounds of its own content, we tested whether text coding produced more stable results when served up unordered rather in the sequence of the text. Anonymous texts versus named texts In serving up sentence coding tasks, another option is whether to identify the texts by name, or instead for them to remain anonymous.2 Especially in relation to a party manifesto, it is not necessary to read very far into the document, even if cover and title page have been ripped off, to figure out which party wrote it indeed we might reasonably deem a coder who cannot figure this out to be unqualified. Coders will likely bring non-zero priors to coding manifesto sentences: precisely the same sentence ( we must do all we can to make the public sector more efficient ) may be coded in different ways if the coder knows this comes from a right- rather than a leftwing party. Yet codings are typically aggregated into estimated document scores as if coders had zero priors. We don t really know how much of the score given to any given sentence in classical expert coding is the coder s judgment about the actual content of the sentence, and how much is a judgment about its author. Accordingly, in our preliminary coding experiment, expert coders coded the same manifesto sentences both knowing and not knowing the name of the author.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 1 We may leave open the sequence in which documents are coded, or make explicit decisions about this, such as coding according to date of authorship. 2 Of course, many of the party manifestos we used made references to their own party names, making it fairly obvious which party wrote the manifesto. In these cases we did not make any effort to anonymize the text, as we did with to risk altering the meaning.

60 Crowd-sourced data coding for the social sciences / 16 Providing context for the target sentence Given the results we note in the previous two sections, our crowd-sourcing method will specify the atomic crowd-sourced text coding task as coding a target sentence selected at random from a text, with the name of the author not revealed. This leaves open the issue of how much context either side of the target sentence we provide to assist the coder. The final objective of our preliminary coding experiment was to assess the effects of providing no context at all, or a oneor two- sentence context either side of the target. To test the effects on reliability, our pre-test experiments provided the same sentences, in random order, to the semi-expert coders with zero, one, and two sentences of context before and after the sentence to be coded. Results of the pre-testing We pre-tested the coding scheme decisions on a sample of all co-authors of this paper, three additional expert coders trained personally, by the authors, and 30 semi-expert coders who were Masters students in courses on applied text analysis at either LSE or UCL. (The detailed design for the administration of treatments to coder is available from the authors.) To assess coder reliability, we also created a carefully agreed set of 120 gold standard sentences whose codes were unanimously agreed by the expert coders. Using an experimental design in which each coder in the test panel coded each sentence multiple times, in random order, with variation across the three treatment effects, we gathered sufficient information to predict misclassification tendencies from the coding set using a multinomial logistic model. (Details may be found in the Appendix.) The results pointed to a minimization of misclassification by: a) serving up codings tasks with unordered sentences, b) not identifying the author of the text, and c) providing two sentences of context before and after each sentence to be coded. The most significant finding was that coders had a mild but significant tendency to code the same sentences differently when they associated the known author of the text with a particular position. Specifically, they tended to code precisely the same sentences from Conservative manifestos as more right wing, if they knew that these sentences came from a Conservative manifesto. We also found a slight but significantly better correspondence between coder judgments and golden codings when we provided a context of two sentences before and after the sentence to be coded. This informed out decision to settle on a two-sentence context for our crowd-sourcing method. The aim of this methodological experiment was to assess effects of: coding manifestos in their natural sequence or in random order (Treatment 1); providing a +/- two-sentence context for the target sentence (Treatment 2); revealing the title of the manifesto and hence the name of its author (Treatment 3). The text corpus to be coded was a limited but carefully-curated set of 120 sentences. We removed some surrounding sentences that had proper party names in them, to maintain a degree of manifesto anonymity. These were chosen on the basis of the classical expert coding (ES) phase of our work to include a balance of sentences between expert-coded economic and social policy content, and only a few sentences with no economic or social policy content. The coder pool comprised three expert coders, all co-authors of this paper, and 30 semi-expert coders who were Masters students in Methods courses at either LSE or UCL. The detailed design for the administration of treatments to coder is available from the authors. The analysis depends in part on the extent to which the semi-expert coders agreed with a master or gold coding for each sentence, which we specified as the majority scale and code from the three "expert coders. For each sentence that was master-coded as referring to none, economic, or social policy, Table 9 reports exponentiated coefficients from a multinomial logit predicting how a coder would! 16!

61 Crowd-sourced data coding for the social sciences / 17 classify a sentence, using the sentence variables as covariates. This allows direct computation of misclassification, given a set of controls. Since all variables are binary, we report odds ratios. Thus the highlighted coefficient of in Model 1 means that, when the master coding says the sentence concerns neither economic nor social policy, the odds of a coder misclassifying the sentence as economic policy were about 3.3 times higher if the sentence displayed a title, all other things held constant. More generally, we see from Table 9 that providing a +/- twosentence context does tend to reduce misclassifications (with odds ratios less that 1.0) while showing the coder the manifesto title does tend to increase misclassification (with odds ratios greater than 1.0). Confining the data to sentence codings for which the coder agreed with the master coding on the policy area covered by the sentence, Table A2 reports an ordinal logit of the positional codes assigned by non-expert coders, controlling for fixed effects of the manifesto. The base category is the relatively centrist Liberal Democrat manifesto of The main quantities of interest estimate the interactions of the assigned positional codes with title and context treatments. If there is no effect of title or context, then these interactions should add nothing. If revealing the title of the manifesto makes a difference, this should for example move economic policy codings to the left for a party like Labour, and to the right for the Conservatives. The highlighted coefficients show that this is a significant effect, though only for Conservative manifestos.! 17!

62 Crowd-sourced data coding for the social sciences / 18 (1) (2) (3) Independent Master Domain Equation Variable Neither Economic Social Economic Context 0.492* ( ) ( ) Sequential ( ) ( ) Title 3.272*** ( ) ( ) Social Context ( ) ( ) Sequential ( ) ( ) Title 1.540** ( ) ( ) None Context 0.478*** ( ) ( ) Sequential ** ( ) ( ) Title ( ) ( ) N 750 3,060 1,590 Odds ratios (95% confidence intervals), *** p<0.01, ** p<0.05, * p<0.1 Table 9: Domain Misclassification in Semi-Expert Coding Experiments.! 18!

63 Crowd-sourced data coding for the social sciences / 19 Independent Variable (4) (5) (6) (7) Coded [-1, 0, 1] Coded [-2, -1, 0, 1, 2] Economic Social Economic Social Con *** 158.7*** 9.939*** 286.8*** ( ) ( ) ( ) ( ) Lab ( ) ( ) ( ) ( ) Con *** 4.248*** 4.385*** 10.80*** ( ) ( ) ( ) ( ) LD ( ) ( ) Lab *** 328.0*** 4.554*** 1,004*** ( ) ( ) ( ) ( ,099) Context 0.386*** *** ( ) ( ) ( ) ( ) Context * Con ** ** ( ) ( ) ( ) ( ) Context * Lab ** 0.373** ( ) ( ) ( ) ( ) Context * Con *** *** ( ) ( ) ( ) ( ) Context * LD *** 2.645*** ( ) ( ) Context * Lab ( ) ( ) ( ) ( ) Title 0.506*** ** 0.87 ( ) ( ) ( ) ( ) Title * Con ** ** ( ) ( ) ( ) ( ) Title * Lab ( ) ( ) ( ) ( ) Title * Con ** 2.080* ( ) ( ) ( ) ( ) Title * LD ( ) ( ) Title * Lab ( ) ( ) ( ) ( ) Sequential ( ) ( ) ( ) ( ) Observations 2,370 1,481 2,370 1,481! 19!

64 Crowd-sourced data coding for the social sciences / Implementing manifesto coding on CrowdFlower Once gold data have been identified, CF has a flexible system for working with many different types of crowd-sourcing task. In our case, preparing the manifesto texts for CF coders requires converting the text into a matrix-organized dataset with one natural sentence per row. CF uses its own proprietary markup language, CrowdFlower Markup Language (CML), to build jobs on the platform. The language is based entirely on HTML, and contains only a small set of special features that are needed to link the data being used for the job to the interface itself. To create the coding tasks themselves, some additional markup is needed. Here we use two primary components: a text chunk to be coded, and the coding interface. To provide context for the text chunk, we include two sentences of preceding and proceeding manifesto text, in-line with the sentence being coded. The line to be coded is colored red to highlight it. The data are then linked to the job using CML, and the CF platform will then serve up the coding tasks as they appear in the dataset. To design the interface itself we use CML to design the form menus and buttons, but must also link the form itself to the appropriate data. Unlike the sentence chunk, however, for the interface we need to tell the form which columns in our data will be used to store the workers coding; rather than where to pull data from. In addition, we need to alert the CF platform as to which components in the interface are used in gold questions.!! 20!

65 Crowd-sourced data coding for the social sciences / 21 Figure 5a. Screenshot of text coding platform, implemented in CrowdFlower.!! shows a screen shot of the coding interface as deployed and Figure A2 shows the CML used to design our this interface. With all aspects of the interface designed, the CF platform uses each row in our data set to populate tasks, and links back the necessary data. Each coding task is served up randomly by CF to its pool of workers, and the job runs on the platform until the desired number of trusted judgments has been collected.!! 21!

66 Crowd-sourced data coding for the social sciences / 22 Our job settings for each CrowdFlower job are reported in Table 8. Full materials including all of the data files, CML, and instructions required to replicate the data production process on CrowdFlower are provided in the replication materials.! 22!

67 Crowd-sourced data coding for the social sciences / 23 Figure 5a. Screenshot of text coding platform, implemented in CrowdFlower.!!! 23!

68 Crowd-sourced data coding for the social sciences / 24! Figure 2b. Screenshot of text coding platform, implemented in CrowdFlower (continued).!! 24!

69 Crowd-sourced data coding for the social sciences / 25 $ $${{pre_sentence}}$$<font$color="red">$ $${{sentence_text}}$$$ $${{post_sentence}}$ $$<cml:select$label="policy$area"$class=""$instructions=""$id=""$ validates="required"$gold="true"$name="policy_area">$ $$$$<cml:option$label="not$economic$or$social"$id=""$value="1"></cml:option>$ $$$$<cml:option$label="economic"$value="2"$id=""></cml:option>$ $$$$<cml:option$label="social"$value="3"$id=""></cml:option>$$$$$ $$</cml:select>$ $ $$$<cml:ratings$class=""$from=""$to=""$label="economic$policy$scale"$points="5"$ name="econ_scale"$onlyiif="policy_area:[2]"$gold="true"$matcher="range">$ $$$$<cml:rating$label="very$left"$value="i2"></cml:rating>$ $$$$<cml:rating$label="somewhat$left"$value="i1"></cml:rating>$ $$$$<cml:rating$label="neither$left$nor$right"$value="0"></cml:rating>$ $$$$<cml:rating$label="somewhat$right"$value="1"></cml:rating>$ $$$$<cml:rating$label="very$right"$value="2"></cml:rating>$ $$</cml:ratings>$ $ $$<cml:ratings$class=""$from=""$to=""$label="social$policy$scale"$name="soc_scale"$ points="5"$onlyiif="policy_area:[3]"$gold="true"$matcher="range">$ $$$$<cml:rating$label="very$liberal"$value="i2"></cml:rating>$ $$$$<cml:rating$label="somewhat$liberal"$value="i1"></cml:rating>$ $$$$<cml:rating$label="neither$liberal$nor$conservative"$value="0"></cml:rating>$ $$$$<cml:rating$label="somewhat$conservative"$value="1"></cml:rating>$ $$$$<cml:rating$label="very$conservative"$value="2"></cml:rating>$ $$</cml:ratings>$ Figure 6. CrowdFlower Markup Language used for Economic and Social Coding.! 25!

70 Crowd-sourced data coding for the social sciences / 26 Figure 7. Immigration Policy Coding Instructions.!! 26!

CROWD-SOURCED TEXT ANALYSIS: REPRODUCIBLE AND AGILE PRODUCTION OF POLITICAL DATA *

CROWD-SOURCED TEXT ANALYSIS: REPRODUCIBLE AND AGILE PRODUCTION OF POLITICAL DATA * Kenneth Benoit London School of Economics and Trinity College, Dublin Benjamin E. Lauderdale London School of Economics