Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Size: px
Start display at page:

Download "Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora"


1 Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora Ludovic Rheault and Christopher Cochrane Abstract Word embeddings, the coefficients from neural network models predicting the use of words in context, have now become inescapable in applications involving natural language processing and artificial intelligence. Despite a few studies in political science, the potential of this methodology for the analysis of political texts has yet to be fully uncovered. This paper introduces models of word embeddings augmented with political metadata and trained on large-scale parliamentary corpora from Britain, Canada, and the United States. We fit these models with indicator variables of the party affiliation of members of parliament, which we call party embeddings. We illustrate how these embeddings can be used to produce scaling estimates of ideological placement and other quantities of interest for political research. To validate the methodology, we assess our results against indicators from the Comparative Manifesto Project and measures based on roll-call votes. Our findings suggest that party embeddings are successful at capturing latent concepts such as ideology, and the approach provides researchers with an integrated framework for studying political language. Assistant Professor, Department of Political Science and Munk School of Global Affairs and Public Policy, University of Toronto. Associate Professor, Department of Political Science, University of Toronto.

2 1 Introduction The representation of meaning is a fundamental objective in natural language processing. As a basic illustration, consider queries performed with a search engine. We ideally want computers to return documents that are relevant to the substantive meaning of a query, just like a human being would interpret it, rather than simply the records with an exact word match. To achieve such results, early methods such as latent semantic analysis relied on low-rank approximations of word frequencies to score the semantic similarity between texts and rank them by relevance (Deerwester et al. 1990; Manning, Raghavan, and Schütze 2009, Ch. 18). The new state of the art in meaning representation is word embeddings, or word vectors, the parameter estimates of artificial neural networks designed to predict the occurrence of a word by the surrounding words in a text sequence. Consistent with the use theory of meaning (Wittgenstein 2009), these embeddings have been shown to capture semantic properties of language, revealed by an ability to solve analogies and identify synonyms (Mikolov, Sutskever, Chen, Corrado, and Dean 2013; Pennington, Socher, and Manning 2014). Despite a broad appeal across disciplines, the use of word embeddings to analyze political texts remains a new field of research. 1 The aim of this paper is to examine the reliability of the methodology for the detection of a latent concept such as political ideology. In particular, we show that neural networks for word embeddings can be augmented with metadata available in parliamentary corpora. We illustrate the properties of this method using publicly available corpora from Britain, Canada and the United States, and assess its validity using external indicators. The proposed methodology addresses at least three shortcomings associated with textual indicators of ideological placement currently available. First, as opposed to measures based on word frequencies, the estimates from our neural networks are trained to predict the use of language in context. Put another way, the method accounts for a party s usage of words given the surrounding text. Among other things, this also allows us to examine differences in how parties talk about the same issues, rather than only the differences in the issues that parties talk about. Second, our approach can easily accommodate control variables, factors that could otherwise confound the placement of parties or politicians. For example, we account for the governmentopposition dynamics that have foiled ideology indicators applied to Westminster systems in the past, and filter out their influence to achieve more accurate estimates of party placement. Third, the methodology allows us to map political actors and language in a common vector space. This means that we can situate actors of interest based on their proximity to political concepts. Using a single model of embeddings, researchers can rank political actors relative to these concepts using a variety of metrics for vector arithmetic. We demonstrate such implementations in our empirical section. 1 Examples of recent applications in political research include Rheault et al. (2016), Preoţiuc-Pietro et al. (2017), and Glavaš, Nanni, and Ponzetto (2017). 1

3 Our results suggest that word embeddings are a promising tool for expanding the possibilities of political research based on textual data. In particular, we find that scaling estimates of party placement derived from the embeddings for the metadata which we call party embeddings are strongly correlated with human-annotated and roll-call vote measures of left-right ideology. We distinguish between two approaches to estimate the placement of political actors on ideological dimensions. The first consists of using dimension reduction techniques on the raw embeddings, and does not involve the judgment of researchers. The second uses linear projections onto predefined scales, by choosing a relevant set of political expressions. Our findings indicate that both approaches lead to high levels of accuracy when assessed against external benchmarks, such that scholars can safely prefer the option that requires no arbitrary decisions when analyzing the ideological space of a legislature. We also show that the methodology is especially well suited to conduct analyses where the main interest is to assess the proximity between actors and their association with groups of concepts, for instance to analyze language specificity, ideological polarization and issue ownership. 2 Relations to Previous Work Two of the most popular approaches in political science for the extraction of ideology from texts are WordScores (Laver, Benoit, and Garry 2003) and WordFish (Slapin and Proksch 2008). 2 The first relies on a sample of labeled documents, for instance party manifestos annotated by experts. The relative probabilities of word occurrences in the labeled documents serve to produce scores for each word, which can be viewed as indicators of their ideological load. Next, the scores can be applied to the words found in new documents, to estimate their ideological placement. In fact, this approach can be compared to methods of supervised machine learning (Bishop 2006), where a computer is trained to predict the class of a labeled set of documents based on their observed features (e.g. their words). WordFish, on the other hand, relies on party annotations only. The methodology consists of fitting a regression model where word counts are projected onto party-year parameters, using an expectation maximization algorithm (Slapin and Proksch 2008). This approach avoids the reliance on expert annotations, and amounts to estimating the specificity of word usage by party, at different points in time. Neither of these approaches, however, takes into account the role of words in context. Put another way, they ignore semantics. Although theoretically both WordScores and WordFish could be expanded to include n-grams (sequences of more than one word), this comes at an increased computational cost. There are so many different combinations of words in the English language that it rapidly becomes inefficient to count them. This problem has been addressed recently in Gentzkow, Shapiro, and Taddy (2016), and presented as a curse of dimensionality. 2 See also Lauderdale and Herzog (2016) for an extension of the method to legislative speeches, and Kim, Londregan, and Ratkovic (2018) for an expanded model combining both text and votes. 2

4 Using a large number of words may be inefficient when tracking down ideological slants from textual data, since a high feature-document ratio overstates the variance across the target corpora (Taddy 2013; Gentzkow, Shapiro, and Taddy 2016). This problem is usually called overfitting in the machine learning literature. Problems associated with high-dimensionality often preclude the reliance on n-grams. But for a few exceptions (Sim et al. 2013; Iyyer et al. 2014), models of political language face a trade-off between ignoring the role of words in context and dealing with highdimensional variables. 3 Instead of relying on word frequencies, word embedding models aim to capture and represent relations between words using co-occurrences, which sidesteps overfitting problems while allowing researchers to move beyond counts of words taken in isolation. In the context of studies based on parliamentary debates, an additional concern is the ability to account for other institutional elements such as the difference in tone between the government and the opposition. A dangerous shortcut would consist of attributing any observed differences between party speeches to ideology. In any given parliament, the cabinet will use a different vocabulary than opposition parties due to the nature of these legislative functions. For instance, opposition parties in Westminster systems will invoke ministerial positions frequently when addressing their counterparts. Hirst et al. (2014) show that machine learning models used to classify texts by ideology have a tendency to be confounded with government-opposition language. As a result, temporal trends can also be obscured by procedural changes in the way government and opposition parties interact in parliament. A similar issue has been found to affect methods based on roll-call votes to infer ideology, where government-opposition dynamics can dominate the first dimension of the estimates (Spirling and McLean 2007; Hix and Noury 2016). Our proposed methodology can accommodate the inclusion of additional control variables to filter out the effect of these institutional factors. Finally, we argue that neural network models of language represent the natural way for scholars to move forward when attempting to measure a concept such as political ideology. The patterns of ideas that constitute ideologies are not as straightforward as is often assumed (Freeden 1998; Cochrane 2015). Ideologies are emergent phenomena that are tangible but irreducible. Ideologies are tangible in that they genuinely structure political thinking, disagreement, and behavior; they are irreducible, however, in that no one actor or idea, or group of actors or ideas, constitutes the core from which an ideology emanates. Second-degree connections between words and phrases give rise to meaningful clusters that fade from view when we analyze ideas, actors, or subsets of these things in isolation from their broader context. People know ideology when they see it, but struggle to say precisely what it is, because they cannot resolve the irresolvable (Cochrane 2015). This property of ideology has bedeviled past analysis. Since neural network models are designed to capture complex interactions between the inputs in our case, context words and indicator variables for political actors they are well adapted for the study of 3 Other examples of text-based methods for the detection of ideology based on word and phrase occurrences include Gentzkow and Shapiro (2010), Diermeier et al. (2012) and Jensen et al. (2012). 3

5 concepts that should theoretically emerge from such interactions. 3 Methodology Models for word embeddings have been explored thoroughly in the literature, but we need to introduce them summarily to facilitate the exposition of our approach. This section also adopts a notation familiar to social science scholars. Our implementation uses shallow neural networks, that is, statistical models containing one layer of latent variables or hidden nodes between the input and output data. 4 The outcome variable w t is the word occurring at position t in the corpus. The variable w t is multinomial with V categories corresponding to the size of the vocabulary. The input variables in the model are the surrounding words appearing in a window before and after the outcome word, which we denote as w = (w t,..., w t 1, w t+1,..., w t+ ). The window is symmetrical to the left and to the right, which is the specification we use for this study, although non-symmetrical windows are possible, for instance if one wishes to give more consideration to the previous words in a sequence than to the following ones. Simply put, word embedding models consist of predicting w t from w. The neural network can be subdivided into two components. Let z m represent a hidden node, with m = {1,..., M} and where M is the dimension of the hidden layer. Each node can be expressed as a function of the inputs: z m = f(w β m ) (1) In machine learning, f is called an activation function. In the case of word embedding models such as the one we rely upon, that function is simply the average value of w β m across all input words (see Mikolov, Chen, Corrado, and Dean 2013). Since each word in the vocabulary can be treated as an indicator variable, Eq. (1) can be expressed equivalently as z m = 1 2 w v w β v,m (2) that is, a hidden node is the average of coefficients β v,m specific to a word w v if that word is present in the context of the target word w t. In turn, the vector of hidden nodes z = (z 1,..., z M ) is the average of the M-dimensional vectors of coefficients β v, for all words v occurring in w : z = 1 2 w v w β v (3) 4 For the purpose of our presentation, we follow the steps of the model that Mikolov, Chen, Corrado, and Dean (2013) call continuous bag-of-words (CBOW). 4

6 Upon estimation, these vectors β v are the word embeddings of interest. The remaining component of the model expresses the probability of the target word w t as a function of the hidden nodes. Similar to the multinomial logit regression framework, commonly used to model vote choice, a latent variable representing a specific word i can be expressed as a linear function of the hidden nodes: u it = α i + z µ i. The probability P (w t = i) given the surrounding words corresponds to: P (w t = i w ) = e α i+z µ i V v=1 eαv+z µ v (4) The full model can be written compactly using nested functions and dropping some indices for simplicity: P (w t w ) = g (α, µ, f(w β)) (5) As can be seen with the visual depiction in Figure 1, the embeddings β link each input word to the hidden nodes. 5 The parameters of the model can be fitted by minimizing the cross-entropy using stochastic gradient descent. We rely on negative sampling to fit the predicted probabilities in Eq. (4) (see Mikolov, Sutskever, Chen, Corrado, and Dean 2013). In an influential study, Pennington, Socher, and Manning (2014) have shown that a corresponding model can be represented as a log-bilinear Poisson regression using the word-word co-occurrence matrix of a corpus as data. However, the implementation we use here facilitates the inclusion of metadata by preserving individual words as units of analysis. The basic model introduced above can be expanded to include additional input variables, which is our main interest in this paper. A common implementation uses indicator variables for documents or segments of text of interest, in addition to the context words (Le and Mikolov 2014). 6 The approach was originally called paragraph vectors or document vectors. More generally, other types of metadata can be entered in Eq. 1 to account for properties of interest at the document level, which is the approach we adopt here (for an illustration using political texts, see Nay 2016). In our implementation, we focus primarily on indicator variables measuring the party affiliation of a member of parliament (MP) or congressperson uttering a speech. The inner component of the expanded model can be represented as: z m = f(w β m + x ζ m ) (6) where x is a vector of metadata, and the rest of the specification is similar as before. In addition 5 In fact, Mikolov, Chen, Corrado, and Dean (2013) proposed two approaches: one in which the word embeddings are the link coefficients between input words and the hidden nodes (CBOW), and another where the outcome and the inputs are switched (called skip-gram) in effect, predicting surrounding context from the word, rather than the reverse. 6 The type of model described here is called distributed memory in the original article (Le and Mikolov 2014). 5

7 Figure 1: Example of Model with Word and Party Embeddings w w t 3 : overcoming β z w t 2 : barriers w t 1 : to µ w t+1 : and w t : work w t+2 : tackling w t+3 : inequalities x: Labour 2005 ζ Schematic example of input and output data in a model with M = 5 and a window = 3. The model includes a variable indicating the party affiliation and parliament of the politician making the speech. to party affiliation, it is straightforward to account for attributes with the potential to affect the use of language and confound party-specific estimates. We mentioned the government status earlier (cabinet versus non-cabinet positions, or party in power versus opposition). For Canada, a country where federal politics is characterized by persistent regional divides, a relevant variable would be the province of the MP. Just like words have their embeddings, each variable entered in x has a corresponding vector ζ of dimension M. Observe that the resulting vectors ζ have commonalities with the WordFish estimator of party placement. In their WordFish model, Slapin and Proksch (2008) predict word counts with partyyear indicator variables. The resulting parameters are interpreted as the ideological placement of parties. The model introduced in (6) achieves a similar goal. The key difference is that our model is estimated at the word-level, while taking into account the context (w ) in which a word occurs. The hidden layer serves an important purpose by capturing interactions between the metadata and these context words. Moreover, the dimension of the hidden layer will determine the size of what we refer to as party embeddings in what follows, that is, the estimated parameters for each party. Rather than a single point estimate, we fit a vector of dimension M. An obvious benefit is that these party embeddings can be compared against the rest of the corpus vocabulary in a common vector space, as we illustrate below. Specifically, our implementation uses party-parliament pairs as indicator variables, for a number of reasons. First, fitting combinations of parties and time periods allows us to reproduce 6

8 the nature of the WordFish model as closely as possible: each party within a given parliament or congress has a specific embedding. This approach has relevant benefits, by accounting for the possibility that the language and issues debated by each party may evolve from one parliament to the next. Parties are allowed to move over time in the vector space. We rely on parliaments/congresses, rather than years, to facilitate external validity tests against roll-call vote measures and annotations based on party manifestos, which are published at the election preceding the beginning of each parliament. Of course, the possible specifications are virtually endless and may differ in future applications. But we believe that the models we present are consistent with existing practice and provide a useful ground for a detailed assessment. 4 Parliamentary Corpora Models of word embeddings have been shown to perform best when fitted on large corpora that are adapted to the domain of interest (Lai, Liu, and Xu 2016). For the purpose of this study, we rely on three publicly available resources containing digitized parliamentary debates overlapping a century of history in Canada, the United States and Britain. Replicating the results in three polities helps to demonstrate that the proposed methodology is general in application. The Canadian Hansard corpus is described in Beelen et al. (2017) and released as linked and open data on Our version of the British Hansard corpus is hosted on the Political Mashup website. 7 Finally, the United States corpus is the version released by Gentzkow, Shapiro, and Taddy (2016). Each resource is enriched with a similar set of metadata about speakers, such as party affiliations and functions. The first section of the online appendix describes each resource in more details. We considered speeches made by the major parties in each corpus. For Canada, we use the entirety of the available corpus, which covers a period ranging between 1901 and 2017, from the 9th to the 42nd Parliament. The corpus represents over 3 million speeches after restricting our attention to five major parties (Conservatives, Liberals, New Democratic Party, Bloc Québécois, and Reform Party/Canadian Alliance). For the United Kingdom, the corpus covers the period from 1936 to We restrict our focus to the three major party labels: Labour, Liberal-Democrats, and Conservatives. We removed speeches from the Speaker of the House of Commons in Britain and Canada, whose role is non-partisan. The United States corpus ranges from 1873 to 2016 (43rd to 114th Congress). We present results for the House of Representatives and the Senate separately, and restrict our attention to voting members affiliated with the Democratic and Republican parties. For each corpora, we tested models with various specifications and compared their accuracy. The main text reports models with hidden layers of 200 nodes, which we have found to be reliable 7 See 7

9 for applied research. The appendix provides additional information on parameterization and its influence on the output. Each model uses a window of ± 20 words and includes tokens with a minimum count of 50 occurrences. Our models include not only words, but also common phrases. We proceed with two passes to detect collocations (words used frequently together) and merge them as single entities, which means that we capture phrases of up to 4 words. 8 This is especially useful for political research, where multi-word entities are frequent and common expressions may have specific meanings (e.g. civil rights ). We fit the models using custom scripts based on Řehůřek and Sojka (2010) s implementation for Python and a learning rate. We preprocessed the text by removing digits and words with two letters or fewer, as well as a list of English stop words enriched to remove overly common procedural words such as speaker (or chairwoman/chairman in the United States), used in most speeches due to decorum. Our scripts are released publicly. 9 5 Two Approaches to Ideological Scaling We start by assessing the ability of the model to represent political ideology. We propose two different approaches for this purpose. The first consists of extracting the principal components from the party vectors and interpret them as estimates of ideological placement. As long as ideology is the main dimension along which political actors differ in terms of semantics, this interpretation is plausible. Moreover, we illustrate how the methodology has features that facilitate the interpretation of the principal components. In the second approach, we identify words or expressions that define the dimensions a priori, and use them to create a customized vector space. We call this second approach guided since the choice of words as anchor points will have some influence on the findings. 5.1 Unguided Projections With the simplest approach, the objective is to project the M-dimensional party embeddings into a substantively meaningful vector space. These party embeddings can be visualized in two dimensions using standard dimensionality reduction techniques such as principal component analysis (PCA), which we rely upon in this section. In plain terms, PCA finds the one-dimensional component that maximizes the variance across the vectors of party embeddings (see e.g. Hastie, Tibshirani, and Friedman 2009, Ch. 14.5). The next component is calculated the same way, by im- 8 Each pass combines pairs of words frequently used together as a single-expression, for instance united kingdom. By applying a second pass, expressions of one word or two words can be merged, resulting in phrases of up to 4 words. The algorithm used to detect phrases is based on the original implementation of word embeddings proposed in Mikolov, Sutskever, Chen, Corrado, and Dean (2013). We have found the inclusion of phrases to bring some improvement in accuracy, but models without phrases remain reliable for political analysis. 9 See 8

10 posing a constraint of zero covariance with the first component. Additional components could be computed but our analysis focuses on two-dimensional projections, to simplify visualizations. If the speeches made by members of different parties are primarily characterized by ideological conflicts, as is normally assumed in unsupervised techniques for ideology scaling, we can reasonably expect the first component to describe the ideological placement of parties. The second component will capture the next most important source of semantic variation across parties in a legislature. To facilitate the interpretation of these components, we can use the model s word embeddings and examine the concepts most strongly associated with each dimension. Starting with the US corpus, Figure 2 plots the party embeddings in a two-dimensional space for the House of Representatives and the Senate. We label each data point using an abbreviation of the party name and the beginning year of a Congress; for instance, the embedding ζ Dem 2011 means the Democratic party in the Congress starting in 2011 (the 112th Congress). The only adjustment that may be relevant to perform is orienting the scale in a manner intuitive for interpretation, for instance, by multiplying the values of a component by 1 such that conservative parties appear on the right. We fit separate models for the two chambers. Each model includes party-congress indicator variables as well as separate dummy variables for congress, which account for temporal change in the discourse. Our methodology captures ideological shifts that occurred during the 20th Century. Whereas both major parties were originally close to the center of the first dimension, which we interpret as the left-right or liberal-conservative divide, they begin to drift apart around the New Deal era in the 1930s, period usually associated with the fifth party system. Consistent with common wisdom, party embeddings for the Democrats started to shift toward the left of the ideological spectrum, while Republicans moved the opposite way. The trend culminates with a period of marked polarization from the late 1990s to the most recent Congresses. However, the most spectacular shift is probably the one occurring on the second dimension, which we interpret as a South-North divide (we oriented the South to the bottom, and North to the top). The change reflects a well-documented realignment between Northern and Southern states that occurred between the New Deal and the civil rights eras (Shafer and Johnston 2009; Sundquist 2011). A similar trajectory is manifest using both the House and the Senate corpora. Whereas Republicans initially became associated with issues of Northern states, the two parties eventually switched sides entirely. The recent era appears particularly polarized on both axes, which is consistent with a body of literature documenting party polarization (we return to this discussion in the penultimate section of this paper). On the other hand, we do not find clear evidence of polarization on the principal component in the late 20th Century, contrary to indicators based on vote data Poole and Rosenthal (2007) but consistently with Gentzkow, Shapiro, and Taddy (2016). The proposed models have desirable properties for interpreting the low-dimensional projection, by taking advantage of having words and political actors in the same model. In particular, 9

11 Figure 2: Party Placement in the US Congress ( ) (a) House Component 2 Dem 2015 Dem 2009 Dem 2011 Dem 2013 Dem Dem 2003 Dem 2005 Dem 1995 Dem 2001 Dem 1997 Dem 1999 Dem 1991 Dem 1993 Dem 1873 Dem 1981 Dem 1977 Dem 1971 Dem 1979 Dem 1969 Dem Dem 1973 Dem Dem 1875 Dem Dem Dem Dem 1889 Dem 1881 Dem 1975 Dem 1887 Dem 1959 Dem 1967 Dem 1891 Dem 1965 Dem 1953 Dem 1895 Dem Dem Dem 1933 Dem 1899 Dem 1957 Dem 1963 Dem 1955 Dem 1903 Dem Dem 1901 Dem Dem 1945 Dem 1913 Dem 1935 Dem 1917 Dem 1923 Dem 1961 Dem 1915 Dem Dem 1943 Dem Dem Dem 1911 Dem Dem Dem 1905 Dem 1937 Dem 1929Dem 1927 Rep 1937 Rep 1945 Rep 1939 Rep 1913 Rep 1935 Rep 1943 Rep 1931 Rep 1933 Rep Rep Rep Rep Rep Rep Rep Rep 1967 Rep 1947Rep 1949 Rep Rep 1923Rep Rep 1921 Rep 1961 Rep 1911 Rep 1959 Rep Rep Rep 1885 Rep 1901 Rep 1903 Rep Rep 1973 Rep 1875 Rep 1893 Rep 1969 Rep Rep Rep Rep Rep Rep Rep Rep Rep Rep Rep 1873 Rep 1975 Rep 1983 Rep 1981 Rep 1985 Rep 1987 Rep 1989 Rep 1991 Rep 1993 Rep 2001 Rep 1997 Rep 1999 Rep 2007 Rep 1995 Rep 2009 Rep 2015 Dem Dem Dem 1931 Rep 2003 Rep Rep 2005 Rep Component 1 (b) Senate Dem Dem Dem 2009 Dem 2011 Dem 2007 Dem Dem 2005 Dem 2001 Dem Dem 1999 Dem 1993 Dem 1995 Dem 1989 Dem 1987 Dem 1985 Rep 1921 Rep 1947 Rep Rep Rep Rep Rep 1949 Rep 1931 Rep 1951 Rep 1909 Rep 1923 Rep Rep 1935 Rep 1925 Rep 1939 Rep 1889 Rep 1927 Rep 1905 Rep 1937 Rep 1917 Rep 1953 Rep 1885 Rep Rep Rep 1955 Rep Rep 1887 Rep Rep 1913 Rep 1907 Rep 1891 Rep 1893 Rep 1895 Rep 1915 Rep 1959 Rep 1957 Rep 1963 Rep 1899 Rep 1897 Rep 1971 Rep Rep 1961 Rep 1967 Rep 1877 Rep 1875 Rep 1873 Rep Component Dem 1975 Dem 1991 Dem 1971 Dem 1983 Dem 1981 Dem 1973Dem Dem Dem Dem Dem 1957 Dem 1969 Dem 1965 Dem 1955 Dem 1961 Dem 1875 Dem 1953 Dem 1951 Dem 1945Dem 1895 Dem 1959 Dem 1915 Dem 1879 Dem 1903 Dem Dem Dem Dem Dem Dem 1943 Dem Dem 1873 Dem Dem Dem Dem Dem Dem Dem Dem Dem Dem Dem Dem Dem 1905 Dem Dem Dem 1935 Dem 1927 Dem 1929 Dem 1923 Dem 1909 Dem 1931 Rep 1975 Rep 1973 Rep Rep Rep 1989 Rep Rep Rep 1977 Rep 1979 Rep 1993 Rep 1991 Rep 1997 Rep 2001 Rep 1995 Rep 2005 Rep 1999 Rep 2003 Rep 2015 Rep 2007 Rep 2013 Rep 2011 Rep 2009 Dem Component 1 The figure shows the two principal components of party embeddings for the US House and the US Senate. 10

12 we can compute the cosine similarity between each word or phrase in the vocabulary and the party embeddings, and then identify the words that most strongly correlate with each axis. To illustrate, Table 1 reports the expressions with the highest and lowest correlation coefficients for the House model. As can be seen, the first dimension is negatively correlated with expressions such as black caucus, decent housing, the poor and the elderly, meaning that these expressions define the semantics of parties located on the left. These words refer to topics one would expect in the language of liberal (or left-wing) parties in the United States. Conversely, issues like decentralization and bureaucracies are associated with the right. As for the second dimension, the keywords unambiguously refer to Southern and Northern locations, supporting our interpretation of that axis as a South-North divide. As discussed earlier, ideology cannot be reduced to any single group of expressions, and the proposed methodology is precisely designed to capture the latent structure of political debates. As a result, the words in Table 1 only represent the axes imperfectly. However, in this case they provide straightforward clues that facilitate a substantive interpretation of each axis. Table 1: Interpreting PCA Axes: Word and Phrase Correlations Component Orientation Words/Phrases with Highest Correlation First Positive (Right) decentralization, centralized, nebraska, kansas, mentioned earlier, governmentrun, apropos, feed grain, bureaucratic, bureaucracies Negative (Left) congressional black caucus, black caucus, poor elderly, decent housing, latinos, cbc, deepest, elderly handicapped, wealthiest, elderly disabled Second Positive (North) buffalo, minneapolis, detroit, milwaukee, cleveland, vermont, toledo, maine, erie, chicago Negative (South) georgia, southeast, red river, arkansas, bankhead, gulf mexico, georgias, shreveport, waco, peanut Another way to validate an interpretation of the projections is to retrieve concepts that are semantically similar to specific parties. For instance, we can readily identify the expressions closest to the position of the Democrats in the vector space for the 114th Congress, by retaining words having the highest cosine similarity with that specific party embedding. This does not require any technique for dimension reduction, as the similarity scores can be computed from the original, M-sized embeddings. The top words for the Democrats contain relevant hints at a liberal stance, with concepts such as gun violence and environmental protection. For Republicans in the same Congress, we find expressions such as Obamacare, overregulation and job creators. The full list of top words is printed out in the appendix. Once again, this exploration of the model suggests that party embeddings can achieve a meaningful representation of positions in an ideological space. 11

13 Table 2: Accuracy of Party Placement against Gold Standards Gold Standard Metric US House US Senate Canada Britain Voteview rile vanilla legacy Correlation Pairwise Accuracy 85.96% 86.56% Correlation Pairwise Accuracy 76.60% 74.60% Correlation Pairwise Accuracy 76.44% 77.58% Correlation Pairwise Accuracy 80.07% 82.66% The gold standard used for the United States is the average DW-NOMINATE score (first dimension) from the Voteview project (Poole and Rosenthal 2007). For Canda and the UK, the references are the rile measure of party placement based on the 2017 version of the Comparative Manifesto Project (CMP) dataset (Budge and Laver 1992; Budge et al. 2001), the Vanilla measure of left-right placement (Gabel and Huber 2000), and the legacy measure from Cochrane (2015). The pairwise accuracy metric counts the number of correct ideological orderings for all possible pairs of parties and parliaments. To further assess the validity of estimates derived from our model, we evaluate our predicted ideological placement against well-known metrics from the literature. Table 2 reports the results. For the US corpus, we use DW-NOMINATE estimates based on roll-call votes and retrieved from the latest version of the Voteview project (Poole and Rosenthal 2007). We use the first principal component of our models and compute the Pearson correlation coefficient with the first dimension of the aggregate Voteview indicator, which measures the average placement of congress members by party over time. We also report the pairwise accuracy, that is, the percentage of pairs of party placements that are consistently ordered relative to the gold standard. Pairwise accuracy accounts for all possible comparisons, within parties and across parties. Our placement is strongly correlated with the Voteview scores (ρ 0.9) and the pairwise accuracy, a more conservative metric, is above 85% for both the House and the Senate. This strengthens the conclusion that our model produces reliable estimates of ideological placement. Next, we illustrate that the methodology is generalizable across politics by replicating the same steps using the British and Canadian parliamentary corpora. To begin, Figure 3a reports a visualization of party embeddings for Britain. In addition to party and parliament indicators, the model includes a variable measuring whether an MP is member of the cabinet or not. As can be seen, political parties are once again appropriately clustered together in the vector space: speeches made by members of the same group tend to resemble each other across parliaments. Moreover, the parties are correctly ordered relative to intuitive expectations about political ideology. Focusing on the first principal component (x-axis), the Labour party appears on one end 12

14 of the spectrum, the Liberal-Democrats occupy the center, whereas the embeddings for Conservatives are clustered on the other side. In fact, without any intervention needed on our end, the model correctly captures well-accepted claims about ideological shifts within the British party system over time (see e.g. Clarke et al. 2004). For instance, the party embeddings for Conservatives during the Thatcher era (Cons 1979, 1983, and 1987) are ranked farther apart on the right end of the axis, whereas the Labour s shift toward the center at the time of the New Labour era (Labour 1997, 2001, and 2005), under the leadership of Tony Blair, is also apparent. The second component captures dynamics opposing the party in power and opposition, with parties forming a government appearing at the top of the y-axis. Finally, Figure 3b reports the results for Canada. 10 Once more, the first principal component can be readily interpreted in terms of left-right ideological placement. The Conservatives appear consistently on the right, whereas the left-wing New Democratic Party (NDP, which is merged with its predecessor, the Co-operative Commonwealth Federation) is correctly located on the other end of the spectrum. The Reform/Canadian Alliance split from the Conservatives, generally viewed as the most conservative political party in the Canadian system, appears at the extreme right of the first dimension, consistent with substantive expectations (see Cochrane 2010). In the case of Canada, the second principal component can be easily interpreted as a division between parties reflecting their views of the federation. The secessionist party, the Bloc Québécois, appears clustered on one end of the y-axis, whereas federalist parties are grouped on the other side. Such a division also resurfaces in models based on vote data (see Godbout and Høyland 2013; Johnston 2017). For the last two countries, we validate our ideological placement using data from the Comparative Manifesto Project (CMP) data (Budge and Laver 1992; Budge et al. 2001). We report the same two accuracy metrics used earlier in the lower section of Table 2. Note that we could not use that resource as a reliable benchmark for the United States since the CMP only provides data on American parties at four-year intervals. On the other hand, the CMP data provide a more useful gold standard to assess Westminster systems, for which vote-based estimates are less reliable indicators of ideology. The CMP is based on party manifestos and relies on human annotations to score the orientation of political parties on a number of policy items, following a normalized scheme. We test whether three ideology indicators derived from the project s data are consistent with our estimated placement of the same party in the parliament that immediately follows. 11 Looking first at the British case, party placements appear positively correlated with the three 10 For that country, our model includes variables measuring whether the MP making the speech belongs to the cabinet or not, whether they belong to the party in power or the opposition, and their province of origin. 11 The rile measure is the original left/right measure in the CMP. It is an additive index composed of 26 policyrelated items, as described in Budge et al. (2001). The Vanilla measure proposed by Gabel and Huber (2000) uses all 56 items in the CMP and weights them according to their loadings on the first unrotated dimension of a factor analysis. The Legacy measure is a weighted index based on a network analysis of party positions and a model that assigns exponentially decaying weight to party positions in prior elections (Cochrane 2015). 13

15 Figure 3: Party Placement in the Britain ( ) and Canada ( ) (a) Britain Labour 2001 Labour Labour 1997 Labour Cons 1992 Cons 1987 Cons 1983 Labour 1966 Component Labour 1979 Labour 1983 Labour 1945 Labour 1964 Labour 1950 Labour Labour 1935 Labour 1970 Labour 1951 Labour 2010 Labour 1955 Labour 1959 LibDems 1951 LibDems 1945 LibDems 2010 LibDems 1935 LibDems 1950 LibDems 1955 LibDems 1983 LibDems 1964 LibDems 1966 LibDems 1970 LibDems LibDems LibDems LibDems LibDems LibDems 1992 LibDems 1979 LibDems LibDems Cons 1935 Cons Cons 2010 Cons 1970 Cons 1959 Cons Cons Cons 1964 Cons 1945 Cons Labour 1992 Labour 1987 Cons 1950 Cons Cons 1966 Cons 2005 Cons 2001 Cons Component 1 (b) Canada Bloc 2000 Bloc 1997 Bloc Bloc 2004 Bloc 2006 Bloc Bloc Bloc 2011 Component NDP NDP NDP NDP 1974 NDP 1980 Liberal 1980 Liberal 1958 Liberal 1930 Liberal 1945 Liberal Cons Liberal Liberal Liberal Liberal 1949 Liberal Liberal Liberal 1935 Liberal 1962 Liberal NDP Cons 1911 Liberal 1922 Liberal 1953 Liberal Cons Liberal Cons 1949 Cons Cons NDP 1968 Liberal 1901 Cons Liberal Liberal Cons Cons Cons Cons 1965 NDP 1949 NDP 1972 NDP NDP NDP 1963 NDP NDP Liberal Liberal Liberal Cons Cons 1974 Cons 1926 Cons Cons 2008 Cons 1972 NDP NDP 1979 NDP 1958 NDP RefAll 1988 Cons 1962 Cons 1901 Liberal 1963 Cons Liberal 1930 Cons 1940 Cons 1953 Cons Liberal 2015 Liberal 2004 Cons 1979 Cons 2015 Cons 1984 NDP 1984 Cons 1935 Cons 1988 Cons 2006 Cons 1945 NDP 2011 Liberal 1997 Liberal 1988 NDP 2004 NDP 2006 NDP 1988 NDP 2008 Bloc 2015 Cons 1993 Liberal 2006 Cons 1997 Liberal 2008 Liberal 2011 RefAll 1997 Cons 2000 RefAll 2000 Cons 2004 RefAll Component 1 The figure shows the two principal components of party embeddings for the British and Canadian parliaments. 14

16 external indicators, ranging from ρ 0.68 when considering the more basic right minus left measure (rile) from the 2017 CMP dataset, up to ρ 0.76 and ρ 0.88 using more robust ideology metrics based on the same data. As for pairwise accuracy, between 75 to 83% of comparisons against CMP-based measures are ordered consistently. These results once again support the conclusion that our party embeddings are related to left-right ideology as coded by humans using the party manifestos. For Canada, the fit with the CMP data is also strong across the three gold standards. For instance, the correlations range between 0.72 and For all countries, our tests suggest that setting M around 200 dimensions is a reliable default value to achieve results consistent with human judgments or vote-based indicators. In the appendix, we report additional tests of accuracy of the models for various hidden layer sizes. We also discuss validation tests of the word embeddings using standard benchmarks from the literature. 5.2 Guided Projections Instead of interpreting dimensions ex-post, researchers may also choose to define axes of interest. We briefly illustrate how the proposed methodology can be used in such fashion. We start by choosing expressions representative of opposite ideological stances on economic and social issues (see Table A4 in the appendix for the full list). When more than one term is used to anchor a position, we can take the centroids of each group of words and phrases, by averaging their embeddings. Finally, axes are created by taking the difference between the right and left centroids, for each dimension of interest. We project party embeddings onto the customized space by taking dot products: ζ ( ) i L Right β i i L Left β i V Right V Left where L Left is the chosen lexicon for words identifying the left-wing, and V Left the size of that lexicon (and similarly for the Right). 12 Figure 4 illustrates such a linear projection of party embeddings in a two-dimensional space for the US House. The neural network model is the same as that used in the previous subsection. The social dimension (y-axis) uses expressions such as civil rights and traditional values to represent left and right, respectively. For the economic dimension (x-axis), we use concepts related to workers and redistribution for the left, and expressions such as businesses, taxpayers and free enterprise for the right. Consistent with expectations, the figure suggests that the Republican party is located to the right on both the economic and social dimensions in recent decades. The Democrats, on the other hand, are both socially and economically on the left, or liberal end of the scale. We also assessed the validity of this second approach against gold standards and report the 12 This approach expands on a standard visualization technique for the analysis of word embeddings; for instance, a similar implementation is included in Google s TensorBoard tool. 15

17 Figure 4: Party Placement in a 2D Space using Customized Ideological Axes (US House) 40 Rep 2001 Rep 2005 Rep 2009 Rep 2003 Rep 2015 Social Left-Right Dem 1997 Rep 1993 Rep 1987 Rep 1981 Rep 1985 Rep 1953 Rep 1973Rep 1983 Rep 1989 Rep 1991 Rep 1957 Rep 1943 Rep 1967 Rep 1969 Rep 1975 Rep 1971 Rep 1919 Rep Dem 1875 Dem 1877 Rep 1921 Rep 1931 Rep 1873 Dem 1931 Rep Rep 1923 Rep Rep Rep 1955 Dem 1873 Rep Rep 1901 Dem 1913 Rep Rep 1881 Dem Rep Dem Dem Dem Dem Dem Rep 1905 Rep 1907 Rep Rep 1945 Dem Rep Rep Rep Rep Dem Dem Rep Rep 1897 Rep Rep Dem 1957 Rep Dem 1927 Dem 1907 Dem Dem Dem Dem Dem Rep Rep Rep 1887 Rep 1913 Dem 1973 Dem 1951 Dem Rep Dem Rep Rep Rep Rep 1929 Rep 1937 Rep 1977 Dem 1911 Dem Dem 1941 Dem Dem Rep 1901 Dem Dem 1889 Dem 1967 Dem 1937 Rep 1963 Dem 1987 Dem 1969 Dem 1949 Dem 1905 Dem 1981 Dem 1881 Dem 1935 Dem Dem 1909 Dem 1959 Dem Dem 1945 Dem 1977 Dem 1939 Dem 1955 Dem 1947 Dem Dem Dem 1985 Dem 1993 Dem 1963 Dem 1989 Rep 1995 Rep 1997 Rep 1999 Rep 2007 Rep 2013 Rep Dem 2001 Dem 2009 Dem 2007 Dem 2005 Dem Dem Dem 1979 Dem 1995 Dem 2015Dem Dem Economic Left-Right results in the appendix. Overall, we find that the unguided method presented in the previous section achieves higher levels of accuracy than the one in which we predefine the axes. Of course, a researcher could easily test the entire vocabulary from a corpus and find combinations of expressions that maximize the accuracy with human-annotated benchmarks. Such an optimization technique, however, would defeat the purpose of developing a general approach that can be used across polities, free or arbitrary decisions. Finally, we note that our models can be fitted with indicator variables for each member of a parliament, rather than parties. The same analytical steps apply, and the transposition to a different type of political actor is straightforward. A difference that researchers need to take into account is that a model fitted with embeddings for individual speakers has fewer training examples, which may affect accuracy and the choice of parameters. Due to space constraints, we discuss an example in the appendix, using the US Senate corpus. We find that the method applied to individual members is promising as a tool for political research, and our test based on the 114th Congress suggests that our predicted ideological placement for Senators is consistent with an extended set of individual-level data from external sources. 16

18 6 Example Application: Topic Similarities After fitting a model of word and party embeddings, researchers can assess hypotheses regarding issues or topics prioritized by political actors in parliament. For instance, one might wish to assess the extent to which left-wing parties in Canada (the NDP) and the United States (Democrats) have embraced the issue of environment over time. This question touches a broader theoretical debate on the evolution of left-wing parties, and whether the issues that they prioritize have changed over time (see Kitschelt and Hellemans 1990). To implement such an analysis, we may either start with a custom list of words related to environmental issues or automatically retrieve expressions most similar to the keyword environment in the vector space using the word embeddings. We use the latter approach for illustration. Next, we compute the cosine similarity between the centroid of the embeddings for the topic words and that of a party embedding. Implementing a bootstrap estimator can be achieved by replicating the computation using random samples with replacement from the set of topic words. Figure 5 reports the similarity scores between left-leaning parties and a centroid vector for the issue of environment over time. Consistent with the idea that the new left has embraced the issue of environment, both trends are positive. The trend originates at the time of conservation movement in the 1960s, up until recent debates over climate change. Figure 5: Semantic Similarity between Left-Wing Parties and The Issue of Environment (a) Canada: New Democratic Party (b) US House: Democrats 0.20 Cosine Similarity (Moving Average) Cosine Similarity (Moving Average) Parliament Congress Moving average of the cosine similarity between environment topic vector and party embeddings, with 95% bootstrap confidence intervals. 7 Example Application: Party Polarization Finally, another benefit of having estimates of party placement in a vector space is the possibility of computing quantities of interest based on metrics for vector distances. An obvious application 17

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner Abstract For our project, we analyze data from US Congress voting records, a dataset that consists

More information

national congresses and show the results from a number of alternate model specifications for

national congresses and show the results from a number of alternate model specifications for Appendix In this Appendix, we explain how we processed and analyzed the speeches at parties national congresses and show the results from a number of alternate model specifications for the analysis presented

More information

Benchmarks for text analysis: A response to Budge and Pennings

Benchmarks for text analysis: A response to Budge and Pennings Electoral Studies 26 (2007) 130e135 www.elsevier.com/locate/electstud Benchmarks for text analysis: A response to Budge and Pennings Kenneth Benoit a,, Michael Laver b a Department of Political Science,

More information

Vote Compass Methodology

Vote Compass Methodology Vote Compass Methodology 1 Introduction Vote Compass is a civic engagement application developed by the team of social and data scientists from Vox Pop Labs. Its objective is to promote electoral literacy

More information

Using Text to Scale Legislatures with Uninformative Voting

Using Text to Scale Legislatures with Uninformative Voting Using Text to Scale Legislatures with Uninformative Voting Nick Beauchamp NYU Department of Politics August 8, 2012 Abstract This paper shows how legislators written and spoken text can be used to ideologically

More information

Mapping Policy Preferences with Uncertainty: Measuring and Correcting Error in Comparative Manifesto Project Estimates *

Mapping Policy Preferences with Uncertainty: Measuring and Correcting Error in Comparative Manifesto Project Estimates * Mapping Policy Preferences with Uncertainty: Measuring and Correcting Error in Comparative Manifesto Project Estimates * Kenneth Benoit Michael Laver Slava Mikhailov Trinity College Dublin New York University

More information

Do two parties represent the US? Clustering analysis of US public ideology survey

Do two parties represent the US? Clustering analysis of US public ideology survey Do two parties represent the US? Clustering analysis of US public ideology survey Louisa Lee 1 and Siyu Zhang 2, 3 Advised by: Vicky Chuqiao Yang 1 1 Department of Engineering Sciences and Applied Mathematics,

More information

EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA. Michael Laver, Kenneth Benoit, and John Garry * Trinity College Dublin

EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA. Michael Laver, Kenneth Benoit, and John Garry * Trinity College Dublin ***CONTAINS AUTHOR CITATIONS*** EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA Michael Laver, Kenneth Benoit, and John Garry * Trinity College Dublin October 9, 2002 Abstract We present

More information


EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA * January 21, 2003 EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA * Michael Laver Kenneth Benoit John Garry Trinity College, U. of Dublin Trinity College, U. of Dublin University of Reading January

More information

Distributed representations of politicians

Distributed representations of politicians Distributed representations of politicians Bobbie Macdonald Department of Political Science Stanford University bmacdon@stanford.edu Abstract Methods for generating dense embeddings of words and sentences

More information

Do Individual Heterogeneity and Spatial Correlation Matter?

Do Individual Heterogeneity and Spatial Correlation Matter? Do Individual Heterogeneity and Spatial Correlation Matter? An Innovative Approach to the Characterisation of the European Political Space. Giovanna Iannantuoni, Elena Manzoni and Francesca Rossi EXTENDED

More information

Random Forests. Gradient Boosting. and. Bagging and Boosting

Random Forests. Gradient Boosting. and. Bagging and Boosting Random Forests and Gradient Boosting Bagging and Boosting The Bootstrap Sample and Bagging Simple ideas to improve any model via ensemble Bootstrap Samples Ø Random samples of your data with replacement

More information

Congressional Gridlock: The Effects of the Master Lever

Congressional Gridlock: The Effects of the Master Lever Congressional Gridlock: The Effects of the Master Lever Olga Gorelkina Max Planck Institute, Bonn Ioanna Grypari Max Planck Institute, Bonn Preliminary & Incomplete February 11, 2015 Abstract This paper

More information

The Integer Arithmetic of Legislative Dynamics

The Integer Arithmetic of Legislative Dynamics The Integer Arithmetic of Legislative Dynamics Kenneth Benoit Trinity College Dublin Michael Laver New York University July 8, 2005 Abstract Every legislature may be defined by a finite integer partition

More information

Congruence in Political Parties

Congruence in Political Parties Descriptive Representation of Women and Ideological Congruence in Political Parties Georgia Kernell Northwestern University gkernell@northwestern.edu June 15, 2011 Abstract This paper examines the relationship

More information

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants The Ideological and Electoral Determinants of Laws Targeting Undocumented Migrants in the U.S. States Online Appendix In this additional methodological appendix I present some alternative model specifications

More information

Read My Lips : Using Automatic Text Analysis to Classify Politicians by Party and Ideology 1

Read My Lips : Using Automatic Text Analysis to Classify Politicians by Party and Ideology 1 Read My Lips : Using Automatic Text Analysis to Classify Politicians by Party and Ideology 1 Eitan Sapiro-Gheiler 2 June 15, 2018 Department of Economics Princeton University 1 Acknowledgements: I would

More information

Wisconsin Economic Scorecard

Wisconsin Economic Scorecard RESEARCH PAPER> May 2012 Wisconsin Economic Scorecard Analysis: Determinants of Individual Opinion about the State Economy Joseph Cera Researcher Survey Center Manager The Wisconsin Economic Scorecard

More information

A comparative analysis of subreddit recommenders for Reddit

A comparative analysis of subreddit recommenders for Reddit A comparative analysis of subreddit recommenders for Reddit Jay Baxter Massachusetts Institute of Technology jbaxter@mit.edu Abstract Reddit has become a very popular social news website, but even though

More information

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University 7 July 1999 This appendix is a supplement to Non-Parametric

More information

Pivoted Text Scaling for Open-Ended Survey Responses

Pivoted Text Scaling for Open-Ended Survey Responses Pivoted Text Scaling for Open-Ended Survey Responses William Hobbs September 28, 2017 Abstract Short texts such as open-ended survey responses and tweets contain valuable information about public opinions,

More information

Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems

Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems Andrew Peterson Arthur Spirling Abstract Measuring the polarization of legislators and parties

More information

Dynamic representation: the rise of issue voting?

Dynamic representation: the rise of issue voting? A CRITICAL ELECTION? UNDERSTANDING THE 1997 BRITISH ELECTION IN LONG- TERM PERSPECTIVE Eds. Geoffrey Evans and Pippa Norris CHAPTER THIRTEEN Dynamic representation: the rise of issue voting? by Mark Franklin

More information

Incumbency Advantages in the Canadian Parliament

Incumbency Advantages in the Canadian Parliament Incumbency Advantages in the Canadian Parliament Chad Kendall Department of Economics University of British Columbia Marie Rekkas* Department of Economics Simon Fraser University mrekkas@sfu.ca 778-782-6793

More information

Polimetrics. Lecture 2 The Comparative Manifesto Project

Polimetrics. Lecture 2 The Comparative Manifesto Project Polimetrics Lecture 2 The Comparative Manifesto Project From programmes to preferences Why studying texts Analyses of many forms of political competition, from a wide range of theoretical perspectives,

More information

We present a new way of extracting policy positions from political texts that treats texts not

We present a new way of extracting policy positions from political texts that treats texts not American Political Science Review Vol. 97, No. 2 May 2003 Extracting Policy Positions from Political Texts Using Words as Data MICHAEL LAVER and KENNETH BENOIT Trinity College, University of Dublin JOHN

More information

Economics Marshall High School Mr. Cline Unit One BC

Economics Marshall High School Mr. Cline Unit One BC Economics Marshall High School Mr. Cline Unit One BC Political science The application of game theory to political science is focused in the overlapping areas of fair division, or who is entitled to what,

More information

Combining national and constituency polling for forecasting

Combining national and constituency polling for forecasting Combining national and constituency polling for forecasting Chris Hanretty, Ben Lauderdale, Nick Vivyan Abstract We describe a method for forecasting British general elections by combining national and

More information

In less than 20 years the European Parliament has

In less than 20 years the European Parliament has Dimensions of Politics in the European Parliament Simon Hix Abdul Noury Gérard Roland London School of Economics and Political Science Université Libre de Bruxelles University of California, Berkeley We

More information

Polimetrics. Mass & Expert Surveys

Polimetrics. Mass & Expert Surveys Polimetrics Mass & Expert Surveys Three things I know about measurement Everything is measurable* Measuring = making a mistake (* true value is intangible and unknowable) Any measurement is better than

More information

Introduction to the Virtual Issue: Recent Innovations in Text Analysis for Social Science

Introduction to the Virtual Issue: Recent Innovations in Text Analysis for Social Science Introduction to the Virtual Issue: Recent Innovations in Text Analysis for Social Science Margaret E. Roberts 1 Text Analysis for Social Science In 2008, Political Analysis published a groundbreaking special

More information

Intersections of political and economic relations: a network study

Intersections of political and economic relations: a network study Procedia Computer Science Volume 66, 2015, Pages 239 246 YSC 2015. 4th International Young Scientists Conference on Computational Science Intersections of political and economic relations: a network study

More information

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries) Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries) Guillem Riambau July 15, 2018 1 1 Construction of variables and descriptive statistics.

More information

Beyond Binary Labels: Political Ideology Prediction of Twitter Users

Beyond Binary Labels: Political Ideology Prediction of Twitter Users Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preoţiuc-Pietro Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), Lyle Ungar (CS) 2 August 2017 Motivation

More information

Media coverage in times of political crisis: a text mining approach

Media coverage in times of political crisis: a text mining approach Media coverage in times of political crisis: a text mining approach Enric Junqué de Fortuny Tom De Smedt David Martens Walter Daelemans Faculty of Applied Economics Faculty of Arts Faculty of Applied Economics

More information

Appendices for Elections and the Regression-Discontinuity Design: Lessons from Close U.S. House Races,

Appendices for Elections and the Regression-Discontinuity Design: Lessons from Close U.S. House Races, Appendices for Elections and the Regression-Discontinuity Design: Lessons from Close U.S. House Races, 1942 2008 Devin M. Caughey Jasjeet S. Sekhon 7/20/2011 (10:34) Ph.D. candidate, Travers Department

More information



More information

Methodology. 1 State benchmarks are from the American Community Survey Three Year averages

Methodology. 1 State benchmarks are from the American Community Survey Three Year averages The Choice is Yours Comparing Alternative Likely Voter Models within Probability and Non-Probability Samples By Robert Benford, Randall K Thomas, Jennifer Agiesta, Emily Swanson Likely voter models often

More information

Estimating Better Left-Right Positions Through Statistical Scaling of Manual Content Analysis

Estimating Better Left-Right Positions Through Statistical Scaling of Manual Content Analysis Estimating Better Left-Right Positions Through Statistical Scaling of Manual Content Analysis Thomas Däubler Kenneth Benoit February 13, 2017 Abstract Borrowing from automated text as data approaches,

More information

Probabilistic Latent Semantic Analysis Hofmann (1999)

Probabilistic Latent Semantic Analysis Hofmann (1999) Probabilistic Latent Semantic Analysis Hofmann (1999) Presenter: Mercè Vintró Ricart February 8, 2016 Outline Background Topic models: What are they? Why do we use them? Latent Semantic Analysis (LSA)

More information

CS 229 Final Project - Party Predictor: Predicting Political A liation

CS 229 Final Project - Party Predictor: Predicting Political A liation CS 229 Final Project - Party Predictor: Predicting Political A liation Brandon Ewonus bewonus@stanford.edu Bryan McCann bmccann@stanford.edu Nat Roth nroth@stanford.edu Abstract In this report we analyze

More information

Party Polarization and Parliamentary Speech

Party Polarization and Parliamentary Speech Page X of XXX Party Polarization and Parliamentary Speech MARTIN G. SØYLAND AND EMANUELE LAPPONI In recent years, quantitative studies have started to utilize at the natural language content in parliamentary

More information

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES Lectures 4-5_190213.pdf Political Economics II Spring 2019 Lectures 4-5 Part II Partisan Politics and Political Agency Torsten Persson, IIES 1 Introduction: Partisan Politics Aims continue exploring policy

More information

CS 229: r/classifier - Subreddit Text Classification

CS 229: r/classifier - Subreddit Text Classification CS 229: r/classifier - Subreddit Text Classification Andrew Giel agiel@stanford.edu Jonathan NeCamp jnecamp@stanford.edu Hussain Kader hkader@stanford.edu Abstract This paper presents techniques for text

More information

Text as Data. Justin Grimmer. Associate Professor Department of Political Science Stanford University. November 20th, 2014

Text as Data. Justin Grimmer. Associate Professor Department of Political Science Stanford University. November 20th, 2014 Text as Data Justin Grimmer Associate Professor Department of Political Science Stanford University November 20th, 2014 Justin Grimmer (Stanford University) Text as Data November 20th, 2014 1 / 24 Ideological

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Linearly Separable Data SVM: Simple Linear Separator hyperplane Which Simple Linear Separator? Classifier Margin Objective #1: Maximize Margin MARGIN MARGIN How s this look? MARGIN

More information

Deep Learning and Visualization of Election Data

Deep Learning and Visualization of Election Data Deep Learning and Visualization of Election Data Garcia, Jorge A. New Mexico State University Tao, Ng Ching City University of Hong Kong Betancourt, Frank University of Tennessee, Knoxville Wong, Kwai

More information

Welfare State and Local Government: the Impact of Decentralization on Well-Being

Welfare State and Local Government: the Impact of Decentralization on Well-Being Welfare State and Local Government: the Impact of Decentralization on Well-Being Paolo Addis, Alessandra Coli, and Barbara Pacini (University of Pisa) Discussant Anindita Sengupta Associate Professor of

More information



More information

Supplementary/Online Appendix for:

Supplementary/Online Appendix for: Supplementary/Online Appendix for: Relative Policy Support and Coincidental Representation Perspectives on Politics Peter K. Enns peterenns@cornell.edu Contents Appendix 1 Correlated Measurement Error

More information

U.S. Family Income Growth

U.S. Family Income Growth Figure 1.1 U.S. Family Income Growth Growth 140% 120% 100% 80% 60% 115.3% 1947 to 1973 97.1% 97.7% 102.9% 84.0% 40% 20% 0% Lowest Fifth Second Fifth Middle Fifth Fourth Fifth Top Fifth 70% 60% 1973 to

More information

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012 Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012 Abstract In this paper we attempt to develop an algorithm to generate a set of post recommendations

More information

And Yet it Moves: The Effect of Election Platforms on Party. Policy Images

And Yet it Moves: The Effect of Election Platforms on Party. Policy Images And Yet it Moves: The Effect of Election Platforms on Party Policy Images Pablo Fernandez-Vazquez * Supplementary Online Materials [ Forthcoming in Comparative Political Studies ] These supplementary materials

More information

Predicting Congressional Votes Based on Campaign Finance Data

Predicting Congressional Votes Based on Campaign Finance Data 1 Predicting Congressional Votes Based on Campaign Finance Data Samuel Smith, Jae Yeon (Claire) Baek, Zhaoyi Kang, Dawn Song, Laurent El Ghaoui, Mario Frank Department of Electrical Engineering and Computer

More information

Statistical Analysis of Corruption Perception Index across countries

Statistical Analysis of Corruption Perception Index across countries Statistical Analysis of Corruption Perception Index across countries AMDA Project Summary Report (Under the guidance of Prof Malay Bhattacharya) Group 3 Anit Suri 1511007 Avishek Biswas 1511013 Diwakar

More information

DU PhD in Home Science

DU PhD in Home Science DU PhD in Home Science Topic:- DU_J18_PHD_HS 1) Electronic journal usually have the following features: i. HTML/ PDF formats ii. Part of bibliographic databases iii. Can be accessed by payment only iv.

More information

Lab 3: Logistic regression models

Lab 3: Logistic regression models Lab 3: Logistic regression models In this lab, we will apply logistic regression models to United States (US) presidential election data sets. The main purpose is to predict the outcomes of presidential

More information



More information


AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 3 NO. 4 (2005) , Partisanship and the Post Bounce: A MemoryBased Model of Post Presidential Candidate Evaluations Part II Empirical Results Justin Grimmer Department of Mathematics and Computer Science Wabash College

More information

Following the Leader: The Impact of Presidential Campaign Visits on Legislative Support for the President's Policy Preferences

Following the Leader: The Impact of Presidential Campaign Visits on Legislative Support for the President's Policy Preferences University of Colorado, Boulder CU Scholar Undergraduate Honors Theses Honors Program Spring 2011 Following the Leader: The Impact of Presidential Campaign Visits on Legislative Support for the President's

More information

A Global Perspective on Socioeconomic Differences in Learning Outcomes

A Global Perspective on Socioeconomic Differences in Learning Outcomes 2009/ED/EFA/MRT/PI/19 Background paper prepared for the Education for All Global Monitoring Report 2009 Overcoming Inequality: why governance matters A Global Perspective on Socioeconomic Differences in

More information

Many theories of comparative politics rely on the

Many theories of comparative politics rely on the A Scaling Model for Estimating Time-Series Party Positions from Texts Jonathan B. Slapin Sven-Oliver Proksch Trinity College, Dublin University of California, Los Angeles Recent advances in computational

More information

Introduction to Path Analysis: Multivariate Regression

Introduction to Path Analysis: Multivariate Regression Introduction to Path Analysis: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #7 March 9, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

Using Poole s Optimal Classification in R

Using Poole s Optimal Classification in R Using Poole s Optimal Classification in R January 22, 2018 1 Introduction This package estimates Poole s Optimal Classification scores from roll call votes supplied though a rollcall object from package

More information

Michael Laver and Ernest Sergenti: Party Competition. An Agent-Based Model

Michael Laver and Ernest Sergenti: Party Competition. An Agent-Based Model RMM Vol. 3, 2012, 66 70 http://www.rmm-journal.de/ Book Review Michael Laver and Ernest Sergenti: Party Competition. An Agent-Based Model Princeton NJ 2012: Princeton University Press. ISBN: 9780691139043

More information

Preliminary Effects of Oversampling on the National Crime Victimization Survey

Preliminary Effects of Oversampling on the National Crime Victimization Survey Preliminary Effects of Oversampling on the National Crime Victimization Survey Katrina Washington, Barbara Blass and Karen King U.S. Census Bureau, Washington D.C. 20233 Note: This report is released to

More information

How many political parties are there, really? A new measure of the ideologically cognizable number of parties/party groupings

How many political parties are there, really? A new measure of the ideologically cognizable number of parties/party groupings Article How many political parties are there, really? A new measure of the ideologically cognizable number of parties/party groupings Party Politics 18(4) 523 544 ª The Author(s) 2011 Reprints and permission:

More information

Deep Classification and Generation of Reddit Post Titles

Deep Classification and Generation of Reddit Post Titles Deep Classification and Generation of Reddit Post Titles Tyler Chase tchase56@stanford.edu Rolland He rhe@stanford.edu William Qiu willqiu@stanford.edu Abstract The online news aggregation website Reddit

More information

Model of Voting. February 15, Abstract. This paper uses United States congressional district level data to identify how incumbency,

Model of Voting. February 15, Abstract. This paper uses United States congressional district level data to identify how incumbency, U.S. Congressional Vote Empirics: A Discrete Choice Model of Voting Kyle Kretschman The University of Texas Austin kyle.kretschman@mail.utexas.edu Nick Mastronardi United States Air Force Academy nickmastronardi@gmail.com

More information

Whose Statehouse Democracy?: Policy Responsiveness to Poor vs. Rich Constituents in Poor vs. Rich States

Whose Statehouse Democracy?: Policy Responsiveness to Poor vs. Rich Constituents in Poor vs. Rich States Policy Studies Organization From the SelectedWorks of Elizabeth Rigby 2010 Whose Statehouse Democracy?: Policy Responsiveness to Poor vs. Rich Constituents in Poor vs. Rich States Elizabeth Rigby, University

More information

The cost of ruling, cabinet duration, and the median-gap model

The cost of ruling, cabinet duration, and the median-gap model Public Choice 113: 157 178, 2002. 2002 Kluwer Academic Publishers. Printed in the Netherlands. 157 The cost of ruling, cabinet duration, and the median-gap model RANDOLPH T. STEVENSON Department of Political

More information

The 2017 TRACE Matrix Bribery Risk Matrix

The 2017 TRACE Matrix Bribery Risk Matrix The 2017 TRACE Matrix Bribery Risk Matrix Methodology Report Corruption is notoriously difficult to measure. Even defining it can be a challenge, beyond the standard formula of using public position for

More information


VOTING DYNAMICS IN INNOVATION SYSTEMS VOTING DYNAMICS IN INNOVATION SYSTEMS Voting in social and collaborative systems is a key way to elicit crowd reaction and preference. It enables the diverse perspectives of the crowd to be expressed and

More information

Parties, Candidates, Issues: electoral competition revisited

Parties, Candidates, Issues: electoral competition revisited Parties, Candidates, Issues: electoral competition revisited Introduction The partisan competition is part of the operation of political parties, ranging from ideology to issues of public policy choices.

More information

An empirical model of issue evolution and partisan realignment in a multiparty system

An empirical model of issue evolution and partisan realignment in a multiparty system An empirical model of issue evolution and partisan realignment in a multiparty system Article Accepted Version Online Appendix Arndt, C. (218) An empirical model of issue evolution and partisan realignment

More information

FOURIER ANALYSIS OF THE NUMBER OF PUBLIC LAWS David L. Farnsworth, Eisenhower College Michael G. Stratton, GTE Sylvania

FOURIER ANALYSIS OF THE NUMBER OF PUBLIC LAWS David L. Farnsworth, Eisenhower College Michael G. Stratton, GTE Sylvania FOURIER ANALYSIS OF THE NUMBER OF PUBLIC LAWS 1789-1976 David L. Farnsworth, Eisenhower College Michael G. Stratton, GTE Sylvania 1. Introduction. In an earlier study (reference hereafter referred to as

More information

Political text is a fundamental source of information

Political text is a fundamental source of information Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions Kenneth Benoit Michael Laver Slava Mikhaylov Trinity College New York University Trinity College Political text offers

More information

Median voter theorem - continuous choice

Median voter theorem - continuous choice Median voter theorem - continuous choice In most economic applications voters are asked to make a non-discrete choice - e.g. choosing taxes. In these applications the condition of single-peakedness is

More information

Chapter 6 Online Appendix. general these issues do not cause significant problems for our analysis in this chapter. One

Chapter 6 Online Appendix. general these issues do not cause significant problems for our analysis in this chapter. One Chapter 6 Online Appendix Potential shortcomings of SF-ratio analysis Using SF-ratios to understand strategic behavior is not without potential problems, but in general these issues do not cause significant

More information

Research Statement. Jeffrey J. Harden. 2 Dissertation Research: The Dimensions of Representation

Research Statement. Jeffrey J. Harden. 2 Dissertation Research: The Dimensions of Representation Research Statement Jeffrey J. Harden 1 Introduction My research agenda includes work in both quantitative methodology and American politics. In methodology I am broadly interested in developing and evaluating

More information



More information

The California Primary and Redistricting

The California Primary and Redistricting The California Primary and Redistricting This study analyzes what is the important impact of changes in the primary voting rules after a Congressional and Legislative Redistricting. Under a citizen s committee,

More information

The League of Women Voters of Pennsylvania et al v. The Commonwealth of Pennsylvania et al. Nolan McCarty

The League of Women Voters of Pennsylvania et al v. The Commonwealth of Pennsylvania et al. Nolan McCarty The League of Women Voters of Pennsylvania et al v. The Commonwealth of Pennsylvania et al. I. Introduction Nolan McCarty Susan Dod Brown Professor of Politics and Public Affairs Chair, Department of Politics

More information



More information

Segal and Howard also constructed a social liberalism score (see Segal & Howard 1999).

Segal and Howard also constructed a social liberalism score (see Segal & Howard 1999). APPENDIX A: Ideology Scores for Judicial Appointees For a very long time, a judge s own partisan affiliation 1 has been employed as a useful surrogate of ideology (Segal & Spaeth 1990). The approach treats

More information



More information

The Seventeenth Amendment, Senate Ideology, and the Growth of Government

The Seventeenth Amendment, Senate Ideology, and the Growth of Government The Seventeenth Amendment, Senate Ideology, and the Growth of Government Danko Tarabar College of Business and Economics 1601 University Ave, PO BOX 6025 West Virginia University Phone: 681-212-9983 datarabar@mix.wvu.edu

More information

Immigration and Multiculturalism: Views from a Multicultural Prairie City

Immigration and Multiculturalism: Views from a Multicultural Prairie City Immigration and Multiculturalism: Views from a Multicultural Prairie City Paul Gingrich Department of Sociology and Social Studies University of Regina Paper presented at the annual meeting of the Canadian

More information

Hierarchical Item Response Models for Analyzing Public Opinion

Hierarchical Item Response Models for Analyzing Public Opinion Hierarchical Item Response Models for Analyzing Public Opinion Xiang Zhou Harvard University July 16, 2017 Xiang Zhou (Harvard University) Hierarchical IRT for Public Opinion July 16, 2017 Page 1 Features

More information

Parties, Voters and the Environment

Parties, Voters and the Environment CANADA-EUROPE TRANSATLANTIC DIALOGUE: SEEKING TRANSNATIONAL SOLUTIONS TO 21ST CENTURY PROBLEMS Introduction canada-europe-dialogue.ca April 2013 Policy Brief Parties, Voters and the Environment Russell

More information

A procedure to compute a probabilistic bound for the maximum tardiness using stochastic simulation

A procedure to compute a probabilistic bound for the maximum tardiness using stochastic simulation Proceedings of the 17th World Congress The International Federation of Automatic Control A procedure to compute a probabilistic bound for the maximum tardiness using stochastic simulation Nasser Mebarki*.

More information

NOMINATE: A Short Intellectual History. Keith T. Poole. When John Londregan asked me to write something for TPM about NOMINATE

NOMINATE: A Short Intellectual History. Keith T. Poole. When John Londregan asked me to write something for TPM about NOMINATE NOMINATE: A Short Intellectual History by Keith T. Poole When John Londregan asked me to write something for TPM about NOMINATE and why we (Howard Rosenthal and I) went high tech rather than using simpler

More information

The Effectiveness of Receipt-Based Attacks on ThreeBallot

The Effectiveness of Receipt-Based Attacks on ThreeBallot The Effectiveness of Receipt-Based Attacks on ThreeBallot Kevin Henry, Douglas R. Stinson, Jiayuan Sui David R. Cheriton School of Computer Science University of Waterloo Waterloo, N, N2L 3G1, Canada {k2henry,

More information

THE WORKMEN S CIRCLE SURVEY OF AMERICAN JEWS. Jews, Economic Justice & the Vote in Steven M. Cohen and Samuel Abrams

THE WORKMEN S CIRCLE SURVEY OF AMERICAN JEWS. Jews, Economic Justice & the Vote in Steven M. Cohen and Samuel Abrams THE WORKMEN S CIRCLE SURVEY OF AMERICAN JEWS Jews, Economic Justice & the Vote in 2012 Steven M. Cohen and Samuel Abrams 1/4/2013 2 Overview Economic justice concerns were the critical consideration dividing

More information

Attitudes towards Refugees and Asylum Seekers

Attitudes towards Refugees and Asylum Seekers Attitudes towards Refugees and Asylum Seekers A Survey of Public Opinion Research Study conducted for Refugee Week May 2002 Contents Introduction 1 Summary of Findings 3 Reasons for Seeking Asylum 3 If

More information

Indian Political Data Analysis Using Rapid Miner

Indian Political Data Analysis Using Rapid Miner Indian Political Data Analysis Using Rapid Miner Dr. Siddhartha Ghosh Jagadeeswari Chittiboina Shireen Fatima HOD, CSE, Keshav Memorial MTech, CSE, Keshav Memorial MTech, CSE, Keshav Memorial siddhartha@kmit.in

More information

Colorado 2014: Comparisons of Predicted and Actual Turnout

Colorado 2014: Comparisons of Predicted and Actual Turnout Colorado 2014: Comparisons of Predicted and Actual Turnout Date 2017-08-28 Project name Colorado 2014 Voter File Analysis Prepared for Washington Monthly and Project Partners Prepared by Pantheon Analytics

More information

! # % & ( ) ) ) ) ) +,. / 0 1 # ) 2 3 % ( &4& 58 9 : ) & ;; &4& ;;8;

! # % & ( ) ) ) ) ) +,. / 0 1 # ) 2 3 % ( &4& 58 9 : ) & ;; &4& ;;8; ! # % & ( ) ) ) ) ) +,. / 0 # ) % ( && : ) & ;; && ;;; < The Changing Geography of Voting Conservative in Great Britain: is it all to do with Inequality? Journal: Manuscript ID Draft Manuscript Type: Commentary

More information

Ideology Classifiers for Political Speech. Bei Yu Stefan Kaufmann Daniel Diermeier

Ideology Classifiers for Political Speech. Bei Yu Stefan Kaufmann Daniel Diermeier Ideology Classifiers for Political Speech Bei Yu Stefan Kaufmann Daniel Diermeier Abstract: In this paper we discuss the design of ideology classifiers for Congressional speech data. We then examine the

More information

Out of Step, but in the News? The Milquetoast Coverage of Incumbent Representatives

Out of Step, but in the News? The Milquetoast Coverage of Incumbent Representatives Out of Step, but in the News? The Milquetoast Coverage of Incumbent Representatives Michael C. Dougal 1 1 Travers Department of Political Science, UC Berkeley 2016/07/11 Abstract Why do citizens routinely

More information