Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora Ludovic Rheault and Christopher Cochrane Abstract Word embeddings, the coefficients from neural network models predicting the use of words in context, have now become inescapable in applications involving natural language processing and artificial intelligence. Despite a few studies in political science, the potential of this methodology for the analysis of political texts has yet to be fully uncovered. This paper introduces models of word embeddings augmented with political metadata and trained on large-scale parliamentary corpora from Britain, Canada, and the United States. We fit these models with indicator variables of the party affiliation of members of parliament, which we call party embeddings. We illustrate how these embeddings can be used to produce scaling estimates of ideological placement and other quantities of interest for political research. To validate the methodology, we assess our results against indicators from the Comparative Manifesto Project and measures based on roll-call votes. Our findings suggest that party embeddings are successful at capturing latent concepts such as ideology, and the approach provides researchers with an integrated framework for studying political language. Assistant Professor, Department of Political Science and Munk School of Global Affairs and Public Policy, University of Toronto. Associate Professor, Department of Political Science, University of Toronto.

1 Introduction The representation of meaning is a fundamental objective in natural language processing. As a basic illustration, consider queries performed with a search engine. We ideally want computers to return documents that are relevant to the substantive meaning of a query, just like a human being would interpret it, rather than simply the records with an exact word match. To achieve such results, early methods such as latent semantic analysis relied on low-rank approximations of word frequencies to score the semantic similarity between texts and rank them by relevance (Deerwester et al. 1990; Manning, Raghavan, and Schütze 2009, Ch. 18). The new state of the art in meaning representation is word embeddings, or word vectors, the parameter estimates of artificial neural networks designed to predict the occurrence of a word by the surrounding words in a text sequence. Consistent with the use theory of meaning (Wittgenstein 2009), these embeddings have been shown to capture semantic properties of language, revealed by an ability to solve analogies and identify synonyms (Mikolov, Sutskever, Chen, Corrado, and Dean 2013; Pennington, Socher, and Manning 2014). Despite a broad appeal across disciplines, the use of word embeddings to analyze political texts remains a new field of research. 1 The aim of this paper is to examine the reliability of the methodology for the detection of a latent concept such as political ideology. In particular, we show that neural networks for word embeddings can be augmented with metadata available in parliamentary corpora. We illustrate the properties of this method using publicly available corpora from Britain, Canada and the United States, and assess its validity using external indicators. The proposed methodology addresses at least three shortcomings associated with textual indicators of ideological placement currently available. First, as opposed to measures based on word frequencies, the estimates from our neural networks are trained to predict the use of language in context. Put another way, the method accounts for a party s usage of words given the surrounding text. Among other things, this also allows us to examine differences in how parties talk about the same issues, rather than only the differences in the issues that parties talk about. Second, our approach can easily accommodate control variables, factors that could otherwise confound the placement of parties or politicians. For example, we account for the governmentopposition dynamics that have foiled ideology indicators applied to Westminster systems in the past, and filter out their influence to achieve more accurate estimates of party placement. Third, the methodology allows us to map political actors and language in a common vector space. This means that we can situate actors of interest based on their proximity to political concepts. Using a single model of embeddings, researchers can rank political actors relative to these concepts using a variety of metrics for vector arithmetic. We demonstrate such implementations in our empirical section. 1 Examples of recent applications in political research include Rheault et al. (2016), Preoţiuc-Pietro et al. (2017), and Glavaš, Nanni, and Ponzetto (2017). 1

Our results suggest that word embeddings are a promising tool for expanding the possibilities of political research based on textual data. In particular, we find that scaling estimates of party placement derived from the embeddings for the metadata which we call party embeddings are strongly correlated with human-annotated and roll-call vote measures of left-right ideology. We distinguish between two approaches to estimate the placement of political actors on ideological dimensions. The first consists of using dimension reduction techniques on the raw embeddings, and does not involve the judgment of researchers. The second uses linear projections onto predefined scales, by choosing a relevant set of political expressions. Our findings indicate that both approaches lead to high levels of accuracy when assessed against external benchmarks, such that scholars can safely prefer the option that requires no arbitrary decisions when analyzing the ideological space of a legislature. We also show that the methodology is especially well suited to conduct analyses where the main interest is to assess the proximity between actors and their association with groups of concepts, for instance to analyze language specificity, ideological polarization and issue ownership. 2 Relations to Previous Work Two of the most popular approaches in political science for the extraction of ideology from texts are WordScores (Laver, Benoit, and Garry 2003) and WordFish (Slapin and Proksch 2008). 2 The first relies on a sample of labeled documents, for instance party manifestos annotated by experts. The relative probabilities of word occurrences in the labeled documents serve to produce scores for each word, which can be viewed as indicators of their ideological load. Next, the scores can be applied to the words found in new documents, to estimate their ideological placement. In fact, this approach can be compared to methods of supervised machine learning (Bishop 2006), where a computer is trained to predict the class of a labeled set of documents based on their observed features (e.g. their words). WordFish, on the other hand, relies on party annotations only. The methodology consists of fitting a regression model where word counts are projected onto party-year parameters, using an expectation maximization algorithm (Slapin and Proksch 2008). This approach avoids the reliance on expert annotations, and amounts to estimating the specificity of word usage by party, at different points in time. Neither of these approaches, however, takes into account the role of words in context. Put another way, they ignore semantics. Although theoretically both WordScores and WordFish could be expanded to include n-grams (sequences of more than one word), this comes at an increased computational cost. There are so many different combinations of words in the English language that it rapidly becomes inefficient to count them. This problem has been addressed recently in Gentzkow, Shapiro, and Taddy (2016), and presented as a curse of dimensionality. 2 See also Lauderdale and Herzog (2016) for an extension of the method to legislative speeches, and Kim, Londregan, and Ratkovic (2018) for an expanded model combining both text and votes. 2

Using a large number of words may be inefficient when tracking down ideological slants from textual data, since a high feature-document ratio overstates the variance across the target corpora (Taddy 2013; Gentzkow, Shapiro, and Taddy 2016). This problem is usually called overfitting in the machine learning literature. Problems associated with high-dimensionality often preclude the reliance on n-grams. But for a few exceptions (Sim et al. 2013; Iyyer et al. 2014), models of political language face a trade-off between ignoring the role of words in context and dealing with highdimensional variables. 3 Instead of relying on word frequencies, word embedding models aim to capture and represent relations between words using co-occurrences, which sidesteps overfitting problems while allowing researchers to move beyond counts of words taken in isolation. In the context of studies based on parliamentary debates, an additional concern is the ability to account for other institutional elements such as the difference in tone between the government and the opposition. A dangerous shortcut would consist of attributing any observed differences between party speeches to ideology. In any given parliament, the cabinet will use a different vocabulary than opposition parties due to the nature of these legislative functions. For instance, opposition parties in Westminster systems will invoke ministerial positions frequently when addressing their counterparts. Hirst et al. (2014) show that machine learning models used to classify texts by ideology have a tendency to be confounded with government-opposition language. As a result, temporal trends can also be obscured by procedural changes in the way government and opposition parties interact in parliament. A similar issue has been found to affect methods based on roll-call votes to infer ideology, where government-opposition dynamics can dominate the first dimension of the estimates (Spirling and McLean 2007; Hix and Noury 2016). Our proposed methodology can accommodate the inclusion of additional control variables to filter out the effect of these institutional factors. Finally, we argue that neural network models of language represent the natural way for scholars to move forward when attempting to measure a concept such as political ideology. The patterns of ideas that constitute ideologies are not as straightforward as is often assumed (Freeden 1998; Cochrane 2015). Ideologies are emergent phenomena that are tangible but irreducible. Ideologies are tangible in that they genuinely structure political thinking, disagreement, and behavior; they are irreducible, however, in that no one actor or idea, or group of actors or ideas, constitutes the core from which an ideology emanates. Second-degree connections between words and phrases give rise to meaningful clusters that fade from view when we analyze ideas, actors, or subsets of these things in isolation from their broader context. People know ideology when they see it, but struggle to say precisely what it is, because they cannot resolve the irresolvable (Cochrane 2015). This property of ideology has bedeviled past analysis. Since neural network models are designed to capture complex interactions between the inputs in our case, context words and indicator variables for political actors they are well adapted for the study of 3 Other examples of text-based methods for the detection of ideology based on word and phrase occurrences include Gentzkow and Shapiro (2010), Diermeier et al. (2012) and Jensen et al. (2012). 3

concepts that should theoretically emerge from such interactions. 3 Methodology Models for word embeddings have been explored thoroughly in the literature, but we need to introduce them summarily to facilitate the exposition of our approach. This section also adopts a notation familiar to social science scholars. Our implementation uses shallow neural networks, that is, statistical models containing one layer of latent variables or hidden nodes between the input and output data. 4 The outcome variable w t is the word occurring at position t in the corpus. The variable w t is multinomial with V categories corresponding to the size of the vocabulary. The input variables in the model are the surrounding words appearing in a window before and after the outcome word, which we denote as w = (w t,..., w t 1, w t+1,..., w t+ ). The window is symmetrical to the left and to the right, which is the specification we use for this study, although non-symmetrical windows are possible, for instance if one wishes to give more consideration to the previous words in a sequence than to the following ones. Simply put, word embedding models consist of predicting w t from w. The neural network can be subdivided into two components. Let z m represent a hidden node, with m = {1,..., M} and where M is the dimension of the hidden layer. Each node can be expressed as a function of the inputs: z m = f(w β m ) (1) In machine learning, f is called an activation function. In the case of word embedding models such as the one we rely upon, that function is simply the average value of w β m across all input words (see Mikolov, Chen, Corrado, and Dean 2013). Since each word in the vocabulary can be treated as an indicator variable, Eq. (1) can be expressed equivalently as z m = 1 2 w v w β v,m (2) that is, a hidden node is the average of coefficients β v,m specific to a word w v if that word is present in the context of the target word w t. In turn, the vector of hidden nodes z = (z 1,..., z M ) is the average of the M-dimensional vectors of coefficients β v, for all words v occurring in w : z = 1 2 w v w β v (3) 4 For the purpose of our presentation, we follow the steps of the model that Mikolov, Chen, Corrado, and Dean (2013) call continuous bag-of-words (CBOW). 4

Upon estimation, these vectors β v are the word embeddings of interest. The remaining component of the model expresses the probability of the target word w t as a function of the hidden nodes. Similar to the multinomial logit regression framework, commonly used to model vote choice, a latent variable representing a specific word i can be expressed as a linear function of the hidden nodes: u it = α i + z µ i. The probability P (w t = i) given the surrounding words corresponds to: P (w t = i w ) = e α i+z µ i V v=1 eαv+z µ v (4) The full model can be written compactly using nested functions and dropping some indices for simplicity: P (w t w ) = g (α, µ, f(w β)) (5) As can be seen with the visual depiction in Figure 1, the embeddings β link each input word to the hidden nodes. 5 The parameters of the model can be fitted by minimizing the cross-entropy using stochastic gradient descent. We rely on negative sampling to fit the predicted probabilities in Eq. (4) (see Mikolov, Sutskever, Chen, Corrado, and Dean 2013). In an influential study, Pennington, Socher, and Manning (2014) have shown that a corresponding model can be represented as a log-bilinear Poisson regression using the word-word co-occurrence matrix of a corpus as data. However, the implementation we use here facilitates the inclusion of metadata by preserving individual words as units of analysis. The basic model introduced above can be expanded to include additional input variables, which is our main interest in this paper. A common implementation uses indicator variables for documents or segments of text of interest, in addition to the context words (Le and Mikolov 2014). 6 The approach was originally called paragraph vectors or document vectors. More generally, other types of metadata can be entered in Eq. 1 to account for properties of interest at the document level, which is the approach we adopt here (for an illustration using political texts, see Nay 2016). In our implementation, we focus primarily on indicator variables measuring the party affiliation of a member of parliament (MP) or congressperson uttering a speech. The inner component of the expanded model can be represented as: z m = f(w β m + x ζ m ) (6) where x is a vector of metadata, and the rest of the specification is similar as before. In addition 5 In fact, Mikolov, Chen, Corrado, and Dean (2013) proposed two approaches: one in which the word embeddings are the link coefficients between input words and the hidden nodes (CBOW), and another where the outcome and the inputs are switched (called skip-gram) in effect, predicting surrounding context from the word, rather than the reverse. 6 The type of model described here is called distributed memory in the original article (Le and Mikolov 2014). 5

Figure 1: Example of Model with Word and Party Embeddings w w t 3 : overcoming β z w t 2 : barriers w t 1 : to µ w t+1 : and w t : work w t+2 : tackling w t+3 : inequalities x: Labour 2005 ζ Schematic example of input and output data in a model with M = 5 and a window = 3. The model includes a variable indicating the party affiliation and parliament of the politician making the speech. to party affiliation, it is straightforward to account for attributes with the potential to affect the use of language and confound party-specific estimates. We mentioned the government status earlier (cabinet versus non-cabinet positions, or party in power versus opposition). For Canada, a country where federal politics is characterized by persistent regional divides, a relevant variable would be the province of the MP. Just like words have their embeddings, each variable entered in x has a corresponding vector ζ of dimension M. Observe that the resulting vectors ζ have commonalities with the WordFish estimator of party placement. In their WordFish model, Slapin and Proksch (2008) predict word counts with partyyear indicator variables. The resulting parameters are interpreted as the ideological placement of parties. The model introduced in (6) achieves a similar goal. The key difference is that our model is estimated at the word-level, while taking into account the context (w ) in which a word occurs. The hidden layer serves an important purpose by capturing interactions between the metadata and these context words. Moreover, the dimension of the hidden layer will determine the size of what we refer to as party embeddings in what follows, that is, the estimated parameters for each party. Rather than a single point estimate, we fit a vector of dimension M. An obvious benefit is that these party embeddings can be compared against the rest of the corpus vocabulary in a common vector space, as we illustrate below. Specifically, our implementation uses party-parliament pairs as indicator variables, for a number of reasons. First, fitting combinations of parties and time periods allows us to reproduce 6

the nature of the WordFish model as closely as possible: each party within a given parliament or congress has a specific embedding. This approach has relevant benefits, by accounting for the possibility that the language and issues debated by each party may evolve from one parliament to the next. Parties are allowed to move over time in the vector space. We rely on parliaments/congresses, rather than years, to facilitate external validity tests against roll-call vote measures and annotations based on party manifestos, which are published at the election preceding the beginning of each parliament. Of course, the possible specifications are virtually endless and may differ in future applications. But we believe that the models we present are consistent with existing practice and provide a useful ground for a detailed assessment. 4 Parliamentary Corpora Models of word embeddings have been shown to perform best when fitted on large corpora that are adapted to the domain of interest (Lai, Liu, and Xu 2016). For the purpose of this study, we rely on three publicly available resources containing digitized parliamentary debates overlapping a century of history in Canada, the United States and Britain. Replicating the results in three polities helps to demonstrate that the proposed methodology is general in application. The Canadian Hansard corpus is described in Beelen et al. (2017) and released as linked and open data on www.lipad.ca. Our version of the British Hansard corpus is hosted on the Political Mashup website. 7 Finally, the United States corpus is the version released by Gentzkow, Shapiro, and Taddy (2016). Each resource is enriched with a similar set of metadata about speakers, such as party affiliations and functions. The first section of the online appendix describes each resource in more details. We considered speeches made by the major parties in each corpus. For Canada, we use the entirety of the available corpus, which covers a period ranging between 1901 and 2017, from the 9th to the 42nd Parliament. The corpus represents over 3 million speeches after restricting our attention to five major parties (Conservatives, Liberals, New Democratic Party, Bloc Québécois, and Reform Party/Canadian Alliance). For the United Kingdom, the corpus covers the period from 1936 to 2014. We restrict our focus to the three major party labels: Labour, Liberal-Democrats, and Conservatives. We removed speeches from the Speaker of the House of Commons in Britain and Canada, whose role is non-partisan. The United States corpus ranges from 1873 to 2016 (43rd to 114th Congress). We present results for the House of Representatives and the Senate separately, and restrict our attention to voting members affiliated with the Democratic and Republican parties. For each corpora, we tested models with various specifications and compared their accuracy. The main text reports models with hidden layers of 200 nodes, which we have found to be reliable 7 See http://search.politicalmashup.nl/. 7

for applied research. The appendix provides additional information on parameterization and its influence on the output. Each model uses a window of ± 20 words and includes tokens with a minimum count of 50 occurrences. Our models include not only words, but also common phrases. We proceed with two passes to detect collocations (words used frequently together) and merge them as single entities, which means that we capture phrases of up to 4 words. 8 This is especially useful for political research, where multi-word entities are frequent and common expressions may have specific meanings (e.g. civil rights ). We fit the models using custom scripts based on Řehůřek and Sojka (2010) s implementation for Python and a 0.025 learning rate. We preprocessed the text by removing digits and words with two letters or fewer, as well as a list of English stop words enriched to remove overly common procedural words such as speaker (or chairwoman/chairman in the United States), used in most speeches due to decorum. Our scripts are released publicly. 9 5 Two Approaches to Ideological Scaling We start by assessing the ability of the model to represent political ideology. We propose two different approaches for this purpose. The first consists of extracting the principal components from the party vectors and interpret them as estimates of ideological placement. As long as ideology is the main dimension along which political actors differ in terms of semantics, this interpretation is plausible. Moreover, we illustrate how the methodology has features that facilitate the interpretation of the principal components. In the second approach, we identify words or expressions that define the dimensions a priori, and use them to create a customized vector space. We call this second approach guided since the choice of words as anchor points will have some influence on the findings. 5.1 Unguided Projections With the simplest approach, the objective is to project the M-dimensional party embeddings into a substantively meaningful vector space. These party embeddings can be visualized in two dimensions using standard dimensionality reduction techniques such as principal component analysis (PCA), which we rely upon in this section. In plain terms, PCA finds the one-dimensional component that maximizes the variance across the vectors of party embeddings (see e.g. Hastie, Tibshirani, and Friedman 2009, Ch. 14.5). The next component is calculated the same way, by im- 8 Each pass combines pairs of words frequently used together as a single-expression, for instance united kingdom. By applying a second pass, expressions of one word or two words can be merged, resulting in phrases of up to 4 words. The algorithm used to detect phrases is based on the original implementation of word embeddings proposed in Mikolov, Sutskever, Chen, Corrado, and Dean (2013). We have found the inclusion of phrases to bring some improvement in accuracy, but models without phrases remain reliable for political analysis. 9 See https://github.com/lrheault/partyembed. 8

posing a constraint of zero covariance with the first component. Additional components could be computed but our analysis focuses on two-dimensional projections, to simplify visualizations. If the speeches made by members of different parties are primarily characterized by ideological conflicts, as is normally assumed in unsupervised techniques for ideology scaling, we can reasonably expect the first component to describe the ideological placement of parties. The second component will capture the next most important source of semantic variation across parties in a legislature. To facilitate the interpretation of these components, we can use the model s word embeddings and examine the concepts most strongly associated with each dimension. Starting with the US corpus, Figure 2 plots the party embeddings in a two-dimensional space for the House of Representatives and the Senate. We label each data point using an abbreviation of the party name and the beginning year of a Congress; for instance, the embedding ζ Dem 2011 means the Democratic party in the Congress starting in 2011 (the 112th Congress). The only adjustment that may be relevant to perform is orienting the scale in a manner intuitive for interpretation, for instance, by multiplying the values of a component by 1 such that conservative parties appear on the right. We fit separate models for the two chambers. Each model includes party-congress indicator variables as well as separate dummy variables for congress, which account for temporal change in the discourse. Our methodology captures ideological shifts that occurred during the 20th Century. Whereas both major parties were originally close to the center of the first dimension, which we interpret as the left-right or liberal-conservative divide, they begin to drift apart around the New Deal era in the 1930s, period usually associated with the fifth party system. Consistent with common wisdom, party embeddings for the Democrats started to shift toward the left of the ideological spectrum, while Republicans moved the opposite way. The trend culminates with a period of marked polarization from the late 1990s to the most recent Congresses. However, the most spectacular shift is probably the one occurring on the second dimension, which we interpret as a South-North divide (we oriented the South to the bottom, and North to the top). The change reflects a well-documented realignment between Northern and Southern states that occurred between the New Deal and the civil rights eras (Shafer and Johnston 2009; Sundquist 2011). A similar trajectory is manifest using both the House and the Senate corpora. Whereas Republicans initially became associated with issues of Northern states, the two parties eventually switched sides entirely. The recent era appears particularly polarized on both axes, which is consistent with a body of literature documenting party polarization (we return to this discussion in the penultimate section of this paper). On the other hand, we do not find clear evidence of polarization on the principal component in the late 20th Century, contrary to indicators based on vote data Poole and Rosenthal (2007) but consistently with Gentzkow, Shapiro, and Taddy (2016). The proposed models have desirable properties for interpreting the low-dimensional projection, by taking advantage of having words and political actors in the same model. In particular, 9

Figure 2: Party Placement in the US Congress (1873 2016) (a) House Component 2 Dem 2015 Dem 2009 Dem 2011 Dem 2013 Dem 2007 4 2 0 2 4 Dem 2003 Dem 2005 Dem 1995 Dem 2001 Dem 1997 Dem 1999 Dem 1991 Dem 1993 Dem 1873 Dem 1981 Dem 1977 Dem 1971 Dem 1979 Dem 1969 Dem 1989 1987 Dem 1973 Dem Dem 1875 Dem Dem 1877 1883 1879 1885 Dem 1985 1983 Dem 1889 Dem 1881 Dem 1975 Dem 1887 Dem 1959 Dem 1967 Dem 1891 Dem 1965 Dem 1953 Dem 1895 Dem Dem 1897 1893 Dem 1933 Dem 1899 Dem 1957 Dem 1963 Dem 1955 Dem 1903 Dem Dem 1901 Dem 1947 1919 Dem 1945 Dem 1913 Dem 1935 Dem 1917 Dem 1923 Dem 1961 Dem 1915 Dem Dem 1943 Dem 19391941 Dem 1907 1909 Dem 1911 Dem Dem 1951 1949Dem 1905 Dem 1937 Dem 1929Dem 1927 Rep 1937 Rep 1945 Rep 1939 Rep 1913 Rep 1935 Rep 1943 Rep 1931 Rep 1933 Rep Rep 1941 1965 1951 Rep Rep Rep 1917 1905 1915 1929 Rep Rep 1919 1927 Rep 1967 Rep 1947Rep 1949 Rep Rep 1923Rep 1907 1909 1925 1963 Rep 1921 Rep 1961 Rep 1911 Rep 1959 Rep Rep 19571971 Rep 1885 Rep 1901 Rep 1903 Rep 1953 1955 Rep 1973 Rep 1875 Rep 1893 Rep 1969 Rep Rep 1879 1891 Rep Rep 1887 1895 Rep 1889 1881 Rep 1883 1899 Rep Rep 18771897 Rep Rep 19771979 Rep 1873 Rep 1975 Rep 1983 Rep 1981 Rep 1985 Rep 1987 Rep 1989 Rep 1991 Rep 1993 Rep 2001 Rep 1997 Rep 1999 Rep 2007 Rep 1995 Rep 2009 Rep 2015 Dem Dem 1925 1921 Dem 1931 Rep 2003 Rep 2013 6 Rep 2005 Rep 2011 10 5 0 5 10 Component 1 (b) Senate Dem Dem 20132015 6 Dem 2009 Dem 2011 Dem 2007 Dem 1997 4Dem 2005 Dem 2001 Dem 2003 2 Dem 1999 Dem 1993 Dem 1995 Dem 1989 Dem 1987 Dem 1985 Rep 1921 Rep 1947 Rep Rep 1929 1919 Rep Rep 1941 1945Rep 1949 Rep 1931 Rep 1951 Rep 1909 Rep 1923 Rep 1933 1943 Rep 1935 Rep 1925 Rep 1939 Rep 1889 Rep 1927 Rep 1905 Rep 1937 Rep 1917 Rep 1953 Rep 1885 Rep Rep 1903 1911 Rep 1955 Rep Rep 1887 Rep 18831881 Rep 1913 Rep 1907 Rep 1891 Rep 1893 Rep 1895 Rep 1915 Rep 1959 Rep 1957 Rep 1963 Rep 1899 Rep 1897 Rep 1971 Rep 1879 1901 Rep 1961 Rep 1967 Rep 1877 Rep 1875 Rep 1873 Rep 1965 1969 Component 2 0 2 4 6 Dem 1975 Dem 1991 Dem 1971 Dem 1983 Dem 1981 Dem 1973Dem Dem 19791977 Dem Dem 1963 1967 Dem 1957 Dem 1969 Dem 1965 Dem 1955 Dem 1961 Dem 1875 Dem 1953 Dem 1951 Dem 1945Dem 1895 Dem 1959 Dem 1915 Dem 1879 Dem 1903 Dem Dem Dem 1949 1893 Dem 1947 1877 Dem Dem 1943 Dem Dem 1873 Dem Dem 1891 1917 Dem 1889 1939 1901 Dem Dem Dem 1913 1907 1887 Dem 1941 1881 Dem 1899 1897 1883 Dem Dem 1885 1911 Dem Dem 1919 1937Dem 1905 Dem Dem 1925 1933 Dem 1935 Dem 1927 Dem 1929 Dem 1923 Dem 1909 Dem 1931 Rep 1975 Rep 1973 Rep Rep 19851987 Rep 1989 Rep Rep 1983 1981 Rep 1977 Rep 1979 Rep 1993 Rep 1991 Rep 1997 Rep 2001 Rep 1995 Rep 2005 Rep 1999 Rep 2003 Rep 2015 Rep 2007 Rep 2013 Rep 2011 Rep 2009 Dem 1921 8 10 5 0 5 10 Component 1 The figure shows the two principal components of party embeddings for the US House and the US Senate. 10

we can compute the cosine similarity between each word or phrase in the vocabulary and the party embeddings, and then identify the words that most strongly correlate with each axis. To illustrate, Table 1 reports the expressions with the highest and lowest correlation coefficients for the House model. As can be seen, the first dimension is negatively correlated with expressions such as black caucus, decent housing, the poor and the elderly, meaning that these expressions define the semantics of parties located on the left. These words refer to topics one would expect in the language of liberal (or left-wing) parties in the United States. Conversely, issues like decentralization and bureaucracies are associated with the right. As for the second dimension, the keywords unambiguously refer to Southern and Northern locations, supporting our interpretation of that axis as a South-North divide. As discussed earlier, ideology cannot be reduced to any single group of expressions, and the proposed methodology is precisely designed to capture the latent structure of political debates. As a result, the words in Table 1 only represent the axes imperfectly. However, in this case they provide straightforward clues that facilitate a substantive interpretation of each axis. Table 1: Interpreting PCA Axes: Word and Phrase Correlations Component Orientation Words/Phrases with Highest Correlation First Positive (Right) decentralization, centralized, nebraska, kansas, mentioned earlier, governmentrun, apropos, feed grain, bureaucratic, bureaucracies Negative (Left) congressional black caucus, black caucus, poor elderly, decent housing, latinos, cbc, deepest, elderly handicapped, wealthiest, elderly disabled Second Positive (North) buffalo, minneapolis, detroit, milwaukee, cleveland, vermont, toledo, maine, erie, chicago Negative (South) georgia, southeast, red river, arkansas, bankhead, gulf mexico, georgias, shreveport, waco, peanut Another way to validate an interpretation of the projections is to retrieve concepts that are semantically similar to specific parties. For instance, we can readily identify the expressions closest to the position of the Democrats in the vector space for the 114th Congress, by retaining words having the highest cosine similarity with that specific party embedding. This does not require any technique for dimension reduction, as the similarity scores can be computed from the original, M-sized embeddings. The top words for the Democrats contain relevant hints at a liberal stance, with concepts such as gun violence and environmental protection. For Republicans in the same Congress, we find expressions such as Obamacare, overregulation and job creators. The full list of top words is printed out in the appendix. Once again, this exploration of the model suggests that party embeddings can achieve a meaningful representation of positions in an ideological space. 11

Table 2: Accuracy of Party Placement against Gold Standards Gold Standard Metric US House US Senate Canada Britain Voteview rile vanilla legacy Correlation 0.918 0.934 Pairwise Accuracy 85.96% 86.56% Correlation 0.751 0.678 Pairwise Accuracy 76.60% 74.60% Correlation 0.719 0.755 Pairwise Accuracy 76.44% 77.58% Correlation 0.843 0.876 Pairwise Accuracy 80.07% 82.66% The gold standard used for the United States is the average DW-NOMINATE score (first dimension) from the Voteview project (Poole and Rosenthal 2007). For Canda and the UK, the references are the rile measure of party placement based on the 2017 version of the Comparative Manifesto Project (CMP) dataset (Budge and Laver 1992; Budge et al. 2001), the Vanilla measure of left-right placement (Gabel and Huber 2000), and the legacy measure from Cochrane (2015). The pairwise accuracy metric counts the number of correct ideological orderings for all possible pairs of parties and parliaments. To further assess the validity of estimates derived from our model, we evaluate our predicted ideological placement against well-known metrics from the literature. Table 2 reports the results. For the US corpus, we use DW-NOMINATE estimates based on roll-call votes and retrieved from the latest version of the Voteview project (Poole and Rosenthal 2007). We use the first principal component of our models and compute the Pearson correlation coefficient with the first dimension of the aggregate Voteview indicator, which measures the average placement of congress members by party over time. We also report the pairwise accuracy, that is, the percentage of pairs of party placements that are consistently ordered relative to the gold standard. Pairwise accuracy accounts for all possible comparisons, within parties and across parties. Our placement is strongly correlated with the Voteview scores (ρ 0.9) and the pairwise accuracy, a more conservative metric, is above 85% for both the House and the Senate. This strengthens the conclusion that our model produces reliable estimates of ideological placement. Next, we illustrate that the methodology is generalizable across politics by replicating the same steps using the British and Canadian parliamentary corpora. To begin, Figure 3a reports a visualization of party embeddings for Britain. In addition to party and parliament indicators, the model includes a variable measuring whether an MP is member of the cabinet or not. As can be seen, political parties are once again appropriately clustered together in the vector space: speeches made by members of the same group tend to resemble each other across parliaments. Moreover, the parties are correctly ordered relative to intuitive expectations about political ideology. Focusing on the first principal component (x-axis), the Labour party appears on one end 12

of the spectrum, the Liberal-Democrats occupy the center, whereas the embeddings for Conservatives are clustered on the other side. In fact, without any intervention needed on our end, the model correctly captures well-accepted claims about ideological shifts within the British party system over time (see e.g. Clarke et al. 2004). For instance, the party embeddings for Conservatives during the Thatcher era (Cons 1979, 1983, and 1987) are ranked farther apart on the right end of the axis, whereas the Labour s shift toward the center at the time of the New Labour era (Labour 1997, 2001, and 2005), under the leadership of Tony Blair, is also apparent. The second component captures dynamics opposing the party in power and opposition, with parties forming a government appearing at the top of the y-axis. Finally, Figure 3b reports the results for Canada. 10 Once more, the first principal component can be readily interpreted in terms of left-right ideological placement. The Conservatives appear consistently on the right, whereas the left-wing New Democratic Party (NDP, which is merged with its predecessor, the Co-operative Commonwealth Federation) is correctly located on the other end of the spectrum. The Reform/Canadian Alliance split from the Conservatives, generally viewed as the most conservative political party in the Canadian system, appears at the extreme right of the first dimension, consistent with substantive expectations (see Cochrane 2010). In the case of Canada, the second principal component can be easily interpreted as a division between parties reflecting their views of the federation. The secessionist party, the Bloc Québécois, appears clustered on one end of the y-axis, whereas federalist parties are grouped on the other side. Such a division also resurfaces in models based on vote data (see Godbout and Høyland 2013; Johnston 2017). For the last two countries, we validate our ideological placement using data from the Comparative Manifesto Project (CMP) data (Budge and Laver 1992; Budge et al. 2001). We report the same two accuracy metrics used earlier in the lower section of Table 2. Note that we could not use that resource as a reliable benchmark for the United States since the CMP only provides data on American parties at four-year intervals. On the other hand, the CMP data provide a more useful gold standard to assess Westminster systems, for which vote-based estimates are less reliable indicators of ideology. The CMP is based on party manifestos and relies on human annotations to score the orientation of political parties on a number of policy items, following a normalized scheme. We test whether three ideology indicators derived from the project s data are consistent with our estimated placement of the same party in the parliament that immediately follows. 11 Looking first at the British case, party placements appear positively correlated with the three 10 For that country, our model includes variables measuring whether the MP making the speech belongs to the cabinet or not, whether they belong to the party in power or the opposition, and their province of origin. 11 The rile measure is the original left/right measure in the CMP. It is an additive index composed of 26 policyrelated items, as described in Budge et al. (2001). The Vanilla measure proposed by Gabel and Huber (2000) uses all 56 items in the CMP and weights them according to their loadings on the first unrotated dimension of a factor analysis. The Legacy measure is a weighted index based on a network analysis of party positions and a model that assigns exponentially decaying weight to party positions in prior elections (Cochrane 2015). 13

Figure 3: Party Placement in the Britain (1935 2014) and Canada (1901 2017) (a) Britain Labour 2001 Labour 2005 10 Labour 1997 Labour 1974.2 Cons 1992 Cons 1987 Cons 1983 Labour 1966 Component 2 5 0 Labour 1979 Labour 1983 Labour 1945 Labour 1964 Labour 1950 Labour 1974.1 Labour 1935 Labour 1970 Labour 1951 Labour 2010 Labour 1955 Labour 1959 LibDems 1951 LibDems 1945 LibDems 2010 LibDems 1935 LibDems 1950 LibDems 1955 LibDems 1983 LibDems 1964 LibDems 1966 LibDems 1970 LibDems 1974.2 LibDems LibDems 1987 2005 LibDems LibDems 20011997 LibDems 1992 LibDems 1979 LibDems LibDems 1974.1 1959 Cons 1935 Cons 1974.1 Cons 2010 Cons 1970 Cons 1959 Cons Cons 1955 1951 Cons 1964 Cons 1945 Cons 1979 5 Labour 1992 Labour 1987 Cons 1950 Cons 1974.2 Cons 1966 Cons 2005 Cons 2001 Cons 1997 10 15 10 5 0 5 10 15 Component 1 (b) Canada Bloc 2000 Bloc 1997 Bloc 1993 20 Bloc 2004 Bloc 2006 Bloc 2008 15 Bloc 1988 10 Bloc 2011 Component 2 5 0 5 NDP NDP 1997 1993 NDP 2000 10 NDP 1974 NDP 1980 Liberal 1980 Liberal 1958 Liberal 1930 Liberal 1945 Liberal Cons Liberal 1922 1926 1925.1 Liberal 1911 1918 Liberal Liberal 1949 Liberal 1979 1974 Liberal 19571984 Liberal 1935 Liberal 1962 Liberal NDP 1935 1904Cons 1911 Liberal 1922 Liberal 1953 Liberal 1925.2 Cons Liberal 1908 1940 Cons 1949 Cons Cons 19681963 NDP 1968 Liberal 1901 Cons Liberal Liberal 1918 19081968 Cons 1925.1 Cons Cons 1957 2011 Cons 1965 NDP 1949 NDP 1972 NDP NDP 19451940 NDP 1963 NDP NDP 1953 1965 Liberal Liberal 1965 1972Liberal Cons 1993 1925.2 Cons 1974 Cons 1926 Cons 1904 1958 Cons 2008 Cons 1972 NDP NDP 1979 NDP 1958 NDP 1957 2015 1962 RefAll 1988 Cons 1962 Cons 1901 Liberal 1963 Cons Liberal 1930 Cons 1940 Cons 1953 Cons 1980 2000 Liberal 2015 Liberal 2004 Cons 1979 Cons 2015 Cons 1984 NDP 1984 Cons 1935 Cons 1988 Cons 2006 Cons 1945 NDP 2011 Liberal 1997 Liberal 1988 NDP 2004 NDP 2006 NDP 1988 NDP 2008 Bloc 2015 Cons 1993 Liberal 2006 Cons 1997 Liberal 2008 Liberal 2011 RefAll 1997 Cons 2000 RefAll 2000 Cons 2004 RefAll 1993 15 10 5 0 5 10 15 Component 1 The figure shows the two principal components of party embeddings for the British and Canadian parliaments. 14

external indicators, ranging from ρ 0.68 when considering the more basic right minus left measure (rile) from the 2017 CMP dataset, up to ρ 0.76 and ρ 0.88 using more robust ideology metrics based on the same data. As for pairwise accuracy, between 75 to 83% of comparisons against CMP-based measures are ordered consistently. These results once again support the conclusion that our party embeddings are related to left-right ideology as coded by humans using the party manifestos. For Canada, the fit with the CMP data is also strong across the three gold standards. For instance, the correlations range between 0.72 and 0.84. For all countries, our tests suggest that setting M around 200 dimensions is a reliable default value to achieve results consistent with human judgments or vote-based indicators. In the appendix, we report additional tests of accuracy of the models for various hidden layer sizes. We also discuss validation tests of the word embeddings using standard benchmarks from the literature. 5.2 Guided Projections Instead of interpreting dimensions ex-post, researchers may also choose to define axes of interest. We briefly illustrate how the proposed methodology can be used in such fashion. We start by choosing expressions representative of opposite ideological stances on economic and social issues (see Table A4 in the appendix for the full list). When more than one term is used to anchor a position, we can take the centroids of each group of words and phrases, by averaging their embeddings. Finally, axes are created by taking the difference between the right and left centroids, for each dimension of interest. We project party embeddings onto the customized space by taking dot products: ζ ( ) i L Right β i i L Left β i V Right V Left where L Left is the chosen lexicon for words identifying the left-wing, and V Left the size of that lexicon (and similarly for the Right). 12 Figure 4 illustrates such a linear projection of party embeddings in a two-dimensional space for the US House. The neural network model is the same as that used in the previous subsection. The social dimension (y-axis) uses expressions such as civil rights and traditional values to represent left and right, respectively. For the economic dimension (x-axis), we use concepts related to workers and redistribution for the left, and expressions such as businesses, taxpayers and free enterprise for the right. Consistent with expectations, the figure suggests that the Republican party is located to the right on both the economic and social dimensions in recent decades. The Democrats, on the other hand, are both socially and economically on the left, or liberal end of the scale. We also assessed the validity of this second approach against gold standards and report the 12 This approach expands on a standard visualization technique for the analysis of word embeddings; for instance, a similar implementation is included in Google s TensorBoard tool. 15

Figure 4: Party Placement in a 2D Space using Customized Ideological Axes (US House) 40 Rep 2001 Rep 2005 Rep 2009 Rep 2003 Rep 2015 Social Left-Right 30 20 10 0 10 Dem 1997 Rep 1993 Rep 1987 Rep 1981 Rep 1985 Rep 1953 Rep 1973Rep 1983 Rep 1989 Rep 1991 Rep 1957 Rep 1943 Rep 1967 Rep 1969 Rep 1975 Rep 1971 Rep 1919 Rep 1877 1959 Dem 1875 Dem 1877 Rep 1921 Rep 1931 Rep 1873 Dem 1931 Rep Rep 1923 Rep 1951 1941 Rep 1949 1947 Rep 1955 Dem 1873 Rep 1915 1879 Rep 1901 Dem 1913 Rep Rep 1881 Dem 1875 1927Rep Dem 1889 1893 1891 Dem Dem 19171933 Dem Dem Rep 1905 Rep 1907 Rep Rep 1945 Dem 1885 1879 19231895 Rep Rep Rep Rep 1917 1891 1925 1925 Dem Dem 18971887 Rep 1979 1935 1933 Rep 1897 Rep Rep 1965 1961 Dem 1957 Rep Dem 1927 Dem 1907 Dem 1919 1895 1893 1883 Dem Dem Dem 1899 1943 1915 Dem Rep Rep 1965 1883 1899 Rep 1887 Rep 1913 Dem 1973 Dem 1951 Dem Rep Dem Rep 1953 1911 Rep 1921 1909 Rep 1903 1939 Rep 1929 Rep 1937 Rep 1977 Dem 1911 Dem Dem 1941 Dem 19711975 Dem Rep 1901 Dem 1885 1903 Dem 1889 Dem 1967 Dem 1937 Rep 1963 Dem 1987 Dem 1969 Dem 1949 Dem 1905 Dem 1981 Dem 1881 Dem 1935 Dem Dem 1909 Dem 1959 Dem 1961 1929 Dem 1945 Dem 1977 Dem 1939 Dem 1955 Dem 1947 Dem Dem 1991 1983 Dem 1985 Dem 1993 Dem 1963 Dem 1989 Rep 1995 Rep 1997 Rep 1999 Rep 2007 Rep 2013 Rep 2011 20 Dem 2001 Dem 2009 Dem 2007 Dem 2005 Dem Dem 1999 2013 Dem 1979 Dem 1995 Dem 2015Dem 2003 30 Dem 2011 60 40 20 0 20 40 60 80 Economic Left-Right results in the appendix. Overall, we find that the unguided method presented in the previous section achieves higher levels of accuracy than the one in which we predefine the axes. Of course, a researcher could easily test the entire vocabulary from a corpus and find combinations of expressions that maximize the accuracy with human-annotated benchmarks. Such an optimization technique, however, would defeat the purpose of developing a general approach that can be used across polities, free or arbitrary decisions. Finally, we note that our models can be fitted with indicator variables for each member of a parliament, rather than parties. The same analytical steps apply, and the transposition to a different type of political actor is straightforward. A difference that researchers need to take into account is that a model fitted with embeddings for individual speakers has fewer training examples, which may affect accuracy and the choice of parameters. Due to space constraints, we discuss an example in the appendix, using the US Senate corpus. We find that the method applied to individual members is promising as a tool for political research, and our test based on the 114th Congress suggests that our predicted ideological placement for Senators is consistent with an extended set of individual-level data from external sources. 16

6 Example Application: Topic Similarities After fitting a model of word and party embeddings, researchers can assess hypotheses regarding issues or topics prioritized by political actors in parliament. For instance, one might wish to assess the extent to which left-wing parties in Canada (the NDP) and the United States (Democrats) have embraced the issue of environment over time. This question touches a broader theoretical debate on the evolution of left-wing parties, and whether the issues that they prioritize have changed over time (see Kitschelt and Hellemans 1990). To implement such an analysis, we may either start with a custom list of words related to environmental issues or automatically retrieve expressions most similar to the keyword environment in the vector space using the word embeddings. We use the latter approach for illustration. Next, we compute the cosine similarity between the centroid of the embeddings for the topic words and that of a party embedding. Implementing a bootstrap estimator can be achieved by replicating the computation using random samples with replacement from the set of topic words. Figure 5 reports the similarity scores between left-leaning parties and a centroid vector for the issue of environment over time. Consistent with the idea that the new left has embraced the issue of environment, both trends are positive. The trend originates at the time of conservation movement in the 1960s, up until recent debates over climate change. Figure 5: Semantic Similarity between Left-Wing Parties and The Issue of Environment (a) Canada: New Democratic Party (b) US House: Democrats 0.20 Cosine Similarity (Moving Average) 0.10 0.05 0.00 0.05 Cosine Similarity (Moving Average) 0.15 0.10 0.05 0.00 0.05 1950 1960 1970 1980 1990 2000 2010 Parliament 1950 1960 1970 1980 1990 2000 2010 Congress Moving average of the cosine similarity between environment topic vector and party embeddings, with 95% bootstrap confidence intervals. 7 Example Application: Party Polarization Finally, another benefit of having estimates of party placement in a vector space is the possibility of computing quantities of interest based on metrics for vector distances. An obvious application 17