Classification and Regression Approaches to Predicting United States Senate Elections. Rohan Sampath, Yue Teng

Similar documents
ECE 510 Lecture 6 Confidence Limits. Scott Johnson Glenn Shirley

Bounds on Welfare-Consistent Global Poverty Measures

Welfare-Consistent Global Poverty Measures

The Economic and Scientific Context of Quality Improvement and Six Sigma

Ethnic Disparities in the Graduate Labour Market

Global Production Sharing and Rising Inequality: A Survey of Trade and Wages*

FINAL PROJECT REPORT. University of Delaware Disaster Research Center. # 10 ENVIRONMENTAL CRISES Russell R. Dynes and Dennis E.

P and V(p w,y) V(p,y) for at least one-half the measure of

Subset Selection Via Implicit Utilitarian Voting

Support Vector Machines

ESTIMATION OF GENDER WAGE DIFFERENTIALS IN EGYPT USING OAXACA DECOMPOSITION TECHNIQUE *

Predicting Congressional Votes Based on Campaign Finance Data

Charles H. Dyson School of Applied Economics and Management Cornell University, Ithaca, New York USA

Random Forests. Gradient Boosting. and. Bagging and Boosting

Immigration and Social Justice

Factor Content of Intra-European Trade Flows

Immigration Policy and Counterterrorism

11. Labour Market Discrimination

Prisons in Europe France

DOES AUSTRALIAN LAW RECOGNISE PUBLIC LITIGATION?

Estimation of Gender Wage Differentials using Oaxaca Decomposition Technique

1%(5:25.,1*3$3(56(5,(6 */2%$/352'8&7,216+$5,1*$1'5,6,1*,1(48$/,7< $6859(<2)75$'($1':$*(6 5REHUW&)HHQVWUD *RUGRQ++DQVRQ

DeGolyer and MacNaughton 5001 Spring Valley Road Suite 800 East Dallas, Texas 75244

On the Dynamics of Growth and Poverty in Cities

HQPE and the journal literature in the history of economic thought

Migrants Movement. Introduction to Trade Unions for Migant Workers. February 24th-25th 2011

STAATSKOERANT, 3 APRIL 2012 GOVERNMENT NOTICE DEPARTMENT OF POLICE SECOND-HAND GOODS ACT, 2009 REGULATIONS FOR DEALERS AND RECYCLERS

2 Gender, Poverty, and Wealth

JUDGE, JURY AND CLASSIFIER

Corruption and Economic Growth in Nigeria ( )

DISCUSSION PAPER SERIES

NBER WORKING PAPER SERIES ENTRY AND ASYMMETRIC LOBBYING: WHY GOVERNMENTS PICK LOSERS. Richard E. Baldwin Frédéric Robert-Nicoud

Understanding factors that influence L1-visa outcomes in US

CS 229 Final Project - Party Predictor: Predicting Political A liation

The Case of the Disappearing Bias: A 2014 Update to the Gerrymandering or Geography Debate

This journal is published by the American Political Science Association. All rights reserved.

Stimulus Facts TESTIMONY. Veronique de Rugy 1, Senior Research Fellow The Mercatus Center at George Mason University

Distorting Democracy: How Gerrymandering Skews the Composition of the House of Representatives

Lab 3: Logistic regression models

Geographical Indications and The Trade Related Intellectual Property Rights Agreement (TRIPS): A Case Study of Basmati Rice Exports

SCATTERGRAMS: ANSWERS AND DISCUSSION

Deindustrialization, Professionalization and Racial Inequality in Cape Town,

The Case of the Disappearing Bias: A 2014 Update to the Gerrymandering or Geography Debate

European Foundation for the Improvement of Living and Working Conditions

WORKING PAPER STIMULUS FACTS PERIOD 2. By Veronique de Rugy. No March 2010

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

MEMORANDUM OF INCORPORATION OF COMENSA NPC REGISTRATION NUMBER 2005/017895/08

PARTISANSHIP AND WINNER-TAKE-ALL ELECTIONS

Gender preference and age at arrival among Asian immigrant women to the US

Midterm Elections Used to Gauge President s Reelection Chances

Wage Effects of High-Skilled Migration

Appendices for Elections and the Regression-Discontinuity Design: Lessons from Close U.S. House Races,

Model of Voting. February 15, Abstract. This paper uses United States congressional district level data to identify how incumbency,

Do two parties represent the US? Clustering analysis of US public ideology survey

Introduction. Midterm elections are elections in which the American electorate votes for all seats of the

A Dead Heat and the Electoral College

Non-Voted Ballots and Discrimination in Florida

INSTRUCTIONS FOR COMPLETING PETITION FOR GUARDIANSHIP OF THE PERSON AND/OR PROPERTY OF AN ALLEGED DISABLED PERSON (CC-GN-002)

Institut für Halle Institute for Economic Research Wirtschaftsforschung Halle

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants

VoteCastr methodology

United States House Elections Post-Citizens United: The Influence of Unbridled Spending

A Behavioral Measure of the Enthusiasm Gap in American Elections

EEOC v. Presrite Corporation

Amy Tenhouse. Incumbency Surge: Examining the 1996 Margin of Victory for U.S. House Incumbents

The League of Women Voters of Pennsylvania et al v. The Commonwealth of Pennsylvania et al. Nolan McCarty

Ohio State University

Online Appendix for Redistricting and the Causal Impact of Race on Voter Turnout

Unequal Recovery, Labor Market Polarization, Race, and 2016 U.S. Presidential Election. Maoyong Fan and Anita Alves Pena 1

Issues in Political Economy, Vol 26(1), 2017, 79-88

ANES Panel Study Proposal Voter Turnout and the Electoral College 1. Voter Turnout and Electoral College Attitudes. Gregory D.

Immigrant Legalization

Immigration and Internal Mobility in Canada Appendices A and B. Appendix A: Two-step Instrumentation strategy: Procedure and detailed results

Chapter 11. Weighted Voting Systems. For All Practical Purposes: Effective Teaching

Retrospective Voting

RBS SAMPLING FOR EFFICIENT AND ACCURATE TARGETING OF TRUE VOTERS

A comparative analysis of subreddit recommenders for Reddit

A Multivariate Analysis of the Factors that Correlate to the Unemployment Rate. Amit Naik, Tarah Reiter, Amanda Stype

COMMONWEALTH OF VIRGINIA AT RICHMOND, FEBRUARY 21, 2018 FINAL ORDER. On July 13, 2017, Reynolds Group Holdings Inc. ("Reynolds") filed with the State

Benefit levels and US immigrants welfare receipts

Statistical Analysis of Corruption Perception Index across countries

Appendix: Supplementary Tables for Legislating Stock Prices

TRACKING CITIZENS UNITED: ASSESSING THE EFFECT OF INDEPENDENT EXPENDITURES ON ELECTORAL OUTCOMES

New Louisiana Run-Off Poll Shows Lead for Kennedy, Higgins, & Johnson

Classifier Evaluation and Selection. Review and Overview of Methods

Analysis of Gender Wage Differential in China s Urban Labor Market

Out of Step, but in the News? The Milquetoast Coverage of Incumbent Representatives

! = ( tapping time ).

The Timeline Method of Studying Electoral Dynamics. Christopher Wlezien, Will Jennings, and Robert S. Erikson

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Deep Learning and Visualization of Election Data

Identifying Factors in Congressional Bill Success

Supplementary/Online Appendix for:

The California Primary and Redistricting

PROJECTING THE LABOUR SUPPLY TO 2024

Working Paper: The Effect of Electronic Voting Machines on Change in Support for Bush in the 2004 Florida Elections

Minnesota. State Register. (Published every Monday (Tuesday when Monday is a holiday.)

Forecasting the 2018 Midterm Election using National Polls and District Information

Determinants of Unemployment in the Philippines 1

COMMONWEALTH OF VIRGINIA. AT RICHMOND, JUNE 12,2018 Se'G- CLERK'S twrce 1 PETITION OF ORDER ADOPTING REGULATIONS

Transcription:

Classification and Regression Approaches to Predicting United States Senate Elections Rohan Sapath, Yue Teng Abstract The United States Senate is arguably the finest deocratic institution for debate and deliberation in the world. It also provides a fascinating opportunity to understand the coplex dynaics that go into deterining the outcoe of Senate elections, using Machine Learning. Motivation Are elections decided even before they begin? Can political fundaentals predict elections regardless of candidates/capaigns? Our goal is to get a bird s eye, forward-looking view of Senate elections using data available well in advance. We believe that if we reliably predict Senate elections well before they happen, that has several significant iplications for various stakeholders, since individual senators wield a treendous aount of legislative power. Introduction We use: Data A odified version of the LMS algorith, called a discount-weighted least-ean squares algorith to predict the argin of victory of Senate elections. An ordinary Support Vector Machine classifier to predict the outcoe of Senate elections. Rando Forest to also predict the outcoe of Senate elections. Our data set consists of all biennial Senate elections that were held fro 998 to 204; this data is publicly available. Preprocessing: We preprocess the data to weed out elections where: There wasn t exactly one Republican and exactly one Deocratic candidate. A third Party candidate either won the election or distorted the election by winning ore than 20% of the vote (i.e. a third party candidate was a significant player). After preprocessing, we are left with 273 data points. (There were 300 regularly scheduled Senate elections in the period 998-204, of which 27 were eliinated in preprocessing.) The fundaental challenge we face is one of liited data. Senate elections, by their very nature, are liited only around 33 happen every two years. Therefore, we had to keep in ind that an inescapable part of this project was having liited data. Features We use a feature vector of 7 features. These 7 features include original sourced features (such as argin of victory, uneployent rate etc.) and derived features (such as change in uneployent rate over a period of tie). The features are described below: Margin of victory in the Senate election held six and twelve years previously. (ote: Senators serve six-year ters.) Margin of victory in the state in the last three Presidential Elections Presidential Approval in the State. Annualized Changes in the Presidential Approval in the state. Percent African-Aerican population (as extrapolated fro the ost recent Census) Percent Hispanic/Latino population ((as extrapolated fro the ost recent Census) Changes in the above deographic factors over tie. Three-onth average uneployent rates in the year before the election.

Cuulative Variance Eigenvalue ('000) 6-onth, 2-onth, 8-onth 24-onth changes in uneployent rate Partisan Voting Index (PVI) over the past three Presidential elections 2 Change in the PVI fro the second-last presidential election to the last one. Median incoe in the state. Variation in the edian incoe in the state. Indicator variable: Whether Republican candidate is the incubent senator. Indicator variable: Whether Deocratic candidate is the incubent senator. uber of years of incubency for the President. Indicator variable: Whether the election was a idter election or not. Convention: In all cases, a positive result for the Republican is recorded as positive, and vice-versa. Exaple: A reduction of the uneployent rate during a Deocratic President s ter is eans that the feature data point is (-), since it s good for the Deocrats. Cross-Validation We use cross-validation frequently through the project. Our perusal of literature suggested that a directapplication of k-fold cross-validation was not appropriate for tie-series data it would not be appropriate to train on 202 data, for exaple, and validate on a hold-out data point that happened before 202! Hence, we use a odified version called forward chaining. For exaple, say we have a training set consisting of data fro years 2000, 2002, 2004 and 2006; we then design the folds as follows: Fold : train [2000], hold-out validation [2002] Fold 2: train [2000, 2002], hold-out validation [2004] Fold 3: train [2000, 2002, 2004], hold-out validation [2006] Principal Coponent Analysis Motivation: clear interdependencies between certain variables: PVI and Previous US Presidential election result, for exaple. In order to choose an appropriate k-diension spanned by the first k principal coponents subspace (for k (, 0)) and thus deterine the k principal factors, we use the Scree Plot and the Cuulative Variance Plot. (The plot below is for the 83 data points fro 2002 to 202. First 0 principal coponents are shown.).4.2 0.8 0.6 0.4 0.2 0 0.00% 00.00% 90.00% 80.00% 70.00% 60.00% Scree Plot 2 3 4 5 6 7 8 9 0 Factor/Coponent Cuulative Variance Plot 2 3 4 5 6 7 8 9 0 Factor/Coponent A reduction is positive under incubent Republican President, while negative under an incubent Deocratic President; vice-versa for an increase. 2 PVI of a state: On average, how uch ore Republican was the state in the last two presidential elections as copared to the nation as a whole)

Support Vector Machine (SVM) Classification We solve 3 classification probles using standard SVM classification: classifying 204 after learning on 2002-202, classifying 202 after learning on 2000-200, and classifying 200 after learning on 998-2008. The fundaental otivation behind SVM is carrying out binary classification in a high-diension feature space efficiently, by using the kernel trick (i.e. by apping input data via a non-linear function). The SVM algorith can perfor this coputation efficiently because it considers a sall nuber of training points and ignores all training points that are close (within a threshold ε) to the odel prediction. The prial optiization proble is given by: in 2 w2 + C (ξ i + ξ i ) i= y (i) < w, x (i) > b ε + ξ i subject to: { < w, x (i) > + b y (i) ε + ξ i ξ i, ξ i 0 The nor w 2 easures the flatness of the proxy, and the constraints force the odel to approxiate all training points within an absolute argin ε. ξ i, ξ i are slack variables that allow for copliance with the ε argin constraints and the flatness of the proxy. C is the penalty for violating the constraints. The corresponding dual proble is given by: ax 2 (α i α i )(α j α j ) (< x (i), x (j) >) i,j= ε (α i + α i ) y (i) (α i α i ) i= subject to:{ (α i α i= i ) = 0 α i, α i C i= to ipleent SVM classification with a Gaussian kernel function. Results for SVM Classification (Training Data Set) Years Trained Upon Correctly Training Error 998-2008 83 79 2.9% 2000-200 83 80.64% 2002-202 82 78 2.20% (Test Data Set) Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) Correctly Test Error 30 27 0.00 % 30 26 3.33% 30 27 0.00% Discount-Weighted Least-Means Square Regression Once again, we solve three regression probles for the years 200, 202 and 204. Given that the coposition and voting intentions of a state evolve rapidly, we thought it would be beneficial to give less weight to earlier training data as copared to later ones. The basic preise of this tie discount rate algorith, which has been adapted fro Harrison and Johnston [5], is to use a discount factor which conveys the rate of decay of the inforation content of an observation. The discount-weighted LMS algorith had a lower generalization error than a standard LMS algorith when forward-chaining cross-validation was used. We used a discount factor of the for: The dual optiization is convex and can easily be solved with optiization software. We use LIBSVM

Discount factor (delta) 2αt δ t = 2αt + 2αT 2αT + Where δ is the discount factor for the earliest tie period and T is the nuber of tie periods (i.e. t =,., T). α is a paraeter than can be optiized. Clearly, δ T is always =. Discount factor for various alphas is shown below: Discount factors for various alphas (T = 5). 0.9 0.8 0.7 0.6 0.5 Results for Discount Weighted LMS (Training Data Set) Years Trained Upon Mean argin of error Correctly Training Classification Error 998-2008 83 4.20% 79 2.9% 2000-200 83 4.6% 80.64% 2002-202 82 4.43% 78 2.20% (Test Data Set) Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) 2 3 4 5 Tie period (t) Mean argin of error Correctly alpha= alpha=2 alpha=3 alpha=4 Test Classifica tion Error 30 7.48% 27 0.00 % 30 6.52% 26 3.33% 30 7.0% 27 0.00% Rando Forests We also ipleented Rando Forests classification on the original data set. Rando forests use decision trees as the basic building block to enable prediction. A decision tree uses a treelike graph or odel of decisions to split up the feature space into separate regions. Each data point falls into exactly one region, and in the case of classification, the ost coon class is the predicted class. Rando forests use ultiple decision trees, and the reasoning behind this is to reduce the chances of overfitting to the data. Each tree is built on a separate dataset where each dataset is sapled fro the original distribution. However, since we do not know, or have access to, the original distribution, we build each dataset by sapling with replaceent using the original dataset. This is known as bootstrap aggregation, since we now have ultiple decision trees which are all fit to an approxiation of the original distribution. By using ultiple trees we can lower the variance of the odel at the cost of increasing the bias. Although bootstrap aggregation helps to reduce the variance of the odel, it does not fix an iportant proble which is that every tree ay be highly correlated to each other. In that case, it does not atter how any trees we average our predictions over if each tree is exactly the sae, since the variance of the odel will not decrease at all. In order to prevent highly siilar trees, we will only consider a rando subset of the features at each split. Often the nuber of features considered,, is uch lower than p, where p is the original nuber of predictors. There are two paraeters to tune over in rando forests: B, the nuber of decision trees to create, and, the nuber of predictors to consider at each split. Increasing B will prevent the odel fro overfitting, but ay also prevent accurately capturing the relationship between the training data and the output. Increasing will increase the chances of overfitting, but ay allow a better fit to the training data. Appropriate choices for B and can be selected by using cross validation. Choices for B and that were optial in our three tests hovered around B 00 and p/7 0.

Results for Rando Forests (Test Data Set) Party achinery: Senator Z is vulnerable. We ust begin directing resources towards his/her capaign IMMEDIATELY. Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) Correctly Test Error 30 28 6.67 % 30 27 0.00% 30 27 0.00% And therein lies the practical utility of our exercise. We re excited that we were able to get reasonably good results with publicly available data and achinelearning approaches clearly, elections can be predictable! We re eager to build on soe of these approaches, especially Rando Forest, and explore new techniques as well. Conclusions Rando Forests clearly works better than the SVM classifier while attepting binary classification with a sall nuber of data points (and hence a high possibility of over-fitting). The average classification test error rate for Rando Forests is 8.9%, while for the other two algoriths it is.%. Most iportantly, we conclude that we predicted the 200, 202 and 204 Senate elections with a reasonable aount of accuracy with data that was ostly available at least two years in advance of those elections. That is, except for uneployent statistics (for which we can use forecasts), we have enough data to predict the 206 election too (we do just that in the Appendix)! While a lot of attention is directed towards Presidential Elections, individual Senators have treendous power over legislation. Therefore, we believe that having a bird s eye estiation of what the Senate ight shape up to be two years in the future could be very useful for a lot of stakeholders, such as: Stakeholders in key bills: If Senator X loses, will the AJKL bill fail in the next Congress? Lobbyists: Can the threat of being vulnerable help persuade Senator X to support Z? Speculators: Can I shape y investents with a reasonable aount of confidence in having a Republican/ Deocratic Senate 2 years fro now? Data Sources All data is publicly available: Election Results are sourced fro the Federal Election Coission website (www.fec.gov) Uneployent Rate Statistics are sourced fro the Bureau of Labor Statistics (www.bls.gov) Deographic Statistics are sourced fro the United States Census bureau (www.census.gov) References [] Drucker, H., Burges, C. J., Kaufan, L., Sola, A., & Vapnik, V. (997). Support vector regression achines. Advances in neural inforation processing systes, 9, 55-6. [2] Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. eural Inforation Processing-Letters and Reviews, (0), 203-224. [3] J. Solatand Bernhard Scholkof. A tutorial on support vector regression. 2004 [4] Friedan, Jeroe, Trevor Hastie, and Robert Tibshirani. The eleents of statistical learning. Vol.. ew York: Springer Series in Statistics, 200. [5] Harrison, P. J., and F. R. Johnston. "Discount weighted regression." Journal of the Operational Research Society (984): 923-932.

Appendix: Our Prediction for the 206 Senate Elections The Republicans lose two seats, but hold on to the Senate, 52-48!