Classification and Regression Approaches to Predicting United States Senate Elections Rohan Sapath, Yue Teng Abstract The United States Senate is arguably the finest deocratic institution for debate and deliberation in the world. It also provides a fascinating opportunity to understand the coplex dynaics that go into deterining the outcoe of Senate elections, using Machine Learning. Motivation Are elections decided even before they begin? Can political fundaentals predict elections regardless of candidates/capaigns? Our goal is to get a bird s eye, forward-looking view of Senate elections using data available well in advance. We believe that if we reliably predict Senate elections well before they happen, that has several significant iplications for various stakeholders, since individual senators wield a treendous aount of legislative power. Introduction We use: Data A odified version of the LMS algorith, called a discount-weighted least-ean squares algorith to predict the argin of victory of Senate elections. An ordinary Support Vector Machine classifier to predict the outcoe of Senate elections. Rando Forest to also predict the outcoe of Senate elections. Our data set consists of all biennial Senate elections that were held fro 998 to 204; this data is publicly available. Preprocessing: We preprocess the data to weed out elections where: There wasn t exactly one Republican and exactly one Deocratic candidate. A third Party candidate either won the election or distorted the election by winning ore than 20% of the vote (i.e. a third party candidate was a significant player). After preprocessing, we are left with 273 data points. (There were 300 regularly scheduled Senate elections in the period 998-204, of which 27 were eliinated in preprocessing.) The fundaental challenge we face is one of liited data. Senate elections, by their very nature, are liited only around 33 happen every two years. Therefore, we had to keep in ind that an inescapable part of this project was having liited data. Features We use a feature vector of 7 features. These 7 features include original sourced features (such as argin of victory, uneployent rate etc.) and derived features (such as change in uneployent rate over a period of tie). The features are described below: Margin of victory in the Senate election held six and twelve years previously. (ote: Senators serve six-year ters.) Margin of victory in the state in the last three Presidential Elections Presidential Approval in the State. Annualized Changes in the Presidential Approval in the state. Percent African-Aerican population (as extrapolated fro the ost recent Census) Percent Hispanic/Latino population ((as extrapolated fro the ost recent Census) Changes in the above deographic factors over tie. Three-onth average uneployent rates in the year before the election.
Cuulative Variance Eigenvalue ('000) 6-onth, 2-onth, 8-onth 24-onth changes in uneployent rate Partisan Voting Index (PVI) over the past three Presidential elections 2 Change in the PVI fro the second-last presidential election to the last one. Median incoe in the state. Variation in the edian incoe in the state. Indicator variable: Whether Republican candidate is the incubent senator. Indicator variable: Whether Deocratic candidate is the incubent senator. uber of years of incubency for the President. Indicator variable: Whether the election was a idter election or not. Convention: In all cases, a positive result for the Republican is recorded as positive, and vice-versa. Exaple: A reduction of the uneployent rate during a Deocratic President s ter is eans that the feature data point is (-), since it s good for the Deocrats. Cross-Validation We use cross-validation frequently through the project. Our perusal of literature suggested that a directapplication of k-fold cross-validation was not appropriate for tie-series data it would not be appropriate to train on 202 data, for exaple, and validate on a hold-out data point that happened before 202! Hence, we use a odified version called forward chaining. For exaple, say we have a training set consisting of data fro years 2000, 2002, 2004 and 2006; we then design the folds as follows: Fold : train [2000], hold-out validation [2002] Fold 2: train [2000, 2002], hold-out validation [2004] Fold 3: train [2000, 2002, 2004], hold-out validation [2006] Principal Coponent Analysis Motivation: clear interdependencies between certain variables: PVI and Previous US Presidential election result, for exaple. In order to choose an appropriate k-diension spanned by the first k principal coponents subspace (for k (, 0)) and thus deterine the k principal factors, we use the Scree Plot and the Cuulative Variance Plot. (The plot below is for the 83 data points fro 2002 to 202. First 0 principal coponents are shown.).4.2 0.8 0.6 0.4 0.2 0 0.00% 00.00% 90.00% 80.00% 70.00% 60.00% Scree Plot 2 3 4 5 6 7 8 9 0 Factor/Coponent Cuulative Variance Plot 2 3 4 5 6 7 8 9 0 Factor/Coponent A reduction is positive under incubent Republican President, while negative under an incubent Deocratic President; vice-versa for an increase. 2 PVI of a state: On average, how uch ore Republican was the state in the last two presidential elections as copared to the nation as a whole)
Support Vector Machine (SVM) Classification We solve 3 classification probles using standard SVM classification: classifying 204 after learning on 2002-202, classifying 202 after learning on 2000-200, and classifying 200 after learning on 998-2008. The fundaental otivation behind SVM is carrying out binary classification in a high-diension feature space efficiently, by using the kernel trick (i.e. by apping input data via a non-linear function). The SVM algorith can perfor this coputation efficiently because it considers a sall nuber of training points and ignores all training points that are close (within a threshold ε) to the odel prediction. The prial optiization proble is given by: in 2 w2 + C (ξ i + ξ i ) i= y (i) < w, x (i) > b ε + ξ i subject to: { < w, x (i) > + b y (i) ε + ξ i ξ i, ξ i 0 The nor w 2 easures the flatness of the proxy, and the constraints force the odel to approxiate all training points within an absolute argin ε. ξ i, ξ i are slack variables that allow for copliance with the ε argin constraints and the flatness of the proxy. C is the penalty for violating the constraints. The corresponding dual proble is given by: ax 2 (α i α i )(α j α j ) (< x (i), x (j) >) i,j= ε (α i + α i ) y (i) (α i α i ) i= subject to:{ (α i α i= i ) = 0 α i, α i C i= to ipleent SVM classification with a Gaussian kernel function. Results for SVM Classification (Training Data Set) Years Trained Upon Correctly Training Error 998-2008 83 79 2.9% 2000-200 83 80.64% 2002-202 82 78 2.20% (Test Data Set) Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) Correctly Test Error 30 27 0.00 % 30 26 3.33% 30 27 0.00% Discount-Weighted Least-Means Square Regression Once again, we solve three regression probles for the years 200, 202 and 204. Given that the coposition and voting intentions of a state evolve rapidly, we thought it would be beneficial to give less weight to earlier training data as copared to later ones. The basic preise of this tie discount rate algorith, which has been adapted fro Harrison and Johnston [5], is to use a discount factor which conveys the rate of decay of the inforation content of an observation. The discount-weighted LMS algorith had a lower generalization error than a standard LMS algorith when forward-chaining cross-validation was used. We used a discount factor of the for: The dual optiization is convex and can easily be solved with optiization software. We use LIBSVM
Discount factor (delta) 2αt δ t = 2αt + 2αT 2αT + Where δ is the discount factor for the earliest tie period and T is the nuber of tie periods (i.e. t =,., T). α is a paraeter than can be optiized. Clearly, δ T is always =. Discount factor for various alphas is shown below: Discount factors for various alphas (T = 5). 0.9 0.8 0.7 0.6 0.5 Results for Discount Weighted LMS (Training Data Set) Years Trained Upon Mean argin of error Correctly Training Classification Error 998-2008 83 4.20% 79 2.9% 2000-200 83 4.6% 80.64% 2002-202 82 4.43% 78 2.20% (Test Data Set) Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) 2 3 4 5 Tie period (t) Mean argin of error Correctly alpha= alpha=2 alpha=3 alpha=4 Test Classifica tion Error 30 7.48% 27 0.00 % 30 6.52% 26 3.33% 30 7.0% 27 0.00% Rando Forests We also ipleented Rando Forests classification on the original data set. Rando forests use decision trees as the basic building block to enable prediction. A decision tree uses a treelike graph or odel of decisions to split up the feature space into separate regions. Each data point falls into exactly one region, and in the case of classification, the ost coon class is the predicted class. Rando forests use ultiple decision trees, and the reasoning behind this is to reduce the chances of overfitting to the data. Each tree is built on a separate dataset where each dataset is sapled fro the original distribution. However, since we do not know, or have access to, the original distribution, we build each dataset by sapling with replaceent using the original dataset. This is known as bootstrap aggregation, since we now have ultiple decision trees which are all fit to an approxiation of the original distribution. By using ultiple trees we can lower the variance of the odel at the cost of increasing the bias. Although bootstrap aggregation helps to reduce the variance of the odel, it does not fix an iportant proble which is that every tree ay be highly correlated to each other. In that case, it does not atter how any trees we average our predictions over if each tree is exactly the sae, since the variance of the odel will not decrease at all. In order to prevent highly siilar trees, we will only consider a rando subset of the features at each split. Often the nuber of features considered,, is uch lower than p, where p is the original nuber of predictors. There are two paraeters to tune over in rando forests: B, the nuber of decision trees to create, and, the nuber of predictors to consider at each split. Increasing B will prevent the odel fro overfitting, but ay also prevent accurately capturing the relationship between the training data and the output. Increasing will increase the chances of overfitting, but ay allow a better fit to the training data. Appropriate choices for B and can be selected by using cross validation. Choices for B and that were optial in our three tests hovered around B 00 and p/7 0.
Results for Rando Forests (Test Data Set) Party achinery: Senator Z is vulnerable. We ust begin directing resources towards his/her capaign IMMEDIATELY. Year (Using Training Data fro) 200 (998-2008) 202 (2000-200) 204 (2002-202) Correctly Test Error 30 28 6.67 % 30 27 0.00% 30 27 0.00% And therein lies the practical utility of our exercise. We re excited that we were able to get reasonably good results with publicly available data and achinelearning approaches clearly, elections can be predictable! We re eager to build on soe of these approaches, especially Rando Forest, and explore new techniques as well. Conclusions Rando Forests clearly works better than the SVM classifier while attepting binary classification with a sall nuber of data points (and hence a high possibility of over-fitting). The average classification test error rate for Rando Forests is 8.9%, while for the other two algoriths it is.%. Most iportantly, we conclude that we predicted the 200, 202 and 204 Senate elections with a reasonable aount of accuracy with data that was ostly available at least two years in advance of those elections. That is, except for uneployent statistics (for which we can use forecasts), we have enough data to predict the 206 election too (we do just that in the Appendix)! While a lot of attention is directed towards Presidential Elections, individual Senators have treendous power over legislation. Therefore, we believe that having a bird s eye estiation of what the Senate ight shape up to be two years in the future could be very useful for a lot of stakeholders, such as: Stakeholders in key bills: If Senator X loses, will the AJKL bill fail in the next Congress? Lobbyists: Can the threat of being vulnerable help persuade Senator X to support Z? Speculators: Can I shape y investents with a reasonable aount of confidence in having a Republican/ Deocratic Senate 2 years fro now? Data Sources All data is publicly available: Election Results are sourced fro the Federal Election Coission website (www.fec.gov) Uneployent Rate Statistics are sourced fro the Bureau of Labor Statistics (www.bls.gov) Deographic Statistics are sourced fro the United States Census bureau (www.census.gov) References [] Drucker, H., Burges, C. J., Kaufan, L., Sola, A., & Vapnik, V. (997). Support vector regression achines. Advances in neural inforation processing systes, 9, 55-6. [2] Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. eural Inforation Processing-Letters and Reviews, (0), 203-224. [3] J. Solatand Bernhard Scholkof. A tutorial on support vector regression. 2004 [4] Friedan, Jeroe, Trevor Hastie, and Robert Tibshirani. The eleents of statistical learning. Vol.. ew York: Springer Series in Statistics, 200. [5] Harrison, P. J., and F. R. Johnston. "Discount weighted regression." Journal of the Operational Research Society (984): 923-932.
Appendix: Our Prediction for the 206 Senate Elections The Republicans lose two seats, but hold on to the Senate, 52-48!