Lab 3: Logistic regression models

Similar documents
IN POLITICS, WHAT YOU KNOW IS LESS IMPORTANT THAN WHAT YOU D LIKE TO BELIEVE

Supplementary Materials A: Figures for All 7 Surveys Figure S1-A: Distribution of Predicted Probabilities of Voting in Primary Elections

The Election What is the function of the electoral college today? What are the flaws in the electoral college?

Forecasting the 2012 U.S. Presidential Election: Should we Have Known Obama Would Win All Along?

A Dead Heat and the Electoral College

LESSONS LEARNED FROM THE 2016 ELECTION

Response to the Report Evaluation of Edison/Mitofsky Election System

VOTERS AGAINST CASINO EXPANSION, SUPPORT TRANSPORTATION TRUST FUND AMENDMENT

Bias Correction by Sub-population Weighting for the 2016 United States Presidential Election

VP PICKS FAVORED MORE THAN TRUMP AND CLINTON IN FAIRLEIGH DICKINSON UNIVERSITY NATIONAL POLL; RESULTS PUT CLINTON OVER TRUMP BY DOUBLE DIGITS

Electing a President. The Electoral College

What do you know about how our president is elected?

Proposal for the 2016 ANES Time Series. Quantitative Predictions of State and National Election Outcomes

RBS SAMPLING FOR EFFICIENT AND ACCURATE TARGETING OF TRUE VOTERS

From Straw Polls to Scientific Sampling: The Evolution of Opinion Polling

Biases in Message Credibility and Voter Expectations EGAP Preregisration GATED until June 28, 2017 Summary.

Simulating Electoral College Results using Ranked Choice Voting if a Strong Third Party Candidate were in the Election Race


Red Oak Strategic Presidential Poll

Drew Kurlowski University of Missouri Columbia

Trump Topple: Which Trump Supporters Are Disapproving of the President s Job Performance?

Distorting Democracy: How Gerrymandering Skews the Composition of the House of Representatives

Survey on the Death Penalty

Eagleton Institute of Politics Rutgers, The State University of New Jersey 191 Ryders Lane New Brunswick, New Jersey

Experiments in Election Reform: Voter Perceptions of Campaigns Under Preferential and Plurality Voting

Ipsos Poll Conducted for Reuters Daily Election Tracking:

CLINTON TRUMPS TRUMP WITH MAJORITY SUPPORT IN FAIRLEIGH DICKINSON UNIVERSITY PUBLICMIND POLL, BUT VOTERS DIVIDED OVER TRUMP S LOCKER ROOM TALK

1. A Republican edge in terms of self-described interest in the election. 2. Lower levels of self-described interest among younger and Latino

Robert H. Prisuta, American Association of Retired Persons (AARP) 601 E Street, N.W., Washington, D.C

Google Consumer Surveys Presidential Poll Fielded 8/18-8/19

Forecast error The UK general election

Team 1 IBM UNH

THE FIELD POLL FOR ADVANCE PUBLICATION BY SUBSCRIBERS ONLY.

The Republican Race: Trump Remains on Top He ll Get Things Done February 12-16, 2016

Practice Questions for Exam #2

Ohio State University

Ipsos MORI June 2016 Political Monitor

French Polls and the Aftermath of by Claire Durand, professor, Department of Sociology, Université de Montreal

AVOTE FOR PEROT WAS A VOTE FOR THE STATUS QUO

In the Margins Political Victory in the Context of Technology Error, Residual Votes, and Incident Reports in 2004

Voting and Elections. CP Political Systems

VoteCastr methodology

New Louisiana Run-Off Poll Shows Lead for Kennedy, Higgins, & Johnson

ELECTORAL COLLEGE AND BACKGROUND INFO

I. Chapter Overview. Roots of Public Opinion Research. A. Learning Objectives

Statistics, Politics, and Policy

ALABAMA: TURNOUT BIG QUESTION IN SENATE RACE

What is The Probability Your Vote will Make a Difference?

Tulane University Post-Election Survey November 8-18, Executive Summary

A Vote Equation and the 2004 Election

Learning from Small Subsamples without Cherry Picking: The Case of Non-Citizen Registration and Voting

Patterns of Poll Movement *

REACTIONS TO SEN. OBAMA S SPEECH AND THE REV. WRIGHT CONTROVERSY March 20, 2008

ISERP Working Paper 06-10

Chapter 9: Elections, Campaigns, and Voting. American Democracy Now, 4/e

The Electoral College

Methodology. 1 State benchmarks are from the American Community Survey Three Year averages

NEWS RELEASE. Red State Nail-biter: McCain and Obama in 47% - 47 % Dead Heat Among Hoosier Voters

NEW JERSEYANS SEE NEW CONGRESS CHANGING COUNTRY S DIRECTION. Rutgers Poll: Nearly half of Garden Staters say GOP majority will limit Obama agenda

SCATTERGRAMS: ANSWERS AND DISCUSSION

November 9, By Jonathan Trichter Director, Pace Poll & Chris Paige Assistant Director, Pace Poll

The Electoral College

VIEWS ON IMMIGRATION April 6-9, 2006

The Electoral College

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Report for the Associated Press: Illinois and Georgia Election Studies in November 2014

Notes for Government American Government

*Embargoed Until Monday, Nov. 7 th at 7am EST* The 2016 Election: A Lead for Clinton with One Day to Go November 2-6, 2016

NEVADA: CLINTON LEADS TRUMP IN TIGHT RACE

Should we use recall of previous vote(s) to weight electoral polls?

Predicting Elections from the Most Important Issue: A Test of the Take-the-Best Heuristic

To understand the U.S. electoral college and, more generally, American democracy, it is critical to understand that when voters go to the polls on

and The 2012 Presidential Election

Midterm Elections Used to Gauge President s Reelection Chances

Comprehensive Immigration Reform and Winning the Latino Vote

CHAPTER 11 PUBLIC OPINION AND POLITICAL SOCIALIZATION. Narrative Lecture Outline

This Rising American Electorate & Working Class Strike Back

Introduction. Midterm elections are elections in which the American electorate votes for all seats of the

Minnesota State Politics: Battles Over Constitution and State House

DATA ANALYSIS USING SETUPS AND SPSS: AMERICAN VOTING BEHAVIOR IN PRESIDENTIAL ELECTIONS

Identifying Factors in Congressional Bill Success

Repeat Voting: Two-Vote May Lead More People To Vote

Predicting the Next US President by Simulating the Electoral College

Introduction. 1 Freeman study is at: Cal-Tech/MIT study is at

Campaign Finance Charges Raise Doubts Among 7% of Clinton Backers FINAL PEW CENTER SURVEY-CLINTON 52%, DOLE 38%, PEROT 9%

Campaigning in General Elections (HAA)

The Electoral College. What is it?, how does it work?, the pros, and the cons

Retrospective Voting

For immediate release Monday, March 7 Contact: Dan Cassino ;

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Test your skills. Chapters 6 and 7. Investigating election statistics

Minnesota Public Radio News and Humphrey Institute Poll

1 Year into the Trump Administration: Tools for the Resistance. 11:45-1:00 & 2:40-4:00, Room 320 Nathan Phillips, Nathaniel Stinnett

THE PRESIDENTIAL RACE AND THE DEBATES October 3-5, 2008

Friends of Democracy Corps and Greenberg Quinlan Rosner Research. Stan Greenberg and James Carville, Democracy Corps

American Dental Association

Electing our President with National Popular Vote

Quantitative Prediction of Electoral Vote for United States Presidential Election in 2016

Public Opinion and Political Socialization. Chapter 7

Voting and Elections

Chapter 13: The Presidency Section 4

Transcription:

Lab 3: Logistic regression models In this lab, we will apply logistic regression models to United States (US) presidential election data sets. The main purpose is to predict the outcomes of presidential election in each state based on the election polls and historical election results data sets. We will use the election polls data sets collected in 2008 and 2012, and the true election outcome data set of 2008 to predict the election outcomes of 2012. It might be more interesting to predict the outcomes of the 2016 presidential election. However, due to the data availability limitation and current stage of election, we will not consider 2016 presidential election in today s lab. The US presidential election is held every four years on Tuesday after the first Monday in November. The 2016 presidential election date is scheduled for Nov 8, 2016. The 2008 and 2012 elections were held, respectively, on Nov 4, 2008 and Nov 6, 2012. The President of US is not elected directly by popular vote. Instead, the President is elected by electors who are selected by popular vote on a stateby-state basis. These selected electors cast direct votes for the President. Almost all the states except Maine and Nebraska, electors are selected on a winnertake-all basis. That is, all electoral votes go to the presidential candidate who wins the most votes in popular vote. For simplicity, we will assume all the states use the winner-take-all principle in this lab. The number of electors in each state is the same as the number of congressmen of that state. Currently, there are a total of 538 electors including 435 House representatives, 100 senators and 3 electors from the District of Columbia. A presidential candidate who receives an absolute majority of electoral votes (no less than 270) is elected as President. For simplicity, our data analysis only considers the two major political parties: Democratic (Dem) and Republican (Rep). The interest is to predict which party (Dem or Rep) will win the most votes in each state. Because the chance that a third-party (except Dem and Rep) receives an electoral vote is very small, our simplification is reasonable. Prediction of the outcomes of presidential election campaigns is of great interests to many people. In the past, the prediction was typically made by political analysts and pundits based on their personal experience, intuition and preferences. However, in recent decades, statistical methods have been widely 1

used in predicting election results. Surprisingly, in 2012, statistician Nate Silver correctly predicted the outcome in every state while he successfully called the outcomes in 49 states out of the 50 states in 2008. In today s lab, we will compare his method (a simplified version) to our method built on the logistic regression models. Date sets The following data sets are available for our data analysis 1) Polling data from the 2008 US presidential election (2008-polls.csv); 2) Election results from the 2008 US presidential election (2008-results.csv); 3) Polling data from the 2012 US presidential election (2012-polls.csv); 4) Election results from the 2012 US presidential election (2012-results.csv). The data sets 1) and 2) will be used for training purpose. That is, the data sets 1) and 2) will be used to build logistic regression models. The data set 3) will be used for prediction. The data set 4) is provided for validation purpose, which can help us to check if our predictions are correct or not. Both polling data sets 1) and 2) contain five columns. The first column is the State Abbreviations (SA). The second and third columns are, respectively, the percentages of votes to Democratic and Republican. The fourth column is the dates that the polls were conducted. The last column is the names of pollster institutions. Election polls Our prediction will be based on election polls. An election poll is a survey that samples a small portion of voters about their vote plans. If the survey is conducted appropriately, the samples of voters should be a representation of the voting population at large. However, it is very challenging to obtain a good representative group because a good sampling strategy needs to consider many factors (e.g., sampling time, locations, methods). Therefore, a poll s prediction could be biased and the prediction accuracy could be improved by combining multiple polls. There exist many possible factors affecting the prediction accuracy of election polls. Based on the available data sets, we consider the following three factors. 2

1. Sampling time. It is understandable that if the sampling time is far ahead of the election date, the accuracy could be worse than those polls conducted more close to the election date. Because there are many events that could change voters opinions about presidential candidates, the longer the time, the more likely voters are going to change their voting plans. 2. Pollsters. Systematic biases could occur if a false sampling method is taken. For example, if a pollster only collects samples through Internet, it would be a biased sample since the sample only includes those who have access to Internet. Each pollster uses different methods for sampling voters. Some sampling schemes could be better than the others. Therefore, it is very likely that some pollsters predictions are more reliable than some others. We should not give equal weights to every poll. 3. State edges. The state edge is the difference between the Democratic and Republican popular vote percentages (based on the polls) in that state. For instance, if the Democratic candidate receives 55% of the vote and Republican candidate receives 45% of the votes, then the Democratic edge is 10 percentage points. Because of the sampling errors, if the state edges are small, the prediction accuracy of a poll is more likely to be affected by the sampling errors. However, if the state edges are big, the prediction accuracy is less likely to be affected by sampling errors. Silver s approach The Nate Silver s algorithm is described in detail at the FiveThirtyEight blog (http://fivethirtyeight.blogs.nytimes.com/methodology/?_r=0). The key idea of his algorithm is to smooth (average) different polls results using a weighted average. Silver s algorithm gives weight to each pollster according to its prediction accuracy in the previous elections. More biased pollsters will receive less weight. In the following, we briefly describe the general structure of Silver s algorithm. 1. Calculate the average error of each pollster s prediction for previous elections. This is known as the pollster s rank. A smaller rank indicates a more accurate pollster. 2. Transform each rank into a weight. In this lab, we simply set weight as the one over square of rank. In Silver s algorithm, a number of factors are 3

considered in computing a weight. But we are lack of that information in the available data sets. 3. For each state, compute a weighted average of predictions made by pollsters. This predicts the winner in that state. In this lab, we will compare our method based on the logistic regression models with Silver s approach in predicting the presidential election winner in each state. To this end, please answer the following questions. Q1. Read the data sets 2008-polls.csv, 2012-polls.csv and 2008-results.csv into R. To simplify our data analysis, let us focus on subsets of these available data sets. We will select the subset of data sets based on pollsters because not all the pollsters conducted polls in every state. For our data analysis, please first select pollsters that conducted at least five polls. Then obtain all the polling data collected by those selected pollsters. Using R to find out the pollsters that conducted at least five polls in both 2008 and 2012 polling data sets 1) and 3). Then create subsets of the 2008 and 2012 polling data sets that are collected by the selected pollsters. Q2. For the purpose of performing logistic regression, we need to define three new variables using data sets created in Q1. First, we define binary response variables (Resp), which is an indicator that indicates if the predictions given by polls are correct or not. If the prediction is correct, we define Resp to be 1 otherwise 0. To check if the prediction given by each poll is correct or not, you could first find out the predicted winner for each state, and then compare it with the actual winner in the data set 2008- results.csv. Second, define state edges based on the definition of the state edges (see above for the definition). Finally, compute the number of days between the sampling time (polling date) and the presidential election date of 2008 (lag time). The 2008 presidential election date is Nov 4, 2008. 4

Combining the above defined variables (Resp, State edge and lag time), State names and pollsters into a new data set. Q3. In the data set created in Q2, you might find that the responses (Resp) of some states are all equal to 1. For these states, the prediction is relatively easy. Therefore, we will focus on the states that are relatively difficult to predict. Please select the states whose responses (Resp) contain at least one 0. Then find the corresponding subsets of the polling data sets for those selected states. Q4. Now we fit a logistic regression model using the data set created in Q3. In the model, using Resp as the binary response variable, SA and the Pollsters as categorical predictors, together with the other two predictors defined in Q2: lag time and the state edges. Based on the fitted model, what predictors are significantly associated with Resp? Please also conduct a hypothesis testing to examine if the categorical variable SA is significant or not. Q5. Refit the logistic regression model in Q4 without the categorical variable SA. Compare this model with the model fitted in Q4, which one is better? Q6. For the prediction purpose, we need to define new variables: State edges and the lag time for the 2012 polling data set. The definition of these new variables is same as those described in Q2. For computing the lag time, note that the 2012 presidential election date is Nov 6, 2012. Then create a new data set containing these two new variables for the polls conducted by the pollsters selected in Q1 and the states selected in Q3. Based on the logistic regression models fitted in Q4 and Q5, predicting the mean of the response variable (Resp) for the data set just created. The mean of Resp is the probability that Resp=1 (success probability). Please predict the success probability of each poll for the following states: FL, MI, MO and CO. Q7. In this question, we will predict the winner of each state (FL, MI, MO and CO) using predictions given in Q6. To be concrete, define the winner indicator as 1 (WIND=1) if the Democratic candidate is the winner, otherwise define it as 0. Based on Q6, we could know the probability that a poll made a correct prediction of the winner (i.e. Resp=1). Note that Resp=1 if the variable WIND based on the polling data is the same as the variable WIND based on the actual election data. 5

Then we use the average probability of WIND=1 to predict the probability that Dem wins the election, and use the average probability of WIND=0 to predict the probability that Rep wins the election. The average is across all the predicted probabilities of multiple pollsters who conducted polls in that state. Please do the prediction using both models in Q4 and Q5. Compare your predictions with the actual election results in the data file 2012-results.csv, what are your conclusions about the accuracy of your predictions? Q8. Please construct the 95% prediction intervals for the average probabilities predicted in Q7. Q9. Finally, implement the Silver s approach to the data sets created in Q3 and Q6 to predict the winners for states considered in Q6 (namely, FL, MI, MO and CO). Please compare the accuracy of the predictions using Silver s approach and our approach. 6