VoteCastr methodology - PDF Free Download

VoteCastr methodology Introduction Going into Election Day, we will have a fairly good idea of which candidate would win each state if everyone voted. However, not everyone votes. The levels of enthusiasm among the candidates respective bases and the strength of their turnout operations will have a huge impact on the final outcome. Traditionally, turnout rates are unknown until after the polls close. This year, however, VoteCastr will be tracking turnout in battleground states throughout Election Day. This information, combined with microtargeting models and early vote reports will allow us to make predictions about who is winning or losing the turnout battle. (Terms in italics are defined in the terminology appendix.) The process can be thought of as similar to solving a Sudoku or a crossword puzzle. Each piece of the puzzle that we fill in helps us fill in others: Turnout observations during the day allow us to predict end-of-day turnout Predicted end-of-day turnout in precincts with turnout observations allow us to calculate the average turnout rate in different categories and subcategories of precincts The average turnout rates in each of the categories and subcategories of precincts allow us to calculate the extrapolated end-of-day turnout in precincts where we do not have turnout observations. The extrapolated end-of-day turnout combined with candidate support microtargeting models allow us to calculate each precinct s extrapolated vote for Clinton, Trump, Johnson, and Stein, based on that turnout level. The extrapolated votes will be aggregated to the state level to determine which candidate has the advantage in terms of turnout. Methodology Stages Microtargeting models Validating the microtargeting models Early vote tracking Precinct turnout reports Reasons why a field worker might not be able to get turnout reports from their assigned precinct Projected end-of-day turnout Expected vote Projected end-of-day turnout as a percent of expected vote Category and subcategory average turnout as a percent of expected Extrapolated end-of-day turnout Expected Clinton, Trump, Stein, and Johnson vote based on extrapolated turnout Summary statistics Microtargeting models

Microtargeting works by taking information known about a sample of voters, combining it with demographics and commercial marketing data, and using that information to build statistical or machine learning models that then predict that information about every other voter. In this case, the information we have about a sample of voters will come from telephone surveys and past turnout information. In the telephone surveys, we will ask a random sample of 10,000 voters how likely it is that they will vote and who they are supporting for president and for U.S. Senate. The bulk of the surveys will be done through automated calls to landlines, but a smaller sample of live calls will be made to cellphones in order to reach the necessary number of younger, cellphone-only voters. We will use these survey responses to build models that predict how voters who were not called would have answered the survey had we been able to reach them. There are a large number of algorithms that we typically use for modeling projects. In this case, we expect to use a blend of penalized logistic regression and random forests. These models will then be scored on the full voter file. That means that every voter will get a number between 0 and 100 giving the percent likelihood that he or she will support each of the top four candidates. The process for modeling turnout is similar, although in addition to self-reported turnout likelihood from the phone surveys, we will also use past turnout history from the 2012 election. We use 2012 because that was the last presidential election, and turnout is traditionally higher in presidential years than in off-year elections like 2014. These vote history based models give us predictions of how likely it is that someone would have voted in 2012 had they been eligible. Using these models, we will have predicted turnout likelihood for all voters, even those who had not yet turned 18 in 2012 or who had not moved into the state until after that election. Validating the microtargeting models Both the candidate support models and the turnout models are built using two-thirds of the available survey or vote history data. The other one-third, selected at random, is used as a test set, and not used in the construction of the models. Once the models have been build and scored, the model predictions are compared to the actual test-set survey responses and vote history to make sure that the model is accurately predicting the behavior of voters whose responses were not used to build the models. We use a large number of model validation metrics, including the F1 score and area under the ROC curve. As a more easily understood metric, we will also calculate the expected margin of error for the average-sized precinct in each state. To do this, we will randomly select samples of test-set IDs equal to the number of voters in the average size precinct, then we will compare the test-set IDs to the candidate support model predictions for each one of these groups. We will run 1,000 simulations per state to calculate these margins of error. Early vote tracking Local election officials collect and report information about who has voted early. This information is compiled by the voter-file vendor L2 and provided to us the weekend before the election. We will know who has already vote, and from the microtargeting models know for whom they most likely voted. Because we know who has voted early, we will remove those voters from the pool of potential Election Day voters in each precinct.

Precinct turnout reports Field workers are assigned specific precincts to monitor. These precincts are selected so as to give us coverage of base Clinton and Trump areas, key demographic groups, and major geographies in each state. If for some reason a field worker is not able to get turnout reports for their assigned precinct, they will have been given backup precincts with similar demographics. (See below for possible reasons why a field worker might not be able to obtain reports from their assigned precinct.) Field workers collect total turnout frequently throughout the day and report those numbers via an automated touch-tone reporting system. That information is then fed in real time to the VoteCastr team. Reasons why a field worker might not be able to get turnout reports from their assigned precinct There are a number of reasons why reliable reports might not be available from a specific precinct. Poll workers refuse to allow field workers access. In theory, field workers have the right to observe polling locations, but rather than spend time on legal challenges, if a field worker is barred from a polling location, he will be directed to move on to his backup precinct. Poll workers are too busy to provide turnout numbers. Poll workers may be willing to have our field workers at their polling location, but if lines of voters are long, they might not be willing to assist with reporting the number of votes cast. In these cases, field workers will be instructed to first attempt to get a voter to report his voter number, and failing that, to move on to a backup precinct. Precinct has multiple lines, each with its own number range. A precinct might have one line for voters with last names A through M using voter numbers 1-1,000, and a second line for voters N through Z using numbers 1,001 through 2,000. In this case, getting the voter number from a voter in either line will give a misleading turnout number. If the poll workers are helpful and competent they may be able to determine the actual number of voters cast, but if not, the field worker will be instructed to move on to a backup precinct. Projected end of day turnout When we get turnout observations, they will include a time stamp telling us when the observation was collected. For each precinct in a state will we know the poll opening and closing times. This will allow us to calculate the percentage of the day that has transpired at the time of the observation. We will use this percentage to calculate the projected end of day turnout. For example, if a turnout observation was collected 10 percent of the way into the voting day, then we would multiply that observation by 10 to calculate the projected end of day turnout. NOTE: The rate of voting is not actually steady throughout the day. Generally speaking, there are surges of turnout before 9 a.m., over the lunch hour, and after 5 p.m. However those patterns are less distinct in areas with large numbers of unemployed voters, students, retirees, or people working nontraditional hours. There is not enough historical data to definitively define the turnout pattern in each precinct, and making assumptions like saying that Clinton-supporting student areas, or Trump supporting areas with large numbers of unemployed white voters will turnout out in a certain way risks skewing the results. So, in the interest of transparency, we are treating all precincts alike. This will probably lead to

apparently high turnout first thing in the morning that then begins to taper off as more post 9am reports come in. Expected vote In each precinct, we will have calculated an expected vote. This is based on the turnout scores and early vote numbers. Voters who have voted early will be suppressed from the voterfile for each precinct, then the turnout scores divided by 100 will be summed. The turnout scores will have been adjusted to match the expected statewide turnout. Expected statewide turnout is calculated based on 2012 turnout adjusted for population growth. Projected end-of-day turnout as a percent of expected vote Once we have calculated the projected end-of-day vote, we will be able to calculate the projected endof-day turnout as a percentage of the precinct s expected vote in each precinct with turnout observations. Defining precinct categories and sub-categories Each precinct in a state will be defined as belonging to a broad category, and then to a sub-category. Categories: Base Clinton Clinton expected to get 60%+ of the 2-way vote (Clinton 2-way >=60%) Lean Clinton Clinton expected to get between 55% and 60% of the 2-way vote (Clinton 2way >=55 and <60) Swing Clinton expected to get between 45% and 55% of the 2-way vote (Clinton 2-way >=45 and <55) Lean Trump Trump expected to get between 55% and 60% of the 2-way vote (Trump 2-way >=55 and <60) Base Trump Trump expected to get 60%+ of the 2-way vote (Trump 2-way >=60) Subcategories Each category is sub-divided based on race. The racial categories are: Majority African American Majority Hispanic Majority White Mixed-race (no one group > 50 percent)

There are 20 possible sub categories (5 categories times 4 racial types) Base Clinton majority African American Base Clinton majority Hispanic Base Clinton Majority White Base Clinton Mixed race Lean Clinton majority African-American. Not all subcategories will exist in every state. For example, it is extremely unlikely that there will be any base Trump, majority African American precincts. Category and subcategory average turnout as a percent of expected Whenever a precinct s turnout as a percent of expected is updated, we will calculate the average turnout as a percent of expected for all precincts in that category and for all precincts in the subcategory. We will also record the total number of precincts reporting turnout for each category and subcategory. Extrapolated end of day turnout For precincts where we do not have turnout observations, we will use the average from other precincts in the same category or subcategory to fill in the extrapolated end of day turnout. Each precinct belongs to one subcategory and one category. We will attempt to calculate its extrapolated end of day turnout first using the subcategory if five or more precincts in that subcategory have reported turnout. Otherwise, we will use the average from the precinct category. Once we have calculated the extrapolated end of day turnout as a percent of expected for each precinct we will then multiply that number divided by 100 by the precinct s expected vote in order to calculate the extrapolated end of day turnout. Expected Clinton, Trump, Stein, and Johnson vote based on extrapolated turnout We will have microtargeting models giving the likelihood of each voter in each precinct supporting Clinton, Trump, Johnson, or Stein. If all voters in a precinct turned out, the vote for each of those candidates would be the sum of their candidate support scores divided by 100. However we know that not all voters are going to vote, and furthermore that all voters are not equally likely to turn out. In each precinct we can rank the voters from most to least likely to vote. The candidate support scores for the most likely to vote voter will likely to be different from those for the second most likely to vote on down through the least likely to vote. Prior to the election, we will calculate the 2-way Clinton vs. Trump support scores at each turnout level from one vote through every voter in the precinct turning out. Once that is done, we will run a linear regression model predicting Clinton and Trump support based on percent of registered voters turning out. The formula for each precinct will be saved and used on election day to calculate the expected Clinton, Trump, Johnson, and Stein vote in each precinct based on the number of votes expected to be cast by the end of the day. We will not be adjusting the Johnson

and Stein numbers as they will be too small to yield a reliable model based on different turnout levels. Instead, we will use the precinct aggregate microtargeting scores for them. Summary statistics As each precinct s extrapolated end of day turnout and extrapolated candidate support are calculated, we will be able to sum the candidate vote statewide to get an overall sense of which candidates are over- or under-performing in terms of turnout. We will also be able to provide statistics for turnout as a percent of expected for each of the precinct categories and sub-categories. Appendix 1 Terminology There are a lot of terms that are used interchangeably in the news. Projected, modeled, and estimated are all often used to mean something that is believed, but not known with certainty. In the context of Votecastr, these terms have distinct, specific meanings. These, and other key terms are defined below. Observed An actual report. Observed turnout is a turnout report called in from the field. Projected A calculation based on time of day. If a precinct has observed turnout at a specific time during the day, we can use that to calculate the projected end of day vote. Extrapolated In a precinct that does not have observed turnout, we extrapolate turnout based on how other precincts in the same category or subcategory are turning out. Modeled A predictive model (likely logistic regression and/or random forests) that predicts a behavior at an individual level. For this project we will model likelihood of supporting each of the top four candidates, and likelihood of voting. Microtargeting

Modeled support or turnout Score The output of a predictive model. A score gives the percent likelihood of an individual taking an action or holding an opinion. The turnout score gives the percent likelihood that the individual in question will vote. A candidate support score gives the percent likelihood that the individual will support that candidate, if he or she votes. Precinct The smallest unit of political geography. In some states, the words ward and precinct are used interchangeably. In other states, precincts are a subdivision of wards. VTD Voting Tabulation District. This is the unit of geography at which votes are counted. Often, this is the same as a precinct, but in many cases, multiple precincts will vote at the same location, and their votes are not separated by precinct. In this case, the VTD will be the combination of those precincts. Note: Each 10 years, the census collects VTD definitions from the states. Many states will report their precinct lines, even if the individual precincts always vote in combination with others. Because a VTD is defined in a certain way by the census does not necessarily mean that that is the actual configuration that will be used for tabulating results. For clarity, we use the term Census VTD to refer to the official census geography. Polling location Where people go to vote on Election Day A polling location may be the same as a precinct, or it may include multiple precincts. If a polling location includes multiple precincts and their votes are counted together, then the polling location and VTD are the same. Some polling locations will have separate lines and ballot boxes

or voting machines for each precinct voting at that polling location, and those votes are counted and reported separately. In these cases, the VTD and the polling location are not the same. VAP Voting Age Population. From the census. Number of people age 18-plus. CVAP Citizen Voting Age Population. From the census. Number of U.S. Citizens age 18-plus.