Congressional samples Juho Lamminmäki

Similar documents
Data Sampling using Congressional sampling. by Juhani Heliö

Response to the Report Evaluation of Edison/Mitofsky Election System

College Voting in the 2018 Midterms: A Survey of US College Students. (Medium)

Wisconsin Economic Scorecard

Staff Pay Levels for Selected Positions in Senators Offices, FY2001-FY2015

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

Staff Pay Levels for Selected Positions in House Member Offices,

Staff Pay Levels for Selected Positions in Senators Offices, FY2009-FY2013

Random Forests. Gradient Boosting. and. Bagging and Boosting

Supplementary Materials A: Figures for All 7 Surveys Figure S1-A: Distribution of Predicted Probabilities of Voting in Primary Elections

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Test your skills. Chapters 6 and 7. Investigating election statistics

Report for the Associated Press: Illinois and Georgia Election Studies in November 2014

Statistical Analysis of Corruption Perception Index across countries

Lab 3: Logistic regression models

JUDGE, JURY AND CLASSIFIER

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

Patterns of Poll Movement *

Ii.====== Report to the Legislature from the New Sentencing System Task Force. February 15, 1993

Learning from Small Subsamples without Cherry Picking: The Case of Non-Citizen Registration and Voting

The following provides a brief summary of the salient provisions relating to forensic DNA:

THE INDEPENDENT AND NON PARTISAN STATEWIDE SURVEY OF PUBLIC OPINION ESTABLISHED IN 1947 BY MERVIN D. FiElD.

Local differential privacy

Model of Voting. February 15, Abstract. This paper uses United States congressional district level data to identify how incumbency,

IN POLITICS, WHAT YOU KNOW IS LESS IMPORTANT THAN WHAT YOU D LIKE TO BELIEVE

Report for the Associated Press. November 2015 Election Studies in Kentucky and Mississippi. Randall K. Thomas, Frances M. Barlas, Linda McPetrie,

A Dead Heat and the Electoral College

RANKED VOTING METHOD SAMPLE PLANNING CHECKLIST COLORADO SECRETARY OF STATE 1700 BROADWAY, SUITE 270 DENVER, COLORADO PHONE:

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

FINAL RESOURCE ASSESSMENT: BLADED ARTICLES AND OFFENSIVE WEAPONS OFFENCES

Simulating Electoral College Results using Ranked Choice Voting if a Strong Third Party Candidate were in the Election Race

Methodology. 1 State benchmarks are from the American Community Survey Three Year averages

A Bill Regular Session, 2019 SENATE BILL 187

This journal is published by the American Political Science Association. All rights reserved.

Processes. Criteria for Comparing Scheduling Algorithms

The Job of President and the Jobs Model Forecast: Obama for '08?

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

2001 Senate Staff Employment Study

Embargoed until 00:01 Thursday 20 December. The cost of electoral administration in Great Britain. Financial information surveys and

CHAPTER 10 PLACE OF RESIDENCE

Red Oak Strategic Presidential Poll

CENTER FOR URBAN POLICY AND THE ENVIRONMENT MAY 2007

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

coordinated mail ballot election is approved in substantially the same form as the copy attached hereto

Job approval in North Carolina N=770 / +/-3.53%

Staff Pay Levels for Selected Positions in Senate Committees, FY2001-FY2015

CONSULTATION STAGE RESOURCE ASSESSMENT: REDUCTION IN SENTENCE FOR A GUILTY PLEA

The 2000 U.S. presidential election was a

Statewide Survey on Job Approval of President Donald Trump

What is fairness? - Justice Anthony Kennedy, Vieth v Jubelirer (2004)

14 Managing Split Precincts

NBC News/Marist Poll. Do you consider your permanent home address to be in Minnesota? Which county in Minnesota do you live in?

The Effect of Electoral Geography on Competitive Elections and Partisan Gerrymandering

(a) Draw side-by-side box plots that show the yields of the two types of land. Check for outliers before making the plots.

Objectives. Scope and concepts

NBC News/Marist Poll. Do you consider your permanent home address to be in Arizona? Which county in Arizona do you live in?

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

CSC304 Lecture 16. Voting 3: Axiomatic, Statistical, and Utilitarian Approaches to Voting. CSC304 - Nisarg Shah 1

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

Sampling Equilibrium, with an Application to Strategic Voting Martin J. Osborne 1 and Ariel Rubinstein 2 September 12th, 2002.

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

On the Causes and Consequences of Ballot Order Effects

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

NH Statewide Horserace Poll

Preliminary Effects of Oversampling on the National Crime Victimization Survey

VIEWS ON IMMIGRATION April 6-9, 2006

by Casey B. Mulligan and Charles G. Hunter University of Chicago September 2000

If further discussion would be of value, we stand by ready and eager to meet with your team at your convenience. Sincerely yours,

More Justice for Less Money

HIGH POINT UNIVERSITY POLL MEMO RELEASE 10/13/2017 (UPDATE)

A comparative analysis of subreddit recommenders for Reddit

National Latino Survey Sept 2017

Erie County and the Trump Administration

1 Year into the Trump Administration: Tools for the Resistance. 11:45-1:00 & 2:40-4:00, Room 320 Nathan Phillips, Nathaniel Stinnett

DATE: October 7, 2004 CONTACT: Adam Clymer at or (cell) VISIT:

CSE 190 Professor Julian McAuley Assignment 2: Reddit Data. Forrest Merrill, A Marvin Chau, A William Werner, A

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Preferential votes and minority representation in open list proportional representation systems

NH Statewide Horserace Poll

NBC News/Marist Poll October 2018 Arizona Questionnaire

Chapter 17. The Labor Market and The Distribution of Income. Microeconomics: Principles, Applications, and Tools NINTH EDITION

COPYRIGHT 1982 BY THE FIELD INSTITUTE. FOR PUBLICATION BY SUBSCRIBERS ONLY.

oductivity Estimates for Alien and Domestic Strawberry Workers and the Number of Farm Workers Required to Harvest the 1988 Strawberry Crop

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

Video Notes Unit 2 Political Beliefs & Behaviors

The Cook Political Report / LSU Manship School Midterm Election Poll

A positive correlation between turnout and plurality does not refute the rational voter model

Conviction and Sentencing of Offenders in New Zealand: 1997 to 2006

Sentencing Snapshot. Indecent Act With a Child Under 16

One. After every presidential election, commentators lament the low voter. Introduction ...

Tony Licciardi Department of Political Science

More Know Unemployment Rate than Dow Average PUBLIC KNOWS BASIC FACTS ABOUT FINANCIAL CRISIS

McClatchy-Marist Poll National Survey January 2011

October 29, 2010 I. Survey Methodology Selection of Households

CS 4407 Algorithms Greedy Algorithms and Minimum Spanning Trees

Local Elections 2009

Socially Optimal Districting: An Empirical Investigation

Ipsos MORI November 2016 Political Monitor

EMBARGOED FOR RELEASE UNTIL MONDAY, OCTOBER 27, am EDT. A survey of Virginians conducted by the Center for Public Policy

Transcription:

Congressional samples Based on Congressional Samples for Approximate Answering of Group-By Queries (2000) by Swarup Acharyua et al.

Data Sampling Trying to obtain a maximally representative subset of the original data to reduce computation time or required storage. 100% accurate data is not always needed for analytics. The sample should work well with different kinds of queries.

Data Sampling The problem with plain uniform sampling Congressional samples Querying the sampled data Drawbacks of the approach Conclusion

Aggregation queries Aggregate attribute SELECT sex, muncipality, party, AVG(age) FROM poll WHERE election_year = 2017 GROUP BY sex, muncipality, party Predicate Grouping attributes

Uniform sampling Given a sample size X and the size of the original data D, pick X random rows with an equal probability. However, if some groups are very small, only a few rows are picked from those groups. Accuracy becomes an issue with very small samples.

The basic idea behind the solution A larger proportion of the original group has to be sampled if the group is small. Fewer rows can be sampled from the larger groups since the accuracy does not suffer as much. Uniform sampling is important because it works the best if the sample is later queried using predicates.

Congressional samples House Senate Basic Congress Congress

House Uniform sampling over the whole data.

House

Senate Given m groups and a sample size X, take a sample of X/m rows from each group, i.e. the total sample size is divided equally between all groups. May use too few samples from the larger groups.

House and Senate

Basic Congress A combination of House and Senate For each group g, the sample size is max(hg, Sg) where Hg and Sg are the expected sample sizes of group g in House and Senate sampling methods respectively.

House, Senate and Basic Congress

Basic Congress Produces a total sample size X, so the sample sizes of each group have to be scaled with a constant so that the total sample size becomes X.

House, Senate and Basic Congress

Not perfect

Basic Congress Let A and B be some grouping attributes that group the data into four groups i.e. GROUP-BY A, B A B avg(c) a1 b1... a1 b2... a1 b3... a2 b3... Group (a1, b1) Group (a1, b2) Group (a1, b3) Group (a2, b3)

Basic Congress Grouping attributes: A (a1) (a2) 75% 25% Grouping attributes: A, B (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25%

Basic Congress Grouping attributes: A (a1) (a2) 75% 25% (a1) (a2) 60% 40% As a percentage of the total sample size Grouping attributes: A, B (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% (a1, b1) (a1, b2) (a1, b3) (a2, b3) 27% 27% 23% 23% As a percentage of the total sample size

Basic Congress Grouping attributes: A (a1) (a2) 75% 25% (a1) (a2) 60% 40% Grouping attributes: A, B (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% (a1) (a2) 77% 23% (a1, b1) (a1, b2) (a1, b3) (a2, b3) 27% 27% 23% 23%

Basic Congress Grouping attributes: A (a1) (a2) 75% 25% Optimal (a1) (a2) 60% 40% Grouping attributes: A, B (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% Not optimal (a1) (a2) 77% 23% (a1, b1) (a1, b2) (a1, b3) (a2, b3) 27% 27% 23% 23% Optimal

Congress A solution to the problem i.e. it works better than Basic Congress with subsets of the original grouping attributes. An extension of the basic congress

Congress All subsets of the grouping attributes are, {A}, {B} and {A, B}. First, calculate the amount of groups created by each subset. Subset Groups Total # The whole data 1 {A} (a1), (a2) 2 {B} (b1), (b2), (b3) 3 {A, B} (a1, b1), (a1, b2), (a1, b3), (a2, b3) 4

Congress Then, calculate the expected sample size for each group using senate sampling. If X is the total sample size, then each group has a sample size of X/(number of groups). Subset Groups Total # Sample size of a single group The whole data 1 X/1 {A} (a1), (a2) 2 X/2 {B} (b1), (b2), (b3) 3 X/3 {A, B} (a1, b1), (a1, b2), (a1, b3), (a2, b3) 4 X/4

Congress So the expected sample size as a percentage of the total sample size X for each group (a1, b1), (a1, b2), (a1, b3), (a2, b3) becomes (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% {A} 20% 20% 10% 50% {B} 25% 25% 18.75% 31.25% {A, B} 25% 25% 25% 25%

Congress The empty set does not group at all, so taking a senate sample with no grouping attributes is the same as taking a House (uniform) sample. (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% {A} 20% 20% 10% 50% {B} 25% 25% 18.75% 31.25% {A, B} 25% 25% 25% 25%

Congress Taking the maximum sample size from either or {A, B} and scaling is the same as Basic Congress (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% {A, B} 25% 25% 25% 25% MAX 30% 30% 25% 25%

Congress Adding the other subsets makes the Basic Congress into Congress. (a1, b1) (a1, b2) (a1, b3) (a2, b3) 30% 30% 15% 25% {A, {A} B} 25% 20% 25% 20% 25% 10% 25% 50% MAX {B} 30% 25% 30% 25% 25% 18.75% 31.25% {A, B} 25% 25% 25% 25% MAX 30% 30% 25% 50%

Congress This ensures that the sample works reasonably well with any subset of the original grouping attributes. (a1, b1) (a1, b2) (a1, b3) (a2, b3) MAX 30% 30% 25% 50% SCALED 22.22% 22.22% 18.52% 37.04% (a1) (a2) 62.96% 37.04% (b1) (b2) (b3) 22.22% 22.22% 55.56%

Querying sampled data Averages, medians etc. work fine without modifications. Sums, counts etc. require modification.

Querying sampled data SELECT sum(value) * original_size/sample_size Works only for uniform samples since original_size/sample_size is not the correct scale factor for all groups in non-uniform (biased) samples. Storing the scale factor for each row Very high maintenance overhead. Storing the scale factor for each group Most likely the best solution

Querying sampled data SELECT v.a, v.b, v.c, sum(v.value) * s.scale_factor FROM values v JOIN scale_factors s USING(A, B, C) GROUP BY v.a, v.b, v.c Can be optimized further, but this is the basic idea. The scale factors have to be constantly maintained, but the overhead is not very high.

Drawbacks For some data, uniform sampling over the whole data, which is much easier to implement and maintain, may be good enough. Such data might be something where not many grouping attributes are needed and/or there exists no small groups

Drawbacks Senate sampling (used in Congress and Basic Congress too) might try to sample more rows than there are in the original data. The original paper simply states that handling these scenarios is not straightforward and leaves it at that.

Drawbacks Aggregate attributes with a very high variance or outliers with extreme values do not behave well when uniformly sampled. e.g. avg(-3, 0, 3,1, 1, 100000) = 16667, but avg(-3, 0, 3,1) = 0.5

Drawbacks In these cases, implementing a solution that buckets the values into ranges [v1, vn] =[v1, v2]... [v[n-1], vn] and takes a representative sample from each bucket will yield better results (Error-bounded Sampling for Analytics on Big Sparse Data, Yin Yan et al., 2014). This kind of a solution is more accurate in general, but it is less flexible with e.g. query predicates and the aggregate attributes must be known beforehand.

Conclusion Data sampling is useful when saving resources or time trumps accuracy. Small groups a problem with uniform sampling. Congress sampling fixes the problem with small groups, but does not handle situations where the aggregate attribute has some extreme values. Sampling makes querying more complex.

Phew, it s finally over! In case you missed it, my name is