Data Sampling using Congressional sampling. by Juhani Heliö

Similar documents
Congressional samples Juho Lamminmäki

1 PEW RESEARCH CENTER

Topline Questionnaire

NEW JERSEYANS SEE NEW CONGRESS CHANGING COUNTRY S DIRECTION. Rutgers Poll: Nearly half of Garden Staters say GOP majority will limit Obama agenda

November 2017 Toplines

LAUTENBERG SUBSTITUTION REVIVES DEMOCRATS CHANCES EVEN WHILE ENERGIZING REPUBLICANS

8 5 Sampling Distributions

September 2017 Toplines

It's Still the Economy

INTRODUCTION BACKGROUND. Chapter One

Kansas Policy Survey: Fall 2001 Survey Results

IMMIGRATION IN THE GARDEN STATE

Motivation: uses of statistics

Report for the Associated Press: Illinois and Georgia Election Studies in November 2014

Matthew A. Cole and Eric Neumayer. The pitfalls of convergence analysis : is the income gap really widening?

RUTGERS-EAGLETON POLL: MOST NEW JERSEYANS SUPPORT DREAM ACT

Equality Awareness in Northern Ireland: Employers and Service Providers

BY Aaron Smith FOR RELEASE JUNE 28, 2018 FOR MEDIA OR OTHER INQUIRIES:

Tie Breaking in STV. 1 Introduction. 3 The special case of ties with the Meek algorithm. 2 Ties in practice

Author(s) Title Date Dataset(s) Abstract

NATIONAL: DID SHUTDOWN MAKE TRUMP LOOK STRONGER OR WEAKER

Job approval in North Carolina N=770 / +/-3.53%

FINAL RESOURCE ASSESSMENT: FAILING TO SURRENDER TO BAIL

EKOS/Toronto Star Poll Public Response to the Ontario Budget: Lights, camera, but no action

Lab 3: Logistic regression models

Americans and Germans are worlds apart in views of their countries relationship By Jacob Poushter and Alexandra Castillo

NATIONAL: PUBLIC SAYS LET DREAMERS STAY

Liberal Revival Stalled Despite New Leader

Inventory Project: Identifying and Preserving Minnesota s Digital Legislative Record

THE PRESIDENT, THE STATE OF THE UNION AND THE TROOP INCREASE January 18-21, 2007

IN BRIEF MAKING A NEW LAW. Ontario Justice Education Network

The WMUR / CNN Poll. September 13, 1999 GREGG MOST POPULAR POLITICIAN IN NEW HAMPSHIRE

Backgrounder. This report finds that immigrants have been hit somewhat harder by the current recession than have nativeborn

ATTITUDES TOWARDS IMMIGRATION TAKE A HIT FROM 9/11 New Jerseyans Like Their Immigrant Neighbors, But Aren t Sure They Want More

LBB Contract Reporting & Oversight

PROGRAM FOR PUBLIC CONSULTATION / ANWAR SADAT CHAIR

Case 1:17-cv TCB-WSD-BBM Document 94-1 Filed 02/12/18 Page 1 of 37

The National Citizen Survey

NATIONAL: PUBLIC TAKES SOFTER STANCE ON ILLEGAL IMMIGRATION

ADDING RYAN TO TICKET DOES LITTLE FOR ROMNEY IN NEW JERSEY. Rutgers-Eagleton Poll finds more than half of likely voters not influenced by choice

FOR RELEASE NOVEMBER 07, 2017

Sociology 201: Social Research Design

GENERATIONAL DIFFERENCES

Bosnia and Herzegovina

The Cook Political Report / LSU Manship School Midterm Election Poll

The Detroit Sexual Assault Kit Action Research Project. Rebecca Campbell, Ph.D.

EMBARGOED NOT FOR RELEASE UNTIL: SUNDAY, SEPTEMBER 19, 1993

Views of the Economy by Party --- Now / Reps Dems Inds Reps Dems Inds Good 61% 67% 56% 31% 78% 53% Bad

NATIONAL SURVEY / ARGENTINES PERCEPTIONS OF THE WORLD ORDER, FOREIGN POLICY, AND GLOBAL ISSUES (Round 2)

oductivity Estimates for Alien and Domestic Strawberry Workers and the Number of Farm Workers Required to Harvest the 1988 Strawberry Crop

MUTED AND MIXED PUBLIC RESPONSE TO PEACE IN KOSOVO

National Latino Survey Sept 2017

THE INDEPENDENT AND NON PARTISAN STATEWIDE SURVEY OF PUBLIC OPINION ESTABLISHED IN 1947 BY MERVIN D. FiElD.

PPIC Statewide Survey Methodology

HIGH POINT UNIVERSITY POLL MEMO RELEASE 10/13/2017 (UPDATE)

- The Fast PR System is a proportional representation (PR) system. Every vote counts. But it offers significant differences from other PR systems.

Americans fear the financial crisis has far-reaching effects for the whole nation and are more pessimistic about the economy than ever.

PUBLIC BACKS CLINTON ON GUN CONTROL

THE WAR IN IRAQ, THE PRESIDENT AND THE COUNTRY S INFRASTRUCTURE August 8-12, 2007

HOUSE VOTING INTENTIONS KNOTTED, NATIONAL TREND NOT APPARENT

NATIONAL: SENATE HEALTH CARE BILL GETS THUMBS DOWN

MEMO: The Folmer Redistricting Commission: Neither Independent Nor Nonpartisan

THE GROWTH OF CANADA

APPENDIX B. Environmental Justice Evaluation

Social audit of governance and delivery of public services

For immediate release Monday, March 7 Contact: Dan Cassino ;

The Costs and Benefits of Cambridgeshire Multi-Systemic Therapy Transition to Mutual Delivery Model. September 2016

EPI BRIEFING PAPER. Immigration and Wages Methodological advancements confirm modest gains for native workers. Executive summary

CURRENT ISSUES: THE DEBATE OVER SCHIP AND THE WAR IN IRAQ October 12-16, 2007

THE LOUISIANA SURVEY 2017

Equality Awareness in Northern Ireland: General Public

VIEWS OF GOVERNMENT IN NEW JERSEY GO NEGATIVE But Residents Don t See Anything Better Out There

The Federal Advisory Committee Act: Analysis of Operations and Costs

The Guardian. Campaign Poll 8, May 2017

GenForward March 2019 Toplines

Attitudes toward Immigration: Findings from the Chicago- Area Survey

RESEARCH BRIEF: The State of Black Workers before the Great Recession By Sylvia Allegretto and Steven Pitts 1

PRRI March 2018 Survey Total = 2,020 (810 Landline, 1,210 Cell) March 14 March 25, 2018

R Eagleton Institute of Politics Center for Public Interest Polling

The Costs of Immigration to Taxpayers: Analytical and Policy Issues

The Hall of Mirrors: Perceptions and Misperceptions in the Congressional Foreign Policy Process

9 Advantages of conflictual redistricting

AMERICANS ARE OPTIMISTIC ABOUT BARACK OBAMA S PRESIDENCY AND CABINET CHOICES December 4-8, 2008

England Riots Survey August Summary of findings

National: Trump Down, Dems Up, Russia Bad, Kushner Out

Professor Christina Romer. LECTURE 14 RISING INEQUALITY March 6, 2018

Newsweek Poll Congressional Elections/Marijuana Princeton Survey Research Associates International. Final Topline Results (10/22/10)

OHIO: GAP NARROWS IN CD12 SPECIAL

Washington Statewide Survey of 603 Voters Statewide December 3-9, 2014

International migration data as input for population projections

closer look at Rights & remedies

NEGOTIATIONS WITH IRAN: Views from a Red State, a Blue State and a Swing State

The 2016 Republican Primary Race: Trump Still Leads October 4-8, 2015

Unit 11 Public Opinion: Voice of the People

LIFE IN RURAL AMERICA

AP PHOTO/MATT VOLZ. Voter Trends in A Final Examination. By Rob Griffin, Ruy Teixeira, and John Halpin November 2017

RECOMMENDED CITATION: Pew Research Center, January, 2015, Obama Job Rating Ticks Higher, Views of Nation s Economy Turn More Positive

Topline Questionnaire

Call for Action: Voters React to Explosion and Oil Spill in Gulf of Mexico

Preliminary Effects of Oversampling on the National Crime Victimization Survey

Transcription:

Data Sampling using Congressional sampling by Juhani Heliö

Overview 1. Introduction 2. Data sampling as a concept 3. Uniform random sampling 4. Congressional sampling 5. Results of Congressional sampling 6. Summary

Explanations for the presentation In this presentation I will use following words with following meanings: - Data set: The entire data being sampled. - Data point/point of data: Any single tuple or entry or similar that is found in the data set.

1. Motivation Motivation - Storing big data can be difficult task on its own right but how does one actually use the stored data? Because of the large volume of the data, using the data can be difficult. Example: Company has enormous amounts of data stored and wants to construct an average sales record to use in its decision making. How could this be done as fast and efficient?

2. Data sampling as a concept What is data sampling? - Main idea is to take a statistically significant sample of data and then analyse this sample rather than having to use the whole original data set. - This way analysing huge amounts of data can be done faster and more efficiently.

3. Uniform random sampling 1. What is Uniform Random Sampling 2. Why Uniform random sampling is good... 3....and why it s not very good after all 4. Example 1: US Census database 5. Example 2: Very sparse data

What is Uniform Random Sampling Uniform random sampling is a simple and old sampling method Key concept: - Select points of data at random from the whole data set to the sample. - Selection is done so that all the points of data have the same chance to be chosen to the sample.

Pros 1. Works well with simple queries like trying to find average of the whole data set. 2. Uniform random sampling is fast, O(n)

Cons: Uniform random sampling has despite its good sides a critical flaw that can lead to inaccuracies in the result sample. Key problem lies with grouped data. If the size difference between groups is too large, uniform random sampling can cause problems. In the following examples we examine 2 similar scenarios and see that uniform random sampling struggles with largely sparse data and large differences between sizes of groups.

Example: US Census database -US Census database contains data of all the citizens in the nation. In this example an analyst wants to make a query to the Census DB asking for average income of each state. Because of the large volume of data in the DB a sample will be made. If uniform random sampling would be used in this instance inaccuracies would occur possibly rendering the sample unusable. -This occurs because of the differences in the populations of each state. States with low population will not have many points of data selected. If too few data points are produced into the sample the resulting average calculation will not be accurate enough to be used. This problem could be fixed by creating an larger sample but this would reduce the effectiveness of sampling.

Problem summary Given large number of groups from which large majority are small, uniform random sampling needs to either consume nearly the entire data set to satisfy the error bound or give inaccurate answer which probably will be useless to the user. This leads to having less benefit from the sampling or even negative benefit due to the sampling overhead.

Solutions There are many solutions for the problem and in this presentation will focus on Congressional sampling developed to enhance the Aqua system.

4. Congressional sampling 1. Introduction 2. Aqua 3. House 4. Senate 5. Basic congress 6. Congress

Introduction Congressional sampling is a biased sampling method developed to enhance the Aqua system. This sampling method is actually four methods of sampling. Congressional sampling has taken its inspiration from the US political system, hence the name.

Aqua - A system designed to sit between traditional DBMS and the users of the database. - Aqua provides approximate query answering. - Enhancing this system has been the main motivation for developing this sampling method.

House - Do an uniform random sample over the entire data set. - This will favor the large subgroups of the data set as per with uniform random sampling. - This also means that House in itself is bad at sampling groupbys

Senate - Take an equal sized sample from all subgroups of the sample - This division is done by dividing the sample size by the number of subgroups. - This method heavily favors small subgroups of the data set. - Because the groups are even sized the small groups get disproportionately large amounts of points of data in their samples compared to large groups. - The Senate thus will perform worse than the House with data containing only a few small groups.

Basic congress - The basic congress is a combination of the house and senate samples. - This method of sampling would be fair to both large and small groups - However, this would also mean that the sample created would be twice as big - This is mitigated by the following strategy: - For all subgroups g in the samples made with House h g and Senate s g do: - Take the larger of h g and s g into the basic Congress sample - Then the sample sizes are uniformly scaled down so that the overall sample size is the same as house or senate would have.

Problem of Basic congress The Basic congress method is still somewhat flawed: Consider a data set with 4 groups of tuples with sizes respectively: {a 1, b 1 } 3000, {a 1,b 2 } 3000, {a 1,b 3 } 1500 and {a 2,b 1 } 2500. We take samples with sample size X = 100. In the table we can see the different samples done with house and senate and also with Basic congress and Congress.

The problem in Basic congress is that it focuses on the extremes. - In the case we would like to make a sample with the values of A, Basic congress will allocate 77.3 and 22.7 units of space in these groups. This could lead to inaccuracies in the a 2 group. This problem is addressed in the Congress method of sampling

Congress Basic concept of the congress is to use stratified biased sampling to construct a sample. Unlike the Basic congress,the Congress method considers all the possible groupings in the data and constructs the sample out of those. In the case with the figure above, possible groups would be {A, B}. The sample would then be taken using these groups and then combining them using the same method used in Basic congress. Optimization is then done to ensure the sample size stays the same.

5. Results of Congressional sampling To test the validity of this method three tests were conducted with different groupings: No groupby columns, two groupbys and three groupbys. Results:

- The house performs poorly with any groupbys but when no groupbys were made house was the most accurate. This is due to the focus of House on the allocation of space to the large subgroups. - Senate on the other hand focuses on the allocation of space to the small groups and thus performs poorly with no groupbys. - Basic congress performs slightly poorly than Congress as it gives more focus on the extreme groups but still tries to balance them out. - Congress performs the best or nearly the best in all of the cases and is the most consistent. As the other methods try to focus on one aspect of sampling, congress does not focus on any particular aspect and thus performs the best.

6. Summary In this presentation we have explored different sampling methods: -Uniform random sampling despite its appealing simple and fast nature was found to be lacking with more complex queries -Congressional sampling, a biased sampling method, was found to be a good alternative with an effective solution to this problem.

Q&A