Random Forests. Gradient Boosting. and. Bagging and Boosting

Similar documents
Support Vector Machines

JUDGE, JURY AND CLASSIFIER

Classifier Evaluation and Selection. Review and Overview of Methods

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Understanding factors that influence L1-visa outcomes in US

Classification of posts on Reddit

Cluster Analysis. (see also: Segmentation)

Preliminary Effects of Oversampling on the National Crime Victimization Survey

Subjectivity Classification

On Trade Policy and Wages Inequality in Egypt: Evidence from Microeconomic Data

Quantitative Analysis of Migration and Development in South Asia

Out of Step, but in the News? The Milquetoast Coverage of Incumbent Representatives

An overview and comparison of voting methods for pattern recognition

Wage Trends among Disadvantaged Minorities

CS 229: r/classifier - Subreddit Text Classification

Who Would Have Won Florida If the Recount Had Finished? 1

Lab 3: Logistic regression models

twentieth century and early years of the twenty-first century, reversed its net migration result,

Case Study: Get out the Vote

List of Tables and Appendices

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Protocol to Check Correctness of Colorado s Risk-Limiting Tabulation Audit

The National Citizen Survey

Supplemental Materials for: An Informed Forensics Approach to Detecting Vote Irregularities

Improved Boosting Algorithms Using Confidence-rated Predictions

Statistical Analysis of Corruption Perception Index across countries

Gender preference and age at arrival among Asian immigrant women to the US

Evidence-Based Policy Planning for the Leon County Detention Center: Population Trends and Forecasts

Skill Classification Does Matter: Estimating the Relationship Between Trade Flows and Wage Inequality

Growth and Poverty Reduction: An Empirical Analysis Nanak Kakwani

Introduction to Path Analysis: Multivariate Regression

The Timeline Method of Studying Electoral Dynamics. Christopher Wlezien, Will Jennings, and Robert S. Erikson

Dimension Reduction. Why and How

NBER WORKING PAPER SERIES HOMEOWNERSHIP IN THE IMMIGRANT POPULATION. George J. Borjas. Working Paper

Congressional samples Juho Lamminmäki

Classification and Regression Approaches to Predicting United States Senate Elections. Rohan Sampath, Yue Teng

John Parman Introduction. Trevon Logan. William & Mary. Ohio State University. Measuring Historical Residential Segregation. Trevon Logan.

CENTER FOR URBAN POLICY AND THE ENVIRONMENT MAY 2007

Immigrant Legalization

Remittances and Poverty. in Guatemala* Richard H. Adams, Jr. Development Research Group (DECRG) MSN MC World Bank.

Practice Questions for Exam #2

Migrant Wages, Human Capital Accumulation and Return Migration

Simulating Electoral College Results using Ranked Choice Voting if a Strong Third Party Candidate were in the Election Race

Analysis of Categorical Data from the California Department of Corrections

Supporting Information Political Quid Pro Quo Agreements: An Experimental Study

Deep Learning and Visualization of Election Data

Answer THREE questions, ONE from each section. Each section has equal weighting.

Democracy in a Dim Light: Milquetoast Local Newspapers, Votes for Only Looking the Part, and Online News Cycles. Michael Colin Dougal

RBS SAMPLING FOR EFFICIENT AND ACCURATE TARGETING OF TRUE VOTERS

Immigrant Employment and Earnings Growth in Canada and the U.S.: Evidence from Longitudinal data

Immigration and Multiculturalism: Views from a Multicultural Prairie City

A positive correlation between turnout and plurality does not refute the rational voter model

Poverty Reduction and Economic Growth: The Asian Experience Peter Warr

Mapping Policy Preferences with Uncertainty: Measuring and Correcting Error in Comparative Manifesto Project Estimates *

CHAPTER FIVE RESULTS REGARDING ACCULTURATION LEVEL. This chapter reports the results of the statistical analysis

Rapid Methods for Assessing Water, Sanitation and Hygiene (WASH) Services in Emergency Settings: Working Paper

Characteristics of People. The Latino population has more people under the age of 18 and fewer elderly people than the non-hispanic White population.

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

THE LOUISIANA SURVEY 2018

A comparative analysis of subreddit recommenders for Reddit

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

Self-Selection and the Earnings of Immigrants

Lesson 6, Part 1: Linear Mixed Effects Models

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Test your skills. Chapters 6 and 7. Investigating election statistics

The Seventeenth Amendment, Senate Ideology, and the Growth of Government

This report examines the factors behind the

Subreddit Recommendations within Reddit Communities

POLITICS AND THE PRESIDENT April 6-9, 2006

Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

On the Causes and Consequences of Ballot Order Effects

Report for the Associated Press: Illinois and Georgia Election Studies in November 2014

Being a Good Samaritan or just a politician? Empirical evidence of disaster assistance. Jeroen Klomp

A Global Perspective on Socioeconomic Differences in Learning Outcomes

Coalition Formation and Selectorate Theory: An Experiment - Appendix

P(x) testing training. x Hi

ANNUAL SURVEY REPORT: REGIONAL OVERVIEW

(a) Draw side-by-side box plots that show the yields of the two types of land. Check for outliers before making the plots.

The Essential Report. 6 December 2016 ESSENTIALMEDIA.COM.AU

AVOTE FOR PEROT WAS A VOTE FOR THE STATUS QUO

Public Opinions towards Gun Control vs. Gun Ownership. Society today is witnessing a major increase in violent crimes involving guns.

5. Destination Consumption

DATE: October 7, 2004 CONTACT: Adam Clymer at or (cell) VISIT:

The Effect of Immigrant Student Concentration on Native Test Scores

Complexity of Terminating Preference Elicitation

Iowa Voting Series, Paper 6: An Examination of Iowa Absentee Voting Since 2000

Incumbency Advantages in the Canadian Parliament

Hoboken Public Schools. AP Statistics Curriculum

Probabilistic Latent Semantic Analysis Hofmann (1999)

FOURIER ANALYSIS OF THE NUMBER OF PUBLIC LAWS David L. Farnsworth, Eisenhower College Michael G. Stratton, GTE Sylvania

THE EFFECT OF CONCEALED WEAPONS LAWS: AN EXTREME BOUND ANALYSIS

Just War or Just Politics? The Determinants of Foreign Military Intervention

Corruption, Political Instability and Firm-Level Export Decisions. Kul Kapri 1 Rowan University. August 2018

RECOMMENDED CITATION: Pew Research Center, May, 2017, Partisan Identification Is Sticky, but About 10% Switched Parties Over the Past Year

Family Ties, Labor Mobility and Interregional Wage Differentials*

! = ( tapping time ).

Estimating the Margin of Victory for Instant-Runoff Voting

Pork Barrel as a Signaling Tool: The Case of US Environmental Policy

Category-level localization. Cordelia Schmid

The Impact of Immigration on the Wage Structure: Spain

AN ANALYSIS OF INTIMATE PARTNER VIOLENCE CASE PROCESSING AND SENTENCING USING NIBRS DATA, ADJUDICATION DATA AND CORRECTIONS DATA

Transcription:

Random Forests and Gradient Boosting Bagging and Boosting

The Bootstrap Sample and Bagging Simple ideas to improve any model via ensemble

Bootstrap Samples Ø Random samples of your data with replacement that are the same size as original data. Ø Some observations will not be sampled. These are called out-of-bag observations Example: Suppose you have 10 observations, labeled 1-10 Bootstrap Sample Number Training Observations (Efron 1983) (Efron and Tibshirani 1986) Out- of- Bag Observations 1 {1,3,2,8,3,6,4,2,8,7} {5,9,10} 2 {9,1,10,9,7,6,5,9,2,6} {3,4,8} 3 {8,10,5,3,8,9,2,3,7,6} {1,4}

Bootstrap Samples Ø Can be proven that a bootstrap sample will contain approximately 63% of the observations. Ø The sample size is the same as the original data as some observations are repeated. Ø Some observations left out of the sample (~37% outof-bag) Ø Uses: Ø Alternative to traditional validation/cross-validation Ø Create Ensemble Models using different training sets (Bagging)

Bagging (Bootstrap Aggregating) Ø Let k be the number of bootstrap samples Ø For each bootstrap sample, create a classifier using that sample as training data Ø Results in k different models Ø Ensemble those classifiers Ø A test instance is assigned to the class that received the highest number of votes.

Bagging Example input variable x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 y 1 1 1-1 - 1-1 - 1 1 1 1 target Ø 10 observations in original dataset Ø Suppose we build a decision tree with only 1 split. Ø The best accuracy we can get is 70% Ø Split at x=0.35 Ø Split at x=0.75 Ø A tree with one split called a decision stump

Bagging Example Let s see how bagging might improve this model: 1. Take 10 Bootstrap samples from this dataset. 2. Build a decision stump for each sample. 3. Aggregate these rules into a voting ensemble. 4. Test the performance of the voting ensemble on the whole dataset.

Bagging Example Classifier 1 First bootstrap sample: Some observations chosen multiple times. Some not chosen. Best decision stump splits at x=0.35

Bagging Example Classifiers 1-5

Bagging Example Classifiers 6-10

Bagging Example Predictions from each Classifier Ensemble Classifier has 100% Accuracy

Bagging Summary Ø Improves generalization error on models with high variance Ø Bagging helps reduce errors associated with random fluctuations in training data (high variance) Ø If base classifier is stable (not suffering from high variance), bagging can actually make it worse Ø Bagging does not focus on any particular observations in the training data (unlike boosting)

Random Forests Tin Kam Ho (1995, 1998) Leo Breiman (2001)

Random Forests Ø Random Forests are ensembles of decision trees similar to the one we just saw Ø Ensembles of decision trees work best when their predictions are not correlated they each find different patterns in the data Ø Problem: Bagging tends to create correlated trees Ø Two Solutions: (a) Randomly subset features considered for each split. (b) Use unpruned decision trees in the ensemble.

Random Forests Ø A collection of unpruned decision or regression trees. Ø Each tree is build on a bootstrap sample of the data and a subset of features are considered at each split. Ø The number of features considered for each split is a parameter called mtry. Ø Brieman (2001) suggests mtry = p where p is the number of features Ø I d suggest setting mtry equal to 5-10 values evenly spaced between 2 and p and choosing the parameter by validation Ø Overall, the model is relatively insensitive to values for mtry. Ø The results from the trees are ensembled into one voting classifier.

Ø Advantages Random Forests Summary Ø Computationally Fast can handle thousands of input variables Ø Trees can be trained simultaneously Ø Exceptional Classifiers one of most accurate available Ø Provide information on variable importance for the purposes of feature selection Ø Can effectively handle missing data Ø Disadvantages Ø No interpretability in final model aside from variable importance Ø Prone to overfitting Ø Lots of tuning parameters like the number of trees, the depth of each tree, the percentage of variables passed to each tree

Boosting

Boosting Overview Ø Like bagging, going to draw a sample of the observations from our data with replacement Ø Unlike bagging, the observations not sampled randomly Ø Boosting assigns a weight to each training observation and uses that weight as a sampling distribution Ø Higher weight observations more likely to be chosen. Ø May adaptively change that weight in each round Ø The weight is higher for examples that are harder to classify

Boosting Example input variable x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 y 1 1 1-1 - 1-1 - 1 1 1 1 target Ø Same dataset used to illustrate bagging Ø Boosting typically requires fewer rounds of sampling and classifier training. Ø Start with equal weights for each observation Ø Update weights each round based on the classification errors

Boosting Example

Boosting: Weighted Ensemble Ø Unlike Bagging, Boosted Ensembles usually weight the votes of each classifier by a function of their accuracy. Ø If a classifier gets the higher weight observations wrong, it has a higher error rate. Ø More accurate classifiers get higher weight in the prediction.

Boosting: Classifier weights Errors made: First 3 observations Errors made: Middle 4 observations Errors made: Last 3 observations

Boosting: Classifier weights Errors made: First 3 observations Errors made: Middle 4 observations Errors made: Last 3 observations Lowest weighted error. Highest weighted model.

Boosting: Weighted Ensemble Weight Classifier Decision Rules and Classifier Weights Individual Classifier Predictions and Weighted Ensemble Predictions

Boosting: Weighted Ensemble Weight Classifier Decision Rules and Classifier Weights 5.16 = - 1.738+2.7784+4.1195 Individual Classifier Predictions and Weighted Ensemble Predictions

(Major) Boosting Algorithms AdaBoost (This is sooo 2007) Gradient Boosting [xgboost] (Welcome to the New Age of learning)

(Self- Study) AdaBoost Details: The Classifier Weights Ø Let w * be the weight of observation j entering into present round. Ø Let m * = 1 if observation j is misclassified, 0 otherwise Ø The error of the classifier this round is 1 ε. = 1 N 0 w * *23 Ø The voting weight for the classifier this round is then m * α. = 1 2 ln 1 ε. ε.

(Self- Study) AdaBoost Details: Updating observation Weights To update the observation weights from the current round (round i) to the next round (round i + 1): w * (.<3) = w*. e?@ A w * (.<3) = w*. e @ A if observation j was correctly classified if observation j was misclassified The new weights are then normalized to sum to 1 so they form a probability distribution.

Gradient Boosting The latest and greatest (Jerome H. Friedman 1999)

Gradient Boosting Overview Ø Build a simple model f 3 (x) trying to predict a target y Ø It has error, right? y = f 3 x + ε 3 actual value modeled value error Ø Now, let s try to predict that error with another simple model, f D x. Unfortunately, it still has some error: y = f 3 x + f D x + ε D original modeled value predicting the residual, ε 3 error

Gradient Boosting Overview Ø We could just continue to add model after model, trying to predict the residuals from the previous set of models. y = f 3 x + f D x + f E x + + f G x + ε G original modeled value predicting the residual, ε 3 predicting the residual, ε D presumably very small error

Gradient Boosting Overview Ø To address the obvious problem of overfitting, we ll dampen the effect of the additional models by only taking a step toward the solution in that direction. Ø We ll also start (in continuous problems) with a constant function (intercept) Ø The step-sizes are automatically determined at each round inside the method y = γ 3 + γ D f D x + γ E f E x + + γ G f G x + ε G

Gradient Boosted Trees Ø Gradient boosting yields a additive ensemble model Ø The key to gradient boosting is using weak learners Ø Typically simple, shallow decision/regression trees Ø Computationally fast and efficient Ø Alone, make poor predictions but ensembled in this additive fashion provide superior results

Gradient Boosting and Overfitting Ø In general, the step-size is not enough to prevent us from overfitting the training data Ø To further aid in this mission, we must use some form of regularization to prevent overfitting: 1. Control the number of trees/classifiers used in the prediction Larger number of trees => More prone to overfitting Choose a number of trees by observing out-of-sample error 2. Use a shrinkage parameter ( learning rate ) to effectively lessen the step-size taken at each step. Often called eta, η y = γ 3 + η γ D f D x + η γ E f E x + + η γ G f G x + ε G Smaller values of eta => Less prone to overfitting eta = 1 => no regularization

Ø Advantages Gradient Boosting Summary Ø Exceptional model one of most accurate available, generally superior to Random Forests when well trained Ø Can provide information on variable importance for the purposes of variable selection Ø Disadvantages Ø Model lacks interpretability in the classical sense aside from variable importance Ø The trees must be trained sequentially so computationally this method is slower than Random Forest Ø Extra tuning parameter over Random Forests, the regularization or shrinkage parameter, eta.

Notes about EM Ø EM has node for Random Forest (HP tab=> HP Forest) Ø Uses CHAID unlike other implementations Ø Does not perform bootstrap sampling Ø Does not appear to work as well as the randomforest package in R Ø EM has node for gradient boosting Ø Personally I recommend the extreme gradient boosting implementation of this method, which is called xgboost both in R and python. Ø This implementation appears to be stronger and faster than the one in SAS