Support Vector Machines

Similar documents
Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Random Forests. Gradient Boosting. and. Bagging and Boosting

CS 229 Final Project - Party Predictor: Predicting Political A liation

Classifier Evaluation and Selection. Review and Overview of Methods

Probabilistic Latent Semantic Analysis Hofmann (1999)

Do two parties represent the US? Clustering analysis of US public ideology survey

Instructors: Tengyu Ma and Chris Re

Classification of Short Legal Lithuanian Texts

Cluster Analysis. (see also: Segmentation)

Deep Learning and Visualization of Election Data

A comparative analysis of subreddit recommenders for Reddit

P(x) testing training. x Hi

JUDGE, JURY AND CLASSIFIER

Popularity Prediction of Reddit Texts

Predicting Congressional Votes Based on Campaign Finance Data

Constraint satisfaction problems. Lirong Xia

Automated Classification of Congressional Legislation

Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Thinkwell s Homeschool Microeconomics Course Lesson Plan: 31 weeks

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Automatic Thematic Classification of the Titles of the Seimas Votes

Classification of posts on Reddit

Research and strategy for the land community.

Lab 3: Logistic regression models

Migration and Tourism Flows to New Zealand

Introduction to Path Analysis: Multivariate Regression

CS 229: r/classifier - Subreddit Text Classification

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Statistical Analysis of Corruption Perception Index across countries

PROJECTING THE LABOUR SUPPLY TO 2024

Coalitional Game Theory

Identifying Factors in Congressional Bill Success

the notion that poverty causes terrorism. Certainly, economic theory suggests that it would be

Probabilistic earthquake early warning in complex earth models using prior sampling

Parties, Candidates, Issues: electoral competition revisited

Migrant Wages, Human Capital Accumulation and Return Migration

Immigrants Inflows, Native outflows, and the Local Labor Market Impact of Higher Immigration David Card

Computational Inelasticity FHLN05. Assignment A non-linear elasto-plastic problem

σ IηIη Andrew Askew Florida State University

PROJECTION OF NET MIGRATION USING A GRAVITY MODEL 1. Laboratory of Populations 2

Combining national and constituency polling for forecasting

Read My Lips : Using Automatic Text Analysis to Classify Politicians by Party and Ideology 1

The Analytics of the Wage Effect of Immigration. George J. Borjas Harvard University September 2009

Announcements. HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson

! = ( tapping time ).

IMMIGRATION REFORM, JOB SELECTION AND WAGES IN THE U.S. FARM LABOR MARKET

Classification and Regression Approaches to Predicting United States Senate Elections. Rohan Sampath, Yue Teng

Generalized Scoring Rules: A Framework That Reconciles Borda and Condorcet

Drug Trafficking Organizations and Local Economic Activity in Mexico

Improved Boosting Algorithms Using Confidence-rated Predictions

Deep Classification and Generation of Reddit Post Titles

The Trade Effects of Skilled versus Unskilled Migration

An Investigation into a Circuit Based Supply Chain Analyzer for FPGAs

The cost of ruling, cabinet duration, and the median-gap model

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

The Costs of Remoteness, Evidence From German Division and Reunification by Redding and Sturm (AER, 2008)

A Vote Equation and the 2004 Election

MIPAS Temperature and Pressure Validation by RO Data

CHAPTER FIVE RESULTS REGARDING ACCULTURATION LEVEL. This chapter reports the results of the statistical analysis

Remittances and the Brain Drain: Evidence from Microdata for Sub-Saharan Africa

THE EVALUATION OF OUTPUT CONVERGENCE IN SEVERAL CENTRAL AND EASTERN EUROPEAN COUNTRIES

SIMPLE LINEAR REGRESSION OF CPS DATA

(a) Draw side-by-side box plots that show the yields of the two types of land. Check for outliers before making the plots.

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Using Poole s Optimal Classification in R

VOTING ON INCOME REDISTRIBUTION: HOW A LITTLE BIT OF ALTRUISM CREATES TRANSITIVITY DONALD WITTMAN ECONOMICS DEPARTMENT UNIVERSITY OF CALIFORNIA

JudgeIt II: A Program for Evaluating Electoral Systems and Redistricting Plans 1

Wage Rigidity and Spatial Misallocation: Evidence from Italy and Germany

CS269I: Incentives in Computer Science Lecture #4: Voting, Machine Learning, and Participatory Democracy

Hoboken Public Schools. College Algebra Curriculum

The Shadow Value of Legal Status --A Hedonic Analysis of the Earnings of U.S. Farm Workers 1

Wage Trends among Disadvantaged Minorities

RECOMMENDED CITATION: Pew Research Center, May, 2017, Partisan Identification Is Sticky, but About 10% Switched Parties Over the Past Year

Was This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content

Supplementary Tables for Online Publication: Impact of Judicial Elections in the Sentencing of Black Crime

Hoboken Public Schools. AP Statistics Curriculum

Category-level localization. Cordelia Schmid

On the Determinants of Global Bilateral Migration Flows

Congressional Gridlock: The Effects of the Master Lever

Pivoted Text Scaling for Open-Ended Survey Responses

Chapter Five: Forces. Ø 5.1 Forces. Ø 5.2 Friction. Ø 5.3 Forces and Equilibrium

Climate Change Around the World

Schooling and Cohort Size: Evidence from Vietnam, Thailand, Iran and Cambodia. Evangelos M. Falaris University of Delaware. and

Do Individual Heterogeneity and Spatial Correlation Matter?

SocialSecurityEligibilityandtheLaborSuplyofOlderImigrants. George J. Borjas Harvard University

Vote Compass Methodology

Use and abuse of voter migration models in an election year. Dr. Peter Moser Statistical Office of the Canton of Zurich

Quant 101 Learn2Quant HK, 14 September Vinesh Jha CEO, ExtractAlpha

Understanding factors that influence L1-visa outcomes in US

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Migration With Endogenous Social Networks in China

Hierarchical Item Response Models for Analyzing Public Opinion

Split Decisions: Household Finance when a Policy Discontinuity allocates Overseas Work

Data Assimilation in Geosciences

Women and Power: Unpopular, Unwilling, or Held Back? Comment

School Quality and Returns to Education of U.S. Immigrants. Bernt Bratsberg. and. Dek Terrell* RRH: BRATSBERG & TERRELL:

Determinants and Effects of Negative Advertising in Politics

IPSA International Conference Concordia University, Montreal (Quebec), Canada April 30 May 2, 2008

Wind power integration and consumer behavior: a complementarity approach

The Effects of Housing Prices, Wages, and Commuting Time on Joint Residential and Job Location Choices

Transcription:

Support Vector Machines

Linearly Separable Data

SVM: Simple Linear Separator hyperplane

Which Simple Linear Separator?

Classifier Margin

Objective #1: Maximize Margin MARGIN MARGIN

How s this look? MARGIN MARGIN

Objective #2: Minimize Misclassifications MARGIN MARGIN

Support Vectors SUPPORT VECTORS

Not Linearly Separable

SVM w/ Soft Margin

The model Ø A hyperplane in R " can be represented by a vector w with n elements, plus a bias term, w % which lifts it away from the origin. Ø w % + w T x = 0 (equation of the decision boundary itself) Ø Any observation, x, above the hyperplane has Ø w % + w T x > 0 Ø Any observation, x, below the hyperplane has Ø w % + w T x < 0

The input Ø Input data and a class target. Ø For best results, input data should be centered and standardized/normalized Ø Can be either a linear scaling or a statistical scaling Ø You will frequently need to enter and tune other parameters for regularization and kernels. Ø (more on this later)

The output Ø The output will typically be a set of parameters (i.e. a vector, w, plus an intercept w % ) For a new example, x: Ø If w % + w T x < 0 then predict target = 1 Ø If w % + w T x > 0 then predict target = +1 The above formulation changes when kernels are used, and it is best to use the model as an output object.

Nonlinear SVMs The Kernel Trick

Not Linearly Separable

Create Additional Variables?

2 2 z = x +y

New Data is Linearly Separable!

Another view The last trick seems difficult in this case! Not immediately clear what transformation will make this data linearly separable.

Kernels l 2 Ø Suppose we add two points, which we ll call landmarks. Ø Now suppose we create two new variables, f 2 and f 3, which measure the similarity of each point to those landmarks. l 1 l

Kernels l 2 Ø f 2 is some measure of similarity (proximity) to l 2. Ø It takes large values near l 2 and small values far from l 2. l 1 l

Kernels l 2 Ø f 3 is some measure of similarity (proximity) to l 3. Ø It takes large values near l 3 and small values far from l 3. l 1 l

Kernels l 2 Ø Let s ignore our previous variables (the axis shown) and instead use f 2 and f 3. Ø Suppose the blue target is +1 and the red target is - 1. Ø Consider the SVM model f(x) = 50-100f 2-100f 3 l 1 l When f 2 or f 3 >.5 (i.e. when points are close to l 2 or l 3 ) the prediction is negative (red). When f 2 and f 3 <.5 (i.e. when points are far from l 2 and l 3 ) the prediction is positive (blue).

Kernels l 2 Ø Next natural question How do we choose the landmarks? Ø You could choose a modest number of landmarks (using clustering or other methodology). Ø In practice, a kernel uses every data point as a landmark. Ø Essentially computes a similarity matrix to use as the data. l 1 l

Summary of Kernels Ø Kernels are similarity functions that measure some kind of proximity between data points. Ø Number of data points becomes number of variables Ø So this is not good for large datasets! SAS has trouble running a kernel method with 50K data points! Ø SVMs can use kernels in a very efficient way (similarity matrix never explicitly computed/stored). Ø Kernels can improve the performance of SVMs in many situations.

Choosing Kernels Ø Kernels embed data in a higher dimensional space (implicitly) Ø Cannot typically know ahead of time which kernel function will work best Ø Can try several, take best performer on validation data

Popular Kernels Ø Linear (è NO kernel) Ø Radial Basis Functions (RBFs) Ø Gaussian in particular is most common and usually default Ø exp < = > <=? @ 3A @ = exp γ x D x E 3 2 Ø γ = 3A @ is hyper parameter controlling shape of function. Ø Some packages want you to specify gamma (γ). Some ask you to specify sigma (σ). Ø Overwhelmingly THE most popular option when kernel needed. Ø NOT good for text classification. Typically linear is best for text

RBF/Gaussian Kernel exp < = G<= @ @ exp < = G<= @ @ 3A @ 3A @ σ = 1 σ = 0.5

Kernels l 2 Ø The circles shown are meant to represent contours of those Gaussian functions. l 1 l

RBF/Gaussian Kernel exp < = G<= @ @ exp < = G<= @ @ 3A @ 3A @ σ = 1 σ = 0.5

Tuning σ (or equivalently, γ) Ø This hyperparameter controls the influence of each training observation. Ø A larger value of σ (equivalently, a smaller value of γ) means that basis functions are wider the influence of a single point reaches far. Ø Smoother decision boundary => Reduce potential for overfitting. Ø A smaller value of σ (equivalently, a larger value of γ) means that basis functions are slimmer the influence of a single point is more local. Ø More localized/jagged decision boundary => Overfitting more likely Ø Consider: if σ were small enough, every point might be identified individually!

Ø Polynomial Other Kernels Ø ax J D x E + c L where a and c are constants and d is degree of polynomial Ø much less popular Ø Sigmoid Ø tanh ax J D x E + c where a and c are constants Ø much less popular

What kernels can do

What kernels can do

Regularization Ø As with most machine learning algorithms, a regularization penalty is built in to most packages. Ø Rather than specifying a λ as we would in most algorithms, SVMs are generally coded to expect C = 2 Q Ø C controls the tradeoff between a smooth decision boundary (bias/underfitting) and classifying training points correctly (variance/overfitting). Ø Larger C aims to classify all points correctly. Ø Smaller C aims to make decision surface more smooth.

Tuning Hyperparameters Ø How do we choose the specific values of the hyperparameters σ (or γ) and C? Ø One option is a grid search. See how the algorithm performs for all combinations of σ and C within a certain range: high CV accuracy low CV accuracy

Extensions of SVMs Multiclass classification Regression

Multiclass Classification with SVM Ø Most straightforward approach: One vs. All method 1. Starting with k classes 2. Train one SVM for each class, separating the points in that class (code as +1) from all other points (code as -1). 3. For SVM on class i, result is a set of parameters w i 4. To classify a new data point d, compute w T i d and place d in the class for which w T i d is largest. Ø This is still an ongoing research issue: how to define a larger objective function efficiently to avoid several binary classifiers. Ø New methods/packages constantly being developed. Ø Most existing packages can handle multiclass targets.

Support Vector Regression Ø The methodology behind SVMs has been extended to the regression problem. Ø Essentially, the data is imbedded in a very high dimensional space via kernels and then a regression hyperplane is determined via optimization.

Creating an SVM in SAS EM In my experience, this algorithm does not work as effectively as those implemented in R or Python. You also don t have the flexibility of hyperparameter tuning via cross validation.

SVM in SAS EM Under the HPDM tab, find HP SVM node

SVM in SAS EM The parameter C is called the Penalty and is listed under the option panel Train

SVM in SAS EM To use SVM with kernels, change the optimization method to Active Set and click the ellipses for more options.

SVM in SAS EM See the various options for the kernel used and the parameters. The parameter for the RBF kernel is gamma not sigma.