Support Vector Machines

Linearly Separable Data

SVM: Simple Linear Separator hyperplane

Which Simple Linear Separator?

Classifier Margin

Objective #1: Maximize Margin MARGIN MARGIN

How s this look? MARGIN MARGIN

Objective #2: Minimize Misclassifications MARGIN MARGIN

Support Vectors SUPPORT VECTORS

Not Linearly Separable

SVM w/ Soft Margin

The model Ø A hyperplane in R " can be represented by a vector w with n elements, plus a bias term, w % which lifts it away from the origin. Ø w % + w T x = 0 (equation of the decision boundary itself) Ø Any observation, x, above the hyperplane has Ø w % + w T x > 0 Ø Any observation, x, below the hyperplane has Ø w % + w T x < 0

The input Ø Input data and a class target. Ø For best results, input data should be centered and standardized/normalized Ø Can be either a linear scaling or a statistical scaling Ø You will frequently need to enter and tune other parameters for regularization and kernels. Ø (more on this later)

The output Ø The output will typically be a set of parameters (i.e. a vector, w, plus an intercept w % ) For a new example, x: Ø If w % + w T x < 0 then predict target = 1 Ø If w % + w T x > 0 then predict target = +1 The above formulation changes when kernels are used, and it is best to use the model as an output object.

Nonlinear SVMs The Kernel Trick

Not Linearly Separable

Create Additional Variables?

2 2 z = x +y

New Data is Linearly Separable!

Another view The last trick seems difficult in this case! Not immediately clear what transformation will make this data linearly separable.

Kernels l 2 Ø Suppose we add two points, which we ll call landmarks. Ø Now suppose we create two new variables, f 2 and f 3, which measure the similarity of each point to those landmarks. l 1 l

Kernels l 2 Ø f 2 is some measure of similarity (proximity) to l 2. Ø It takes large values near l 2 and small values far from l 2. l 1 l

Kernels l 2 Ø f 3 is some measure of similarity (proximity) to l 3. Ø It takes large values near l 3 and small values far from l 3. l 1 l

Kernels l 2 Ø Let s ignore our previous variables (the axis shown) and instead use f 2 and f 3. Ø Suppose the blue target is +1 and the red target is - 1. Ø Consider the SVM model f(x) = 50-100f 2-100f 3 l 1 l When f 2 or f 3 >.5 (i.e. when points are close to l 2 or l 3 ) the prediction is negative (red). When f 2 and f 3 <.5 (i.e. when points are far from l 2 and l 3 ) the prediction is positive (blue).

Kernels l 2 Ø Next natural question How do we choose the landmarks? Ø You could choose a modest number of landmarks (using clustering or other methodology). Ø In practice, a kernel uses every data point as a landmark. Ø Essentially computes a similarity matrix to use as the data. l 1 l

Summary of Kernels Ø Kernels are similarity functions that measure some kind of proximity between data points. Ø Number of data points becomes number of variables Ø So this is not good for large datasets! SAS has trouble running a kernel method with 50K data points! Ø SVMs can use kernels in a very efficient way (similarity matrix never explicitly computed/stored). Ø Kernels can improve the performance of SVMs in many situations.

Choosing Kernels Ø Kernels embed data in a higher dimensional space (implicitly) Ø Cannot typically know ahead of time which kernel function will work best Ø Can try several, take best performer on validation data

Popular Kernels Ø Linear (è NO kernel) Ø Radial Basis Functions (RBFs) Ø Gaussian in particular is most common and usually default Ø exp < = > <=? @ 3A @ = exp γ x D x E 3 2 Ø γ = 3A @ is hyper parameter controlling shape of function. Ø Some packages want you to specify gamma (γ). Some ask you to specify sigma (σ). Ø Overwhelmingly THE most popular option when kernel needed. Ø NOT good for text classification. Typically linear is best for text

RBF/Gaussian Kernel exp < = G<= @ @ exp < = G<= @ @ 3A @ 3A @ σ = 1 σ = 0.5

Kernels l 2 Ø The circles shown are meant to represent contours of those Gaussian functions. l 1 l

RBF/Gaussian Kernel exp < = G<= @ @ exp < = G<= @ @ 3A @ 3A @ σ = 1 σ = 0.5

Tuning σ (or equivalently, γ) Ø This hyperparameter controls the influence of each training observation. Ø A larger value of σ (equivalently, a smaller value of γ) means that basis functions are wider the influence of a single point reaches far. Ø Smoother decision boundary => Reduce potential for overfitting. Ø A smaller value of σ (equivalently, a larger value of γ) means that basis functions are slimmer the influence of a single point is more local. Ø More localized/jagged decision boundary => Overfitting more likely Ø Consider: if σ were small enough, every point might be identified individually!

Ø Polynomial Other Kernels Ø ax J D x E + c L where a and c are constants and d is degree of polynomial Ø much less popular Ø Sigmoid Ø tanh ax J D x E + c where a and c are constants Ø much less popular

What kernels can do

Regularization Ø As with most machine learning algorithms, a regularization penalty is built in to most packages. Ø Rather than specifying a λ as we would in most algorithms, SVMs are generally coded to expect C = 2 Q Ø C controls the tradeoff between a smooth decision boundary (bias/underfitting) and classifying training points correctly (variance/overfitting). Ø Larger C aims to classify all points correctly. Ø Smaller C aims to make decision surface more smooth.

Tuning Hyperparameters Ø How do we choose the specific values of the hyperparameters σ (or γ) and C? Ø One option is a grid search. See how the algorithm performs for all combinations of σ and C within a certain range: high CV accuracy low CV accuracy

Extensions of SVMs Multiclass classification Regression

Multiclass Classification with SVM Ø Most straightforward approach: One vs. All method 1. Starting with k classes 2. Train one SVM for each class, separating the points in that class (code as +1) from all other points (code as -1). 3. For SVM on class i, result is a set of parameters w i 4. To classify a new data point d, compute w T i d and place d in the class for which w T i d is largest. Ø This is still an ongoing research issue: how to define a larger objective function efficiently to avoid several binary classifiers. Ø New methods/packages constantly being developed. Ø Most existing packages can handle multiclass targets.

Support Vector Regression Ø The methodology behind SVMs has been extended to the regression problem. Ø Essentially, the data is imbedded in a very high dimensional space via kernels and then a regression hyperplane is determined via optimization.

Creating an SVM in SAS EM In my experience, this algorithm does not work as effectively as those implemented in R or Python. You also don t have the flexibility of hyperparameter tuning via cross validation.

SVM in SAS EM Under the HPDM tab, find HP SVM node

SVM in SAS EM The parameter C is called the Penalty and is listed under the option panel Train

SVM in SAS EM To use SVM with kernels, change the optimization method to Active Set and click the ellipses for more options.

SVM in SAS EM See the various options for the kernel used and the parameters. The parameter for the RBF kernel is gamma not sigma.