Announcements. HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson

Similar documents
Support Vector Machines

A comparative analysis of subreddit recommenders for Reddit

CS 229: r/classifier - Subreddit Text Classification

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Tengyu Ma Facebook AI Research. Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

Cluster Analysis. (see also: Segmentation)

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

CS 229 Final Project - Party Predictor: Predicting Political A liation

Dimension Reduction. Why and How

Congressional Gridlock: The Effects of the Master Lever

Classification of posts on Reddit

Deep Learning and Visualization of Election Data

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Popularity Prediction of Reddit Texts

Text as Actuator: Text-Driven Response Modeling and Prediction in Politics. Tae Yano

Pivoted Text Scaling for Open-Ended Survey Responses

Instructors: Tengyu Ma and Chris Re

Deep Classification and Generation of Reddit Post Titles

Probabilistic earthquake early warning in complex earth models using prior sampling

Distributed representations of politicians

Category-level localization. Cordelia Schmid

Statistical Analysis of Corruption Perception Index across countries

Predicting Congressional Votes Based on Campaign Finance Data

P(x) testing training. x Hi

Probabilistic Latent Semantic Analysis Hofmann (1999)

An Homophily-based Approach for Fast Post Recommendation in Microblogging Systems

Understanding factors that influence L1-visa outcomes in US

Political Language in Economics

Chapter. Sampling Distributions Pearson Prentice Hall. All rights reserved

Automatic Thematic Classification of the Titles of the Seimas Votes

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Do two parties represent the US? Clustering analysis of US public ideology survey

Classification of Short Legal Lithuanian Texts

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

Classifier Evaluation and Selection. Review and Overview of Methods

1/12/12. Introduction-cont Pattern classification. Behavioral vs Physical Traits. Announcements

Deep Learning Working Group R-CNN

Subreddit Recommendations within Reddit Communities

Random Forests. Gradient Boosting. and. Bagging and Boosting

Identifying Factors in Congressional Bill Success

UC-BERKELEY. Center on Institutions and Governance Working Paper No. 22. Interval Properties of Ideal Point Estimators

Hierarchical Item Response Models for Analyzing Public Opinion

The Issue-Adjusted Ideal Point Model

arxiv: v1 [econ.gn] 20 Feb 2019

Introduction to Text Modeling

Constraint satisfaction problems. Lirong Xia

Michael Laver and Ernest Sergenti: Party Competition. An Agent-Based Model

Polydisciplinary Faculty of Larache Abdelmalek Essaadi University, MOROCCO 3 Department of Mathematics and Informatics

No Adults Allowed! Unsupervised Learning Applied to Gerrymandered School Districts

Name Phylogeny. A Generative Model of String Variation. Nicholas Andrews, Jason Eisner and Mark Dredze

Do Individual Heterogeneity and Spatial Correlation Matter?

Partition Decomposition for Roll Call Data

Remittances and the Brain Drain: Evidence from Microdata for Sub-Saharan Africa

Statistical Analysis of Endorsement Experiments: Measuring Support for Militant Groups in Pakistan

Predicting How U.S. Counties will Vote in Presidential Elections Through Analysis of Socio- Economic Factors, Voting Heuristics, and Party Platforms

A procedure to compute a probabilistic bound for the maximum tardiness using stochastic simulation

Local differential privacy

national congresses and show the results from a number of alternate model specifications for

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

Genetic Algorithms with Elitism-Based Immigrants for Changing Optimization Problems

Ranking Subreddits by Classifier Indistinguishability in the Reddit Corpus

Police patrol districting method and simulation evaluation using agent-based model & GIS

Automated Classification of Congressional Legislation

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Classical papers: Osborbe and Slivinski (1996) and Besley and Coate (1997)

Measuring Political Preferences of the U.S. Voting Population

Coalitional Game Theory for Communication Networks: A Tutorial

Online Appendix: Trafficking Networks and the Mexican Drug War

Game theoretical techniques have recently

Transnational Dimensions of Civil War

Political Language in Economics

Using a Fuzzy-Based Cluster Algorithm for Recommending Candidates in eelections

Towards Tackling Hate Online Automatically

The HeLIx + inversion code Genetic algorithms. A. Lagg - Abisko Winter School 1

arxiv: v4 [cs.cl] 7 Jul 2015

What makes people feel free: Subjective freedom in comparative perspective Progress Report

AMONG the vast and diverse collection of videos in

Comparison of Multi-stage Tests with Computerized Adaptive and Paper and Pencil Tests. Ourania Rotou Liane Patsula Steffen Manfred Saba Rizavi

Research and strategy for the land community.

An Unbiased Measure of Media Bias Using Latent Topic Models

Introduction-cont Pattern classification

Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Social Computing in Blogosphere

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

Data Assimilation in Geosciences

Using Text to Scale Legislatures with Uninformative Voting

Coalitional Game Theory

SIERRA LEONE 2012 ELECTIONS PROJECT PRE-ANALYSIS PLAN: INDIVIDUAL LEVEL INTERVENTIONS

Multistage Adaptive Testing for a Large-Scale Classification Test: Design, Heuristic Assembly, and Comparison with Other Testing Modes

Combining national and constituency polling for forecasting

arxiv: v2 [cs.si] 10 Apr 2017

CS269I: Incentives in Computer Science Lecture #4: Voting, Machine Learning, and Participatory Democracy

Commuting and Productivity: Quantifying Urban Economic Activity using Cellphone Data

MPEDS: Automating the Generation of Protest Event Data

Experimental Computational Philosophy: shedding new lights on (old) philosophical debates

PASW & Hand Calculations for ANOVA

Are policy makers out of step with their constituency when it comes to immigration?

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

Transcription:

Announcements HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson 1

Mixtures of Gaussians Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2016 Kevin Jamieson 2017 Kevin Jamieson 2017 2

Mixture models Y Z = {y i } n i=1 is observed data = { i } n i=1 is unobserved data If (x) is Gaussian density with parameters =(µ, 2 )then `( ; Z, ) = nx (1 i) log[(1 ) 1 (y i )] + i log( 2 (y i )] i=1 i( ) =E[ i, Z] = Kevin Jamieson 2017 Kevin Jamieson 2017 3

Mixture models Kevin Jamieson 2017 Kevin Jamieson 2017 4

Gaussian Mixture Example: Start Kevin Jamieson 2017 Kevin Jamieson 2017 5

After first iteration Kevin Jamieson 2017 Kevin Jamieson 2017 6

After 2nd iteration Kevin Jamieson 2017 Kevin Jamieson 2017 7

After 3rd iteration Kevin Jamieson 2017 Kevin Jamieson 2017 8

After 4th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 9

After 5th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 10

After 6th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 11

After 20th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 12

Some Bio Assay data Kevin Jamieson 2017 Kevin Jamieson 2017 13

GMM clustering of the assay data Kevin Jamieson 2017 Kevin Jamieson 2017 14

Resulting Density Estimator Kevin Jamieson 2017 Kevin Jamieson 2017 15

Expectation Maximization Algorithm Observe data x 1,...,x n drawn from a distribution p( ) for some 2 nx log(p(x i )) = i=1 = = i=1 i=1 b MLE = arg max 0 1 nx log @ X p(x i,z i = j ) A j 0 nx log @ X j nx X q i (z i = j 0 ) log i=1 j nx log(p(x i )) i=1 1 q i (z i = j 0 ) p(x i,z i = j ) A q i (z i = j 0 ) p(xi,z i = j ) q i (z i = j 0 ) nx X q i (z i = j 0 ) log (p(x i,z i = j )) + i=1 j (Introduce hidden data zi) (Introduce dummy distribution qi, variable θ ) nx X i=1 j (Jensen s inequality, log() is concave) q i (z i = j 0 ) log( 1 q i (z i =j 0 ) ) Does not depend on θ! Kevin Jamieson 2017 Kevin Jamieson 2017 16

Expectation Maximization Algorithm Observe data X =[x 1,...,x n ] drawn from a distribution p( ) for some 2 nx log(p(x i )) i=1 b MLE = arg max nx log(p(x i )) i=1 nx X q i (z i = j 0 ) log (p(x i,z i = j )) i=1 j True for any choice of 0 and distribution q i (z i = j 0 ) Set q i (z i = j 0 )=p(z i = j 0, X) Kevin Jamieson 2017 Kevin Jamieson 2017 17

Expectation Maximization Algorithm Observe data x 1,...,x n drawn from a distribution p( ) for some 2 nx log(p(x i )) i=1 b MLE = arg max nx log(p(x i )) i=1 nx X p(z i = j 0, X) log (p(x i,z i = j )) =: Q(, 0 ) i=1 j Initial guess for (0), for each step k: E-step: compute Q(, (k) )= nx i=1 i E zi hlog (p(x i,z i )) (k), X M-step: find (k+1) = arg max Q(, (k) ) Kevin Jamieson 2017 Kevin Jamieson 2017 18

Expectation Maximization Algorithm Initial guess for (0), for each step k: E-step: compute M-step: find Q(, (k) )= nx i=1 (k+1) = arg max i E zi hlog (p(x i,z i )) (k), X Q(, (k) ) Example: Observe x 1,...,x n (1 )N (µ 1, 2 1 )+ N (µ 2, 2 2 ) z i = j if i is in mixture component j for j 2 {1, 2} =(,µ 1, E zi [log(p(x i,z i ) (k), X] = p(z i =1 (k),x i ) log p(x i,z i =1 ) = p(z i =1 (k),x i ) log p(x i z i =1, )p(z i =1 ) + p(z i =2 (k),x i ) log p(x i,z i =2 ) + p(z i =2 (k),x i ) log p(x i z i =2, )p(z i =2 ) 2 1,µ 2, 2 2) = (x i µ (k) 1, 2(k) 1 ) (x i µ (k) 1, 2 (k) 1 )+ (x i µ (k) 2, 2 2 (k) ) log (x i µ 1, 2 1)(1 ) + (x i µ (k) 2, 2(k) 2 ) (x i µ (k) 1, 2 (k) 1 )+ (x i µ (k) 2, 2 2 (k) ) log (x i µ 2, 2 2) Kevin Jamieson 2017 Kevin Jamieson 2017 19

Expectation Maximization Algorithm - EM used to solve Latent Factor Models - Also used to solve missing data problems - Also known as Baum-Welch algorithm for Hidden Markov Models - In general, EM is non-convex so it can get stuck in local minima. Kevin Jamieson 2017 Kevin Jamieson 2017 20

Density Estimation Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2016 Kevin Jamieson 2017 Kevin Jamieson 2017 21

Kernel Density Estimation A very lazy GMM Kevin Jamieson 2017 Kevin Jamieson 2017 22

Kernel Density Estimation Kevin Jamieson 2017 Kevin Jamieson 2017 23

Kernel Density Estimation What is the Bayes optimal classification rule? Predict arg max m br im Kevin Jamieson 2017 Kevin Jamieson 2017 24

Generative vs Discriminative Kevin Jamieson 2017 Kevin Jamieson 2017 25

Basic Text Modeling Machine Learning CSE4546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 26

Bag of Words n documents/articles with lots of text Questions: - How to get a feature representation of each article? - How to cluster documents into topics? Bag of words model: ith document: x i 2 R D x i,j = proportion of times jth word occurred in ith document Kevin Jamieson 27

Bag of Words n documents/articles with lots of text Questions: - How to get a feature representation of each article? - How to cluster documents into topics? Bag of words model: ith document: x i 2 R D x i,j = proportion of times jth word occurred in ith document Given vectors, run k-means or Gaussian mixture model to find k clusters/topics Kevin Jamieson 28

Nonnegative matrix factorization (NMF) A 2 R m n A i,j = frequency of jth word in document i Nonnegative Matrix factorization: min ka WH T k 2 F W 2 R m d +,H 2 R n d + d is number of topics Also see latent Dirichlet factorization (LDA) Kevin Jamieson 29

Nonnegative matrix factorization (NMF) A 2 R m n A i,j = frequency of jth word in document i Nonnegative Matrix factorization: min ka WH T k 2 F W 2 R m d +,H 2 R n d + d is number of topics Each column of H represents a cluster of a topic, Each row W is some weights a combination of topics Also see latent Dirichlet factorization (LDA) Kevin Jamieson 30

Word embeddings, word2vec Previous section presented methods to embed documents into a latent space Alternatively, we can embed words into a latent space This embedding came from directly querying for relationships. word2vec is a popular unsupervised learning approach that just uses a text corpus (e.g. nytimes.com) Kevin Jamieson 31

Word embeddings, word2vec slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 32

Word embeddings, word2vec Training neural network to predict co-occuring words. Use first layer weights as embedding, throw out output layer slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 33

Word embeddings, word2vec e hx ants,y car i X i e hx ants,y i i Training neural network to predict co-occuring words. Use first layer weights as embedding, throw out output layer slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 34

word2vec outputs king - man + woman = queen country - capital slide: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Kevin Jamieson 35

TF*IDF n documents/articles with lots of text How to get a feature representation of each article? 1. For each document d compute the proportion of times word t occurs out of all words in d, i.e. term frequency TF d,t 2. For each word t in your corpus, compute the proportion of documents out of n that the word t occurs, i.e., document frequency DF t 3. Compute score for word t in document d as 1 TF d,t log( ) DF t Kevin Jamieson 36

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Two Hearted Ale - Input ~2500 natural language reviews http://www.ratebeer.com/beer/two-hearted-ale/1502/2/1/ Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Two Hearted Ale - Weighted Bag of Words: Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Weighted count vector for the ith beer: z i 2 R 400,000 Cosine distance: d(z i,z j )=1 z T i z j z i z j Two Hearted Ale - Nearest Neighbors: Bear Republic Racer 5 Avery IPA Stone India Pale Ale (IPA) Founders Centennial IPA Smuttynose IPA Anderson Valley Hop Ottin IPA AleSmith IPA BridgePort IPA Boulder Beer Mojo IPA Goose Island India Pale Ale Great Divide Titan IPA New Holland Mad Hatter Ale Lagunitas India Pale Ale Heavy Seas Loose Cannon Hop3 Sweetwater IPA Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Find an embedding {x 1,...,x n } R d such that x k x i < x k x j whenever d(z k,z i ) <d(z k,z j ) for all 100-nearest neighbors. (10 7 constraints, 10 5 variables) distance in 400,000 dimensional word space Solve with hinge loss and stochastic gradient descent. (20 minutes on my laptop) (d=2,err=6%) (d=3,err=4%) Could have also used local-linear-embedding, max-volume-unfolding, kernel-pca, etc. Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Pilsner IPA Sanity check: styles should cluster together and similar styles should be close. Light lager Pale ale Blond Amber Brown ale Doppelbock Belgian light Wit Belgian dark Lambic Wheat Porter Stout Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Pilsner IPA Sanity check: styles should cluster together and similar styles should be close. Light lager Pale ale Blond Amber Brown ale Doppelbock Belgian light Wit Belgian dark Lambic Wheat Porter Stout Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions

Feature generation for images Machine Learning CSE4546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 44

Contains slides from LeCun & Ranzato Russ Salakhutdinov Honglak Lee Google images Kevin Jamieson 45

Convolution of images (Note to EEs: deep learning uses the word convolution to mean what is usually known as cross-correlation, e.g., neither signal is flipped) Image I Filter K I K Slide credit: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ Kevin Jamieson 46

Convolution of images (Note to EEs: deep learning uses the word convolution to mean what is usually known as cross-correlation, e.g., neither signal is flipped) K I K Image I Kevin Jamieson 47

Convolution of images Input image X flatten into vector 2 3 vec(h 1 X) 6 4 vec(h 2 X) 7 5. filters H k convolved image H k X Kevin Jamieson 48

Stacking convolved images 27 6 6 3 27 64 filters Kevin Jamieson 49

Stacking convolved images 27 6 6 3 27 64 filters Apply Non-linearity to the output of each layer, Here: ReLu (rectified linear unit) Other choices: sigmoid, arctan Kevin Jamieson 50

Pooling Pooling reduces the dimension and can be interpreted as This filter had a high response in this general region 27x27x64 14x14x64 Kevin Jamieson 51

Pooling Convolution layer 27 14x14x64 6 6 3 27 64 filters Convolve with 64 6x6x3 filters MaxPool with 2x2 filters and stride 2 Kevin Jamieson 52

Full feature pipeline 27 14x14x64 6 6 3 27 64 filters Convolve with 64 6x6x3 filters MaxPool with 2x2 filters and stride 2 Flatten into a single vector of size 14*14*64=12544 How do we choose all the hyperparameters? How do we choose the filters? - Hand design them (digital signal processing, c.f. wavelets) - Learn them (deep learning) Kevin Jamieson 53

Some hand-created image features SIFT Spin Image HoG RIFT Texton GLOH Slide from Honglak Lee Kevin Jamieson 54

ML Street Fight Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 2017 Kevin Jamieson 2017 55

Mini case study Inspired by Coates and Ng (2012) Input is CIFAR-10 dataset: 50000 examples of 32x32x3 images 1. Construct set of patches by random selection from images 2. Standardize patch set (de-mean, norm 1, whiten, etc.) 3. Run k-means on random patches 4. Convolve each image with all patches (plus an offset) 5. Push through ReLu 6. Solve least squares for multiclass classification 7. Classify with argmax Kevin Jamieson 56

Mini case study Methods of standardization: Kevin Jamieson 57

Mini case study Dealing with class imbalance: Kevin Jamieson 58

Mini case study Dealing with outliers: Kevin Jamieson 59

Mini case study Dealing with outliers: `huber (z) = ( 1 2 z2 if z apple 1 1 z 2 otherwise arg min 0 nx @ X j i=1 k(x i,x j ) j y i 1 A 2 + X i,j i j k(x i,x j ) arg min nx i=1 `huber 0 @ X j k(x i,x j ) j y i 1 A + X i,j i j k(x i,x j ) Kevin Jamieson 60

Mini case study Dealing with hyperparameters: Kevin Jamieson 61

Hyperparameter Optimization Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 2017 Kevin Jamieson 2017 62

N out = 10 N hid Training set Eval \ set N in = 784 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]

N out = 10 N hid Training set Eval \ set N in = 784 bf Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]

N out = 10 N hid Training set Eval \ set N in = 784 bf Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]

N out = 10 N hid Training set Eval \ set N in = 784 Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]

N out = 10 N hid Training set Eval \ set N in = 784 Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 How do we choose hyperparameters to train and evaluate?

How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid

How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid Random search: Hyperparameters randomly chosen

How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid Random search: Hyperparameters randomly chosen Bayesian Optimization: 1 14 15 16 9 5 12 4 13 3 7 8 6 11 Hyperparameters adaptively chosen 10 2

Bayesian Optimization: 1 14 15 16 9 5 How does it work? 12 4 10 13 3 7 8 2 6 11 Hyperparameters adaptively chosen

Recent work attempts to speed up hyperparameter evaluation by stopping poor performing settings before they are fully trained. Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. arxiv:1406.3896, 2014. Alekh Agarwal, Peter Bartlett, and John Duchi. Oracle inequalities for computationally adaptive model selection. COLT, 2012. Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, 2015. András György and Levente Kocsis. Efficient multi-start strategies for local search algorithms. JAIR, 41, 2011. Li, Jamieson, DeSalvo, Rostamizadeh, Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. ICLR 2016. Hyperparameters Eval-loss (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) 0.0577 0.182 0.0436 0.0919 0.0575 eval-loss How computation time was spent? (10 2.7, 10 2.5, 10 2.4 ) 0.0765 (10 1.8, 10 1.4, 10 2.6 ) 0.1196 (10 1.4, 10 2.1, 10 1.5 ) 0.0834 (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) 0.0242 0.029 epochs

Hyperparameter Optimization In general, hyperparameter optimization is non-convex optimization and little is known about the underlying function (only observe validation loss) Your time is valuable, computers are cheap: Do not employ grad student descent for hyper parameter search. Write modular code that takes parameters as input and automate this embarrassingly parallel search. Use crowd resources (see pywren) Tools for different purposes: - Very few evaluations: use random search (and pray) or be clever - Few evaluations and long-running computations: see refs on last slide - Moderate number of evaluations (but still exp(#params)) and high accuracy needed: use Bayesian Optimization - Many evaluations possible: use random search. Why overthink it?