Announcements HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson 1
Mixtures of Gaussians Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2016 Kevin Jamieson 2017 Kevin Jamieson 2017 2
Mixture models Y Z = {y i } n i=1 is observed data = { i } n i=1 is unobserved data If (x) is Gaussian density with parameters =(µ, 2 )then `( ; Z, ) = nx (1 i) log[(1 ) 1 (y i )] + i log( 2 (y i )] i=1 i( ) =E[ i, Z] = Kevin Jamieson 2017 Kevin Jamieson 2017 3
Mixture models Kevin Jamieson 2017 Kevin Jamieson 2017 4
Gaussian Mixture Example: Start Kevin Jamieson 2017 Kevin Jamieson 2017 5
After first iteration Kevin Jamieson 2017 Kevin Jamieson 2017 6
After 2nd iteration Kevin Jamieson 2017 Kevin Jamieson 2017 7
After 3rd iteration Kevin Jamieson 2017 Kevin Jamieson 2017 8
After 4th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 9
After 5th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 10
After 6th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 11
After 20th iteration Kevin Jamieson 2017 Kevin Jamieson 2017 12
Some Bio Assay data Kevin Jamieson 2017 Kevin Jamieson 2017 13
GMM clustering of the assay data Kevin Jamieson 2017 Kevin Jamieson 2017 14
Resulting Density Estimator Kevin Jamieson 2017 Kevin Jamieson 2017 15
Expectation Maximization Algorithm Observe data x 1,...,x n drawn from a distribution p( ) for some 2 nx log(p(x i )) = i=1 = = i=1 i=1 b MLE = arg max 0 1 nx log @ X p(x i,z i = j ) A j 0 nx log @ X j nx X q i (z i = j 0 ) log i=1 j nx log(p(x i )) i=1 1 q i (z i = j 0 ) p(x i,z i = j ) A q i (z i = j 0 ) p(xi,z i = j ) q i (z i = j 0 ) nx X q i (z i = j 0 ) log (p(x i,z i = j )) + i=1 j (Introduce hidden data zi) (Introduce dummy distribution qi, variable θ ) nx X i=1 j (Jensen s inequality, log() is concave) q i (z i = j 0 ) log( 1 q i (z i =j 0 ) ) Does not depend on θ! Kevin Jamieson 2017 Kevin Jamieson 2017 16
Expectation Maximization Algorithm Observe data X =[x 1,...,x n ] drawn from a distribution p( ) for some 2 nx log(p(x i )) i=1 b MLE = arg max nx log(p(x i )) i=1 nx X q i (z i = j 0 ) log (p(x i,z i = j )) i=1 j True for any choice of 0 and distribution q i (z i = j 0 ) Set q i (z i = j 0 )=p(z i = j 0, X) Kevin Jamieson 2017 Kevin Jamieson 2017 17
Expectation Maximization Algorithm Observe data x 1,...,x n drawn from a distribution p( ) for some 2 nx log(p(x i )) i=1 b MLE = arg max nx log(p(x i )) i=1 nx X p(z i = j 0, X) log (p(x i,z i = j )) =: Q(, 0 ) i=1 j Initial guess for (0), for each step k: E-step: compute Q(, (k) )= nx i=1 i E zi hlog (p(x i,z i )) (k), X M-step: find (k+1) = arg max Q(, (k) ) Kevin Jamieson 2017 Kevin Jamieson 2017 18
Expectation Maximization Algorithm Initial guess for (0), for each step k: E-step: compute M-step: find Q(, (k) )= nx i=1 (k+1) = arg max i E zi hlog (p(x i,z i )) (k), X Q(, (k) ) Example: Observe x 1,...,x n (1 )N (µ 1, 2 1 )+ N (µ 2, 2 2 ) z i = j if i is in mixture component j for j 2 {1, 2} =(,µ 1, E zi [log(p(x i,z i ) (k), X] = p(z i =1 (k),x i ) log p(x i,z i =1 ) = p(z i =1 (k),x i ) log p(x i z i =1, )p(z i =1 ) + p(z i =2 (k),x i ) log p(x i,z i =2 ) + p(z i =2 (k),x i ) log p(x i z i =2, )p(z i =2 ) 2 1,µ 2, 2 2) = (x i µ (k) 1, 2(k) 1 ) (x i µ (k) 1, 2 (k) 1 )+ (x i µ (k) 2, 2 2 (k) ) log (x i µ 1, 2 1)(1 ) + (x i µ (k) 2, 2(k) 2 ) (x i µ (k) 1, 2 (k) 1 )+ (x i µ (k) 2, 2 2 (k) ) log (x i µ 2, 2 2) Kevin Jamieson 2017 Kevin Jamieson 2017 19
Expectation Maximization Algorithm - EM used to solve Latent Factor Models - Also used to solve missing data problems - Also known as Baum-Welch algorithm for Hidden Markov Models - In general, EM is non-convex so it can get stuck in local minima. Kevin Jamieson 2017 Kevin Jamieson 2017 20
Density Estimation Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2016 Kevin Jamieson 2017 Kevin Jamieson 2017 21
Kernel Density Estimation A very lazy GMM Kevin Jamieson 2017 Kevin Jamieson 2017 22
Kernel Density Estimation Kevin Jamieson 2017 Kevin Jamieson 2017 23
Kernel Density Estimation What is the Bayes optimal classification rule? Predict arg max m br im Kevin Jamieson 2017 Kevin Jamieson 2017 24
Generative vs Discriminative Kevin Jamieson 2017 Kevin Jamieson 2017 25
Basic Text Modeling Machine Learning CSE4546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 26
Bag of Words n documents/articles with lots of text Questions: - How to get a feature representation of each article? - How to cluster documents into topics? Bag of words model: ith document: x i 2 R D x i,j = proportion of times jth word occurred in ith document Kevin Jamieson 27
Bag of Words n documents/articles with lots of text Questions: - How to get a feature representation of each article? - How to cluster documents into topics? Bag of words model: ith document: x i 2 R D x i,j = proportion of times jth word occurred in ith document Given vectors, run k-means or Gaussian mixture model to find k clusters/topics Kevin Jamieson 28
Nonnegative matrix factorization (NMF) A 2 R m n A i,j = frequency of jth word in document i Nonnegative Matrix factorization: min ka WH T k 2 F W 2 R m d +,H 2 R n d + d is number of topics Also see latent Dirichlet factorization (LDA) Kevin Jamieson 29
Nonnegative matrix factorization (NMF) A 2 R m n A i,j = frequency of jth word in document i Nonnegative Matrix factorization: min ka WH T k 2 F W 2 R m d +,H 2 R n d + d is number of topics Each column of H represents a cluster of a topic, Each row W is some weights a combination of topics Also see latent Dirichlet factorization (LDA) Kevin Jamieson 30
Word embeddings, word2vec Previous section presented methods to embed documents into a latent space Alternatively, we can embed words into a latent space This embedding came from directly querying for relationships. word2vec is a popular unsupervised learning approach that just uses a text corpus (e.g. nytimes.com) Kevin Jamieson 31
Word embeddings, word2vec slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 32
Word embeddings, word2vec Training neural network to predict co-occuring words. Use first layer weights as embedding, throw out output layer slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 33
Word embeddings, word2vec e hx ants,y car i X i e hx ants,y i i Training neural network to predict co-occuring words. Use first layer weights as embedding, throw out output layer slide: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Kevin Jamieson 34
word2vec outputs king - man + woman = queen country - capital slide: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ Kevin Jamieson 35
TF*IDF n documents/articles with lots of text How to get a feature representation of each article? 1. For each document d compute the proportion of times word t occurs out of all words in d, i.e. term frequency TF d,t 2. For each word t in your corpus, compute the proportion of documents out of n that the word t occurs, i.e., document frequency DF t 3. Compute score for word t in document d as 1 TF d,t log( ) DF t Kevin Jamieson 36
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Two Hearted Ale - Input ~2500 natural language reviews http://www.ratebeer.com/beer/two-hearted-ale/1502/2/1/ Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Two Hearted Ale - Weighted Bag of Words: Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Weighted count vector for the ith beer: z i 2 R 400,000 Cosine distance: d(z i,z j )=1 z T i z j z i z j Two Hearted Ale - Nearest Neighbors: Bear Republic Racer 5 Avery IPA Stone India Pale Ale (IPA) Founders Centennial IPA Smuttynose IPA Anderson Valley Hop Ottin IPA AleSmith IPA BridgePort IPA Boulder Beer Mojo IPA Goose Island India Pale Ale Great Divide Titan IPA New Holland Mad Hatter Ale Lagunitas India Pale Ale Heavy Seas Loose Cannon Hop3 Sweetwater IPA Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Find an embedding {x 1,...,x n } R d such that x k x i < x k x j whenever d(z k,z i ) <d(z k,z j ) for all 100-nearest neighbors. (10 7 constraints, 10 5 variables) distance in 400,000 dimensional word space Solve with hinge loss and stochastic gradient descent. (20 minutes on my laptop) (d=2,err=6%) (d=3,err=4%) Could have also used local-linear-embedding, max-volume-unfolding, kernel-pca, etc. Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Pilsner IPA Sanity check: styles should cluster together and similar styles should be close. Light lager Pale ale Blond Amber Brown ale Doppelbock Belgian light Wit Belgian dark Lambic Wheat Porter Stout Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
BeerMapper - Under the Hood Algorithm requires feature representations of the beers {x 1,...,x n } R d Pilsner IPA Sanity check: styles should cluster together and similar styles should be close. Light lager Pale ale Blond Amber Brown ale Doppelbock Belgian light Wit Belgian dark Lambic Wheat Porter Stout Reviews for each beer Bag of Words weighted by TF*IDF Get 100 nearest neighbors using cosine distance Non-metric multidimensional scaling Embedding in d dimensions
Feature generation for images Machine Learning CSE4546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 44
Contains slides from LeCun & Ranzato Russ Salakhutdinov Honglak Lee Google images Kevin Jamieson 45
Convolution of images (Note to EEs: deep learning uses the word convolution to mean what is usually known as cross-correlation, e.g., neither signal is flipped) Image I Filter K I K Slide credit: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ Kevin Jamieson 46
Convolution of images (Note to EEs: deep learning uses the word convolution to mean what is usually known as cross-correlation, e.g., neither signal is flipped) K I K Image I Kevin Jamieson 47
Convolution of images Input image X flatten into vector 2 3 vec(h 1 X) 6 4 vec(h 2 X) 7 5. filters H k convolved image H k X Kevin Jamieson 48
Stacking convolved images 27 6 6 3 27 64 filters Kevin Jamieson 49
Stacking convolved images 27 6 6 3 27 64 filters Apply Non-linearity to the output of each layer, Here: ReLu (rectified linear unit) Other choices: sigmoid, arctan Kevin Jamieson 50
Pooling Pooling reduces the dimension and can be interpreted as This filter had a high response in this general region 27x27x64 14x14x64 Kevin Jamieson 51
Pooling Convolution layer 27 14x14x64 6 6 3 27 64 filters Convolve with 64 6x6x3 filters MaxPool with 2x2 filters and stride 2 Kevin Jamieson 52
Full feature pipeline 27 14x14x64 6 6 3 27 64 filters Convolve with 64 6x6x3 filters MaxPool with 2x2 filters and stride 2 Flatten into a single vector of size 14*14*64=12544 How do we choose all the hyperparameters? How do we choose the filters? - Hand design them (digital signal processing, c.f. wavelets) - Learn them (deep learning) Kevin Jamieson 53
Some hand-created image features SIFT Spin Image HoG RIFT Texton GLOH Slide from Honglak Lee Kevin Jamieson 54
ML Street Fight Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 2017 Kevin Jamieson 2017 55
Mini case study Inspired by Coates and Ng (2012) Input is CIFAR-10 dataset: 50000 examples of 32x32x3 images 1. Construct set of patches by random selection from images 2. Standardize patch set (de-mean, norm 1, whiten, etc.) 3. Run k-means on random patches 4. Convolve each image with all patches (plus an offset) 5. Push through ReLu 6. Solve least squares for multiclass classification 7. Classify with argmax Kevin Jamieson 56
Mini case study Methods of standardization: Kevin Jamieson 57
Mini case study Dealing with class imbalance: Kevin Jamieson 58
Mini case study Dealing with outliers: Kevin Jamieson 59
Mini case study Dealing with outliers: `huber (z) = ( 1 2 z2 if z apple 1 1 z 2 otherwise arg min 0 nx @ X j i=1 k(x i,x j ) j y i 1 A 2 + X i,j i j k(x i,x j ) arg min nx i=1 `huber 0 @ X j k(x i,x j ) j y i 1 A + X i,j i j k(x i,x j ) Kevin Jamieson 60
Mini case study Dealing with hyperparameters: Kevin Jamieson 61
Hyperparameter Optimization Machine Learning CSE546 Kevin Jamieson University of Washington November 20, 2017 Kevin Jamieson 2017 Kevin Jamieson 2017 62
N out = 10 N hid Training set Eval \ set N in = 784 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]
N out = 10 N hid Training set Eval \ set N in = 784 bf Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]
N out = 10 N hid Training set Eval \ set N in = 784 bf Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]
N out = 10 N hid Training set Eval \ set N in = 784 Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 hyperparameters learning rate 2 [10 3, 10 1 ] `2-penalty 2 [10 6, 10 1 ] #hiddennodesn hid 2 [10 1, 10 3 ]
N out = 10 N hid Training set Eval \ set N in = 784 Hyperparameters (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) (10 2.7, 10 2.5, 10 2.4 ) (10 1.8, 10 1.4, 10 2.6 ) (10 1.4, 10 2.1, 10 1.5 ) (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) Eval-loss 0.0577 0.182 0.0436 0.0919 0.0575 0.0765 0.1196 0.0834 0.0242 0.029 How do we choose hyperparameters to train and evaluate?
How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid
How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid Random search: Hyperparameters randomly chosen
How do we choose hyperparameters to train and evaluate? Grid search: Hyperparameters on 2d uniform grid Random search: Hyperparameters randomly chosen Bayesian Optimization: 1 14 15 16 9 5 12 4 13 3 7 8 6 11 Hyperparameters adaptively chosen 10 2
Bayesian Optimization: 1 14 15 16 9 5 How does it work? 12 4 10 13 3 7 8 2 6 11 Hyperparameters adaptively chosen
Recent work attempts to speed up hyperparameter evaluation by stopping poor performing settings before they are fully trained. Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. arxiv:1406.3896, 2014. Alekh Agarwal, Peter Bartlett, and John Duchi. Oracle inequalities for computationally adaptive model selection. COLT, 2012. Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, 2015. András György and Levente Kocsis. Efficient multi-start strategies for local search algorithms. JAIR, 41, 2011. Li, Jamieson, DeSalvo, Rostamizadeh, Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. ICLR 2016. Hyperparameters Eval-loss (10 1.6, 10 2.4, 10 1.7 ) (10 1.0, 10 1.2, 10 2.6 ) (10 1.2, 10 5.7, 10 1.4 ) (10 2.4, 10 2.0, 10 2.9 ) (10 2.6, 10 2.9, 10 1.9 ) 0.0577 0.182 0.0436 0.0919 0.0575 eval-loss How computation time was spent? (10 2.7, 10 2.5, 10 2.4 ) 0.0765 (10 1.8, 10 1.4, 10 2.6 ) 0.1196 (10 1.4, 10 2.1, 10 1.5 ) 0.0834 (10 1.9, 10 5.8, 10 2.1 ) (10 1.8, 10 5.6, 10 1.7 ) 0.0242 0.029 epochs
Hyperparameter Optimization In general, hyperparameter optimization is non-convex optimization and little is known about the underlying function (only observe validation loss) Your time is valuable, computers are cheap: Do not employ grad student descent for hyper parameter search. Write modular code that takes parameters as input and automate this embarrassingly parallel search. Use crowd resources (see pywren) Tools for different purposes: - Very few evaluations: use random search (and pray) or be clever - Few evaluations and long-running computations: see refs on last slide - Moderate number of evaluations (but still exp(#params)) and high accuracy needed: use Bayesian Optimization - Many evaluations possible: use random search. Why overthink it?