Tengyu Ma Facebook AI Research Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)
Users Optimization Researchers function f Solution gradient descent local search Convex relaxation + Rounding
Users Optimization Researchers function f f = f # + + f & f ' is convex, smooth condition number, Solution Stochastic gradient descent SAGA, SDCA, SVRG,
Users Optimization Researchers function f Well, let me try a new model and a mew loss Too hard, can you change the function? A new function f Is this function easy for me? NB: In learning, Solution for f model: y) = g + (x) loss: f(θ) = E[l(y, g + x (No ] rounding) Stochastic gradient descent
Users Optimization Researchers function f Well, let me try a new model and a mew loss [ReLU, overparameterization, batch normalization, residual networks.] Too hard, can you change the function? A new function f Solution for f (No rounding) Is this function easy for me? Stochastic gradient descent
Ø Identify a family F of tractable functions F = {f: all or most local minma are approximate global minima} Ø Decide whether a function belongs to the family F Analysis techniques: linear algebra + probability, Kac-Rice formula, Ø Design new models and objective functions that are provably in F Some recent progress in simplified settings: [Hardt-M.-Recht 16, Soudry-Carmon 16, Liang-Xie-Song 17, Hardt-M. 17, Ge-Lee-M. 17] NB: we also need to care about generalization error (but not in this talk)
Ø Assume data (x, y) satisfies y = a L σ B x + ξ Ø Assume data x from Gaussian distribution Ø Goal: learn a function that predicts y given x y a L B x dim=d Ø (σ = ReLU for all experiments in the talk)
Label y = a L σ B x + ξ Our prediction Ø Loss function (population) y) = a L σ(bx) E[ y y) R ]
Fails Population risk Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX Ø ξ = 0 Ø fresh samples every iteration dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis in infinity norm
Ø Non-overlapping filters (rows of B have disjoint supports) [Brutzkus- Globerson 17, Tian 17] Ø Initialization is sufficiently close to B in spectral norm [Li-Yuan 17] Ø NB: the bad local min found is very far from B in spectral norm but close in infinity norm Ø Kernel-based methods [Zhang et al. 16, 17] Ø Tensor decomposition followed by local improvement algorithms [Janzaminet al. 15, Zhong et al. 17] Ø Empirical solution: over-parameterization [Livni et al. 14]
Users Optimization Researchers Well, let me try a new model and a new loss Main goal of this this talk Is this function easy for me? Next slide: understand this better?
An Analytic Formula Label y = a L σ B x + ξ Loss f a, B = E[ y a L σ(bx) R ] Theorem 1: suppose the rows of B are unit vectors and x N(0, I) Ø σ) _ = the Hermite coefficient of σ Ø h _ = k-th normalized Hermite polynomial Øσ) _ : = E[σ x h _ x ]
Ø f X = a ' a ' R Ø Convex, not identifiable Ø f # = a ' b ' a ' b ' R Ø No spurious local min, not identifiable Ø f R = a ' b ' b ' L a ' b ' b ' L e R Ø No spurious local min? not identifiable Ø f f = a ' b ' f a ' b ' f e R Ø bad saddle point, identifiable : = f _ Each f _ solves a tensor decomposition problem More difficult landscape? Stronger identifiability A sweat spot? A: yes, to some extent
New Loss Function Label y = a L σ B x + ξ f i a, B = E[ y a L γ(bx) R ] f (a, B) = X k2n ˆk X i2[m] a? i b i? k ˆk X i2[m] a i b k i 2 F Ø Choosing γ such that γ) R = σ) R, γ) f = σ) f, and γ) _ = 0 for k 2,4 f i a, B = σ) R R f R + σ) f R f f + const Ø Hope: the landscape of f i is better (and easier to analyze) Ø Ø Now empirically it works! Still we don t know how to analyze (more or provable alg. later)
Label y = a L σ B x + ξ Loss f i a, B = E[ y a L γ(bx) R ] f i global min Ø σ = ReLU Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis Ø fresh samples every iteration
Ø Key lemma for proving Theorem 1 E y h k (b > i x) =ˆk X j2[d] a? j hb? j,b i i k Ø Extension (informal): for any polynomial p, there exists a function φ s, such that E [y p (b i,x)] = X a? j p(hb? j,b i i) j2[d] Ø for any polynomial q over two variables, φ u s.t. E [y p (b j,b k,x)] = X a? j q(hb? j,b i i, hb? j,b k i) j2[d] Ø Next: find an objective that uses these gadgets, and have no spurious local minimum
min G(B) = X X a? i i2[d] j6=khb? i,b j i 2 hb? i,b k i 2 µ X i,j s.t kb i k 2 =1, 8i a? i hb? i,b j i 4 Theorem: assume a 0, B is orthogonal 1. G(B) can be estimated via samples: G B = E y φ B, x 2. A global minimum of G is equal to B up to permutation and scaling of the rows 3. All the local minima of G are global minima Ø Inspired by GHJY 15, which proved the case when μ = 0 and a ' = 1 Ø Can be extended to non-singular B Ø Limitation: B : R { R } with m d
Ø Caveat: need huge batch size and training datasets
Ø Landscape design: designing new models and objectives with good landscape properties Ø This paper: one first step for simplified neural nets Open questions: ØSample efficiency: killing higher-order term seems to lose information Ø Best empirical result: using for training ReLU Ø Beyond Gaussian inputs Ø Understanding over-parameterization Ø More techniques for analyzing optimization landscape Thank you!