Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Tengyu Ma Facebook AI Research Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Users Optimization Researchers function f Solution gradient descent local search Convex relaxation + Rounding

Users Optimization Researchers function f f = f # + + f & f ' is convex, smooth condition number, Solution Stochastic gradient descent SAGA, SDCA, SVRG,

Users Optimization Researchers function f Well, let me try a new model and a mew loss Too hard, can you change the function? A new function f Is this function easy for me? NB: In learning, Solution for f model: y) = g + (x) loss: f(θ) = E[l(y, g + x (No ] rounding) Stochastic gradient descent

Users Optimization Researchers function f Well, let me try a new model and a mew loss [ReLU, overparameterization, batch normalization, residual networks.] Too hard, can you change the function? A new function f Solution for f (No rounding) Is this function easy for me? Stochastic gradient descent

Ø Identify a family F of tractable functions F = {f: all or most local minma are approximate global minima} Ø Decide whether a function belongs to the family F Analysis techniques: linear algebra + probability, Kac-Rice formula, Ø Design new models and objective functions that are provably in F Some recent progress in simplified settings: [Hardt-M.-Recht 16, Soudry-Carmon 16, Liang-Xie-Song 17, Hardt-M. 17, Ge-Lee-M. 17] NB: we also need to care about generalization error (but not in this talk)

Ø Assume data (x, y) satisfies y = a L σ B x + ξ Ø Assume data x from Gaussian distribution Ø Goal: learn a function that predicts y given x y a L B x dim=d Ø (σ = ReLU for all experiments in the talk)

Label y = a L σ B x + ξ Our prediction Ø Loss function (population) y) = a L σ(bx) E[ y y) R ]

Fails Population risk Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX Ø ξ = 0 Ø fresh samples every iteration dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis in infinity norm

Ø Non-overlapping filters (rows of B have disjoint supports) [Brutzkus- Globerson 17, Tian 17] Ø Initialization is sufficiently close to B in spectral norm [Li-Yuan 17] Ø NB: the bad local min found is very far from B in spectral norm but close in infinity norm Ø Kernel-based methods [Zhang et al. 16, 17] Ø Tensor decomposition followed by local improvement algorithms [Janzaminet al. 15, Zhong et al. 17] Ø Empirical solution: over-parameterization [Livni et al. 14]

Users Optimization Researchers Well, let me try a new model and a new loss Main goal of this this talk Is this function easy for me? Next slide: understand this better?

An Analytic Formula Label y = a L σ B x + ξ Loss f a, B = E[ y a L σ(bx) R ] Theorem 1: suppose the rows of B are unit vectors and x N(0, I) Ø σ) _ = the Hermite coefficient of σ Ø h _ = k-th normalized Hermite polynomial Øσ) _ : = E[σ x h _ x ]

Ø f X = a ' a ' R Ø Convex, not identifiable Ø f # = a ' b ' a ' b ' R Ø No spurious local min, not identifiable Ø f R = a ' b ' b ' L a ' b ' b ' L e R Ø No spurious local min? not identifiable Ø f f = a ' b ' f a ' b ' f e R Ø bad saddle point, identifiable : = f _ Each f _ solves a tensor decomposition problem More difficult landscape? Stronger identifiability A sweat spot? A: yes, to some extent

New Loss Function Label y = a L σ B x + ξ f i a, B = E[ y a L γ(bx) R ] f (a, B) = X k2n ˆk X i2[m] a? i b i? k ˆk X i2[m] a i b k i 2 F Ø Choosing γ such that γ) R = σ) R, γ) f = σ) f, and γ) _ = 0 for k 2,4 f i a, B = σ) R R f R + σ) f R f f + const Ø Hope: the landscape of f i is better (and easier to analyze) Ø Ø Now empirically it works! Still we don t know how to analyze (more or provable alg. later)

Label y = a L σ B x + ξ Loss f i a, B = E[ y a L γ(bx) R ] f i global min Ø σ = ReLU Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis Ø fresh samples every iteration

Ø Key lemma for proving Theorem 1 E y h k (b > i x) =ˆk X j2[d] a? j hb? j,b i i k Ø Extension (informal): for any polynomial p, there exists a function φ s, such that E [y p (b i,x)] = X a? j p(hb? j,b i i) j2[d] Ø for any polynomial q over two variables, φ u s.t. E [y p (b j,b k,x)] = X a? j q(hb? j,b i i, hb? j,b k i) j2[d] Ø Next: find an objective that uses these gadgets, and have no spurious local minimum

min G(B) = X X a? i i2[d] j6=khb? i,b j i 2 hb? i,b k i 2 µ X i,j s.t kb i k 2 =1, 8i a? i hb? i,b j i 4 Theorem: assume a 0, B is orthogonal 1. G(B) can be estimated via samples: G B = E y φ B, x 2. A global minimum of G is equal to B up to permutation and scaling of the rows 3. All the local minima of G are global minima Ø Inspired by GHJY 15, which proved the case when μ = 0 and a ' = 1 Ø Can be extended to non-singular B Ø Limitation: B : R { R } with m d

Ø Caveat: need huge batch size and training datasets

Ø Landscape design: designing new models and objectives with good landscape properties Ø This paper: one first step for simplified neural nets Open questions: ØSample efficiency: killing higher-order term seems to lose information Ø Best empirical result: using for training ReLU Ø Beyond Gaussian inputs Ø Understanding over-parameterization Ø More techniques for analyzing optimization landscape Thank you!