Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Similar documents
Tengyu Ma Facebook AI Research. Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

Instructors: Tengyu Ma and Chris Re

Support Vector Machines

Local differential privacy

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Trading Goods or Human Capital

Random Forests. Gradient Boosting. and. Bagging and Boosting

Coalitional Game Theory

Migration With Endogenous Social Networks in China

Announcements. HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Probabilistic earthquake early warning in complex earth models using prior sampling

Coalitional Game Theory for Communication Networks: A Tutorial

A comparative analysis of subreddit recommenders for Reddit

Migrant Wages, Human Capital Accumulation and Return Migration

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

Rock the Vote or Vote The Rock

Probabilistic Latent Semantic Analysis Hofmann (1999)

Improved Boosting Algorithms Using Confidence-rated Predictions

Essential Questions Content Skills Assessments Standards/PIs. Identify prime and composite numbers, GCF, and prime factorization.

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

Extracting GPDs from DVCS data: Border and skewness functions at LO

P(x) testing training. x Hi

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Adapting the Social Network to Affect Elections

arxiv: v1 [cs.cc] 29 Sep 2015

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Self-selection: The Roy model

Female Migration, Human Capital and Fertility

Tie Breaking in STV. 1 Introduction. 3 The special case of ties with the Meek algorithm. 2 Ties in practice

Combining national and constituency polling for forecasting

Heterogeneity in the Economic Returns to Schooling among Chinese Rural-Urban Migrants, * NILS working paper series No 200

Proving correctness of Stable Matching algorithm Analyzing algorithms Asymptotic running times

Infinite-Horizon Policy-Gradient Estimation

CSC304 Lecture 16. Voting 3: Axiomatic, Statistical, and Utilitarian Approaches to Voting. CSC304 - Nisarg Shah 1

A Calculus for End-to-end Statistical Service Guarantees

Rural Child Poverty across Immigrant Generations in New Destination States

PROJECTION OF NET MIGRATION USING A GRAVITY MODEL 1. Laboratory of Populations 2

Do two parties represent the US? Clustering analysis of US public ideology survey

Combating Friend Spam Using Social Rejections

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

The Careers of Immigrants

Constraint satisfaction problems. Lirong Xia

Cluster Analysis. (see also: Segmentation)

Within-Groups Wage Inequality and Schooling: Further Evidence for Portugal

Processes. Criteria for Comparing Scheduling Algorithms

Hoboken Public Schools. AP Calculus Curriculum

Regression. Linear least squares. Support vector regression. increasing the dimensionality fitting polynomials to data over fitting regularization

JudgeIt II: A Program for Evaluating Electoral Systems and Redistricting Plans 1

(67686) Mathematical Foundations of AI June 18, Lecture 6

The Analytics of the Wage Effect of Immigration. George J. Borjas Harvard University September 2009

Split Decisions: Household Finance when a Policy Discontinuity allocates Overseas Work

THE PRIMITIVES OF LEGAL PROTECTION AGAINST DATA TOTALITARIANISMS

3 Electoral Competition

Internal Migration With Social Networks in China

The Dynamic Effects of Immigration

On the Dynamics of Interstate Migration: Migration Costs and Self-Selection

Chapter. Sampling Distributions Pearson Prentice Hall. All rights reserved

NBER WORKING PAPER SERIES THE LABOR SUPPLY OF UNDOCUMENTED IMMIGRANTS. George J. Borjas. Working Paper

TRADE-OFFS BETWEEN CIVIL LIBERTIES AND NATIONAL SECURITY: A DISCRETE CHOICE EXPERIMENT

Migration and Incomes in Source Communities: A New Economics of Migration Perspective from China

Analysis group. Joel Feinstein. School of Mathematical Sciences University of Nottingham

Wage Structure and Gender Earnings Differentials in China and. India*

Changes across Cohorts in Wage Returns to Schooling and Early Work Experiences:

The HeLIx + inversion code Genetic algorithms. A. Lagg - Abisko Winter School 1

Self-Selection and the Earnings of Immigrants

School Quality and Returns to Education of U.S. Immigrants. Bernt Bratsberg. and. Dek Terrell* RRH: BRATSBERG & TERRELL:

Model of Voting. February 15, Abstract. This paper uses United States congressional district level data to identify how incumbency,

IV. Labour Market Institutions and Wage Inequality

CS269I: Incentives in Computer Science Lecture #4: Voting, Machine Learning, and Participatory Democracy

CS 229: r/classifier - Subreddit Text Classification

A New Method of the Single Transferable Vote and its Axiomatic Justification

Determinants of the Wage Gap betwee Title Local Urban Residents in China:

Assessing the Employment Effects of Labor Market Training Programs in Sweden

Game theory and applications: Lecture 12

Predicting Congressional Votes Based on Campaign Finance Data

Statistical Analysis of Corruption Perception Index across countries

Variance, Violence, and Democracy: A Basic Microeconomic Model of Terrorism

Agreement Beyond Polarization: Spectral Network Analysis of Congressional Roll Call Votes 1

How Decisive Is the Decisive Voter?

Towards Tackling Hate Online Automatically

Immigration Policy In The OECD: Why So Different?

NBER WORKING PAPER SERIES SELF-SELECTION OF EMIGRANTS: THEORY AND EVIDENCE ON STOCHASTIC DOMINANCE IN OBSERVABLE AND UNOBSERVABLE CHARACTERISTICS

Minimizing Justified Envy in School Choice: The Design of NewApril Orleans 13, 2018 One App1 Atila / 40

Hoboken Public Schools. Algebra II Honors Curriculum

The Costs of Remoteness, Evidence From German Division and Reunification by Redding and Sturm (AER, 2008)

Online Appendix for: Internal Geography, Labor Mobility, and the Distributional Impacts of Trade

NBER WORKING PAPER SERIES THE ANALYTICS OF THE WAGE EFFECT OF IMMIGRATION. George J. Borjas. Working Paper

NBER WORKING PAPER SERIES IMMIGRATION, WAGES, AND COMPOSITIONAL AMENITIES. David Card Christian Dustmann Ian Preston

Determinants and Effects of Negative Advertising in Politics

Complexity of Strategic Behavior in Multi-Winner Elections

Socially Optimal Districting: An Empirical Investigation

Climate Change Around the World

Identifying Factors in Congressional Bill Success

Minimum Spanning Tree Union-Find Data Structure. Feb 28, 2018 CSCI211 - Sprenkle. Comcast wants to lay cable in a neighborhood. Neighborhood Layout

Parliaments Shapes and Sizes

An Integer Linear Programming Approach for Coalitional Weighted Manipulation under Scoring Rules

Immigration and Internal Mobility in Canada Appendices A and B. Appendix A: Two-step Instrumentation strategy: Procedure and detailed results

Game theoretical techniques have recently

Transcription:

Tengyu Ma Facebook AI Research Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Users Optimization Researchers function f Solution gradient descent local search Convex relaxation + Rounding

Users Optimization Researchers function f f = f # + + f & f ' is convex, smooth condition number, Solution Stochastic gradient descent SAGA, SDCA, SVRG,

Users Optimization Researchers function f Well, let me try a new model and a mew loss Too hard, can you change the function? A new function f Is this function easy for me? NB: In learning, Solution for f model: y) = g + (x) loss: f(θ) = E[l(y, g + x (No ] rounding) Stochastic gradient descent

Users Optimization Researchers function f Well, let me try a new model and a mew loss [ReLU, overparameterization, batch normalization, residual networks.] Too hard, can you change the function? A new function f Solution for f (No rounding) Is this function easy for me? Stochastic gradient descent

Ø Identify a family F of tractable functions F = {f: all or most local minma are approximate global minima} Ø Decide whether a function belongs to the family F Analysis techniques: linear algebra + probability, Kac-Rice formula, Ø Design new models and objective functions that are provably in F Some recent progress in simplified settings: [Hardt-M.-Recht 16, Soudry-Carmon 16, Liang-Xie-Song 17, Hardt-M. 17, Ge-Lee-M. 17] NB: we also need to care about generalization error (but not in this talk)

Ø Assume data (x, y) satisfies y = a L σ B x + ξ Ø Assume data x from Gaussian distribution Ø Goal: learn a function that predicts y given x y a L B x dim=d Ø (σ = ReLU for all experiments in the talk)

Label y = a L σ B x + ξ Our prediction Ø Loss function (population) y) = a L σ(bx) E[ y y) R ]

Fails Population risk Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX Ø ξ = 0 Ø fresh samples every iteration dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis in infinity norm

Ø Non-overlapping filters (rows of B have disjoint supports) [Brutzkus- Globerson 17, Tian 17] Ø Initialization is sufficiently close to B in spectral norm [Li-Yuan 17] Ø NB: the bad local min found is very far from B in spectral norm but close in infinity norm Ø Kernel-based methods [Zhang et al. 16, 17] Ø Tensor decomposition followed by local improvement algorithms [Janzaminet al. 15, Zhong et al. 17] Ø Empirical solution: over-parameterization [Livni et al. 14]

Users Optimization Researchers Well, let me try a new model and a new loss Main goal of this this talk Is this function easy for me? Next slide: understand this better?

An Analytic Formula Label y = a L σ B x + ξ Loss f a, B = E[ y a L σ(bx) R ] Theorem 1: suppose the rows of B are unit vectors and x N(0, I) Ø σ) _ = the Hermite coefficient of σ Ø h _ = k-th normalized Hermite polynomial Øσ) _ : = E[σ x h _ x ]

Ø f X = a ' a ' R Ø Convex, not identifiable Ø f # = a ' b ' a ' b ' R Ø No spurious local min, not identifiable Ø f R = a ' b ' b ' L a ' b ' b ' L e R Ø No spurious local min? not identifiable Ø f f = a ' b ' f a ' b ' f e R Ø bad saddle point, identifiable : = f _ Each f _ solves a tensor decomposition problem More difficult landscape? Stronger identifiability A sweat spot? A: yes, to some extent

New Loss Function Label y = a L σ B x + ξ f i a, B = E[ y a L γ(bx) R ] f (a, B) = X k2n ˆk X i2[m] a? i b i? k ˆk X i2[m] a i b k i 2 F Ø Choosing γ such that γ) R = σ) R, γ) f = σ) f, and γ) _ = 0 for k 2,4 f i a, B = σ) R R f R + σ) f R f f + const Ø Hope: the landscape of f i is better (and easier to analyze) Ø Ø Now empirically it works! Still we don t know how to analyze (more or provable alg. later)

Label y = a L σ B x + ξ Loss f i a, B = E[ y a L γ(bx) R ] f i global min Ø σ = ReLU Ø d = 50 Ø a = 1 and assumed to be known Ø B = I WX WX dist(b, B ) measured by a surrogate error ε A row or a column of B is εfar away from the natural basis Ø fresh samples every iteration

Ø Key lemma for proving Theorem 1 E y h k (b > i x) =ˆk X j2[d] a? j hb? j,b i i k Ø Extension (informal): for any polynomial p, there exists a function φ s, such that E [y p (b i,x)] = X a? j p(hb? j,b i i) j2[d] Ø for any polynomial q over two variables, φ u s.t. E [y p (b j,b k,x)] = X a? j q(hb? j,b i i, hb? j,b k i) j2[d] Ø Next: find an objective that uses these gadgets, and have no spurious local minimum

min G(B) = X X a? i i2[d] j6=khb? i,b j i 2 hb? i,b k i 2 µ X i,j s.t kb i k 2 =1, 8i a? i hb? i,b j i 4 Theorem: assume a 0, B is orthogonal 1. G(B) can be estimated via samples: G B = E y φ B, x 2. A global minimum of G is equal to B up to permutation and scaling of the rows 3. All the local minima of G are global minima Ø Inspired by GHJY 15, which proved the case when μ = 0 and a ' = 1 Ø Can be extended to non-singular B Ø Limitation: B : R { R } with m d

Ø Caveat: need huge batch size and training datasets

Ø Landscape design: designing new models and objectives with good landscape properties Ø This paper: one first step for simplified neural nets Open questions: ØSample efficiency: killing higher-order term seems to lose information Ø Best empirical result: using for training ReLU Ø Beyond Gaussian inputs Ø Understanding over-parameterization Ø More techniques for analyzing optimization landscape Thank you!