Tengyu Ma Facebook AI Research. Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

Similar documents
Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

A comparative analysis of subreddit recommenders for Reddit

Instructors: Tengyu Ma and Chris Re

CSC304 Lecture 16. Voting 3: Axiomatic, Statistical, and Utilitarian Approaches to Voting. CSC304 - Nisarg Shah 1

Probabilistic earthquake early warning in complex earth models using prior sampling

Local differential privacy

Improved Boosting Algorithms Using Confidence-rated Predictions

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

Title: Local Search Required reading: AIMA, Chapter 4 LWH: Chapters 6, 10, 13 and 14.

UC-BERKELEY. Center on Institutions and Governance Working Paper No. 22. Interval Properties of Ideal Point Estimators

CS 4407 Algorithms Greedy Algorithms and Minimum Spanning Trees

The HeLIx + inversion code Genetic algorithms. A. Lagg - Abisko Winter School 1

A Calculus for End-to-end Statistical Service Guarantees

Announcements. HW3 Due tonight HW4 posted No class Thursday (Thanksgiving) 2017 Kevin Jamieson

Classifier Evaluation and Selection. Review and Overview of Methods

Online Appendix: Trafficking Networks and the Mexican Drug War

Congressional Gridlock: The Effects of the Master Lever

Voting and Complexity

Social Choice & Mechanism Design

Spatial Chaining Methods for International Comparisons of Prices and Real Expenditures D.S. Prasada Rao The University of Queensland

Introduction to Path Analysis: Multivariate Regression

Cluster Analysis. (see also: Segmentation)

P(x) testing training. x Hi

CS 229: r/classifier - Subreddit Text Classification

Do two parties represent the US? Clustering analysis of US public ideology survey

Classical papers: Osborbe and Slivinski (1996) and Besley and Coate (1997)

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

Estimating the Margin of Victory for Instant-Runoff Voting

Trading Goods or Human Capital

(67686) Mathematical Foundations of AI June 18, Lecture 6

Adapting the Social Network to Affect Elections

Bargaining and Cooperation in Strategic Form Games

Police patrol districting method and simulation evaluation using agent-based model & GIS

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

A New Proposal on Special Majority Voting 1 Christian List

THE EFFECT OF OFFER-OF-SETTLEMENT RULES ON THE TERMS OF SETTLEMENT

Combating Friend Spam Using Social Rejections

Robust Electric Power Infrastructures. Response and Recovery during Catastrophic Failures.

A Framework for the Quantitative Evaluation of Voting Rules

Incumbency Advantages in the Canadian Parliament

Proving correctness of Stable Matching algorithm Analyzing algorithms Asymptotic running times

Random Forests. Gradient Boosting. and. Bagging and Boosting

Convergence of Iterative Voting

1 Electoral Competition under Certainty

Probabilistic Latent Semantic Analysis Hofmann (1999)

Introduction to Text Modeling

Estimating the Margin of Victory for an IRV Election Part 1 by David Cary November 6, 2010

BIPOLAR MULTICANDIDATE ELECTIONS WITH CORRUPTION by Roger B. Myerson August 2005 revised August 2006

Name Phylogeny. A Generative Model of String Variation. Nicholas Andrews, Jason Eisner and Mark Dredze

Use and abuse of voter migration models in an election year. Dr. Peter Moser Statistical Office of the Canton of Zurich

Adaptive QoS Control for Real-Time Systems

Slicing and Bundling

CS269I: Incentives in Computer Science Lecture #4: Voting, Machine Learning, and Participatory Democracy

Supporting Information Political Quid Pro Quo Agreements: An Experimental Study

Minimizing Justified Envy in School Choice: The Design of NewApril Orleans 13, 2018 One App1 Atila / 40

Complexity of Strategic Behavior in Multi-Winner Elections

Universality of election statistics and a way to use it to detect election fraud.

Session 2: The economics of location choice: theory

14.770: Introduction to Political Economy Lectures 8 and 9: Political Agency

Polydisciplinary Faculty of Larache Abdelmalek Essaadi University, MOROCCO 3 Department of Mathematics and Informatics

Deep Learning and Visualization of Election Data

CS 886: Multiagent Systems. Fall 2016 Kate Larson

Hoboken Public Schools. Algebra II Honors Curriculum

Illegal Migration and Policy Enforcement

Fall Detection for Older Adults with Wearables. Chenyang Lu

Networked Games: Coloring, Consensus and Voting. Prof. Michael Kearns Networked Life NETS 112 Fall 2013

MATH 1340 Mathematics & Politics

Can Mathematics Help End the Scourge of Political Gerrymandering?

Experimental Computational Philosophy: shedding new lights on (old) philosophical debates

Infinite-Horizon Policy-Gradient Estimation

Processes. Criteria for Comparing Scheduling Algorithms

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Wind power integration and consumer behavior: a complementarity approach

Learning Systems. Research at the Intersection of Machine Learning & Data Systems. Joseph E. Gonzalez

Decomposition and Complexity of Hereditary History Preserving Bisimulation on BPP

Voting with Bidirectional Elimination

Enriqueta Aragones Harvard University and Universitat Pompeu Fabra Andrew Postlewaite University of Pennsylvania. March 9, 2000

The parametric g- formula in SAS JESSICA G. YOUNG CIMPOD 2017 CASE STUDY 1

Topics on the Border of Economics and Computation December 18, Lecture 8

What makes people feel free: Subjective freedom in comparative perspective Progress Report

Pathbreakers? Women's Electoral Success and Future Political Participation

COULD SIMULATION OPTIMIZATION HAVE PREVENTED 2012 CENTRAL FLORIDA ELECTION LINES?

Dimension Reduction. Why and How

Game theoretical techniques have recently

On removing the Condorcet influence from pairwise elections data

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Support Vector Machines

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

Notes on Strategic and Sincere Voting

Introduction. The Politician and the Judge: Accountability in Government

Coalition Formation and Selectorate Theory: An Experiment - Appendix

Planning versus Free Choice in Scientific Research

PPIC Statewide Survey Methodology

Buying Supermajorities

A model for election night forecasting applied to the 2004 South African elections

University of Toronto Department of Economics. Influential Opinion Leaders

Supplementary/Online Appendix for:

Minimum Spanning Tree Union-Find Data Structure. Feb 28, 2018 CSCI211 - Sprenkle. Comcast wants to lay cable in a neighborhood. Neighborhood Layout

Rural-urban Migration and Urbanization in Gansu Province, China: Evidence from Time-series Analysis

Transcription:

Tengyu Ma Facebook AI Research Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

Ø Over-parameterization: # parameters # examples Ø a set of parameters that can Ø fit to training data and generalize to test data Ø or fit to real inputs with random labels, and fail to generalize Ø or fit to training data but fail to generalize This talk: analysis for simpler models that share the properties above (matrix sensing, and neural nets with quadratic activations)

Ø Uniform convergence doesn t hold Ø training loss test loss for all parameters Ø a model that can fit to training data but fail to generalize test loss training loss Ø Algorithm matters: multiple local/global minima exist, the algorithm chooses the one that generalizes Ø different algorithms converge to local min of training loss, but generalize differently [Keskar et al 16, Wilson et al 17, Dinh et al 17] Ø Post-mortem explanations: margin theory, PAC-Bayes, and compressionbased bounds [Bartlett et al. 17, Neyshabur et al. 17, Arora et al. 18, Dziugaite and Roy 18]

Algorithms matter Stochastic gradient descent, with proper initialization and learning rate, prefers an optimal solution with low complexity, when it exists Ø # parameters is almost irrelevant Intrinsic complexity of the data matters Ø This talk: rigorous argument for matrix sensing and quadratic neural networks

Ø! data " #,, " & R ), *! Ø Claim: minimize,(.) with gradient descent starting with. = 0 is equivalent to solve Ø Gradient descent is limited to search in a subspace Ø Related: GD on logistic loss converges to max margin solution [Soudry et al. 17, Ji&Telgarsky 17] span(" #,, " & )

Ø! data points " #,, " & R ) ) from standard normal dist. Ø Unknown PSD matrix + R ) ) of rank,. Ø We observe / 0 = " 0, + Ø Variable 4 R ) ) min U f(u) = nx (y i ha i,uu > i) 2 i=1 Ø Focus: gradient descent 4 56# = 4 5 89:(4 5 ) Ø Well-studied problem with efficient solutions [Recht et al. 10, Candes et al 07, Tu et al 2015, Zheng and Lafferty 15 ]

min U f(u) = 3 4 : = 7 4, * nx (y i ha i,uu > i) 2 i=1 Ø Regime of parameters:! #$ % # % Ø Ideal solution: ' satisfying '' ( = * has zero training error Ø other solution ' with zero training error but '' ( * Gradient descent with small initialization empirically converges to the ideal solution! [Gunasekar et al. 2017] Ø Compared to low-rank factorization (taking ' R / 1 ): the algorithm finds the correct rank automatically

test error (population risk) = Ef = km UU > k 2 F Ø! = 5, % = 5&! Ø Early stopping and stochasticity is not necessary Systematic empirical studies in [Gunasekar et al. 2017]

Theorem: [Li-M.-Zhang 17] With!"($% & ) observations, and initialization ( ) = + - and learning rate., at iteration / satisfying 0 log 5 / 0 0 1 6 1 56 Technicalities:, the generalization error is bounded by ( 9 ( 9 : < = & $+ Ø We assume < is well-conditioned Ø Theorem also holds when the measurements > 0,, > A satisfy B- restricted isometry property with B 1/ % ØThe runtime bound is non-trivial even with infinite samples

Gradient descent prefers low complexity solutions S r = {approximately rank-r solutions} := {U : r+1(u) apple } Non-generalizable global minima of training loss! $ generalizable global minima of training loss! "! # 0

More concrete analysis plan: Ø GD on population risk %& stays in! # Ø GD on & behaves similarly to that on %& in! # ref(u) rf(u) Ø Generalization is trivial in! # Ef(U) f(u), 8U 2 S r! $ Non-generalizable global minima of training loss generalizable global minima of training loss! "! # GD on %& 0

Ø Input dim = 100 ØGenerate labels with a network of hidden layer size! = 1 Ø Train with hidden layer size = 100

Ø WLOG, assume! = # # %, # = 1 Ø Decompose the iterate ( ) into: ( ) = # * ) % +, ) signal noise 7 :*; < Goals: show inductively Ø, ) -. 0 Ø * ) 1 Ø These imply ( ) # # %, )23, ) + 267 * )23 1 + 6 1 * ) * ) 267

Lemma 1:! "#$! " + 2() Ø Preparation: = U t (U t U t M)U t =(I (U t U t M))U t Small when * " is approx. low rank Ø Proof: E t =(I u? u?> )U t E t+1 = E t (I U t U t ) + small term b.c. (I u? u?> )M =0 GD on population risk reduces the error ke t+1 kappleke t k + small term

Ø! : entry-wise quadratic $ = & '!() ' *) Ø Almost equivalent to matrix sensing with rank-1 measurement: $ = ) ' *, = ** ', )) ' measurement matrix to recover Ø Only difference: unlike random measurement, ** ' doesn t satisfy restricted isometry property Ø Solution: throwing away a very small fraction of the data (adaptively) that devastate restricted isometry property

Ø Generalization error depends on initialization Initialization =! #

Initialization =! Ø Caveat: SGD or GD with large initialization can work with quadratic neural networks. (But the current theory requires small initialization.)

Ø Algo. analyzed: GD on Ø Algo. for comparison: projected GD on min U f(u) = nx (y i ha i,uu > i) 2 i=1 min g(z) = X n (y i ha i,zi) 2 Z 0 i=1

Ø Algorithms have an implicit regularization effect Open questions: Ø other matrix factorization based models Ø logistic loss [Gunasekar et al 18] Ø neural nets with other activation functions and loss (more in Nati s talk) Ø better understanding of algorithms for deep learning Ø which seems to be very helpful for fully understanding generalization Thank you!