Effective Theory of Deep Learning

92
Effective Theory of Deep Learning Beyond the Infinite-Width Limit Dan Roberts a and Sho Yaida b a MIT, IAIFI, & Salesforce, b Facebook AI Research Deep Learning Theory Summer School at Princeton July 27, 2021 – August 8, 2021 Based on The Principles of Deep Learning Theory, also with Boris Hanin: 2106.10165. 1 / 42

Transcript of Effective Theory of Deep Learning

Page 1: Effective Theory of Deep Learning

Effective Theory of Deep LearningBeyond the Infinite-Width Limit

Dan Robertsa and Sho Yaidab

aMIT, IAIFI, & Salesforce, bFacebook AI Research

Deep Learning Theory Summer School at PrincetonJuly 27, 2021 – August 8, 2021

Based on The Principles of Deep Learning Theory, also with Boris Hanin: 2106.10165.

1 / 42

Page 2: Effective Theory of Deep Learning

Initialization

The simulation is such that [one] generally perceives the sum ofmany billions of elementary processes simultaneously, so that theleveling law of large numbers completely obscures the real natureof the individual processes.

John von Neumann

Thanks to substantial investments into computer technology,modern artificial intelligence (AI) systems can now comeequipped with many billions of elementary components.

I Behind much of this success is deep learning: deep learninguses artificial neural networks as an underlying model for AI.

I The real power of this framework comes from representationlearning: deep neural networks, with many neurons organizedinto sequential computational layers, that learn usefulrepresentations of the world.

2 / 42

Page 3: Effective Theory of Deep Learning

Initialization

The simulation is such that [one] generally perceives the sum ofmany billions of elementary processes simultaneously, so that theleveling law of large numbers completely obscures the real natureof the individual processes.

John von Neumann

Thanks to substantial investments into computer technology,modern artificial intelligence (AI) systems can now comeequipped with many billions of elementary components.

I Behind much of this success is deep learning: deep learninguses artificial neural networks as an underlying model for AI.

I The real power of this framework comes from representationlearning: deep neural networks, with many neurons organizedinto sequential computational layers, that learn usefulrepresentations of the world.

2 / 42

Page 4: Effective Theory of Deep Learning

Initialization

The simulation is such that [one] generally perceives the sum ofmany billions of elementary processes simultaneously, so that theleveling law of large numbers completely obscures the real natureof the individual processes.

John von Neumann

Thanks to substantial investments into computer technology,modern artificial intelligence (AI) systems can now comeequipped with many billions of elementary components.

I Behind much of this success is deep learning: deep learninguses artificial neural networks as an underlying model for AI.

I The real power of this framework comes from representationlearning: deep neural networks, with many neurons organizedinto sequential computational layers, that learn usefulrepresentations of the world.

2 / 42

Page 5: Effective Theory of Deep Learning

Initialization: Goals

The goal of these lectures is to put forth a set of principles thatenable us to theoretically analyze deep neural networks of actualrelevance. To initialize you to this task, in the beginning of thislecture we’ll explain at a very high-level both

(i) why such a goal is even attainable in theory, and(ii) how we are able to get there in practice.

3 / 42

Page 6: Effective Theory of Deep Learning

Neural NetworksA neural network is a recipe for computing a function built out ofmany computational units called neurons:

Neurons are then organized in parallel into layers, and deep neuralnetworks are those composed of multiple layers in sequence.

4 / 42

Page 7: Effective Theory of Deep Learning

Neural Networks Abstracted

For a moment, let’s ignore all that structure and simply think of aneural network as a parameterized function

f (x ; θ) ,

where x is the input to the function and θ is a vector of a largenumber of parameters controlling the shape of the function.

5 / 42

Page 8: Effective Theory of Deep Learning

Neural Networks Abstracted: Function Approximation

For such a function to be useful, we need to tune thehigh-dimensional parameter vector θ:

I First, we choose an initialization distribution by randomlysampling the parameter vector θ from a computationallysimple probability distribution,

p(θ) .

I Second, we adjust the parameter vector as θ → θ?, such thatthe resulting network function f (x ; θ?) is as close as possibleto a desired target function f (x):

f (x ; θ?) ≈ f (x) .

This is called function approximation.

6 / 42

Page 9: Effective Theory of Deep Learning

Neural Networks Abstracted: Function Approximation

For such a function to be useful, we need to tune thehigh-dimensional parameter vector θ:I First, we choose an initialization distribution by randomly

sampling the parameter vector θ from a computationallysimple probability distribution,

p(θ) .

I Second, we adjust the parameter vector as θ → θ?, such thatthe resulting network function f (x ; θ?) is as close as possibleto a desired target function f (x):

f (x ; θ?) ≈ f (x) .

This is called function approximation.

6 / 42

Page 10: Effective Theory of Deep Learning

Neural Networks Abstracted: Function Approximation

For such a function to be useful, we need to tune thehigh-dimensional parameter vector θ:I First, we choose an initialization distribution by randomly

sampling the parameter vector θ from a computationallysimple probability distribution,

p(θ) .

I Second, we adjust the parameter vector as θ → θ?, such thatthe resulting network function f (x ; θ?) is as close as possibleto a desired target function f (x):

f (x ; θ?) ≈ f (x) .

This is called function approximation.

6 / 42

Page 11: Effective Theory of Deep Learning

Neural Networks Abstracted: Training

To find these tunings θ?, we fit the network function f (x ; θ) totraining data, consisting of many pairs of the form

(x , f (x)

)observed from the desired – but only partially observable – targetfunction f (x).

I Making these adjustments is called training.I The particular procedure used to tune them is called a

learning algorithm.

7 / 42

Page 12: Effective Theory of Deep Learning

The Theoretical Minimum

Our goal is to understand this trained network function:

f (x ; θ?) .

One way to see the kinds of technical problems that we’ll encounterin pursuit of this goal is to Taylor expand our trained networkfunction f (x ; θ?) around the initialized value of the parameters θ

f (x ; θ?) =f (x ; θ) + (θ? − θ) dfdθ + 1

2 (θ? − θ)2 d2fdθ2 + . . . ,

where f (x ; θ) and its derivatives on the right-hand side are allevaluated at initialized value of the parameters.

8 / 42

Page 13: Effective Theory of Deep Learning

The Theoretical Minimum

Our goal is to understand this trained network function:

f (x ; θ?) .

One way to see the kinds of technical problems that we’ll encounterin pursuit of this goal is to Taylor expand our trained networkfunction f (x ; θ?) around the initialized value of the parameters θ

f (x ; θ?) =f (x ; θ) + (θ? − θ) dfdθ + 1

2 (θ? − θ)2 d2fdθ2 + . . . ,

where f (x ; θ) and its derivatives on the right-hand side are allevaluated at initialized value of the parameters.

8 / 42

Page 14: Effective Theory of Deep Learning

The Theoretical Minimum: Problem 1

In general, the Taylor series contains an infinite number of terms

f , dfdθ ,

d2fdθ2 ,

d3fdθ3 ,

d4fdθ4 , . . . ,

and in principle we need to compute them all.

9 / 42

Page 15: Effective Theory of Deep Learning

The Theoretical Minimum: Problem 2

Since the parameters θ are randomly sampled from p(θ), each timewe initialize our network we get a different function f (x ; θ), and weneed to determine the mapping:

p(θ)→ p(

f , dfdθ ,

d2fdθ2 , . . .

).

This means that each term f , df /dθ, d2f /dθ2, . . . , in the Taylorexpansion is really a random function of the input x , and this jointdistribution will have intricate statistical dependencies.

10 / 42

Page 16: Effective Theory of Deep Learning

The Theoretical Minimum: Problem 3

The learned value of the parameters, θ?, is the result of acomplicated training process. In general, θ? is not unique and candepend on everything:

θ? ≡ [θ?](θ, f , df

dθ ,d2fdθ2 , . . . ; learning algorithm; training data

).

Determining an analytical expression for θ? must take “everything”into account.

11 / 42

Page 17: Effective Theory of Deep Learning

Goal, restated

If we could solve all three of these problems, then we’d have adistribution over trained network functions

p(f ?) ≡ p(

f (x ; θ?)∣∣∣ learning algorithm; training data

),

now conditioned in a simple way on the learning algorithm and thedata we used for training.

The development of a method for the analytical computation ofp(f ?) is the principle goal of these lectures.

12 / 42

Page 18: Effective Theory of Deep Learning

Goal, restated

If we could solve all three of these problems, then we’d have adistribution over trained network functions

p(f ?) ≡ p(

f (x ; θ?)∣∣∣ learning algorithm; training data

),

now conditioned in a simple way on the learning algorithm and thedata we used for training.

The development of a method for the analytical computation ofp(f ?) is the principle goal of these lectures.

12 / 42

Page 19: Effective Theory of Deep Learning

Fine, StructureSolving our three problems for a general parameterized functionf (x ; θ) is not tractable. However, we only care about the functionsthat are deep neural networks:

To make progress we will have to make use of the particularstructure of neural-network function.

13 / 42

Page 20: Effective Theory of Deep Learning

Fine, StructureTwo essential aspects of a neural network architecture are itswidth, n, and its depth, L.

14 / 42

Page 21: Effective Theory of Deep Learning

A Principle of SparsityThere are often simplifications to be found in the limit of a largenumber of components.

It’s not enough to consider any massive macroscopic system, andtaking the right limit often requires some care.

15 / 42

Page 22: Effective Theory of Deep Learning

A Principle of SparsityThere are often simplifications to be found in the limit of a largenumber of components.

It’s not enough to consider any massive macroscopic system, andtaking the right limit often requires some care.

15 / 42

Page 23: Effective Theory of Deep Learning

A Principle of SparsityThere are often simplifications to be found in the limit of a largenumber of components.

It’s not enough to consider any massive macroscopic system, andtaking the right limit often requires some care.

15 / 42

Page 24: Effective Theory of Deep Learning

A Principle of SparsityThere are often simplifications to be found in the limit of a largenumber of components.

It’s not enough to consider any massive macroscopic system, andtaking the right limit often requires some care.

15 / 42

Page 25: Effective Theory of Deep Learning

A Principle of SparsityIn this case, the n =∞ will make everything really simple, whilethe L =∞ will be hopelessly complicated and useless in practice.

16 / 42

Page 26: Effective Theory of Deep Learning

The Infinite-Width Limit

Let’s begin by formally taking the limit

limn→∞

p(f ?) ,

and studying an idealized neural network in this limit.

I This is known as the infinite-width limit of the network, andas a strict limit it’s rather unphysical for a network: obviouslyyou cannot directly program a function to have an infinitenumber of components on a finite computer.

I However, this extreme limit does massively simplify thedistribution over trained networks p(f ?), rendering each of ourthree problems completely benign.

17 / 42

Page 27: Effective Theory of Deep Learning

The Infinite-Width Limit

Let’s begin by formally taking the limit

limn→∞

p(f ?) ,

and studying an idealized neural network in this limit.

I This is known as the infinite-width limit of the network, andas a strict limit it’s rather unphysical for a network: obviouslyyou cannot directly program a function to have an infinitenumber of components on a finite computer.

I However, this extreme limit does massively simplify thedistribution over trained networks p(f ?), rendering each of ourthree problems completely benign.

17 / 42

Page 28: Effective Theory of Deep Learning

The Infinite-Width Limit

Let’s begin by formally taking the limit

limn→∞

p(f ?) ,

and studying an idealized neural network in this limit.

I This is known as the infinite-width limit of the network, andas a strict limit it’s rather unphysical for a network: obviouslyyou cannot directly program a function to have an infinitenumber of components on a finite computer.

I However, this extreme limit does massively simplify thedistribution over trained networks p(f ?), rendering each of ourthree problems completely benign.

17 / 42

Page 29: Effective Theory of Deep Learning

Simplicity at Infinite WidthI Addressing Problem 1, higher derivative terms will effectively

vanish, and we only need to keep track of two terms:

f , dfdθ .

I Addressing Problem 2, the distributions of these randomfunctions will be independent,

limn→∞

p(

f , dfdθ ,

d2fdθ2 , . . .

)= p(f ) p

(dfdθ

),

with each marginal distribution factor taking a simple form.I Addressing Problem 3, the training dynamics become linear

and independent of the details of the learning algorithm,giving θ? in a closed form analytical solution:

limn→∞

θ? = [θ?](θ, f , df

dθ ; training data).

18 / 42

Page 30: Effective Theory of Deep Learning

Simplicity at Infinite WidthI Addressing Problem 1, higher derivative terms will effectively

vanish, and we only need to keep track of two terms:

f , dfdθ .

I Addressing Problem 2, the distributions of these randomfunctions will be independent,

limn→∞

p(

f , dfdθ ,

d2fdθ2 , . . .

)= p(f ) p

(dfdθ

),

with each marginal distribution factor taking a simple form.

I Addressing Problem 3, the training dynamics become linearand independent of the details of the learning algorithm,giving θ? in a closed form analytical solution:

limn→∞

θ? = [θ?](θ, f , df

dθ ; training data).

18 / 42

Page 31: Effective Theory of Deep Learning

Simplicity at Infinite WidthI Addressing Problem 1, higher derivative terms will effectively

vanish, and we only need to keep track of two terms:

f , dfdθ .

I Addressing Problem 2, the distributions of these randomfunctions will be independent,

limn→∞

p(

f , dfdθ ,

d2fdθ2 , . . .

)= p(f ) p

(dfdθ

),

with each marginal distribution factor taking a simple form.I Addressing Problem 3, the training dynamics become linear

and independent of the details of the learning algorithm,giving θ? in a closed form analytical solution:

limn→∞

θ? = [θ?](θ, f , df

dθ ; training data).

18 / 42

Page 32: Effective Theory of Deep Learning

Simplicity at Infinite-Width

These simplifications are the consequence of a principle ofsparsity, and the fully-trained distribution,

limn→∞

p(f ?) ,

is a simple Gaussian distribution with a nonzero mean.

19 / 42

Page 33: Effective Theory of Deep Learning

Too Much Simplicity at Infinite-Width

The formal infinite-width limit, n→∞, leads to a poor model ofdeep neural networks in practice:

I The distribution over real trained networks does depend onthe properties of the learning algorithm used to train them.

I Infinite-width networks don’t have representation learning:for any input x , its transformations in the hidden layers, z(1),z(2), . . . , z(L−1), will remain unchanged from initialization.

The central limiting problem is that the input of an infinite numberof signals is such that the leveling law of large numbers completelyobscures the subtle correlations between neurons that get amplifiedover the course of training for representation learning.

20 / 42

Page 34: Effective Theory of Deep Learning

Too Much Simplicity at Infinite-Width

The formal infinite-width limit, n→∞, leads to a poor model ofdeep neural networks in practice:

I The distribution over real trained networks does depend onthe properties of the learning algorithm used to train them.

I Infinite-width networks don’t have representation learning:for any input x , its transformations in the hidden layers, z(1),z(2), . . . , z(L−1), will remain unchanged from initialization.

The central limiting problem is that the input of an infinite numberof signals is such that the leveling law of large numbers completelyobscures the subtle correlations between neurons that get amplifiedover the course of training for representation learning.

20 / 42

Page 35: Effective Theory of Deep Learning

Too Much Simplicity at Infinite-Width

The formal infinite-width limit, n→∞, leads to a poor model ofdeep neural networks in practice:

I The distribution over real trained networks does depend onthe properties of the learning algorithm used to train them.

I Infinite-width networks don’t have representation learning:for any input x , its transformations in the hidden layers, z(1),z(2), . . . , z(L−1), will remain unchanged from initialization.

The central limiting problem is that the input of an infinite numberof signals is such that the leveling law of large numbers completelyobscures the subtle correlations between neurons that get amplifiedover the course of training for representation learning.

20 / 42

Page 36: Effective Theory of Deep Learning

Too Much Simplicity at Infinite-Width

The formal infinite-width limit, n→∞, leads to a poor model ofdeep neural networks in practice:

I The distribution over real trained networks does depend onthe properties of the learning algorithm used to train them.

I Infinite-width networks don’t have representation learning:for any input x , its transformations in the hidden layers, z(1),z(2), . . . , z(L−1), will remain unchanged from initialization.

The central limiting problem is that the input of an infinite numberof signals is such that the leveling law of large numbers completelyobscures the subtle correlations between neurons that get amplifiedover the course of training for representation learning.

20 / 42

Page 37: Effective Theory of Deep Learning

Interacting Neurons

We’ll need to find a way to restore and then study the interactionsbetween neurons that are present in realistic finite-width networks.

To do so, we can use perturbation theory and study deeplearning using a 1/n expansion, treating the inverse layer width,ε ≡ 1/n, as our small parameter of expansion:

21 / 42

Page 38: Effective Theory of Deep Learning

Interacting Neurons

We’ll need to find a way to restore and then study the interactionsbetween neurons that are present in realistic finite-width networks.

To do so, we can use perturbation theory and study deeplearning using a 1/n expansion, treating the inverse layer width,ε ≡ 1/n, as our small parameter of expansion:

p(f ?) ≡{

limn→∞

p(f ?)}

+ p{1}(f ?)n + p{2}(f ?)

n2 + . . . ,

21 / 42

Page 39: Effective Theory of Deep Learning

Interacting Neurons

We’ll need to find a way to restore and then study the interactionsbetween neurons that are present in realistic finite-width networks.

To do so, we can use perturbation theory and study deeplearning using a 1/n expansion, treating the inverse layer width,ε ≡ 1/n, as our small parameter of expansion:

p(f ?) ≡{

limn→∞

p(f ?)}

+ p{1}(f ?)n + O

( 1n2

).

21 / 42

Page 40: Effective Theory of Deep Learning

Near-Simplicity at Finite WidthI Addressing Problem 1, most derivatives will contribute as

O(1/n2) or smaller, so we only need to keep track of 4 terms:

f , dfdθ ,

d2fdθ2 ,

d3fdθ3 .

I Addressing Problem 2, the distribution of these randomfunctions at initialization will be nearly simple at order 1/n:

p(

f , dfdθ ,

d2fdθ2 ,

d3fdθ3

),

I Addressing Problem 3, the nonlinear training dynamics canbe tamed with dynamical perturbation theory, giving θ? in aclosed form analytical solution:

θ? = [θ?](θ, f , df

dθ ,d2fdθ2

d3fdθ3 ; learning algorithm; training data

).

22 / 42

Page 41: Effective Theory of Deep Learning

Near-Simplicity at Finite WidthI Addressing Problem 1, most derivatives will contribute as

O(1/n2) or smaller, so we only need to keep track of 4 terms:

f , dfdθ ,

d2fdθ2 ,

d3fdθ3 .

I Addressing Problem 2, the distribution of these randomfunctions at initialization will be nearly simple at order 1/n:

p(

f , dfdθ ,

d2fdθ2 ,

d3fdθ3

),

I Addressing Problem 3, the nonlinear training dynamics canbe tamed with dynamical perturbation theory, giving θ? in aclosed form analytical solution:

θ? = [θ?](θ, f , df

dθ ,d2fdθ2

d3fdθ3 ; learning algorithm; training data

).

22 / 42

Page 42: Effective Theory of Deep Learning

Near-Simplicity at Finite WidthI Addressing Problem 1, most derivatives will contribute as

O(1/n2) or smaller, so we only need to keep track of 4 terms:

f , dfdθ ,

d2fdθ2 ,

d3fdθ3 .

I Addressing Problem 2, the distribution of these randomfunctions at initialization will be nearly simple at order 1/n:

p(

f , dfdθ ,

d2fdθ2 ,

d3fdθ3

),

I Addressing Problem 3, the nonlinear training dynamics canbe tamed with dynamical perturbation theory, giving θ? in aclosed form analytical solution:

θ? = [θ?](θ, f , df

dθ ,d2fdθ2

d3fdθ3 ; learning algorithm; training data

).

22 / 42

Page 43: Effective Theory of Deep Learning

Near-Simplicity at Finite Width

These near-simplifications are a further consequence of theprinciple of sparsity, and our dual effective theory description ofthe fully-trained distribution at order 1/n,

p(f ?) ≡{

limn→∞

p(f ?)}

+ p{1}(f ?)n + O

( 1n2

),

will be a nearly-Gaussian distribution.

23 / 42

Page 44: Effective Theory of Deep Learning

Subleading Corrections?

An important byproduct of studying the leading corrections isactually a deeper understanding of the truncation error. Definingthe aspect ratio

r ≡ L/n ,

we will see that we can recast our understanding of infinite-widthvs. finite-width and shallow vs. deep:

Networks of practical use have small aspect ratios: r ∼ r? � 1.

24 / 42

Page 45: Effective Theory of Deep Learning

Subleading Corrections?

An important byproduct of studying the leading corrections isactually a deeper understanding of the truncation error. Definingthe aspect ratio

r ≡ L/n ,

we will see that we can recast our understanding of infinite-widthvs. finite-width and shallow vs. deep:

I In the strict limit r → 0, the interactions between neuronsturn off: the infinite-width limit is actually a decentdescription, but these networks are not really deep, as theirrelative depth is zero: L/n = 0.

Networks of practical use have small aspect ratios: r ∼ r? � 1.

24 / 42

Page 46: Effective Theory of Deep Learning

Subleading Corrections?

An important byproduct of studying the leading corrections isactually a deeper understanding of the truncation error. Definingthe aspect ratio

r ≡ L/n ,

we will see that we can recast our understanding of infinite-widthvs. finite-width and shallow vs. deep:

I In the regime 0 < r � 1, there are nontrivial interactionsbetween neurons: the finite-width effective theory truncatedat order 1/n gives an accurate accounting p(f ?). Thesenetworks are effectively deep.

Networks of practical use have small aspect ratios: r ∼ r? � 1.

24 / 42

Page 47: Effective Theory of Deep Learning

Subleading Corrections?

An important byproduct of studying the leading corrections isactually a deeper understanding of the truncation error. Definingthe aspect ratio

r ≡ L/n ,

we will see that we can recast our understanding of infinite-widthvs. finite-width and shallow vs. deep:

I In the regime r � 1, the neurons are strongly coupled:networks will behave chaotically, and there is no effectivedescription due to large fluctuations from instantiation toinstantiation. These networks are overly deep.

Networks of practical use have small aspect ratios: r ∼ r? � 1.

24 / 42

Page 48: Effective Theory of Deep Learning

Subleading Corrections?

An important byproduct of studying the leading corrections isactually a deeper understanding of the truncation error. Definingthe aspect ratio

r ≡ L/n ,

we will see that we can recast our understanding of infinite-widthvs. finite-width and shallow vs. deep:

I In the regime r � 1, the neurons are strongly coupled:networks will behave chaotically, and there is no effectivedescription due to large fluctuations from instantiation toinstantiation. These networks are overly deep.

Networks of practical use have small aspect ratios: r ∼ r? � 1.

24 / 42

Page 49: Effective Theory of Deep Learning

End of Initialization

To really describe the properties of multilayer neural networks,i.e. to understand deep learning, we need to studylarge-but-finite-width networks.

25 / 42

Page 50: Effective Theory of Deep Learning

Course Plan

Lecture 1 Initialization, Linear ModelsI §0 + §7.1 + §10.4

Lecture 2 Gradient Descent & Nearly-Kernel MethodsI §7.2 + §∞.2.2 + §11.4

Lecture 3 The Principle of Sparsity (Recursing)I §4, §8, §11.2, §∞.3

Lecture 4 The Principle of CriticalityI §5, §9, §11.3, §∞.1, +§10.3

Lecture 5 The End of Training & MoreI §∞.2.3 + Maybe { §A, §B, . . . }

26 / 42

Page 51: Effective Theory of Deep Learning

Problem 3: Training Dynamics

The learned value of the parameters, θ?, is the result of acomplicated training process. In general, θ? is not unique and candepend on everything:

θ? ≡ [θ?](θ, f , df

dθ ,d2fdθ2 , . . . ; learning algorithm; training data

).

Determining an analytical expression for θ? must take “everything”into account.

Goal right now is to understand the algorithm dependence anddata dependence of solutions for a very general class of machinelearning models.

27 / 42

Page 52: Effective Theory of Deep Learning

Problem 3: Training Dynamics

The learned value of the parameters, θ?, is the result of acomplicated training process. In general, θ? is not unique and candepend on everything:

θ? ≡ [θ?](θ, f , df

dθ ,d2fdθ2 , . . . ; learning algorithm; training data

).

Determining an analytical expression for θ? must take “everything”into account.

Goal right now is to understand the algorithm dependence anddata dependence of solutions for a very general class of machinelearning models.

27 / 42

Page 53: Effective Theory of Deep Learning

Machine Learning Models

f (x ; θ)

28 / 42

Page 54: Effective Theory of Deep Learning

Machine Learning Models

f (x ; θ) ≡z(x ; θ)

28 / 42

Page 55: Effective Theory of Deep Learning

Machine Learning Models

f (x ; θ) ≡zi (x ; θ)

I i = 1, . . . , nout is a vectorial index

28 / 42

Page 56: Effective Theory of Deep Learning

Machine Learning Models

f (x ; θ) ≡zi (xδ; θ)

I i = 1, . . . , nout is a vectorial indexI δ ∈ D is a sample index

28 / 42

Page 57: Effective Theory of Deep Learning

Machine Learning Models

f (x ; θ) ≡zi ;δ(θ)

I i = 1, . . . , nout is a vectorial indexI δ ∈ D is a sample index

28 / 42

Page 58: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

+ . . .

29 / 42

Page 59: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

+ . . .

29 / 42

Page 60: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

+ . . .

I µi is parameter index, with P total model parameters

29 / 42

Page 61: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

+ . . .

I µi is parameter index, with P total model parametersI generic nonlinear model will be intractable.

29 / 42

Page 62: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

I µi is parameter index, with P total model parametersI generic nonlinear model will be intractable.I truncation to cubic model describes deep MLPs at O(1/n)

29 / 42

Page 63: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

I µi is parameter index, with P total model parametersI generic nonlinear model will be intractable.I truncation to cubic model describes deep MLPs at O(1/n)I truncation to quadratic model illustrates nonlinear dynamics

29 / 42

Page 64: Effective Theory of Deep Learning

Machine Learning Models: Taylor Expanded

zi ;δ(θ) = zi ;δ(θ = 0) +P∑µ=1

θµdzi ;δdθµ

∣∣∣∣θ=0

+ 12

P∑µ1,µ2=1

θµ1θµ2d2zi ;δ

dθµ1dθµ2

∣∣∣∣θ=0

+ 13!

P∑µ1,µ2,µ3=1

θµ1θµ2θµ3d3zi ;δ

dθµ1dθµ2dθµ3

∣∣∣∣θ=0

I µi is parameter index, with P total model parametersI generic nonlinear model will be intractable.I truncation to cubic model describes deep MLPs at O(1/n)I truncation to quadratic model illustrates nonlinear dynamicsI truncation to linear model describes infinite-width networks

29 / 42

Page 65: Effective Theory of Deep Learning

Linear Models

The simplest linear model is a one-layer (zero-hidden-layer) network

ziδ(θ) = bi +n0∑

j=1Wij xj;δ .

I While this model is linear in both the parameters θ = {bi ,Wij}and the input xj;δ, the linear in linear model takes its namefrom the dependence on the parameters θ and not the input x .

I While xj;δ can serve as a reasonable set of features forfunction approximation, in general they do not.

30 / 42

Page 66: Effective Theory of Deep Learning

Linear Models

The simplest linear model is a one-layer (zero-hidden-layer) network

ziδ(θ) = bi +n0∑

j=1Wij xj;δ .

I While this model is linear in both the parameters θ = {bi ,Wij}and the input xj;δ, the linear in linear model takes its namefrom the dependence on the parameters θ and not the input x .

I While xj;δ can serve as a reasonable set of features forfunction approximation, in general they do not.

30 / 42

Page 67: Effective Theory of Deep Learning

Linear Models

The simplest linear model is a one-layer (zero-hidden-layer) network

ziδ(θ) = bi +n0∑

j=1Wij xj;δ .

I While this model is linear in both the parameters θ = {bi ,Wij}and the input xj;δ, the linear in linear model takes its namefrom the dependence on the parameters θ and not the input x .

I While xj;δ can serve as a reasonable set of features forfunction approximation, in general they do not.

30 / 42

Page 68: Effective Theory of Deep Learning

(Generalized) Linear Models

Instead, we might design a fixed set of feature functions φj(x)that’s meant to work well for underlying task at hand:

zi ;δ(θ) = bi +nf∑

j=1Wij φj(xδ)

31 / 42

Page 69: Effective Theory of Deep Learning

(Generalized) Linear Models

Instead, we might design a fixed set of feature functions φj(x)that’s meant to work well for underlying task at hand:

zi ;δ(θ) = bi +nf∑

j=1Wij φj(xδ)

I All the complicated modeling work goes into the constructionof these feature functions φj(x).

31 / 42

Page 70: Effective Theory of Deep Learning

(Generalized) Linear Models

Instead, we might design a fixed set of feature functions φj(x)that’s meant to work well for underlying task at hand:

zi ;δ(θ) =nf∑

j=0Wij φj(xδ)

I All the complicated modeling work goes into the constructionof these feature functions φj(x).

I Here, we’ve subsumed the bias vector into the weight matrixby setting φ0(x) ≡ 1 and Wi0 ≡ bi .

31 / 42

Page 71: Effective Theory of Deep Learning

(Generalized) Linear Models

Instead, we might design a fixed set of feature functions φj(x)that’s meant to work well for underlying task at hand:

zi ;δ(θ) =nf∑

j=0Wij φj(xδ)

I All the complicated modeling work goes into the constructionof these feature functions φj(x).

I Here, we’ve subsumed the bias vector into the weight matrixby setting φ0(x) ≡ 1 and Wi0 ≡ bi .

I We can still think of this model as a one-layer neural network,but now we pre-process x with the function φj(x).

31 / 42

Page 72: Effective Theory of Deep Learning

Aside: Supervised Learning

For any pair (x , y) that is jointly sampled from a data distribution

p(x , y) = p(y |x)p(x) ,

the goal is to predict a label y given an input x . The better theprobabilistic model learns the distribution p(y |x), the moreaccurately it can predict a true label y for a novel input example x .

32 / 42

Page 73: Effective Theory of Deep Learning

Aside: Supervised LearningBefore we understand how to fit the model parameters, we need tounderstand what we mean by making good predictions:

I For a typical (xδ, yδ) sampled from p(x , y), we want zδ(θ) tobe as close to yδ as possible on average.

I To measure this proximity, we need to define an auxiliaryfunction called the loss,

L(

zδ(θ), yδ),

such that the closer zδ(θ) is to yδ, the smaller the loss.I One intuitive choice for this is the MSE loss:

LMSE(

zδ(θ), yδ)≡ 1

2[zδ(θ)− yδ

]2.

33 / 42

Page 74: Effective Theory of Deep Learning

Aside: Supervised LearningBefore we understand how to fit the model parameters, we need tounderstand what we mean by making good predictions:

I For a typical (xδ, yδ) sampled from p(x , y), we want zδ(θ) tobe as close to yδ as possible on average.

I To measure this proximity, we need to define an auxiliaryfunction called the loss,

L(

zδ(θ), yδ),

such that the closer zδ(θ) is to yδ, the smaller the loss.I One intuitive choice for this is the MSE loss:

LMSE(

zδ(θ), yδ)≡ 1

2[zδ(θ)− yδ

]2.

33 / 42

Page 75: Effective Theory of Deep Learning

Aside: Supervised LearningBefore we understand how to fit the model parameters, we need tounderstand what we mean by making good predictions:

I For a typical (xδ, yδ) sampled from p(x , y), we want zδ(θ) tobe as close to yδ as possible on average.

I To measure this proximity, we need to define an auxiliaryfunction called the loss,

L(

zδ(θ), yδ),

such that the closer zδ(θ) is to yδ, the smaller the loss.

I One intuitive choice for this is the MSE loss:

LMSE(

zδ(θ), yδ)≡ 1

2[zδ(θ)− yδ

]2.

33 / 42

Page 76: Effective Theory of Deep Learning

Aside: Supervised LearningBefore we understand how to fit the model parameters, we need tounderstand what we mean by making good predictions:

I For a typical (xδ, yδ) sampled from p(x , y), we want zδ(θ) tobe as close to yδ as possible on average.

I To measure this proximity, we need to define an auxiliaryfunction called the loss,

L(

zδ(θ), yδ),

such that the closer zδ(θ) is to yδ, the smaller the loss.I One intuitive choice for this is the MSE loss:

LMSE(

zδ(θ), yδ)≡ 1

2[zδ(θ)− yδ

]2.

33 / 42

Page 77: Effective Theory of Deep Learning

Aside: Supervised Learning

To train our model, we try to find a configuration of the modelparameters that minimizes the training loss LA(θ):

θ? = arg minθ LA(θ) = arg minθ[∑α̃∈AL(

zα̃(θ), yα̃)]

,

where A is the training set.

I To assess the generalization of a model, a test set,(xβ̇, yβ̇)β̇∈B, is used to evaluate a model after training.

I We will denote training set inputs as α̃ ∈ A, generic inputs asδ ∈ D, and test set inputs as β̇ ∈ B.

34 / 42

Page 78: Effective Theory of Deep Learning

Aside: Supervised Learning

To train our model, we try to find a configuration of the modelparameters that minimizes the training loss LA(θ):

θ? = arg minθ LA(θ) = arg minθ[∑α̃∈AL(

zα̃(θ), yα̃)]

,

where A is the training set.

I To assess the generalization of a model, a test set,(xβ̇, yβ̇)β̇∈B, is used to evaluate a model after training.

I We will denote training set inputs as α̃ ∈ A, generic inputs asδ ∈ D, and test set inputs as β̇ ∈ B.

34 / 42

Page 79: Effective Theory of Deep Learning

Aside: Supervised Learning

To train our model, we try to find a configuration of the modelparameters that minimizes the training loss LA(θ):

θ? = arg minθ LA(θ) = arg minθ[∑α̃∈AL(

zα̃(θ), yα̃)]

,

where A is the training set.

I To assess the generalization of a model, a test set,(xβ̇, yβ̇)β̇∈B, is used to evaluate a model after training.

I We will denote training set inputs as α̃ ∈ A, generic inputs asδ ∈ D, and test set inputs as β̇ ∈ B.

34 / 42

Page 80: Effective Theory of Deep Learning

Linear Regression

Supervised learning with a linear model is linear regression

LA(θ) = 12∑α̃∈A

nout∑i=1

yi ;α̃ −nf∑

j=0Wijφj(xα̃)

2

.

Being entirely naive about it, let’s try direct optimization:

0 = dLAdWij

∣∣∣∣∣W =W ?

,

which gives an implicit equation for the optimal weight matrix W ?ij :

nf∑k=0

W ?ik

[∑α̃∈A

φk(xα̃)φj(xα̃)]

=∑α̃∈A

yi ;α̃φj(xα̃) .

35 / 42

Page 81: Effective Theory of Deep Learning

Linear Regression

Supervised learning with a linear model is linear regression

LA(θ) = 12∑α̃∈A

nout∑i=1

yi ;α̃ −nf∑

j=0Wijφj(xα̃)

2

.

Being entirely naive about it, let’s try direct optimization:

0 = dLAdWij

∣∣∣∣∣W =W ?

,

which gives an implicit equation for the optimal weight matrix W ?ij :

nf∑k=0

W ?ik

[∑α̃∈A

φk(xα̃)φj(xα̃)]

=∑α̃∈A

yi ;α̃φj(xα̃) .

35 / 42

Page 82: Effective Theory of Deep Learning

Linear RegressionTo solve implicit equation,

nf∑k=0

W ?ik

[∑α̃∈A

φk(xα̃)φj(xα̃)]

=∑α̃∈A

yi ;α̃φj(xα̃) ,

let’s define a symmetric (nf + 1)-by-(nf + 1) matrix of features:

Mij ≡∑α̃∈A

φi (xα̃)φj(xα̃) .

Applying its inverse to both sides, we find a solution:

W ?ij =

nf∑k=0

∑α̃∈A

yi ;α̃ φk(xα̃)(

M−1)

kj.

I Notice that W ?ij depends on A, linearly on yi ;α̃, and in a more

complicated way on φk(xα̃).

36 / 42

Page 83: Effective Theory of Deep Learning

Linear RegressionTo solve implicit equation,

nf∑k=0

W ?ik

[∑α̃∈A

φk(xα̃)φj(xα̃)]

=∑α̃∈A

yi ;α̃φj(xα̃) ,

let’s define a symmetric (nf + 1)-by-(nf + 1) matrix of features:

Mij ≡∑α̃∈A

φi (xα̃)φj(xα̃) .

Applying its inverse to both sides, we find a solution:

W ?ij =

nf∑k=0

∑α̃∈A

yi ;α̃ φk(xα̃)(

M−1)

kj.

I Notice that W ?ij depends on A, linearly on yi ;α̃, and in a more

complicated way on φk(xα̃).

36 / 42

Page 84: Effective Theory of Deep Learning

Linear RegressionTo solve implicit equation,

nf∑k=0

W ?ik

[∑α̃∈A

φk(xα̃)φj(xα̃)]

=∑α̃∈A

yi ;α̃φj(xα̃) ,

let’s define a symmetric (nf + 1)-by-(nf + 1) matrix of features:

Mij ≡∑α̃∈A

φi (xα̃)φj(xα̃) .

Applying its inverse to both sides, we find a solution:

W ?ij =

nf∑k=0

∑α̃∈A

yi ;α̃ φk(xα̃)(

M−1)

kj.

I Notice that W ?ij depends on A, linearly on yi ;α̃, and in a more

complicated way on φk(xα̃).36 / 42

Page 85: Effective Theory of Deep Learning

Linear Regression

Finally, we can use this to make predictions on test-set inputs xβ̇ as

zi(xβ̇; θ?

)=

nf∑j=0

W ?ij φj(xβ̇)

=∑α̃∈A

nf∑j,k=0

φj(xβ̇)(

M−1)

jkφk(xα̃)

yi ;α̃

Note that this involves the inversion of Mij .If nf � 1, then this might be computationally difficult.

37 / 42

Page 86: Effective Theory of Deep Learning

Kernel Methods

Finally, we can use this to make predictions on test-set inputs xβ̇ as

zi(xβ̇; θ?

)=

nf∑j=0

W ?ij φj(xβ̇)

=∑α̃∈A

nf∑j,k=0

φj(xβ̇)(

M−1)

jkφk(xα̃)

yi ;α̃

I Note that this involves the inversion of Mij .I If nf � 1, then this might be computationally difficult.

38 / 42

Page 87: Effective Theory of Deep Learning

Kernel Methods

Finally, we can use this to make predictions on test-set inputs xβ̇ as

zi(xβ̇; θ?

)=

nf∑j=0

W ?ij φj(xβ̇)

=∑α̃∈A

nf∑j,k=0

φj(xβ̇)(

M−1)

jkφk(xα̃)

yi ;α̃

I Note that this involves the inversion of Mij .I If nf � 1, then this might be computationally difficult.

38 / 42

Page 88: Effective Theory of Deep Learning

The Kernel

Let us introduce a new ND × ND-dimensional symmetric matrix:

kδ1δ2 ≡ k(xδ1 , xδ2) ≡nf∑

i=0φi (xδ1)φi (xδ2) .

As an inner product of features, the kernel kδ1δ2 is a measure ofsimilarity between two inputs xi ;δ1 and xi ;δ2 in feature space.

We’ll also denote an NA-by-NA-dimensional submatrix of thekernel evaluated on the training set as k̃α̃1α̃2 with a tilde. This letsus write its inverse as k̃ α̃1α̃2 , which satisfies∑

α̃2∈Ak̃ α̃1α̃2 k̃α̃2α̃3 = δα̃1

α̃3.

39 / 42

Page 89: Effective Theory of Deep Learning

The Kernel

Let us introduce a new ND × ND-dimensional symmetric matrix:

kδ1δ2 ≡ k(xδ1 , xδ2) ≡nf∑

i=0φi (xδ1)φi (xδ2) .

As an inner product of features, the kernel kδ1δ2 is a measure ofsimilarity between two inputs xi ;δ1 and xi ;δ2 in feature space.

We’ll also denote an NA-by-NA-dimensional submatrix of thekernel evaluated on the training set as k̃α̃1α̃2 with a tilde. This letsus write its inverse as k̃ α̃1α̃2 , which satisfies∑

α̃2∈Ak̃ α̃1α̃2 k̃α̃2α̃3 = δα̃1

α̃3.

39 / 42

Page 90: Effective Theory of Deep Learning

Some Algebra

∑α̃∈A

nf∑j,k=0

φj(xβ̇)(

M−1)

jkφk(xα̃)

k̃α̃α̃1

=∑α̃∈A

nf∑j,k=0

φj(xβ̇)(

M−1)

jkφk(xα̃)

nf∑i=0

φi (xα̃)φi (xα̃1)

=nf∑

i ,j,k=0φj(xβ̇)

(M−1

)jk

Mki φi (xα̃1)

=nf∑

i=0φi (xβ̇)φi (xα̃1) = kβ̇α̃1

.

40 / 42

Page 91: Effective Theory of Deep Learning

Kernel Methods

Multiplying the first and last expressions by the inverse submatrixk̃ α̃1α̃2 , we get a dual representation nf∑

j,k=0φj(xβ̇)

(M−1

)jkφk(xα̃2)

=∑α̃1∈A

kβ̇α̃1k̃ α̃1α̃2 ,

which lets us rewrite the prediction of our linear model as

zi(xβ̇; θ?

)=

∑α̃1,α̃2∈A

kβ̇α̃1k̃ α̃1α̃2yi ;α̃2 .

When the prediction of a linear model is computed in this way, it’sknown as a kernel machine or kernel methods.

41 / 42

Page 92: Effective Theory of Deep Learning

Course Plan

Lecture 1 Initialization, Linear ModelsI §0 + §7.1 + §10.4

Lecture 2 Gradient Descent & Nearly-Kernel MethodsI §7.2 + §∞.2.2 + §11.4

Lecture 3 The Principle of Sparsity (Recursing)I §4, §8, §11.2, §∞.3

Lecture 4 The Principle of CriticalityI §5, §9, §11.3, §∞.1, +§10.3

Lecture 5 The End of Training & MoreI §∞.2.3 + Maybe { §A, §B, . . . }

42 / 42