Function Learning and Neural Nets R&N: Chap. 20, Sec. 20.5
-
Upload
mallory-gonzalez -
Category
Documents
-
view
37 -
download
1
description
Transcript of Function Learning and Neural Nets R&N: Chap. 20, Sec. 20.5
1
Function Learning and Neural Nets
R&N: Chap. 20, Sec. 20.5
2
Function-Learning Formulation
Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))
Inductive inference: find a function h that fits the points well
Same Keep-It-Simple biasx
f(x)
3
Least-Squares Fitting Propose a class of functions g(x,)
parameterized by Minimize E() = i ( g(x(i),)-y(i))2
x
f(x)
4
Linear Least-Squares
g(x,) = x1 1 + … + xN N
Best given by = (ATA)-1 AT b
Where A is matrix of x(i)’s, b is vector of y(i)’s
x
f(x)g(x,)
5
Constant offset
Set x0=1, g(x,) = x0 0 + x1 1 + … + xN N
Best given by = (ATA)-1 AT b
Where A is matrix of x(i)’s, b is vector of y(i)’s
x
f(x)g(x,)
6
Nonlinear Least-Squares
E.g. quadratic g(x,) = 0 + x 1 + x2
2
E.g. exponential g(x,) = exp(0 + x
1) Any combinations
g(x,) = exp(0 + x 1) + 2 + x
3x
f(x)
linear
quadratic other
7
Performance of Nonlinear Least-squares
Overfitting: too many parameters Efficient optimization
Often can only find a local minimum of objective E()
Expensive with lots of data!
8
Neural Networks Overfitting: too many parameters Efficient optimization
Often can only find a local minimum of objective E()
Expensive with lots of data!
9
Perceptron(The goal function f is a boolean
one)
gxi
x1
xn
ywi
y = g(i=1,…,n wi xi)
+ +
+
++ -
-
--
-x1
x2
w1 x1 + w2 x2 = 0
10
gxi
x1
xn
ywi
y = g(i=1,…,n wi xi)
+ +
+ +
+ -
-
--
-
?
Perceptron(The goal function f is a boolean
one)
11
Unit (Neuron)
gxi
x1
xn
ywi
y = g(i=1,…,n wi xi)
g(u) = 1/[1 + exp(-u)]
12
A Single Neuron can learn
gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority function
XOR?
13
Neural Network
Network of interconnected neurons
gxi
x1
xn
ywi
gxi
x1
xn
ywi
Acyclic (feed-forward) vs. recurrent networks
14
Two-Layer Feed-Forward Neural Network
Inputs Hiddenlayer
Outputlayer
w1j w2k
15
Backpropagation (Principle)
New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for
inputs x(k) Error function: E(k)(w(k-1)) = ||φ(k) – y(k)||2
wij(k) = wij
(k-1) – εE(k)/wij (w(k) = w(k-1) - E)
Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.
16
Understanding Backpropagation
Minimize E() Gradient Descent…
E()
17
Understanding Backpropagation
Minimize E() Gradient Descent…
E()
Gradient of E
18
Understanding Backpropagation
Minimize E() Gradient Descent…
E()
Step ~ gradient
19
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e1
20
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e1
21
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e2
22
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e2
23
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e3
24
Understanding Backpropagation
Example of Stochastic Gradient Descent
Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2
Take a step to reduce eiE()
Gradient of e3
25
Stochastic Gradient Descent
Parameter values over time
(local) minimum of E
26
Stochastic Gradient Descent
Objective function values over time
27
Caveats
Choosing a convergent “learning rate” can be hard in practice
E()
28
Comments and Issues
How to choose the size and structure of networks?• If network is too large, risk of over-fitting
(data caching)• If network is too small, representation
may not be rich enough Role of representation: e.g., learn the
concept of an odd number Incremental learning
29
Role of Marketing
Not a good model of a neuron Spiking behavior, recurrence in real NNs
No special properties above other learning techniques
Like other learning techniques, a convenient way to get results without thinking too hard
30
Incremental (“Online”) Function Learning
31
Incremental (“Online”) Function Learning
Data is streaming into learnerx1,y1, …, xt,yt yi = f(xi)
Observes xt+1 and must make
prediction for next time step yt+1
Brute force approach: Store all data at step t Use your learner of choice on all data up
to time t, predict for time t+1
32
Example: Mean Estimation
yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi
t+1= 1/(t+1) i=1…t+1 yi
= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)
5
33
Example: Mean Estimation
yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi
t+1= 1/(t+1) i=1…t+1 yi
= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)
5
y6
34
Example: Mean Estimation
yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi
t+1= 1/(t+1) i=1…t+1 yi
= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)
5 6 = 5/6 5 + 1/6 y6
35
Example: Mean Estimation
t+1= 1/(t+1) (yt+1 + tt) Only need to store t, t
Similar formulas for standard deviation
5 6 = 5/6 6 + 1/6 y6
36
Incremental Least Squares
Recall Least Squares estimate = (ATA)-1 AT b
Where A is matrix of x(i)’s, b is vector of y(i)’s (laid out in rows)
A =
x(1)
x(2)
x(N)
…
b =
y(1)
y(2)
y(N)
…NxM Nx1
37
Incremental Least Squares
Let A(t), b(t) be A matrix, b vector up to time t
(t) = (A(t)TA(t))-1 A(t)T b(t)
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM (t+1)x1
b(t)A(t)
38
Incremental Least Squares
Let A(t), b(t) be A matrix, b vector up to time t
(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM (t+1)x1
b(t)A(t)
39
Incremental Least Squares
Let A(t), b(t) be A matrix, b vector up to time t
(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM (t+1)x1
b(t)A(t)
40
Incremental Least Squares
Let A(t), b(t) be A matrix, b vector up to time t
(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM (t+1)x1
b(t)A(t)
41
Incremental Least Squares
Let A(t), b(t) be A matrix, b vector up to time t
(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
Sherman-Morrison Update (Y + xxT)-1 = Y-1 - Y-1
xxT Y-1 / (1 – xT Y-1 x)
42
Incremental Least Squares
Putting it all together Store
p(t) = A(t)Tb(t)
Q(t) = (A(t)TA(t))-1
Updatep(t+1) = p(t) + y x
Q(t+1) = Q(t) - Q(t)
xxT Q(t) / (1 – xT Q(t) x)(t+1) = Q(t+1)p(t+1)
43
Recap
• Function learning with least squares
• Neural nets, backpropagation, and gradient descent
• Incremental learning
44
Reminder
• HW6 due
• HW7 available on Oncourse
45
Machine Learning Classes
• CS659 (Hauser) Principles of Intelligent Robot Motion
• CS657 (Yu) Computer Vision
• STAT520 (Trosset) Introduction to Statistics
• STAT682 (Rocha) Statistical Model Selection