CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets...
Transcript of CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets...
![Page 1: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/1.jpg)
CS 4100: Artificial IntelligenceOptimization and Neural Nets
Jan-Willem van de Meent, Northeastern University[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
![Page 2: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/2.jpg)
Linear Classifiers
• Inputs are feature values• Each feature has a weight• Sum is the activation
• If the activation is:• Positive, output +1• Negative, output -1
Sf1f2f3
w1
w2
w3>0?
![Page 3: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/3.jpg)
How to get probabilistic decisions?
• Activation:• If very positive à want probability going to 1• If very negative à want probability going to 0
• Sigmoid function
z = w · f(x)z = w · f(x)
z = w · f(x)
�(z) =1
1 + e�z
![Page 4: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/4.jpg)
Best w?
• Maximum likelihood estimation:
with:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
P (y(i) = +1|x(i);w) =1
1 + e�w·f(x(i))
P (y(i) = �1|x(i);w) = 1� 1
1 + e�w·f(x(i))
This is called Logistic Regression
![Page 5: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/5.jpg)
Multiclass Logistic Regression• Recall Perceptron:
• A weight vector for each class:
• Score (activation) of a class y:
• Prediction with highest score wins
• How to turn scores into probabilities?
z1, z2, z3 ! ez1
ez1 + ez2 + ez3,
ez2
ez1 + ez2 + ez3,
ez3
ez1 + ez2 + ez3
original activations softmax activations
![Page 6: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/6.jpg)
Best w?
• Maximum likelihood estimation:
with:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
P (y(i)|x(i);w) =ewy(i) ·f(x(i))
Py e
wy·f(x(i))
This is called Multi-Class Logistic Regression
![Page 7: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/7.jpg)
This Lecture
• Optimization
• i.e., how do we solve:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
![Page 8: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/8.jpg)
Hill Climbing
• Recall from CSPs lecture: simple, general idea• Start wherever• Repeat: move to the best neighboring state• If no neighbors better than current, quit
• What’s particularly tricky when hill-climbing for multiclass logistic regression?• Optimization over a continuous space
• Infinitely many neighbors!• How to do this efficiently?
![Page 9: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/9.jpg)
1-D Optimization
• Could evaluate and• Then step in best direction
• Or, evaluate derivative• Defines direction to step into
w
g(w)
w0
g(w0)
g(w0 + h) g(w0 � h)
@g(w0)
@w= lim
h!0
g(w0 + h)� g(w0 � h)
2h
![Page 10: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/10.jpg)
2-D Optimization
Source: offconvex.org
![Page 11: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/11.jpg)
Gradient Ascent
• Perform update in uphill direction for each coordinate• The steeper the slope (i.e. the higher the derivative)
the bigger the step for that coordinate• E.g., consider:
• Updates:
g(w1, w2)
w2 w2 + ↵ ⇤ @g
@w2(w1, w2)
w1 w1 + ↵ ⇤ @g
@w1(w1, w2)
• Updates in vector notation:
• with:
w w + ↵ ⇤ rwg(w)
rwg(w) =
"@g@w1
(w)@g@w2
(w)
#= gradient
![Page 12: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/12.jpg)
Gradient Ascent• Idea:
• Start somewhere• Repeat: Take a step in the gradient direction
Figure source: Mathworks
![Page 13: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/13.jpg)
What is the Steepest Direction?
• First-Order Taylor Expansion:
• Steepest Ascent Direction:
• Recall: à
• Hence, solution:
g(w +�) ⇡ g(w) +@g
@w1�1 +
@g
@w2�2
rg =
"@g@w1@g@w2
#
Gradient direction = steepest direction!
max�:�2
1+�22"
g(w +�)
max�:�2
1+�22"
g(w) +@g
@w1�1 +
@g
@w2�2
� = "rg
krgk
� = "a
kakmax
�:k�k"�>a
|Δ| ≤ ε1/2
w
aΔ
![Page 14: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/14.jpg)
Gradient in n dimensions
rg =
2
6664
@g@w1@g@w2
· · ·@g@wn
3
7775
![Page 15: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/15.jpg)
Optimization Procedure: Gradient Ascent
• initialize• for iter = 1, 2, …
w
• learning rate - hyperparameter that needs to be chosen carefully
• How? Try multiple choices• Crude rule of thumb: update changes about 0.1 – 1%
↵
w
w w + ↵ ⇤ rg(w)
![Page 16: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/16.jpg)
Gradient Ascent on the Log Likelihood Objective
• initialize• for iter = 1, 2, …
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
g(w)
w
w w + ↵ ⇤X
i
r logP (y(i)|x(i);w)
(Sum over all training examples)
![Page 17: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/17.jpg)
Stochastic Gradient Ascent on the Log Likelihood Objective
• initialize• for iter = 1, 2, …
• pick random j
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
w
w w + ↵ ⇤ r logP (y(j)|x(j);w)
Observation: once the gradient on one training example has been computed, we might as well incorporate before computing next one
Intuition: Use sampling to approximate the gradient. This is called stochastic gradient descent.
![Page 18: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/18.jpg)
Mini-Batch Gradient Ascent on the Log Likelihood Objective
• initialize• for iter = 1, 2, …
• pick random subset of training examples J
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
w
Observation: gradient over small set of training examples (=mini-batch) can be computed in parallel, might as well do that instead of a single one
w w + ↵ ⇤X
j2J
r logP (y(j)|x(j);w)
![Page 19: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/19.jpg)
How about computing all the derivatives?
• We’ll talk about that once we covered neural networks, which are a generalization of logistic regression
![Page 20: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/20.jpg)
Neural Networks
![Page 21: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/21.jpg)
Multi-class Logistic Regression
• Logistic regression is a special case of a neural network
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
W1,1W1,2W1,3
W3,3
zi =X
j
Wi,j fj(x)
<latexit sha1_base64="5jPDuadc92/H0mDYLheYrIDBNTY=">AAAHNnicfZXPb9MwFMe9AWWUXxsckVBFhzSkqkrW0Y1JSNPGNG6MiW1ITVU5idumdX5gO107yzf+Gq5w4F/hwg1x5U/ATgIkcZijNNZ7n/f1s/ts2xH2KDOMb0vL167fqN1cuVW/fefuvfuraw/OaBgTB506IQ7JextShL0AnTKPYfQ+Igj6Nkbn9vRA+c9niFAvDN6xRYT6PhwF3tBzIJOmwerjy4HXeNmwaOwPJo3zAfdaE9GwdhvDwWRj/myw2jTaRtIaesfMOs09kLbjwVoNWG7oxD4KmIMhpT3TiFifQ8I8ByNRt2KKIuhM4Qj1ZDeAPqJ9nkxENOr1pzk/D9BFNGdozkSF3YdsLOp5PQ59mlpLxmEYMFq0Yih16cIvWm2/qNhD8ygkKn13ElNmh3OhUnTRUC53kjN3bRwjwU+O9gU3Wt1Oy9zcFmWGIDdDzB2jJR+NGBGEgozZ2WqZ3Z0KKIpJhNE/ylCcyjhPvYJkepyBjr+YyrTa3W5L/lctI3k7HaFH7CezSHkzw67iT9SM/sgbiXoa9rwCPljAoCiuXvM/9FG6FLncjSvV3xAYjFA+m9yEVfJygQiSNeOEvg8Dl1sz5Iie2ecWCmhMkKoZbtkhdmVByA9vmkIILSgNkbGJX4rmvWgGseBWSwYNMXJUqfB1CyPoUhauC2u3HDAXxeFnfJ4MmmcWGrPQmEuNudSYscZYiMGKOcZFMC4LfSgJsXGiU5bBJQzL88it4FhZDmojDktINPa0tYdk5EO1nmGECGQhUYfKhcfG2PM9RnnmF3qUF1wdJf3lwQ5LCalf2+aHQiMdGyclI3vFbZaUTxElroaqHVZBjohGphumgo10NjsZKmBnocHJvq1AQ10324QKTu6MF6p1/94Qeudss2122ltvt5p7+9ntsQIegSdgA5hgG+yB1+AYnAIHfASfwGfwpfa19r32o/YzRZeXspiHoNBqv34DRKiTHQ==</latexit>
![Page 22: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/22.jpg)
Deep Neural Network = Also learn the features!
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
Can we automatically learn good features fj(x) ?
![Page 23: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/23.jpg)
Deep Neural Network = Also learn the features!
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(2)
K(2)
z(2)1
z(2)2
z(2)3
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g = nonlinear activation function
![Page 24: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/24.jpg)
Deep Neural Network = Also learn the features!
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(n)
K(n)z(2)K(2)
z(2)1
z(2)2
z(2)3 z(n)3
z(n)2
z(n)1
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g = nonlinear activation function
![Page 25: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/25.jpg)
Common Activation Functions
[source: MIT 6.S191 introtodeeplearning.com]
![Page 26: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/26.jpg)
Deep Neural Network: Also Learn the Features!
• Training a deep neural network is just like logistic regression:
just w tends to be a much, much larger vector J
àjust run gradient ascent + stop when log likelihood of hold-out data starts to decrease
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
![Page 27: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/27.jpg)
Neural Networks Properties
• Theorem (Universal Function Approximators).A two-layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy.
• Practical considerations• Can be seen as learning the features
• Large number of neurons• Danger for overfitting• (hence early stopping!)
![Page 28: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/28.jpg)
Universal Function Approximation Theorem*
• In words: Given any continuous function f(x), if a 2-layer neural network has enough hidden units, then there is a choice of weights w that allow it to closely approximate f(x).
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”
![Page 29: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/29.jpg)
Universal Function Approximation Theorem*
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”
![Page 30: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/30.jpg)
Fun Neural Net Demo Site
• Demo-site:• http://playground.tensorflow.org/
![Page 31: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/31.jpg)
How about computing all the derivatives?
• Derivatives tables:
[source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
![Page 32: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/32.jpg)
How about computing all the derivatives?
• But neural net f is never one of those?• No problem: use the chain rule
• If
• Then
• Implication: Derivatives of most functions can be computed using automated procedures
f(x) = g(h(x))
f 0(x) = g0(h(x))h0(x)
![Page 33: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/33.jpg)
Automatic Differentiation
• Automatic differentiation software • e.g. Theano, TensorFlow, PyTorch• Only need to program the function g(x,y,w)• Can automatically compute all derivatives w.r.t. all entries in w• This is typically done by caching info during forward computation
pass of f, and then doing a backward pass = “backpropagation”• Autodiff / Backpropagation can often be done at
computational cost comparable to the forward pass• Need to know this exists• How this is done? -- outside of scope of CS4100
![Page 34: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/34.jpg)
Summary of Key Ideas• Optimize probability of label given input
• Continuous optimization• Gradient ascent
• Compute gradient (=steepest uphill direction) – just a vector of partial derivatives• Take step in the gradient direction• Repeat (until held-out data accuracy starts to drop = “early stopping”)
• Deep neural nets• Last layer: still logistic regression• Now also many more layers before this last layer
• = computing the features• à the features are learned rather than hand-designed
• Universal function approximation theorem• If neural net is large enough • Then neural net can represent any continuous mapping from input to output with arbitrary accuracy• But remember: need to avoid overfitting / memorizing the training data à early stopping!
• Automatic differentiation gives the derivatives efficiently (how? = outside of scope of CS 4100)
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
![Page 35: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/35.jpg)
Next time: More Neural Nets and Applications
![Page 36: CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan](https://reader035.fdocuments.net/reader035/viewer/2022071413/610c0dbae5d5d41d066660ab/html5/thumbnails/36.jpg)
Next: More Neural Net Applications!