CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec...Let's talk about VGG-16. Winner of Imagenet...
Transcript of CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec...Let's talk about VGG-16. Winner of Imagenet...
/
CSE 559A: Computer Vision
Fall 2020: T-R: 11:30-12:50pm @ Zoom
Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams
Nov 24, 2020
http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/
1
/
GENERALGENERALLook at your proposal “ee’ba‘k !Problem Set 4 ’ue To’ay.Problem set 5 will be out by toni”ht.No ‘lass Thurs’ay. No offi‘e hours on Fri’ay.
2
/
CLASSIFICATIONCLASSIFICATION
3
/
CLASSIFICATIONCLASSIFICATION
4
/
CLASSIFICATIONCLASSIFICATION
5
/
CLASSIFICATIONCLASSIFICATION
6
/
CLASSIFICATIONCLASSIFICATION
7
/
CLASSIFICATIONCLASSIFICATION
8
/
CLASSIFICATIONCLASSIFICATION
9
/
CLASSIFICATIONCLASSIFICATION
10
/
CLASSIFICATIONCLASSIFICATION
11
/
CLASSIFICATIONCLASSIFICATION
12
/
CLASSIFICATIONCLASSIFICATION
13
/
CLASSIFICATIONCLASSIFICATION
14
/
CLASSIFICATIONCLASSIFICATIONLearn an’ ’o binary ‘lassi“i‘ation on its output.
A”ain, use (sto‘hasti‘) ”ra’ient ’es‘ent.But this time, the ‘ost is no lon”er ‘onvex.
= g(x; θ)x
w = arg log
[
1 + exp(− )
]
+ (1 − ) log
[
1 + exp( )
]
min
w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
15
/
CLASSIFICATIONCLASSIFICATIONLearn
A”ain, use (sto‘hasti‘) ”ra’ient ’es‘ent.But this time, the ‘ost is no lon”er ‘onvex.Turns out .. ’oesn t matter (sort o“).
Re‘all in the previous ‘ase: (where is the ‘ost o“ one sample)
What about now ?
Exa‘tly the same, with “or the ‘urrent value o“ .
= g(x; θ)x
w = arg log
[
1 + exp(− )
]
+ (1 − ) log
[
1 + exp( )
]
min
w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
C
t
=
[
−
]
∇
w
C
t
x
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
= g(x; θ)x θ
16
/
CLASSIFICATIONCLASSIFICATIONLearn
What about ?
First, what is the ?
Take 5 mins
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
=
[
−
]
∇
w
C
t
x
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
∇
θ
C
t
∇
x
t
C
t
17
/
CLASSIFICATIONCLASSIFICATIONLearn
What about ?
First, what is the ?
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
=
[
−
]
∇
w
C
t
x
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
∇
θ
C
t
∇
x
t
C
t
= ?
[
−
]
∇
x
t
C
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
18
/
CLASSIFICATIONCLASSIFICATIONLearn
What about ?
First, what is the ?
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
=
[
−
]
∇
w
C
t
x
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
∇
θ
C
t
∇
x
t
C
t
= w
[
−
]
∇
x
t
C
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
19
/
CLASSIFICATIONCLASSIFICATIONLearn
Now, let s say was an matrix, an’ .
is the len”th o“ the ve‘tor
is the len”th o“ the en‘o’e’ ve‘tor
What is ?
Take 5 mins!
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
= w
[
−
]
∇
x
t
C
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
θ M × N g(x; θ) = θx
N x
M x
∇
θ
C
t
20
/
CLASSIFICATIONCLASSIFICATIONLearn
Now, let s say was an matrix, an’ .
is the len”th o“ the ve‘tor
is the len”th o“ the en‘o’e’ ve‘tor
What is ?
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
= w
[
−
]
∇
x
t
C
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
θ M × N g(x; θ) = θ x
N x
M x
∇
θ
C
t
= ?∇
θ
C
t
∇
x
t
C
t
21
/
CLASSIFICATIONCLASSIFICATIONLearn
Now, let s say was an matrix, an’ .
is the len”th o“ the ve‘tor
is the len”th o“ the en‘o’e’ ve‘tor
What is ?
This is a‘tually a linear ‘lassi“ier on
But be‘ause o“ our “a‘torization, is no lon”er ‘onvex.
I“ we want to in‘rease the expressive power o“ our ‘lassi“ier, has to be non-linear !
= g(x; θ)x
θ,w = arg log
[
1 + exp(− g( ; θ))
]
+ (1 − ) log
[
1 + exp( g( ; θ))
]
min
θ,w
1
T
∑
t
y
t
w
T
x
t
y
t
w
T
x
t
= w
[
−
]
∇
x
t
C
t
exp( )w
T
x
t
1 + exp( )w
T
x
t
y
t
θ M × N g(x; θ) = θ x
N x
M x
∇
θ
C
t
=
( )
∇
θ
C
t
∇
x
t
C
t
x
T
t
x
θ x = ( w x = xw
T
θ
T
)
T
w
T
g
22
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
x
23
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
x h⟶
h = θx
24
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
is an element-wise non-linearity.
For example . More on this later.
Has no learnable parameters.
x h⟶
h = θx
⟶
= κ( )h
j
h
j
h
κ
κ(x) = σ(x)
25
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
is an element-wise non-linearity.
For example . More on this later.
Has no learnable parameters.
x h y⟶
h = θx
⟶
= κ( )h
j
h
j
h
⟶
y =w
T
h
κ
κ(x) = σ(x)
26
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
is an element-wise non-linearity.
For example . More on this later.
Has no learnable parameters.
is our si”moi’ to ‘onvert lo”-o’’s to probability.
Multipli‘ation by an’ a‘tion o“ is a layer .
Calle’ a hi’’en layer, be‘ause you re learnin” a latent representation .Don t have ’ire‘t a‘‘ess to the true value o“ its outputsLearnin” a representation that jointly with a learne’ ‘lassi“ier is optimal
x h y p⟶
h = θx
⟶
= κ( )h
j
h
j
h
⟶
y =w
T
h
⟶
p = σ(y)
κ
κ(x) = σ(x)
σ
σ(y) =
exp(y)
1 + exp(y)
θ κ
27
/
CLASSIFICATIONCLASSIFICATIONThe Multi-Layer Perceptron
This is a neural network:A ‘omplex “un‘tion “orme’ by composition o“ simple linear an’ non-linear “un‘tions.
This network has learnable parameters .
Train by ”ra’ient ’es‘ent with respe‘t to ‘lassi“i‘ation loss.What are the ”ra’ients ?
Doin” this manually is ”oin” to ”et ol’ really “ast.
Autograd
Express ‘omplex “un‘tion as a composition o“ simpler “un‘tions.Store this as no’es in a ‘omputation ”raphUse ‘hain rule to automati‘ally ba‘k-propa”ate
Popular Auto”ra’ Systems: Tensor“low, Tor‘h, Caffe, MXNet, Theano, …
We ll write our own!
x h y p⟶
h = θx
⟶
= κ( )h
j
h
j
h
⟶
y =w
T
h
⟶
p = σ(y)
θ,w
/ 28
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATIONSay we want to minimize a loss , that is a “un‘tion o“ parameters an’ trainin” ’ata.
Let s say “or a parameter we ‘an write:
where is in’epen’ent o“ , an’ ’oes not use ex‘ept throu”h .
Now, let s say I ”ave you the value o“ an’ the ”ra’ient o“ with respe‘t to .
is an ’imensional ve‘tor
is an ’imensional ve‘tor (i“ its a matrix, just think o“ ea‘h element as a separate parameter)
Express in terms o“ an’ : whi‘h is the partial ’erivative o“ one o“ the ’imensions o“ the outputs o“ with respe‘t to one o“ the ’imensions o“ its inputs.
For every
We ‘an similarly ‘ompute ”ra’ients “or the other input to , i.e. y.
L
θ
L = f (x); x = g(θ, y)
y θ f θ x
y L x
x N−
θ M−
∂L
∂θ
j
∂L
∂x
i
∂g(θ,y)
i
∂θ
j
g
j
=
∂L
∂θ
j
∑
i
∂L
∂x
i
∂g(θ, y)
i
∂θ
j
g
29
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATION
Let s say a spe‘i“i‘ variable ha’ two paths to the loss.
L = f (x, ); x = g(θ, y), = (θ, )x
′
x
′
g
′
y
′
= +
∂L
∂θ
j
∑
i
∂L
∂x
i
∂g(θ, y)
i
∂θ
j
∑
i
∂L
∂x
′i
∂ (θ,g
′
y
′
)
i
∂θ
j
30
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATIONOur very own auto”ra’ system:
Buil’ a ’ire‘te’ ‘omputation ”raph with a (python) list o“ no’es G = [n1,n2,n3 …]Ea‘h no’e is an obje‘t that is one o“ three kin’s:
InputParameterOperation . . .
We will ’e“ine the ”raph by ‘allin” “un‘tions that ’e“ine “un‘tional relationships.
import edf
x = edf.Input() theta = edf.Parameter()
y = edf.matmul(theta,x) y = edf.tanh(y)
w = edf.Parameter() y = edf.matmul(w,y)
31
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATION
Ea‘h o“ these statements a’’s a no’e to the list o“ no’es.Operation no’es are a’’e’ by matmul, tanh, et‘., an’ are linke’ to previous no’es that appear be“ore it inthe list as input.Every no’e obje‘t is ”oin” to have a member element n.top whi‘h will be the value o“ its output
This ‘an be an arbitrary shape’ array.For input an’ parameter no’es, these top values will just be set (or up’ate’ by SGD).For operation no’es, the top values will be ‘ompute’ “rom the top values o“ their inputs.
Every operation no’e will be an obje‘t o“ a ‘lass that has a “un‘tion ‘alle’ “orwar’.A “orwar’ pass will be”in with values o“ all inputs an’ parameters set.
import edf
x = edf.Input() theta = edf.Parameter()
y = edf.matmul(theta,x) y = edf.tanh(y)
w = edf.Parameter() y = edf.matmul(w,y)
32
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATION
A “orwar’ pass will be”in with values o“ all inputs an’ parameters set.Then we will ”o throu”h the list o“ no’es in or’er, an’ ‘ompute the value o“ all operation no’es.Be‘ause no’es were a’’e’ in or’er, i“ we ”o throu”h them in or’er,the tops o“ our inputs will be available.
import edf
x = edf.Input() theta = edf.Parameter()
y = edf.matmul(theta,x) y = edf.tanh(y)
w = edf.Parameter() y = edf.matmul(w,y)
33
/
AUTOGRAD / BACK-PROPAGATIONAUTOGRAD / BACK-PROPAGATION
Somewhere in the trainin” loop, where the values o“ parameters have been set be“ore.
An’ this will ”ive us the value o“ the output.But now, we want to ‘ompute ”ra’ients .For ea‘h operation ‘lass, we will also ’e“ine a “un‘tion backward.All operation an’ parameter no’es will also have an element ‘alle’ grad.We will have to then ba‘k-propa”ate ”ra’ients in or’er.
import edf
x = edf.Input() theta = edf.Parameter()
y = edf.matmul(theta,x) y = edf.tanh(y)
w = edf.Parameter() y = edf.matmul(w,y)
x.set(...) edf.Forward() print(y.top)
34
/
AUTOGRADAUTOGRADBuilding Our Own Deep Learning Framework
Computation Graph that en‘o’es symboli‘ relationship between input to output (an’ loss)No’es in this ”raph are
Values: Values are set “rom trainin” ’ataParams: Values are initialize’ an’ up’ate’ usin” SGDOperations: Values are ‘ompute’ as “un‘tions o“ their input
Forwar’ ‘omputation to ‘ompute lossBa‘kwar’ ‘omputation to ‘ompute ”ra’ients “or every no’e
35
/
AUTOGRADAUTOGRADCo’e “rom pset5/mnist.py
# Inputs and parameters inp = edf.Value() lab = edf.Value()
W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param()
# Modely = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y)
y = edf.matmul(y,W2) y = edf.add(y,B2) # This is our final prediction
36
/
BRIEF DETOURBRIEF DETOURWhat are RELUs ?
Element-wise non-linear a‘tivations
Previous a‘tivations woul’ be si”moi’-like:
Great when you want a probability, ba’ when you want to learn by ”ra’ient ’es‘ent.
For both hi”h an’ low-values o“ ,
So i“ you weren t ‘are“ul, you woul’ en’ up with hi”h-ma”nitu’e a‘tivations,an’ the network stops learnin”.
Gra’ient ’es‘ent is very fragile !
RELU(x) = max(0, x)
σ(x) = exp(x)/(1 + exp(x))
x ∂σ(x)/∂x = 0
37
/
BRIEF DETOURBRIEF DETOURWhat are RELUs ?
What is RELU ?
0 i“ , 1 otherwise.
So your ”ra’ients are passe’ un-‘han”e’ i“ the input is positive.But ‘ompletely attenuate’ i“ the input is ne”ativeSo there s still the possibility o“ your optimization ”ettin” stu‘k, i“ you rea‘h a point where all inputs to theRELU are ne”ative.Will talk about initialization, et‘. later.
RELU(x) = max(0, x)
∂ (x) / ∂x
x < 0
38
/
AUTOGRADAUTOGRADCo’e “rom pset5/mnist.py
# Inputs and parameters inp = edf.Value() lab = edf.Value()
W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param()
# Modely = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y)
y = edf.matmul(y,W2) y = edf.add(y,B2) # This is our final prediction
39
/
AUTOGRADAUTOGRADCo’e “rom pset5/edf.py
When you ‘onstru‘t an obje‘t, it just ’oes book-keepin” !
ops = []; params = []; values = [] ...class Param: def __init__(self): params.append(self) ....class Value: def __init__(self): values.append(self) ...class matmul: def __init__(self,x,y): ops.append(self) self.x = x self.y = y
40
/
AUTOGRADAUTOGRAD
41
/
AUTOGRADAUTOGRAD
42
/
AUTOGRADAUTOGRAD
43
/
AUTOGRADAUTOGRAD
44
/
AUTOGRADAUTOGRAD
45
/
AUTOGRADAUTOGRAD
46
/
AUTOGRADAUTOGRAD
47
/
AUTOGRADAUTOGRAD
48
/
AUTOGRADAUTOGRAD
49
/
AUTOGRADAUTOGRAD
50
/
AUTOGRADAUTOGRAD
51
/
AUTOGRADAUTOGRAD
mean will be a‘ross a bat‘h.a‘‘ura‘y is a‘tual a‘‘ura‘y o“ har’ pre’i‘tions (not ’ifferentiable).
# Inputs and parameters inp = edf.Value() lab = edf.Value() W1 = edf.Param() B1 = edf.Param() W2 = edf.Param() B2 = edf.Param() y = edf.matmul(inp,W1) y = edf.add(y,B1) y = edf.RELU(y) y = edf.matmul(y,W2) y = edf.add(y,B2)
loss = edf.smaxloss(y,lab) loss = edf.mean(loss)
acc = edf.accuracy(y,lab) acc = edf.mean(acc)
52
/
AUTOGRADAUTOGRADNow let s train this thin” !
Be”innin” o“ trainin”:
Initialize wei”hts ran’omly.
In ea‘h iteration o“ trainin”:
Loa’ ’ata into the values or inputs.
What is this set “un‘tion anyway ?
nHidden = 1024; K = 10 W1.set(xavier((28*28,nHidden))) B1.set(np.zeros((nHidden))) W2.set(xavier((nHidden,K))) B2.set(np.zeros((K)))
for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]]) . . .
53
/
AUTOGRADAUTOGRADset is the only “un‘tion that the ‘lasses param an’ value have.
Sets a member ‘alle’ top to be an array that hol’s these values.
class Value: def __init__(self): values.append(self)
def set(self,value): self.top = np.float32(value).copy()
class Param: def __init__(self): params.append(self)
def set(self,value): self.top = np.float32(value).copy()
54
/
AUTOGRADAUTOGRAD
Note that we are loa’in” our input ’ata in bat‘hes as matri‘es.inp is BSZ x NThen we re ’oin” a matmul with W1 whi‘h is N x nHi’’en.Output will be BSZ x nHi’’en
Essentially, we re repla‘in” ve‘tor matrix multiply “or a sin”le sample with matrix-matrix multiply “or a bat‘ho“ samples.
W1.set(xavier((28*28,nHidden))) ...B2.set(np.zeros((K)))
for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]])
55
/
AUTOGRADAUTOGRAD
An’ this will work. It will print the loss an’ a‘‘ura‘y values “or theset inputs, ”iven the ‘urrent value o“ the parameters.
What is this ma”i‘al “un‘tion “orwar’ ?
W1.set(xavier((28*28,nHidden))) ...B2.set(np.zeros((K)))
for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]])
edf.Forward() print(loss.top,acc.top)
56
/
AUTOGRADAUTOGRADFrom e’“.py
But the operation ‘lasses have their own “orwar’ “un‘tion.
# Global forward def Forward(): for c in ops: c.forward()
class matmul: def __init__(self,x,y): ops.append(self) self.x = x self.y = y
def forward(self): self.top = np.matmul(self.x.top,self.y.top)
. . .
57
/
AUTOGRADAUTOGRAD
58
/
AUTOGRADAUTOGRAD
59
/
AUTOGRADAUTOGRAD
60
/
AUTOGRADAUTOGRAD
61
/
AUTOGRADAUTOGRAD
62
/
AUTOGRADAUTOGRAD
63
/
AUTOGRADAUTOGRAD
64
/
AUTOGRADAUTOGRAD
65
/
AUTOGRADAUTOGRAD
66
/
AUTOGRADAUTOGRADSo the “orwar’ pass ‘omputes the loss.But we want to learn the parameters.
The SGD “un‘tion is pretty simple
Requires p.”ra’ (”ra’ients with respe‘t to loss) to be present.
That s what ba‘kwar’ ’oes!
for iters in range(...): . . . inp.set(train_im[idx[b:b+BSZ],:]) lab.set(train_lb[idx[b:b+BSZ]])
edf.Forward() print(loss.top,acc.top) edf.Backward(loss) edf.SGD(lr)
def SGD(lr): for p in params: p.top = p.top - lr*p.grad
67