Formal Computational Skills Week 4: Differentiation.
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Formal Computational Skills Week 4: Differentiation.
Formal Computational Skills
Week 4: Differentiation
Overview
Will talk about differentiation, how derivatives are calculated and its uses (especially for optimisation). Will end with how it is used to train neural networks
By the end you should:• Know what dy/dx means• Know how partial differentiation works• Know how derivatives are calculated• Be able to differentiate simple functions• Intuitively understand the chain rule• Understand the role of differentiation in optimisation• Be able to use the gradient descent algorithm
DerivativesDerivative of a function: y = f(x) is dy/dx (or df/dx or df(x)/dxRead as: “dy by dx”
How a variable changes with respect to (w.r.t) other variables ie the rate of change y wrt x
It is a function of x and for any given value of x calculates the gradient at that point ie how much/fast y changes as x changes
Examples: If x is space/distance, dy/dx tells us the slope or steepness, If x time, dy/dx tells us speeds and accelerations etc
Essential to any calculation in a dynamic world
If dy/dx is >0 function is increasing
The bigger the absolute size of dy/dx (written as |dy/dx|) the faster change is happening and y is assumed to be more complex/difficult to work with
If dy/dx is <0 function is decreasing
If dy/dx = 0 function is constant
In general, dy/dx changes as x changes x
yx
y
x
y
x
y
Partial DifferentiationFor functions of 2 variables (z=x exp(-x2-y2)), the partial derivatives and tell us how z varies wrt each parameter
-2-1
01
2
-2
-1
0
1
2
-0.4
-0.2
0
0.2
0.4Thus partial derivatives go along the lines
xy
z
x
z
yx
z
x
z
y
z
Calculated by treating the other variables as if they are constant
And for a function of n variables y=f(x1,x2, ...,xn),
ix
y
z=x exp(-x2-y2)
y x
x
z
y
z
tells us how y changes if only xi varies
eg z=xy: to get treat y as if it is a constant and
Higher Dimensions
w1
w2
E(w)
nw
E
w
E
w
EE ,,,
21
In more dimensions, the gradient is a vector called grad (represented by an upside down triangle)
Grad tells us steepness and direction
Elements are partial derivatives wrt each variable
01
w
EHere so 1,0,21
w
E
w
EE
12
w
E01
w
E
12
w
E
Vector FieldsSince grad is a vector, often useful to view vectors as a field with vectors drawn as arrows
Note that grad points in the steepest direction up hill
Similarly minus grad points in the steepest direction downhill
Calculating the Derivativedy/dx = rate of change of y = change in y/change in x (if dy/dx remained constant)
Thus dy/dx is constant and equals m.
y = 6-2 = 4
So: dy/dx = y/x = 4/2 =2
6
2
1 3
x = 3-1 = 2
For a straight line, y=mx + c it is easySince y remains constant, divide change in y by change in x
Eg y=2x Aside: different sorts of d’sMeans “a change in” Means “a small change in”d Means “an infinitesimally small change in”
However if derivative is not constant: we want gradient at a point ie gradient of a tangent to the curve at that pointA tangent is a line that touches the curve at that point
It is approximated by the gradient of a chord = y/x (a chord is a straight line from one point to another on a curve)
For infinitesimally small chord:y/x = dy/dx and the approximation is exact
y+y
y
x x+x
y
x
The smaller the chord, the better the approximation.
Eg y = x2
y(x+h)
y(x)
x x+h
xhx
xyhxy
x
y
dx
dy
)()(
Can get derivatives in this way but have rules for most
222 2)()( hxhxhxhxy
y
x=h
h
xyhxy
dx
dy )()(
Now do this process mathematically;
xhxdx
dyh
2)2(lim0
hxh
hxh
h
xhxhx
dx
dy
2
22 2222
Function | Derivative
01
1
1
1
0 !)!1(!! i
xi
i
i
i
i
i
ix e
i
x
i
x
i
ix
dx
dy
i
xe
y=constant eg y=3, dy/dx = 0 as no change
NB n! is n factorial and means eg 5! = 5x4x3x2x1
xx
nn
edx
dyey
xdx
dyxy
xdx
dyxy
xdx
dyxy
nxdx
dyxy
1,ln
)sin(,cos
)cos(,sin
, 1
Can prove all eg why is dy/dx = ex if y = ex?
y= k f(x) dy/dx = k df/dx
eg y = 3sin(x), dy/dx = 3cos(x)
y=f(x) + g(x), dy/dx = df/dx + dg/dx
y = x3 + ex, dy/dx = 3x2 + ex
Other useful rules:
Product Rule
Quotient Rule
dx
duv
dx
dvu
dx
dy
xvxuy
)()(
2)(
)(
vdxdvu
dxduv
dx
dy
xv
xuy
eg: y = x sin(x),
dx
duv
dx
dvu
dx
dy
dy/dx=x sin(x)+ 1cos(x)
u= x, v = sin(x)
du/dx = 1x0 = 1, dv/dx = cos(x)
Also known as function of a function
dx
dv
dv
vdu
dx
dy
xvuy
)(
))((
eg: y = sin(x2),
EASY IN PRACTICE (honestly)
The one you will see most
Chain Rule
u= sin(v), v = x2
du/dv = cos(v), dv/dx = 2x
dy/dx=dx
dv
dv
vdu
dx
dy )( cos(v)2x 2x= cos(x2)
2121
xwx
2121
11111
1 xwx
xwxx
y
1111
xwx
y1= 3x1
= w11 = 0 So
Some examples:
y1= w11x1 + w12x2
y1= w11x11
1
dx
dy= 3 = w11
1
1
x
y
11111
1 0 wwx
y
11
1
w
y
= x1
partial d’s as y a function of 2 variables
Similarly, if: y1= w11x1 + w12x2 + … + w1nxn 166
1 wx
y
E = u2 dE/du = 2u E = u2 +2u + 20 dE/du = 2u + 2
1)1(1
1
a
ae
ey Can show dy/da= y(1-y)Sigmoid function:
How does a change in x1 affect the network output?
Intuitively, the bigger the weight on the connection, the more effect a change in the input along that connection will make.
1
1
x
y
111
1 wx
y
2121
11111
1 xwx
xwxx
y
x1
x2
y1= w11x1 + w12x2
w11
w12
So
EG: Networks
So intuitively, changing w12 has a larger effect the stronger the signal travelling along it
21212
1111212
1 xww
xwww
y
x1
x2
y1= w11x1 + w12x2
w11
w12
Similarly, w12 affects y1 by:
Note also that by following the path of x2 through the network, you can see which variables affect it
= x2
x1
x2 y1= w11x1 + w12x2
w11
w12
Now how does E change if the output changes?
Now suppose we want to train the network so that it outputs a target t1 when the input is x. Need an error function E to evaluate how far the output y1 is from the target t1
E = (y1 – t1)2
To calculate this we need the chain rule as E is a function of a function of y1
11
1))((
dy
dv
dv
du
dy
dE
yvuE
1dy
dE
where v = (y1-t1)
and u = v2
du/dv = = 2(y1- t1)
dv/dy1 = 1
1dy
dE1x 2(y1- t1) =2(y1- t1)and so
Now
2v
12w
E
x2 y1= w11x1 + w12x2
w11
w12
We train the network by adjusting the weights, so we need to know how the error varies if eg w12 varies ie
E = (y1 – t1)2
The chain of events: w12 affecting y1 affecting E indicates the chain rule so
12
1
w
y
1y
E
12
1
112 w
y
y
E
w
E
To calculate this note that w12 affects y1 by:
which in turn affects E by:
iijij
i
iij
tyxw
y
y
E
w
E
2And in general: cf backprop ij
x2
2(y1 – t1)
x2 2(y1 – t1)
Back to 2 (or even N outputs). Consider the route (or more accurately, the variables) by which a change in w12 affects E:
x1
x2
w11
w21
w12
w22
y1
y2
See that w12 still affects E via x2 and y1 so still have:
2
1
2)(i
ii tyE
12
1
112 w
y
y
E
w
E iijij
i
iij
tyxw
y
y
E
w
E
2And: x2 2(y1 – t1)
1y
E
12
1
w
y
Visualise the partial derivatives working along the connections
How about how E is changed by x2: 2x
E
Therefore need to somehow combine the contributions along these 2 routes …
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
x2 changes y1 along w12 which changes E
x2 also changes y2 along w22 which changes E
Again look at the route along which x2 travels:
12
1
w
y
12
1
w
y
2y
E
2
1
2)(i
ii tyE
Chain rule for partial differentiationIf y is a function of u and v which are both functions of x then:
Quite intuitive: sum the contributions along each path affecting E
dx
dv
v
y
dx
du
u
y
dx
dy
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
2
1
x
y
2
2
x
y
2y
E
2x
ESo: = 2(y1- t1)
2
1
2)(i
ii tyE
Outputs
2)(2 iii wty
2
1
1 x
y
y
E
2
2
2 x
y
y
E
w12 + 2(y2- t2)w22
What about another layer? x1
x2
w11
w21
w12
w22
y1
y2
2
1
2)(i
ii ytE
v11
v21
v12
v22
u1
u2
We need to know how E changes wrt the v’s and w’s. The calculation for the w’s is the same but what about the v’s?
ji
j
v
x
jiv
E
which involves finding andjx
E
Thus: ik j
k
kji
j
jji
ux
y
y
E
v
x
x
E
v
E
2
1
Seems complicated but can do via matrix multiplication
v’s affect x’s which affect y’s which affect E.
So we need:
Finding Maxima/Minima (Optima)At a maximum or minimum (optimum, stationary point etc) the gradient is flat ie dy/dx =0
To find max/min calculate dy/dx and solve dy/dx =0
maximaminima
Global minimum = lowest point
global maximum = highest point
Similarly for a function of n variables y=f(x1,x2, ...,xn), at optima for i = 1, …, n0
ix
y
local maxima
Local minima
Eg: y = x2 , dy/dx = 2x so at optima: dy/dx = 0 => 2x =0 ie x=0
E
w11
Why do we want to find optima? Error minimisation
In neural network training have the error function E = (yi – ti)2 At the global minimum of E, outputs are close as possible to targets, so network is ‘trained’
Therefore to train a network ‘simply’ set dE/dw=0 and solve
However, In many cases (especially neural network training) cannot solve dy/dx=0.
In such cases can use gradient descent to change the weights in such a way that the error decreases
Gradient DescentAnalogy is being on a foggy hillside where we can only see the area around us. To get to the valley floor a good plan is to take a step downhill. To get there quickly, take a step in the steepest direction downhill ie in the direction of –grad. As we can’t see very far and only know the gradient locally only take a small step
Algorithm• Start at a random position in space, w• Calculate • Update w via: wnew = wold – • Repeat from 2
)(wE)(wE
is the learning rate and governs the size of the step taken. It is set to a low value and often changed adaptively eg if landscape seems ‘easy’ (not jagged or steep) increase step-size etc
w1
w2
E(w)
Original point inweight space: w
New point inweight space
221
,0,w
E
w
E
w
EE
)(wEw
Many variations which help with these problems eg: • Introduce more stochasticity to ‘escape’ local minima • use higher order gradients to improve convergence• Approximate grad or use a proxy eg take a step, see if lower down or not and if so, stay there eg hill-climbing etc
Problems of local minima. Gradient descent goes to the closest local minimum and stays there:
Start here Start here
End up here
End up here
Other problems: length of time to get a solution especially if there are flat regions in the search space; Can be difficult to calculate grad
E
Higher order derivativesRates of rates of change.
Also used to tell whether a function (eg mapping produced by a network) is smooth as it calculates the curvature at a point
is greater for the first mapping than the 2nd
3
3
dx
yd
x
y2
yn
x
y
x
y
In 1 dimension
The higher the order, the more information known about y and so used eg improve gradient descent search
In many dimensions a vector
Eg velocity dy/dx is 1st order, acceleration: d2y/dx2 is 2nd order