Adjoint computation and BackpropagationAdjoint computation and Backpropagation Guillaume Pallez...
Transcript of Adjoint computation and BackpropagationAdjoint computation and Backpropagation Guillaume Pallez...
Adjoint computation and Backpropagation
Guillaume Pallez (Aupy)
Inria & University of Bordeaux
Meeting of the Royal Society, April 2019
1
Who’s who
Automatic Differentiation
Paul Hovland (Argonne)
Navjot Kukreja (Imperial College)
Krishna Narayanan (Argonne)Machine Learning (I)
Alexis Joly (Inria)
Machine Learning (II)
Alena Shilova (Inria)Scheduling
Guillaume Pallez (Inria)
Olivier Beaumont (Inria)
Julien Herrmann (Inria)
2
Guillaume Pallez
3
Adjoint Computation andBackpropagation
4
Ice-sheet model (I)
“In climate modelling, Ice-sheet models use numerical methods to simulate theevolution, dynamics and thermodynamics of ice sheets.” (wikipedia)
Credit: Daniel Goldberg
Simpler Version:
proc Model Algorithm(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
end/* F (u0) = fn ◦ fn−1 ◦ . . . ◦ f0(u0) */
Compute ∇F (u0)y ;
end
4
Ice-sheet model (I)
“In climate modelling, Ice-sheet models use numerical methods to simulate theevolution, dynamics and thermodynamics of ice sheets.” (wikipedia)
Credit: Daniel Goldberg
Simpler Version:
proc Model Algorithm(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
end/* F (u0) = fn ◦ fn−1 ◦ . . . ◦ f0(u0) */
Compute ∇F (u0)y ;
end
5
Ice-sheet model (II)
A quick reminder about the gradient:
F (u0) = fn ◦ fn−1 ◦ . . . ◦ f1 ◦ f0(u0)
∇F (u0)y = Jf0(u0)T ·∇(fn ◦ f1)(u1) · y
= Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
JfT = Transpose Jacobian matrix of f ;ui+1 = fi(ui) = fi (fi−1 ◦ . . . ◦ f0(u0)) .
6
A better solution?
∇F (u0)y = Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
proc Algo A(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
endCompute ∇F (u0)y;
end
→ What is the problem with Algo A?
proc Algo B(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
vi+1 = vi · Jfi+1(ui+1)T ;
end
end
→ What is the problem with Algo B?
∇F (u0)y =((
. . .(Jf0
T · Jf1T)· . . . · Jfn−1
T)· JfnT
)· y n MatMat ops
∇F (u0)y = Jf0T ·
(Jf1
T · . . . ·(Jfn−1
T ·(Jfn
T · y). . .
))n MatVec ops
6
A better solution?
∇F (u0)y = Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
proc Algo A(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
endCompute ∇F (u0)y;
end
→ What is the problem with Algo A?
proc Algo B(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
vi+1 = vi · Jfi+1(ui+1)T ;
end
end
→ What is the problem with Algo B?
∇F (u0)y =((
. . .(Jf0
T · Jf1T)· . . . · Jfn−1
T)· JfnT
)· y n MatMat ops
∇F (u0)y = Jf0T ·
(Jf1
T · . . . ·(Jfn−1
T ·(Jfn
T · y). . .
))n MatVec ops
6
A better solution?
∇F (u0)y = Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
proc Algo A(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
endCompute ∇F (u0)y;
end
→ What is the problem with Algo A?
proc Algo B(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
vi+1 = vi · Jfi+1(ui+1)T ;
end
end
→ What is the problem with Algo B?
∇F (u0)y =((
. . .(Jf0
T · Jf1T)· . . . · Jfn−1
T)· JfnT
)· y n MatMat ops
∇F (u0)y = Jf0T ·
(Jf1
T · . . . ·(Jfn−1
T ·(Jfn
T · y). . .
))n MatVec ops
6
A better solution?
∇F (u0)y = Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
proc Algo A(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
endCompute ∇F (u0)y;
end
→ What is the problem with Algo A?
proc Algo B(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
vi+1 = vi · Jfi+1(ui+1)T ;
end
end
→ What is the problem with Algo B?
∇F (u0)y =((
. . .(Jf0
T · Jf1T)· . . . · Jfn−1
T)· JfnT
)· y n MatMat ops
∇F (u0)y = Jf0T ·
(Jf1
T · . . . ·(Jfn−1
T ·(Jfn
T · y). . .
))n MatVec ops
6
A better solution?
∇F (u0)y = Jf0(u0)T · Jf1(u1)T · . . . · Jfn−1(un−1)T · Jfn(un)T · y
proc Algo A(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
endCompute ∇F (u0)y;
end
→ What is the problem with Algo A?
proc Algo B(u0,y)begin
Do stuff;for i = 0 to n do
ui+1 = fi(ui);Do stuff;
vi+1 = vi · Jfi+1(ui+1)T ;
end
end
→ What is the problem with Algo B?
∇F (u0)y =((
. . .(Jf0
T · Jf1T)· . . . · Jfn−1
T)· JfnT
)· y n MatMat ops
∇F (u0)y = Jf0T ·
(Jf1
T · . . . ·(Jfn−1
T ·(Jfn
T · y). . .
))n MatVec ops
7
Adjoint computation
Fi(xi) = xi+1 i < l
F̄i(xi, x̄i+1) = x̄i i ≤ l
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
x0 x1 x2 xl−2 xl−1
x̄lx̄l−1x̄3x̄2x̄1x̄0
x0 x1 x2 xl−1 xl
7
Adjoint computation
Fi(xi) = xi+1 i < l
F̄i(xi, x̄i+1) = x̄i i ≤ l
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
x0 x1 x2 xl−2 xl−1
x̄lx̄l−1x̄3x̄2x̄1x̄0
x0 x1 x2 xl−1 xl
7
Adjoint computation
Fi(xi) = xi+1 i < l
F̄i(xi, x̄i+1) = x̄i i ≤ l
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
x0 x1 x2 xl−2 xl−1
x̄lx̄l−1x̄3x̄2x̄1x̄0
x0 x1 x2 xl−1 xl
7
Adjoint computation
Fi(xi) = xi+1 i < l
F̄i(xi, x̄i+1) = x̄i i ≤ l
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
x0 x1 x2 xl−2 xl−1
x̄lx̄l−1x̄3x̄2x̄1x̄0
x0 x1 x2 xl−1 xl
7
Adjoint computation
Fi(xi) = xi+1 i < l
F̄i(xi, x̄i+1) = x̄i i ≤ l
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
x0 x1 x2 xl−2 xl−1
x̄lx̄l−1x̄3x̄2x̄1x̄0
x0 x1 x2 xl−1 xl
8
Relation to IA? (I)
GoogleNet graph:
Source : Internet :s
8
Relation to IA? (I)
GoogleNet graph:
Source : Internet :s
9
Relation to IA? (II)
4
Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances Computer vision
NVIDIA DRIVE PX 2 segmentationTop-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015)
Speech recognition & synthesis
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Machine translation
Source: Baydin, Beyond backprop: Automatic differentiation in machine learning, 2017.
10
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2F2 F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2F2 F3F3 F4
Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2F2 F3F3 F4F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2F2 F3F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2̄F2 F̄3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1̄F1 F̄2 F̄3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0̄F0 F̄1 F̄2 F̄3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time Space
Store all $ $$$Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 6
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 3
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 3
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 3
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 3
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none”
$$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T 3
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $
No reuse $$ $$Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 TT 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4̄F4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2̄F2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2̄F2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1̄F1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0̄F0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2 F̄3̄F3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1 F̄2̄F2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0 F̄1̄F1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
10
Model of execution
F0 F1 F2 F3 F4 Peak Mem
F̄0̄F0 F̄1 F̄2 F̄3 F̄4 T 4
I Memory to store output ofcomputations (xi or x̄i). Initial state:contains x0.
I Cost to write: wm = 0,I Cost to read: rm = 0.
Example of executionStrategy Time SpaceStore all $ $$$
Store “none” $$$ $No reuse $$ $$
Memory Reuse $ $
11
Problem formulation
We want to minimize the makespan of:
Initial state:AC graph: size l
Steps: uf , ub
Memory: cm, wm = rm = 0, Mini = {x0}Storage k: ck, wk, rk, Sini = ∅
F0 F1 · · · Fl−2 Fl−1
F̄0 F̄1 F̄2 · · · F̄l−1 T
12
Existing work
Question: How to organize the reverse execution of intermediate steps?What do we store, what do we recompute?
I Store all: memory expensive
I Recompute all: compute expensive
I Intermediate status?
13
Bounded memory
Griewand and Walther, 2000: Revolve(l, cm), optimal algorithm with cm memoryslots.
Source: Andrea Walther’s PhD thesis, 1999
14
Storage hierarchy
A., Herrmann, Hovland, Robert, 2015: Optimal algorithm for two level of storage:cheap bounded memory and costly unbounded disks.A., Herrmann, 2019: Library of optimal schedules for any number of storage level.
(https://gitlab.inria.fr/adjoint-computation)
Forward phase
F0 Fl−1
Revolve(m, cm)
m1 m2 m3 m4 l −∑
mi
15
What directions for AI?
Then what? Are we done? Just let AD people and ML people talk together!
Cut the middle-(scheduling)-people!
Source: A graffiti in Paris (twitter)
15
What directions for AI?
Then what? Are we done? Just let AD people and ML people talk together!Cut the middle-(scheduling)-people!
Source: A graffiti in Paris (twitter)
15
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100− 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executedsimultaneously), shared memory?
I Etc.
Pffiew, just saved my job
15
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100− 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executedsimultaneously), shared memory?
I Etc.
Pffiew, just saved my job
15
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100− 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executedsimultaneously), shared memory?
I Etc.
Pffiew, just saved my job
16
Dir. for AI: Graph structure
Remember this google network:
16
Dir. for AI: Graph structure
Remember this google network:
Nice try, we are not going tohelp you with this for free Google ,
17
Dir. for AI: Graph structure II
Source: Rao et al., A Deep Siamese Neural
Network (...), 2016
Source: Surı́s et al., Cross-Modal Embeddings
for Video and Audio Retrieval, 2018
17
Dir. for AI: Graph structure II
Pitchfork graph1 (aka join graphs):
Theorem (A., Beaumont, Herrmann, Shilova, 2019)
Given a bounded memory and a pitchfork with a bounded number of “teeth”, we can find inpolynomial time the solution that backpropagates it in minimal time.
1This should not be seen as an endorsement of the YJ movement, I don’t think I’m allowed togive publicly my opinion.
18
A Grasp of the proof? (I)
Three phase algorithm:
1 Forward phase
2 Turn
3 Backward phase
I Traverse all branches. Write someintermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed datafrom one of the branches, backpropagate asubset of the graph (can write additionalintermediate data)
18
A Grasp of the proof? (I)
Three phase algorithm:
1 Forward phase
2 Turn
3 Backward phase
I Traverse all branches. Write someintermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed datafrom one of the branches, backpropagate asubset of the graph (can write additionalintermediate data)
18
A Grasp of the proof? (I)
Three phase algorithm:
1 Forward phase
2 Turn
3 Backward phase
I Traverse all branches. Write someintermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed datafrom one of the branches, backpropagate asubset of the graph (can write additionalintermediate data)
18
A Grasp of the proof? (I)
Three phase algorithm:
1 Forward phase
2 Turn
3 Backward phase
I Traverse all branches. Write someintermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed datafrom one of the branches, backpropagate asubset of the graph (can write additionalintermediate data)
19
A Grasp of the proof? (II)
It relies on key properties of thebackward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
19
A Grasp of the proof? (II)
It relies on key properties of thebackward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Stability 1)
If Fi is “backpropagated”, then there areno Fj for i ≤ j.
19
A Grasp of the proof? (II)
It relies on key properties of thebackward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Checkpoint persistence)
If xi is stored, until Fi is“backpropagated”, there are no Fj forj < i.
19
A Grasp of the proof? (II)
It relies on key properties of thebackward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Stability 2)
If xi is read, then there are no Fj onother branches until it is backpropagated.
19
A Grasp of the proof? (II)
It relies on key properties of thebackward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
In this case, for a given forward phase,we get a multi-phase backward phase:
Forward phase
F0 F1 F2 Fl−1
I Where do we schedule thecheckpoints in the forward phase?
I In which order do we execute thesubsegment on each branch?
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!With a memory of size O(M):I Store All can execute a graph of size O(M) in time O(M);I Revolve can execute a graph of size O(eM ) in time O(MeM )!I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Thanks
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!With a memory of size O(M):I Store All can execute a graph of size O(M) in time O(M);I Revolve can execute a graph of size O(eM ) in time O(MeM )!I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Thanks
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!With a memory of size O(M):I Store All can execute a graph of size O(M) in time O(M);I Revolve can execute a graph of size O(eM ) in time O(MeM )!I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Thanks
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!With a memory of size O(M):I Store All can execute a graph of size O(M) in time O(M);I Revolve can execute a graph of size O(eM ) in time O(MeM )!I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Thanks
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!With a memory of size O(M):I Store All can execute a graph of size O(M) in time O(M);I Revolve can execute a graph of size O(eM ) in time O(MeM )!I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Thanks