Parallel Recursive Filtering of Infinite Input...
Transcript of Parallel Recursive Filtering of Infinite Input...
Parallel Recursive Filtering of Infinite Input Extensions
Diego Nehab André Maximo
IMPA GE Global Research
GTC 2017
Convolution
• The convolution of sequences 𝐛 and 𝐡 is a sequence 𝐚
𝐚 = 𝐛 ∗ 𝐡 𝑎𝑖 =
𝑗=−∞
∞
𝑏𝑗ℎ𝑖−𝑗𝐛, 𝐡, 𝐚: ℤ → ℝ
𝐛
𝐡 𝐡 𝐡
𝐚 = 𝐛 ∗ 𝐡
𝐡
Linear shift-invariant filters are convolutions
filter’s impulse response
unit impulse
filter
convolve filter
Outline of talk
• Introduction• Recursive filters are very useful
• Initialization at the boundaries is an important problem
• Exact recursive filtering of infinite input extensions• Closed-form formulas available for the first time
• Enable simple and effective algorithms
• Parallelization• Fastest recursive filtering algorithms to date
• First to filter infinite extensions exactly
General model for convolutions
• Linear difference equations
𝐀𝐳 = 𝐁𝐰
⋱ ⋱ ⋱ ⋱⋱ 𝑎0 𝑎−1 ⋮ 𝑎−𝑟⋱ 𝑎1 𝑎0 𝑎−1 ⋮ 𝑎−𝑟⋱ ⋮ 𝑎1 𝑎0 𝑎−1 ⋮ ⋱
𝑎𝑟 ⋮ 𝑎1 𝑎0 𝑎−1 ⋱
𝑎𝑟 ⋮ 𝑎1 𝑎0 ⋱
⋱ ⋱ ⋱ ⋱
⋮𝑧𝑖−2𝑧𝑖−1𝑧𝑖𝑧𝑖+1𝑧𝑖+2⋮
=
⋱ ⋱ ⋱ ⋱⋱ 𝑏0 𝑏−1 ⋮ 𝑏−𝑑⋱ 𝑏1 𝑏0 𝑏−1 ⋮ 𝑏−𝑑⋱ ⋮ 𝑏1 𝑏0 𝑏−1 ⋮ ⋱
𝑏𝑑 ⋮ 𝑏1 𝑏0 𝑏−1 ⋱
𝑏𝑑 ⋮ 𝑏1 𝑏0 ⋱
⋱ ⋱ ⋱ ⋱
⋮𝑤𝑖−2
𝑤𝑖−1
𝑤𝑖
𝑤𝑖+1
𝑤𝑖+2
⋮
constant diagonals
𝑎𝑟𝑧𝑖−𝑟 +⋯+ 𝑎1𝑧𝑖−1+𝑎0𝑧𝑖+𝑎−1𝑧𝑖+1+⋯+ 𝑎−𝑟𝑧𝑖+𝑟 = 𝑏𝑑𝑤𝑖−𝑑 +⋯+ 𝑏1𝑤𝑖−1+𝑏0𝑤𝑖+𝑏−1𝑤𝑖+1+ ⋯+ 𝑏−𝑑𝑤𝑖+𝑑
filter 𝐳𝐰 𝐳𝐳
finite impulse response support (FIR)
⋮𝑥𝑖−2𝑥𝑖−1𝑥𝑖𝑥𝑖+1𝑥𝑖+2⋮
=
⋱ ⋱ ⋱ ⋱⋱ 𝑏0 𝑏−1 ⋮ 𝑏−𝑠⋱ 𝑏1 𝑏0 𝑏−1 ⋮ 𝑏−𝑠⋱ ⋮ 𝑏1 𝑏0 𝑏−1 ⋮ ⋱
𝑏𝑠 ⋮ 𝑏1 𝑏0 𝑏−1 ⋱
𝑏𝑠 ⋮ 𝑏1 𝑏0 ⋱
⋱ ⋱ ⋱ ⋱
⋮𝑤𝑖−2
𝑤𝑖−1
𝑤𝑖
𝑤𝑖+1
𝑤𝑖+2
⋮
• Decompose into direct and recursive parts
𝐱 𝐳recursive
𝑥𝑖 = 𝑏𝑠𝑤𝑖−𝑠 +⋯+ 𝑏1𝑤𝑖−1+𝑏0𝑤𝑖+𝑏−1𝑤𝑖+1+ ⋯+ 𝑏−𝑠𝑤𝑖+𝑠
direct part is what we think of as convolutionit is like a matrix multiplication 𝐱 = 𝐁𝐰
𝑂(𝑠 𝑛)
General model for convolutions
𝑎𝑟𝑧𝑖−𝑟 +⋯+ 𝑎1𝑧𝑖−1+𝑎0𝑧𝑖+𝑎−1𝑧𝑖+1+⋯+ 𝑎−𝑟𝑧𝑖+𝑟 = 𝑥𝑖
recursive part is the inverse of a convolutionit is like a linear system 𝐀𝐳 = 𝐱
infinite impulse response support (IIR)
⋱ ⋱ ⋱ ⋱⋱ 𝑎0 𝑎−1 ⋮ 𝑎−𝑟⋱ 𝑎1 𝑎0 𝑎−1 ⋮ 𝑎−𝑟⋱ ⋮ 𝑎1 𝑎0 𝑎−1 ⋮ ⋱
𝑎𝑟 ⋮ 𝑎1 𝑎0 𝑎−1 ⋱
𝑎𝑟 ⋮ 𝑎1 𝑎0 ⋱
⋱ ⋱ ⋱ ⋱
⋮𝑧𝑖−2𝑧𝑖−1𝑧𝑖𝑧𝑖+1𝑧𝑖+2⋮
=
⋮𝑥𝑖−2𝑥𝑖−1𝑥𝑖𝑥𝑖+1𝑥𝑖+2⋮
direct𝐰 filter 𝐳𝐰 𝐳𝐳direct𝐰 𝐳recursive
𝐀𝐳 = 𝐁𝐰𝐱 = 𝐁𝐰 𝐀𝐳 = 𝐱
𝐄𝐳 = √𝑎0
⋱ ⋱ ⋱ ⋱1 𝑒1 ⋮ 𝑒𝑟
1 𝑒1 ⋮ 𝑒𝑟1 𝑒1 ⋮ ⋱
1 𝑒1 ⋱
1 ⋱⋱
𝐳 = 𝐲
anticausal 𝐳
General model for convolutions
• Decompose recursive part into causal and anticausal passes
direct𝐰 𝐱 causal 𝐲 anticausal 𝐳
𝑦𝑖 =1
√𝑎0𝑥𝑖 − 𝑑1𝑦𝑖−1 −⋯− 𝑑𝑟𝑦𝑖−𝑟
causal part is forward-substitution 𝐃𝐲 = 𝐱
𝑂(𝑟 𝑛)𝑧𝑖 =
1
√𝑎0𝑦𝑖 − 𝑒1𝑧𝑖+1 −⋯− 𝑒𝑟𝑧𝑖+𝑟
anticausal is back-substitution 𝐄𝐳 = 𝐲
𝑂(𝑟 𝑛)
𝐱 causal
𝐃𝐲 = √𝑎0
⋱⋱ 1⋱ 𝑑1 1
⋱ ⋮ 𝑑1 1
𝑑𝑟 ⋮ 𝑑1 1
𝑑𝑟 ⋮ 𝑑1 1
⋱ ⋱ ⋱ ⋱
𝐲 = 𝐱
𝐳recursive𝐱
𝐀𝐳 = 𝐱𝐃𝐲 = 𝐱 𝐄𝐳 = 𝐲𝐱 = 𝐁𝐰
recursive filter
Downsampling with cardinal cubic B-splines
prefiltered
post-processed output
input
Catmull-Rom
t
cardinal cubic B-spline
tinfinite support
cubic B-spline
t
prefiltered
[Nehab & Hoppe 2014]
input
direct convolution 𝑠 = 2 𝜎
2 𝜎 𝑛 operations
Fast image blur
Blur with FIR filter (given 𝜎)
causal pass 𝑟 = 3
anticausal pass 𝑟 = 3
6𝑛 operations
Blur with IIR filter (any 𝜎)
Blur with FFT
𝑛 log𝑛 operations
[van Vliet et al. 1998]
Infinite input extensionsrepeat periodically
filter
reflect periodically
filter
constant padding
filter
periodic repetitionclamp to border
Tileable textures
• Textures designed to be tiled in a certain way
• Filtering must respect the periodicity
original texture filtered, tiled, and shifted
Dealing with boundaries in practice
• In the frequency domain• DFT/DCT imply infinite extensions
• Computations are exact even for IIR filters
• In the time domain• Direct convolution can decide out of bounds input arbitrarily
• Recursive filters must define out of bounds outputs!
even more wasted computation
filtered
even more wasted memory
padded
filtered
more wasted computation
padded
more wasted memorywasted memory
padded
wasted computation
filtered
Approximation by input padding
input
approximation
output
amount of padding dependson impulse response ”support”
Outline of talk
• Introduction• Recursive filters are very useful
• Initialization at the boundaries is an important problem
• Exact recursive filtering of infinite input extensions• Closed-form formulas available for the first time
• Enable simple and effective algorithms
• Parallelization• Fastest recursive filtering algorithms to date
• First to filter infinite extensions exactly
finite output
Exact recursive filtering of finite input
finite input
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
input extension
𝐱–3 𝐱–1 𝐱–0𝐱–2𝐱–4𝐱–5… 𝐱8 𝐱10𝐱7 𝐱9 𝐱11 𝐱12 …input extension
𝐳8 𝐳10𝐳9 𝐳11 𝐳12 …𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4 𝐳7
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3 𝐲8 𝐲10𝐲7 𝐲9 𝐲11 𝐲12 …𝐲0… 𝐲–2 𝐲–1𝐲–3𝐲–4𝐲–5
finite output
Exact recursive filtering of finite input
• Instead, first obtain the initial feedbacks• To initialize causal pass
• To initialize subsequent anticausal pass
• How to obtain these feedbacks in closed form?• Depends on the choice of infinite input extension
initial causal feedback
initial anticausal feedback
finite inputinput extension
𝐱–3 𝐱–1 𝐱–0𝐱–2𝐱–4𝐱–5… 𝐱8 𝐱10𝐱7 𝐱9 𝐱11 𝐱12 …input extension
𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4 𝐳7
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3𝐲0
Constant padding
𝐲8 𝐲10𝐲7 𝐲9 𝐲11 𝐲12 …series
finite input
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
input extension
𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4
anticausal
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3
causal
𝐱0 𝐱0 𝐱0𝐱0𝐱0𝐱0…
𝐲–2 𝐲–1𝐲–3𝐲–4𝐲–5…infinite series
𝐲0
𝐲0 = 𝑴1 𝐱0
initial causal feedback
𝐳8 𝐳10𝐳9 𝐳11 𝐳12 …infinite series
𝐳7
= 𝑴2𝐳7 + 𝑴3𝐲6 𝐱7
initial anticausal feedback
𝐱7 𝐱7𝐱7 𝐱7 𝐱7 𝐱7 …input extension
𝑴1 = 𝑺𝐹ഥ𝑨𝐹 𝑺𝐹 = 𝑰 − 𝑨𝐹𝑟 −1
𝑴2 = 𝑺𝑅𝐹𝑨𝐹𝑟
𝑺𝑅𝐹 − 𝑨𝑅𝑟𝑺𝑅𝐹𝑨𝐹
𝑟 = ഥ𝑨𝑅
𝑴3 = 𝑺𝑅ഥ𝑨𝑅 − 𝑺𝑅𝐹𝑨𝐹𝑟 𝑴1
𝑺𝑅 = 𝑰 − 𝑨𝑅𝑟 −1
finite output
precomputed 𝑟 × 𝑟 matrix
precomputed 𝑟 × 𝑟 matrices
ሶ𝐲2 ሶ𝐲4 ሶ𝐲5 ሶ𝐲6ሶ𝐲1 ሶ𝐲30
1st causal2nd causal
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3
2nd anticausal
𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4 ሶ𝐳5ሶ𝐳3ሶ𝐳2ሶ𝐳1 ሶ𝐳6ሶ𝐳4 0
1st anticausal
Periodic repetition
finite input
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
= 𝑴5𝐳1 ሶ𝐳1
𝐳1
initial anticausal feedback
𝐲6 = 𝑴4 ሶ𝐲6
𝐲6
initial causal feedback
𝑴4 = 𝑰 − 𝑨𝐹𝑛 −1
𝑴5 = 𝑰 − 𝑨𝑅𝑛 −1
finite output
input extension
𝐱3 𝐱5 𝐱6𝐱4𝐱2𝐱1… 𝐱2 𝐱4𝐱1 𝐱3 𝐱5 𝐱6 …input extension
ሷ𝐳5ሷ𝐳3ሷ𝐳2ሷ𝐳1 ሷ𝐳6ሷ𝐳4 0
1st anticausal
ሶ𝐲2 ሶ𝐲4 ሶ𝐲5 ሶ𝐲6ሶ𝐲1 ሶ𝐲30
1st causal
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3
2nd causal
Periodic reflection
finite input
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
ሶ𝐲6 𝐲12= 𝑴8 + 𝑴9
ሷ𝐳1ሶ𝐲6𝐲12 = 𝑴6 + 𝑴7 𝑴7 = 𝑰 − 𝑨𝐹2ℎ −1
𝑲 ഥ𝑨𝑅−1
𝑰 − 𝑨𝐹𝑟𝑨𝑅
𝑟
𝑴6 = 𝑰 − 𝑨𝐹2ℎ −1
𝑨𝐹ℎ −𝑲𝑨𝑅
ℎ ഥ𝑨𝐹−1𝑨𝐹𝑟 ഥ𝑨𝑅
𝑴8 = 𝑲− 𝑨𝑅𝑟 −𝟏ഥ𝑨𝑅
𝑴9 = 𝑲− 𝑨𝑅𝑟 −𝟏ഥ𝑨𝑅𝑨𝐹
ℎ
𝐲12
initial causal feedback
initial anticausal feedbackfinite output
𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4
2nd anticausal
input extension
… …input extension
Is this correct? Is this fast?
• All required 𝑟 × 𝑟 matrices 𝑴1to 𝑴9 exist and can be precomputed• Proofs in the paper assume only filter stability
• Exact filtering for all infinite extensions takes 𝑂(𝑟 𝑛)• May require twice as much computation
• Real advantage comes with parallelization• No additional cost
Outline of talk
• Introduction• Recursive filters are very useful
• Initialization at the boundaries is an important problem
• Exact recursive filtering of infinite input extensions• Closed-form formulas available for the first time
• Enable simple and effective algorithms
• Parallelization• Fastest recursive filtering algorithms to date
• First to filter infinite extensions exactly
GPU
• • • ThreadsThreadsSharedMemor
y
SharedMemor
y
Global Memory
Modern GPU
• Many multiprocessors each supporting many hardware threads• 10k threads is typical
• On-chip shared memory within each multiprocessor• 48k to be shared by threads local to multiprocessor
• Global memory with high throughput but high latency• High throughput but high latency
• Challenge is hide latency and keeping all cores busy
Multiprocessor Multiprocessor
𝐱13 𝐱15 𝐱16𝐱12 𝐱14𝐱1 𝐱3 𝐱4 𝐱5𝐱2
Independent rows then columns
input
𝐱7 𝐱9 𝐱10 𝐱11𝐱6 𝐱8
output
𝐲1 𝐲3 𝐲4 𝐲5𝐲2 𝐲13 𝐲15 𝐲16𝐲12 𝐲14𝐲7 𝐲9 𝐲10 𝐲11𝐲6 𝐲80
causal
𝐳1 𝐳2 𝐳3 𝐳4 𝐳13 𝐳15𝐳14 𝐳16𝐳8𝐳7𝐳6𝐳5 𝐳10 𝐳11𝐳9 𝐳12 0
anticausal
[Ruijters and Thevenaz 2010]
Summary of parallel algorithms
𝑟 is the filter order𝑝 is the number of processorsℎ is the image height𝑤 is the image width
Algorithm Step complexity Threads Bandwidth
independent rows then columns 4𝑟 ℎ𝑤/𝑝 ℎ, 𝑤 8ℎ𝑤
7
1
3
5
Thro
ughput
(GiP
/s)
1st order
Input size (pixels)
642 1282 2562 5122 10242 20482 40962
[Nehab, Maximo, Lima & Hoppe 2011]
independent rows then columns
Performance on Fermi (GF100)
0 0 0 0
0 0 0 0
correct feedbacks
𝐳9𝐳1 𝐳5 𝐳13 0ሶ𝐳9ሶ𝐳1 ሶ𝐳5 ሶ𝐳13 0𝐳10 𝐳11𝐳9 𝐳12
2nd anticausal
𝐲13 𝐲15 𝐲16𝐲14
2nd causal
ሶ𝐲13 ሶ𝐲15 ሶ𝐲16ሶ𝐲14
1st causal
ሶ𝐲4 ሶ𝐲8 ሶ𝐲12 ሶ𝐲160 ሶ𝐲10 ሶ𝐲12ሶ𝐲9 ሶ𝐲11
1st causal
𝐲12𝐲9 𝐲10 𝐲11
2nd causal
ሶ𝐲5 ሶ𝐲7ሶ𝐲6 ሶ𝐲8
1st causal
𝐲4 𝐲16𝐲12𝐲8
correct feedbacks
0 𝐲5 𝐲7𝐲6 𝐲8
2nd causal1st causal
ሶ𝐲1 ሶ𝐲3 ሶ𝐲4ሶ𝐲2 𝐲3𝐲1 𝐲4𝐲2
2nd causal
𝐱13 𝐱15 𝐱16𝐱14𝐱1 𝐱3 𝐱4𝐱2 𝐱12𝐱9 𝐱10 𝐱11𝐱5 𝐱7𝐱6 𝐱8
ሶ𝐳1 ሶ𝐳2 ሶ𝐳3 ሶ𝐳4
1st anticausal
ሶ𝐳13 ሶ𝐳15ሶ𝐳14 ሶ𝐳16
1st anticausal
ሶ𝐳8ሶ𝐳7ሶ𝐳6ሶ𝐳5
1st anticausal
ሶ𝐳10 ሶ𝐳11ሶ𝐳9 ሶ𝐳12
1st anticausal
𝐳1 𝐳2 𝐳3 𝐳4
2nd anticausal
𝐳13 𝐳15𝐳14 𝐳16
2nd anticausal
𝐳8𝐳7𝐳6𝐳5
2nd anticausal
ሶ𝐳𝑏𝑖𝐳𝑏𝑖 𝐳𝑏(𝑖+1)= 𝑨𝑅𝑏 +
Split rows and columns into blocks
output
input
𝐱13 𝐱15 𝐱16𝐱14𝐱1 𝐱3 𝐱4𝐱2 𝐱12𝐱9 𝐱10 𝐱11𝐱5 𝐱7𝐱6 𝐱8
ሶ𝐲𝑏𝑖𝐲𝑏𝑖 𝐲𝑏(𝑖−1)= 𝑨𝐹𝑏 +
[Sung & Mitra 1986; Nehab, Maximo, Lima & Hoppe 2011]
Summary of parallel algorithms
𝑟 is the filter order𝑝 is the number of processorsℎ is the image height𝑤 is the image width
Algorithm Step complexity Threads Bandwidth
independent rows then columns 4𝑟 ℎ𝑤/𝑝 ℎ, 𝑤 8ℎ𝑤
+ split rows and columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 9ℎ𝑤
7
1
3
5
Thro
ughput
(GiP
/s)
1st order
Input size (pixels)
642 1282 2562 5122 10242 20482 40962
[Nehab, Maximo, Lima & Hoppe 2011]
independent rows then columns
+ split rows and columns
Performance on Fermi (GF100)
correct feedbacks
𝐳9𝐳1 𝐳5 𝐳13 00 0 0 0ሷ𝐳10 ሷ𝐳11ሷ𝐳9 ሷ𝐳12
1st anticausal
𝐳10 𝐳11𝐳9 𝐳12
2nd anticausal
ሷ𝐳9ሷ𝐳1 ሷ𝐳5 ሷ𝐳13 0𝐳13 𝐳15𝐳14 𝐳16
2nd anticausal
ሷ𝐳13 ሷ𝐳15ሷ𝐳14 ሷ𝐳16
1st anticausal
0 0 0 0 𝐲15𝐲13 𝐲16𝐲14
2nd causal
𝐲12𝐲9 𝐲10 𝐲11
2nd causal
𝐲5 𝐲7𝐲6 𝐲8
2nd causal
𝐲3𝐲1 𝐲4𝐲2
2nd causal
ሶ𝐲12ሶ𝐲9 ሶ𝐲10 ሶ𝐲11
1st causal1st causal
ሶ𝐲1 ሶ𝐲3 ሶ𝐲4ሶ𝐲2 ሶ𝐲13 ሶ𝐲15 ሶ𝐲16ሶ𝐲14
1st causal
ሶ𝐲5 ሶ𝐲7ሶ𝐲6 ሶ𝐲8
1st causal
𝐲4 𝐲16𝐲12𝐲8
correct feedbacks
0
ሷ𝐳8ሷ𝐳7ሷ𝐳6ሷ𝐳5
1st anticausal
ሷ𝐳1 ሷ𝐳2 ሷ𝐳3 ሷ𝐳4
1st anticausal
𝐳8𝐳7𝐳6𝐳5
2nd anticausal
𝐳1 𝐳2 𝐳3 𝐳4
2nd anticausal
ሶ𝐲4 ሶ𝐲8 ሶ𝐲12 ሶ𝐲160
Overlap causal-anticausal processing
output
input
ሶ𝐲𝑏𝑖𝐲𝑏𝑖 𝐲𝑏(𝑖−1)= 𝑨𝐹𝑏 +
[Nehab, Maximo, Lima & Hoppe 2011]
ሷ𝐳𝑏𝑖𝐳𝑏𝑖 𝐳𝑏(𝑖+1)= 𝑨𝑅𝑏 + 𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃 𝐲𝑏(𝑖−1) +
𝐱13 𝐱15 𝐱16𝐱14𝐱1 𝐱3 𝐱4𝐱2 𝐱12𝐱9 𝐱10 𝐱11𝐱5 𝐱7𝐱6 𝐱8
Summary of parallel algorithms
𝑟 is the filter order𝑝 is the number of processorsℎ is the image height𝑤 is the image width
Algorithm Step complexity Threads Bandwidth
independent rows then columns 4𝑟 ℎ𝑤/𝑝 ℎ, 𝑤 8ℎ𝑤
+ split rows and columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 9ℎ𝑤
+ overlap causal with anticausal ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 5ℎ𝑤
7
1
3
5
Thro
ughput
(GiP
/s)
1st order
Input size (pixels)
642 1282 2562 5122 10242 20482 40962
[Nehab, Maximo, Lima & Hoppe 2011]
independent rows then columns
+ split rows and columns
+ overlap causal and anticausal
Performance on Fermi (GF100)
Stage 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
ഺ𝐮𝑚,𝑛𝐮𝑚,𝑛 𝐮𝑚,𝑛−1= 𝑨𝐹𝑏 𝑡
+ 𝑨𝑅𝐸 + 𝑨𝑅𝐵𝑨𝐹𝐵 𝑇 𝑨𝐹𝐵𝑡 +𝐳𝑚+1,𝑛 𝐲𝑚−1,𝑛𝑨𝑅
𝑏 𝑡+ 𝐳𝑚+1,𝑛𝐯𝑚,𝑛 𝐯𝑚,𝑛+1= + 𝑨𝑅𝐵𝑨𝐹𝑃𝑨𝑅𝐸𝐮𝑚,𝑛−1 𝐲𝑚−1,𝑛 𝐻 𝑨𝑅𝐵 𝑨𝐹𝐵
𝑡 + ሷሷ𝐯𝑚,𝑛𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃𝑡 +
Stage 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
ሶ𝐲𝑚,𝑛𝐲𝑚,𝑛 𝐲𝑚−1,𝑛= 𝑨𝐹𝑏 + ሷ𝐳𝑚,𝑛𝐳𝑚,𝑛 𝐳𝑚+1,𝑛= 𝑨𝑅𝑏 + 𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃 𝐲𝑚−1,𝑛 +
Summary of parallel algorithms
𝑟 is the filter order𝑝 is the number of processorsℎ is the image height𝑤 is the image width
Algorithm Step complexity Threads Bandwidth
independent rows then columns 4𝑟 ℎ𝑤/𝑝 ℎ, 𝑤 8ℎ𝑤
+ split rows and columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 9ℎ𝑤
+ overlap causal with anticausal ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 5ℎ𝑤
+ overlap rows with columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 3ℎ𝑤
7
1
3
5
Thro
ughput
(GiP
/s)
1st order
Input size (pixels)
642 1282 2562 5122 10242 20482 40962
[Nehab, Maximo, Lima & Hoppe 2011]
+ overlap rows and columns
independent rows then columns
+ split rows and columns
+ overlap causal and anticausal
Performance on Fermi (GF100)
+ overlap rows and columns
[Nehab, Maximo, Lima & Hoppe 2011]
Input size (pixels)
642 1282 2562 5122 10242 20482 40962
2nd order
5
1
2
3
Thro
ughput
(GiP
/s)
4
overlap causal and anticausal
Performance on Fermi (GF100)
ഺ𝐮𝑚,𝑛𝐮𝑚,𝑛 𝐮𝑚,𝑛−1= 𝑨𝐹𝑏 𝑡
+ 𝑨𝑅𝐸 + 𝑨𝑅𝐵𝑨𝐹𝐵 𝑇 𝑨𝐹𝐵𝑡 +𝐳𝑚+1,𝑛 𝐲𝑚−1,𝑛
𝐲𝑚−1,𝑛𝑨𝑅𝑏 𝑡
+ 𝐳𝑚+1,𝑛𝐯𝑚,𝑛 𝐯𝑚,𝑛+1= + 𝑨𝑅𝐵𝑨𝐹𝑃𝑨𝑅𝐸𝐮𝑚,𝑛−1 𝐻 𝑨𝑅𝐵 𝑨𝐹𝐵𝑡 + ሷሷ𝐯𝑚,𝑛𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃
𝑡 +𝑨𝑅𝑏 𝑡
+𝐯𝑚,𝑛 𝐯𝑚,𝑛+1= 𝐮𝑚,𝑛−1 𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃𝑡 + ത𝐯𝑚,𝑛
𝐮𝑚,𝑛 𝐮𝑚,𝑛−1= 𝑨𝐹𝑏 𝑡
+ ഥ𝐮𝑚,𝑛
New trick for better performance
ሶ𝐲𝑚,𝑛𝐲𝑚,𝑛 𝐲𝑚−1,𝑛= 𝑨𝐹𝑏 +
ሷ𝐳𝑚,𝑛𝐳𝑚,𝑛 𝐳𝑚+1,𝑛= 𝑨𝑅𝑏 + 𝐻 𝑨𝑅𝐵 𝑨𝐹𝑃 𝐲𝑚−1,𝑛 +
sequentially find per-block column feedbacks
sequentially find per-block row feedbacks
ഥ𝐮𝑚,𝑛 ഺ𝐮𝑚,𝑛= 𝑨𝑅𝐸 + 𝑨𝑅𝐵𝑨𝐹𝐵 𝑇 𝑨𝐹𝐵𝑡 +𝐳𝑚+1,𝑛 𝐲𝑚−1,𝑛
ത𝐯𝑚,𝑛 𝐳𝑚+1,𝑛= + 𝑨𝑅𝐵𝑨𝐹𝑃𝑨𝑅𝐸𝐲𝑚−1,𝑛 𝐻 𝑨𝑅𝐵 𝑨𝐹𝐵
𝑡 + ሷሷ𝐯𝑚,𝑛
new fully parallel intermediate stage
(exactly the same as column processing)
Summary of parallel algorithms
Algorithm Step complexity Threads Bandwidth
independent rows then columns 4𝑟 ℎ𝑤/𝑝 ℎ, 𝑤 8ℎ𝑤
+ split rows and columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 9ℎ𝑤
+ overlap causal with anticausal ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 5ℎ𝑤
+ overlap rows with columns ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 3ℎ𝑤
+ new trick ≈ 8𝑟 ℎ𝑤/𝑝 ℎ𝑤/𝑏 ≈ 3ℎ𝑤
𝑟 is the filter order𝑝 is the number of processorsℎ is the image height𝑤 is the image width
13
1
3
5
7
9
11
642 819221282 2562 5122 10242 20482 40962
Input size (pixels)
Thro
ughput
(GiP
/s)
overlap causal and anticausal
+ overlap rows and columns
+ new trick
Performance on Kepler (GK110)
3rd order4th order
642 819221282 2562 5122 10242 20482 40962
17
1
3
15
13
11
9
7
5
Input size (pixels)
Thro
ughput
(GiP
/s)
Performance on Pascal (GP102)
overlap causal and anticausal
+ overlap rows and columns
+ new trick
5th order
ሶ𝐲2 ሶ𝐲4 ሶ𝐲5 ሶ𝐲6ሶ𝐲1 ሶ𝐲30
1st causal
ሷ𝐳5ሷ𝐳3ሷ𝐳2ሷ𝐳1 ሷ𝐳6ሷ𝐳4 0
1st anticausal
𝐳5𝐳3𝐳2𝐳1 𝐳6𝐳4
2nd anticausal
𝐲2 𝐲4 𝐲5 𝐲6𝐲1 𝐲3
2nd causal
Back to infinite extensions
entire finite input
𝐱2 𝐱4 𝐱5 𝐱6𝐱1 𝐱3
ሶ𝐲6 𝐲12= 𝑴8 + 𝑴9
ሷ𝐳1ሶ𝐲6𝐲12 = 𝑴6 + 𝑴7 𝑴7 = 𝑰 − 𝑨𝐹2ℎ −1
𝑲 ഥ𝑨𝑅−1
𝑰 − 𝑨𝐹𝑟𝑨𝑅
𝑟
𝑴6 = 𝑰 − 𝑨𝐹2ℎ −1
𝑨𝐹ℎ −𝑲𝑨𝑅
ℎ ഥ𝑨𝐹−1𝑨𝐹𝑟 ഥ𝑨𝑅
𝑴8 = 𝑲− 𝑨𝑅𝑟 −𝟏ഥ𝑨𝑅
𝑴9 = 𝑲− 𝑨𝑅𝑟 −𝟏ഥ𝑨𝑅𝑨𝐹
ℎ
𝐲12
initial causal feedback
initial anticausal feedbackentire finite output
input extension
… …input extension
ሶ𝐲6
ሷ𝐳1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
Back to infinite extensions
• Side effect of 2nd stage of parallel algorithm
• Just what we need to compute exact initial feedbacks!
13
1
3
5
7
9
11
642 819221282 2562 5122 10242 20482 40962
Input size (pixels)
Thro
ughput
(GiP
/s)
1st order (cubic B-spline interpolation)
[Chaurasia et al. 2015]
[Nehab et al. 2011]
padding
exact reflect
exact repeat
exact clamp to border
Performance on Kepler (GK110)
conv
cuFFT
[Chaurasia et al. 2015]
[Nehab et al. 2011]
642 819221282 2562 5122 10242 20482 40962
7
1
3
5
Input size (pixels)
Thro
ughput
(GiP
/s)
3rd order (2D Gaussian blur with 𝜎 = 𝑛/6)
Performance on Kepler (GK110)
padding
exact reflect
exact repeat
exact clamp to border
Summary
• Introduction• Recursive filters are very useful
• Initialization at the boundaries is an important problem
• Exact recursive filtering of infinite input extensions• Closed-form formulas available for the first time
• Enable simple and effective algorithms
• Parallelization• Fastest recursive filtering algorithms to date
• First to filter infinite extensions exactly
Thank you!
• Please download and use our code
https://github.com/andmax/gpufilter
• Questions?