Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L....
-
Upload
della-carpenter -
Category
Documents
-
view
222 -
download
4
Transcript of Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L....
Vectorization of the 2D Wavelet Lifting Transform Using SIMD
Extensions
D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado
UCM
2
UCM
Index
1. Motivation
2. Experimental environment
3. Lifting Transform
4. Memory hierarchy exploitation
5. SIMD optimization
6. Conclusions
7. Future work
3
UCM
Motivation
4
UCM
Motivation
Applications based on the Wavelet Transform:
JPEG-2000 MPEG-4
Usage of the lifting scheme
Study based on a modern general purpose microprocessor
o Pentium 4
Objectives:
o Efficient exploitation of Memory Hierarchy
o Use of the SIMD ISA extensions
5
UCM
Experimental
Environment
6
UCM
Experimental Environment
RedHat Distribution 7.2 (Enigma)
Operating System
1 GB RDRAM (PC800)Memory
512 KB, 128 Byte/LineL2
8 KB, 64 Byte/Line, Write-Through
DL1
NAIL1
Cache
DFI WT70-ECMotherboard
Intel Pentium4 (2,4 GHz)Platform
Intel ICC compilerGCC compilerCompiler
7
UCM
Lifting Transfor
m
8
UCM
D1st1st1st1st1st1st
Lifting Transform
Original element
1st step
2nd step
+
x +
+
β
x +
+
x
+
+
δ
x +x
x
A D D DA A A1st1st
9
UCM
N Levels
Lifting Transform
1 Level
Horizontal Filtering (1D Lifting Transform)
Vertical Filtering (1D Lifting Transform)
Original element
Approximation
10
UCM
Lifting Transform
Horizontal Filtering
1
2
Vertical Filtering
2 1
11
UCM
Memory Hierarchy
Exploitation
12
UCM
Poor data locality of one component (canonical layouts)
E.g. : column-major layout processing image rows (Horizontal Filtering)
o Aggregation (loop tiling)
Memory Hierarchy Exploitation
Poor data locality of the whole transform
o Other layouts
1
2
13
UCM
Memory Hierarchy Exploitation
Horizontal Filtering
1
2
Vertical Filtering
2 1
14
UCM
Aggregation
Horizontal Filtering
IMAGE
2 1
Memory Hierarchy Exploitation
15
UCM
Memory Hierarchy Exploitation
INPLACE
Common implementation of the transform
Memory: Only requires the original matrix
For most applications needs post-processing
MALLAT
Memory: requires 2 matrices
Stores the image in the expected order
INPLACE-MALLAT
Memory: requires 2 matrices
Stores the image in the expected order
Different studied schemes
16
UCM
Memory Hierarchy Exploitation
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
MATRIX 1
L
L
L
L
L
L
L
L
H
H
H
H
H
H
H
H
Horizontal Filtering
LL1
HH1
HL1
LH1
LL3
HH3
HL3
LH3
LL4
HH4
HL4
LH4
LL2
HH2
HL2
LH2
Vertical Filtering
Transformed image
...LL1 LH1 LL2 LH2 HH1HL1 HH2HL2 LL3
logical view
physical view
INPLACE
LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1
17
UCM
Memory Hierarchy Exploitation
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
L
L
L
L
L
L
L
L
H
H
H
H
H
H
H
H
HorizontalFiltering
MATRIX 1 MATRIX 2
LL1
LL2 LL4
LL3
HH3
HH4HH2
HH1
HL1
HL2 HL4
HL3
LH1
LH2 LH4
LH3
Vertical
Filtering
Transformed image LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1
logical view
physical view
MALLAT
18
UCM
Memory Hierarchy Exploitation
MATRIX 1 MATRIX 2
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
logical viewL
L
L
L
L
L
L
L
H
H
H
H
H
H
H
H
HorizontalFiltering
LL1
LL2 LL4
LL3
HH3
HH4HH2
HH1
HL1
HL2 HL4
HL3
LH1
LH2 LH4
LH3
Vertical
Filtering
Transformed image (Matrix 1) LL1 LL2 LL3 LL4...
Transformed image (Matrix 2) LH2LH1 LH4LH3 ...HL1
physicalview
INPLACE-
MALLAT
19
UCM
2562
0
0,001
0,002
0,003
I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 Post 10242
0
0,02
0,04
0,06
0,08
I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 Post
20482
0
0,1
0,2
0,3
I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 Post 81922
0
2
4
6
I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 Post
Memory Hierarchy Exploitation
Execution time breakdown for several sizes comparing both compilers.
I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.
Each bar shows the execution time of each level and the post-processing step.
20
UCM
The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above
These 2 approaches have a noticeable slowdown for the 1st level:
•Larger working set
•More complex access pattern
The Inplace-Mallat version achieves the best execution time
ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach
Memory Hierarchy Exploitation
CONCLUSIONS
21
UCM
SIMD Optimizati
on
22
UCM
Objective: Extract the parallelism available on the Lifting Transform
Different strategies:
Semi-automatic vectorization
Hand-coded vectorization
Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout)
SIMD Optimization
23
UCM
SIMD Optimization
Automatic Vectorization (Intel C/C++ Compiler)
Inner loops
Simple array index manipulation
Iterate over contiguous memory locations
Global variables avoided
Pointer disambiguation if pointers are employed
24
UCM
Original element
1st step
2nd step
+
x +
+
β
x +
+
x
+
+
δ
x +x
x
A D
SIMD Optimization
1st1st
25
UCM
SIMD Optimization
Column-major layout
Vectorial Horizontal filtering
+
x +
Horizontal filtering
+
x +
26
UCM
SIMD Optimization
Column-major layout
Vectorial Vertical filtering
+
x +
Vertical filtering
+
x +
27
UCM
for(j=2,k=1;j<(#columns-4);j+=2,k++){ #pragma vector aligned for(i=0;i<#rows;i++) {
/* 1st operation */col3=col3 + alfa*( col4+ col2);/* 2nd operation */
col2=col2 + beta*( col3+ col1);/* 3rd operation */
col1=col1 + gama*( col2+ col0); /* 4th operation */
col0 =col0 + delt*( col1+ col-1);/* Last step */
detail = col1 *phi_inv; aprox = col0 *phi; }}
Horizontal Vectorial Filtering (semi-automatic)
SIMD Optimization
28
UCM
SIMD Optimization
Hand-coded Vectorization
SIMD parallelism has to be explicitly expressed
Intrinsics allow more flexibility
Possibility to also vectorize the vertical filtering
29
UCM
Horizontal Vectorial Filtering (hand)
SIMD Optimization
/* 1st operation */
t2 = _mm_load_ps(col2);
t4 = _mm_load_ps(col4);
t3 = _mm_load_ps(col3);
coeff = _mm_set_ps1(alfa);
t4 = _mm_add_ps(t2,t4);
t4 = _mm_mul_ps(t4,coeff);
t3 = _mm_add_ps(t4,t3);
_mm_store_ps(col3,t3);/* 2nd operation */
/* 3rd operation */
/* 4th operation */
/* Last step */
_mm_store_ps(detail,t1);
_mm_store_ps(aprox,t0);
t2 t3 t4
+
x +
30
UCM
0
0,01
0,02
0,03
0,04
0,05
0,06
I-S IM-S M-S I-A IM-A M-A I-H IM-H M-HT
ime
(s)
Level 1 Level 2 Level 3 Level 4ICC
0
0,01
0,02
0,03
0,04
0,05
0,06
I-S IM-S M-S I-H IM-H M-H
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4GCC
SIMD Optimization
Execution time breakdown of the horizontal filtering (10242 pixels image).
I, IM and M denote inplace, inplace-mallat and mallat approaches.
S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.
31
UCM
SIMD Optimization
Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.
The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.
For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer.
CONCLUSIONS
32
UCM
SIMD Optimization
0
0,02
0,04
0,06
0,08
I-S IM-S M-S I-A IM-A M-A I-H IM-H M-H
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 PostICC
0
0,02
0,04
0,06
0,08
I-S IM-S M-S I-H IM-H M-H
Tim
e (s
)
Level 1 Level 2 Level 3 Level 4 PostGCC
Execution time breakdown of the whole transform (10242 pixels image).
I, IM and M denote inplace, inplace-mallat and mallat approaches.
S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.
33
UCM
SIMD Optimization
Speedup between 1,5 and 2 depending on the strategy.
For ICC the shortest execution time is reached by the mallat version.
When using GCC both recursive-layout strategies obtain similar results.
CONCLUSIONS
34
UCM
SIMD Optimization
1
1,4
1,8
2,2
2,6
7 8 9 10 11 12 13 14
Image Size (log2)
Sp
ee
du
p
Hand-Coded ICC Automatic ICC Hand-Coded GCC
1
1,5
2
2,5
3
7 8 9 10 11 12 13 14Image Size (log2)
Sp
ee
du
p
Hand-Coded ICC Automatic ICC Hand-Coded GCC
Speedup achieved by the different vectorial codes over the inplace-mallat and inplace.
We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.
35
UCM
SIMD Optimization
The speedup grows with the image size since.
On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.
Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes
CONCLUSIONS
36
UCM
Conclusions
37
UCM
Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.
SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results.
Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system).
Whole transform around 2.
The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.
Most of our insights are compiler independent.
Conclusions
38
UCM
Future work
39
UCM
4D layout for a lifting-based scheme
Measurements using other platforms
• Intel Itanium
• Intel Pentium-4 with hiperthreading
Parallelization using OpenMP (SMT)
Future work
For additional information:
http://www.dacya.ucm.es/dchaver