Oss and libraries enabling arabic libraries and creating opportunities
Lithe: Enabling Efficient Composition of Parallel Libraries
description
Transcript of Lithe: Enabling Efficient Composition of Parallel Libraries
1
BERKELEY PAR LAB
Lithe: Enabling Efficient Composition of Parallel Libraries
Heidi Pan, Benjamin Hindman, Krste Asanović
HotPar Berkeley, CA March 31, 2009
[email protected] {benh, krste}@eecs.berkeley.edu
Massachusetts Institute of Technology UC Berkeley
BERKELEY PAR LAB
2
How to Build Parallel Apps?
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7Hardware
OS
App
Resource Management:
Need both programmer productivity and performance!
Functionality: or or or
BERKELEY PAR LAB
3
Composability is Key to Productivity
Functional Composability
sort
App 1 App 2
code reuse
sort
same library implementation, different appsmodularity
App
same app, different library implementations
bubblesort
quicksort
BERKELEY PAR LAB
4
Composability is Key to Productivity
Performance Composability
fast
fast
fast+
faster
fast
fast(er)+
BERKELEY PAR LAB
5
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution:
Harts Lithe
Evaluation
BERKELEY PAR LAB
6
Motivational Example
Sparse QR Factorization(Tim Davis, Univ of Florida)
OS
MKL
OpenMP
System Stack
Hardware
TBB
SPQRFrontal MatrixFactorization
ColumnElimination
Tree
Software Architecture
BERKELEY PAR LAB
7
Out-of-the-Box PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box
Input Matrix
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci0
0.5
1
1.5
2
2.5
3
3.5
landmark
sequential
BERKELEY PAR LAB
8
Out-of-the-Box Libraries Oversubscribe the Resources
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
virtualized kernel threads
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
BERKELEY PAR LAB
9
MKL Quick Fix
Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm
If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.
BERKELEY PAR LAB
10
Sequential MKL in SPQR
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
BERKELEY PAR LAB
11
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Sequential MKL PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Sequential MKL
Input Matrix
BERKELEY PAR LAB
12
SPQR Wants to Use Parallel MKL
No task-level parallelism!
Want to exploit matrix-level parallelism.
BERKELEY PAR LAB
13
Share Resources Cooperatively
OS
TBB OpenMP
Hardware
Tim Davis manually tunes libraries to effectively partition the resources.
Core
0Core
1
TBB_NUM_THREADS = 2
Core
2Core
3
OMP_NUM_THREADS = 2
BERKELEY PAR LAB
14
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Manually Tuned PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box Sequential MKL
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Manually Tuned
Input Matrix
BERKELEY PAR LAB
15
Manual Tuning Cannot Share Resources Effectively
Give resources to OpenMP
Give resources to TBB
BERKELEY PAR LAB
16
Manual Tuning Destroys Functional Composability
Tim Davis
LAPACKAx=bMKL
OpenMP
OMP_NUM_THREADS = 4
BERKELEY PAR LAB
17
Manual Tuning Destroys Performance Composability
SPQR
MKLv1
MKLv2
MKLv3
App
SPQR
0 01 2 3
BERKELEY PAR LAB
18
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution:
Harts: better resource abstraction Lithe: framework for sharing resources
Evaluation
BERKELEY PAR LAB
19
Virtualized Threads are Bad
OS
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
App 1 (TBB) App 1 (OpenMP) App 2
Different codes compete unproductively for resources.
BERKELEY PAR LAB
20
Space-Time Partitions aren’t Enough
OS
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
Space-time partitions isolate diff apps.
MKL
OpenMPTBB
SPQR App 2
Partition 1 Partition 2
What to do within an app?
BERKELEY PAR LAB
21
Harts: Hardware Thread Contexts
Represent real hw resources. Requested, not created. OS doesn’t manage harts for app.
OS
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
MKL
OpenMPTBB
SPQR
Harts
BERKELEY PAR LAB
22
Sharing Harts
OS
TBB OpenMP
Hardware
time
Hart 0 Hart 1 Hart 2 Hart 3
Partition
BERKELEY PAR LAB
23
Cooperative Hierarchical Schedulers
OMP
TBB
Cilk
Ct
application call graph
Cilk
library (scheduler) hierarchy
TBB Ct
OpenMP
Modular: Each piece of the app scheduled independently.
Hierarchical: Caller gives resources to callee to execute on its behalf.
Cooperative: Callee gives resources back to caller when done.
BERKELEY PAR LAB
24
A Day in the Life of a Hart
Cilk
TBB Ct
OpenMP
TBB Sched: next?
time
TBB Tasks
executeTBB task
TBB Sched: next?
execute TBB task
TBB Sched: next?nothing left to do, give hart back to parent
Cilk Sched: next?don‘t start new task, finish existing one first
Ct Sched: next?
BERKELEY PAR LAB
25
Child Scheduler
Parent Scheduler
Standard Lithe ABI
CilkLithe Scheduler
interface for sharing harts
TBBLithe Scheduler
Caller
Callee
returncall
returncall
interface for exchanging values
Analogous to function call ABI for enabling interoperable codes.
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
unregisterenter yield request register
Mechanism for sharing harts, not policy.
BERKELEY PAR LAB
26
OS
TBB OpenMP
Hardware
Lithe Runtime
harts
current scheduler
OS
TBBLithe OpenMPLithe
Hardware
Lithe
TBBLithe
scheduler hierarchy
enter yield request register unregister
OpenMPLithe
enter yield request register unregister
yield
current scheduler
BERKELEY PAR LAB
27
}
:
:
register(OpenMPLithe);
Register / Unregister
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
matmult(){
time
Register dynamically adds the new scheduler to the hierarchy.
unregisterenter yield request registerregister unregister
unregister(OpenMPLithe);
BERKELEY PAR LAB
28
}
:
register(OpenMPLithe);
Request
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
matmult(){
time
unregisterenter yield request registerrequest
unregister(OpenMPLithe);
Request asks for more harts from the parent scheduler.
request(n);
BERKELEY PAR LAB
29
:
:
Enter / Yield
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
time
unregisterenter yield request register
enter yield();
yield enter(OpenMPLithe);
Enter/Yield transfers additional harts between the parent and child.
BERKELEY PAR LAB
30
SPQR with Lithe
time
reg
enterenter
enter
yieldyield
MKL
OpenMPLithe
TBBLithe
SPQR
unreg
yield
matmult
req
BERKELEY PAR LAB
31
SPQR with Lithe
time
MKL
OpenMPLithe
TBBLithe
SPQR
unreg unreg unreg unreg
reg reg reg reg matmult matmult matmult matmult
req req req req
BERKELEY PAR LAB
32
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution:
Harts Lithe
Evaluation
BERKELEY PAR LAB
33
Implementation
Harts
TBBLithe OpenMPLithe
Lithe
Harts: simulated using pinned Pthreads on x86-Linux
Lithe: user-level library (register, unregister, request, enter, yield, ...)
TBBLithe
OpenMPLithe (GCC4.4)
~600 lines of C & assembly
~2000 lines of C, C++, assembly
~1500 / ~8000 relevant lines added/removed/modified
~1000 / ~6000 relevant lines added/removed/modified
BERKELEY PAR LAB
34
No Lithe Overhead w/o Composing
TBBLithe Performance (µbench included with release)
OpenMPLithe Performance (NAS parallel benchmarks)
tree sum preorder fibonacci
TBBLithe 54.80ms 228.20ms 8.42ms
TBB 54.80ms 242.51ms 8.72ms
conjugate gradient (cg) LU solver (lu) multigrid (mg)
OpenMPLithe 57.06s 122.15s 9.23s
OpenMP 57.00s 123.68s 9.54s
All results on Linux 2.6.18, 8-core Intel Clovertown.
BERKELEY PAR LAB
35
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
BERKELEY PAR LAB
36
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
SequentialTBB=1, OMP=1
172.1 sec
BERKELEY PAR LAB
37
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
Out-of-the-BoxTBB=8, OMP=8
111.8 sec
BERKELEY PAR LAB
38
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
Out-of-the-Box
Manually Tuned70.8 sec
BERKELEY PAR LAB
39
Performance of SPQR with LitheT
ime
(sec
)
Out-of-the-Box Lithe
Input Matrix
0
20
40
60
80
100
120
ESOC
Manually Tuned
0
5
10
15
20
25
30
deltaX0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
landmark0
100
200
300
400
500
600
Rucci
BERKELEY PAR LAB
40
Future Work
SPQR
TBBLitheenter yield req reg unreg
OpenMPLitheenter yield req reg unreg
CtLitheenter yield req reg unreg
CilkLitheenter yield req reg unreg
BERKELEY PAR LAB
41
Conclusion
Composability essential for parallel programming to become widely adopted.
Lithe project contributions Harts: better resource model for parallel programming Lithe: enables parallel codes to interoperate by
standardizing the sharing of harts
MKL
OpenMPTBB
SPQR
resource management
functionality
0 1 2 3
Parallel libraries need to share resources cooperatively.
BERKELEY PAR LAB
42
Acknowledgements
We would like to thank George Necula and the rest of BerkeleyPar Lab for their feedback on this work.
Research supported by Microsoft (Award #024263 ) and Intel(Award #024894) funding and by matching funding by U.C.Discovery (Award #DIG07-10227). This work has also beenin part supported by a National Science Foundation GraduateResearch Fellowship. Any opinions, findings, conclusions, orrecommendations expressed in this publication are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.