Lithe: Enabling Efficient Composition of Parallel Libraries

Click here to load reader

download Lithe: Enabling Efficient Composition of Parallel Libraries

of 42

  • date post

    19-Jan-2016
  • Category

    Documents

  • view

    31
  • download

    5

Embed Size (px)

description

Lithe: Enabling Efficient Composition of Parallel Libraries. Heidi Pan , Benjamin Hindman, Krste Asanovi ć. xoxo@mit.edu  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley. HotPar  Berkeley, CA  March 31, 2009. Functionality :. or. or. or. - PowerPoint PPT Presentation

Transcript of Lithe: Enabling Efficient Composition of Parallel Libraries

Lithe: Enabling Efficient Composition of Parallel LibrariesHeidi Pan, Benjamin Hindman, Krste Asanovi
HotPar Berkeley, CA March 31, 2009
xoxo@mit.edu {benh, krste}@eecs.berkeley.edu
BERKELEY PAR LAB
Hardware
OS
App
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Resource Management:
Functional Composability
modularity
App
bubble
sort
quick
sort
Performance Composability
Solution:
Harts
Lithe
Evaluation
OS
Out-of-the-Box
OS
TBB
OpenMP
Hardware
Core
0
Core
1
Core
2
Core
3
http://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm
If more than one thread calls Intel MKL and the
function being called is threaded, it is important
that threading in Intel MKL be turned off.
Set OMP_NUM_THREADS=1 in the environment.
BERKELEY PAR LAB
Out-of-the-Box
No task-level parallelism!
Want to exploit
Tim Davis manually tunes libraries to effectively partition the resources.
Core
0
Core
1
Core
2
Core
3
OS
TBB
OpenMP
Hardware
Out-of-the-Box
Give resources to OpenMP
Give resources to TBB
SPQR
MKL
v1
MKL
v2
MKL
v3
App
SPQR
0
0
1
2
3
Solution:
Evaluation
BERKELEY PAR LAB
OS
App 2
MKL
OpenMP
TBB
SPQR
OS
OMP
TBB
application call graph
Hierarchical: Caller gives resources to callee to execute on its behalf.
Cooperative: Callee gives resources back to caller when done.
Cilk
Ct
Cilk
Cilk
TBB
Ct
OpenMP
time
Cilk Sched: next?
Ct Sched: next?
BERKELEY PAR LAB
Standard Lithe ABI
TBBLithe Scheduler
OpenMPLithe Scheduler
Caller
Callee
return
call
return
call
register
unregister
}
:
:
Solution:
Harts
Lithe
Evaluation
Lithe: user-level library (register, unregister, request, enter, yield, ...)
TBBLithe
~2000 lines of C, C++, assembly
~1500 / ~8000 relevant lines added/removed/modified
~1000 / ~6000 relevant lines added/removed/modified
BERKELEY PAR LAB
TBBLithe Performance (µbench included with release)
OpenMPLithe Performance (NAS parallel benchmarks)
All results on Linux 2.6.18, 8-core Intel Clovertown.
tree sum
NUM_OMP_THREADS
NUM_TBB_THREADS
NUM_OMP_THREADS
NUM_TBB_THREADS
NUM_OMP_THREADS
NUM_TBB_THREADS
NUM_OMP_THREADS
NUM_TBB_THREADS
Time (sec)
Lithe project contributions
Harts: better resource model for parallel programming
Lithe: enables parallel codes to interoperate by standardizing the sharing of harts
MKL
OpenMP
TBB
SPQR
BERKELEY PAR LAB
Acknowledgements
We would like to thank George Necula and the rest of Berkeley
Par Lab for their feedback on this work.
Research supported by Microsoft (Award #024263 ) and Intel
(Award #024894) funding and by matching funding by U.C.
Discovery (Award #DIG07-10227). This work has also been
in part supported by a National Science Foundation Graduate
Research Fellowship. Any opinions, findings, conclusions, or
recommendations expressed in this publication are those of the
authors and do not necessarily reflect the views of the National
Science Foundation. The authors also acknowledge the support
of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.
0
0.5
1
1.5
2
2.5
3
3.5
landmark
0
0.5
1
1.5
2
2.5
3
3.5
landmark
0
5
10
15
20
25
deltaX
0
5
10
15
20
25
30
deltaX
0
20
40
60
80
100
120
ESOC
1
2
3
4
5
6
7
8
8
6
4
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
0
200
400
600
800
1000
1200
Rucci
0
5
10
15
20
25
deltaX
60
65
70
75
80
85
ESOC
0
200
400
600
800
1000
1200
Rucci
60
65
70
75
80
85
ESOC
0
5
10
15
20
25
deltaX
60
65
70
75
80
85
ESOC
0
200
400
600
800
1000
1200
Rucci
0
0.5
1
1.5
2
2.5
3
3.5
landmark
0
100
200
300
400
500
600
Rucci
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
landmark
1
2
3
4
5
6
7
8
8
6
4
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
0
0.5
1
1.5
2
2.5
3
3.5
landmark