An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared...
-
Upload
virgil-hensley -
Category
Documents
-
view
222 -
download
4
Transcript of An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared...
![Page 1: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/1.jpg)
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation
on Shared Memory Parallel Computer
Yoshihiro Oyama, Kenjiro Taura,
Toshio Endo, Akinori Yonezawa
Department of Information Science, Faculty of Science,
University of Tokyo
![Page 2: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/2.jpg)
Background
“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated
Languages with fine-grain threads• promising approach to handle the complexity
![Page 3: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/3.jpg)
Motivation
Q: Are fine-grain threads really effective?
• Easy to describe irregular parallelism?• Scalable?• Fast?
Case studies to answer the Q are few
Many sophisticated designs and implementation techniqueshave been proposed so far, but
![Page 4: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/4.jpg)
Goal
Case study to better understandthe effectiveness of fine-grain threads
C + Solaris threads
VS.
• program description cost• speed on 1 PE• scalability on 64PE SMP
in terms of
our language Schematic
approach w/o fine-grain threads
approach withfine-grain threads
![Page 5: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/5.jpg)
Overview
Applications ( RNA & CKY )
Solutions without fine-grain threads
Solutions with fine-grain threads
Performance evaluation
![Page 6: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/6.jpg)
Case Study 1: RNA- protein secondary structure prediction -
Algorithm simple node traversal + pruning
finding a path• satisfying certain condition• with largest weight
unbalanced tree
![Page 7: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/7.jpg)
Case Study 2: CKY- context-free grammar parser -
calculation of matrix elements
depends on all s
She is a girl whose mother is a teacher.
calculation time significantlyvaries from element to element
actual size 100≒
![Page 8: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/8.jpg)
![Page 9: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/9.jpg)
To create a threadfor each node large overhead
communicationwith memory
Task Pool
P P P
Solution without Fine-grain Threads(RNA)
![Page 10: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/10.jpg)
calculating 1 element→ 0 ~ 200 synchronization
P P P
decision strategy?• trial & error• prediction
Solution without Fine-grain Threads(CKY )
how to implement?• small delay → simple spin• large delay → block wait
![Page 11: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/11.jpg)
![Page 12: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/12.jpg)
Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]
(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))
thread creation
synchronization
channel
Language with Fine-grain Threads
![Page 13: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/13.jpg)
Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]
PE A PE B
future future
future
future
future future
future
future
future
stac
k future
future
future
![Page 14: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/14.jpg)
Synchronization on Register
PE A PE B
• StackThreads [Taura 97]
register
memory
register
register
register register
registerregister
register
memory
register
memory
![Page 15: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/15.jpg)
Synchronization by Code Duplication
heuristics to decide which to duplicate+
if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}
cont(c, v){ }
work A
work B ver. 1;
work B ver. 2;
work A work B(touch r)
simple spin
block wait
![Page 16: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/16.jpg)
![Page 17: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/17.jpg)
What description can be omittedin Schematic? Management of fine-grain tasks
Synchronization details
future ⇔ manipulation of task pool + load balance
touch ⇔ manipulation of comm. medium + aggressive optimizations
SchematicC + thread
![Page 18: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/18.jpg)
Codes for Parallel Execution
int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}
C
(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))
Schematic
whole: 1566 lines whole: 453 lines
parallel: 537 lines (34 %)
parallel: 29 lines (6.4 %)
for parallelexecution
RNA
![Page 19: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/19.jpg)
Performance Evaluation(Condition) Sun Ultra Enterprise 10000
(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)
GC time not included Runtime type check omitted
![Page 20: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/20.jpg)
Performance Evaluation(Sequential)
0
1
2
3
RNA CKY
norm
aliz
ed e
laps
ed t
ime
C Schematic
![Page 21: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/21.jpg)
Performance Evaluation(Parallel)
0
10
20
30
40
50
0 10 20 30 40 50 60# of PEs
spee
dup
C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)
![Page 22: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/22.jpg)
Related Work
ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on
• namespace management
• data locality
• object-consistency model
![Page 23: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/23.jpg)
Conclusion
We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization
We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C
![Page 24: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/24.jpg)
Performance Evaluation(Other Applications 1/2)
14.7
0
1
2
3
4
Fib Tak Qsort Knapsack Grobner SPLASH2
norm
aliz
ed e
laps
ed t
ime
C Schematic
![Page 25: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/25.jpg)
Performance Evaluation(Other Applications 2/2)
0
10
20
30
40
50
0 10 20 30 40 50 60
# of PEs
spee
dup
Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2
![Page 26: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,](https://reader036.fdocuments.net/reader036/viewer/2022081401/56649ed15503460f94be0a96/html5/thumbnails/26.jpg)
Identifying Overheads
0
200
400
600
800
1000
normal no poll no GCcheck
stolentagopt.
flagcheck
usesmalltag
globalvaropt.
C
norm
aliz
ed e
laps
ed t
ime