Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*...
Transcript of Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*...
Out-‐of-‐Order Parallel Discrete Event Simula7on for Transac7on Level Models Weiwei Chen
Advisor: Prof. Rainer Dömer Center for Embedded Computer Systems
University of California, Irvine
Center for Embedded Computer Systems
© 2013 Weiwei Chen, Version 1.2 March, 2013.
v Selec7on of Relevant Publica7ons • W. Chen, R. Dömer, "ConcurrenC: A new approach towards effec3ve abstrac3on of C-‐based SLDLs”, IESS 2009. • W. Chen, R. Dömer, " A Fast Heuris3c Scheduling Algorithm for Periodic ConcurrenC Models", ASP-‐DAC 2010. • W. Chen, X. Han, R. Dömer, “ESL Design and Mul3-‐Core Valida3on using the System-‐on-‐Chip Environment”, HLDVT 2010 • R. Dömer, W. Chen, X. Han, Andreas Gerstlauer, “Mul3-‐Core Parallel Simula3on of System-‐Level Descrip3on Languages”, ASP-‐DAC 2011. • W. Chen, X. Han, R. Dömer, “Mul3-‐core Simula3on of Transac3on Level Models using the System-‐on-‐Chip Environment”,
IEEE Design & Test of Computers, May-‐June 2011.
Tradi7onal DE simula7on Synchronous PDES Out-‐of-‐order PDES
SimulaUon Time One global Ume tuple (t, δ) shared by every thread and event
Local Ume for each thread th as tuple (tth,δth). A total order of Ume is defined with the following relaUons: equal: (t1, δ1) = (t2, δ2), iff t1 = t2, δ1 = δ2 before: (t1,δ1) < (t2,δ2), iff t1 < t2, or t1 = t2, δ1 < δ2 aEer: (t1,δ1) > (t2,δ2), iff t1 > t2, or t1 = t2, δ1 > δ2
Event DescripUon Events are idenUfied by their ids, i.e, event (id). A Umestamp is added to idenUfy every event, i.e. event (id, t, δ).
SimulaUon Thread Sets READY, RUN, WAIT, WAITFOR, JOINING, COMPLETE
Threads are organized as subsets with the same Umestamp (tth, δth). Thread sets are the union of these subsets, i.e, READY = ∪READYt,δ, RUN = ∪RUNt,δ , WAIT = ∪WAITt,δ , WAITFOR = ∪WAITFORt,δ (δ = 0), where the subsets are ordered in increasing order of Ume (t, δ)
Threading Model User-‐level or OS kernel-‐level OS kernel-‐level
Run Time Scheduling
Event delivery in-‐order in delta-‐cycle loop Time advance in-‐order in outer loop
Event delivery out-‐of-‐order if no conflicts exist Time advance out-‐of-‐order if no conflicts exist
Only one thread is acUve at one Ume No parallelism No SMP u5liza5on
Threads at same simulaUon cycle run in parallel Limited parallelism Inefficient SMP u5liza5on
Threads at same simulaUon cycle or with no conflicts run in parallel More parallelism Efficient SMP u5liza5on
Compile Time Analysis
No synchronizaUon protecUon needed
Need synchroniza5on protec5on for shared resources, e.g. any user-‐defined and hierarchical channels, data structures in the scheduler
No conflict analysis needed StaUc conflict analysis derives Segment Graph (SG) from the Control Flow Graph (CFG), analyzes variable and event accesses, passes conflict tables to the scheduler. Compile 7me increases
SpecC Models
SystemC Models
SpecC SLDLs SLDLs
MoCs MoCs
Abstrac7on Descrip7ve Capability
Rela7onship between ConcurrenC and C-‐based SLDLs
SystemC
ConcurrenC
Source: www.apple.com Source: www.aa1car.com
v Embedded System Design • Embedded systems are special purposed computer systems
with a wide applicaUon domain, e.g. automobiles, medical devices, communicaUon systems, etc.
• TransacUon Level Modeling (TLM) models embedded systems at different abstracUon levels.
• TLMs are usually described in System-‐Level DescripUon Languages (SLDLs), e.g. SystemC and SpecC.
• TLMs are typically validated by Discrete Event (DE) simulaUon.
v Related Work: Accelerate TLM simula7on Modeling Techniques
• TransacUon-‐level modeling (TLM) • TLM temporal decoupling • Source-‐level SimulaUon Stagelmann et al. [DAC’11] • Host-‐Compiled SimulaUon Gerstlauer et al. [RSP’10]
Distributed Simula7on • Chandy et al. [TSE’79] • Huang et al. [SIES’08] • Chen et al. [CECS’11]
Hardware-‐based Accelera7on • Sirowy et al. [DAC’10] • Nanjundappa et al. [ASPDAC’10] • Sinha et al. [ASPDAC’12] • Vinco et al. [DAC’12]
SMP Parallel Simula7on • Fujimoto. [CACM’90] • Chopard et al. [ICCS’06] • Ezudheen et al. [PADS’09] • Mello et al. [DATE’10] • Schumacher et al. [CODES’11] • Chen et al. [IEEED&T’11] • Yun et al. [TCAD’12]
Source: P. Chou, UCI
– Executes threads in the same simulaUon cycles (3me, delta) in parallel – Needed protecUon of communicaUon and synchronizaUon can be automaUcally instrumented
start
READY == ∅ ?
∀th∈WAIT, if th’s event is noUfied; Move(th, WAIT, READY), Clear noUfied events;
READY == ∅ ?
Update the simulaUon Ume; move the earliest th∈WAITFOR to READY;
READY == ∅ ?
end
No
Yes
No
No
Yes
Yes
th = Pick(READY, RUN); Go(th);
RUN == ∅ ? |RUN| <= #CPUs && READY != ∅ ?
Yes
No sleep
sleep
No
Yes
delta-‐cycle
5med-‐cycle
Image courtesy of Intel
v Synchronous Parallel Discrete Event Simula7on (SPDES) System-level Models
B0 B1
B2 B3
• Structured model • Computation and Communication separated • Explicit parallelism • Synthesizable and analyzable
• Multi-core CPUs readily available • Symmetric multiprocessing (SMP) • Hyper-threading support • Many-core CPUs coming
Sta5c States in SPDES
READY
WAIT
RUN
WAITFOR
thread is issued
waits for event e
event e is no7fied
waits for 7me t
Simula7on 7me is
advanced to t
#core
4
3
1
3
2 1
4
10:0 10:1
10:2
20:1 20:0
20:2
30:0
0:0 T:Δ th4 th2 th3 th1
10:0 10:1
10:2
20:1 20:0
20:2
30:0
0:0 T:Δ th4 th2 th3 th1
• R. Dömer, W. Chen, X. Han, “Parallel Discrete Event Simula3on of Transac3on Level Models”, ASP-‐DAC 2012. • W. Chen, R. Dömer, “An Op3mizing Compiler for Out-‐of-‐Order Parallel ESL Simula3on Exploi3ng Instance Isola3on” ASP-‐DAC 2012. • W. Chen, X. Han, R. Dömer, “Out-‐of-‐order Parallel Simula3on for ESL Design”, DATE 2012. • W. Chen, X. Han, C. Chang, R. Dömer, “Elimina3ng Race Condi3ons in System-‐Level Models by using Parallel Simula3on Infrastructure”, HLDVT 2012. • W. Chen, X. Han, C. Chang, R. Dömer, “Advances in Parallel Discrete Event Simula3on for Electronic System-‐Level Design”,
IEEE Design & Test of Computers, January-‐February 2013. • W. Chen, R. Dömer, “Op3mized Out-‐of-‐Order Parallel Discrete Event Simula3on Using Predic3ons” DATE 2013.
∀th∈READY, if( RUN.size < numCPUs && NoConflict(th)), Go(th);
start
Simula5on States Update
∀th∈WAIT, if th’s event is noUfied @ (te, δe); Move(th, WAITtth,δth, READYte, δe+1)
∀th∈WAITFOR, Move(th, WAITFORtth,δth, READYtth, 0)
RUN == ∅ ?
end
∀th∈READY, if( RUN.size < numCPUs && NoConflict(th)), Go(th);
sleep; No
Yes
1: #include <stdio.h> 2: int array[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 3: behavior B(in int begin, in int end, out int sum) 4: { int i; 5: void main(){ 6: int tmp; tmp = 0; i = begin; 7: waikor 1; // v3, segment 3 starts 8: while(i <= end){ 9: waikor 2; // v4, segment 4 starts 10: tmp += array[i]; i ++; 11: } 12: sum = tmp; } 13: }; 14: behavior Main() 15: {int sum1, sum2; 16: B b1(0, 4, sum1); B b2(5, 9, sum2); 17: int main(){ 18: par{ // v1, segment 1 starts 19: b1.main(); 20: b2.main(); 21: } // v2, segment 2 starts 22: prink("summa7on 1 is :%d \n", sum1); 23: prink("summa7on 2 is :%d \n", sum2); }};
v Out-‐of-‐order Parallel Discrete Event Simula7on (OoO PDES) – Breaks simulaUon cycle barriers by localizing the Ume for each thread – Aggressively issues threads to run in parallel even in different cycles – Preserves the simulaUon correctness and Uming accuracy
waiHor 1 (Main.b1)
seg3
seg4
thread_b1 par
par end
seg0 thread_Main
seg2 thread_Main
seg1
waiHor 1 (Main.b2)
seg1
seg5
seg6
thread_b2
waiHor 2 (Main.b1) waiHor 2 (Main.b2)
seg 0 1 2 3 4 5 6
0 F F F F F F F
1 F T F T T T T
2 F F F T T T T
3 F T T T T F F
4 F T T T T F F
5 F T T F F T T
6 F T T F F T T
• StaUc Conflict Analysis – Compiler builds Segment Graph (SG) derived from Control Flow Graph (CFG) of applicaUons
– Compiler builds Segment Conflict Tables for quick look-‐up at runUme
Example Segment Graph
• OoO PDES OpUmizaUons – Instance isola5on reduces false conflicts [ASPDAC’12] – Scheduling predic5on for more parallelism [DATE’13]
• StaUc code analysis applicaUons – Conflicts detecUon for recoding – Debugging when parallelizing applicaUons [HLDVT’12]
Segment Data Conflict Table
10:0 10:1
10:2
20:1 20:0
20:2
30:0
0:0 T:Δ th4 th2 th3 th1 #core
4
4
4
4
4 3
1
th4 th2 th3 th1
shortend sim run 7me
Dynamic States in OoO PDES
READYt,δ
WAITt,δ
RUNt,δ
READYt,δ+1
READYt+d1,0
… … …
WAITFORt+d2,0
READYt+d1,1
WAITt,δ+1
WAITt+d1,0
RUNt,δ+1
RUNt+d1,0
RUNt+d1,1
WAITFORt+d1,0
0
10
20
30
40
50
0
200
400
600
800
1000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Out-‐of-‐Order Parallel Simula7on with Predic7ons (H.264 encoder)
sim run7me (sec) comp 7me(sec)
v Experiments and Results • Parallel Fibonacci calculaUon with Uming informaUon • Parallel JPEG image encoder with 3 color components encoded in parallel, a sequenUal Huffman encoding • Parallel H.264 video decoder with 4 slice decoders and a sequenUal slice reader and synchronizer Ø Host: 64-‐bit Fedora 12 Linux with 2 6-‐core CPUs (Intel(R) Xeon(R) X5650) at 2.67 GHz with 2 hyper-‐threads per
core (supports 24 threads in parallel)
0
0.5
1
1.5
2
2.5
3
spec arch sched net
Simula7on Run7me for the JPEG encoder model [sec]
Tradi7onal sequen7al Synchronous PDES OoO PDES
0
0.5
1
1.5
2
2.5
3
3.5
spec arch sched net
Compila7on 7me for the JPEG encoder model [sec]
Tradi7onal sequen7al Synchronous PDES OoO PDES
0
50
100
150
200
spec arch sched net
Simula7on run7me for the H.264 decoder model [sec]
Tradi7onal sequen7al Synchronous PDES OoO PDES
10
10.5
11
11.5
12
12.5
13
13.5
spec arch sched net
Compila7on 7me for the H.264 decoder model [sec]
Tradi7onal sequen7al Synchronous PDES OoO PDES
0
2
4
6
8
10
12
14
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Parallel Fibonacci calculation with timing information fibo_timed elapsed time (SPDES) [sec] fibo_timed elapsed time (OoO PDES) [sec] fibo_timed rel. speedup (SPDES) fibo_timed rel. speedup (OoO PDES)