Out$of$OrderParallelDiscreteEventSimula7onforTransac7onLevelModels*...

Out-‐of-‐Order Parallel Discrete Event Simula7on for Transac7on Level Models Weiwei Chen

Advisor: Prof. Rainer Dömer Center for Embedded Computer Systems

University of California, Irvine

Center for Embedded Computer Systems

© 2013 Weiwei Chen, Version 1.2 March, 2013.

v  Selec7on of Relevant Publica7ons •  W. Chen, R. Dömer, "ConcurrenC: A new approach towards effec3ve abstrac3on of C-‐based SLDLs”, IESS 2009. •  W. Chen, R. Dömer, " A Fast Heuris3c Scheduling Algorithm for Periodic ConcurrenC Models", ASP-‐DAC 2010. •  W. Chen, X. Han, R. Dömer, “ESL Design and Mul3-‐Core Valida3on using the System-‐on-‐Chip Environment”, HLDVT 2010 •  R. Dömer, W. Chen, X. Han, Andreas Gerstlauer, “Mul3-‐Core Parallel Simula3on of System-‐Level Descrip3on Languages”, ASP-‐DAC 2011. •  W. Chen, X. Han, R. Dömer, “Mul3-‐core Simula3on of Transac3on Level Models using the System-‐on-‐Chip Environment”,

IEEE Design & Test of Computers, May-‐June 2011.

Tradi7onal DE simula7on Synchronous PDES Out-‐of-‐order PDES

SimulaUon Time One global Ume tuple (t, δ) shared by every thread and event

Local Ume for each thread th as tuple (tth,δth). A total order of Ume is defined with the following relaUons: equal: (t1, δ1) = (t2, δ2), iff t1 = t2, δ1 = δ2 before: (t1,δ1) < (t2,δ2), iff t1 < t2, or t1 = t2, δ1 < δ2 aEer: (t1,δ1) > (t2,δ2), iff t1 > t2, or t1 = t2, δ1 > δ2

Event DescripUon Events are idenUfied by their ids, i.e, event (id). A Umestamp is added to idenUfy every event, i.e. event (id, t, δ).

SimulaUon Thread Sets READY, RUN, WAIT, WAITFOR, JOINING, COMPLETE

Threads are organized as subsets with the same Umestamp (tth, δth). Thread sets are the union of these subsets, i.e, READY = ∪READYt,δ, RUN = ∪RUNt,δ , WAIT = ∪WAITt,δ , WAITFOR = ∪WAITFORt,δ (δ = 0), where the subsets are ordered in increasing order of Ume (t, δ)

Threading Model User-‐level or OS kernel-‐level OS kernel-‐level

Run Time Scheduling

Event delivery in-‐order in delta-‐cycle loop Time advance in-‐order in outer loop

Event delivery out-‐of-‐order if no conflicts exist Time advance out-‐of-‐order if no conflicts exist

Only one thread is acUve at one Ume No parallelism No SMP u5liza5on

Threads at same simulaUon cycle run in parallel Limited parallelism Inefficient SMP u5liza5on

Threads at same simulaUon cycle or with no conflicts run in parallel More parallelism Efficient SMP u5liza5on

Compile Time Analysis

No synchronizaUon protecUon needed

Need synchroniza5on protec5on for shared resources, e.g. any user-‐defined and hierarchical channels, data structures in the scheduler

No conflict analysis needed StaUc conflict analysis derives Segment Graph (SG) from the Control Flow Graph (CFG), analyzes variable and event accesses, passes conflict tables to the scheduler. Compile 7me increases

SpecC Models

SystemC Models

SpecC SLDLs SLDLs

MoCs MoCs

Abstrac7on Descrip7ve Capability

Rela7onship between ConcurrenC and C-‐based SLDLs

SystemC

ConcurrenC

Source: www.apple.com Source: www.aa1car.com

v Embedded System Design •  Embedded systems are special purposed computer systems

with a wide applicaUon domain, e.g. automobiles, medical devices, communicaUon systems, etc.

•  TransacUon Level Modeling (TLM) models embedded systems at different abstracUon levels.

•  TLMs are usually described in System-‐Level DescripUon Languages (SLDLs), e.g. SystemC and SpecC.

•  TLMs are typically validated by Discrete Event (DE) simulaUon.

v Related Work: Accelerate TLM simula7on Modeling Techniques

•  TransacUon-‐level modeling (TLM) •  TLM temporal decoupling •  Source-‐level SimulaUon Stagelmann et al. [DAC’11] •  Host-‐Compiled SimulaUon Gerstlauer et al. [RSP’10]

Distributed Simula7on •  Chandy et al. [TSE’79] •  Huang et al. [SIES’08] •  Chen et al. [CECS’11]

Hardware-‐based Accelera7on •  Sirowy et al. [DAC’10] •  Nanjundappa et al. [ASPDAC’10] •  Sinha et al. [ASPDAC’12] •  Vinco et al. [DAC’12]

SMP Parallel Simula7on •  Fujimoto. [CACM’90] •  Chopard et al. [ICCS’06] •  Ezudheen et al. [PADS’09] •  Mello et al. [DATE’10] •  Schumacher et al. [CODES’11] •  Chen et al. [IEEED&T’11] •  Yun et al. [TCAD’12]

Source: P. Chou, UCI

–  Executes threads in the same simulaUon cycles (3me, delta) in parallel –  Needed protecUon of communicaUon and synchronizaUon can be automaUcally instrumented

start

READY == ∅ ?

∀th∈WAIT, if th’s event is noUfied; Move(th, WAIT, READY), Clear noUfied events;

READY == ∅ ?

Update the simulaUon Ume; move the earliest th∈WAITFOR to READY;

READY == ∅ ?

end

No

Yes

No

No

Yes

Yes

th = Pick(READY, RUN); Go(th);

RUN == ∅ ? |RUN| <= #CPUs && READY != ∅ ?

Yes

No sleep

sleep

No

Yes

delta-‐cycle

5med-‐cycle

Image courtesy of Intel

v Synchronous Parallel Discrete Event Simula7on (SPDES) System-level Models

B0 B1

B2 B3

• Structured model • Computation and Communication separated • Explicit parallelism • Synthesizable and analyzable

• Multi-core CPUs readily available • Symmetric multiprocessing (SMP) • Hyper-threading support • Many-core CPUs coming

Sta5c States in SPDES

READY

WAIT

RUN

WAITFOR

thread is issued

waits for event e

event e is no7fied

waits for 7me t

Simula7on 7me is

advanced to t

#core

4

3

1

3

2 1

4

10:0 10:1

10:2

20:1 20:0

20:2

30:0

0:0 T:Δ th4 th2 th3 th1

10:0 10:1

10:2

20:1 20:0

20:2

30:0

0:0 T:Δ th4 th2 th3 th1

•  R. Dömer, W. Chen, X. Han, “Parallel Discrete Event Simula3on of Transac3on Level Models”, ASP-‐DAC 2012. •  W. Chen, R. Dömer, “An Op3mizing Compiler for Out-‐of-‐Order Parallel ESL Simula3on Exploi3ng Instance Isola3on” ASP-‐DAC 2012. •  W. Chen, X. Han, R. Dömer, “Out-‐of-‐order Parallel Simula3on for ESL Design”, DATE 2012. •  W. Chen, X. Han, C. Chang, R. Dömer, “Elimina3ng Race Condi3ons in System-‐Level Models by using Parallel Simula3on Infrastructure”, HLDVT 2012. •  W. Chen, X. Han, C. Chang, R. Dömer, “Advances in Parallel Discrete Event Simula3on for Electronic System-‐Level Design”,

IEEE Design & Test of Computers, January-‐February 2013. •  W. Chen, R. Dömer, “Op3mized Out-‐of-‐Order Parallel Discrete Event Simula3on Using Predic3ons” DATE 2013.

∀th∈READY, if( RUN.size < numCPUs && NoConflict(th)), Go(th);

start

Simula5on States Update

∀th∈WAIT, if th’s event is noUfied @ (te, δe); Move(th, WAITtth,δth, READYte, δe+1)

∀th∈WAITFOR, Move(th, WAITFORtth,δth, READYtth, 0)

RUN == ∅ ?

end

∀th∈READY, if( RUN.size < numCPUs && NoConflict(th)), Go(th);

sleep; No

Yes

1: #include <stdio.h> 2: int array[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 3: behavior B(in int begin, in int end, out int sum) 4: { int i; 5: void main(){ 6: int tmp; tmp = 0; i = begin; 7: waikor 1; // v3, segment 3 starts 8: while(i <= end){ 9: waikor 2; // v4, segment 4 starts 10: tmp += array[i]; i ++; 11: } 12: sum = tmp; } 13: }; 14: behavior Main() 15: {int sum1, sum2; 16: B b1(0, 4, sum1); B b2(5, 9, sum2); 17: int main(){ 18: par{ // v1, segment 1 starts 19: b1.main(); 20: b2.main(); 21: } // v2, segment 2 starts 22: prink("summa7on 1 is :%d \n", sum1); 23: prink("summa7on 2 is :%d \n", sum2); }};

v Out-‐of-‐order Parallel Discrete Event Simula7on (OoO PDES) –  Breaks simulaUon cycle barriers by localizing the Ume for each thread –  Aggressively issues threads to run in parallel even in different cycles –  Preserves the simulaUon correctness and Uming accuracy

waiHor 1 (Main.b1)

seg3

seg4

thread_b1 par

par end

seg0 thread_Main

seg2 thread_Main

seg1

waiHor 1 (Main.b2)

seg1

seg5

seg6

thread_b2

waiHor 2 (Main.b1) waiHor 2 (Main.b2)

seg 0 1 2 3 4 5 6

0 F F F F F F F

1 F T F T T T T

2 F F F T T T T

3 F T T T T F F

4 F T T T T F F

5 F T T F F T T

6 F T T F F T T

•  StaUc Conflict Analysis –  Compiler builds Segment Graph (SG) derived from Control Flow Graph (CFG) of applicaUons

–  Compiler builds Segment Conflict Tables for quick look-‐up at runUme

Example Segment Graph

•  OoO PDES OpUmizaUons –  Instance isola5on reduces false conflicts [ASPDAC’12] –  Scheduling predic5on for more parallelism [DATE’13]

•  StaUc code analysis applicaUons –  Conflicts detecUon for recoding –  Debugging when parallelizing applicaUons [HLDVT’12]

Segment Data Conflict Table

10:0 10:1

10:2

20:1 20:0

20:2

30:0

0:0 T:Δ th4 th2 th3 th1 #core

4

4

4

4

4 3

1

th4 th2 th3 th1

shortend sim run 7me

Dynamic States in OoO PDES

READYt,δ

WAITt,δ

RUNt,δ

READYt,δ+1

READYt+d1,0

… … …

WAITFORt+d2,0

READYt+d1,1

WAITt,δ+1

WAITt+d1,0

RUNt,δ+1

RUNt+d1,0

RUNt+d1,1

WAITFORt+d1,0

0

10

20

30

40

50

0

200

400

600

800

1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Out-‐of-‐Order Parallel Simula7on with Predic7ons (H.264 encoder)

sim run7me (sec) comp 7me(sec)

v Experiments and Results •  Parallel Fibonacci calculaUon with Uming informaUon •  Parallel JPEG image encoder with 3 color components encoded in parallel, a sequenUal Huffman encoding •  Parallel H.264 video decoder with 4 slice decoders and a sequenUal slice reader and synchronizer Ø  Host: 64-‐bit Fedora 12 Linux with 2 6-‐core CPUs (Intel(R) Xeon(R) X5650) at 2.67 GHz with 2 hyper-‐threads per

core (supports 24 threads in parallel)

0

0.5

1

1.5

2

2.5

3

spec arch sched net

Simula7on Run7me for the JPEG encoder model [sec]

Tradi7onal sequen7al Synchronous PDES OoO PDES

0

0.5

1

1.5

2

2.5

3

3.5

spec arch sched net

Compila7on 7me for the JPEG encoder model [sec]


0

50

100

150

200

spec arch sched net

Simula7on run7me for the H.264 decoder model [sec]


10

10.5

11

11.5

12

12.5

13

13.5

spec arch sched net

Compila7on 7me for the H.264 decoder model [sec]


0

2

4

6

8

10

12

14

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Parallel Fibonacci calculation with timing information fibo_timed elapsed time (SPDES) [sec] fibo_timed elapsed time (OoO PDES) [sec] fibo_timed rel. speedup (SPDES) fibo_timed rel. speedup (OoO PDES)

Out$of$OrderParallelDiscreteEventSimula7onforTransac7onLevelModels*...

Documents

Transcript of Out$of$OrderParallelDiscreteEventSimula7onforTransac7onLevelModels*...

Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*...

Documents

Transcript of Out$of$Order*Parallel*Discrete*Event*Simula7on*for*Transac7on*Level*Models*...

Out$of$OrderParallelDiscreteEventSimula7onforTransac7onLevelModels*...

Transcript of Out$of$OrderParallelDiscreteEventSimula7onforTransac7onLevelModels*...