A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

32
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer

Transcript of A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

A Process Splitting Transformation for Kahn Process Networks

Sjoerd Meijer

2Research

Contents

• Background• Problem Definition and Project Goal• Splitting

– Producer Selection– Inter-process Communication– Consumer Selection

• Implementation• Conclusion And Further Work

3Research

Background• Parallelization is not new• Forking a sequential application

Classic example, matrix-matrix multiplication:

– Master processor executes code up to parallel loop– Execute parallel iterations on other processors– Synchronize at end of parallel loop

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory Bus

Main Memory

:

do i, n

do j, n

a(i,j) = …

:

end do

end do

:

4Research

Background

Applications are specified as parallel tasks:

• Example JPEG decoder:

Frontend

Backend

rasterIdct colIdctrow

5Research

Cakesim (eCos+CCP) - profile for JPEG-KPN:

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

frontend idctrow idctcol raster backend

Problem Definition

# CPUs # cycles

1 18,622 10,483 10,114 18,325 20,77

?

6Research

Problem Definition

Automatic procedure for process splitting in KPNs to take advantage of multiprocessor architectures.

Original process network Split-up network:

Frontend

Backend

rasterIdct colIdctrow

Frontend

Backend

rasterIdct colIdctrow

Idctrow

7Research

Splitting – The ConceptRequired:• Determine computational expensive process: profiling or pragma’s + static support• Partitioning of the Iteration Space (IS)• N = number of times a process has to be split• L = loop-nest level at which the splitting takes placeTo do:• Duplication of code and FIFOs• Adding control for token production and consumption

8Research

Techniques used:

Data dependence analysis:– Data flow analysis– Array data flow analysis

Tree transformations:– Adding/removing/duplicating tree statements

Compiler framework:– GCC

9Research

Solution for KPNs

Four step approach:

COMPUTATION:1. Partitioning (computation)

COMMUNICATION:2. Interprocess communication3. Token production4. Token consumption

P1 P2 P3

P1 P21 P3

P22

10Research

Partitioning of the original process computation over the resulted split-up processes

11Research

Interprocess Communication :

for(int i=1; i<10; i++) a[i] = a[i-1] + i; //s1 :

• Inter process communication is given by the loop-carried dependency: a[i-1] at iteration i is produced at iteration i-1.

• If execution of stmt s1 is distributed over different processes, token a[i-1] needs to be communicated:

: :for(int i=1; i<10; i++){ for(int i=1; i<10; i++){ if(i%2==0) if(i%2==1)

a[i] = a[i-1] + i; a[i] = a[i-1] + i;: :

12Research

Problems: – P1.At the producer side: where to send the tokens to?

– PII.At the consumer side: from where to consume tokens ?

Solutions P1:1. Producer filters the tokens (static solution)2. Producer sends all tokens to all split-up processes (run time

solution)

Solutions PII:1. The consumer knows by it self when to switch (static solution)2. Each producer sends a signal to the consumer when to

switch reading data from a different FIFO (run time solution)

Token Production&Consumption

?P1 P2 P1P2’

P2’’

?P2’

P2’’P2 P3 P3

13Research

Token Production– runtime vs. static

100 tokens

P1 P2

50 tokens

50 tokens

Static solution

P1

P2’

P2’’

Runtime solution

P1

P2’

P2’’

100 tokens

100 tokens

14Research

Token Consumption – runtime vs. static

100 tokens

P2 P3

Switch is known internally by the consumer50 tokens

50 tokens

Static solution

P2’

P2’’

P2

Switch is communicated over the channels to the consumer

50 tagged

tokens

50 tagged

tokens

Runtime solution

P2’

P2’’

P3

15Research

Token Production & Consumption – static solution• Establish the data-dependencies over the processes

HOW?• Data Dependence function (DD) and DD-1 DD -1 : Producer Consumer

DD : Consumer Producer

• However, DD cannot always be determined at compile time

16Research

Token Production – static solution without DD -1

Observation: loop counters producer side equalloop counters from consumer side

17Research

Token Production – static solution without DD -1

DD-1

(w1,w2,w3)=(w4,w5,w6);

P2(DD-1 (w1,w2,w3))=w5 w5=w2 =>P2(DD-1 (w1,w2,w3)%2= w2%2

18Research

Token Consumption – static solution without DD

Similar to productionof tokens.

19Research

Runtime solution:

int w0=0;int m0;while(c1){ : F00.put( t1 ); F01.put( t1 ); : while(c2){ : F10.put( t2 ); F11.put( t2 ); : while(c3){ : F20.put( t2 ); F21.put( t2 ); : } }}

int w1=0;while(c4){ : F00.get(); : F30.put(… ); : while(c5){ w1++; t=F10.get(); if(w1%2==0){ token=t }

: while(c6){ : F20.get(); : if(!c6){ augment(&token); } F40.put(token); : } } }}

static int state=0;while(c7){ : F30.get(); : while(c8){ : while(c9){ : while(true){ if(state==0){ token=F40.get(); if(token.isAugmented()){ state=1;

break; } } if(state==1){ token=F41.get(); if(token.isAugmented()){ state=0; break; } } } : } }}

Process P1

Process P2'

Process P3

Process P2'’

int w1=0;while(c4){ : F01.get(); : : while(c5){ w1++; t=F11.get(); if(w1%2==1){ token =t; } :

: while(c6){ : F21.get(); : if(!c6){ augment(&token); } F41.put(token); : } } }}

20Research

Multiple split-up processes

Split-up into 3 processes

P1 P2 P3 P4

P1 P2’’ P3’’ P4

P2’ P3’

P2’’’ P3’’’

21Research

Copy-nodes

P1 P2 P3 P4

Copy-nodes insertion

P1 P2 P3 P4

Splitting transformation

P1 P2’’ P3’’ P4

P2’ P3’

P2’’ P3’’

22Research

Copy-nodes

• Pros:– Simple network structure– Apply four-step splitting approach

• Cons:– More processes => more communication

(can be improved) => overhead

23Research

Implementation• Used technique:

– Runtime solution (general)• Used framework:

– GCC (GNU Compiler Collection)

• Advantages GCC:– Availability of data dependence information– Supported by large community;– We are in contact with Sebastian Pop, maintainer and

developer of various compiler phases e.g. the data dependence analysis, control flow and induction variable.

24Research

Implementation

• Data dependence analysis (already present):– scalars– arrays

• Data Dependence Graph (DDG)present only on RTL level, not on tree SSA

• Two new passes:1. Create DDG 2. Splitting

26Research

Implementation

1. Splitting pragma2. Data dependence graph3. Class definition reconstruction4. Function cloning5. Modulo condition insertion

27Research

Implementation

To do:1. Copying of class definition2. Copying of class member functions3. Reconstruction network structure

– FIFO– Network definition

28Research

Implementation

Final result:

• Data dependence information tells whether splitting is legal (no IPC)

• Semi-automatic transformation/case-study

29Research

Results

# CPUs org -cn org +cn row+col row+col+raster col+raster raster1 18.62 22.36 24.38 24.62 23.63 20.232 10.48 12.67 14.01 14.33 13.46 11.373 10.11 11.61 12.76 11.44 10.57 8.774 18.32 16.69 17.31 11.07 9.41 7.955 20.77 28.04 28.09 12.68 9.76 8.76

Improvement of 21%

Original KPN

KPN with copy nodes

Processes split-up into two

30Research

Future work: YAPI and CCP• Difference in active and passive connectors.• Active connectors in YAPI are modeled as a

thread• Passive do not run in a separate thread• More connectors in CCP:

P1 P2’’ P3’’ P4

P2’ P3’

P2’’ P3’’

Mesh MergeFork

31Research

Future Work

• Connect GCC with SCOTTY:

• GCC branch– Main branch: may not accept the patch– GOMP branch targets parallelization

GCCIs

transformationlegal?

Yes ScottySourcecode

TransformedSource code

No

Stop+ data dependence

+ Network topology

32Research

Conclusion

• Only split-up the most computationally expensive processes

• The transformation is profitable