A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
-
Upload
emily-frary -
Category
Documents
-
view
220 -
download
0
Transcript of A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
2Research
Contents
• Background• Problem Definition and Project Goal• Splitting
– Producer Selection– Inter-process Communication– Consumer Selection
• Implementation• Conclusion And Further Work
3Research
Background• Parallelization is not new• Forking a sequential application
Classic example, matrix-matrix multiplication:
– Master processor executes code up to parallel loop– Execute parallel iterations on other processors– Synchronize at end of parallel loop
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory Bus
Main Memory
:
do i, n
do j, n
a(i,j) = …
:
end do
end do
:
4Research
Background
Applications are specified as parallel tasks:
• Example JPEG decoder:
Frontend
Backend
rasterIdct colIdctrow
5Research
Cakesim (eCos+CCP) - profile for JPEG-KPN:
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
frontend idctrow idctcol raster backend
Problem Definition
# CPUs # cycles
1 18,622 10,483 10,114 18,325 20,77
?
6Research
Problem Definition
Automatic procedure for process splitting in KPNs to take advantage of multiprocessor architectures.
Original process network Split-up network:
Frontend
Backend
rasterIdct colIdctrow
Frontend
Backend
rasterIdct colIdctrow
Idctrow
7Research
Splitting – The ConceptRequired:• Determine computational expensive process: profiling or pragma’s + static support• Partitioning of the Iteration Space (IS)• N = number of times a process has to be split• L = loop-nest level at which the splitting takes placeTo do:• Duplication of code and FIFOs• Adding control for token production and consumption
8Research
Techniques used:
Data dependence analysis:– Data flow analysis– Array data flow analysis
Tree transformations:– Adding/removing/duplicating tree statements
Compiler framework:– GCC
9Research
Solution for KPNs
Four step approach:
COMPUTATION:1. Partitioning (computation)
COMMUNICATION:2. Interprocess communication3. Token production4. Token consumption
P1 P2 P3
P1 P21 P3
P22
11Research
Interprocess Communication :
for(int i=1; i<10; i++) a[i] = a[i-1] + i; //s1 :
• Inter process communication is given by the loop-carried dependency: a[i-1] at iteration i is produced at iteration i-1.
• If execution of stmt s1 is distributed over different processes, token a[i-1] needs to be communicated:
: :for(int i=1; i<10; i++){ for(int i=1; i<10; i++){ if(i%2==0) if(i%2==1)
a[i] = a[i-1] + i; a[i] = a[i-1] + i;: :
12Research
Problems: – P1.At the producer side: where to send the tokens to?
– PII.At the consumer side: from where to consume tokens ?
Solutions P1:1. Producer filters the tokens (static solution)2. Producer sends all tokens to all split-up processes (run time
solution)
Solutions PII:1. The consumer knows by it self when to switch (static solution)2. Each producer sends a signal to the consumer when to
switch reading data from a different FIFO (run time solution)
Token Production&Consumption
?P1 P2 P1P2’
P2’’
?P2’
P2’’P2 P3 P3
13Research
Token Production– runtime vs. static
100 tokens
P1 P2
50 tokens
50 tokens
Static solution
P1
P2’
P2’’
Runtime solution
P1
P2’
P2’’
100 tokens
100 tokens
14Research
Token Consumption – runtime vs. static
100 tokens
P2 P3
Switch is known internally by the consumer50 tokens
50 tokens
Static solution
P2’
P2’’
P2
Switch is communicated over the channels to the consumer
50 tagged
tokens
50 tagged
tokens
Runtime solution
P2’
P2’’
P3
15Research
Token Production & Consumption – static solution• Establish the data-dependencies over the processes
HOW?• Data Dependence function (DD) and DD-1 DD -1 : Producer Consumer
DD : Consumer Producer
• However, DD cannot always be determined at compile time
16Research
Token Production – static solution without DD -1
Observation: loop counters producer side equalloop counters from consumer side
17Research
Token Production – static solution without DD -1
DD-1
(w1,w2,w3)=(w4,w5,w6);
P2(DD-1 (w1,w2,w3))=w5 w5=w2 =>P2(DD-1 (w1,w2,w3)%2= w2%2
19Research
Runtime solution:
int w0=0;int m0;while(c1){ : F00.put( t1 ); F01.put( t1 ); : while(c2){ : F10.put( t2 ); F11.put( t2 ); : while(c3){ : F20.put( t2 ); F21.put( t2 ); : } }}
int w1=0;while(c4){ : F00.get(); : F30.put(… ); : while(c5){ w1++; t=F10.get(); if(w1%2==0){ token=t }
: while(c6){ : F20.get(); : if(!c6){ augment(&token); } F40.put(token); : } } }}
static int state=0;while(c7){ : F30.get(); : while(c8){ : while(c9){ : while(true){ if(state==0){ token=F40.get(); if(token.isAugmented()){ state=1;
break; } } if(state==1){ token=F41.get(); if(token.isAugmented()){ state=0; break; } } } : } }}
Process P1
Process P2'
Process P3
Process P2'’
int w1=0;while(c4){ : F01.get(); : : while(c5){ w1++; t=F11.get(); if(w1%2==1){ token =t; } :
: while(c6){ : F21.get(); : if(!c6){ augment(&token); } F41.put(token); : } } }}
20Research
Multiple split-up processes
Split-up into 3 processes
P1 P2 P3 P4
P1 P2’’ P3’’ P4
P2’ P3’
P2’’’ P3’’’
21Research
Copy-nodes
P1 P2 P3 P4
Copy-nodes insertion
P1 P2 P3 P4
Splitting transformation
P1 P2’’ P3’’ P4
P2’ P3’
P2’’ P3’’
22Research
Copy-nodes
• Pros:– Simple network structure– Apply four-step splitting approach
• Cons:– More processes => more communication
(can be improved) => overhead
23Research
Implementation• Used technique:
– Runtime solution (general)• Used framework:
– GCC (GNU Compiler Collection)
• Advantages GCC:– Availability of data dependence information– Supported by large community;– We are in contact with Sebastian Pop, maintainer and
developer of various compiler phases e.g. the data dependence analysis, control flow and induction variable.
24Research
Implementation
• Data dependence analysis (already present):– scalars– arrays
• Data Dependence Graph (DDG)present only on RTL level, not on tree SSA
• Two new passes:1. Create DDG 2. Splitting
26Research
Implementation
1. Splitting pragma2. Data dependence graph3. Class definition reconstruction4. Function cloning5. Modulo condition insertion
27Research
Implementation
To do:1. Copying of class definition2. Copying of class member functions3. Reconstruction network structure
– FIFO– Network definition
28Research
Implementation
Final result:
• Data dependence information tells whether splitting is legal (no IPC)
• Semi-automatic transformation/case-study
29Research
Results
# CPUs org -cn org +cn row+col row+col+raster col+raster raster1 18.62 22.36 24.38 24.62 23.63 20.232 10.48 12.67 14.01 14.33 13.46 11.373 10.11 11.61 12.76 11.44 10.57 8.774 18.32 16.69 17.31 11.07 9.41 7.955 20.77 28.04 28.09 12.68 9.76 8.76
Improvement of 21%
Original KPN
KPN with copy nodes
Processes split-up into two
30Research
Future work: YAPI and CCP• Difference in active and passive connectors.• Active connectors in YAPI are modeled as a
thread• Passive do not run in a separate thread• More connectors in CCP:
P1 P2’’ P3’’ P4
P2’ P3’
P2’’ P3’’
Mesh MergeFork
31Research
Future Work
• Connect GCC with SCOTTY:
• GCC branch– Main branch: may not accept the patch– GOMP branch targets parallelization
GCCIs
transformationlegal?
Yes ScottySourcecode
TransformedSource code
No
Stop+ data dependence
+ Network topology
32Research
Conclusion
• Only split-up the most computationally expensive processes
• The transformation is profitable