Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
-
Upload
austen-andrews -
Category
Documents
-
view
221 -
download
0
Transcript of Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
1
Pipelined and Parallel Computing
Partition for
Hongtao DuAICIP Research
Nov 3, 2005
2
Motivation
• Emergence of pipelined and parallel computing in multiprocessor systems.
• VLSI technologies with extensive parallel and pipelining computation capabilities have triggered the algorithm implementations directly on FPGA and ASIC.
• Performance impacts:– Number and type of computing resources– Interconnection and communication mechanisms– Partition among resources
3
Partitioning
• Definition: divide processes/data among computing resources such that critical processes / larger data sets are implemented on faster resources, while the less critical processes / smaller data sets implemented on slower and cheaper resources.
• Goal: optimizing the overall performance corresponding to resource constraints such as speed, power, and area.
• Spatial partitioning and temporal partitioning
4
Partition Scheme
5
Dependency
• Chain structure – Computation are divided into several stages, each of which
perform one operation of the whole procedure. – Partition task: keep the processing loads of all stages
roughly equal.– Serial computing and pipelined computing
• Tree structure– Independent branches are sent to multiple resources in
homogenous or heterogeneous computing environments. Each resource performs operations at the same time.
– Parallel computing and distributed computing
6
Combinations
• Chain-structured pipelined programs over chain-connected systems
• Multiple chain-structured parallel and pipelined programs over single-host multiple-satellite systems
• Multiple arbitrarily structured serial programs over single-host multiple-satellite system
• Single-tree structured parallel programs over single-host multiple identical satellites systems
7
Chain structure
• Partition principles– Challenge:
– Load balance– Shorten the execution time of the bottleneck stage
– Minimize overall computing time– Computing– Communication
8
Modeling – Layered Graph
• Definition– Node <i, j>: sub-process
– <i, j> connects to <j+1, *>
– Number of node: m: processn:resource
– Weight– Node: computing time– Edge: communication time
mji 1
nm2
9
Minimum Bottleneck Path
• Pre-definitions:– Label L(i) for each node i to record the minimum, over
all paths ending at i, of the maximum edge (bottleneck) on each path at any stage.
– Label W(e) as the weight for the edge e connecting node a (above) and b (below).
• For each node b, replace L(b) by
• Complexity
)]}(),(max[),(min{ aLeWbL
)()( 3nmOmNO
10
Dynamic Programming
• Pre-definition– : time required to run process i on resource j– : amount of data to be transmitted from i to (i+1)– : transmission speed between j and (j+1)
– The total time on j : (k to i are assigned to resource j)
– : time consumed by bottleneck resource given the first i are assigned to the first j resources.
– : pointer to the first process on the resource, given the first i are assigned to the first j resources.
–
• Objective: minimize
i
kl j
ilj
jki v
ctT
ijt
ic
jv
ijf
jip
thj
mnf
11
• Algorithm1. Initialization
where
2. First resourceLet and for
3. Recursion (Bellman equation)
where k is the smallest index holding the above equality
i
kl j
ilj
jki v
ctT
nmikj nj ,,2,1
111 ii Tf 11 ip 1,,2,1 nmi
)},{max(min 1,1,,
jkijk
ijkij Tff
jnmjji ,,1, nj ,,3,2
kp ji
12
Continue …
4. Assignment– Assign processes to resource n– Set and decrease n by 1– Iterate until all processes are assigned.
• Complexity
– Step1: – Step2: – Step3: – Step4:
mpp nm
nm ,,1,
1 nmpm
)( 2nmO
)(mO
)( 2nmO
)( 2nmO
)(nO
13
Modeling - Doubly Weighted Graph
• Definition:
– Each edge e:
– : sum weight– : bottleneck weight
– Path (P): the processing flow
END ,
)(),( ee )(e)(e
14
Sum-Bottleneck (SB) Path
• Pre-definition:– S(P): sum weight of path P
– B(P): the bottleneck weight
• Algorithm: search for the optimal path between two nodes such that the sum-bottleneck (SB) weight is minimum.
: sum-bottleneck weight of path POptimal SB path has weight 10.
)]}(),(min{max[ PBPS
)](),(max[ PBPS
)()( iePS
)](max[)( iePB
)(S
0)( B
15
Optimal SB Path Algorithm
• Algorithm: bottleneck weight 1. Remove all edges with 2. Delete all remaining weights. DWG
has been transformed into an ordinary weighted graph
3. If there is a path between s and t, then return the shortest path P between s and t.
• Using Dijkstra’s algorithm to find the shortest path P.
• Complexity:
TB
TBe )(
)log( 2 enO
16
Tree structure
• Current process depends only on their parent process.
• Independence between children processes permits the burden distribution from single resource to multi-resources.
17
Modeling – Directed Dual Graph
Direction: from lower ordered node to higher
18
: execution time on host : execution time on slave (all slaves are similar) : inter-processor communication (communication between parent to grandparent)
ih
is
ijc
ijtreesub
ki cs
ijki ch
19
• Objective: Finding the optimal assignment from A to H.
• Algorithm: Optimal SB path
• Complexity analysis– Multi-graph: multiple edge connecting the same pair– Given m nodes, there are m edges
)log( 2 mmO
20
Driving force
• Data-driven– How to divide data sets into different sizes for
multiple computing resources – How to coordinate data flows along different
directions such that brings appropriate data to the suitable processors at the right time.
• Function-driven– How to perform different functions of one task
on different computing resources at the same time.
21
Resource constraints
• Multi-processor– Homogenous system– Heterogeneous system
• Hardware/software (HW/SW) co-processing– Process scheduling
• VLSI– the communication time is ignorable– Capacity
– RAM size for image processing
– Computation complexity affects both design size and delay. Reducing computation complexity is necessary.
– Decrease image size– Reduce RAM size, therefore matching FPGA capacity constraint– Reduce clock cycle numbers of sub-processes, and increase overall
processing frequency
22
Analysis Tools
• DWG
• DAG (Directed Acyclic Graph)
• CDFG (Control and Data Flow Graph)
• Petri Net PN := (P, T, I, O, M)P - places; T – transitions; I - input arc connections between places and transitions; O - output arc connections between transitions and places; M – the number of tokens.
23
To be continued ……
Thank you!