Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore...
-
date post
18-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore...
Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip
Michela [email protected]
DEIS Università di Bologna
Digital Convergence – Mobile Example
Broadcasting
TelematicsImaging
Computing
CommunicationEntertainment
One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon.
Industry
SoC: Enabler for Digital Convergence
Today
Future
> 100XPerformanceLow PowerComplexity
Storage
SoCSoCSoCSoC
Design as optimization Design space
The set of “all” possible design choices
ConstraintsSolutions that we are not willing to
accept Cost function
A property we are interested in (execution time, power, reliability…)
Embedded system designMOTIVATION & CONTEXT
System design with MPSoCs
Exploit application and platform parallelism to
achieve real time performance
Given a platform description
and an application abstraction
compute an allocation & schedule
verify results & perform changes
Design flow
1
2 3
4
5
P1 P2 ram
P1
P2
1 2
3
4
5
1 2
3
4
5 t
Crucial role of the allocation& scheduling algorithm
MPSoC platform
Identical processing elements (PE)
Local storage devices Remote on-chip
memory Shared bus
PROBLEM DESCRIPTION
MPSoC platform
Resources:
Constraints:
PEs Local memory devices Shared BUS PE frequencies Local device capacity Bus bandwidth (additive resource) Architecture dependent constraints
DVS
Application
Nodes are tasks/processes Arcs are data communication
PROBLEM DESCRIPTION
1
2 3
4
5
6
RD
Each task:
Reads data for each ingoing arc
Performs some computation Writes data for each outgoing
arc
EXEC
WR
WR
COMM. BUFFER
LOC/REM
PEPROG. DATA
LOC/REMFREQ
.
Application Task Graph
Application
Memory allocation Remote memory is slower than the local ones
Execution frequency
PROBLEM DESCRIPTION
1
2 3
4
5
6
RD EXEC
WR
WR
COMM. BUFFER
LOC/REM
PEPROG. DATA
LOC/REMFREQ
.
Durations depend on:
Different phases have different bus requirements
Application Task Graph
ApplicationPROBLEM VARIANTS
O. F.
We focused on problem variants with different Objective function Graph features
G.F. Bustraffic
Energyconsumptio
nMakespan
Pipelined
Generic
Generic withcond.
branches
DVS
DVS
DVS
no DVS
no DVS
no DVS
no DVS
no DVS
no DVS
Problem variants
Objective functionPROBLEM VARIANTS
Bus traffic
Tasks produce traffic when they access the bus Completely depends on memory allocationMakespan
Completely depends on the computed schedule
Energy (DVS)
The higher the frequency, the higher the energy consumption
Cost for frequency switching
1 2 3 t4
Time & energy
cost
Allocation dependentSchedule dependent
Objective function
Graph structurePROBLEM VARIANTS
1 2 3 4
PipelinedTypical of stream processing applications
Generic
1
2 3
4
5
6
Generic with cond. branches
1
2 3
4
5
6
a !a
b
!b
Stochastic problem
0.3 0.7
0.50.5
Stochastic O.F. (exp. value)
Graph Structure
Application Development Flow
CTGCharacterization
Phase
Simulator
OptimizationPhase
Optimizer
ApplicationProfiles
Optimal SWApplication
ImplementationAllo
catio
n
Sched
ulin
g
ApplicationDevelopment
Support
PlatformExecution
When & Why Offline Optimization?
Plenty of design-time knowledge Applications pre-characterized at design time Dynamic transitions between different pre-
characterized scenarios Aggressive exploitation of system
resources Reduces overdesign (lowers cost) Strong performance guarantees
Applicable for many embedded applications
Question?
Complete or incomplete solver? Can I solve the instances with a complete
solver? See the structure of the problem and the
average instance dimension If yes, which technique to use? If no, which is the quality of the heuristic solution
proposed?
Optimization in system design
Complete solvers find the optimal solution and prove its optimality The System Design community uses Integer
Programming techniques for every optimization problem despite the structure of the problem itself
Scheduling is poorly handled by IP Incomplete solvers: lack of estimated optimality
gap Decomposition of the problem and sequential solution
of each subproblem. Local search/metaheuristic algorithms Require a lot of tuning
Optimization techniques We will consider two techniques:
Constraint Programming Integer Programming
Two aspects of a problem : Feasibility Optimality
Merging the two, one could obtain better results
Constraint Programming Quite recent programming
paradigm declarative paradigm
Inherits from logic programming operations research software engineering AI constraint solving
Constraint Programming Problem model
variable domains constraints
Problem solving constraint propagation search
Example 1 X::[1..10], Y::[5..15], X>Y
Arc-consistency: for each value v of X, there should be a value in the domain of Y consistent with v
X::[6..10], Y::[5..9] after propagation Example 2
X::[1..10], Y::[5..15], X=Y X::[5..10], Y::[5..10] after propagation
Example 3 X::[1..10], Y::[5..15], XY No propagation
Mathematical Constraints
Every variable involved in many constraints: each changing in the domain of a variable could affect many constraints
Agent perspective: Example:
Constraint Interaction
X::[1..5], Y::[1..5], Z::[1..5]
X = Y + 1
Y = Z + 1
X = Z - 1
Constraint Interaction
X::[1..5], Y::[1..5], Z::[1..5]
X = Y + 1
Y = Z + 1
X = Z - 1
First propagation of X = Y + 1 leads to
X::[2..5] Y::[1..4] Z::[1..5]
X = Y + 1suspended
Constraint Interaction Second propagation of Y = Z + 1 leads to
X::[2..5] Y::[2..4] Z::[1..3]
Y = Z + 1 suspended
The domain og Y is changed and X = Y + 1 is awaken
X::[3..5] Y::[2..4] Z::[1..3]
X = Y + 1 suspended
Constraint Interaction Third propagation of Z = X - 1 leads to
The order in which constraints are considered does not affect the result BUT can influence the efficiency
FAILX::[] Y::[2..4] Z::[1..3]
Global Constraintscumulative([S1,..,Sn],[D1,..,Dn],[R1,..,Rn], L)
S1,...Sn activity starting time (domain variables) D1,...Dn duration (domain variables) R1,...Rn resource consumption (domain variables) L resource capacity
Given the interval [min,max] where min = mini {Si},
max = max{Si+Di} - 1, the cumulative constraint holds iff
max{ Ri} L j|SjiSj+Dj
Global Constraints cumulative([1,2,4],[4,2,3],[1,2,2],3)
resources
time
L
1
2 3
1 2 3 4 5 6 7
1
2
3
One of the propagation algorithms used in resource constraints is the one based on obligatory parts
8
8
S min
S max
Obligatory part
Propagation
Propagation Another propagation is the one based on edge
finding [Baptiste, Le Pape, Nuijten, IJCAI95]
Consider a unary resource R and three activities that should be scheduled on the R
1
0 17
11
6
4
13
12
S1
S2
S3
Propagation
S1
S2
S3
We can deduce that the earliest start time of S1 is 8.
In fact, S1 should be executed after both S2 e S3.
Global reasoningGlobal reasoning: if S1 is executed before the two, then there is no space for executing both S2 and S3
1
8 17
11
6
4
13
12
CP solving process
• The solution process interleaves propagation and search• Constraints propagate as much as possible• When no more propagation can be performed, either the
process fails since one domain is empty or search is performed.
• The search heuristics chooses• Which variable to select• Which value to assign to it
• Optimization problems• Solved via a set of feasibility problems with constraints on the
objective function variable
Pros and Cons Declarative programming: the user states the constraint
the solver takes into account propagation and search Strong on the feasibility side Constraints are symbolic and mathematic: expressivity Adding a constraint helps the solution process: flexibility
Weak optimality pruning if the link between problem decision variables and the objective function is loose
No use of relaxations
• Standard form of Combinatorial Optimization Problem (IP)
min z = cj xj
subject to aij xj = bi i = 1..m
xj 0 j = 1..n
xj integer
• Inequality y 0 recasted in y - s = 0• Maximization expressed by negating the objective
function
j =1
j =1
n
n
May make the problem NP complete
Integer Programming
0-1 Integer Programming
• Many Combinatorial Optimization Problem can be expressed in terms of 0-1 variables (IP)
min z = cj xj
subject to aij xj = bi i = 1..m
xj :[0,1]
j =1
j =1
n
n
May make the problem NP complete
Linear Relaxation
min z = cj xj
subject to aij xj = bi i = 1..m
xj 0 j = 1..n
xj integer
The linear relaxation is solvable in POLYNOMIAL TIME The SIMPLEX ALGORITHM is the technique of choice
even if it is exponential in the worst case
j =1
j =1
n
n
Removed
Linear Relaxation
Geometric Properties• The set of constraints defines a polytope• The optimal solution is located on one of its vertices
min z = cj xj
subject to aij xj = bi i = 1..m
xj 0 j = 1..n
The simplex algorithm starts from
one vertex and moves to an adjacent one
with a better value of the objective function
j =1
j =1
n
n
Optimal solution
Objective function
IP solving process
• The optimal LP solution is in general fractional: violates the integrality constraint but provides a bound on the solution of the overall problem
• The solution process interleaves branch and bound:• relaxation • search
Pros and Cons Declarative programming: the user states the constraint
the solver takes into account relaxation and search Strong on the optimality side Many real problem structure has been deeply studied
Only linear constraints should be used If sophisticated techniques are used, we lose flexibility No pruning on feasibility (only some preprocessing)
Resource-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources
MULTIMEDIAAPPLICATIONS
Allocation and scheduling• Given:
• An hardware platform with processors, (local and remote) storage devices, a communication channel
• A pre characterized task graph representing a functional abstraction of the application we should run
• Find:• An allocation and a scheduling of tasks to resources respecting
• Real time constraints• Task deadlines • Precedences among tasks• Capacity of all resources
• Such that • the communication cost is minimized
Allocation and scheduling The platform is a multi-processor system with N
nodes Each node includes a processor and a schretchpad memory The bus is a shared communication channel In addition we have a remote memory of unlimited capacity
(realistic assomption for our application, but easily generalizable)
The task graph has a pipeline workload Real time video graphic processing pixel of a digital image Task dependencies, i.e., arcs between tasks Computation, communication, storage requirements on the
graph
Embedded system designMOTIVATION & CONTEXT
Given a platform description
and an application abstraction
compute an allocation & schedule
Design flow
1
2 3
4
5
P1 P2 ram
P1
P2
1 2
3
4
5
1 2
3
4
5 t
Allocation and scheduling
MPSoC platform
Identical processing elements (PE)
Local storage devices Remote on-chip
memory Shared bus
PROBLEM DESCRIPTION
MPSoC platform
Resources:
Constraints:
PEs Unary resource Local memory devices Limited
capacity Shared BUS Limited bandwidth Remote memory assumed to be
infinite Local device capacity Bus bandwidth (additive resource) Architecture dependent constraints
Application
Nodes are tasks/processes Arcs are data communication
PROBLEM DESCRIPTION
1
2 3
4
5
6
RD
Each task:
Reads data for each ingoing arc
Performs some computation Writes data for each outgoing
arc
EXEC
WR
WR
COMM. BUFFER
LOC/REM
PEPROG. DATA
LOC/REMFREQ
.
Application Task Graph
Application
Memory allocation Remote memory is slower than the
local ones
PROBLEM DESCRIPTION
1
2 3
4
5
6
RD EXEC
WR
WR
COMM. BUFFER
LOC/REM
PEPROG. DATA
LOC/REMFREQ
.
Durations depend on:
Different phases have different bus requirements
Application Task Graph
Problem structure As a whole it is a scheduling problem
with alternative resources: very tough problem It smoothly decomposes into allocation
and scheduling Allocation better handled with IP techniques
Not with CP due to the complex objective function Scheduling better handled with CP techniques
Not with IP since we should model for each task all its possible starting time with a 0/1 variable
INTERACTION REGULATED VIA CUTTING PLANES
Logic Based Benders Decomposition
Obj. Function:Communication cost
Validallocation
Allocation
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: linearconstraint
Memory constraints
Timing constraint
Decomposes the problem into 2 sub-problems: Allocation → IP
Objective Function: communication of tasks Scheduling → CP
Secondary Objective Function: makespan
Master Problem model Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)
Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise
Each process should be allocated to one processor Tij= 1 for all j
Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)
If a task is NOT allocated to a processor nor its required memories are:Tij= 0 Yij =0 and Zij =0
Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2
i
i j
Improvement of the model
With the proposed model, the allocation problem solver tends With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONTRIVIAL SOLUTION
To improve the model we should add a relaxation of the To improve the model we should add a relaxation of the subproblem to the master problem model: subproblem to the master problem model:
For each set S of consecutive tasks whose sum of durations For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors exceeds the Real time requirement, we impose that their processors should not be the same should not be the same
WCETi > RT Tij |S| -1
i S i S
Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start
Activity Starting Time: Starti::[0..Deadlinei]
Precedence constraints: Starti+Duri Startj
Real time constraints: for all activities running on the same processor (Starti+Duri ) RT
Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)
What about the bus??
i
Bus modelBANDWIDTH
BIT/SEC
TIME
Max busbandwidth
Size of program data TaskExecTime
Task0 accessesinput data:
BW=MaxBW/NoProc
Task0reads state
Task0 writes state
task0task1
Additive bus model
The model does not hold under heavy bus congestionBus traffic has to be minimized
ResultsAlgorithm search time
The combined approach dominates, and its higher complexitycomes out only for simple system configurations
Energy-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize power consumption
MULTIMEDIAAPPLICATIONS
Logic Based Benders Decomposition
Obj. Function:Communication cost
& energy consumption
Validallocation
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good and cutting planes
Memory constraints
Timing constraint
Decomposes the problem into 2 sub-problems: Allocation & Assignment (& freq. setting) → IP
Objective Function: minimizing energy consumption during execution
Scheduling → CP Objective Function: E.g.: minimizing energy
consumption during frequency switching
Allocation problem model
Xtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.
Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.
Task j on core p reads data to i at freq. f;
WriteadComp
P
p
M
fijfpijfp
P
p
M
fijfp
P
p
M
fijfp
P
p
M
ftfp
EnEnEnOF
TjiRW
TjiR
TjiW
tX
Re
1 1
1 1
1 1
1 1
,0)(
,1
,1
1Each task can execute only on
one processor at one freq.
Communication between tasks can execute only once for execution and one write corresponds to one read
The objective function: minimize energy consumption associated with task execution and communication
Scheduling problem model
INPUT EXEC OUTPUT
The objective function: minimize energy consumption associated with frequency switching
•Processors are modelled as unary resource•Bus is modelled as additive resource
Duration of task i is now fixed since mode is fixed:
jijijii
jiii
jii
S ta rtadddW ritedura tionS ta rt
S ta rtTdura tionS ta rt
S ta rtdu ra tionS ta rt
R e
Task i Task j
Tasks running on the same processor at the same frequency
Tasks running on the same processor at different frequencies
Tasks running on different processors
Computational efficiencySearch time for an increasing number of tasks
• Similar plot for an increasing number of processors• Standalone CP and IP proved not comparable on a simpler problem• We varied the RT constraint:
tight deadline: few feasible solutions very loose deadline: trivial solution search time within 1 order of magnitude
P4 2GHz 512 Mb RAMProfessional solving tools (ILOG)
Conditional Task Graphs Conditional Task graphs: the problem
becomes stochastic since the outcome of the conditions labeling the arcs are known at execution time.
We only know the probability distribution Minimize the expected value of the
objective function Min communication cost: easier Min makespan: much more complicated
Promising results
Other problems What if the task durations are not known:
only the worst and best case are available Scheduling only the worst case is not enough Scheduling anomalies
Change platform: We are facing the CELL BE Processor architecture
Model synchronous data flow applications Model communication intensive applications
Optimization Development
The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.
Programmers must be conscious about simplified assumptions taken into account in optimization tools.
Platform Modelling
Optimization Analysis
Optimal Solution
Starting Implementation
Platform Execution
Abstractiongap
(. .
Final Implementation
Challenge: the Abstraction Gap
MAX error lower than 10% AVG error equal to 4.51%, with
standard deviation of 1.94 All deadlines are met
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation -0.05
0
0.05
0.1
0.15
0.2
0.25
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Validation of optimizer solutions Throughput
Pro
bab
ilit
y (%
)Throughput difference (%)
MAX error lower than 10%; AVG error equal to 4.80%, with
standard deviation of 1.71;
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
250 instances
Validation of optimizer solutions Power
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Pro
bab
ilit
y (%
)Energy consumption difference (%)
GSM Encoder
Throughput required: 1 frame/10ms. With 2 processors and 4 possible
frequency & voltage settings:
Task Graph: 10 computational tasks; 15 communication tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ - 66,4%
Challenge: programming environment
A software development toolkit to help programmers in software implementation:
a generic customizable application template OFFLINE SUPPORT;
a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)
The main goals are: predictable application execution after the optimization step; guarantees on high performance and constraint
satisfaction. Starting from a high level task and data flow graph,
software developers can easily and quickly build their application infrastructure.
Programmers can intuitively translate high level representation into C-code using our facilities and library