Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore...

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Michela [email protected]

DEIS Università di Bologna

Digital Convergence – Mobile Example

Broadcasting

TelematicsImaging

Computing

CommunicationEntertainment

One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon.

Industry

SoC: Enabler for Digital Convergence

Today

Future

> 100XPerformanceLow PowerComplexity

Storage

SoCSoCSoCSoC

Design as optimization Design space

The set of “all” possible design choices

ConstraintsSolutions that we are not willing to

accept Cost function

A property we are interested in (execution time, power, reliability…)

Embedded system designMOTIVATION & CONTEXT

System design with MPSoCs

Exploit application and platform parallelism to

achieve real time performance

Given a platform description

and an application abstraction

compute an allocation & schedule

verify results & perform changes

Design flow

1

2 3

4

5

P1 P2 ram

P1

P2

1 2

3

4

5

1 2

3

4

5 t

Crucial role of the allocation& scheduling algorithm

MPSoC platform

Identical processing elements (PE)

Local storage devices Remote on-chip

memory Shared bus

PROBLEM DESCRIPTION

MPSoC platform

Resources:

Constraints:

PEs Local memory devices Shared BUS PE frequencies Local device capacity Bus bandwidth (additive resource) Architecture dependent constraints

DVS

Application

Nodes are tasks/processes Arcs are data communication

PROBLEM DESCRIPTION

1

2 3

4

5

6

RD

Each task:

Reads data for each ingoing arc

Performs some computation Writes data for each outgoing

arc

EXEC

WR

WR

COMM. BUFFER

LOC/REM

PEPROG. DATA

LOC/REMFREQ

.

Application Task Graph

Application

Memory allocation Remote memory is slower than the local ones

Execution frequency

PROBLEM DESCRIPTION

1

2 3

4

5

6

RD EXEC

WR

WR

COMM. BUFFER

LOC/REM

PEPROG. DATA

LOC/REMFREQ

.

Durations depend on:

Different phases have different bus requirements


ApplicationPROBLEM VARIANTS

O. F.

We focused on problem variants with different Objective function Graph features

G.F. Bustraffic

Energyconsumptio

nMakespan

Pipelined

Generic

Generic withcond.

branches

DVS

DVS

DVS

no DVS

no DVS

no DVS

no DVS

no DVS

no DVS

Problem variants

Objective functionPROBLEM VARIANTS

Bus traffic

Tasks produce traffic when they access the bus Completely depends on memory allocationMakespan

Completely depends on the computed schedule

Energy (DVS)

The higher the frequency, the higher the energy consumption

Cost for frequency switching

1 2 3 t4

Time & energy

cost

Allocation dependentSchedule dependent

Objective function

Graph structurePROBLEM VARIANTS

1 2 3 4

PipelinedTypical of stream processing applications

Generic

1

2 3

4

5

6

Generic with cond. branches

1

2 3

4

5

6

a !a

b

!b

Stochastic problem

0.3 0.7

0.50.5

Stochastic O.F. (exp. value)

Graph Structure

Application Development Flow

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

ImplementationAllo

catio

n

Sched

ulin

g

ApplicationDevelopment

Support

PlatformExecution

When & Why Offline Optimization?

Plenty of design-time knowledge Applications pre-characterized at design time Dynamic transitions between different pre-

characterized scenarios Aggressive exploitation of system

resources Reduces overdesign (lowers cost) Strong performance guarantees

Applicable for many embedded applications

Question?

Complete or incomplete solver? Can I solve the instances with a complete

solver? See the structure of the problem and the

average instance dimension If yes, which technique to use? If no, which is the quality of the heuristic solution

proposed?

Optimization in system design

Complete solvers find the optimal solution and prove its optimality The System Design community uses Integer

Programming techniques for every optimization problem despite the structure of the problem itself

Scheduling is poorly handled by IP Incomplete solvers: lack of estimated optimality

gap Decomposition of the problem and sequential solution

of each subproblem. Local search/metaheuristic algorithms Require a lot of tuning

Optimization techniques We will consider two techniques:

Constraint Programming Integer Programming

Two aspects of a problem : Feasibility Optimality

Merging the two, one could obtain better results

Constraint Programming Quite recent programming

paradigm declarative paradigm

Inherits from logic programming operations research software engineering AI constraint solving

Constraint Programming Problem model

variable domains constraints

Problem solving constraint propagation search

Example 1 X::[1..10], Y::[5..15], X>Y

Arc-consistency: for each value v of X, there should be a value in the domain of Y consistent with v

X::[6..10], Y::[5..9] after propagation Example 2

X::[1..10], Y::[5..15], X=Y X::[5..10], Y::[5..10] after propagation

Example 3 X::[1..10], Y::[5..15], XY No propagation

Mathematical Constraints

Every variable involved in many constraints: each changing in the domain of a variable could affect many constraints

Agent perspective: Example:

Constraint Interaction

X::[1..5], Y::[1..5], Z::[1..5]

X = Y + 1

Y = Z + 1

X = Z - 1

Constraint Interaction

X::[1..5], Y::[1..5], Z::[1..5]

X = Y + 1

Y = Z + 1

X = Z - 1

First propagation of X = Y + 1 leads to

X::[2..5] Y::[1..4] Z::[1..5]

X = Y + 1suspended

Constraint Interaction Second propagation of Y = Z + 1 leads to

X::[2..5] Y::[2..4] Z::[1..3]

Y = Z + 1 suspended

The domain og Y is changed and X = Y + 1 is awaken

X::[3..5] Y::[2..4] Z::[1..3]

X = Y + 1 suspended

Constraint Interaction Third propagation of Z = X - 1 leads to

The order in which constraints are considered does not affect the result BUT can influence the efficiency

FAILX::[] Y::[2..4] Z::[1..3]

Global Constraintscumulative([S1,..,Sn],[D1,..,Dn],[R1,..,Rn], L)

S1,...Sn activity starting time (domain variables) D1,...Dn duration (domain variables) R1,...Rn resource consumption (domain variables) L resource capacity

Given the interval [min,max] where min = mini {Si},

max = max{Si+Di} - 1, the cumulative constraint holds iff

max{ Ri} L j|SjiSj+Dj

Global Constraints cumulative([1,2,4],[4,2,3],[1,2,2],3)

resources

time

L

1

2 3

1 2 3 4 5 6 7

1

2

3

One of the propagation algorithms used in resource constraints is the one based on obligatory parts

8

8

S min

S max

Obligatory part

Propagation

Propagation Another propagation is the one based on edge

finding [Baptiste, Le Pape, Nuijten, IJCAI95]

Consider a unary resource R and three activities that should be scheduled on the R

1

0 17

11

6

4

13

12

S1

S2

S3

Propagation

S1

S2

S3

We can deduce that the earliest start time of S1 is 8.

In fact, S1 should be executed after both S2 e S3.

Global reasoningGlobal reasoning: if S1 is executed before the two, then there is no space for executing both S2 and S3

1

8 17

11

6

4

13

12

CP solving process

• The solution process interleaves propagation and search• Constraints propagate as much as possible• When no more propagation can be performed, either the

process fails since one domain is empty or search is performed.

• The search heuristics chooses• Which variable to select• Which value to assign to it

• Optimization problems• Solved via a set of feasibility problems with constraints on the

objective function variable

Pros and Cons Declarative programming: the user states the constraint

the solver takes into account propagation and search Strong on the feasibility side Constraints are symbolic and mathematic: expressivity Adding a constraint helps the solution process: flexibility

Weak optimality pruning if the link between problem decision variables and the objective function is loose

No use of relaxations

• Standard form of Combinatorial Optimization Problem (IP)

min z = cj xj

subject to aij xj = bi i = 1..m

xj 0 j = 1..n

xj integer

• Inequality y 0 recasted in y - s = 0• Maximization expressed by negating the objective

function

j =1

j =1

n

n

May make the problem NP complete

Integer Programming

0-1 Integer Programming

• Many Combinatorial Optimization Problem can be expressed in terms of 0-1 variables (IP)

min z = cj xj


xj :[0,1]

j =1

j =1

n

n

May make the problem NP complete

Linear Relaxation

min z = cj xj


xj 0 j = 1..n

xj integer

The linear relaxation is solvable in POLYNOMIAL TIME The SIMPLEX ALGORITHM is the technique of choice

even if it is exponential in the worst case

j =1

j =1

n

n

Removed

Linear Relaxation

Geometric Properties• The set of constraints defines a polytope• The optimal solution is located on one of its vertices

min z = cj xj


xj 0 j = 1..n

The simplex algorithm starts from

one vertex and moves to an adjacent one

with a better value of the objective function

j =1

j =1

n

n

Optimal solution

Objective function

IP solving process

• The optimal LP solution is in general fractional: violates the integrality constraint but provides a bound on the solution of the overall problem

• The solution process interleaves branch and bound:• relaxation • search

Pros and Cons Declarative programming: the user states the constraint

the solver takes into account relaxation and search Strong on the optimality side Many real problem structure has been deeply studied

Only linear constraints should be used If sophisticated techniques are used, we lose flexibility No pruning on feasibility (only some preprocessing)

Resource-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources

MULTIMEDIAAPPLICATIONS

Allocation and scheduling• Given:

• An hardware platform with processors, (local and remote) storage devices, a communication channel

• A pre characterized task graph representing a functional abstraction of the application we should run

• Find:• An allocation and a scheduling of tasks to resources respecting

• Real time constraints• Task deadlines • Precedences among tasks• Capacity of all resources

• Such that • the communication cost is minimized

Allocation and scheduling The platform is a multi-processor system with N

nodes Each node includes a processor and a schretchpad memory The bus is a shared communication channel In addition we have a remote memory of unlimited capacity

(realistic assomption for our application, but easily generalizable)

The task graph has a pipeline workload Real time video graphic processing pixel of a digital image Task dependencies, i.e., arcs between tasks Computation, communication, storage requirements on the

graph

Embedded system designMOTIVATION & CONTEXT

Given a platform description

and an application abstraction

compute an allocation & schedule

Design flow

1

2 3

4

5

P1 P2 ram

P1

P2

1 2

3

4

5

1 2

3

4

5 t

Allocation and scheduling

MPSoC platform

Identical processing elements (PE)

Local storage devices Remote on-chip

memory Shared bus

PROBLEM DESCRIPTION

MPSoC platform

Resources:

Constraints:

PEs Unary resource Local memory devices Limited

capacity Shared BUS Limited bandwidth Remote memory assumed to be

infinite Local device capacity Bus bandwidth (additive resource) Architecture dependent constraints

Application

Nodes are tasks/processes Arcs are data communication

PROBLEM DESCRIPTION

1

2 3

4

5

6

RD

Each task:

Reads data for each ingoing arc

Performs some computation Writes data for each outgoing

arc

EXEC

WR

WR

COMM. BUFFER

LOC/REM

PEPROG. DATA

LOC/REMFREQ

.


Application

Memory allocation Remote memory is slower than the

local ones

PROBLEM DESCRIPTION

1

2 3

4

5

6

RD EXEC

WR

WR

COMM. BUFFER

LOC/REM

PEPROG. DATA

LOC/REMFREQ

.

Durations depend on:

Different phases have different bus requirements


Problem structure As a whole it is a scheduling problem

with alternative resources: very tough problem It smoothly decomposes into allocation

and scheduling Allocation better handled with IP techniques

Not with CP due to the complex objective function Scheduling better handled with CP techniques

Not with IP since we should model for each task all its possible starting time with a 0/1 variable

INTERACTION REGULATED VIA CUTTING PLANES

Logic Based Benders Decomposition

Obj. Function:Communication cost

Validallocation

Allocation

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Timing constraint

Decomposes the problem into 2 sub-problems: Allocation → IP

Objective Function: communication of tasks Scheduling → CP

Secondary Objective Function: makespan

Master Problem model Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)

Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise

Each process should be allocated to one processor Tij= 1 for all j

Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)

If a task is NOT allocated to a processor nor its required memories are:Tij= 0 Yij =0 and Zij =0

Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2

i

i j

Improvement of the model

With the proposed model, the allocation problem solver tends With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONTRIVIAL SOLUTION

To improve the model we should add a relaxation of the To improve the model we should add a relaxation of the subproblem to the master problem model: subproblem to the master problem model:

For each set S of consecutive tasks whose sum of durations For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors exceeds the Real time requirement, we impose that their processors should not be the same should not be the same

WCETi > RT Tij |S| -1

i S i S

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start

Activity Starting Time: Starti::[0..Deadlinei]

Precedence constraints: Starti+Duri Startj

Real time constraints: for all activities running on the same processor (Starti+Duri ) RT

Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)

What about the bus??

i

Bus modelBANDWIDTH

BIT/SEC

TIME

Max busbandwidth

Size of program data TaskExecTime

Task0 accessesinput data:

BW=MaxBW/NoProc

Task0reads state

Task0 writes state

task0task1

Additive bus model

The model does not hold under heavy bus congestionBus traffic has to be minimized

ResultsAlgorithm search time

The combined approach dominates, and its higher complexitycomes out only for simple system configurations

Energy-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize power consumption

MULTIMEDIAAPPLICATIONS

Logic Based Benders Decomposition

Obj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good and cutting planes

Memory constraints

Timing constraint

Decomposes the problem into 2 sub-problems: Allocation & Assignment (& freq. setting) → IP

Objective Function: minimizing energy consumption during execution

Scheduling → CP Objective Function: E.g.: minimizing energy

consumption during frequency switching

Allocation problem model

Xtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.

Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.

Task j on core p reads data to i at freq. f;

WriteadComp

P

p

M

fijfpijfp

P

p

M

fijfp

P

p

M

fijfp

P

p

M

ftfp

EnEnEnOF

TjiRW

TjiR

TjiW

tX

Re

1 1

1 1

1 1

1 1

,0)(

,1

,1

1Each task can execute only on

one processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption associated with task execution and communication

Scheduling problem model

INPUT EXEC OUTPUT

The objective function: minimize energy consumption associated with frequency switching

•Processors are modelled as unary resource•Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:

jijijii

jiii

jii

S ta rtadddW ritedura tionS ta rt

S ta rtTdura tionS ta rt

S ta rtdu ra tionS ta rt

R e

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies

Tasks running on different processors

Computational efficiencySearch time for an increasing number of tasks

• Similar plot for an increasing number of processors• Standalone CP and IP proved not comparable on a simpler problem• We varied the RT constraint:

tight deadline: few feasible solutions very loose deadline: trivial solution search time within 1 order of magnitude

P4 2GHz 512 Mb RAMProfessional solving tools (ILOG)

Conditional Task Graphs Conditional Task graphs: the problem

becomes stochastic since the outcome of the conditions labeling the arcs are known at execution time.

We only know the probability distribution Minimize the expected value of the

objective function Min communication cost: easier Min makespan: much more complicated

Promising results

Other problems What if the task durations are not known:

only the worst and best case are available Scheduling only the worst case is not enough Scheduling anomalies

Change platform: We are facing the CELL BE Processor architecture

Model synchronous data flow applications Model communication intensive applications

Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.

Programmers must be conscious about simplified assumptions taken into account in optimization tools.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

(. .

Final Implementation

Challenge: the Abstraction Gap

MAX error lower than 10% AVG error equal to 4.51%, with

standard deviation of 1.94 All deadlines are met

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation -0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions Throughput

Pro

bab

ilit

y (%

)Throughput difference (%)

MAX error lower than 10%; AVG error equal to 4.80%, with

standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Pro

bab

ilit

y (%

)Energy consumption difference (%)

GSM Encoder

Throughput required: 1 frame/10ms. With 2 processors and 4 possible

frequency & voltage settings:

Task Graph: 10 computational tasks; 15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Challenge: programming environment

A software development toolkit to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;

a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)

The main goals are: predictable application execution after the optimization step; guarantees on high performance and constraint

satisfaction. Starting from a high level task and data flow graph,

software developers can easily and quickly build their application infrastructure.

Programmers can intuitively translate high level representation into C-code using our facilities and library

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore...

Documents

Transcript of Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore...