Partitioned Fixed-Priority Scheduling of Parallel Tasks...

Partitioned Fixed-Priority Scheduling of Parallel Tasks

Without Preemptions

Daniel Casini*, Alessandro Biondi*, Geoffrey Nelissen†, and Giorgio Buttazzo*

* ReTiS Lab, Scuola Superiore Sant’Anna, Pisa, Italy

† CISTER, ISEP, Polytechnic Institute of Porto, Portugal

2Daniel Casini

Overview

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions

• Each task is represented by a Direct Acyclic Graph,and is characterized by

i. a minimum inter-arrival time 𝑇𝑖

ii. a constrained deadline 𝐷𝑖 ≤ 𝑇𝑖

iii. a fixed priority 𝜋𝑖

Each node is characterized

by a WCET 𝐶𝑗𝑖

3Daniel Casini

Overview

• Each task is represented by a Direct Acyclic Graph,and is characterized by

i. a minimum inter-arrival time 𝑇𝑖

ii. a constrained deadline 𝐷𝑖 ≤ 𝑇𝑖

iii. a fixed priority 𝜋𝑖

Each node is characterized

by a WCET 𝐶𝑗𝑖

4Daniel Casini

Overview

• Each node is statically assigned to a core

• Nodes of the same task can be allocated to different cores

Node-to-core Mapping

Core 1 Core 2 Core 3 Core 4

llelTa

5Daniel Casini

Overview

𝑣𝐿𝑃

𝑣𝐻𝑃

Non-preemptive blocking

As soon a node starts executing, it

cannot be preempted

6Daniel Casini

Why non-preemptive scheduling?

Predictable management of local memories

Memory Feasibility Analysis of Parallel Tasks

Running on Scratchpad-Based Architectures

Daniel Casini, Alessandro Biondi, Geoffrey Nelissen and Giorgio Buttazzo

This morning @ RTSS

e.g., nodes can pre-load data from scratchpads

before start executing

7Daniel Casini

Predictable management of local memories

Reduces context-switch overhead

Simplifies WCET Analysis

Use of HW accelerators and GPUs

Can be a good choice for executing

deep neural networks

8Daniel Casini

• We profiled a deep neural network executed by Tensorflow a8-core Intel i7 machine @ 3.5GHz

More than 34000 nodes where only about

1.2% of them have execution times larger

than 100 microseconds

9Daniel Casini

Overview of the analysis framework

llelTa

Response-time analysis for parallel tasks

Schedulable? (yes/no)

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

10Daniel Casini

llelTa

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Part II

Part I

Part III

11Daniel Casini

llelTa

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Part II

Part I

Part III

12Daniel Casini

Part I:

Response-time analysis for

parallel tasks

13Daniel Casini

Intuition

𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛𝑝1

𝑝2 𝑣3

𝑠𝑢𝑠𝑝𝑒𝑛𝑠𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛

𝑣2 𝑣4

• Each core ‘perceives’ the execution of a parallel task as aninterleaved sequence of execution and suspension regions

Suspension regions correspond toexecution regions on a different core

14Daniel Casini

Intuition

𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛𝑝1

𝑣6 𝑣7

𝑝2 𝑣3

𝑣1 𝑣2 𝑣4 𝑣5 𝑣7

• Paths can be mapped to a self-suspending tasks

15Daniel Casini

Intuition

𝑣1 𝑣5𝑣3 𝑣6 𝑣7

• Paths can be mapped to a self-suspending tasks

𝐶1 𝐶3 𝐶5 𝐶6 𝐶7

SegmentedSS-tasks

𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛

𝐶1𝑆𝑆 = 𝐶1 𝐶2

𝑆𝑆 = 𝐶5 𝐶3𝑆𝑆 = 𝐶7

The length of each execution region directly maps to the WCET of a node in the graph

16Daniel Casini

Intuition

SegmentedSS-tasks

𝑆1𝑆𝑆 = 𝑅3 𝑆2

𝑆𝑆 = 𝑅6

The length of each suspension region depends on the response time of nodes allocated to different cores

Complex inter-core

dependencies can arise

𝑝2 𝑣3

𝑣1 𝑣2 𝑣4 𝑣5 𝑣7

17Daniel Casini

Solution (from Fonseca et al. 2017)

• Recursive algorithm to unfold response-time dependencies:

𝑣1 𝑣3𝑣2 𝑣4 𝑣5 𝑣6 𝑣7

𝑅𝑆𝑆4

𝑆1𝑆𝑆3 = 𝑅𝑆𝑆4

𝑅𝑆𝑆3

𝑅𝑆𝑆2

𝑅𝑆𝑆1

𝑝1 𝑝2 𝑝3 𝑝4

18Daniel Casini

Parallel tasks without preemptions

• We extended this approach to work under non-preemptivescheduling

Need for a fine-grained analysis for

non-preemptive self-suspending tasks

Our next contribution

19Daniel Casini

Motivation

Part II:Analysis for non-preemptive

self-suspending tasks

20Daniel Casini

llelTa

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

21Daniel Casini

Overview of the analysis for SS-tasks

Two different approaches:

Holistic analysis1

Segment-based analysis2

• Computes the RT of a whole self-suspending task

• Analytically dominates state-of-the-art analysis (Dong et al. 2018)

• Computes the RT of individual segments

all segments

Ci,j Si =

all segments

per-task response time bound

per-segment response time bounds

22Daniel Casini

Overview of the analysis for SS-tasks

Two different approaches:

Holistic analysis1

Segment-based analysis2

• Computes the RT of a whole self-suspending task

• Analytically dominates state-of-the-art analysis (Dong et al. 2018)

• Computes the RT of individual segments

all segments

Ci,j Si =

all segments

per-task response time bound

per-segment response time bounds

23Daniel Casini

Analysis for SS-tasks

Response-time analysis

Interference due to

higher-priority segments

Non-preemptive blocking due to

lower-priority segments

24Daniel Casini

Computing Interference

release

deadline = period

maximum possible release jitter

response time bound

• Interference from higher-priority tasks is accounted bymeans of the following worst-case scenario*:

• The response time bound can be initially approximated to thetask’s deadline and iteratively refined

• Holistic and segmented analyses are combined duringthe iterative refinement

*Jian-Jia Chen et al., “Many suspensions, many problems: a review of self-suspending tasks in real-time systems”, Real-time System Journal.

25Daniel Casini

Fine-grained accounting of blocking

• With a multiset approach

𝜏𝐿𝑃

𝜏𝐻𝑃

multiset

Task release Segment release

Two segmentedSS-tasks

contending for the same CPU

Contains the WCET of allthe lower-priority segments

that may block the task under analysis in a window of length ∆

26Daniel Casini

Fine-grained accounting of blocking

• With a multiset approach

𝜏𝐿𝑃

𝜏𝐻𝑃

multiset

Task release Segment release

Two segmentedSS-tasks

contending for the same CPU

Contains the WCET of allthe lower-priority segments

that may block the task under analysis in a window of length ∆

27Daniel Casini

Non-preemptive self-suspending tasks

Now we have our analysis!

28Daniel Casini

29Daniel Casini

The RT of a parallel task can be derived from the

maximum RT of all its paths

𝑅′

𝑅′′

𝑅′′′

𝑹 = 𝐦𝐚𝐱(𝑹′, 𝑹′′, 𝑹′′′)

30Daniel Casini

llelTa

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

31Daniel Casini

llelTa

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

Uniprocessor

analysis for

SS tasks

How to partition nodes to cores?

32Daniel Casini

Partitioning (meta-)algorithm

IDEA: Analyzing schedulability incrementally, adding one node ata time, and perform schedulability analysis on a subgraph

1. Strategy for ordering tasks

2. Strategy for ordering cores

Inputs: Output:

1. Node partitioning

Example:

Task under analysis

Core 1 Core 2

Core 3 Core 4

Task under analysis (during

partitioning)

33Daniel Casini

Inputs: Output:

Example:

Task under analysis

Core 1 Core 2

Core 3 Core 4

partitioning)

34Daniel Casini

Inputs: Output:

Example:

Task under analysis

Core 1 Core 2

Core 3 Core 4

partitioning)

35Daniel Casini

Inputs: Output:

Example:Core 1 Core 2

Core 3 Core 4

Run analysis for parallel tasks

partitioning)

Task under analysis

36Daniel Casini

Inputs: Output:

Core 3 Core 4

Schedulable!

partitioning)

Task under analysis

37Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

38Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

39Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

40Daniel Casini

Inputs: Output:

Core 3 Core 4

Unschedulable!

partitioning)

Task under analysis

41Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

42Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

43Daniel Casini

Inputs: Output:

Core 3 Core 4

Schedulable!

partitioning)

Task under analysis

44Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

45Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

46Daniel Casini

Inputs: Output:

Core 3 Core 4

partitioning)

Task under analysis

47Daniel Casini

Inputs: Output:

Core 3 Core 4Partitioning completed!

partitioning)

Task under analysis

48Daniel Casini

Motivation

Experimental Results

49Daniel Casini

Experimental Study

• Experimental study based on synthetic workload

• We compared against the only previous work targetingnon-preemptive scheduling of parallel tasks, which targetsglobal scheduling (Serrano et al. 2017)

• Same DAG generator used in [Serrano et al. 2017]

• WCETs randomly generated in (0,100] with

uniform distribution

• Tasks utilizations obtained with U-Unifast

• Tasks periods computed as 𝑇𝑖 = 𝑈𝑖 σ𝑛𝑜𝑑𝑒𝑠 𝐶𝑖,𝑗

50Daniel Casini

12 tasks, 16 processors

The higher the better

Utilization

51Daniel Casini

Utilization

Increasing task-set utilization

52Daniel Casini

Improvement up to 100 percentage points

over [Serrano et al. 2017]

Utilization

53Daniel Casini

Utilization

Our experimental study revelead a similar trend varying

the number of tasks and processor, e.g.,

54Daniel Casini

Conclusions

1 Methodology for analyzing non-preemptiveparallel tasks as a set of self-suspending tasks

2Analysis for non-preemptive self-suspending tasks

which analytically dominates the only previous result

3Partitioning algorithm to allocate nodes

to the available processors

4Experimental study to assess the improvement in

terms of schedulability – up to 100 p.p. w.r.t. the onlyexisting previous work for global scheduling

55Daniel Casini

Future Work

Deeper investigation of partitioning strategies

Improvement in the analysis precision

Integration of communication delays in the analysis

Memory Feasibility Analysis of Parallel Tasks

Running on Scratchpad-Based

Architectures

Daniel Casini, Alessandro Biondi, Geoffrey Nelissen and Giorgio Buttazzo

This morning @ RTSS

Thank you!Daniel Casini

daniel.casini@sssup.it

Partitioned Fixed-Priority Scheduling of Parallel Tasks...

Documents

Transcript of Partitioned Fixed-Priority Scheduling of Parallel Tasks...

Incorporating Deadline-Based Scheduling in Tasking …2018.rtss.org/wp-content/uploads/2018/12/bp_03.pdf · 2018. 12. 11. · However, these HPC schedulers provide no guarantees on

Lecture 3: Evolution of the Uniprocessor

Performance Comparison of Uniprocessor and Multiprocessor ...brecht/theses/Harji-PhD.pdf · Performance Comparison of Uniprocessor and Multiprocessor Web Server Architectures by Ashif

Optimizing Network Calculus for Switched Ethernet Network ...2018.rtss.org/wp-content/uploads/2018/12/8-3.pdfOptimizing Network Calculus for Switched Ethernet Network with DRR Scheduler

Uniprocessor Scheduling

Chapter 1 Uniprocessor Architecture Overview

Uniprocessor Checkpointing CS 717 – Fall 2001 9/25/01.

1 Uniprocessor Optimizations and Matrix Multiplication.

Uniprocessor Real-Time Schedulingrobdavis/papers/... · Uniprocessor Real-Time Scheduling ... Rate Monotonic is the optimal priority assignment policy

Uniprocessor Optimizations and Matrix Multiplication · 2003. 7. 11. · 7/10/2003 CS267 Lecure 2 5 Modern Processors: Theory & Practice • Idealized Uniprocessor Model • Execution

MC-Safe: Multi-Channel Real-time V2V Communication for ...2018.rtss.org/wp-content/uploads/2018/12/2-3.pdfDec 02, 2018 · Communication for Enhancing Driving Safety ... 10, 2 0,CW

09 Uniprocessor Scheduling

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Performance Comparison of Uniprocessor and Multiprocessor Web

Operating System 9 UNIPROCESSOR SCHEDULING

Chapter 9 Uniprocessor Scheduling - Plymouth State …turing.plymouth.edu/~zshen/Webfiles/notes/Cs431/Note9.pdf · Chapter 9 Uniprocessor Scheduling ... thus available for execution

Uniprocessor Garbage Collection Techniques Paul R. Wilson.

Uniprocessor Optimizations and Matrix Multiplication

RTSS 2018 – 39th IEEE Real-Time Systems …2018.rtss.org/wp-content/uploads/2018/12/10-1.pdf2018/12/10 · Parallel applications partitioned on multicore platforms can be modelled

Uniprocessor Scheduling