- 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware...

27
- 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses Peter Marwedel 12 , Heiko Falk 1 , Christian Ferdinand 3 Paul Lokuciejewski 1 , Manish Verma 1 , Lars Wehmeyer 12 1 Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD) 3 AbsInt GmbH, Saarbrücken

Transcript of - 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware...

Page 1: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

- 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

Memory-aware compilation enables fast, energy-efficient, timing

predictable memory accesses

Peter Marwedel12, Heiko Falk1, Christian Ferdinand3

Paul Lokuciejewski1, Manish Verma1, Lars Wehmeyer12

1 Universität Dortmund, Informatik 122 Informatik Centrum Dortmund (ICD)

3 AbsInt GmbH, Saarbrücken

Page 2: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 2 -

Key properties of embedded systems

embeddedembedded

real-timereal-time

embedded embedded real-timereal-time

Strong correlation between embedded and real-time systems

„A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995]

Strong correlation between embedded and reactive systems

Page 3: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 3 -

Serious mismatch

Despite considerable progress in software and hardware techniques, when embedded computing systems absolutely must meet tight timing constraints, many of the advances in computing become part of the problem rather than part of the solution.

What would it take to achieve concurrent and networked embedded software that was absolutely positively on time … ? ..What is needed is nearly a reinvention of computer science.

Edward A. Lee: Absolutely Positively On Time: What Would It Take?, Editorial, Draft version: May 18, 2005, Published in: Embedded Systems Column, IEEE Computer, July, 2005

Page 4: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 4 -

Technology "advances" will make the situation worse

2

4

8

2 4 5

Speed

years

CPU

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

2x every 2 years

10

Increasing gap between processor and memory speedsFuture semiconductor

technology will be inherently unreliable, e.g. due to quantum effects and will require fault tolerance mechanisms to be used.

Timing "redundancy" used?

Page 5: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 5 -

Scratchpad seen to help with timing problems

Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability…

[E.Lee, 2005]

Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability…

[E.Lee, 2005]

Page 6: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 6 -

Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable

Address space

scratch pad memory

0

FFF..

ARM7TDMI cores, well-known for low power consumption

ExampleExample

main memory

Called “tightly coupled memory” by ARM

Small; no tag memory

Page 7: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 7 -

Worst case timing analysis using aiT

SP size

C program

encc

executable

ARMulator

aiT

Actualperformance

WCET

Page 8: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 8 -

Results for G.721

L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004

L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005

Using Scratchpad: Using Unified Cache:

Page 9: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 9 -

Impact on access time and energy consumption

Energy

Access times

Small memories also provide faster access time and reduced energy consumption

Small memories also provide faster access time and reduced energy consumption

CACTI model for SRAM

Page 10: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 10 -

Energy savings for memory system energy

Page 11: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 11 -

Static allocation of memory objects

Which object (array, function, etc.) to be stored in SPM?

Gain gk and size sk for eachobject k.

Maximise gain G = gk, respecting size of SPM sk ≤ SSP.

Static memory allocation:

Solution: knapsack algorithm.

Processor

Scratch pad memory,capacity SSP

board

Main memory

?

For i .{ }

for j ..{ }

while ...

Repeat

call ...

Array ...

Int ...

Array

Example:

Page 12: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 12 -

Dynamic replacement within scratch pad

Effectively results in a kind of compiler-controlled swapping for SPM

Address assignment within SPM required(paging or segmentation-like)

Effectively results in a kind of compiler-controlled swapping for SPM

Address assignment within SPM required(paging or segmentation-like)

M.Verma, P.Marwedel (U. Dortmund): Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004

CPU

Memory

Memory

SPM

Page 13: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 13 -

Dynamic replacement of data within scratch pad: based on liveness analysis

SP Size = |A| = |T3|SP Size = |A| = |T3|

Solution:

A SP & T3 SP

Solution:

A SP & T3 SP

SPILL_STORE(A);

SPILL_LOAD(T3);

SPILL_STORE(A);

SPILL_LOAD(T3);

SPILL_LOAD(A);SPILL_LOAD(A);

T3

DEF A

USE A

USE A

MOD A USE T3

USE T3

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

Page 14: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 14 -

Dynamic replacement within scratch pad- Results for edge detection relative to static allocation -

Page 15: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 15 -

Impact of partitioning scratch pads

"main" memory

Scratch pad 2, 16 k entries

Scratch pad 1, 2 k entries

Scratch pad 0, 256 entries0

addr

esse

s

Page 16: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 16 -

Results for parts of GSM coder/decoder

A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.

„Working set“

Page 17: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 17 -

Multiple Processes:Non-Saving Context Switch

Process P1

Process P3

Process P2

Scratchpad

Process P1

Non-Saving Context Switch(Non-Saving) Partitions SPM into disjoint regions Each process is assigned a SPM region Copies contents during initialization Good for large scratchpads

Non-Saving Context Switch(Non-Saving) Partitions SPM into disjoint regions Each process is assigned a SPM region Copies contents during initialization Good for large scratchpads

Process P2

Process P3

P1

P2

P3

Page 18: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 18 -

Saving/Restoring Context Switch

Saving Context Switch (Saving) Utilizes SPM as a common region

shared all processes Contents of processes are copied

on/off the SPM at context switch Good for small scratchpads

Saving Context Switch (Saving) Utilizes SPM as a common region

shared all processes Contents of processes are copied

on/off the SPM at context switch Good for small scratchpads

P1

P2

P3

Scratchpad

Process P3Process P1Process P2

Saving/Restoring at context switch

Saving/Restoring at context switch

Page 19: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 19 -

Hybrid Context Switch

Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis is similar to Non-Saving

Approach Runtime: O(nM3)

Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis is similar to Non-Saving

Approach Runtime: O(nM3)

P1

P2

P3

Scratchpad

Process P1

Process P3

Process P2

Process P1,P2, P3

Process P1

Process P2

Process P3

Process P1

Process P2

Process P3

Page 20: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 20 -

Multi-process Scratchpad Allocation: Results

Hybrid is the best for all SPM sizes.

Energy reduction @ 4kB SPM is 27% for Hybrid approach.

Avoids poor timing predictability of cache-based system after context switch.

Hybrid is the best for all SPM sizes.

Energy reduction @ 4kB SPM is 27% for Hybrid approach.

Avoids poor timing predictability of cache-based system after context switch.

80

90

100

110

120

130

140

150

160

64 128 256 512 1024 2048 4096

Scratchpad Size (bytes)

En

erg

y C

on

su

mp

tio

n (

mJ) Energy (SPA) Energy (Non-Saving)

Energy (Saving) CopyEnergy (Saving)

Energy (Hybrid) CopyEnergy (Hybrid)

edge detection, adpcm, g721, mpeg

27%

SPA: Single Process

Approach

Page 21: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 21 -

Multi-processor ARM (MPARM) Framework

– Homogenous SMP ~ CELL processor– Processing Unit : ARM7T processor– Shared Coherent Main Memory– Private Memory: Scratchpad Memory

– Homogenous SMP ~ CELL processor– Processing Unit : ARM7T processor– Shared Coherent Main Memory– Private Memory: Scratchpad Memory

SPM SPMSPMSPM

InterruptDevice

SemaphoreDevice

ARM ARM ARM ARM

Interconnect (AMBA or STBus)

Shared Main Memory

Page 22: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 22 -

Using optimization in an gcc-based tool flow

Source is split into 2 different file by specially developed memory optimizer tool *.

Source is split into 2 different file by specially developed memory optimizer tool *.

MemoryOptimizer

ICD-CCompiler

.c

.txt

.c

.c

ARM-GCCCompiler

.ld .exe

applicationsource

profile Info.

main mem. src

spm src.

linker script

exe

cu

tabl

e

*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

Page 23: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 23 -

Results (MOMPARM)

0

10

20

30

40

50

60

70

0 128 256 512 1k 2k 4k 8k 16k

Scratchpad Size [Bytes]

En

erg

y C

on

sum

pti

on

[µJ

]

Energy Consum ption [µJ]

DES-Encryption: 4 processors: 2 Controllers+2 Compute EnginesEnergy values from ST Microelectronics

Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

Page 24: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 24 -

State of the art of SPM algorithms

Feature Static allocation Dynamic allocation

Partitioned SPMs Wehmeyer et al. [WMPI 2004]

-

WCET analysis Wehmeyer et al. [WS WCET 04, DATE 05]

Wehmeyer et al.[Thesis]

Multiple processes Verma et al.[ISSS 2004]

Future work

Multiprocessor Systems

Verma et al. [Estimedia 2005]

Verma et al.[ongoing work]

Sections from arrays Not always applicable IMEC (MHLA), Kandemir

Page 25: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 25 -

Extension: WCET-aware compiler

Loop bounds analysis

Standard input to aiTANSI-C

Programm

ANSI-C Frontend Parse

Tree

IR-Code Generator

Medium Level IR

LLIR-Code Generator

Low Level IR

Code Generator

WCET optimized assembly code

Optimization Techniques

Analyses

LLIR2crl

crl2llir

PipelineAnalysis

CacheAnalysis

ValueAnalysis

CRL2

CRL2 with WCET Info

PathAnalysis

ARTIST2

Page 26: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 26 -

Opportunities

Precise WCET information for run-time optimizations- Single implementation of hardware timing models- Accurate information on pipeline influence- Accurate information on timing of memory

- Trade-off Cache vs. Scratchpad Optimization Pass additional information (flow facts) to aiT

Potential for tighter bounds?(e.g. due to pointer disambiguation)

Aggressive optimizations for code on WCET path Respecting WCET constraints during compilation Reduction of jitter in multimedia applications Alternative input to aiT (compare compiler output)

Precise WCET information for run-time optimizations- Single implementation of hardware timing models- Accurate information on pipeline influence- Accurate information on timing of memory

- Trade-off Cache vs. Scratchpad Optimization Pass additional information (flow facts) to aiT

Potential for tighter bounds?(e.g. due to pointer disambiguation)

Aggressive optimizations for code on WCET path Respecting WCET constraints during compilation Reduction of jitter in multimedia applications Alternative input to aiT (compare compiler output)

Page 27: - 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005

Universität Dortmund

- 27 -

Conclusion

Timeliness and timing predictability seriously missing in key concepts of current information technologyScratchpads are seen as a potential contribution

towards new architectural concepts- Comprehensive set of allocation methods has

been developed• Static allocation• Dynamic allocation

Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.

Timeliness and timing predictability seriously missing in key concepts of current information technologyScratchpads are seen as a potential contribution

towards new architectural concepts- Comprehensive set of allocation methods has

been developed• Static allocation• Dynamic allocation

Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.