- 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware...
-
Upload
addison-whiten -
Category
Documents
-
view
216 -
download
0
Transcript of - 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware...
- 1 - P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
Memory-aware compilation enables fast, energy-efficient, timing
predictable memory accesses
Peter Marwedel12, Heiko Falk1, Christian Ferdinand3
Paul Lokuciejewski1, Manish Verma1, Lars Wehmeyer12
1 Universität Dortmund, Informatik 122 Informatik Centrum Dortmund (ICD)
3 AbsInt GmbH, Saarbrücken
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 2 -
Key properties of embedded systems
embeddedembedded
real-timereal-time
embedded embedded real-timereal-time
Strong correlation between embedded and real-time systems
„A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995]
Strong correlation between embedded and reactive systems
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 3 -
Serious mismatch
Despite considerable progress in software and hardware techniques, when embedded computing systems absolutely must meet tight timing constraints, many of the advances in computing become part of the problem rather than part of the solution.
What would it take to achieve concurrent and networked embedded software that was absolutely positively on time … ? ..What is needed is nearly a reinvention of computer science.
Edward A. Lee: Absolutely Positively On Time: What Would It Take?, Editorial, Draft version: May 18, 2005, Published in: Embedded Systems Column, IEEE Computer, July, 2005
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 4 -
Technology "advances" will make the situation worse
2
4
8
2 4 5
Speed
years
CPU
(1.5
-2 p
.a.)
DRAM (1.07 p.a.)
31
2x every 2 years
10
Increasing gap between processor and memory speedsFuture semiconductor
technology will be inherently unreliable, e.g. due to quantum effects and will require fault tolerance mechanisms to be used.
Timing "redundancy" used?
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 5 -
Scratchpad seen to help with timing problems
Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability…
[E.Lee, 2005]
Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability…
[E.Lee, 2005]
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 6 -
Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable
Address space
scratch pad memory
0
FFF..
ARM7TDMI cores, well-known for low power consumption
ExampleExample
main memory
Called “tightly coupled memory” by ARM
Small; no tag memory
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 7 -
Worst case timing analysis using aiT
SP size
C program
encc
executable
ARMulator
aiT
Actualperformance
WCET
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 8 -
Results for G.721
L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004
L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005
Using Scratchpad: Using Unified Cache:
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 9 -
Impact on access time and energy consumption
Energy
Access times
Small memories also provide faster access time and reduced energy consumption
Small memories also provide faster access time and reduced energy consumption
CACTI model for SRAM
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 10 -
Energy savings for memory system energy
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 11 -
Static allocation of memory objects
Which object (array, function, etc.) to be stored in SPM?
Gain gk and size sk for eachobject k.
Maximise gain G = gk, respecting size of SPM sk ≤ SSP.
Static memory allocation:
Solution: knapsack algorithm.
Processor
Scratch pad memory,capacity SSP
board
Main memory
?
For i .{ }
for j ..{ }
while ...
Repeat
call ...
Array ...
Int ...
Array
Example:
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 12 -
Dynamic replacement within scratch pad
Effectively results in a kind of compiler-controlled swapping for SPM
Address assignment within SPM required(paging or segmentation-like)
Effectively results in a kind of compiler-controlled swapping for SPM
Address assignment within SPM required(paging or segmentation-like)
M.Verma, P.Marwedel (U. Dortmund): Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004
CPU
Memory
Memory
SPM
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 13 -
Dynamic replacement of data within scratch pad: based on liveness analysis
SP Size = |A| = |T3|SP Size = |A| = |T3|
Solution:
A SP & T3 SP
Solution:
A SP & T3 SP
SPILL_STORE(A);
SPILL_LOAD(T3);
SPILL_STORE(A);
SPILL_LOAD(T3);
SPILL_LOAD(A);SPILL_LOAD(A);
T3
DEF A
USE A
USE A
MOD A USE T3
USE T3
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 14 -
Dynamic replacement within scratch pad- Results for edge detection relative to static allocation -
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 15 -
Impact of partitioning scratch pads
"main" memory
Scratch pad 2, 16 k entries
Scratch pad 1, 2 k entries
Scratch pad 0, 256 entries0
addr
esse
s
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 16 -
Results for parts of GSM coder/decoder
A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.
„Working set“
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 17 -
Multiple Processes:Non-Saving Context Switch
Process P1
Process P3
Process P2
Scratchpad
Process P1
Non-Saving Context Switch(Non-Saving) Partitions SPM into disjoint regions Each process is assigned a SPM region Copies contents during initialization Good for large scratchpads
Non-Saving Context Switch(Non-Saving) Partitions SPM into disjoint regions Each process is assigned a SPM region Copies contents during initialization Good for large scratchpads
Process P2
Process P3
P1
P2
P3
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 18 -
Saving/Restoring Context Switch
Saving Context Switch (Saving) Utilizes SPM as a common region
shared all processes Contents of processes are copied
on/off the SPM at context switch Good for small scratchpads
Saving Context Switch (Saving) Utilizes SPM as a common region
shared all processes Contents of processes are copied
on/off the SPM at context switch Good for small scratchpads
P1
P2
P3
Scratchpad
Process P3Process P1Process P2
Saving/Restoring at context switch
Saving/Restoring at context switch
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 19 -
Hybrid Context Switch
Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis is similar to Non-Saving
Approach Runtime: O(nM3)
Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis is similar to Non-Saving
Approach Runtime: O(nM3)
P1
P2
P3
Scratchpad
Process P1
Process P3
Process P2
Process P1,P2, P3
Process P1
Process P2
Process P3
Process P1
Process P2
Process P3
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 20 -
Multi-process Scratchpad Allocation: Results
Hybrid is the best for all SPM sizes.
Energy reduction @ 4kB SPM is 27% for Hybrid approach.
Avoids poor timing predictability of cache-based system after context switch.
Hybrid is the best for all SPM sizes.
Energy reduction @ 4kB SPM is 27% for Hybrid approach.
Avoids poor timing predictability of cache-based system after context switch.
80
90
100
110
120
130
140
150
160
64 128 256 512 1024 2048 4096
Scratchpad Size (bytes)
En
erg
y C
on
su
mp
tio
n (
mJ) Energy (SPA) Energy (Non-Saving)
Energy (Saving) CopyEnergy (Saving)
Energy (Hybrid) CopyEnergy (Hybrid)
edge detection, adpcm, g721, mpeg
27%
SPA: Single Process
Approach
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 21 -
Multi-processor ARM (MPARM) Framework
– Homogenous SMP ~ CELL processor– Processing Unit : ARM7T processor– Shared Coherent Main Memory– Private Memory: Scratchpad Memory
– Homogenous SMP ~ CELL processor– Processing Unit : ARM7T processor– Shared Coherent Main Memory– Private Memory: Scratchpad Memory
SPM SPMSPMSPM
InterruptDevice
SemaphoreDevice
ARM ARM ARM ARM
Interconnect (AMBA or STBus)
Shared Main Memory
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 22 -
Using optimization in an gcc-based tool flow
Source is split into 2 different file by specially developed memory optimizer tool *.
Source is split into 2 different file by specially developed memory optimizer tool *.
MemoryOptimizer
ICD-CCompiler
.c
.txt
.c
.c
ARM-GCCCompiler
.ld .exe
applicationsource
profile Info.
main mem. src
spm src.
linker script
exe
cu
tabl
e
*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)
*Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 23 -
Results (MOMPARM)
0
10
20
30
40
50
60
70
0 128 256 512 1k 2k 4k 8k 16k
Scratchpad Size [Bytes]
En
erg
y C
on
sum
pti
on
[µJ
]
Energy Consum ption [µJ]
DES-Encryption: 4 processors: 2 Controllers+2 Compute EnginesEnergy values from ST Microelectronics
Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 24 -
State of the art of SPM algorithms
Feature Static allocation Dynamic allocation
Partitioned SPMs Wehmeyer et al. [WMPI 2004]
-
WCET analysis Wehmeyer et al. [WS WCET 04, DATE 05]
Wehmeyer et al.[Thesis]
Multiple processes Verma et al.[ISSS 2004]
Future work
Multiprocessor Systems
Verma et al. [Estimedia 2005]
Verma et al.[ongoing work]
Sections from arrays Not always applicable IMEC (MHLA), Kandemir
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 25 -
Extension: WCET-aware compiler
Loop bounds analysis
Standard input to aiTANSI-C
Programm
ANSI-C Frontend Parse
Tree
IR-Code Generator
Medium Level IR
LLIR-Code Generator
Low Level IR
Code Generator
WCET optimized assembly code
Optimization Techniques
Analyses
LLIR2crl
crl2llir
PipelineAnalysis
CacheAnalysis
ValueAnalysis
CRL2
CRL2 with WCET Info
PathAnalysis
ARTIST2
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 26 -
Opportunities
Precise WCET information for run-time optimizations- Single implementation of hardware timing models- Accurate information on pipeline influence- Accurate information on timing of memory
- Trade-off Cache vs. Scratchpad Optimization Pass additional information (flow facts) to aiT
Potential for tighter bounds?(e.g. due to pointer disambiguation)
Aggressive optimizations for code on WCET path Respecting WCET constraints during compilation Reduction of jitter in multimedia applications Alternative input to aiT (compare compiler output)
Precise WCET information for run-time optimizations- Single implementation of hardware timing models- Accurate information on pipeline influence- Accurate information on timing of memory
- Trade-off Cache vs. Scratchpad Optimization Pass additional information (flow facts) to aiT
Potential for tighter bounds?(e.g. due to pointer disambiguation)
Aggressive optimizations for code on WCET path Respecting WCET constraints during compilation Reduction of jitter in multimedia applications Alternative input to aiT (compare compiler output)
P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005
Universität Dortmund
- 27 -
Conclusion
Timeliness and timing predictability seriously missing in key concepts of current information technologyScratchpads are seen as a potential contribution
towards new architectural concepts- Comprehensive set of allocation methods has
been developed• Static allocation• Dynamic allocation
Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.
Timeliness and timing predictability seriously missing in key concepts of current information technologyScratchpads are seen as a potential contribution
towards new architectural concepts- Comprehensive set of allocation methods has
been developed• Static allocation• Dynamic allocation
Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.