Increasing the Efficiency of an Embedded Multi-Core...
Transcript of Increasing the Efficiency of an Embedded Multi-Core...
Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture
Increasing the Efficiency of an Embedded Multi-Core Bytecode
Processor Using an Object Cache
Martin Zabel,Thomas B. Preußer,Rainer G. Spallek
JTRES‘12
[email protected] http://vlsi-eda.inf.tu-dresden.de
25.10.2012
Processor Using an Object Cache
Outline
1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 2 von 28
5 Conclusion
Motivation
Why Java?� Object orientation, portability� Automatic memory management, security� Support for thread parallelization
Why Java-(bytecode-)processor?
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 3 von 28
Why Java-(bytecode-)processor?� Native execution of Java-bytecode
� no OS, no interpretation, no re-compilation� Real-time� Suited for embedded systems with limited resources
Why multi-core processors?� Power consumption increases over-proportional with clock frequency.� Use thread-level parallelism instead.
Java Multi-Core Processor
Examples: JopCMP, jamuth, REALJava and SHAP
Common property: central shared heap for all cores
SHAP Multi-Core:
SHAP Multi-Core Architecture
ManagerMemory
32
Core1
MethodCache
Stack
Core n-1
Con
trol
ler
configurable
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 4 von 28
SHAP Multi-Core:� Local stack-memory per core� Method-cache per core� Pipelining of heap-accesses� Concurrent GC for real-time apps� Maximum speed-up of 8 for
programs with an above-average number of memory accesses [1]
� CLDC, constant-time interface method dispatch, …
GarbageCollector
32
32
32
8
32
MethodCache
Data
Code
MethodCache
Stack
Stack
Core0
Con
trol
ler
UART
Graphics Unit
Ethernet MAC - SRAM
- DDR-RAM
Memory
Wis
hbon
e B
us
DDR: 16SDR: 32
DMA
Goal
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 5 von 28
Further reduce the demands on the heap memory interface
to achieve higher speed-ups through thread-level parallelism.
Outline
1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 6 von 28
5 Conclusion
Related Work
Common solution for object-oriented processors:Cache for objects in analogy to data-caches [2] [3]
TagObject
Offset
Data
Object Content
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 7 von 28
Especially for real-time systems:Separate Caches for different data areas [4]:� Classic data cache for data at static addresses (e.g. class data)� Object-cache for data at dynamic addresses (e.g. objects)
ObjectReference
Offset Object Content
Indirect Object-Addressing
Problem of JopCMP and SHAP:� Object-table stored in external memory.� Additional latency for each heap-access.� Additional demand on memory bandwidth.
Stack
Object-Table
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 8 von 28
Solution:� Translation look-aside buffer (TLB) [1]� Virtually indexed object-cache [2]
Heap
Cache Coherence
Problem: Coherence of distributed caches
Advantage of the Java Virtual Machine: Synchronization only when [5]� entering a critical section, or� accessing a “volatile” variable.
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 9 von 28
� accessing a “volatile” variable.
Outline
1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 10 von 28
5 Conclusion
Heap-Access Analysis
Evaluation:� Benchmark suite JemBench (Version 2.0) [9],
all except microkernel benchmarks� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 11 von 28
Heap-Access Analysis
15
20
25
Mem
ory
Ban
dwid
th U
tiliz
atio
n [%
]
Data Accesses (Core Itself)Bytecode Fetches (Method Cache)Accesses for Memory Management
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 12 von 28
0
5
10
15
AES BubbleSort
Kfl Lift MatrixMul
N-Queens
Sieve UdpIp
Mem
ory
Ban
dwid
th U
tiliz
atio
n [%
]
Benchmark
11%
Heap-Access Analysis
Evaluation:� Benchmark suite JemBench (Version 2.0) [9]� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses
Results:
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 13 von 28
Results:1. Most frequently object-accesses are reads on arrays1 and member
variables.2. 80% of all object accesses concentrate on 6 objects.3. Frequent access onto the first user-specific object offsets (-2 and 1)
� Further evaluation: small full-associative cache for each core
1 (already accounts for implicit reads of array length)
Small Full-associative Local Cache
Storing only invariant data:� Would require no extra logic for cache coherency.� In general, only class information pointer and array size are invariant.� Significant reduction only for array-intensive programs.
BubbleSort 33% of all
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 14 von 28
Write-through instead of write-back:� No special GC interaction required.� Simple cache coherence: Invalidate cache when
• entering a critical section, or• accessing a “volatile” variable.
BubbleSort 33% of all memoryaccesses
Sieve 20%
MatrixMul 17%
Outline
1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 15 von 28
5 Conclusion
Cache Design (1)
Cache integration: into memory manager portCore modifications:� Bytecode to access volatile variables� Microcode to invalidate cache
SHAP Multi-Core with AOC (Excerpt)
Data Port
configurable
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 16 von 28
MethodCache
32
8
Stack
Core032
AOC
Data Port
32
Con
trol
ler
GarbageCollector
- SRAM
- DDR-RAM
Memory
ManagerMemory
Data
Code
DDR: 16SDR: 32Wishbone Bus
Features:� Address & Offset-Cache (AOC) with write-through, LRU-strategy� 1 valid-bit per cached word� Configurable:
# of cache lines, cached offsets
Cache Design (2)
Core External Memory
Object Table HeapStack Offset
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 17 von 28
cached offsets
Disadvantage:Additional latency of 1 clock cycle during cache miss
Adress- and Offset-Cache (AOC)
Physical Addr.
Object 1
Object 2
Object 3
Object Handle
Data W
Data Y
Data Z
-2
-1
0
1
Data X
0
1
2
3
Line
Physical Addr.
Address Mem. Valid & Data Memory
Offset -2 Offset 1VV
Data W Data Z1 1Object Handle
Tags
Cache Configuration
Huge configuration space:� Cache lines: � Offsets: no (address only), range
But synthesis for cores to expensive.� Search for good initial configuration.
Nll ∈> ,00,,with],[ >∈− xNyxyx
18,,2,1 K=n
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 18 von 28
Configuration space exploration:� Baseline design for comparison: TLB with 2 entries� Benchmarks:
• JemBench• JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers)
� Platform:• SHAP on Virtex-5 FPGA XC5VLX110T• Same clock frequency of 80 MHz as baseline design
Setup: 1 Core, Cache Address Only
1.06
1.08
1.1
1.12
1.14
Rel
ativ
e P
erfo
rman
ce
SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueensKfl
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 19 von 28
0.96
0.98
1
1.02
1.04
BaselineDesign
2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines
Rel
ativ
e P
erfo
rman
ce
Cache Configuration
KflHSSieveBS
���� Use 8 lines
Setup: 1 Core, 8 Lines
1.3
1.35
1.4
1.45
Rel
ativ
e P
erfo
rman
ce
SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueens
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 20 von 28
1
1.05
1.1
1.15
1.2
1.25
Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55]
Rel
ativ
e P
erfo
rman
ce
Cache Configuration
NQueensKflHSSieveBS
���� Cache Offsets -2 & 1
Speed-Up of JemBench MatrixMul
Matrix multiplication:Default with 20x20 matrix
Reference value:Single-core
12
14
16
18
Spe
ed-U
p S
IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 21 von 28
Single-core performance
Ideal speed-up:10 for n ≥ 10 cores
Speed-up maximum:8,3 ���� 9,4@ 10 cores 2
4
6
8
10
2 4 6 8 10 12 14 16 18
Spe
ed-U
p S
Processor Cores n
Speed-Up of JemBench MatrixMul (N=90)
Matrix multiplication:Extended to 90x90 matrix
Reference value:Single-core
12
14
16
18
Spe
ed-U
p S
IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 22 von 28
Single-core performance
Ideal speed-up:15 for 15--17 cores
Speed-up maximum:9,5 ���� 14,0@ 15 cores 2
4
6
8
10
2 4 6 8 10 12 14 16 18
Spe
ed-U
p S
Processor Cores n
Speed-Up of JGF SparseMatMult with Integer
Matrix multiplication:Reduced to 4.000x4.000 matrix with 16.000 non-zero integer elements
Reference value:
12
14
16
18
Spe
ed-U
p S
IdealAOC with 8 lines, offset -8 to 7AOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 23 von 28
Reference value:Single-core performance
Speed-up maximum:7,6 ���� 14,7 / 15,5@ 17 cores 2
4
6
8
10
2 4 6 8 10 12 14 16 18
Spe
ed-U
p S
Processor Cores n
FPGA Resource Usage
Synthesis results for 8 cache lines, offsets -2 & 1:� Up to 17 cores at 80 MHz (same as baseline design) fit on XC5VLX110T� Additional hardware resources per core:
• For up to 15 cores:
Configuration Register LUTs RAMB36
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 24 von 28
* = includes distributed RAM
• 16 and 17 cores: over-proportional due to high FPGA utilization.
Configuration Register LUTs RAMB36
Address Only +8% +3%* -
Offsets -2 & 1 +9% +8%* -
Offsets [-8,7] +17% +9% +1
Outline
1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 25 von 28
5 Conclusion
Conclusion
Baseline design: SHAP Multi-Core with 2-lines TLB
Realized object-cache:� Full-associative combined address and offset cache� Configurable in # of lines and cached offsets
Results:
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 26 von 28
Results:� With 4 cache-lines: Amortization of additional latency� With 8 cache-lines:
• Speed-up maximum increased from formerly 8—9 to now 14.• Single-core performance increased by 12 to 21%.
�Up to twice absolute compute performance.�Requires only 9% more hardware resources.
Thank you for your attention!
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 27 von 28
Thank you for your attention!
Questions?
Selected Literature1. Zabel, Martin: Effiziente Mehrkernarchitektur für eingebettete Java-Bytecode-Prozessoren,
Dresden, Technische Universität, Diss., 20122. Vijaykrishnan, N.; Ranganathan, N.: Supporting object accesses in a Java processor. In:
IE Proc. Computers and Digital Techniques 147 (2000), Nr. 6, S. 435–443. – ISSN 1350–23873. Wright, G.; Seidl, M. L.; Wolczko, M.: An object-aware memory architecture. In:
Sci. Comput. Program. 62 (2006), Nr. 2, S. 145–163. – ISSN 0167–64234. Huber, B.; Puffitsch, W.; Schoeberl, M.: WCET driven design space exploration of an object
Martin ZabelIncreasing the Efficiency of an Embedded Multi-
Core Bytecode Processor using an Object CacheFolie 28 von 28
cache. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10). ACM, 2010, S. 26–35
5. Lindholm, T.; Yellin, F.: The Java(TM) Virtual Machine Specification. 2nd edition. Amsterdam : Addison-Wesley Longman, 1999. – ISBN 978–0201432947
6. SCHOEBERL, Martin ; PREUSSER, Thomas B. ; UHRIG, Sascha: The embedded Java benchmark suite JemBench. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10), S. 120–127
7. Java Grande Forum Benchmark-Suite. –http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html