Increasing the Efficiency of an Embedded Multi-Core...

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture

Increasing the Efficiency of an Embedded Multi-Core Bytecode

Processor Using an Object Cache

Martin Zabel,Thomas B. Preußer,Rainer G. Spallek

JTRES‘12

[email protected] http://vlsi-eda.inf.tu-dresden.de

25.10.2012

Processor Using an Object Cache

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 2 von 28

5 Conclusion

Motivation

Why Java?� Object orientation, portability� Automatic memory management, security� Support for thread parallelization

Why Java-(bytecode-)processor?



Why Java-(bytecode-)processor?� Native execution of Java-bytecode

� no OS, no interpretation, no re-compilation� Real-time� Suited for embedded systems with limited resources

Why multi-core processors?� Power consumption increases over-proportional with clock frequency.� Use thread-level parallelism instead.

Java Multi-Core Processor

Examples: JopCMP, jamuth, REALJava and SHAP

Common property: central shared heap for all cores

SHAP Multi-Core:

SHAP Multi-Core Architecture

ManagerMemory

32

Core1

MethodCache

Stack

Core n-1

Con

trol

ler

configurable



SHAP Multi-Core:� Local stack-memory per core� Method-cache per core� Pipelining of heap-accesses� Concurrent GC for real-time apps� Maximum speed-up of 8 for

programs with an above-average number of memory accesses [1]

� CLDC, constant-time interface method dispatch, …

GarbageCollector

32

32

32

8

32

MethodCache

Data

Code

MethodCache

Stack

Stack

Core0

Con

trol

ler

UART

Graphics Unit

Ethernet MAC - SRAM

- DDR-RAM

Memory

Wis

hbon

e B

us

DDR: 16SDR: 32

DMA

Goal



Further reduce the demands on the heap memory interface

to achieve higher speed-ups through thread-level parallelism.

Outline




5 Conclusion

Related Work

Common solution for object-oriented processors:Cache for objects in analogy to data-caches [2] [3]

TagObject

Offset

Data

Object Content



Especially for real-time systems:Separate Caches for different data areas [4]:� Classic data cache for data at static addresses (e.g. class data)� Object-cache for data at dynamic addresses (e.g. objects)

ObjectReference

Offset Object Content

Indirect Object-Addressing

Problem of JopCMP and SHAP:� Object-table stored in external memory.� Additional latency for each heap-access.� Additional demand on memory bandwidth.

Stack

Object-Table



Solution:� Translation look-aside buffer (TLB) [1]� Virtually indexed object-cache [2]

Heap

Cache Coherence

Problem: Coherence of distributed caches

Advantage of the Java Virtual Machine: Synchronization only when [5]� entering a critical section, or� accessing a “volatile” variable.



� accessing a “volatile” variable.

Outline




5 Conclusion

Heap-Access Analysis

Evaluation:� Benchmark suite JemBench (Version 2.0) [9],

all except microkernel benchmarks� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses




15

20

25

Mem

ory

Ban

dwid

th U

tiliz

atio

n [%

]

Data Accesses (Core Itself)Bytecode Fetches (Method Cache)Accesses for Memory Management



0

5

10

15

AES BubbleSort

Kfl Lift MatrixMul

N-Queens

Sieve UdpIp

Mem

ory

Ban

dwid

th U

tiliz

atio

n [%

]

Benchmark

11%


Evaluation:� Benchmark suite JemBench (Version 2.0) [9]� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses

Results:



Results:1. Most frequently object-accesses are reads on arrays1 and member

variables.2. 80% of all object accesses concentrate on 6 objects.3. Frequent access onto the first user-specific object offsets (-2 and 1)

� Further evaluation: small full-associative cache for each core

1 (already accounts for implicit reads of array length)

Small Full-associative Local Cache

Storing only invariant data:� Would require no extra logic for cache coherency.� In general, only class information pointer and array size are invariant.� Significant reduction only for array-intensive programs.

BubbleSort 33% of all



Write-through instead of write-back:� No special GC interaction required.� Simple cache coherence: Invalidate cache when

• entering a critical section, or• accessing a “volatile” variable.

BubbleSort 33% of all memoryaccesses

Sieve 20%

MatrixMul 17%

Outline




5 Conclusion

Cache Design (1)

Cache integration: into memory manager portCore modifications:� Bytecode to access volatile variables� Microcode to invalidate cache

SHAP Multi-Core with AOC (Excerpt)

Data Port

configurable



MethodCache

32

8

Stack

Core032

AOC

Data Port

32

Con

trol

ler

GarbageCollector

- SRAM

- DDR-RAM

Memory

ManagerMemory

Data

Code

DDR: 16SDR: 32Wishbone Bus

Features:� Address & Offset-Cache (AOC) with write-through, LRU-strategy� 1 valid-bit per cached word� Configurable:

# of cache lines, cached offsets

Cache Design (2)

Core External Memory

Object Table HeapStack Offset



cached offsets

Disadvantage:Additional latency of 1 clock cycle during cache miss

Adress- and Offset-Cache (AOC)

Physical Addr.

Object 1

Object 2

Object 3

Object Handle

Data W

Data Y

Data Z

-2

-1

0

1

Data X

0

1

2

3

Line

Physical Addr.

Address Mem. Valid & Data Memory

Offset -2 Offset 1VV

Data W Data Z1 1Object Handle

Tags

Cache Configuration

Huge configuration space:� Cache lines: � Offsets: no (address only), range

But synthesis for cores to expensive.� Search for good initial configuration.

Nll ∈> ,00,,with],[ >∈− xNyxyx

18,,2,1 K=n



Configuration space exploration:� Baseline design for comparison: TLB with 2 entries� Benchmarks:

• JemBench• JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers)

� Platform:• SHAP on Virtex-5 FPGA XC5VLX110T• Same clock frequency of 80 MHz as baseline design

Setup: 1 Core, Cache Address Only

1.06

1.08

1.1

1.12

1.14

Rel

ativ

e P

erfo

rman

ce

SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueensKfl



0.96

0.98

1

1.02

1.04

BaselineDesign

2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines

Rel

ativ

e P

erfo

rman

ce

Cache Configuration

KflHSSieveBS

�� Use 8 lines

Setup: 1 Core, 8 Lines

1.3

1.35

1.4

1.45

Rel

ativ

e P

erfo

rman

ce

SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueens



1

1.05

1.1

1.15

1.2

1.25

Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55]

Rel

ativ

e P

erfo

rman

ce

Cache Configuration

NQueensKflHSSieveBS

�� Cache Offsets -2 & 1

Speed-Up of JemBench MatrixMul

Matrix multiplication:Default with 20x20 matrix

Reference value:Single-core

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design



Single-core performance

Ideal speed-up:10 for n ≥ 10 cores

Speed-up maximum:8,3 �� 9,4@ 10 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

Speed-Up of JemBench MatrixMul (N=90)

Matrix multiplication:Extended to 90x90 matrix

Reference value:Single-core

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design



Single-core performance

Ideal speed-up:15 for 15--17 cores

Speed-up maximum:9,5 �� 14,0@ 15 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

Speed-Up of JGF SparseMatMult with Integer

Matrix multiplication:Reduced to 4.000x4.000 matrix with 16.000 non-zero integer elements

Reference value:

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -8 to 7AOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design



Reference value:Single-core performance

Speed-up maximum:7,6 �� 14,7 / 15,5@ 17 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

FPGA Resource Usage

Synthesis results for 8 cache lines, offsets -2 & 1:� Up to 17 cores at 80 MHz (same as baseline design) fit on XC5VLX110T� Additional hardware resources per core:

• For up to 15 cores:

Configuration Register LUTs RAMB36



* = includes distributed RAM

• 16 and 17 cores: over-proportional due to high FPGA utilization.

Configuration Register LUTs RAMB36

Address Only +8% +3%* -

Offsets -2 & 1 +9% +8%* -

Offsets [-8,7] +17% +9% +1

Outline




5 Conclusion

Conclusion

Baseline design: SHAP Multi-Core with 2-lines TLB

Realized object-cache:� Full-associative combined address and offset cache� Configurable in # of lines and cached offsets

Results:



Results:� With 4 cache-lines: Amortization of additional latency� With 8 cache-lines:

• Speed-up maximum increased from formerly 8—9 to now 14.• Single-core performance increased by 12 to 21%.

�Up to twice absolute compute performance.�Requires only 9% more hardware resources.

Thank you for your attention!



Thank you for your attention!

Questions?

Selected Literature1. Zabel, Martin: Effiziente Mehrkernarchitektur für eingebettete Java-Bytecode-Prozessoren,

Dresden, Technische Universität, Diss., 20122. Vijaykrishnan, N.; Ranganathan, N.: Supporting object accesses in a Java processor. In:

IE Proc. Computers and Digital Techniques 147 (2000), Nr. 6, S. 435–443. – ISSN 1350–23873. Wright, G.; Seidl, M. L.; Wolczko, M.: An object-aware memory architecture. In:

Sci. Comput. Program. 62 (2006), Nr. 2, S. 145–163. – ISSN 0167–64234. Huber, B.; Puffitsch, W.; Schoeberl, M.: WCET driven design space exploration of an object



cache. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10). ACM, 2010, S. 26–35

5. Lindholm, T.; Yellin, F.: The Java(TM) Virtual Machine Specification. 2nd edition. Amsterdam : Addison-Wesley Longman, 1999. – ISBN 978–0201432947

6. SCHOEBERL, Martin ; PREUSSER, Thomas B. ; UHRIG, Sascha: The embedded Java benchmark suite JemBench. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10), S. 120–127

7. Java Grande Forum Benchmark-Suite. –http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html

Increasing the Efficiency of an Embedded Multi-Core...

Documents

Transcript of Increasing the Efficiency of an Embedded Multi-Core...