Increasing the Efficiency of an Embedded Multi-Core...

28
Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Martin Zabel, Thomas B. Preußer, Rainer G. Spallek JTRES‘12 [email protected] http://vlsi-eda.inf.tu-dresden.de 25.10.2012

Transcript of Increasing the Efficiency of an Embedded Multi-Core...

Page 1: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture

Increasing the Efficiency of an Embedded Multi-Core Bytecode

Processor Using an Object Cache

Martin Zabel,Thomas B. Preußer,Rainer G. Spallek

JTRES‘12

[email protected] http://vlsi-eda.inf.tu-dresden.de

25.10.2012

Processor Using an Object Cache

Page 2: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 2 von 28

5 Conclusion

Page 3: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Motivation

Why Java?� Object orientation, portability� Automatic memory management, security� Support for thread parallelization

Why Java-(bytecode-)processor?

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 3 von 28

Why Java-(bytecode-)processor?� Native execution of Java-bytecode

� no OS, no interpretation, no re-compilation� Real-time� Suited for embedded systems with limited resources

Why multi-core processors?� Power consumption increases over-proportional with clock frequency.� Use thread-level parallelism instead.

Page 4: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Java Multi-Core Processor

Examples: JopCMP, jamuth, REALJava and SHAP

Common property: central shared heap for all cores

SHAP Multi-Core:

SHAP Multi-Core Architecture

ManagerMemory

32

Core1

MethodCache

Stack

Core n-1

Con

trol

ler

configurable

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 4 von 28

SHAP Multi-Core:� Local stack-memory per core� Method-cache per core� Pipelining of heap-accesses� Concurrent GC for real-time apps� Maximum speed-up of 8 for

programs with an above-average number of memory accesses [1]

� CLDC, constant-time interface method dispatch, …

GarbageCollector

32

32

32

8

32

MethodCache

Data

Code

MethodCache

Stack

Stack

Core0

Con

trol

ler

UART

Graphics Unit

Ethernet MAC - SRAM

- DDR-RAM

Memory

Wis

hbon

e B

us

DDR: 16SDR: 32

DMA

Page 5: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Goal

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 5 von 28

Further reduce the demands on the heap memory interface

to achieve higher speed-ups through thread-level parallelism.

Page 6: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 6 von 28

5 Conclusion

Page 7: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Related Work

Common solution for object-oriented processors:Cache for objects in analogy to data-caches [2] [3]

TagObject

Offset

Data

Object Content

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 7 von 28

Especially for real-time systems:Separate Caches for different data areas [4]:� Classic data cache for data at static addresses (e.g. class data)� Object-cache for data at dynamic addresses (e.g. objects)

ObjectReference

Offset Object Content

Page 8: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Indirect Object-Addressing

Problem of JopCMP and SHAP:� Object-table stored in external memory.� Additional latency for each heap-access.� Additional demand on memory bandwidth.

Stack

Object-Table

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 8 von 28

Solution:� Translation look-aside buffer (TLB) [1]� Virtually indexed object-cache [2]

Heap

Page 9: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Cache Coherence

Problem: Coherence of distributed caches

Advantage of the Java Virtual Machine: Synchronization only when [5]� entering a critical section, or� accessing a “volatile” variable.

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 9 von 28

� accessing a “volatile” variable.

Page 10: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 10 von 28

5 Conclusion

Page 11: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Heap-Access Analysis

Evaluation:� Benchmark suite JemBench (Version 2.0) [9],

all except microkernel benchmarks� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 11 von 28

Page 12: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Heap-Access Analysis

15

20

25

Mem

ory

Ban

dwid

th U

tiliz

atio

n [%

]

Data Accesses (Core Itself)Bytecode Fetches (Method Cache)Accesses for Memory Management

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 12 von 28

0

5

10

15

AES BubbleSort

Kfl Lift MatrixMul

N-Queens

Sieve UdpIp

Mem

ory

Ban

dwid

th U

tiliz

atio

n [%

]

Benchmark

11%

Page 13: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Heap-Access Analysis

Evaluation:� Benchmark suite JemBench (Version 2.0) [9]� SHAP Multi-Core with 1 core and trace unit [1]� Recording of executed bytecodes and memory accesses

Results:

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 13 von 28

Results:1. Most frequently object-accesses are reads on arrays1 and member

variables.2. 80% of all object accesses concentrate on 6 objects.3. Frequent access onto the first user-specific object offsets (-2 and 1)

� Further evaluation: small full-associative cache for each core

1 (already accounts for implicit reads of array length)

Page 14: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Small Full-associative Local Cache

Storing only invariant data:� Would require no extra logic for cache coherency.� In general, only class information pointer and array size are invariant.� Significant reduction only for array-intensive programs.

BubbleSort 33% of all

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 14 von 28

Write-through instead of write-back:� No special GC interaction required.� Simple cache coherence: Invalidate cache when

• entering a critical section, or• accessing a “volatile” variable.

BubbleSort 33% of all memoryaccesses

Sieve 20%

MatrixMul 17%

Page 15: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 15 von 28

5 Conclusion

Page 16: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Cache Design (1)

Cache integration: into memory manager portCore modifications:� Bytecode to access volatile variables� Microcode to invalidate cache

SHAP Multi-Core with AOC (Excerpt)

Data Port

configurable

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 16 von 28

MethodCache

32

8

Stack

Core032

AOC

Data Port

32

Con

trol

ler

GarbageCollector

- SRAM

- DDR-RAM

Memory

ManagerMemory

Data

Code

DDR: 16SDR: 32Wishbone Bus

Page 17: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Features:� Address & Offset-Cache (AOC) with write-through, LRU-strategy� 1 valid-bit per cached word� Configurable:

# of cache lines, cached offsets

Cache Design (2)

Core External Memory

Object Table HeapStack Offset

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 17 von 28

cached offsets

Disadvantage:Additional latency of 1 clock cycle during cache miss

Adress- and Offset-Cache (AOC)

Physical Addr.

Object 1

Object 2

Object 3

Object Handle

Data W

Data Y

Data Z

-2

-1

0

1

Data X

0

1

2

3

Line

Physical Addr.

Address Mem. Valid & Data Memory

Offset -2 Offset 1VV

Data W Data Z1 1Object Handle

Tags

Page 18: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Cache Configuration

Huge configuration space:� Cache lines: � Offsets: no (address only), range

But synthesis for cores to expensive.� Search for good initial configuration.

Nll ∈> ,00,,with],[ >∈− xNyxyx

18,,2,1 K=n

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 18 von 28

Configuration space exploration:� Baseline design for comparison: TLB with 2 entries� Benchmarks:

• JemBench• JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers)

� Platform:• SHAP on Virtex-5 FPGA XC5VLX110T• Same clock frequency of 80 MHz as baseline design

Page 19: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Setup: 1 Core, Cache Address Only

1.06

1.08

1.1

1.12

1.14

Rel

ativ

e P

erfo

rman

ce

SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueensKfl

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 19 von 28

0.96

0.98

1

1.02

1.04

BaselineDesign

2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines

Rel

ativ

e P

erfo

rman

ce

Cache Configuration

KflHSSieveBS

���� Use 8 lines

Page 20: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Setup: 1 Core, 8 Lines

1.3

1.35

1.4

1.45

Rel

ativ

e P

erfo

rman

ce

SparseMatmultIntMatrixMul (N=20)LiftAESUdpIpNQueens

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 20 von 28

1

1.05

1.1

1.15

1.2

1.25

Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55]

Rel

ativ

e P

erfo

rman

ce

Cache Configuration

NQueensKflHSSieveBS

���� Cache Offsets -2 & 1

Page 21: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Speed-Up of JemBench MatrixMul

Matrix multiplication:Default with 20x20 matrix

Reference value:Single-core

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 21 von 28

Single-core performance

Ideal speed-up:10 for n ≥ 10 cores

Speed-up maximum:8,3 ���� 9,4@ 10 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

Page 22: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Speed-Up of JemBench MatrixMul (N=90)

Matrix multiplication:Extended to 90x90 matrix

Reference value:Single-core

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 22 von 28

Single-core performance

Ideal speed-up:15 for 15--17 cores

Speed-up maximum:9,5 ���� 14,0@ 15 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

Page 23: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Speed-Up of JGF SparseMatMult with Integer

Matrix multiplication:Reduced to 4.000x4.000 matrix with 16.000 non-zero integer elements

Reference value:

12

14

16

18

Spe

ed-U

p S

IdealAOC with 8 lines, offset -8 to 7AOC with 8 lines, offset -2 and 1AOC with 8 lines, address onlyBaseline Design

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 23 von 28

Reference value:Single-core performance

Speed-up maximum:7,6 ���� 14,7 / 15,5@ 17 cores 2

4

6

8

10

2 4 6 8 10 12 14 16 18

Spe

ed-U

p S

Processor Cores n

Page 24: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

FPGA Resource Usage

Synthesis results for 8 cache lines, offsets -2 & 1:� Up to 17 cores at 80 MHz (same as baseline design) fit on XC5VLX110T� Additional hardware resources per core:

• For up to 15 cores:

Configuration Register LUTs RAMB36

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 24 von 28

* = includes distributed RAM

• 16 and 17 cores: over-proportional due to high FPGA utilization.

Configuration Register LUTs RAMB36

Address Only +8% +3%* -

Offsets -2 & 1 +9% +8%* -

Offsets [-8,7] +17% +9% +1

Page 25: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Outline

1 Motivation2 Related Work3 Heap-Access Analysis4 Implementation & Results5 Conclusion

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 25 von 28

5 Conclusion

Page 26: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Conclusion

Baseline design: SHAP Multi-Core with 2-lines TLB

Realized object-cache:� Full-associative combined address and offset cache� Configurable in # of lines and cached offsets

Results:

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 26 von 28

Results:� With 4 cache-lines: Amortization of additional latency� With 8 cache-lines:

• Speed-up maximum increased from formerly 8—9 to now 14.• Single-core performance increased by 12 to 21%.

�Up to twice absolute compute performance.�Requires only 9% more hardware resources.

Page 27: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Thank you for your attention!

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 27 von 28

Thank you for your attention!

Questions?

Page 28: Increasing the Efficiency of an Embedded Multi-Core ...jtres2012.imm.dtu.dk/slides/zabelps_jtres12_slides.pdf · 1.35 1.4 1.45 Relative Performance SparseMatmultInt MatrixMul (N=20)

Selected Literature1. Zabel, Martin: Effiziente Mehrkernarchitektur für eingebettete Java-Bytecode-Prozessoren,

Dresden, Technische Universität, Diss., 20122. Vijaykrishnan, N.; Ranganathan, N.: Supporting object accesses in a Java processor. In:

IE Proc. Computers and Digital Techniques 147 (2000), Nr. 6, S. 435–443. – ISSN 1350–23873. Wright, G.; Seidl, M. L.; Wolczko, M.: An object-aware memory architecture. In:

Sci. Comput. Program. 62 (2006), Nr. 2, S. 145–163. – ISSN 0167–64234. Huber, B.; Puffitsch, W.; Schoeberl, M.: WCET driven design space exploration of an object

Martin ZabelIncreasing the Efficiency of an Embedded Multi-

Core Bytecode Processor using an Object CacheFolie 28 von 28

cache. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10). ACM, 2010, S. 26–35

5. Lindholm, T.; Yellin, F.: The Java(TM) Virtual Machine Specification. 2nd edition. Amsterdam : Addison-Wesley Longman, 1999. – ISBN 978–0201432947

6. SCHOEBERL, Martin ; PREUSSER, Thomas B. ; UHRIG, Sascha: The embedded Java benchmark suite JemBench. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10), S. 120–127

7. Java Grande Forum Benchmark-Suite. –http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html