Simultaneous Multithreading Support in Embedded Distributed … · 2018-01-19 · Multithreading...

Simultaneous Multithreading Support in Embedded Distributed Memory MPSoCs

Rafael Garibotti, Luciano Ost, Rémi Busseuil, Mamady Kourouma, Chris Adeniyi-Jones, Gilles Sassatelli, Michel Robert

ADAC ADAptive Computing group

Host CPU (ARM)

OS

I/O

externalmemory

Motivation §  Objectives

§  improving performance scalability §  improving system programmability

-  Array of tiny scalar PEs with private RAM -  Interconnected through a NoC -  Adaptive

-  DFS/DVS -  Load balancing -  NoC reconfigurations

Homogeneous

Adaptive

Distributed

next generation

Static

Centralized

Heterogeneous

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

t3

�

#

��

� �

"#

$

!

��

typical MPSoC template

OpenScale Architecture

§  Main principles: §  each PE has a local private RAM and runs an instance of a uK §  native communication paradigm: message passing (MPI-like) §  advanced features: load balancing through task migration (code migrated

across physical RAM)

Open source platform available at: www.lirmm.fr/ADAC

Host CPU(ARM)

OS

RAM

RTOS

Network interface /remote memory

access

ITCntrol TimersDFS

SBLAZE

I$ D$

PE

ROUTER

Node

- pthreads, semaphores - preemptive multitasking microkernel- dynamic loader & memory allocation- multi-mode communication stack- message-passing API - libraries (math, libc, etc.)

I/F

off-chipDDR

Contributions and Novelty

SBLAZE MEM

D$ I$

Bus

Router NI

§  POSIX-like thread API support in a purely distributed memory MPSoC §  inspiration: accepted programming model §  avoid using OpenCL or vendor specific languages such as CUDA

INTEGRAÇÃO DE HEURÍSTICAS DE MAPEAMENTO ESTÁTICO NA MODELAGEM ABSTRATA DE MPSoCS Rafael Garibotti, Luciano Ost, Rémi Busseuil, Mamady Kourouma,

Chris Adeniyi-Jones, Gilles Sassatelli, Michel Robert

{garibotti, ost, busseuil, kourouma, sassatelli, robert}@lirmm.fr

[email protected]

Simultaneous Multithreading Support in Embedded Distributed Memory MPSoCs

3. Experimental setup and Results

Open Source Open-Scale Platform available at: www.lirmm.fr/ADAC

Applications: • MJPEG, Smith-Waterman and FFT

Reference Platforms (reproduced with GEM5 Simulator): • Up to 8 ARM Cortex-A9 running at 500MHz • Linux Kernel 2.6.38 • 16kB private L1 data and instruction caches • 32bits channel width • 256MB unified L2 cache (only for mesh network) • DDR physical memory running at 400MHz

Adopted Platform: • NoC 3x3, handshake, 500MHz clock frequency • four input buffer positions • cache size 4kB / 8kB / 16kB (8 words per lines)

PE area evaluated at 40nm CMOS Technology

MJPEG full evaluation, which includes: architecture scalability, average bandwidth and average cache miss latency

vSMP with 1 thread vSMP with 2 threads vSMP with 8 threads

Multithreading software support

Open Source Open-Scale Platform available at: www.lirmm.fr/ADAC

Multithreading hardware support

2. Multithreading Platform

1. Contributions

Improved programmability through POSIX-like threads API in a distributed memory MPSoC Remote Module Access (RMA) and memory consistency support, allowing vSMP clusters definition at run-time Improved performance scalability Mechanisms validated in a synthesizable RTL NoC-based MPSoC

Implemented Pthread primitives

Memory Size Low-Power core with FP Low-Power core with FP + RMA Area Overhead 64 kB 15.59 %

128 kB 9.94 % 256 kB 5.72 %

Pthread primitives Description pthread_create() Creates a new thread in the calling process pthread_exit() Terminates the calling thread pthread_join() Waits for the specified thread to terminate pthread_mutex_init() Initializes the specified mutex pthread_mutex_destroy() Destroys the specified mutex pthread_mutex_lock() Locks the mutex object pthread_mutex_unlock() Unlocks the mutex object pthread_barrier_init() Initialize a barrier object pthread_barrier_wait() Synchronizes participating threads at the barrier

pointed to by the barrier argument

Host CPU(ARM)

OS

I/F

off-chipDDR

Contributions and Novelty §  Improved performance scalability

§  purely scalable hardware template: §  distributed memories, NoC-based §  compact CPUs for high compute density §  multithreading support

main thread

Legend:

worker thread

MT

T Host CPU(ARM)

OS

I/F

off-chipDDR

MT T T T

Host CPU(ARM)

OS

I/F

off-chipDDR

Contributions and Novelty §  POSIX-like threads API support in a purely distributed memory MPSoC

§  improved programmability and performance scalability

§  vSMP clusters definition at run-time

§  mechanisms validated in a synthesizable RTL NoC-based MPSoC

Host CPU(ARM)

OS

I/F

off-chipDDR

Supported mechanisms

main thread

Legend:

worker thread

MT

T

message passing

vSMP Cluster

shared memory

MT T T

T

T

T T MT

Evaluation Scenario

TRMA Inst./data

001001011110111010111001 T T T T

T T T

T T T

Inst./data001001011110111010111001

Inst./data001001011110111010111001

§  Platform configuration: §  3x3 processor array, NoC with 4 positions input buffers and 32 bits

§  processor cache size was set to 4kB, 8kB and 16kB, 8 words per lines

§  500MHz frequency for both processor nodes and NoC routers

§  CPU: hardware multiplier, divider and barrel shifter

Reference Platforms (GEM5)

§  up to 8 ARM Cortex-A9 cores

§  CPU running at 500MHz, Linux Kernel 2.6.38

§  16kB private L1 data and instruction caches §  32bits channel width and DDR physical running at 400MHz

§  up to 8 ARM Cortex-A9

§  CPU running at 500MHz, Linux Kernel 2.6.38

§  16kB private L1 data and instruction caches,

§  32bits channel width, 256MB unified L2 cache and an interconnection network.

BUS

RealView IOs

ARMCortex-A9I$ D$

ARMCortex-A9I$ D$

DDR

(a)

ARMCortex-A9

I$ D$

ARMCortex-A9

I$ D$

L1 cache CTRL L1 cache CTRL

Interconnection Network

ARMCortex-A9

I$ D$

L1 cache CTRL

L2 cache CTRL L2 cache CTRL Memory CTRL

(b)

Results

§  Analysis on performance scalability : §  ARM SMP: 1 to 8 CPUs interconnected by a bus / NoC (GEM5) §  OpenScale platform: more scalable

§  low-latency cache traffic thanks to onchip memory with dedicated hardware for remote memory access (RMA)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Spee

dup

# of CPUs and threads

XYOpenScale 16kB

OpenScale 8kBOpenScale 4kB

Mesh-based ARM (16kB)Bus-based ARM (16kB)

Near-linear speedup is observed when cache sizes of 8kB and 16kB are employed

with 4kB the application can only achieve a speedup of 2.4

MJPEG video pipeline

Results

§  Analysis on average bandwidth for MJPEG: §  RMA bandwidth is the bottleneck of the proposed

system due to the latency of the remote memory access protocol

MJPEG video pipeline

0

50

100

150

200

250

1 2 3 4 5 6 7 8

MB/

s

east 4kBsouth 4kBRMA 4kB

0

10

20

30

40

1 2 3 4 5 6 7 8

MB/

s

# of Threads

east 16kBsouth 16kBRMA 16kB

HOST

almost maximum theoretical bandwidth of the RMA module is achieved (250MB/s)

much increased instruction cache miss rate

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8

Clo

ck C

ycle

s

Inst. cache 4kBInst. cache 8kB

Inst. cache 16kB

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8

Clo

ck C

ycle

s


Data cache 4kBData cache 8kB

Data cache 16kB

Results HOST

3

NoC

HOST

2reads instructions/data

cache line

RR

R R

1

Message Module

FIFO

NI

RMA Module

FIFOs

RMA-Reply

RMA-Send

CPUD$ I$

REMOTE

cache line requestRAM

Message Module

FIFO

NI

RMA Module

FIFOsRMA-Send

RMA-Reply

CPUD$I$

instructions data/

00101011110101

RAM

1- whenever a cache miss occurs, a cache line request is routed through the NoC 2- host RMA reads the desired cache line and sent back to the remote CPU 3- CPU resumes the thread execution

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8

Clo

ck C

ycle

s

Inst. cache 4kBInst. cache 8kB

Inst. cache 16kB

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8

Clo

ck C

ycle

s


Data cache 4kBData cache 8kB

Data cache 16kB

Ongoing work

T

T

T

T

T T

T T

MT MT

T

T

T

T

T

T

T T

64 kB

virtual shared memory

8 kB

distributed shared memory (DSM)

§  The best configuration will be achieved when the application size is divided by the number of threads in the cluster.

§  Advantages: §  Performance - Reduce the bottleneck §  Saving area – can be used to reduce the local memory size

Possible solution

§  AES chart (DSM)

Thanks for your time, see you @ the poster session

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8

Sp

ee

du

p


Pthread 4kBDSM 2kB

Pthread 2kBDSM 1kB

Pthread 1kB

Simultaneous Multithreading Support in Embedded Distributed … · 2018-01-19 · Multithreading...

Documents

Transcript of Simultaneous Multithreading Support in Embedded Distributed … · 2018-01-19 · Multithreading...