Simultaneous Multithreading Support in Embedded Distributed … · 2018-01-19 · Multithreading...
Transcript of Simultaneous Multithreading Support in Embedded Distributed … · 2018-01-19 · Multithreading...
Simultaneous Multithreading Support in Embedded Distributed Memory MPSoCs
Rafael Garibotti, Luciano Ost, Rémi Busseuil, Mamady Kourouma, Chris Adeniyi-Jones, Gilles Sassatelli, Michel Robert
ADAC ADAptive Computing group
Host CPU (ARM)
OS
I/O
externalmemory
Motivation § Objectives
§ improving performance scalability § improving system programmability
- Array of tiny scalar PEs with private RAM - Interconnected through a NoC - Adaptive
- DFS/DVS - Load balancing - NoC reconfigurations
Homogeneous
Adaptive
Distributed
next generation
Static
Centralized
Heterogeneous
�������
��
���
�������
��
���
���
��� ��
���
����������
��������������
�������
��
���
��
t3
�
#
������
� �
"#
$
!
������ � �� �
typical MPSoC template
OpenScale Architecture
§ Main principles: § each PE has a local private RAM and runs an instance of a uK § native communication paradigm: message passing (MPI-like) § advanced features: load balancing through task migration (code migrated
across physical RAM)
Open source platform available at: www.lirmm.fr/ADAC
Host CPU(ARM)
OS
RAM
RTOS
Network interface /remote memory
access
ITCntrol TimersDFS
SBLAZE
I$ D$
PE
ROUTER
Node
- pthreads, semaphores - preemptive multitasking microkernel- dynamic loader & memory allocation- multi-mode communication stack- message-passing API - libraries (math, libc, etc.)
I/F
off-chipDDR
Contributions and Novelty
SBLAZE MEM
D$ I$
Bus
Router NI
§ POSIX-like thread API support in a purely distributed memory MPSoC § inspiration: accepted programming model § avoid using OpenCL or vendor specific languages such as CUDA
INTEGRAÇÃO DE HEURÍSTICAS DE MAPEAMENTO ESTÁTICO NA MODELAGEM ABSTRATA DE MPSoCS Rafael Garibotti, Luciano Ost, Rémi Busseuil, Mamady Kourouma,
Chris Adeniyi-Jones, Gilles Sassatelli, Michel Robert
{garibotti, ost, busseuil, kourouma, sassatelli, robert}@lirmm.fr
Simultaneous Multithreading Support in Embedded Distributed Memory MPSoCs
3. Experimental setup and Results
Open Source Open-Scale Platform available at: www.lirmm.fr/ADAC
Applications: • MJPEG, Smith-Waterman and FFT
Reference Platforms (reproduced with GEM5 Simulator): • Up to 8 ARM Cortex-A9 running at 500MHz • Linux Kernel 2.6.38 • 16kB private L1 data and instruction caches • 32bits channel width • 256MB unified L2 cache (only for mesh network) • DDR physical memory running at 400MHz
Adopted Platform: • NoC 3x3, handshake, 500MHz clock frequency • four input buffer positions • cache size 4kB / 8kB / 16kB (8 words per lines)
PE area evaluated at 40nm CMOS Technology
MJPEG full evaluation, which includes: architecture scalability, average bandwidth and average cache miss latency
vSMP with 1 thread vSMP with 2 threads vSMP with 8 threads
Multithreading software support
Open Source Open-Scale Platform available at: www.lirmm.fr/ADAC
Multithreading hardware support
2. Multithreading Platform
1. Contributions
Improved programmability through POSIX-like threads API in a distributed memory MPSoC Remote Module Access (RMA) and memory consistency support, allowing vSMP clusters definition at run-time Improved performance scalability Mechanisms validated in a synthesizable RTL NoC-based MPSoC
Implemented Pthread primitives
Memory Size Low-Power core with FP Low-Power core with FP + RMA Area Overhead 64 kB 15.59 %
128 kB 9.94 % 256 kB 5.72 %
Pthread primitives Description pthread_create() Creates a new thread in the calling process pthread_exit() Terminates the calling thread pthread_join() Waits for the specified thread to terminate pthread_mutex_init() Initializes the specified mutex pthread_mutex_destroy() Destroys the specified mutex pthread_mutex_lock() Locks the mutex object pthread_mutex_unlock() Unlocks the mutex object pthread_barrier_init() Initialize a barrier object pthread_barrier_wait() Synchronizes participating threads at the barrier
pointed to by the barrier argument
Host CPU(ARM)
OS
I/F
off-chipDDR
Contributions and Novelty § Improved performance scalability
§ purely scalable hardware template: § distributed memories, NoC-based § compact CPUs for high compute density § multithreading support
main thread
Legend:
worker thread
MT
T Host CPU(ARM)
OS
I/F
off-chipDDR
MT T T T
Host CPU(ARM)
OS
I/F
off-chipDDR
Contributions and Novelty § POSIX-like threads API support in a purely distributed memory MPSoC
§ improved programmability and performance scalability
§ vSMP clusters definition at run-time
§ mechanisms validated in a synthesizable RTL NoC-based MPSoC
Host CPU(ARM)
OS
I/F
off-chipDDR
Supported mechanisms
main thread
Legend:
worker thread
MT
T
message passing
vSMP Cluster
shared memory
MT T T
T
T
T T MT
Evaluation Scenario
TRMA Inst./data
001001011110111010111001 T T T T
T T T
T T T
Inst./data001001011110111010111001
Inst./data001001011110111010111001
§ Platform configuration: § 3x3 processor array, NoC with 4 positions input buffers and 32 bits
§ processor cache size was set to 4kB, 8kB and 16kB, 8 words per lines
§ 500MHz frequency for both processor nodes and NoC routers
§ CPU: hardware multiplier, divider and barrel shifter
Reference Platforms (GEM5)
§ up to 8 ARM Cortex-A9 cores
§ CPU running at 500MHz, Linux Kernel 2.6.38
§ 16kB private L1 data and instruction caches § 32bits channel width and DDR physical running at 400MHz
§ up to 8 ARM Cortex-A9
§ CPU running at 500MHz, Linux Kernel 2.6.38
§ 16kB private L1 data and instruction caches,
§ 32bits channel width, 256MB unified L2 cache and an interconnection network.
BUS
RealView IOs
ARMCortex-A9I$ D$
ARMCortex-A9I$ D$
DDR
(a)
ARMCortex-A9
I$ D$
ARMCortex-A9
I$ D$
L1 cache CTRL L1 cache CTRL
Interconnection Network
ARMCortex-A9
I$ D$
L1 cache CTRL
L2 cache CTRL L2 cache CTRL Memory CTRL
(b)
Results
§ Analysis on performance scalability : § ARM SMP: 1 to 8 CPUs interconnected by a bus / NoC (GEM5) § OpenScale platform: more scalable
§ low-latency cache traffic thanks to onchip memory with dedicated hardware for remote memory access (RMA)
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Spee
dup
# of CPUs and threads
XYOpenScale 16kB
OpenScale 8kBOpenScale 4kB
Mesh-based ARM (16kB)Bus-based ARM (16kB)
Near-linear speedup is observed when cache sizes of 8kB and 16kB are employed
with 4kB the application can only achieve a speedup of 2.4
MJPEG video pipeline
Results
§ Analysis on average bandwidth for MJPEG: § RMA bandwidth is the bottleneck of the proposed
system due to the latency of the remote memory access protocol
MJPEG video pipeline
0
50
100
150
200
250
1 2 3 4 5 6 7 8
MB/
s
east 4kBsouth 4kBRMA 4kB
0
10
20
30
40
1 2 3 4 5 6 7 8
MB/
s
# of Threads
east 16kBsouth 16kBRMA 16kB
HOST
almost maximum theoretical bandwidth of the RMA module is achieved (250MB/s)
much increased instruction cache miss rate
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8
Clo
ck C
ycle
s
Inst. cache 4kBInst. cache 8kB
Inst. cache 16kB
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8
Clo
ck C
ycle
s
# of CPUs and threads
Data cache 4kBData cache 8kB
Data cache 16kB
Results HOST
3
NoC
HOST
2reads instructions/data
cache line
RR
R R
1
Message Module
FIFO
NI
RMA Module
FIFOs
RMA-Reply
RMA-Send
CPUD$ I$
REMOTE
cache line requestRAM
Message Module
FIFO
NI
RMA Module
FIFOsRMA-Send
RMA-Reply
CPUD$I$
instructions data/
00101011110101
RAM
1- whenever a cache miss occurs, a cache line request is routed through the NoC 2- host RMA reads the desired cache line and sent back to the remote CPU 3- CPU resumes the thread execution
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8
Clo
ck C
ycle
s
Inst. cache 4kBInst. cache 8kB
Inst. cache 16kB
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8
Clo
ck C
ycle
s
# of CPUs and threads
Data cache 4kBData cache 8kB
Data cache 16kB
Ongoing work
T
T
T
T
T T
T T
MT MT
T
T
T
T
T
T
T T
64 kB
virtual shared memory
8 kB
distributed shared memory (DSM)
§ The best configuration will be achieved when the application size is divided by the number of threads in the cluster.
§ Advantages: § Performance - Reduce the bottleneck § Saving area – can be used to reduce the local memory size
Possible solution
§ AES chart (DSM)
Thanks for your time, see you @ the poster session
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8
Sp
ee
du
p
# of CPUs and threads
Pthread 4kBDSM 2kB
Pthread 2kBDSM 1kB
Pthread 1kB