A Dynamic Stream Link for Efficient Data Flow Control in ...Dynamic Stream-Based Link Management...
Transcript of A Dynamic Stream Link for Efficient Data Flow Control in ...Dynamic Stream-Based Link Management...
© CEA. All rights reserved
A Dynamic Stream Link for Efficient Data Flow Control
in NoC Based Heterogeneous MPSoC
Claude Helmstetter1, Sylvain Basset1, Romain Lemaire1, Fabien Clermidy1,
Pascal Vivet1, Michel Langevin2, Chuck Pilkington2, Pierre Paulin2, Didier Fuin3
1CEA-Leti, Minatec Campus, Grenoble, France
2STMicroelectronics, Ottawa, Canada 3STMicroelectronics, Grenoble, France
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
Jan. 22-25, 2013 – Yokohama, Japan
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Heterogeneous Multicore SoCs
Embedded computing evolution
New application: 3D imaging, multi-antenna wireless
baseband processing
Common characteristics: Complex operations, high
data rate, real-time requirements
Efficiency requirement: Performance versus Power
consumption
New architectures for Systems-on-Chip
Complexity: Increasing number of computing cores
Heterogeneity: CPU, DSP, Reconfigurable HW cores
Parallelism: Dataflow programming
| 2
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Stream-Based Communications
From bus-based architecture to Network-on-Chip
Bandwidth requirement due to increasing number of
computing cores
Well-defined communication interfaces required
Ease of integration, reuse and programming
Bus protocols not well-adapted for NoC communication
with dataflow programming
Address management issue
Introduction of stream-based NoC protocols
Local control by Network Interface (NI)
Packet-switched data transfers + Flow control (credits)
| 3
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Dynamic Stream-Based Link Management
Configuration of stream-based communication links
How and when communications can be configured
How and when communication links are started/stopped
Synchronization requirements
How to synchronize communication links
How to synchronize communications and computation
Issue: dynamic communication parameters
Non-predictable, data-dependent control
Control latencies become critical (reaction time)
Novel NI architecture supporting dynamicity
SIF: Streaming InterFace
Objectives: fast reconfiguration and limited control workload
| 4
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Outlines
Introduction
Streaming interface and stream link protocols
Hardware Implementation
Evaluation and simulation results
Conclusion
| 5
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Outlines
Introduction
Streaming interface and stream link protocols
Stream link definition
Static stream link
Iteration-based dynamic stream link
Mode-based dynamic stream link
Hardware Implementation
Evaluation and simulation results
Conclusion
| 6
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Stream Link Definition
| 7
Processing Units (PU)
Receive/send data using FIFO interfaces only (dataflow)
Optional additional ports for control (registers, IT wires, …)
Streaming Interfaces (SIF)
Output Communication Controller
(OCC)
Packetise data, add routing
information (packet header)
Input Communication Controller
(ICC)
Depacketise data, send credit
backward (flow control)
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 8
Static Stream Link Protocol
SIF are configured with a predefined total number
of data to transfer: iteration size
Stream link closed locally using a simple counters
Iteration size must be predictable at configuration
time
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 9
Iteration-Based Dynamic Link
The emitter PU decides to close the link (flush)
The OCC sends a packet flagged as last
Acknowledgement sent by the ICC
Required for the data/credit mechanism reset
No need to predict iteration size at start time
Require additional communications
SIF
Receiver PU
stream in
External
Controller
Emitter PU
stream out
ICC
SIF OCC
co
nfig
. (with
ou
t siz
e)
sta
rt
da
ta
do
ne
da
tad
ata
cre
dits
las
t
last
sta
rt
da
ta
do
ne
da
tad
ata
cre
dits
las
t
last
iterationiteration
running idle running idle
flus
h
flus
h
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 10
Mode-Based Dynamic Link
Mode: Time between 2 datapath reconfigurations
Mode-Based Dynamic Link:
PU controlled at iteration level
Stream links stopped only on mode boundaries
Suppress close/reopen delay between iterations
Longer reconfiguration time between modes
sta
rt
co
nfig
. (with
ou
t siz
e)
da
ta
da
ta
do
ne
da
tad
ata
da
ta
da
ta
do
ne
cre
dits
cre
dits
sta
rt
sto
p
1 mode (incl. 2 iterations)
SIF
Receiver PU
stream in
External
Controller
Emitter PU
stream out
ICC
SIF OCC
running idle running idle
flush
flush
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Outlines
Introduction
Streaming interface and stream link protocols
Hardware Implementation
Architecture overview
Cost overhead for dynamic link protocols
Evaluation and simulation results
Conclusion
| 11
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 12
Architecture Overview
Architecture fully customizable at design time
Number of ICC/OCC, FIFO depth, datawidth, …
Selection of stream link protocols (possibly combined)
Control port
SIF
credits out
credits in Processing
Unit
IN_FIFO
IN_FIFO
Local controller
Register accessIT
flush
flush
ICC0
ICC1
stream in 0
stream in 1IP
OCC0
OCC1OUT_FIFO
OUT_FIFO stream out 0
stream out 1
OP
NoC
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Silicon cost of dynamic link protocols
Technology: CMOS 32nm @ 600MHz
1 ICC, 1 OCC, 32-data FIFO depth, 64-bit data width
Relative costs:
Parts of the design shared when combining protocols
| 13
9,01 kG
6,78 kG
7,02 kG
7,36 kG
All 3 links together
Mode-based dynamic link only
Iteration-based dynamic link only
Static link only
5 5,5 6 6,5 7 7,5 8 8,5 9 9,5
Hardware cost
(kGates)
Mode-based Smallest design
Iteration-based Extra-cost due to last packets protocol
Static Overhead from additional counters
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Outlines
Introduction
Streaming interface and stream link protocols
Hardware Implementation
Evaluation and simulation results
Case study and benchmark
Mode-based dynamic link
Iteration-based dynamic link
Conclusion
| 14
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 15
Case Study: the P2012 SoC
Generic many-core SoC for high performance computing Extensible with hardware Processing Elements (PE)
Developed by STMicroelectronics and CEA/LETI
P2012 cluster architecture:
LIC PE #1
shared
memory
corecorecorenetwork
interface
memory-mapped interconnect
DMA
NoC
router
NoC
router
NoC
router
PE #2
SIF
PE #0
SIF
PU
SIF
SIF
OCC ICCOCC
PUPU
P2012
cluster
synchro.
periph.
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 16
Benchmark Overview
Linear data-flow going through the DMA and n
Processing Elements (PE)
Many parameters: number of PEs (n), iteration
size, PU rate, latencies, FIFO depth, …
Alternative Stream to test path reconfigurations
LIC
shared
memory
memory-mapped interconnectcontroller
coreSIF
PE #3
SIF
PU
PE #4
SIF
PU
PE #n
SIF
PU
PE #2
SIF
PU …PU
PU
DMA
PE #0
SIF
PU
PE #5S
IFPU
alternative streammain stream
…
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 17
Mode-based Dynamic Link
Execution trace of the benchmark with mode-based dynamic link
1st and 2nd iterations run perfectly fine
… but time lost before 3rd and 5th iterations
Problem: PE #0 is restarted too late (cf. )
1st iteration time lost between iterations
Processing Unit
(PU) states
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 18
Optimizing the Controller
Basic controller: while (not_done()) {
wait until all PEs are ready to be configured;
configure and start all PEs;
}
Idea: do not wait for last PE before restarting the 1st
Optimized "asynchronous" controller: while (not_done()) {
For all PE {
if "this PE" is ready to be configured
configure and start "this PE";
}
wait an event from any PE;
}
NB: new embedded software but hardware unchanged As long as the processor running the controller is fast enough
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 19
Mode-based Dynamic Link
Execution of the benchmark with optimized controller
No more pipeline holes between iterations
PE
nu
mb
er
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 20
Cost of Path Reconfigurations
Mode-based dynamic link
Even partial reconfiguration
generates a pipeline hole
This hole can be larger if the
controller is not optimized
Iteration-based dynamic
link
No time cost for path
reconfiguration PE
nu
mb
er
PE
nu
mb
er
path reconfiguration
PU 0 PU 0
PU 0 PU 0 PU 0
PU 1 PU 1
PU 1 PU 1 PU 1
PU 1 PU 0
PU 0 PU 0
PU 0 PU 0 PU 0
PU 1 PU 1
PU 1 PU 1 PU 1
PU 1 PU 0
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 21
Iteration-based vs Mode-based
Mode-based 10% faster than iteration-based dynamic link
For short iterations (figure generated with 60-token long iterations)
Explanation: FIFOs are not flushed between iterations
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Outlines
Introduction
Streaming interface and stream link protocols
Hardware Implementation
Evaluation and simulation results
Conclusion
| 22
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan | 23
Conclusion (1/2)
Validation of an innovative SIF architecture
Support non predictable communication parameters
1 static & 2 dynamic protocols available
« Best protocol » is application dependent
Application characteristics Recommended protocol
Predictable communication Static Stream Link
No or few path reconfiguration Mode-based Dynamic Link
Other cases Iteration-based Dynamic Link
© CEA. All rights reserved © CEA. All rights reserved
18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013 – Jan. 22-25, 2013 – Yokohama, Japan
Conclusion (2/2)
External controller (embedded software) impact
on performance
Too centralized, coarse-grain: simpler code but
performance degradation
Asynchronous, optimized: easer parallelization, better
reactivity if fast enough
Known choices for real applications:
3GPP-LTE (signal processing), Magali SoC: static link
H264 (video processing), P2012 SoC: mainly mode-
based dynamic link
| 24
© CEA. All rights reserved
Thank you
| 25