Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing...
-
Upload
facultad-de-informatica-ucm -
Category
Education
-
view
449 -
download
0
Transcript of Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing...
© IMEC 2015 F.CATTHOOR
FAST AND ENERGY-EFFICIENT ENVM BASED MEMORY ORGANISATION AT L3-L1 LAYERS FOR IOT COMPUTING/ROUTING PLATFORMS: MAIN CHALLENGES
Francky Catthoor, Nov.2015With input of esp. Praveen Raghavan, Jan Van Houdt,
Stefan Cosemans, Matthias Hartmann and other IMEC colleaguesWith use of MSc and PhD thesis results
in cooperation with mem.organ. teams at IMEC, NTUA and UCMadridAlso based on ULP-DSIP PhD team work
© IMEC 2015 F.CATTHOOR
Secure, trustworthy computing and communicationembedded in every-thing and every-body.
A pervasive, context aware ambient IoT environment, sensitive and responsive to the presence of people
© IMEC 2015 F.CATTHOOR
SYSTEM CLASSES FOR IOT ENVIRONMENT
Cloud/fog(Stationary)
Nomadic Sensor network
Am
bient Body
Home
GatewayCar
“UPA”
TCRF
TCRF
TCRF
UMTSWLANWPANWBANDA/VB
PDADSCDVCMP3GPSHC…
Hear, See, Feel, Show…
“More Moore” “More-than-Moore”
MIMO
Internet IPv6
100Gop/s 10 Gop/s
1Watt 100mW
‘Milliwatt’ (battery)
10 Watt
‘Watt’ (mains)
1Top/s 10Mop/s
Gb/s
Mb/s
kb/s
100mW
energy
energy
energy
(ambient)‘Microwatt’
Courtesy: Hugo DeMan ISSCC 2005
Server
Office
© IMEC 2015 F.CATTHOOR
CURRENT PLATFORM ARCHITECTURES: ENERGY-FLEXIBILITY CONFLICT
Courtesy: Engel Roza (Philips)
Goal=progr DSIP, config CGA as good
as ASIC
Note: higher than 1000 MOPS/mW reachable in 45 nm node due to smaller subword length than 32 bit and
non-standard cell based layout schemes for critical components
© IMEC 2015 F.CATTHOOR
FOCUS HERE ON IOT GATEWAY-MICROSERVER
Growing part of applications and market
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Some research published at Google is available online http://research.google.com/pubs/HardwareandArchitecture.html same for Facebookhttps://research.facebook.com/publications/systems/
Also many academic papers in this direction but most of them arefocusing on performance rather than energy efficiency.And the ones which do focus on energy are too disruptive for industry in the short/mid term because they are changing theentire application software stack.
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Google: Proc. ISCA, 2014Towards Energy Proportionality for Large-Scale Latency-Critical Workloads
Abstract: Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings
=> Interesting but they only appear to gain 20% this way. So some bottlenecks are clearly not addressed yet.
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Google: ISCA 2011 The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
Abstract: In this paper we study the impact of sharing memory resources on #ve Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol bu#er. While prior work has found neither positive nor negative e#ects from cache sharing across the PARSEC benchmark suite, we #nd that across these datacenter applications, there is both a sizable bene#t and a potential degradation from improperly sharing resources. In this paper, we #rst present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.
=> Interesting!
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Facebook: Characterizing Load Imbalance in Real-World Networked Caches, Qi Huang e.a.HotNets 2014: Thirteenth ACM Workshop on Hot Topics in Networks · October 27, 2014
Abstract: Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary by associated hashing function and partitioning ratio. Most real-world workloads are also skewed, with some items significantly more popular than others. Inefficiency in addressing both issues can create an imbalance in cache-server loads. We analyze the degree of observed load imbalance, focusing on read-only traffic against Facebook's graph cache tier in Tao. We investigate the principal causes of load imbalance, including data co-location, non-ideal hashing scenarios, and hot-spot temporal effects. We also employ trace-drive analytics to study the benefits and limitations of current load-balancing methods, suggesting areas for future research.
=> this analysis looks very interesting
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Facebook: Fastpass: A Centralized "Zero-Queue" Datacenter Networ Jonathan Perry e.a.ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM) · August 18, 2014
Abstract: Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control-to a centralized arbiter-of when each packet should be transmitted and what path it should follow.
=> Interesting analysis!
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY
Luiz André Barroso, Jimmy Clidaras, Urs Hölzle Morgan & Claypool Publishers (2013) The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition
Abstract: As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. Notes for the Second Edition After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area.
=> Very interesting overview of SotA
© IMEC 2015 F.CATTHOOR
ENERGY BREAKDOWN FOR TYPICAL SERVER UPROC
Source Barroso e.a.The Datacenter as a ComputerGoogle Books’13
© IMEC 2015 F.CATTHOOR 14
TYPICAL APPLICATION: ENERGY BREAKDOWN FOR EMBEDDED PLATFORMS
Non-optim.
Non-optim.(no adv. DTSE)
Optim. ITSE mapping(no ITSE trafo yet)
MPEG2 Decodingon TI C6x VLIW-DSP[Lambrechts-ASAP05]
WLAN activity on IMEC BOADRES-coarse grain array
Audio proc on ARM A19 [Carroll’10]: ~65%CPU
~25%DM ~10%IM FFT proc on IMEC- BLOX
[CSI’13]: ~50% proc ~50%DM
© IMEC 2015 F.CATTHOOR
DECODER + WORDLINE CONTRIBUTES SIGNIFICANTLY TO MEMORY DELAY AND ENERGY: WIDE WORD ACCESS
A typical small SRAM (<64kb) sized in conventional way Breakdown of delay and energy of such SRAM
▸ Decoder + Wordline contributes nearly 60% of SRAM delay▸ Decoder + Wordline contributes about 40%-50% of SRAM energy
Source:▸ Energy: Evans et.al. J.Solid-State circ. 1995▸ Delay: Horowitz et al. Trans. Solid-State circ. 2002
SRAM delay breakdown
delay of decoder+WLdelay of the rest
SRAM energy breakdown
energy of decoder+WL energy of the rest part
© IMEC 2015 F.CATTHOOR
VWRVWR
VWR
LB
PROPOSED DSIP ARCHITECTURE TEMPLATEEXPLOITING WIDE L1D/L1I MEMORY ACCESS
Complx FU1
Wide Scratch Pad(Level-1 DM)
External memory(SDRAM)
AGU LD/ST
Prog. DMA
VWR
LB
DP
LB
MM
U
LBD
MA
LB
Level-1 I-Cache
VWR
LBVWR
SWP Shifter
Complx FU2
1 Tile/node in a Platform
Level-0 inst mem
(Level-0 DM)
Width-1
Width-2: very wide word
Width-3: data-path word
F.Catthoor e.a.ULD DSIPsSpringer book 2010
© IMEC 2015 F.CATTHOOR
REGISTER FILE VS VERY WIDE REGISTER (VWR)
Register File VWRVWRVWRVWRVWRNwords
Bit-out
Bit-out
Width = Nwords* Bit-out / Nports
Number of bits stored in Register File and all the VWRs combined,
are equal
Nports
Nports
VWR
Width = Nwords* Bit-out / Nports
= Very Wide Word = 960 bits
Bit-out = Data path Word
= 96 bits
© IMEC 2015 F.CATTHOOR
MOTIVATION FOR VWR VS RF Assumptions:
▸ 8 bit Out, 64 bit wide VWR▸ Same storage for (multiple) VWRs and (single) RF▸ Same Total Number of ports
Conclusions▸VWR reduces complexity of decoder and net capacitive load at the RF drivers and hence always better than RF▸But additional complexity in compiler
Figure 1: Same Memory Footprint Figure 2: 8 bit Activity
© IMEC 2015 F.CATTHOOR
NONVOLATILE OFF-CHIP MEMORY ROADMAP
2
4
6
8
NAND-Flash (FG)
(Eq.) technology node F [nm]
Cell size[F2]
180 130 90 65 5x 3x
NOR-Flash (FG/NROM)
PCM
RRAM
2x
TANOS
4x 1x
code
data
evolutionary
disruptive
production
3D-NAND Flash
19
Copyright: Jan Van Houdt, IMEC, 2010
Conclusion: main focus now on stand-alone applications
© IMEC 2015 F.CATTHOOR
VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS
EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT
ARM A-15ARM A-15
L1 DL1 D L1 DL1I
MPEG4 accelerator
Mem Mem
L2 Memory
LTE receiver
L1 D L1I
Turbo decoderHARQ Mem
L1 D L1I
Instruction memory can exploit non-
volatility
High speed data needs
efficient latency masking Slower
memories need just enough masking
Slower memories
could handle replacement with some mitigation
© IMEC 2015 F.CATTHOOR
LB code for DP:for (i=0;i<n;i++)
for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++)
C[i] += A[j+k]*B[i+j*WS+k] *C[i] += A[p]*B[p] for p = 1 to Rem
DISTRIBUTED INSTRUCTION MEMORY HIERARCHY
LB code for MMU:for (i=0;i<n;i++)
for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++)
Load A[j to j+WS] Load B[i+j to i+j+WS]
Store C[i]
LB code for VWR MUX:for (i=0;i<n;i++)
for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++)
Select Mux of A, B = k *Select Mux of A, B = 1 to Rem
Select Mux for C = i
Original Code:for (i=0;i<n;i++)
for (j=0;j<Ntaps;j++) C[i] += A[j]*B[i+j]
VWRVWRVW
R L
B
Complx FU
Wide Scratch Pad
AGU LD/ST
VWR
LB
DP
LB
VWR
LBVWR
SWP Shifter
AG
U
LB
Width-2: very wide word
Width-3: data-path word
© IMEC 2015 F.CATTHOOR
INST MEMORY HIERARCHY: EXECUTING LOOPS IN PARALLEL State-of-the-art loop controllers are centralized, we have a distributed approach▸ Loop cache, loop
counters..etc [Bajwa et al]▸ Some commercial
(embedded) processors Simultaneous Multi-Threaded Architecture and Multi-processor approaches have high hardware overhead. Software control and control flow extraction in the compiler are not handled by any known state-of-the-art
0.00E+00
5.00E-09
1.00E-08
1.50E-08
2.00E-08
2.50E-08
Register File InstructionMemory
Data Memory Data Path
VLIWFEENECS
23x Gain = Much lower instructions +Distributed LB for each unit
F.Catthoor e.a.ULD DSIPsSpringer book 2010
© IMEC 2015 F.CATTHOOR
MODIFIED I-CACHE ORGANIZATION
PROCESSOR
NVM IL1
L2 Cache
EMSHR
SELECTOR
• Novel I-cache configuration
• address write latency & write energy issues of eNVM options like STT-MRAM.
• EMSHR as a fully-associative buffer
• few entries• Block promotion to
IL1 based on thresholdM.Komalan e.a., DATE’2014
© IMEC 2015 F.CATTHOOR
PERFORMANCE RESULTS
8486889092949698
100102104
Threshold: 12Threshold: 8Threshold: 4
Perfo
rman
ce P
enal
ty (%
)
Modified NVM I-cache with 64KB
capacityPerformance Penalty eliminated
Better Performance than SRAM-based Instr.
Memory
© IMEC 2015 F.CATTHOOR
ENERGY IMPROVEMENTS
50
55
60
65
70
75
80
85
Threshold: 12Threshold: 8Threshold: 4
Rela
tive
Ener
gy (%
)
Modified NVM I-cache with 64KB
capacity
Factor 1.5 in Energyfor STT-MRAM based Instruction Memories
NVM model still under
investigation!
Reloading and Scenarios not taken into
account(currently only static
mode)
M.Komalan e.a., DATE’2014
© IMEC 2015 F.CATTHOOR
VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS
EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT
ARM A-15ARM A-15
L1 DL1 D L1 DL1I
MPEG4 accelerator
Mem Mem
L2 Memory
LTE receiver
L1 D L1I
Turbo decoderHARQ Mem
L1 D L1I
Instruction memory can exploit non-
volatility
High speed data needs
efficient latency masking Slower
memories need just enough masking
Slower memories
could handle replacement with some mitigation
© IMEC 2015 F.CATTHOOR
Perf
orm
ance
Pen
alty
(%)
33
READ DELAY PERFORMANCE PENALTY ON XRAM INCLUSION IN L1-D {WRITE DELAY ~10NS}
© IMEC 2015 F.CATTHOOR
Perf
orm
ance
Pen
alty
(%)
PERFORMANCE PENALTY ON STT-MRAM INCLUSION IN L1-D {WRITE DELAY ~1.5NS}
© IMEC 2015 F.CATTHOOR
ARCH AND ACCESS PATTERN OPTIMIZATIONS
Modified VWR (2 KBit). ▸ 2 VWR’s : ping-pong of data and helps
mitigate performance issues due to read delay of STT.
Code transformations (Vectorization, prefetching, instr.rescheduling) and some in-built compiler optimizations.
M.Komalan e.a., DATE’2015
© IMEC 2015 F.CATTHOOR
Perf
orm
ance
Pen
alty
(%)
Arch changes and reorder: victim buffer + VWR
PERFORMANCE PENALTY REDUCTION FOR L1-D WITH 2-4 CYCLE ENVM READ
© IMEC 2015 F.CATTHOOR
SDRAM(L2+ main)
PE1
PE5
PE2
PE6
PE3
PE7
Interconnect
PE4
PE8
Embedded Hardware
MEMORY RESOURCE MANAGEMENT: OVERALL VIEW
Middleware (embedded in memory IP module)Mem Resource Manager/TDTSE
Application 1Scalable 3D Graphics
Application 2Wireless Network
Application 2Video Codec
Embedded Software Applications Given▸ Set of active tasks and their data
req▸ Task’s metadata▸ Constraints like deadline &
throughput▸ Objective (e.g. minimize energy
consumption) Decides▸ Where to acces (data to resource
assignment)▸ When to access (scheduling)▸ How to access (with what mem
“configuration”) Also referred as “task-level data access scheduling / mapping”
Proposal: use scenario based run-time scheme embedded inside memory organization (no application code or processor mapping changes!)
RISC VLIW SIMD ASIC
All PE have local L1 SRAM mem
Run-time
Design-time
D.Atienza e.a., Springer book’14
© IMEC 2015 F.CATTHOOR
DATA MANAGEMENT FLOW
DynamicData TypeExplor.
PhysicalMemoryMngnt.
VirtualMemory
Segments
ConcreteData types
PhysicalMemories
DDT Dynamic Data TypeTrafo & Refinement
Dynamic memory mgmtRefinement
Physical memory mgmtRefinement
© IMEC 2015 F.CATTHOOR
WHY IS IT IMPORTANT?DYNAMIC DATA SET IN AN ATM SWITCH
port 1port 2
ATM cellsx25 y21
y25x47
ATMMUX
ATMMUX
2 = 0000 0010
2 = 0000 0010
5 = 0000 0000 0000 0101
1 = 0000 0000 0000 0001
1 = 0000 0001
1 = 0000 0001
4 = 0000 01007 = 0000 0000 0000 01112 = 0000 00102 = 0000 00105 = 0000 0000 0000 01012 = 0000 0010
Key1 ( VPI[8] ) Key3 ( port[8] ){ VPI, VCI, port } [32]Key2 ( VCI[16] )
+ 155 Mb/sec+ Table of active
connections+ Table size without
optimization: 16,284 Mbytes
© IMEC 2015 F.CATTHOOR
DATA MANAGEMENT RESULTS FOR ATM PROTOCOL MODULE IN ADAPTATION LAYER
PhysicalMemoryMngnt.
PhysicalMemories
Physical memory mgmt Refinement
VirtualMemory
Segments&Pools
Dynamic memory mgmt RefinementConcrete
Data types
DDT Dynamic Data Type RefinementFactor 5 less accesses (and energy)
Factor 3 less energy
Factor 2 less memory ports for same cycle budget (throughput)
DynamicData TypeExplor.
D.Atienza e.a., Springer book’14
© IMEC 2015 F.CATTHOOR
Global data management design flow for dynamic concurrent tasks with data-dominated behaviour
Data Type Exploration
Task concurrency mgmt
Physical memory mgmt
Address optimization
SWdesignflow
HWdesignflow
Concurrent OO spec
MgmtUnitMemory
controllerASU ASU
processormemmemmem
MemoryAllocationAssignment
SW/HW co-design
VirtualMgmtMemory
DynamicDataTypes
keydata
keydata
Binary Tree (BT)keydata
Sub-pool per size
Free Blocks
D.Atienza e.a., Springer book’14
© IMEC 2015 F.CATTHOOR
CONCLUSIONS IoT infrastructure platforms currently still have energy bottlenecks. Disruptive approaches which change the application software stack are very challenging to introduce in industry
So instead go for changes inside the memory organization combined with introduction of new NVM technologies, which are not visible to the application code stack
Tuning of selective parameters can reduce the performance penalty due to the NVM to extremely tolerable levels ( ≈1%). And significant energy reductions are potentially available (depends on read energy/access though)
Pareto-optimum values for the different parameters are application and platform dependent.
Has to be combined with matched and optimized dynamic memory management approach in middleware layer (hypervisor)
Applying this for IoT infrastructure platforms is promising domain
© IMEC 2015 F.CATTHOOR