Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing...

© IMEC 2015 F.CATTHOOR

FAST AND ENERGY-EFFICIENT ENVM BASED MEMORY ORGANISATION AT L3-L1 LAYERS FOR IOT COMPUTING/ROUTING PLATFORMS: MAIN CHALLENGES

Francky Catthoor, Nov.2015With input of esp. Praveen Raghavan, Jan Van Houdt,

Stefan Cosemans, Matthias Hartmann and other IMEC colleaguesWith use of MSc and PhD thesis results

in cooperation with mem.organ. teams at IMEC, NTUA and UCMadridAlso based on ULP-DSIP PhD team work


Secure, trustworthy computing and communicationembedded in every-thing and every-body.

A pervasive, context aware ambient IoT environment, sensitive and responsive to the presence of people


SYSTEM CLASSES FOR IOT ENVIRONMENT

Cloud/fog(Stationary)

Nomadic Sensor network

Am

bient Body

Home

GatewayCar

“UPA”

TCRF

TCRF

TCRF

UMTSWLANWPANWBANDA/VB

PDADSCDVCMP3GPSHC…

Hear, See, Feel, Show…

“More Moore” “More-than-Moore”

MIMO

Internet IPv6

100Gop/s 10 Gop/s

1Watt 100mW

‘Milliwatt’ (battery)

10 Watt

‘Watt’ (mains)

1Top/s 10Mop/s

Gb/s

Mb/s

kb/s

100mW

energy

energy

energy

(ambient)‘Microwatt’

Courtesy: Hugo DeMan ISSCC 2005

Server

Office


CURRENT PLATFORM ARCHITECTURES: ENERGY-FLEXIBILITY CONFLICT

Courtesy: Engel Roza (Philips)

Goal=progr DSIP, config CGA as good

as ASIC

Note: higher than 1000 MOPS/mW reachable in 45 nm node due to smaller subword length than 32 bit and

non-standard cell based layout schemes for critical components


FOCUS HERE ON IOT GATEWAY-MICROSERVER

Growing part of applications and market


RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY

Some research published at Google is available online http://research.google.com/pubs/HardwareandArchitecture.html same for Facebookhttps://research.facebook.com/publications/systems/

Also many academic papers in this direction but most of them arefocusing on performance rather than energy efficiency.And the ones which do focus on energy are too disruptive for industry in the short/mid term because they are changing theentire application software stack.

http://research.google.com/pubs/HardwareandArchitecture.html

https://research.facebook.com/publications/systems/

https://research.facebook.com/publications/systems/



Google: Proc. ISCA, 2014Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Abstract: Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings

=> Interesting but they only appear to gain 20% this way. So some bottlenecks are clearly not addressed yet.



Google: ISCA 2011 The Impact of Memory Subsystem Resource Sharing on Datacenter Applications

Abstract: In this paper we study the impact of sharing memory resources on #ve Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol bu#er. While prior work has found neither positive nor negative e#ects from cache sharing across the PARSEC benchmark suite, we #nd that across these datacenter applications, there is both a sizable bene#t and a potential degradation from improperly sharing resources. In this paper, we #rst present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.

=> Interesting!



Facebook: Characterizing Load Imbalance in Real-World Networked Caches, Qi Huang e.a.HotNets 2014: Thirteenth ACM Workshop on Hot Topics in Networks · October 27, 2014

Abstract: Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary by associated hashing function and partitioning ratio. Most real-world workloads are also skewed, with some items significantly more popular than others. Inefficiency in addressing both issues can create an imbalance in cache-server loads. We analyze the degree of observed load imbalance, focusing on read-only traffic against Facebook's graph cache tier in Tao. We investigate the principal causes of load imbalance, including data co-location, non-ideal hashing scenarios, and hot-spot temporal effects. We also employ trace-drive analytics to study the benefits and limitations of current load-balancing methods, suggesting areas for future research.

=> this analysis looks very interesting



Facebook: Fastpass: A Centralized "Zero-Queue" Datacenter Networ Jonathan Perry e.a.ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM) · August 18, 2014

Abstract: Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control-to a centralized arbiter-of when each packet should be transmitted and what path it should follow.

=> Interesting analysis!



Luiz André Barroso, Jimmy Clidaras, Urs Hölzle Morgan & Claypool Publishers (2013) The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition

Abstract: As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. Notes for the Second Edition After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area.

=> Very interesting overview of SotA


ENERGY BREAKDOWN FOR TYPICAL SERVER UPROC

Source Barroso e.a.The Datacenter as a ComputerGoogle Books’13

© IMEC 2015 F.CATTHOOR 14

TYPICAL APPLICATION: ENERGY BREAKDOWN FOR EMBEDDED PLATFORMS

Non-optim.

Non-optim.(no adv. DTSE)

Optim. ITSE mapping(no ITSE trafo yet)

MPEG2 Decodingon TI C6x VLIW-DSP[Lambrechts-ASAP05]

WLAN activity on IMEC BOADRES-coarse grain array

Audio proc on ARM A19 [Carroll’10]: ~65%CPU

~25%DM ~10%IM FFT proc on IMEC- BLOX

[CSI’13]: ~50% proc ~50%DM


DECODER + WORDLINE CONTRIBUTES SIGNIFICANTLY TO MEMORY DELAY AND ENERGY: WIDE WORD ACCESS

A typical small SRAM (<64kb) sized in conventional way Breakdown of delay and energy of such SRAM

▸ Decoder + Wordline contributes nearly 60% of SRAM delay▸ Decoder + Wordline contributes about 40%-50% of SRAM energy

Source:▸ Energy: Evans et.al. J.Solid-State circ. 1995▸ Delay: Horowitz et al. Trans. Solid-State circ. 2002

SRAM delay breakdown

delay of decoder+WLdelay of the rest

SRAM energy breakdown

energy of decoder+WL energy of the rest part


VWRVWR

VWR

LB

PROPOSED DSIP ARCHITECTURE TEMPLATEEXPLOITING WIDE L1D/L1I MEMORY ACCESS

Complx FU1

Wide Scratch Pad(Level-1 DM)

External memory(SDRAM)

AGU LD/ST

Prog. DMA

VWR

LB

DP

LB

MM

U

LBD

MA

LB

Level-1 I-Cache

VWR

LBVWR

SWP Shifter

Complx FU2

1 Tile/node in a Platform

Level-0 inst mem

(Level-0 DM)

Width-1

Width-2: very wide word

Width-3: data-path word

F.Catthoor e.a.ULD DSIPsSpringer book 2010


REGISTER FILE VS VERY WIDE REGISTER (VWR)

Register File VWRVWRVWRVWRVWRNwords

Bit-out

Bit-out

Width = Nwords* Bit-out / Nports

Number of bits stored in Register File and all the VWRs combined,

are equal

Nports

Nports

VWR

Width = Nwords* Bit-out / Nports

= Very Wide Word = 960 bits

Bit-out = Data path Word

= 96 bits


MOTIVATION FOR VWR VS RF Assumptions:

▸ 8 bit Out, 64 bit wide VWR▸ Same storage for (multiple) VWRs and (single) RF▸ Same Total Number of ports

Conclusions▸VWR reduces complexity of decoder and net capacitive load at the RF drivers and hence always better than RF▸But additional complexity in compiler

Figure 1: Same Memory Footprint Figure 2: 8 bit Activity


NONVOLATILE OFF-CHIP MEMORY ROADMAP

2

4

6

8

NAND-Flash (FG)

(Eq.) technology node F [nm]

Cell size[F2]

180 130 90 65 5x 3x

NOR-Flash (FG/NROM)

PCM

RRAM

2x

TANOS

4x 1x

code

data

evolutionary

disruptive

production

3D-NAND Flash

19

Copyright: Jan Van Houdt, IMEC, 2010

Conclusion: main focus now on stand-alone applications


VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS

EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT

ARM A-15ARM A-15

L1 DL1 D L1 DL1I

MPEG4 accelerator

Mem Mem

L2 Memory

LTE receiver

L1 D L1I

Turbo decoderHARQ Mem

L1 D L1I

Instruction memory can exploit non-

volatility

High speed data needs

efficient latency masking Slower

memories need just enough masking

Slower memories

could handle replacement with some mitigation


LB code for DP:for (i=0;i<n;i++)

for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++)

C[i] += A[j+k]*B[i+j*WS+k] *C[i] += A[p]*B[p] for p = 1 to Rem

DISTRIBUTED INSTRUCTION MEMORY HIERARCHY

LB code for MMU:for (i=0;i<n;i++)


Load A[j to j+WS] Load B[i+j to i+j+WS]

Store C[i]

LB code for VWR MUX:for (i=0;i<n;i++)


Select Mux of A, B = k *Select Mux of A, B = 1 to Rem

Select Mux for C = i

Original Code:for (i=0;i<n;i++)

for (j=0;j<Ntaps;j++) C[i] += A[j]*B[i+j]

VWRVWRVW

R L

B

Complx FU

Wide Scratch Pad

AGU LD/ST

VWR

LB

DP

LB

VWR

LBVWR

SWP Shifter

AG

U

LB

Width-2: very wide word

Width-3: data-path word


INST MEMORY HIERARCHY: EXECUTING LOOPS IN PARALLEL State-of-the-art loop controllers are centralized, we have a distributed approach▸ Loop cache, loop

counters..etc [Bajwa et al]▸ Some commercial

(embedded) processors Simultaneous Multi-Threaded Architecture and Multi-processor approaches have high hardware overhead. Software control and control flow extraction in the compiler are not handled by any known state-of-the-art

0.00E+00

5.00E-09

1.00E-08

1.50E-08

2.00E-08

2.50E-08

Register File InstructionMemory

Data Memory Data Path

VLIWFEENECS

23x Gain = Much lower instructions +Distributed LB for each unit

F.Catthoor e.a.ULD DSIPsSpringer book 2010


MODIFIED I-CACHE ORGANIZATION

PROCESSOR

NVM IL1

L2 Cache

EMSHR

SELECTOR

• Novel I-cache configuration

• address write latency & write energy issues of eNVM options like STT-MRAM.

• EMSHR as a fully-associative buffer

• few entries• Block promotion to

IL1 based on thresholdM.Komalan e.a., DATE’2014


PERFORMANCE RESULTS

8486889092949698

100102104

Threshold: 12Threshold: 8Threshold: 4

Perfo

rman

ce P

enal

ty (%

)

Modified NVM I-cache with 64KB

capacityPerformance Penalty eliminated

Better Performance than SRAM-based Instr.

Memory


ENERGY IMPROVEMENTS

50

55

60

65

70

75

80

85

Threshold: 12Threshold: 8Threshold: 4

Rela

tive

Ener

gy (%

)

Modified NVM I-cache with 64KB

capacity

Factor 1.5 in Energyfor STT-MRAM based Instruction Memories

NVM model still under

investigation!

Reloading and Scenarios not taken into

account(currently only static

mode)

M.Komalan e.a., DATE’2014


VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS

EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT

ARM A-15ARM A-15

L1 DL1 D L1 DL1I

MPEG4 accelerator

Mem Mem

L2 Memory

LTE receiver

L1 D L1I

Turbo decoderHARQ Mem

L1 D L1I

Instruction memory can exploit non-

volatility

High speed data needs

efficient latency masking Slower

memories need just enough masking

Slower memories

could handle replacement with some mitigation


Perf

orm

ance

Pen

alty

(%)

33

READ DELAY PERFORMANCE PENALTY ON XRAM INCLUSION IN L1-D {WRITE DELAY ~10NS}


Perf

orm

ance

Pen

alty

(%)

PERFORMANCE PENALTY ON STT-MRAM INCLUSION IN L1-D {WRITE DELAY ~1.5NS}


ARCH AND ACCESS PATTERN OPTIMIZATIONS

Modified VWR (2 KBit). ▸ 2 VWR’s : ping-pong of data and helps

mitigate performance issues due to read delay of STT.

Code transformations (Vectorization, prefetching, instr.rescheduling) and some in-built compiler optimizations.

M.Komalan e.a., DATE’2015


Perf

orm

ance

Pen

alty

(%)

Arch changes and reorder: victim buffer + VWR

PERFORMANCE PENALTY REDUCTION FOR L1-D WITH 2-4 CYCLE ENVM READ


SDRAM(L2+ main)

PE1

PE5

PE2

PE6

PE3

PE7

Interconnect

PE4

PE8

Embedded Hardware

MEMORY RESOURCE MANAGEMENT: OVERALL VIEW

Middleware (embedded in memory IP module)Mem Resource Manager/TDTSE

Application 1Scalable 3D Graphics

Application 2Wireless Network

Application 2Video Codec

Embedded Software Applications Given▸ Set of active tasks and their data

req▸ Task’s metadata▸ Constraints like deadline &

throughput▸ Objective (e.g. minimize energy

consumption) Decides▸ Where to acces (data to resource

assignment)▸ When to access (scheduling)▸ How to access (with what mem

“configuration”) Also referred as “task-level data access scheduling / mapping”

Proposal: use scenario based run-time scheme embedded inside memory organization (no application code or processor mapping changes!)

RISC VLIW SIMD ASIC

All PE have local L1 SRAM mem

Run-time

Design-time

D.Atienza e.a., Springer book’14


DATA MANAGEMENT FLOW

DynamicData TypeExplor.

PhysicalMemoryMngnt.

VirtualMemory

Segments

ConcreteData types

PhysicalMemories

DDT Dynamic Data TypeTrafo & Refinement

Dynamic memory mgmtRefinement

Physical memory mgmtRefinement


WHY IS IT IMPORTANT?DYNAMIC DATA SET IN AN ATM SWITCH

port 1port 2

ATM cellsx25 y21

y25x47

ATMMUX

ATMMUX

2 = 0000 0010

2 = 0000 0010

5 = 0000 0000 0000 0101

1 = 0000 0000 0000 0001

1 = 0000 0001

1 = 0000 0001

4 = 0000 01007 = 0000 0000 0000 01112 = 0000 00102 = 0000 00105 = 0000 0000 0000 01012 = 0000 0010

Key1 ( VPI[8] ) Key3 ( port[8] ){ VPI, VCI, port } [32]Key2 ( VCI[16] )

+ 155 Mb/sec+ Table of active

connections+ Table size without

optimization: 16,284 Mbytes


DATA MANAGEMENT RESULTS FOR ATM PROTOCOL MODULE IN ADAPTATION LAYER

PhysicalMemoryMngnt.

PhysicalMemories

Physical memory mgmt Refinement

VirtualMemory

Segments&Pools

Dynamic memory mgmt RefinementConcrete

Data types

DDT Dynamic Data Type RefinementFactor 5 less accesses (and energy)

Factor 3 less energy

Factor 2 less memory ports for same cycle budget (throughput)

DynamicData TypeExplor.



Global data management design flow for dynamic concurrent tasks with data-dominated behaviour

Data Type Exploration

Task concurrency mgmt

Physical memory mgmt

Address optimization

SWdesignflow

HWdesignflow

Concurrent OO spec

MgmtUnitMemory

controllerASU ASU

processormemmemmem

MemoryAllocationAssignment

SW/HW co-design

VirtualMgmtMemory

DynamicDataTypes

keydata

keydata

Binary Tree (BT)keydata

Sub-pool per size

Free Blocks



CONCLUSIONS IoT infrastructure platforms currently still have energy bottlenecks. Disruptive approaches which change the application software stack are very challenging to introduce in industry

So instead go for changes inside the memory organization combined with introduction of new NVM technologies, which are not visible to the application code stack

Tuning of selective parameters can reduce the performance penalty due to the NVM to extremely tolerable levels ( ≈1%). And significant energy reductions are potentially available (depends on read energy/access though)

Pareto-optimum values for the different parameters are application and platform dependent.

Has to be combined with matched and optimized dynamic memory management approach in middleware layer (hypervisor)

Applying this for IoT infrastructure platforms is promising domain

Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing...

Education

Transcript of Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing...