towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...

ClusterCommunica/onLatency:towardsapproachingitsMinimumHardwareLimits,

onLow-PowerPla=orms

ManolisKatevenisFORTH,Heraklion,Crete,Greece(incollab.withUniv.ofCrete)

http://www.ics.forth.gr/carv/

Acknowledgements

•  IakovosMavroidis•  JohnGoodacre(ARM,KALEAO)•  VassilisPapaefstathiou•  ManolisMarazakis•  NikolaosChrysos• andmanymanymore!...

2M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017


TheEssenceofCommunica/on

write (remote)

Producer ExpectedConsumers

node ...

Interconnect

node 3 node 5

node 6

node 4

node 7node 1

node 8node N

node 2

•  Pushstyle,likeemail•  Ifyouknowtheconsumers– elselikespam

•  Ifspacealreadyexists•  Zerolatency,whenpre-sent•  One-waylatency,whensent“just-in-/me”

•  Pullstyle,likewebbrowser• Whenreaderdecidesitneedsthedataandallocatesspace

•  Two-waylatency,star/ngwhenthedataisneeded

– unlessabletoprefetch

Producer Consumer

reply (data)

read request (remote)

node 1 Interconnect

node 5node 3

node 2node 6

node N

node 7

node 4

node 8node ...

RemoteDMAversusCacheCoherence

C LM

P

LM

P

LM

P

Consumer Consumer

P

C

Producer

NoC

P

C

P

LM

P

NoC

C

Producer

Lower Cache and Directory(Multiple Banks)

P

Remote Store

sto

re

forw

ard

ed

request

inva

lida

te

inva

lida

te

data

sto

re

or Remote Write DMA (block)forwardedrequest (word)


Example–pushstyle:explicitelysendingthedata:5xfewerpackets(⇒less/me,lessenergy)

Canweapproachthisideal,inClusterCommunica/on?

• ClusterCompu/ng/Datacenters/HPC: ⇒manythousandsofnodes “node”=islandof(hardware)cachecoherence

• CanwegetClusterCommunica/ontobeanalogouslyfastlikestore&loadinstruc/onsinacoherenceisland? +smalldelayduetodistance,butnotmuchmore

• Clustersoriginatefrommass,commoditymarket,wherecustomersdidnotcareabout“afewextraμs”…

• WorkdoneinEnergy-EfficientDatacenter/HPCprojects ⇒64-bitARMprocessors



InefficienciesinClusterCommunica/on,tobeovercomed

•  Five(5)copyingsofthedata,insteadofjustone(1)!– Datacopyingconsumes/meandenergy

•  Startbyelimina/ng(large)NICbuffers:– DMAacrossthenetwork–RDMA

user

kernel or lib

user

Sender NIC Receiver NIC

Receiver MemorySender Memory

Network

kernel or lib

cpDMADMA

ideal (zero−copy) RDMA

cp


DMAIni/a/on:SystemCall,ordirectlyfromUser-Level?

kernel or lib

user

Receiver Memory

kernel or lib

Network NINI

user Mem

user process

kernel

Sendervirtual addresses

dma write

addr.cp phy.bu

fp

inn

ed

pin

ne

dcp


ackackDMAdma rd

vDMA

•  SystemCalltoini/ateaDMA(withphysicaladdr.)≈3to4μs– on64-bitARMinXilinxZynqUltraScale+underLinux

•  ForthevirtualizedDMAengine(8channels),thatacceptsvirtualaddressarguments,Ini/a/on(wri/ng3-4registers)≈


Scalability:GlobalVirtualAddresses&ProgressiveTransla/on

•  forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual

•  forScalability:64-bitGVA+(global)protec/ondomainiden/fier

archyHier−

Tra

nsl

ate

(M

MU

)

Tra

nsl

ate

(M

MU

)

Memory

archyHier−

nationDesti−

Routing

read

Copy (RDMA) Engine

Data

write

Network

GlobalVirtual

AddressSpace

src addr

PID

dst addr

64−bitvirtual

addresses

Memory

Source


TwoSlidesfrommyStama9sVassiliadis2007Symposiumtalk:NetworkRou/ngasGeneraliza/onofAddressDecoding

•  PhysicalAddressDecodinginauniprocessor

•  GeographicalAddressRou/nginamul/processor

hwp://www.ics.forth.gr/carv/ipc/ldstgen_katevenis07.pdf


ProgressiveAddressTransla/on:LocalizeMigra/onUpdates

•  Packetscarryglobalvirtualaddresses•  Tablesprovidephysicalroute(address)forthenextfewsteps•  Whenpage9migrateswithinD,onlytablesinthatdomainneedupda/ng•  Variable-size-pagetransla/ontableslooklikeinternetrou/ngtables(longest-prefixmatchesifwewantsmall-page-within-big-regionmigra/on)

•  Tablesthatpar//onthesystem,forprotec/onagainstuntrustedopera/ngsystems,looklikeinternetfirewalls


Scalability:GlobalVirtualAddresses&ProgressiveTransla/on

•  forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual

•  forScalability:64-bitGVA+(global)protec/ondomainiden/fier

archyHier−

Tra

nsl

ate

(M

MU

)

Tra

nsl

ate

(M

MU

)

Memory

archyHier−

nationDesti−

Routing

read

Copy (RDMA) Engine

Data

write

Network

GlobalVirtual

AddressSpace

src addr

PID

dst addr

64−bitvirtual

addresses

Memory

Source

CurrentAddressTransla/onArch.notreadyforGVAS

vDMA engine

read or write addresses,virtual or physical,but truncated to:

SMMU

addresses

44 bits

translateif virtual

448 GBy window

phy. addressroute based on

physical−only

Pro

gra

mm

able

Logic

− P

L

to other PS peripheral devices

Pro

gra

mm

ing S

yste

m −

PS

CoresProcessor

Caches

local

DRAM

FP

GA

40 bits


•  Addresstransla/onarchitectureofA53inXilinxZynqUltraScale+•  DMAengineisvirtualized,buttruncatesvirtualaddressesto


WhycopydatabetweenUserandKernel/Librarybuffers?

kernel or lib

user

Receiver Memory

kernel or lib

Network NINI

user Mem

user process

kernel

Sendervirtual addresses

dma write

addr.cp phy.bu

fp

inn

ed

pin

ne

dcp


ackackDMAdma rd

vDMA

•  “Pinned”buffers:tradi/onalDMAdoesnottoleratepagefaults•  “Registered”bufferaddresses,inthelackof“System”(I/O)MMU•  Non-cacheablebuffers,inthelackofcache-coherentDMA•  Sendbufferreuseimmediatelyayersendini/a/on•  Transferoccursbeforereceive-bufferready,andreceiveApplica/onunwillingtoacceptrecep/onatlibrary-specifiedaddress


MPI:Send-RcvMatchatReceiver,User-SpecifiedRcvAddr

•  cost=oneextranetworkround-trip/meforsendertolearntheaddress

sb

read from rb(user process):

Consumerdone

(done)

sendwrite into sb

(user process):Producer

match

(early) receiveready to send,

pointer to rb

at which address?ready toreceive,

buffer,please!

but in my

rb


Snd-RcvMatchatSender,User-SpecifiedReceiveAddress

•  minimallatencyachieved;earlyreceive(rbpre-allocbyuser)s/llneeded•  doesnotworkwithMPI_ANY_SOURCE

sb


Consumerdone

(done)

send

match

Producer(user process):

write into sb

(early) receiveready toreceive,

buffer,please!

but in my

pointer to rb

rb


ReceiverwillingtoacceptDataatanyAddress(andnotcopy)

sb

send

write into sb

Producer(user process):


Consumerdone

(done)

match

"allocate buffer, and...here come the data"or "use rb[i] that had been

preallocated for me

rcv

tell me whenand wheremy datawill be ready

rb

•  Eagerdeliveryatpreallocated,library-managedbuffer,thengiventouser


Mbox–RemoteEnqueue:AsynchronousEventMul/plexing

•  Atomicenqueueintosharedspace,unlikededicatedspacewithRDMA•  Spacereservedonlyfornumberofactualsenders–notpoten/alsenders•  EventMul/plexor–requests,no/fica/ons(likeSelectsystemcall)•  RDMAcomple/onshouldenqueueno/fica/ontoreceiverandsender

m2

m2P2

P1

m1

m1

storesent via single store

instruction

single−word messages

MultipleSenders

memory 2

longer messages: first composed locally...

... thensent

Shared Space

"Mbox"

RemoteQueue

memory 3

SingleReceiver

emptywait on

P3


Per-TaskMboxQ’s&SchedulerasInterruptGeneraliza/on

•  Tasksblockandwaitwhensynchronouslyreadingfromanemptyqueue•  Schedulerselectsanon-emptyqueueofhighestpriorityandrunsitstask•  TimequantumswitchesQ’sandTasksinsidetopnon-emptypriorityclass•  EnqueuesintohigherpriorityQ’sinterruptlower-priorityrunningtasks

RDMAistheDatapath–QueuesaretheControl

Scheduler

low−power sleep

T11Q11

Q21T21

Q01T01

arrivingnow

High

Low

rity

Prio−/ Tasks

ProcessesEvent Queues

switch task per time quantum

coreP

Interrupt !

when all Queues Empty

Run

Conclusions

•  InterprocessorCommun.analogoustoload/storeinstr.• Nosystemcalls–virtualizedDMAengines• NeedanewAddressTransla/onArchitecture⇒ GlobalVirtualAddressSpace(GVAS)

• No/fica/ons,Mboxes,Scheduler:generalizedinterrupts• Adaptlibraries&API’stouser-level,zero-copycommun.


BackupSlides


RDMA/wr-allocate?–Soyware-GuidedPrefetch&Coherence


•  SWfetchesdatafromthecacheofthelastproducer–  beforetaskexecu/on–  prefetchwhileexecu/ngothertask– waitforcomple/on•  FETCH(read-only)–  forread-onlytaskarguments–  overrideoldcopiesatdestina/on•  FETCH-O(withownership)–  forread-writetaskargs– migratecache-lineownership–  ensurethateachcache-lineisdirtyinatmostonecache(else,write-backorderisundefined)

CacheCachev d tag data

. . .

Memory

command

NoC

fetch

data

core core

update?

keep?

hit

miss

CachecoherencethroughSoyware:simplerHW,energyefficientVassilisPapaefstathiou:ICS’13,andPhDDisserta/on

EpochNumbers:allowSWtocontrolCacheReplacements


•  Useafew(e.g.3)tagbitspercachelinetorecordwhich“group”ofcachelineseachbelongsto;configurereplacementpolicybygroups

incoming RDMACOMPARE

epoch

epoch(s)to preserve

current

whom to evict, if at all keeping this in the cache

dst. a

ddr.

hashes h

ere

& DECIDE:packet

W1_epoch

Tags Data

Cache Way 0

W0_epoch

set_i

configurableEpochs Policy

Cache Controller

Tags Data

Cache Way 3

W3_epoch

Tags Data

Cache Way 2

W2_epoch

Tags Data

Cache Way 1

EpochNumbersIden/fyingTaskLife/mes

•  Scratchpad-likeproper/eswiththeconvenienceofcaches•  DMAprefetchedlines←“next”epoch•  run/meadvances“current”epochayertasksfinish•  evict“past”linesfirst,…•  …thenevictfromepoch(s)with#oflinesperset>theirquota


active epochs3

Next Epoch 4

2 3 4 5 6 7

futurepast

present

prevcurr

next

sliding window

Curr. Epoch

EBP

Core

Tags DataTags Data

Epoch

wraparound

10

Evalua/onSummary:Coher.(HW/SW),Prefetch(gradual/bulk),Epochs

M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017 24

Baseline:HW-coherentcaches,noprefetching

Conven/onalHWPrefetcher

Perf:+6%to+66%ExplicitBulkPref.–RDMA,SWini/ated

Perf:-14%to+20%

Perf.impr.: +3% to +44% (+20%)

L2misses: -17% to -67% (-32%)

NoCtraffic: -2% to -56% (-25%)

Energy: -8% to -53% (-28%)

Perf.impr.: +3% to +33% (+14%)

L2misses: -22% to -71% (-38%)

NoCtraffic: -24% to -72% (-41%)

Energy: -18% to -65% (-44%)

HWCoherent+Epochs

NoHWCoherence+Epochs

towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...

Documents

Transcript of towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...