towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...
Transcript of towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...
-
ClusterCommunica/onLatency:towardsapproachingitsMinimumHardwareLimits,
onLow-PowerPla=orms
ManolisKatevenisFORTH,Heraklion,Crete,Greece(incollab.withUniv.ofCrete)
http://www.ics.forth.gr/carv/
-
Acknowledgements
• IakovosMavroidis• JohnGoodacre(ARM,KALEAO)• VassilisPapaefstathiou• ManolisMarazakis• NikolaosChrysos• andmanymanymore!...
2M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
-
3M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
TheEssenceofCommunica/on
write (remote)
Producer ExpectedConsumers
node ...
Interconnect
node 3 node 5
node 6
node 4
node 7node 1
node 8node N
node 2
• Pushstyle,likeemail• Ifyouknowtheconsumers– elselikespam
• Ifspacealreadyexists• Zerolatency,whenpre-sent• One-waylatency,whensent“just-in-/me”
• Pullstyle,likewebbrowser• Whenreaderdecidesitneedsthedataandallocatesspace
• Two-waylatency,star/ngwhenthedataisneeded
– unlessabletoprefetch
Producer Consumer
reply (data)
read request (remote)
node 1 Interconnect
node 5node 3
node 2node 6
node N
node 7
node 4
node 8node ...
-
RemoteDMAversusCacheCoherence
C LM
P
LM
P
LM
P
Consumer Consumer
P
C
Producer
NoC
P
C
P
LM
P
NoC
C
Producer
Lower Cache and Directory(Multiple Banks)
P
Remote Store
sto
re
forw
ard
ed
request
inva
lida
te
inva
lida
te
data
sto
re
or Remote Write DMA (block)forwardedrequest (word)
4M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Example–pushstyle:explicitelysendingthedata:5xfewerpackets(⇒less/me,lessenergy)
-
Canweapproachthisideal,inClusterCommunica/on?
• ClusterCompu/ng/Datacenters/HPC: ⇒manythousandsofnodes “node”=islandof(hardware)cachecoherence
• CanwegetClusterCommunica/ontobeanalogouslyfastlikestore&loadinstruc/onsinacoherenceisland? +smalldelayduetodistance,butnotmuchmore
• Clustersoriginatefrommass,commoditymarket,wherecustomersdidnotcareabout“afewextraμs”…
• WorkdoneinEnergy-EfficientDatacenter/HPCprojects ⇒64-bitARMprocessors
5M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
-
6M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
InefficienciesinClusterCommunica/on,tobeovercomed
• Five(5)copyingsofthedata,insteadofjustone(1)!– Datacopyingconsumes/meandenergy
• Startbyelimina/ng(large)NICbuffers:– DMAacrossthenetwork–RDMA
user
kernel or lib
user
Sender NIC Receiver NIC
Receiver MemorySender Memory
Network
kernel or lib
cpDMADMA
ideal (zero−copy) RDMA
cp
-
7M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
DMAIni/a/on:SystemCall,ordirectlyfromUser-Level?
kernel or lib
user
Receiver Memory
kernel or lib
Network NINI
user Mem
user process
kernel
Sendervirtual addresses
dma write
addr.cp phy.bu
fp
inn
ed
pin
ne
dcp
ideal (zero−copy) RDMA
ackackDMAdma rd
vDMA
• SystemCalltoini/ateaDMA(withphysicaladdr.)≈3to4μs– on64-bitARMinXilinxZynqUltraScale+underLinux
• ForthevirtualizedDMAengine(8channels),thatacceptsvirtualaddressarguments,Ini/a/on(wri/ng3-4registers)≈
-
8M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Scalability:GlobalVirtualAddresses&ProgressiveTransla/on
• forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual
• forScalability:64-bitGVA+(global)protec/ondomainiden/fier
archyHier−
Tra
nsl
ate
(M
MU
)
Tra
nsl
ate
(M
MU
)
Memory
archyHier−
nationDesti−
Routing
read
Copy (RDMA) Engine
Data
write
Network
GlobalVirtual
AddressSpace
src addr
PID
dst addr
64−bitvirtual
addresses
Memory
Source
-
9M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
TwoSlidesfrommyStama9sVassiliadis2007Symposiumtalk:NetworkRou/ngasGeneraliza/onofAddressDecoding
• PhysicalAddressDecodinginauniprocessor
• GeographicalAddressRou/nginamul/processor
hwp://www.ics.forth.gr/carv/ipc/ldstgen_katevenis07.pdf
-
10M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
ProgressiveAddressTransla/on:LocalizeMigra/onUpdates
• Packetscarryglobalvirtualaddresses• Tablesprovidephysicalroute(address)forthenextfewsteps• Whenpage9migrateswithinD,onlytablesinthatdomainneedupda/ng• Variable-size-pagetransla/ontableslooklikeinternetrou/ngtables(longest-prefixmatchesifwewantsmall-page-within-big-regionmigra/on)
• Tablesthatpar//onthesystem,forprotec/onagainstuntrustedopera/ngsystems,looklikeinternetfirewalls
-
11M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Scalability:GlobalVirtualAddresses&ProgressiveTransla/on
• forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual
• forScalability:64-bitGVA+(global)protec/ondomainiden/fier
archyHier−
Tra
nsl
ate
(M
MU
)
Tra
nsl
ate
(M
MU
)
Memory
archyHier−
nationDesti−
Routing
read
Copy (RDMA) Engine
Data
write
Network
GlobalVirtual
AddressSpace
src addr
PID
dst addr
64−bitvirtual
addresses
Memory
Source
-
CurrentAddressTransla/onArch.notreadyforGVAS
vDMA engine
read or write addresses,virtual or physical,but truncated to:
SMMU
addresses
44 bits
translateif virtual
448 GBy window
phy. addressroute based on
physical−only
Pro
gra
mm
able
Logic
− P
L
to other PS peripheral devices
Pro
gra
mm
ing S
yste
m −
PS
CoresProcessor
Caches
local
DRAM
FP
GA
40 bits
12M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
• Addresstransla/onarchitectureofA53inXilinxZynqUltraScale+• DMAengineisvirtualized,buttruncatesvirtualaddressesto
-
13M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
WhycopydatabetweenUserandKernel/Librarybuffers?
kernel or lib
user
Receiver Memory
kernel or lib
Network NINI
user Mem
user process
kernel
Sendervirtual addresses
dma write
addr.cp phy.bu
fp
inn
ed
pin
ne
dcp
ideal (zero−copy) RDMA
ackackDMAdma rd
vDMA
• “Pinned”buffers:tradi/onalDMAdoesnottoleratepagefaults• “Registered”bufferaddresses,inthelackof“System”(I/O)MMU• Non-cacheablebuffers,inthelackofcache-coherentDMA• Sendbufferreuseimmediatelyayersendini/a/on• Transferoccursbeforereceive-bufferready,andreceiveApplica/onunwillingtoacceptrecep/onatlibrary-specifiedaddress
-
14M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
MPI:Send-RcvMatchatReceiver,User-SpecifiedRcvAddr
• cost=oneextranetworkround-trip/meforsendertolearntheaddress
sb
read from rb(user process):
Consumerdone
(done)
sendwrite into sb
(user process):Producer
match
(early) receiveready to send,
pointer to rb
at which address?ready toreceive,
buffer,please!
but in my
rb
-
15M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Snd-RcvMatchatSender,User-SpecifiedReceiveAddress
• minimallatencyachieved;earlyreceive(rbpre-allocbyuser)s/llneeded• doesnotworkwithMPI_ANY_SOURCE
sb
read from rb(user process):
Consumerdone
(done)
send
match
Producer(user process):
write into sb
(early) receiveready toreceive,
buffer,please!
but in my
pointer to rb
rb
-
16M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
ReceiverwillingtoacceptDataatanyAddress(andnotcopy)
sb
send
write into sb
Producer(user process):
read from rb(user process):
Consumerdone
(done)
match
"allocate buffer, and...here come the data"or "use rb[i] that had been
preallocated for me
rcv
tell me whenand wheremy datawill be ready
rb
• Eagerdeliveryatpreallocated,library-managedbuffer,thengiventouser
-
17M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Mbox–RemoteEnqueue:AsynchronousEventMul/plexing
• Atomicenqueueintosharedspace,unlikededicatedspacewithRDMA• Spacereservedonlyfornumberofactualsenders–notpoten/alsenders• EventMul/plexor–requests,no/fica/ons(likeSelectsystemcall)• RDMAcomple/onshouldenqueueno/fica/ontoreceiverandsender
m2
m2P2
P1
m1
m1
storesent via single store
instruction
single−word messages
MultipleSenders
memory 2
longer messages: first composed locally...
... thensent
Shared Space
"Mbox"
RemoteQueue
memory 3
SingleReceiver
emptywait on
P3
-
18M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
Per-TaskMboxQ’s&SchedulerasInterruptGeneraliza/on
• Tasksblockandwaitwhensynchronouslyreadingfromanemptyqueue• Schedulerselectsanon-emptyqueueofhighestpriorityandrunsitstask• TimequantumswitchesQ’sandTasksinsidetopnon-emptypriorityclass• EnqueuesintohigherpriorityQ’sinterruptlower-priorityrunningtasks
RDMAistheDatapath–QueuesaretheControl
Scheduler
low−power sleep
T11Q11
Q21T21
Q01T01
arrivingnow
High
Low
rity
Prio−/ Tasks
ProcessesEvent Queues
switch task per time quantum
coreP
Interrupt !
when all Queues Empty
Run
-
Conclusions
• InterprocessorCommun.analogoustoload/storeinstr.• Nosystemcalls–virtualizedDMAengines• NeedanewAddressTransla/onArchitecture⇒ GlobalVirtualAddressSpace(GVAS)
• No/fica/ons,Mboxes,Scheduler:generalizedinterrupts• Adaptlibraries&API’stouser-level,zero-copycommun.
19M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
-
BackupSlides
20M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
-
RDMA/wr-allocate?–Soyware-GuidedPrefetch&Coherence
21M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
• SWfetchesdatafromthecacheofthelastproducer– beforetaskexecu/on– prefetchwhileexecu/ngothertask– waitforcomple/on• FETCH(read-only)– forread-onlytaskarguments– overrideoldcopiesatdestina/on• FETCH-O(withownership)– forread-writetaskargs– migratecache-lineownership– ensurethateachcache-lineisdirtyinatmostonecache(else,write-backorderisundefined)
CacheCachev d tag data
. . .
Memory
command
NoC
fetch
data
core core
update?
keep?
hit
miss
CachecoherencethroughSoyware:simplerHW,energyefficientVassilisPapaefstathiou:ICS’13,andPhDDisserta/on
-
EpochNumbers:allowSWtocontrolCacheReplacements
22M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
• Useafew(e.g.3)tagbitspercachelinetorecordwhich“group”ofcachelineseachbelongsto;configurereplacementpolicybygroups
incoming RDMACOMPARE
epoch
epoch(s)to preserve
current
whom to evict, if at all keeping this in the cache
dst. a
ddr.
hashes h
ere
& DECIDE:packet
W1_epoch
Tags Data
Cache Way 0
W0_epoch
set_i
configurableEpochs Policy
Cache Controller
Tags Data
Cache Way 3
W3_epoch
Tags Data
Cache Way 2
W2_epoch
Tags Data
Cache Way 1
-
EpochNumbersIden/fyingTaskLife/mes
• Scratchpad-likeproper/eswiththeconvenienceofcaches• DMAprefetchedlines←“next”epoch• run/meadvances“current”epochayertasksfinish• evict“past”linesfirst,…• …thenevictfromepoch(s)with#oflinesperset>theirquota
23M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017
active epochs3
Next Epoch 4
2 3 4 5 6 7
futurepast
present
prevcurr
next
sliding window
Curr. Epoch
EBP
Core
Tags DataTags Data
Epoch
wraparound
10
-
Evalua/onSummary:Coher.(HW/SW),Prefetch(gradual/bulk),Epochs
M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017 24
Baseline:HW-coherentcaches,noprefetching
Conven/onalHWPrefetcher
Perf:+6%to+66%ExplicitBulkPref.–RDMA,SWini/ated
Perf:-14%to+20%
Perf.impr.: +3% to +44% (+20%)
L2misses: -17% to -67% (-32%)
NoCtraffic: -2% to -56% (-25%)
Energy: -8% to -53% (-28%)
Perf.impr.: +3% to +33% (+14%)
L2misses: -22% to -71% (-38%)
NoCtraffic: -24% to -72% (-41%)
Energy: -18% to -65% (-44%)
HWCoherent+Epochs
NoHWCoherence+Epochs