towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...

24
Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/

Transcript of towards approaching its Minimum Hardware Limits, on Low … · 2018. 2. 2. · translate if virtual...

  • ClusterCommunica/onLatency:towardsapproachingitsMinimumHardwareLimits,

    onLow-PowerPla=orms

    ManolisKatevenisFORTH,Heraklion,Crete,Greece(incollab.withUniv.ofCrete)

    http://www.ics.forth.gr/carv/

  • Acknowledgements

    •  IakovosMavroidis•  JohnGoodacre(ARM,KALEAO)•  VassilisPapaefstathiou•  ManolisMarazakis•  NikolaosChrysos• andmanymanymore!...

    2M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

  • 3M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    TheEssenceofCommunica/on

    write (remote)

    Producer ExpectedConsumers

    node ...

    Interconnect

    node 3 node 5

    node 6

    node 4

    node 7node 1

    node 8node N

    node 2

    •  Pushstyle,likeemail•  Ifyouknowtheconsumers– elselikespam

    •  Ifspacealreadyexists•  Zerolatency,whenpre-sent•  One-waylatency,whensent“just-in-/me”

    •  Pullstyle,likewebbrowser• Whenreaderdecidesitneedsthedataandallocatesspace

    •  Two-waylatency,star/ngwhenthedataisneeded

    – unlessabletoprefetch

    Producer Consumer

    reply (data)

    read request (remote)

    node 1 Interconnect

    node 5node 3

    node 2node 6

    node N

    node 7

    node 4

    node 8node ...

  • RemoteDMAversusCacheCoherence

    C LM

    P

    LM

    P

    LM

    P

    Consumer Consumer

    P

    C

    Producer

    NoC

    P

    C

    P

    LM

    P

    NoC

    C

    Producer

    Lower Cache and Directory(Multiple Banks)

    P

    Remote Store

    sto

    re

    forw

    ard

    ed

    request

    inva

    lida

    te

    inva

    lida

    te

    data

    sto

    re

    or Remote Write DMA (block)forwardedrequest (word)

    4M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Example–pushstyle:explicitelysendingthedata:5xfewerpackets(⇒less/me,lessenergy)

  • Canweapproachthisideal,inClusterCommunica/on?

    • ClusterCompu/ng/Datacenters/HPC: ⇒manythousandsofnodes “node”=islandof(hardware)cachecoherence

    • CanwegetClusterCommunica/ontobeanalogouslyfastlikestore&loadinstruc/onsinacoherenceisland? +smalldelayduetodistance,butnotmuchmore

    • Clustersoriginatefrommass,commoditymarket,wherecustomersdidnotcareabout“afewextraμs”…

    • WorkdoneinEnergy-EfficientDatacenter/HPCprojects ⇒64-bitARMprocessors

    5M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

  • 6M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    InefficienciesinClusterCommunica/on,tobeovercomed

    •  Five(5)copyingsofthedata,insteadofjustone(1)!– Datacopyingconsumes/meandenergy

    •  Startbyelimina/ng(large)NICbuffers:– DMAacrossthenetwork–RDMA

    user

    kernel or lib

    user

    Sender NIC Receiver NIC

    Receiver MemorySender Memory

    Network

    kernel or lib

    cpDMADMA

    ideal (zero−copy) RDMA

    cp

  • 7M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    DMAIni/a/on:SystemCall,ordirectlyfromUser-Level?

    kernel or lib

    user

    Receiver Memory

    kernel or lib

    Network NINI

    user Mem

    user process

    kernel

    Sendervirtual addresses

    dma write

    addr.cp phy.bu

    fp

    inn

    ed

    pin

    ne

    dcp

    ideal (zero−copy) RDMA

    ackackDMAdma rd

    vDMA

    •  SystemCalltoini/ateaDMA(withphysicaladdr.)≈3to4μs– on64-bitARMinXilinxZynqUltraScale+underLinux

    •  ForthevirtualizedDMAengine(8channels),thatacceptsvirtualaddressarguments,Ini/a/on(wri/ng3-4registers)≈

  • 8M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Scalability:GlobalVirtualAddresses&ProgressiveTransla/on

    •  forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual

    •  forScalability:64-bitGVA+(global)protec/ondomainiden/fier

    archyHier−

    Tra

    nsl

    ate

    (M

    MU

    )

    Tra

    nsl

    ate

    (M

    MU

    )

    Memory

    archyHier−

    nationDesti−

    Routing

    read

    Copy (RDMA) Engine

    Data

    write

    Network

    GlobalVirtual

    AddressSpace

    src addr

    PID

    dst addr

    64−bitvirtual

    addresses

    Memory

    Source

  • 9M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    TwoSlidesfrommyStama9sVassiliadis2007Symposiumtalk:NetworkRou/ngasGeneraliza/onofAddressDecoding

    •  PhysicalAddressDecodinginauniprocessor

    •  GeographicalAddressRou/nginamul/processor

    hwp://www.ics.forth.gr/carv/ipc/ldstgen_katevenis07.pdf

  • 10M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    ProgressiveAddressTransla/on:LocalizeMigra/onUpdates

    •  Packetscarryglobalvirtualaddresses•  Tablesprovidephysicalroute(address)forthenextfewsteps•  Whenpage9migrateswithinD,onlytablesinthatdomainneedupda/ng•  Variable-size-pagetransla/ontableslooklikeinternetrou/ngtables(longest-prefixmatchesifwewantsmall-page-within-big-regionmigra/on)

    •  Tablesthatpar//onthesystem,forprotec/onagainstuntrustedopera/ngsystems,looklikeinternetfirewalls

  • 11M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Scalability:GlobalVirtualAddresses&ProgressiveTransla/on

    •  forScalability:notallnodescanberequiredtoknowallothernodes’transla/ontables–see“ProgressiveAddressTransla/on”intheM.Katevenispaperin:Stama9sVassiliadisSymposium2007⇒Des/na/onAddressesinnetworkpacketsmustbeVirtual

    •  forScalability:64-bitGVA+(global)protec/ondomainiden/fier

    archyHier−

    Tra

    nsl

    ate

    (M

    MU

    )

    Tra

    nsl

    ate

    (M

    MU

    )

    Memory

    archyHier−

    nationDesti−

    Routing

    read

    Copy (RDMA) Engine

    Data

    write

    Network

    GlobalVirtual

    AddressSpace

    src addr

    PID

    dst addr

    64−bitvirtual

    addresses

    Memory

    Source

  • CurrentAddressTransla/onArch.notreadyforGVAS

    vDMA engine

    read or write addresses,virtual or physical,but truncated to:

    SMMU

    addresses

    44 bits

    translateif virtual

    448 GBy window

    phy. addressroute based on

    physical−only

    Pro

    gra

    mm

    able

    Logic

    − P

    L

    to other PS peripheral devices

    Pro

    gra

    mm

    ing S

    yste

    m −

    PS

    CoresProcessor

    Caches

    local

    DRAM

    FP

    GA

    40 bits

    12M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    •  Addresstransla/onarchitectureofA53inXilinxZynqUltraScale+•  DMAengineisvirtualized,buttruncatesvirtualaddressesto

  • 13M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    WhycopydatabetweenUserandKernel/Librarybuffers?

    kernel or lib

    user

    Receiver Memory

    kernel or lib

    Network NINI

    user Mem

    user process

    kernel

    Sendervirtual addresses

    dma write

    addr.cp phy.bu

    fp

    inn

    ed

    pin

    ne

    dcp

    ideal (zero−copy) RDMA

    ackackDMAdma rd

    vDMA

    •  “Pinned”buffers:tradi/onalDMAdoesnottoleratepagefaults•  “Registered”bufferaddresses,inthelackof“System”(I/O)MMU•  Non-cacheablebuffers,inthelackofcache-coherentDMA•  Sendbufferreuseimmediatelyayersendini/a/on•  Transferoccursbeforereceive-bufferready,andreceiveApplica/onunwillingtoacceptrecep/onatlibrary-specifiedaddress

  • 14M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    MPI:Send-RcvMatchatReceiver,User-SpecifiedRcvAddr

    •  cost=oneextranetworkround-trip/meforsendertolearntheaddress

    sb

    read from rb(user process):

    Consumerdone

    (done)

    sendwrite into sb

    (user process):Producer

    match

    (early) receiveready to send,

    pointer to rb

    at which address?ready toreceive,

    buffer,please!

    but in my

    rb

  • 15M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Snd-RcvMatchatSender,User-SpecifiedReceiveAddress

    •  minimallatencyachieved;earlyreceive(rbpre-allocbyuser)s/llneeded•  doesnotworkwithMPI_ANY_SOURCE

    sb

    read from rb(user process):

    Consumerdone

    (done)

    send

    match

    Producer(user process):

    write into sb

    (early) receiveready toreceive,

    buffer,please!

    but in my

    pointer to rb

    rb

  • 16M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    ReceiverwillingtoacceptDataatanyAddress(andnotcopy)

    sb

    send

    write into sb

    Producer(user process):

    read from rb(user process):

    Consumerdone

    (done)

    match

    "allocate buffer, and...here come the data"or "use rb[i] that had been

    preallocated for me

    rcv

    tell me whenand wheremy datawill be ready

    rb

    •  Eagerdeliveryatpreallocated,library-managedbuffer,thengiventouser

  • 17M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Mbox–RemoteEnqueue:AsynchronousEventMul/plexing

    •  Atomicenqueueintosharedspace,unlikededicatedspacewithRDMA•  Spacereservedonlyfornumberofactualsenders–notpoten/alsenders•  EventMul/plexor–requests,no/fica/ons(likeSelectsystemcall)•  RDMAcomple/onshouldenqueueno/fica/ontoreceiverandsender

    m2

    m2P2

    P1

    m1

    m1

    storesent via single store

    instruction

    single−word messages

    MultipleSenders

    memory 2

    longer messages: first composed locally...

    ... thensent

    Shared Space

    "Mbox"

    RemoteQueue

    memory 3

    SingleReceiver

    emptywait on

    P3

  • 18M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    Per-TaskMboxQ’s&SchedulerasInterruptGeneraliza/on

    •  Tasksblockandwaitwhensynchronouslyreadingfromanemptyqueue•  Schedulerselectsanon-emptyqueueofhighestpriorityandrunsitstask•  TimequantumswitchesQ’sandTasksinsidetopnon-emptypriorityclass•  EnqueuesintohigherpriorityQ’sinterruptlower-priorityrunningtasks

    RDMAistheDatapath–QueuesaretheControl

    Scheduler

    low−power sleep

    T11Q11

    Q21T21

    Q01T01

    arrivingnow

    High

    Low

    rity

    Prio−/ Tasks

    ProcessesEvent Queues

    switch task per time quantum

    coreP

    Interrupt !

    when all Queues Empty

    Run

  • Conclusions

    •  InterprocessorCommun.analogoustoload/storeinstr.• Nosystemcalls–virtualizedDMAengines• NeedanewAddressTransla/onArchitecture⇒ GlobalVirtualAddressSpace(GVAS)

    • No/fica/ons,Mboxes,Scheduler:generalizedinterrupts• Adaptlibraries&API’stouser-level,zero-copycommun.

    19M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

  • BackupSlides

    20M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

  • RDMA/wr-allocate?–Soyware-GuidedPrefetch&Coherence

    21M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    •  SWfetchesdatafromthecacheofthelastproducer–  beforetaskexecu/on–  prefetchwhileexecu/ngothertask– waitforcomple/on•  FETCH(read-only)–  forread-onlytaskarguments–  overrideoldcopiesatdestina/on•  FETCH-O(withownership)–  forread-writetaskargs– migratecache-lineownership–  ensurethateachcache-lineisdirtyinatmostonecache(else,write-backorderisundefined)

    CacheCachev d tag data

    . . .

    Memory

    command

    NoC

    fetch

    data

    core core

    update?

    keep?

    hit

    miss

    CachecoherencethroughSoyware:simplerHW,energyefficientVassilisPapaefstathiou:ICS’13,andPhDDisserta/on

  • EpochNumbers:allowSWtocontrolCacheReplacements

    22M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    •  Useafew(e.g.3)tagbitspercachelinetorecordwhich“group”ofcachelineseachbelongsto;configurereplacementpolicybygroups

    incoming RDMACOMPARE

    epoch

    epoch(s)to preserve

    current

    whom to evict, if at all keeping this in the cache

    dst. a

    ddr.

    hashes h

    ere

    & DECIDE:packet

    W1_epoch

    Tags Data

    Cache Way 0

    W0_epoch

    set_i

    configurableEpochs Policy

    Cache Controller

    Tags Data

    Cache Way 3

    W3_epoch

    Tags Data

    Cache Way 2

    W2_epoch

    Tags Data

    Cache Way 1

  • EpochNumbersIden/fyingTaskLife/mes

    •  Scratchpad-likeproper/eswiththeconvenienceofcaches•  DMAprefetchedlines←“next”epoch•  run/meadvances“current”epochayertasksfinish•  evict“past”linesfirst,…•  …thenevictfromepoch(s)with#oflinesperset>theirquota

    23M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017

    active epochs3

    Next Epoch 4

    2 3 4 5 6 7

    futurepast

    present

    prevcurr

    next

    sliding window

    Curr. Epoch

    EBP

    Core

    Tags DataTags Data

    Epoch

    wraparound

    10

  • Evalua/onSummary:Coher.(HW/SW),Prefetch(gradual/bulk),Epochs

    M.Katevenis:ClusterCommunica/onLatency-Stama/sVassiliadisSymposium,SAMOS2017 24

    Baseline:HW-coherentcaches,noprefetching

    Conven/onalHWPrefetcher

    Perf:+6%to+66%ExplicitBulkPref.–RDMA,SWini/ated

    Perf:-14%to+20%

    Perf.impr.: +3% to +44% (+20%)

    L2misses: -17% to -67% (-32%)

    NoCtraffic: -2% to -56% (-25%)

    Energy: -8% to -53% (-28%)

    Perf.impr.: +3% to +33% (+14%)

    L2misses: -22% to -71% (-38%)

    NoCtraffic: -24% to -72% (-41%)

    Energy: -18% to -65% (-44%)

    HWCoherent+Epochs

    NoHWCoherence+Epochs