IEEE CloudCom 2014参加報告

16
IEEE CloudCom 2014参加報告 野@産総研 担当パート Session: 2C: Virtualization I Session: 3C, 4B: HPC on Cloud 1 20150206 グリッド協議会第45回ワークショップ

Transcript of IEEE CloudCom 2014参加報告

  • IEEE CloudCom 2014

    Session: 2C: Virtualization I Session: 3C, 4B: HPC on Cloud

    120150206 45

  • Rank: CORE computer science conference rankingsPublication, Citation: Microsoft academic search

    Rank Publica+on Cita+on % accepted

    IEEE/ACM CCGrid A 1454 10577 19

    IEEE CLOUD B 234 445 18

    IEEE CloudCom C 70 187 18

    IEEE CloudNet - - - 28

    IEEE/ACM UCC - - - 19

    ACM SoCC - - - 24

    CLOSER - - - 17

    Gartner Hype Curve 2014

  • A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory

    Transcendent memory (tmem)

    cleancache, frontswap zcache, RAMster, Xen shim

    VM

    NEXTmem (aka. Ex-Tmem)

    evicted page

    clean

    (FIFO)put buffer

    NEXTmem

    memory allocation

    guest VM

    swap region clean region(LFU)

    DRAMhot region(LRU)

    NVM

    hypervisor

    dirty

    level2

    level1

    diskflush

    put

    Figure 5. Architecture of NEXTmem. Arrows indicate writes. A getresults in a read from NEXTmem or from disk.

    joining a FIFO (first-in-first-out) queue at the DRAM level.This queue serves as a put buffer. When the DRAM level isfull and another put arrives, the oldest page is transferredto a clean region at the NVM level.If there are two gets for x while it is in the put buffer, x

    is transferred to a hot region at the DRAM level and pushedonto the top of an LRU stack. A get on a page y in thisstack will move y to the top. The pages at the stack bottomare cold; when the DRAM level is full, the bottom page istransferred to the clean region at the NVM level.The clean NVM region uses LFU replacement, so the least

    frequently used page is discarded if the NVM level is full.If there are two gets for a page z in this clean region, zwill be copied to the hot region at DRAM level.The justification for this design, and experimental re-

    sults to demonstrate its efficacy, are provided in a separatepaper [16]. For this paper, our focus is on using the 3-level cache model in Sec. II to analyze the behavior of thecleancache hierarchy a Level1 that consists of put bufferand hot region in DRAM, a Level2 that is the clean regionin NVM, and a Level3 that is the disk.It would seem intractable to model the unusual and tightly

    coupled replacement policies (a mix of FIFO, LRU and LFU;a transfer or copy within DRAM or between DRAM andNVM). The experimental results that follow show that our3-level model can in fact overcome this difficulty.

    C. Experimental Set-up

    The NEXTmem experiments are run on a machine that hasan Intel Xeon 2.5GHz processor with 12 cores, 2 4GB ofDRAM and a SATA 2.0, 1TB, 5400 rpm Seagate disk. Thehypervisor is Xen 4.1.3, with its tmem interface (for put,get, etc.) unchanged, but the frontswap and cleancacheinside are replaced by NEXTmem.

    VM1: TS; TS; XL; XL; EC; XL; XL; TS; H2; XL; H2; TS; TB; TS; XL;VM2: XL; TB; TS; H2; TS; JY; XL; TS; JY; XL; TS; XL; TB; TB; XL;VM3: H2; TB; XL; XL; XL; TS; XL; TS; TB; TB; TS; EC; TS; TB; JY;VM4: H2; TS; TS; H2; EC; JY; H2; EC; XL; EC; XL; XL; TB; XL; TB;VM5: JY; EC; JY; XL; XL; H2; H2; EC; H2; XL; TB; TS; EC; XL; XL;

    Table ISequence of applications run by each VM for the experiment in

    Sec. III-D. (TS=Tradesoap, TB=Tradebeans, H2=H2, XL=Xalan,JY=Jython, EC=Eclipse.)

    To apply pressure on NEXTmem, we set M1 = 700004KByte pages (put buffer plus hot region). The NVM has300000 pages, soM2 is 300000 minus the swap region. ThisM1 and M2 are shared by the VMs.There is no commodity byte-addressable NVM that we

    could use for this experiment, so the NVM is emulated byDRAM. For the SLO considered here, the DRAM and NVMlatencies are negligible when compared to disk latency.Each VM has 2 vCPU, 500MB of DRAM and a 100GB

    virtual disk, and runs Ubuntu 12.04 LTS.

    D. Application of 3-level model to NEXTmem provisioningThe validation of the Cache Miss Equation in Fig. 2 is for

    a trace simulation over two LRU caches. We now confirmthat the equation does work with a real set-up; specifically,we have a VM running the DaCapo Tradebeans bench-mark4 on the machine with the NEXTmem implementation.Fig. 6(a) shows that the equation is an excellent fit for the

    DRAM Level1 in NEXTmem, and Fig. 6(b) shows similarlygood fit for the NVM Level2 at different M1 values. Likein Fig. 2(c), the four sets of M1+M2, Pmissa values againlie on the same curve defined by the equation in Fig. 6(c).Space constraint prevents us from presenting more results

    from experiments validating the equation for NEXTmem.To illustrate the application of the 3-level model, we run

    5 concurrent VMs. Each VM runs a sequence of randomlychosen DaCapo benchmarks, so the VM goes through phaseswhen it is more memory-intensive than usual. The VMs rundifferent sequences, as shown in Table I, so the VMs are notidentical in their memory demand.The SLO is pmax = 0.05 and Lmax = 300ms for VM1,

    and pmax = 0.02 and Lmax = 500ms for VM2. There is noSLO for the other 3 VMs.Time is divided into 90-second epochs. The experiment

    starts with M1 and M2 both equally partitioned among the5 VMs. After each epoch, there is a 4-second calibrationperiod, during which we vary each VMs M1 and M2allocation and measure their miss probability; the results arethen used for regression to obtain the parameter values forthe Cache Miss Equation for each level and each VM. Thisperiodic recalibration is necessary because the VMs needfor memory is changing dynamically.

    4http://www.dacapobench.org

    223223

    3

  • Persistent memory

    NVMe driver

    NVM PMFS, DAX

    OpenNVM (SanDisk) API: atomic write, atomic trim NVMKV, NVMFS

    SNIA NVM Programming Technical WG http://www.snia.org/forums/sssi/nvmp

    4

    PM = Linux

  • HPC on Cloud (8 papers)1. Reliability Guided Resource Alloca+on for Large-Scale Systems,

    S. Umamaheshwaran and T. J. Hacker (Purdue U.) 2. Energy-Ecient Scheduling of Urgent Bag-of-Tasks Applica+ons in Clouds through

    DVFS, R. N. Calheiros and R. Buyya (U. Melbourne) 3. A Framework for Measuring the Impact and Eec+veness of the NEES Cyber-

    infrastructure for Earthquake Engineering, T. Hacker and A. J. Magana (Purdue U.) 4. Execu+ng Bag of Distributed Tasks on the Cloud: Inves+ga+ng the Trade-Os

    between Performance and Cost, L. Thai, B. Varghese, and A. Barker (U. St Andrew) 5. CPU Performance Coecient (CPU-PC): A Novel Performance Metric Based on

    Real-Time CPU Resource Provisioning in Time-Shared Cloud Environments, T. Mastelic, I. Brandic, and J. Jaarevic (Vienna U. of Technology)

    6. Performance Analysis of Cloud Environments on Top of Energy-Ecient Pla^orms Featuring Low Power Processors, V. Plugaru, S. Varre[e, and P. Bouvry (U. Luxembourg)

    7. Exploring the Performance Impact of Virtualiza+on on an HPC Cloud, N. Chakthranont, P. Khunphet, R. Takano, and T. Ikegami (KMUTNB, AIST)

    8. GateCloud: An Integra+on of Gate Monte Carlo Simula+on with a Cloud Compu+ng Environment, B. A. Rowedder, H. Wang, and Y. Kuang (UNLV)

    5

  • [1] [2, 6] [4, 5] [6, 7]

    [1, 4, 5] IaaS: OpenStack [6], CloudStack [7] [8]

    MPI [6, 7] Bag of Tasks [2], Bag of Distributed Tasks [4] Web (FFmpeg, MongoDB, Ruby on Rails) [5] [8] Earthquake Engineering [3]

    6

  • CPU Performance Coecient (CPU-PC): A Novel Performance Metric Based on Real-time CPU Resource

    Provisioning in Time-shared Cloud Environment

    1VM

    response timeVM stolen timeCPU-PC CPU-PCresponse time

    411411

    414414

    414414

    7

  • ASGC Hardware Spec.

    8

    Compute NodeCPU Intel Xeon E5-2680v2/2.8GHz

    (10 core) x 2CPU

    Memory 128 GB DDR3-1866

    InfiniBand Mellanox ConnectX-3 (FDR)

    Ethernet Intel X520-DA2 (10 GbE)

    Disk Intel SSD DC S3500 600 GB

    155 node-cluster consists of Cray H2312 blade server The theoretical peak performance is 69.44 TFLOPS The operation started from July, 2014

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • ASGC Software Stack Management Stack

    CentOS 6.5 (QEMU/KVM 0.12.1.2) Apache CloudStack 4.3 + our extensions

    PCI passthrough/SR-IOV support (KVM only) sgc-tools: Virtual cluster construction utility

    RADOS cluster storageHPC Stack (Virtual Cluster)

    Intel Compiler/Math Kernel Library SP1 1.1.106 Open MPI 1.6.5 Mellanox OFED 2.1 Torque job scheduler

    9

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • Benchmark ProgramsMicro benchmark

    Intel Micro Benchmark (IMB) version 3.2.4

    Application-level benchmark HPC Challenge (HPCC) version 1.4.3

    G-HPL EP-STREAM G-RandomAccess G-FFT

    OpenMX version 3.7.4 Graph 500 version 2.1.4

    10

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • MPI Point-to-point communication

    11

    0.1$

    1$

    10$

    1$ 1024$

    Throug

    hput)(G

    B/s)

    Message)Size)(KB)

    Physical$Cluster$Virtual$Cluster$

    5.85GB/s5.69GB/s

    The overhead is less than 3% with large message,though it is up to 25% with small message.

    IMBExploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • MPI Collectives (64bytes)

    12

    0

    1000

    2000

    3000

    4000

    5000

    0 32 64 96 128Exe

    cuti

    on T

    ime

    (use

    c)

    Number of Nodes

    Physical ClusterVirtual Cluster

    0

    200

    400

    600

    800

    1,000

    1,200

    0 32 64 96 128

    Exec

    utio

    n Ti

    me

    (use

    c)

    Number of Nodes

    Physical ClusterVirtual Cluster

    0

    2000

    4000

    6000

    0 32 64 96 128Exe

    cuti

    on T

    ime

    (use

    c)

    Number of Nodes

    Physical ClusterVirtual Cluster

    Allgather Allreduce

    Alltoall

    IMB

    The overhead becomes significant as the number of nodes increases.

    load imbalance?

    +77% +88%

    +43%

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • G-HPL (LINPACK)

    13

    0

    10

    20

    30

    40

    50

    60

    0 32 64 96 128

    Perf

    orm

    ance

    (TFL

    OPS

    )

    Number of Nodes

    Physical Cluster Virtual Cluster

    Performance degradation: 5.4 - 6.6%

    Efficiency* on 128 nodesPhysical: 90%Virtual: 84%

    *) Rmax / Rpeak

    HPCCExploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • EP-STREAM and G-FFT

    14

    0

    2

    4

    6

    0 32 64 96 128

    Perf

    orm

    ance

    (GB/

    s)

    Number of Nodes

    Physical Cluster Virtual Cluster

    0

    40

    80

    120

    160

    0 32 64 96 128

    Perf

    orm

    ance

    (GFL

    OPS

    )

    Number of Nodes

    Physical Cluster Virtual Cluster

    EP-STREAM G-FFT

    HPCC

    The overheads are ignorable.

    memory intensivewith no communication

    all-to-all communicationwith large messages

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • Graph500 (replicated-csc, scale 26)

    15

    1.00E+07

    1.00E+08

    1.00E+09

    1.00E+10

    0 16 32 48 64

    Perf

    orm

    ance

    (TEP

    S)

    Number of Nodes

    Physical ClusterVirtual Cluster

    Graph500

    Performance degradation: 2% (64node)

    Graph500 is a Hybrid parallel program (MPI + OpenMP).We used a combination of 2 MPI processes and 10 OpenMP threads.

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

  • Findings PCI passthrough is effective in improving the I/O

    performance, however, it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection.

    VCPU pinning improves the performance for HPC applications.

    Almost all MPI collectives suffer from the scalability issue.

    The overhead of virtualization has less impact on actual applications.

    16

    Exploring the Performance Impact of Virtualiza+on on an HPC Cloud