IEEE CloudCom 2014参加報告

IEEE CloudCom 2014

Session: 2C: Virtualization I Session: 3C, 4B: HPC on Cloud

120150206 45

Rank: CORE computer science conference rankingsPublication, Citation: Microsoft academic search

Rank Publica+on Cita+on % accepted

IEEE/ACM CCGrid A 1454 10577 19

IEEE CLOUD B 234 445 18

IEEE CloudCom C 70 187 18

IEEE CloudNet - - - 28

IEEE/ACM UCC - - - 19

ACM SoCC - - - 24

CLOSER - - - 17

Gartner Hype Curve 2014

A 3-level Cache Miss Model for a Nonvolatile Extension to Transcendent Memory

Transcendent memory (tmem)

cleancache, frontswap zcache, RAMster, Xen shim

VM

NEXTmem (aka. Ex-Tmem)

evicted page

clean

(FIFO)put buffer

NEXTmem

memory allocation

guest VM

swap region clean region(LFU)

DRAMhot region(LRU)

NVM

hypervisor

dirty

level2

level1

diskflush

put

Figure 5. Architecture of NEXTmem. Arrows indicate writes. A getresults in a read from NEXTmem or from disk.

joining a FIFO (first-in-first-out) queue at the DRAM level.This queue serves as a put buffer. When the DRAM level isfull and another put arrives, the oldest page is transferredto a clean region at the NVM level.If there are two gets for x while it is in the put buffer, x

is transferred to a hot region at the DRAM level and pushedonto the top of an LRU stack. A get on a page y in thisstack will move y to the top. The pages at the stack bottomare cold; when the DRAM level is full, the bottom page istransferred to the clean region at the NVM level.The clean NVM region uses LFU replacement, so the least

frequently used page is discarded if the NVM level is full.If there are two gets for a page z in this clean region, zwill be copied to the hot region at DRAM level.The justification for this design, and experimental re-

sults to demonstrate its efficacy, are provided in a separatepaper [16]. For this paper, our focus is on using the 3-level cache model in Sec. II to analyze the behavior of thecleancache hierarchy a Level1 that consists of put bufferand hot region in DRAM, a Level2 that is the clean regionin NVM, and a Level3 that is the disk.It would seem intractable to model the unusual and tightly

coupled replacement policies (a mix of FIFO, LRU and LFU;a transfer or copy within DRAM or between DRAM andNVM). The experimental results that follow show that our3-level model can in fact overcome this difficulty.

C. Experimental Set-up

The NEXTmem experiments are run on a machine that hasan Intel Xeon 2.5GHz processor with 12 cores, 2 4GB ofDRAM and a SATA 2.0, 1TB, 5400 rpm Seagate disk. Thehypervisor is Xen 4.1.3, with its tmem interface (for put,get, etc.) unchanged, but the frontswap and cleancacheinside are replaced by NEXTmem.

VM1: TS; TS; XL; XL; EC; XL; XL; TS; H2; XL; H2; TS; TB; TS; XL;VM2: XL; TB; TS; H2; TS; JY; XL; TS; JY; XL; TS; XL; TB; TB; XL;VM3: H2; TB; XL; XL; XL; TS; XL; TS; TB; TB; TS; EC; TS; TB; JY;VM4: H2; TS; TS; H2; EC; JY; H2; EC; XL; EC; XL; XL; TB; XL; TB;VM5: JY; EC; JY; XL; XL; H2; H2; EC; H2; XL; TB; TS; EC; XL; XL;

Table ISequence of applications run by each VM for the experiment in

Sec. III-D. (TS=Tradesoap, TB=Tradebeans, H2=H2, XL=Xalan,JY=Jython, EC=Eclipse.)

To apply pressure on NEXTmem, we set M1 = 700004KByte pages (put buffer plus hot region). The NVM has300000 pages, soM2 is 300000 minus the swap region. ThisM1 and M2 are shared by the VMs.There is no commodity byte-addressable NVM that we

could use for this experiment, so the NVM is emulated byDRAM. For the SLO considered here, the DRAM and NVMlatencies are negligible when compared to disk latency.Each VM has 2 vCPU, 500MB of DRAM and a 100GB

virtual disk, and runs Ubuntu 12.04 LTS.

D. Application of 3-level model to NEXTmem provisioningThe validation of the Cache Miss Equation in Fig. 2 is for

a trace simulation over two LRU caches. We now confirmthat the equation does work with a real set-up; specifically,we have a VM running the DaCapo Tradebeans bench-mark4 on the machine with the NEXTmem implementation.Fig. 6(a) shows that the equation is an excellent fit for the

DRAM Level1 in NEXTmem, and Fig. 6(b) shows similarlygood fit for the NVM Level2 at different M1 values. Likein Fig. 2(c), the four sets of M1+M2, Pmissa values againlie on the same curve defined by the equation in Fig. 6(c).Space constraint prevents us from presenting more results

from experiments validating the equation for NEXTmem.To illustrate the application of the 3-level model, we run

5 concurrent VMs. Each VM runs a sequence of randomlychosen DaCapo benchmarks, so the VM goes through phaseswhen it is more memory-intensive than usual. The VMs rundifferent sequences, as shown in Table I, so the VMs are notidentical in their memory demand.The SLO is pmax = 0.05 and Lmax = 300ms for VM1,

and pmax = 0.02 and Lmax = 500ms for VM2. There is noSLO for the other 3 VMs.Time is divided into 90-second epochs. The experiment

starts with M1 and M2 both equally partitioned among the5 VMs. After each epoch, there is a 4-second calibrationperiod, during which we vary each VMs M1 and M2allocation and measure their miss probability; the results arethen used for regression to obtain the parameter values forthe Cache Miss Equation for each level and each VM. Thisperiodic recalibration is necessary because the VMs needfor memory is changing dynamically.

4http://www.dacapobench.org

223223

3

Persistent memory

NVMe driver

NVM PMFS, DAX

OpenNVM (SanDisk) API: atomic write, atomic trim NVMKV, NVMFS

SNIA NVM Programming Technical WG http://www.snia.org/forums/sssi/nvmp

4

PM = Linux

HPC on Cloud (8 papers)1. Reliability Guided Resource Alloca+on for Large-Scale Systems,

S. Umamaheshwaran and T. J. Hacker (Purdue U.) 2. Energy-Ecient Scheduling of Urgent Bag-of-Tasks Applica+ons in Clouds through

DVFS, R. N. Calheiros and R. Buyya (U. Melbourne) 3. A Framework for Measuring the Impact and Eec+veness of the NEES Cyber-

infrastructure for Earthquake Engineering, T. Hacker and A. J. Magana (Purdue U.) 4. Execu+ng Bag of Distributed Tasks on the Cloud: Inves+ga+ng the Trade-Os

between Performance and Cost, L. Thai, B. Varghese, and A. Barker (U. St Andrew) 5. CPU Performance Coecient (CPU-PC): A Novel Performance Metric Based on

Real-Time CPU Resource Provisioning in Time-Shared Cloud Environments, T. Mastelic, I. Brandic, and J. Jaarevic (Vienna U. of Technology)

6. Performance Analysis of Cloud Environments on Top of Energy-Ecient Pla^orms Featuring Low Power Processors, V. Plugaru, S. Varre[e, and P. Bouvry (U. Luxembourg)

7. Exploring the Performance Impact of Virtualiza+on on an HPC Cloud, N. Chakthranont, P. Khunphet, R. Takano, and T. Ikegami (KMUTNB, AIST)

8. GateCloud: An Integra+on of Gate Monte Carlo Simula+on with a Cloud Compu+ng Environment, B. A. Rowedder, H. Wang, and Y. Kuang (UNLV)

5

[1] [2, 6] [4, 5] [6, 7]

[1, 4, 5] IaaS: OpenStack [6], CloudStack [7] [8]

MPI [6, 7] Bag of Tasks [2], Bag of Distributed Tasks [4] Web (FFmpeg, MongoDB, Ruby on Rails) [5] [8] Earthquake Engineering [3]

6

CPU Performance Coecient (CPU-PC): A Novel Performance Metric Based on Real-time CPU Resource

Provisioning in Time-shared Cloud Environment

1VM

response timeVM stolen timeCPU-PC CPU-PCresponse time

411411

414414

414414

7

ASGC Hardware Spec.

8

Compute NodeCPU Intel Xeon E5-2680v2/2.8GHz

(10 core) x 2CPU

Memory 128 GB DDR3-1866

InfiniBand Mellanox ConnectX-3 (FDR)

Ethernet Intel X520-DA2 (10 GbE)

Disk Intel SSD DC S3500 600 GB

155 node-cluster consists of Cray H2312 blade server The theoretical peak performance is 69.44 TFLOPS The operation started from July, 2014

Exploring the Performance Impact of Virtualiza+on on an HPC Cloud

ASGC Software Stack Management Stack

CentOS 6.5 (QEMU/KVM 0.12.1.2) Apache CloudStack 4.3 + our extensions

PCI passthrough/SR-IOV support (KVM only) sgc-tools: Virtual cluster construction utility

RADOS cluster storageHPC Stack (Virtual Cluster)

Intel Compiler/Math Kernel Library SP1 1.1.106 Open MPI 1.6.5 Mellanox OFED 2.1 Torque job scheduler

9


Benchmark ProgramsMicro benchmark

Intel Micro Benchmark (IMB) version 3.2.4

Application-level benchmark HPC Challenge (HPCC) version 1.4.3

G-HPL EP-STREAM G-RandomAccess G-FFT

OpenMX version 3.7.4 Graph 500 version 2.1.4

10


MPI Point-to-point communication

11

0.1$

1$

10$

1$ 1024$

Throug

hput)(G

B/s)

Message)Size)(KB)

Physical$Cluster$Virtual$Cluster$

5.85GB/s5.69GB/s

The overhead is less than 3% with large message,though it is up to 25% with small message.

IMBExploring the Performance Impact of Virtualiza+on on an HPC Cloud

MPI Collectives (64bytes)

12

0

1000

2000

3000

4000

5000

0 32 64 96 128Exe

cuti

on T

ime

(use

c)

Number of Nodes

Physical ClusterVirtual Cluster

0

200

400

600

800

1,000

1,200

0 32 64 96 128

Exec

utio

n Ti

me

(use

c)

Number of Nodes


0

2000

4000

6000

0 32 64 96 128Exe

cuti

on T

ime

(use

c)

Number of Nodes


Allgather Allreduce

Alltoall

IMB

The overhead becomes significant as the number of nodes increases.

load imbalance?

+77% +88%

+43%


G-HPL (LINPACK)

13

0

10

20

30

40

50

60

0 32 64 96 128

Perf

orm

ance

(TFL

OPS

)

Number of Nodes

Physical Cluster Virtual Cluster

Performance degradation: 5.4 - 6.6%

Efficiency* on 128 nodesPhysical: 90%Virtual: 84%

*) Rmax / Rpeak

HPCCExploring the Performance Impact of Virtualiza+on on an HPC Cloud

EP-STREAM and G-FFT

14

0

2

4

6

0 32 64 96 128

Perf

orm

ance

(GB/

s)

Number of Nodes


0

40

80

120

160

0 32 64 96 128

Perf

orm

ance

(GFL

OPS

)

Number of Nodes


EP-STREAM G-FFT

HPCC

The overheads are ignorable.

memory intensivewith no communication

all-to-all communicationwith large messages


Graph500 (replicated-csc, scale 26)

15

1.00E+07

1.00E+08

1.00E+09

1.00E+10

0 16 32 48 64

Perf

orm

ance

(TEP

S)

Number of Nodes


Graph500

Performance degradation: 2% (64node)

Graph500 is a Hybrid parallel program (MPI + OpenMP).We used a combination of 2 MPI processes and 10 OpenMP threads.


Findings PCI passthrough is effective in improving the I/O

performance, however, it is still unable to achieve the low communication latency of a physical cluster due to a virtual interrupt injection.

VCPU pinning improves the performance for HPC applications.

Almost all MPI collectives suffer from the scalability issue.

The overhead of virtualization has less impact on actual applications.

16


IEEE CloudCom 2014参加報告

Technology

Transcript of IEEE CloudCom 2014参加報告