Optimizing the Memory Management of a Virtual...

9
COMPUTING PRACTICES 66 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/16/$33.00 © 2016 IEEE V irtualization technology provides compelling benefits, such as resource consolidation, per- formance and fault isolation, flexible migra- tion, and the rapid creation of specialized envi- ronments. Because of their many advantages, virtual ma- chines (VMs) are ubiquitous in enterprise datacenters and clouds and are integral to e-commerce, data mining, and big data management. In a virtualized environment, a VM monitor (VMM; also known as a hypervisor), runs on a PC server and provides multiple VMs, each of which has an OS. The VMs can thus execute their OSs si- multaneously, and the VMM handles privilege opera- tions. However, vir- tualized environ- ments currently sufer from a se- mantic gap between VMM hardware- resource manage- ment and the guest OS’s functionality. This gap be- comes more pro- nounced in server hardware, which means that a virtualized environment can dramatically degrade system performance in large datacenters, where such environments are integral to operations. Even high- performance computing with VMMs on four-core non- uniform memory access (NUMA) architectures, which used to see little or no impact from virtualization, 1−3 are no longer exempt from performance degradation, 4,5 including both open source VMMs such as Xen and Qiuming Luo, Feng Xiao, and Zhong Ming, Shenzhen University Hao Li, Huawei Technologies Jianyong Chen, Shenzhen University Jianhua Zhang, Huawei Technologies Virtualization significantly degrades computational performance on nonuniform memory access (NUMA) systems. The authors propose an optimization scheme based on a memory-access model that covers the entire access path and accommodates the chain-reacting behavior of memory subsystems. A quantitative evaluation shows significant performance improvement with negligible overhead. Optimizing the Memory Management of a Virtual Machine Monitor on a NUMA System

Transcript of Optimizing the Memory Management of a Virtual...

COMPUTING PRACTICES

66 C O M P U T E R P U B L I S H E D B Y T H E I E E E C O M P U T E R S O C I E T Y 0 0 1 8 - 9 1 6 2 / 1 6 / $ 3 3 . 0 0 © 2 0 1 6 I E E E

Virtualization technology provides compelling benefits, such as resource consolidation, per-formance and fault isolation, flexible migra-tion, and the rapid creation of specialized envi-

ronments. Because of their many advantages, virtual ma-chines (VMs) are ubiquitous in enterprise datacenters and clouds and are integral to e-commerce, data mining, and big data management. In a virtualized environment, a VM monitor (VMM; also known as a hypervisor), runs on a PC

server and provides multiple VMs, each of which has an OS. The VMs can thus execute their OSs si-multaneously, and the VMM handles privilege opera-tions. However, vir-tualized environ-ments currently suf er from a se-mantic gap between VMM hardware- resource manage-ment and the guest OS’s functionality.

This gap be-comes more pro-

nounced in server hardware, which means that a virtualized environment can dramatically degrade system performance in large datacenters, where such environments are integral to operations. Even high- performance computing with VMMs on four-core non-uniform memory access (NUMA) architectures, which used to see little or no impact from virtualization,1−3 are no longer exempt from performance degradation,4,5 including both open source VMMs such as Xen and

Qiuming Luo, Feng Xiao, and Zhong Ming, Shenzhen University

Hao Li, Huawei Technologies

Jianyong Chen, Shenzhen University

Jianhua Zhang, Huawei Technologies

Virtualization significantly degrades computational

performance on nonuniform memory access (NUMA)

systems. The authors propose an optimization scheme based

on a memory-access model that covers the entire access path

and accommodates the chain-reacting behavior of memory

subsystems. A quantitative evaluation shows significant

performance improvement with negligible overhead.

Optimizing the Memory Management of a Virtual Machine Monitor on a NUMA System

J U N E 2 0 1 6 67

KVM, and proprietary VMMs such as VMware’s ESX and HyperV.4,5 New re-search on VMs spanning sockets in 16-core NUMA architectures reported degradation by as much as 400 percent on Xen and 82 percent on KVM.6

Hardware platforms are beginning to adopt NUMA architecture that can meet the memory bandwidth required in current multicore and existing and future many-core systems. How-ever, NUMA poses special challenges that stem from significant differences between accessing memory on local nodes and on remote nodes. Measur-ing these differences, which are likely to become more pronounced in larger NUMA systems, is not feasible with the NUMA factor—the ratio of remote to local accessing latencies—as access-ing memory is more complex than this approach alone can handle.

Traditionally, NUMA affinity (asso-ciating memory with the requesting thread) was achievable in a virtual-ized environment by pinning guest VMs to CPUs and having the hyper visor’s memory -management mecha-nism allocate memory that belongs to the page-faulting CPUs. Virtual NUMA Enlightenment, which is similar in function to the tables built through the Advanced Construction and Power Interface (ACPI)—the system locality information table (SLIT) or the static resource affinity table (SRAT)—has been recently proposed as a way to enable NUMA awareness by exposing NUMA hardware details to the guest OS.4

However, NUMA awareness is only the first step toward optimizing a VM’s memory access. Research has shown that maximizing NUMA affinity or locality does not result in optimal sys-tem performance.7 Rather, optimization

must account for the microarchitecture along the memory-access path with the goal of avoiding contention along the path and efficiently distributing work-loads among nodes. To achieve the highest system performance, any strat-egy to optimize thread scheduling and data placement among NUMA nodes must resolve contention for the cache, integrated memory controller (IMC), uncore memory- subsystem path, and interprocess connection.

We have developed an optimiza-tion strategy that incorporates these ideas and have created a memory- access model to describe the micro-architecture characteristics of the Intel Westmere memory subsystem. The model accounts for all elements on the access path—including last lever cache (LLC), global queue (GQ), Quick-Path home logic (QHL), and QuickPath interconnection (QPI). An algorithm for multiobjective memory- management optimization guided by heuristic con-ditions iteratively finds the Pareto solutions for optimal thread and data placement by quantitatively evalu-ating three performance factors— memory-access cost (MAC), bottleneck factor (BNF), and route-contention fac-tor (RCF)—which serve as input to the memory- management model.

We evaluated our strategy using

the HW-Xen VMM, which is a Huawei- maintained branch of the open source Xen VMM. The results of running sev-eral benchmarks showed that our opti-mization achieved nearly a 10-percent improvement in transactions per sec-ond (TPS) on a four-socket NUMA plat-form running HW-Xen, with under 2 percent degradation in memory-opera-tion performance.

NUMA PLATFORMFigure 1 shows the four-socket NUMA system configuration used in our implementation. It is a high- performance four-socket rack server that supports Intel’s Xeon E7-8800/ 4800 (Westmere- EX) series processors and, when expanded to eight sockets, can accommodate up to 80 cores with 4 Tbytes of memory. It also has 35 reliabil-ity, availability, and serviceability (RAS) features. These hardware characteris-tics are similar to those in popular cloud computing platforms, such as Huawei HR5585 V2 and Dell PowerEdge R730.

If a larger scale is needed, two four-way systems can be connected by QPI cables to form a larger system with eight sockets. However, the topology of this hierarchical NUMA system becomes more complicated, and more effort is required to optimize thread scheduling and data placement.

I/Ohub 0

I/Ohub 1

IntelCPU 0

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

MB

IntelCPU 0

IntelCPU 0

IntelCPU 0

QPI or cable connection

FIGURE 1. The nonuniform memory access (NUMA) system used in our implementation. This fairly typical four-socket configuration has hardware characteristics similar to those in cloud platforms. MB: memory bank; QPI: QuickPath interconnection.

COMPUTING PRACTICES

68 C O M P U T E R W W W . C O M P U T E R . O R G / C O M P U T E R

Our solution also incorporates NUMA awareness by exposing the NUMA lay-out to the VM through hardware emu-lations. On x86 platforms, exposure is achieved through the ACPI SLIT/SLAT. The tables specify the memory range of virtual nodes (vnodes) and the distance between them, as well as the mapping

of virtual CPUs to vnodes. Exposing this information to the VM allows it to use all of its NUMA-related opti-mizations.8,9 Not only does the guest OS benefit from low latency but it also avoids high aggregate-memory band-width, which is desirable for memory- intensive applications.

OPTIMIZATION ELEMENTSOur optimization solution is based on the use of a memory-access model, a multiobjective optimization algorithm, and a kernel module for memory- access management in the NUMA architecture.

Memory-access modelMost memory-performance models use only peak bandwidth and typical access latency to gauge performance, but characterizing complex memory behavior in multisocket, multicore systems requires more than that. One approach is to use explicit concurrency to characterize memory behavior.10 Our solution is more in depth because it covers the entire memory-access path to characterize memory subsys-tem behavior. Another approach to capturing memory behavior uses the M/M/1 queuing model to describe the multicore system’s memory request-ing and serving process.11 Our solution also uses that model but for IMC and GQ individually.

Each model is based on the memory- access pattern derived from sampling the CPU hardware’s performance coun-ter when a new VM starts up and peri-odically thereafter. The VMM evaluates the initial hardware-resource map-ping according to a particular pattern and uses queuing theory principles to build the memory-access model based on that pattern. If the evaluation of the initial mapping reveals any inefficient resource use, the VMM can remap the resource dynamically.

Obtaining memory-access patterns. In our study, we used quantitative mea-surements to evaluate the performance bottleneck for one of these memory- access patterns and described the

Node 2

DDRDDR

DDRDDR

DDRDDR

DDRDDR

Node 1

(a)

(b)

IMC IMC

IMC IMC

IMC IMC

IMC IMC

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

Uncore(GQ)

QPI

QPI

QPIQPI

QPI

QPI

QPIQPI

CPU coresCPU cores

CPU coresCPU cores

CPU coresCPU cores

CPU coresCPU cores

Node 2Node 1

Node 4Node 3

Node 4Node 3

FIGURE 2. How the memory-access model accommodates changes in the data-access pattern: (a) initial pattern and (b) new pattern. Some data moves from node 4 to node 1, which changes accessing (thicker arrows). As a result, the integrated memory controller (IMC) on node 1 gets more workload, which could increase latency, but the IMC on node 4 has less memory access, which could improve performance. Similar changes occur in the QuickPath interconnect (QPI) and global queue (GQ). Our model addresses this chain- reaction effect by quantitatively evaluating penalty factors such as bottleneck, memory- access cost, and route contention. DDR: double data rate.

J U N E 2 0 1 6 69

original pattern as a communication matrix and the optimized pattern as a perfect matrix.12 We also defined pen-alty factors, based on MAC, BNF, and RCF, that quantize the distance between the communication matrix and per-fect matrix. MAC is concerned with the topology distance of memory access, BNF accounts IMC contention, and RCF accounts for QPI contention. These quan-titative factors vary unpredictably, so our system model covers the cache, IMC, GQ, and QPI and considers interaction.

Figure 2 shows how our model han-dles changes in data-access pattern that affect the IMC. The change also lessens the load on the QPI between node 2 and node 4 while putting more on the QPI between node 1 and node 2. Reliev-ing the pressure on the GQ on node 4 adds pressure to the GQ on node 1. The decrease for node 4’s GQ might cause it to issue memory-access requests more frequen tly to the DDRs for node 2, node 3, and node 4. This example illustrates the chain reaction that can ensue from altered data-access mapping, which in turn causes independent variations in the MAC, BNF, and RCF.

The model in Figure 2 is the basis for optimizing virtual CPUs and their physi-cal memory mapping. We used different access patterns13 to quantitatively mea-sure MAC, BNF, and RCF12 by observing performance monitor unit (PMU) events.

Capturing subsystem behavior. Our memory-access model aims to cap-ture the behavior of the uncore mem-ory subsystem in the Sandy Bridge micro architecture. Figure 3 is a block diagram of one socket in the uncore memory subsystem in Intel’s West-mere micro architecture. The layout is similar to that of the subsystem socket

in the Nehalem microarchitecture, another Sandy Bridge predecessor.

To measure the uncore’s charac-teristics, we applied various appli-cations and benchmarks, including MAP-numa,13 which is NUMA-specific and provides many access patterns with different task−data affinities. As in previous work,14 we determined GQ and IMC behavior by measuring the bandwidth of local, remote, and cross-node memory-access patterns. In that behavior, we observed imbalances and congestion in the GQs and QHL.15

To control memory access precisely and eliminate the cache effect, we designed a microbenchmark to fit our performance profiling. Figure 4 shows primary code. The struct work_t code, which sets up the memory requests, includes a pointer linking all work_t’s and an array that accurately extends a work_t’s size to a whole cache line. In the main function, we first construct a working set typed as struct work_t, then access the whole working set one time in each outside loop. The global variable LOOP specifies the iterative accessing of working_set (the default LOOP value is 1,000). Because all work_t’s are randomly linked, there is no need to

prefetch the hardware of the adjacent cache lines.

A critical part of characterizing the memory subsystem’s behavior is to build affinity settings of multiple processes or threads. Figure 5 shows six task–affinity settings, which bind threads and their data.

When the working set becomes larger than the LLC capacity, the VMM sets the PMU to record the LLC cache missing rate and the running time of p0 in Figure 5a and p7 in Figure 5b are recorded to describe local and remote accessing. In Figure 5c, all four pro-cesses run on socket 1 (top left box), while the data of p0 is stored in local memory, and the data of p1, p2, and p3 is stored in remote memory. In the sce-nario in Figure 5c, the impact of local memory access competition caused by p1, p2, and p3 is not a consideration; rather, the focus is on measuring the performance impact of LLC and GQ con-tention. When the working set becomes smaller than LLC capacity, the VMM sets the PMU to record LLC contention behavior, otherwise it records the GQ contention caused by local requests.

In Figure 5d, the VMM sets the PMU to record the behavior of IMC

QPI

QHL IMC

LLC

Global queue

Core 0 Core 1 Core 2 Core 3

FIGURE 3. Block diagram of one socket in the uncore memory subsystem in Intel’s West-mere microarchitecture. The subsystem’s behavior was characterized through the memory- access model. LLC: last lever cache; QHL: QuickPath home logic.

COMPUTING PRACTICES

70 C O M P U T E R W W W . C O M P U T E R . O R G / C O M P U T E R

contention and GQ contention caused by a remote request. The performance of p0 decreases as more remote processes compete for memory bandwidth with 0. However, the memory– bandwidth contention caused by remote processes does not impact the performance of p0 seriously because the Intel platform’s arbitrative mechanism reserves 50 to 60 percent of the IMC bandwidth for local memory accessing and 40 to 50 percent for remote memory access-ing. This mechanism is an example of how characterizing the behavior of the microarchitecture along the memory- access path can reveal interesting relationships that affect the memory- access pattern.

In Figures 5e and 5f, the VMM sets the PMU to record the snooping behav-ior of the cache coherence protocol by observing the UNC_LLC_HITS.PROBE event caused by adding t0 to t1 through t3 or adding t4 to t0, t1, and t5.

Multiobjective optimizationOptimization aims to minimize the MAC, BNF, and RCF, which is repre-sented mathematically as

F(x) = (fMAC(x), fBNF(x), fRCF(x)), (1)

where x is the decision vector and can be denoted as (VC-0, VC-1, … VC-(–1), VM-0, VM-1, … VM-(–1)). VC-0 is the NUMA node number to which virtual CPU0 is mapped, and VM-0 is NUMA node number to which the memory is mapped. Because of conflicts among the MAC, BNF, and RCF, simultaneous opti-mization is not possible. The problem becomes one of decision-making with multiple criteria, which is akin to mathe-matical optimization problems in which more than one objective function must

struct work_t

{

struct work_t *next;

long data[7];

}; //the size equals cache line

main(int argc, char *argv[])

{

WORK_T_NUM=WORKING_SET_SIZE*1024*1024/64;

working_set=create_random_list(WORK_T_NUM);

gettimeofday(&begin, NULL);

for(i=0; i<LOOP; i++)

{

ptr=working_set;

for(j=0; j<WORK_T_NUM; j++)

{

temp=ptr->data[1];

//used in sharing overhead

//ptr->data[1]= ptr->data[1]+1;

ptr=ptr->next;

}

}

gettimeofday(&end, NULL); …

}

FIGURE 4. Main code of the microbenchmark. We created the benchmark to control mem-ory access and to eliminate the cache effect while generating memory-access patterns.

(a) Local access (b) Remote access (c) GQ trackers/LLC contention

p3p2p1p0p7p0

p0 p4 p7 t0t1t2t3 t0t1 t5t4

(d) IMC contention (e) Sharing within socket (f ) Sharing across socket

Cores LLC Memory Program data

FIGURE 5. Task–data affinity settings. The number associated with a process (p) or thread (t) refers to the core number on which it runs. The solid black arrow in each diagram denotes the behavior that the performance monitor unit (PMU) records, which serves as input to the memory-access model. The diagram illustrates a range of recorded behaviors: (a) LLC miss-ing rate and running time of p0 and (b) of p7; (c) the impact of memory-access competition from p1, p2, and p3; (d) behavior from IMC and GQ contention caused by a remote request; and snooping behavior of threads when sharing occurs (e) within the socket and (f) across the socket.

J U N E 2 0 1 6 71

be optimized. Thus, to deduce the best mapping for virtual CPUs and vnodes, we use a multiobjective optimization.16−18

Because multiobjective optimi-zation problems typically have mul-tiple Pareto-optimal solutions, the method for solving the problem is not as straightforward as it is for single- objective optimization, and researchers have proposed a range of solutions. We have adopted an interactive method to search the solution space heuristically and automatically.

Applying an interactive method is iterative, and the decision-maker con-tinuously interacts with the optimi-zation module to search for the most preferable solution. For each iteration, the optimization module is expected to express preferences among the Pareto- optimal solutions of interest and to learn which solutions are attainable. Each solution is evaluated according to MAC, BNF, and RCF in the context of our memory-access model, and the module uses the results of that evaluation along with preference information to gener-ate new Pareto -optimal solutions.

Kernel moduleTo observe PMU events, we built the VMMprof profiling toolset and imple-mented it in the kernel module to con-figure the PMU to count desired events. The kernel module then provides data from the uncore PMU counters of inter-est. It runs in domain 0 and interacts with the NUMA memory-management functions in HW-Xen to manipulate configurations and policies. A patched version of open source profiling tools, Oprofile or Xenoprof, that works in a virtual environment performs the profiling. However, the patched Opro-file and Xenoprof do not effectively

support uncore events, so we trans-planted select functions from another open source tool, likwid, to Xenoprof to gather performance data related to the NUMA memory-access pattern. The optimization module uses profil-ing to make rescheduling and replac-ing decisions.

Because the hypervisor is located between the upper guest OS and the underlying hardware, it is impossible for the guest OS to execute the RDMSR and WRMSR instructions to program the PMU. Fortunately, a hypercall is included in HW-Xen, which allows the guest OS to execute a privilege-level operation through a software interrupt. We added two hypercalls in HW-Xen source code, which package both the RDMSR and WRMSR instructions. The profiling component in the guest OS uses these two hyper-calls through Xen’s privcmd driver. The client–server mode in VMMprof allows multiple users to profile performance simultaneously and provides tools for remote accessing.

To integrate this memory- management module in an existing OS, such as Linux, requires some effort. The first step is to modify the profil-ing function to track every task (pro-cess or thread). Consequently, every task-switching incident will require storing and recovering the PMU value for the tasks being switched. The next step is to determine how to track a task’s physical memory alloca-tion on NUMA nodes with reasonable overhead— much more challenging than tracking a VM because tracking physical memory allocation requires scheduling and load-balancing code.

PERFORMANCE EVALUATIONTo verify the effectiveness of our

optimization module, we ran Swing-bench, an open source load genera-tor along with benchmarks designed to stress-test an Oracle database. We chose Swingbench because the Oracle database represents a typical applica-tion workload for the NUMA virtual platform. We installed the guest OS (Red Hat with 2.6.64 kernel) and config-ured the VM, which we equipped with 32 Gbytes of memory (RH5885 V2), to map to 4 CPUs (16 cores). We ran the application with 16 threads.

To measure the database’s ability to handle transactions, we used both TPS, which measures the database transac-tion’s delay, and transactions per min-ute (TPM), which measures database throughput. As Table 1 shows, the per-formance improvement (speedup) is 9.88 percent for TPS and 8.43 percent for TPM—evidence that our optimization works effectively for an application that can be characterized by Swingbench.

We ran the memory-performance microbenchmark we designed to test the effectiveness of our module in optimizing memory operations along with file I/O. As Table 2 shows, multi-objective optimization with HW-Xen is nearly identical to that of Xen. We launched our microbenchmark several times to process file I/O in different patterns—synchronous and asynchro-nous, read and write, sequential and random—and with various block sizes. For all asynchronous read and write operations, HW-Xen performed better than Xen (−6.61 percent degradation) because many I/O operations could be done within page buffers.

For synchronous writes, mem-ory performance has little effect on bandwidth because the bottleneck is in the storage device. For sequential

TABLE 1. Results of measuring workload characterized by Swingbench.

Benchmark With guest NUMA optimization only Average With NUMA multiobjective optimization Average Speedup (%)

Transactions per second (TPS)

26 28 27 27 29 29 31 30 9.88

Maximum transactions per minute (TPM)

2,138 2,120 2,162 2,140 2,470 2,250 2,241 2,320 8.43

COMPUTING PRACTICES

72 C O M P U T E R W W W . C O M P U T E R . O R G / C O M P U T E R

synchronous reads, however, memory optimization obtains better bandwidth as data size increases. For random syn-chronous reads, performance degrada-tion is higher than for multithreaded sequential synchronization, but it is

still negligible except for small syn-chronous reads (4 Kbytes).

Our optimization module minimizes the sampling operation to reduce inter-ference with application execution. When we ran various applications that

are not heavily dependent on NUMA awareness, HW-Xen’s performance was nearly identical to Xen’s. Table 3 gives execution times for nine well-known applications running on both HW-Xen and Xen. The results in the table show that HW-Xen introduces small perfor-mance improvements for most appli-cations in CPU operations and incurs a performance degradation of at most 1.93 percent for memory operations.

We ran another test using Benchmark Factory for Oracle with 16 threads on var-ious underlying hardware nodes and CPUs (with one, two, and four nodes). Fig-ure 6 shows the results of this test, which evaluated HW-Xen’s use of resources on multiple nodes. Through the interface provided by the kernel module in domain 0, we configured the system to allow the VMs to map to one, two, and eight physical nodes. HW-Xen’s performance improved when it mapped to eight nodes because of the increase in total IMC bandwidth. An HW-Xen interface allows system administrators to manually con-trol target nodes as needed.

The results in Figure 6 are evidence that our optimization outperforms methods, such as VM pinning, that attempt to maximize locality as a way to improve performance.

With virtualization technol-ogy, managing the mem-ory of the OS in the VM has

become more difficult than for an OS running on bare hardware. We have shown that even with NUMA aware-ness, the memory-access optimization for VMs is not trivial for a NUMA plat-form. Hardware profiling is distrib-uted to ensure scalability, but the deci-sions about thread placement remain

TABLE 3. Execution times of applications.

Applications HW-Xen time (s) Xen time (s) Degradation (%)

Applications to test CPU performance

Kernel compile 71.55 73.53 −1.13

GZIP 43.93 43.91 −0.12

C-ray 32.55 32.49 0.33

Pi 5.42 5.48 −1.11

Applications to test memory performance

Add 17,019 16,942 1.81

Copy 17,135 16,891 1.93

Scale 17,934 16,378 1.43

Read cache 2,217 2,218 −0.02

Write cache 10,931 10,919 0.01

1,200

1,000

800

600

400

200

Max Average

One nodeTwo nodesEight nodes

Benc

hmar

k fa

ctor

y

0

FIGURE 6. Results of running the Benchmark Factory for Oracle to evaluate HW-Xen’s ability to use resources on multiple nodes. Max and Average denote the best and average performance data, respectively.

TABLE 2. Microbenchmark of memory performance.

Operation HW-Xen (MBps) Xen (MBps) Degradation (%)

Asynchronous read/write 2,070 1,933 −6.61

Random synchronization

1-Mbyte read 4,668 4,837 3.63

4-Kbyte read 3,984 4,290 7.69

1-Mbyte write 100 101 0.71

4-Kbyte write 101 100 −0.46

Multithreaded, sequential synchronization

4 * 2-Gbyte read 7,425 7,108 −4.27

4 * 2-Gbyte write 99.6 100 0.41

8 * 1-Gbyte read 8,773 8,982 −0.02

8 * 1-Gbyte write 98.3 98.1 −0.19

J U N E 2 0 1 6 73

centralized. As the number of NUMA nodes increases, the decision delay will be longer. Moreover, with more nodes comes the higher cost of migrating vir-tual CPUs and their memory. For the large-scale NUMA system, the scalabil-ity problem should be studied further.

In concert with other technolo-gies, such as a mechanism to tune the guest OSs’ kernel configuration, our multi objective optimization enhances RH5885 V2 system performance, and contributes to its suitability as a plat-form for high-performance database applications. The system ranked first in the TPC-E (Transaction Processing Per-formance Council, Enterprise Bench-mark) performance test in 2012.

The results from benchmark- testing our optimization solution are suffi-ciently encouraging to warrant extend-ing our approach to other platforms, such as Loongson. Improving memory management in the VMM for various NUMA architectures will help reduce the cost of server hardware for data centers, thus making the processing of big data applications more cost- effective.

ACKNOWLEDGMENTSThe research was jointly supported by the National Natural Science Foundation of China through grant NSFGDU1301252, by the State Key Laboratory of Computer Architecture of the Chinese Academy of Sciences through grant CARCH201405, by the National Natural Science Founda-tion of China through grant 61170283, and by the Technology Foundation through grants JCYJ20140509172609174 and JCYJ20150930105133185.

REFERENCES1. W. Huang et al., “Virtual Machine

Aware Communication Libraries for

High-Performance Computing,” Proc. ACM/IEEE Conf. Supercomputing (SC 07), 2007; http://dx.doi.org/10 .1145/1362622.1362635.

2. C. Xu, Y. Bai, and C. Luo, “Perfor-mance Evaluation of Parallel Pro-gramming in Virtual Machine Envi-ronment,” Proc. 6th IFIP Int’l Conf. Network and Parallel Computing (NPC 09), 2009, pp. 140–147.

3. L. Youseff et al., “Paravirtualization Effect on Single- and Multi-Threaded Memory-Intensive Linear Algebra

Software,” Cluster Computing, vol. 12, no. 2, 2009, pp. 101–122.

4. D. Rao and K. Schwan, “vNUMA-mgr: Managing VM Memory on NUMA Platforms,” Proc. 17th IEEE Int’l High-Performance Computing Conf. (HiPC 10), 2010; http://dx.doi.org /10.1109/HIPC.2010.5713191.

5. D. Rao and J. Nakajima, “Guest NUMA Support (PV) and (HVM),” Proc. Xen Summit North America, 2010; www .xen.org/xensummit/xensummit _spring_2010.html.

ABOUT THE AUTHORS

QIUMING LUO is an associate professor in the College of Computer Science

and Software Engineering at Shenzhen University. His research interests include

high-performance computing and OS design. Luo received a PhD in computer

architecture from Huazhong University of Science and Technology. He is a mem-

ber of the China Computing Federation (CCF). Contact him at [email protected].

FENG XIAO is an MS student in the College of Computer Science and Soft-

ware Engineering at Shenzhen University. His research interests include high-

performance computing and OS design. Xiao received a BS in educational tech-

nology from Nanchang Hangkong University. Contact him at [email protected].

ZHONG MING is professor and vice dean in the College of Computer Science and

Software Engineering at Shenzhen University. His research interests include soft-

ware engineering, the Internet of Things, and distributed workflow management.

Ming received a PhD in computer science from Zhong Shan University. He is a

member of CCF. Contact him at [email protected].

HAO LI is a senior engineer in the Cloud Computing Department of Huawei Tech-

nologies. His research interests include hypervisor performance, virtual resource

scheduling efficiency, and public cloud services. Li received a BS in computer

science from the Wuhan University of Technology. Contact him at hamilton

[email protected].

JIANYONG CHEN is an associate professor in the College of Computer Science

and Software Engineering at Shenzhen University and technical advisor at ZTE

Corporation. His research interests include information security and Internet

applications. Chen received a PhD in electrical engineering from City University

of Hong Kong. He is a member of CCF, the vice chairman of ITU-T Study Group

17 (security area), and chairman of ITU-T WP3/SG17 (identity management area).

Contact him at [email protected].

JIANHUA ZHANG is a senior engineer in the Cloud Computing Department of

Huawei Technologies. His research interests include virtual desktop infrastruc-

ture, cloud OSs, and cloud orchestration. Zhang received an MS in electronics

and communication engineering from East China Normal University. Contact him

at [email protected].

COMPUTING PRACTICES

74 C O M P U T E R W W W . C O M P U T E R . O R G / C O M P U T E R

6. K.Z. Ibrahim, S. Hofmeyr, and C. Iancu, “Characterizing the Perfor-mance of Parallel Applications on Multi-socket Virtual Machines,” Proc. 11th IEEE/ACM Int’l Symp. Clus-ter Computing and the Grid (CCGrid 11), 2011; http://dx.doi.org/10.1109/CCGrid.2011.50.

7. Z. Majo and T.R. Gross, “Memory Sys-tem Performance in a NUMA Multi-core Multiprocessor,” Proc. 4th ACM Ann. Int’l Conf. Systems and Storage(SYSTOR 11), 2011, article 12.

8. M. Bligh et al., “Linux on NUMA Systems,” Proc. Ottawa Linux Symp.(OLS 04), 2004, pp. 89−101.

9. R. Bryant and J. Hawkes, “Linux Scalability for Large NUMA Systems,” Proc. Ottawa Linux Symp. (OLS 03), 2003, pp. 76−88.

10. A. Mandal, R. Fowler, and A. Por-ter� eld, “Modeling Memory Con-currency for Multisocket Multicore Systems,” Proc. IEEE Int’l Symp.

Performance Analysis of Systems & Soft-ware (ISPASS 10), 2010, pp. 66–75.

11. B.M. Tudor, Y.M. Teo, and S. See, “Understanding O� -Chip Memory Contention of Parallel Programs in Multicore Systems,” Proc. IEEE Int’l Conf. Parallel Processing (ICPP 11), 2011, pp. 602–611.

12. Q. Luo et al., “Quantitatively Measur-ing the Memory Locality Leakage on NUMA Systems based on Instruction- Based-Sampling,” Proc. 13th IEEE Int’l Conf. Parallel and Distributed Com-puting, Applications and Technologies(PDCAT 12), 2012, pp. 251–256.

13. Q. Luo et al., “MAP-numa: Access Pat-terns Used to Characterize the NUMA Memory Access Optimization Tech-niques and Algorithms,” Network and Parallel Computing, LNCS 7513, 2012, pp. 208–216.

14. Q. Luo et al., “Analyzing the Char-acteristics of Memory Subsystem on Two Di� erent 8-Way NUMA

Architectures,” Proc. 10th IFIP Int’l Conf. Network and Parallel Computing, (NPC 13), 2013, pp. 155–166.

15. Q. Luo et al., “Understanding the Data Tra� c of Uncore in Westmere NUMA Architecture,” Proc. 22nd Euromicro Int’l Conf. Parallel, Distributed, and Net-work-Based Processing (PDP 14), 2014, pp. 392–399.

16. Q. Lin et al., “A Novel Multiobjective Particle Swarm Optimization with Multiple Search Strategies,” European J. Operational Research, vol. 247, 2015, pp. 732−744.

17. Q. Lin et al., “A Novel Hybrid Multi-objective Immune Algorithm with Adaptive Di� erential Evolution,” Computers & Operations Research, vol. 62, 2015, pp. 95−111.

18. Z. Liang et al., “A Double-Module Immune Algorithm for Multi -objective Optimization Problems,” Applied Soft Computing, vol. 35, 2015, pp. 161−174.

IEEE SECURITY &

PRIVAC

Y IEEE SYM

POSIU

M O

N SECU

RITY AN

D PRIVA

CY

VOLU

ME 14

NU

MBER 2

MA

RCH/A

PRIL 2016

WWW.CO

MPU

TER.ORG

/SECURITY

Policing Privacy ■ Dynamic Cloud Certifi cation ■ Security for High-Risk Users

March/April 2016

Vol. 14, No. 2

IEEE Symposium on

Security and Privacy

IEEE SECU

RITY &

PRIVAC

Y LESSO

NS LEA

RNED

FROM

THE ED

ITORIA

L BOA

RD

VO

LUM

E 13 N

UM

BER 6 N

OV

EMBER/D

ECEM

BER 2015 WWW.CO

MPU

TER.ORG

/SECURIT

Y

Analyzing Architecture ■ License Creep ■ Online Anonymity Laws

November/December 2015Vol. 13, No. 6

IEEE SECURITY &

PRIVACY

SOFTW

ARE EVERYW

HERE

VOLU

ME 14

NU

MBER 1

JAN

UA

RY/FEBRUA

RY 2016 WWW.CO

MPU

TER.ORG

/SECURITY

Rejuvenating Binary Executables ■ Visual Privacy Protection ■ Communications Jamming

January/February 2016Vol. 14, No. 1

Security and Privacy

Security and Privacy

Security and Privacy

Security and Privacy

Security and Privacy

Security and Privacy

Security and Privacy

SUBSCRIBE FOR $39

ELECTRONIC EDITION

mycs.computer.org