BOPS, N FLOPS! , M OOFLINE ERFORMANCE ODEL OR...

BOPS, NOT FLOPS!A NEW METRIC, MEASURING TOOL, AND ROOFLINE

PERFORMANCE MODEL FOR DATACENTER COMPUTING

EDITED BY

LEI WANG

JIANFENG ZHAN

WANLING GAO

RUI REN

XIWEN HE

CHUNJIE LUO

GANG LUJINGWEI LI

Software Systems Laboratory (SSL), ACSICT, Chinese Academy of Sciences

Beijing, Chinahttp://prof.ict.ac.cn/ssl

TECHNICAL REPORT NO. ACS/SSL-2018-2JANUARY 28, 2018

BOPS, Not FLOPS! A New Metric, Measuring Tool, andRoofline Performance Model For Datacenter Computing

Lei Wang, Jianfeng Zhan, Wanling Gao, Rui Ren, Xiwen He, Chunjie Luo, Gang Lu,Jingwei Li

January 28, 2018

Abstract

The past decades witness FLOPS (Floating-point Operations per Second), as an impor-tant computation-centric performance metric, guides computer architecture evolution, bridgeshardware and software co-design, and provides quantitative performance number for systemoptimization. However, for emerging datacenter computing (in short, DC) workloads, such asinternet services or big data analytics, previous work reports on the modern CPU architec-ture that the average proportion of floating-point instructions only takes 1% and the averageFLOPS efficiency is only 0.1%, while the average CPU utilization is high as 63%. These contra-dicting performance numbers imply that FLOPS is inappropriate for evaluating DC computersystems. To address the above issue, we propose a new computation-centric metric BOPS(Basic OPerations per Second). In our definition, Basic Operations include all of arithmetic,logical, comparing and array addressing operations for integer and floating point. BOPS is theaverage number of BOPs (Basic OPerations) completed each second. To that end, we presenta dwarf-based measuring tool to evaluate DC computer systems in terms of our new metrics.On the basis of BOPS, also we propose a new roofline performance model for DC computing.Through the experiments, we demonstrate that our new metrics–BOPS, measuring tool, andnew performance model indeed facilitate DC computer system design and optimization.

1

1 Introduction

In the past decades, FLOPS (FLoating-point Operations Per Second) [10] is used to evaluate theperformance of modern computer systems. FLOPS has led the computing technology progress notonly limited to high performance computing (HPC) for many years [10]. As shown in Fig. 1 (Rmaxis the number measured with HPL and Rpeak is is the theory number), the history witnesses thatFLOPS defines the concrete R&D objectives and roadmaps (Gflops in the 1990, Tflops in the 2000,Pflops in the 2010, and Eflops in the 2020).

Figure 1: High Performance Computing in terms of FLOPS since 1990s.

For system and architecture community, a single metric like FLOPS is simple but powerfulfor exploring innovative system and architectures. First, FLOPS can be be calculated at the ap-plication’s source code level, independent with the underlying system implementation, so it is fairto evaluate and compare different system and architecture implementations. Second, it can becalculated at different levels independently. For example, it can be calculated at software’s bi-nary code level and hardware’s instruction level, respectively, so it facilitates co-design of systemsand architecture. Last but not least, it helps people understand the performance ceiling of thecomputer systems, and hence guides system optimization.

To date, to perform big data and AI analytics or provide services, more and more organiza-tions in the world build internal datacenters, or rent hosted datacenters. As a result, datacentercomputing (DC in short) has become a new paradigm of computing. It seems that the fraction ofDC has outweighed HPC in terms of market share. A natural question is what is metric for DC.Is it still the same FLOPS metric?

Different from HPC, DC workloads have unique features. For example, it is reported thatDC has very low ratio of floating point operations intensity, which is defined as the total floatingpoint operations divided by total memory access bytes [23]. Our previous work [12] has performedcomprehensive and hierarchical Top-Down analysis on big data and AI workloads from an archi-tectural perspective. We found for typical DC workloads, the average floating point instructionsratio is only 1% and the average FLOPS efficiency is only 0.1%, while the average CPU utiliza-tion 1 is 63% and IPC (Instructions Per Cycle) is 1.3. These performance data implied that FLOPSis inappropriate in terms of measuring DC computer systems.

In practice, there are several user-perceivable metrics used to measuring application-specificsystems, i.e., the number of simultaneous user sessions for SPECWEB [9], the transactions perminute for TPC [1], and input data processed per second for big data analytics [17]. However,there are two limitations of user-perceivable metrics. First, user-perceivable metrics can not beenused to evaluate the efficiency of the computer systems, especially the performance efficiency ofhardware platforms. Second, different user-perceivable metrics can not be compared together.For example, transactions per minute (TPM) and data per second processing capability (GB/S)can not used for apple-to-apple comparison.

1CPU utilization is defined as the percentage of time that the CPU executing at the system or user level

2

In this paper, inspired by FLOPS and its measuring tool HPL [10], we defined basic opera-tions per second (in short BOPS), which covers the basic integer and floating point operations, toevaluate the DC computing system. The contributions are as follows:

First, we propose the computation-centric metrics for DC: the basic operation per second (inshort BOPS). BOPS is the average number of BOPs (Basic OPerations) of a specific workloadcompleted at each second, and BOPs of the workload include all of arithmetic, logical, compar-ing and array addressing operations for integer and floating point. BOPS covers both relevantinteger and floating point operations (i.e., calculating for load/store/jump addresses, comparingoperations, and arithmetic or logical operations).

Second, on the basis of big data and AI dwarfs [11] [12], which are frequently-appearing unitsof computation, identified from typical datacenter application domains, we select a set of funda-mental workloads as the measuring tool to report the BOPS numbers.

Third, we proposed a BOPS based Roofline model [25], which we called DC Roofline, as thequantitative DC performance model for guiding system design and optimization. Through theexperiments, we demonstrate that BOPS-based DC Roofline model indeed helps system designand optimization.

The remainder of the paper is organized as follows. Section 2 summarizes the related work.Section 3 states background and motivations. Section 4 defines BOPS and reports how to calculateit. Section 5 introduce BOPS-based DC Roofline model. Section 6 draws a conclusion.

2 Related Work

Related work is summarized from three perspectives: metrics, measuring tools, and performancemodels.

First, the metrics for evaluating computing system can be classified into two categories: oneis the user-perceivable metric such as the metric of SPECCPU [2], the other is the computationcentric metric such as FLOPS.

User-perceivable metrics can be classified into two categories: one is the metric for end-to-endsystem, and the other is the metric for the components of the system. The former examples includedata processed per second for Sort benchmark [4], simultaneous sessions for SPECWEB [9], andtransactions per second for TPC-C[1]. The latter example includes the SPECCPU score [2] for theCPU component, and the IOPS metric for the storage component.

FLOPS(FLoating-point Operations Per Second) is a metric of computer performance, espe-cially in fields of scientific calculations that make heavy use of floating-point calculations [10].The wide recognition of FLOPS indicates of the maturation of high performance computing.

Other computation-centric metrics include MIPS (Million Instructions Per Second)[14] andits variants. MIPS is defined as the number of million instructions the processor can executeper second. The main defect of MIPS is that it is architecture dependent. And there are alsomany evolutions from MIPS such as MWIPS and DMIPS [20], which use synthetic workloadsto evaluate the floating point operations and integer operations respectively, and they are alsoarchitecture-dependent metrics.

OPS(operations per second) [19] is defined as the 16-bit addition operations per second. Andit only applies for digital processing and SoC systems. The definitions of OPS are extended later,i.e.g, OPSes of TPU [15] and Cambricon [16]. They are defined as the specific operations, suchas the OPS of TPU is 8-bit matrix multiplication operations and the OPS of Cambricon is 16-bitinteger operations.

Second, for different metrics, there are related measuring tools or benchmarking tools. TheSPECCPU benchmark suite, SPECWEB benchmark suite, TPC-C benchmark suite, and Sortbenchmark is the measuring tools for the user-perceivable metrics, introduced on the above, re-spectively. PARSEC [7] is a benchmark suite composed of multi-threaded programs, which usedthe wall clock time as the measuring metric. BigDataBench is a comprehensive big data and AI

3

benchmark suite [12]. For the computation-centric metric, Whetstone [13] and Dhrystone [24] arethe measuring tools for MWIPS and DMIPS metrics. HPL [10] is the micro-benchmark measur-ing the FLOPS number. Because of its sophisticated design (the proportion of the floating-pointaddition and multiplication in the HPL is 1:1 so as to fully utilize the FPU unit.) and the repre-sentativeness ( Dense linear algebraic equations in the HPL is a typical method for HPC), HPL isthe benchmark used for TOP500 [3].

Third, there are two categories of performance models: one is analytical model, which usedstochastic/statistical analytical method to describe and predict the performance of systems, andthe other is the bound model, which is relative simpler and only describe the performance boundor bottleneck of systems. Previous work [26] [21] [8] is the famous stochastic/statistical analyticalmodels, which are used to predict the system performance. As distributed and parallel systemsalways have lots of uncertain behaviors, which makes it hard to build the accurate predict model.Instead, the bound and bottleneck analysis is more suitable. Amdahl’s Law [6] is one of thefamous performance bound model for the parallel processing computer systems. The Rooflinemodel [25] is another famous performance bound model. The Roofline model adopts the FLOPSas the performance metric, and bases on the definition of operational intensity (OI), which is thetotal number of floating point instructions divided by the total number of bytes of memory access.The Roofline model gives an upper bound to the system performance.

3 Background and Motivations

3.1 Background

Moore’s Law reveals that the number of cores per chip doubles approximately every two years.However, the diversity and complexity of modern datacenter workloads raise great challenges,and make it hard to identify the performance bottlenecks spanning multiple domains – algorithm,programming, compiling, system development, and architecture design. Qualitative performanceanalytics can guide the co-design of software and hardware, including performance metric, corre-sponding measuring or benchmarking tool and performance model. Among them, the metric isused to quantitatively evaluate the performance efficiency; The measuring tool is used to measurethe number of the performance metrics; The model is to identify the bottlenecks and provide theoptimization guidelines.

3.1.1 Computation-centric Metric

For system and architecture community, the computation-centric metrics, such as FLOPS, arefundamental yardsticks to reflect the running performance and gaps across different systemsor architectures. The computation-centric metric can be calculated at the application’s sourcecode level, independent with the underlying system implementations. Also, it can be calcu-lated at the software’s binary code level and hardware instruction level, respectively. So it iseffective for the co-design across different layers. Generally, the computation-centric metric hasperformance upper bound on specific architecture, according to the micro-architecture designs.For example, the theoretical FLOPS is computed by NumCPU ∗NumCore ∗FrequencyCPU ∗NumDoubleOperationsPerCycle.

3.1.2 Measuring tool

Measuring tool is used to measure the systems and architectures using the metric number, andprovide the gaps between the actual value with the theoretical one. The gaps are useful forunderstanding the performance bottlenecks and provide optimization guidelines. For example,HPL [10] is a measuring tool based on FLOPS, which has the equal proportions of floating-pointadditions and multiplications so as to fully utilize the FPU unit. Focusing on specific system

4

or architecture, the actual FLOPS is obtained through running HPL benchmark. The FLOPSefficiency is calculated by (100%∗ (FLOPSReal /FLOPSThoery)).

3.1.3 Roofline Model

Computation-centric metric is the foundation of the system performance model. Roofline model[25] is a famous system performance model based on FLOPS, which relates the performance andoperational intensity of a given workload. Based on the definition of operational intensity (OI),which is the total number of floating-point instructions divided by the total number of bytes ofmemory access, the Roofline model gives an upper bound to the system performance.

P = Min(PeakFLOPS,PeakMemoryBandwidthxOI)The formula represents that the peak performance of a workload on specific platform is limited

by the processor’s computing capacity and the memory bandwidth. Peak FLOPS and Peak Mem-ory Bandwidth are the peak performance of the platform. To identify the bottleneck and guidethe optimization, the ceilings – which imply the performance upper bounds for specific tuningsettings – are added to the Roofline model, as shown in Fig. 2 as an example.

Figure 2: Original Roofline Model.

3.2 Motivations

We comprehensively characterize the effectiveness of FLOPS and corresponding Roofline modelon modern datacenter workloads. Considering the diversity and representativeness, We choose19 workloads from BigDataBench 2.0 [23], which covers a wide variety of application types andsoftware stacks, including online service, offline/graph analytics, data warehouse, NoSQL. Thedetails is shown on the Table.1.

5

Table 1: Experimental workloadsID Application Software Stack Input size1 Sort Hadoop,Spark,MPI 100GB data2 Grep Hadoop,Spark,MPI 100GB data3 WordCount Hadoop,Spark,MPI 100GB data4 BFS MPI 2E18 vertex5 Read Hbase 100 GB data6 Write Hbase 100GB data7 Scan Hbase 100 GB data8 Select Query Hive,Spark 100 GB data9 Aggregate Query Hive,Spark 100GB data10 Join Query Hive,Spark 100GB data11 Nutch server Hadoop 1000 req/s12 PageRank Hadoop,Spark,MPI 10E10 pages13 Index Hadoop 10E10 pages14 Olio Server Mysql 1000 req/s15 Kmeans Hadoop,Spark,MPI 100GB data16 CC Hadoop,Spark,MPI 2E18 vertex17 Rubis Server Mysql 1000 req/s18 CF Hadoop,Spark,MPI 2E18 vertex19 Bayes Hadoop,Spark,MPI 100 GB data

Also, the experimental platforms are the same with Section 4.5.1, using Intel Xeon E5645 andIntel Atom D510. Note that we run all 19 workloads and report their average value on Table. 2and Table. 3. Overall, we conclude the limitations of FLOPS and corresponding Roofline modelrespectively, in the following subsections.

3.2.1 The Limitation of FLOPS for DC

An effective metric for DC should not only truly reflect the performance gaps across differentplatforms (e.g. user-perceivable performance gaps), but also reflect the actual system utilization.For the sake of fair evaluation, we measure the effectiveness of FLOPS on DC from two perspec-tives: 1) Using FLOPS metric and HPL benchmark. 2) Using FLOPS metric and BigDataBenchbenchmark.

Using FLOPS Metric and HPL Benchmark for DC HPL [10] is a famous benchmarkfor HPC, and also a benchmark for Top500 Ranking[3]. We run HPL on two experimental plat-forms respectively. As shown on the Table. 2, the theoretical FLOPS is obtained according totheir micro-architecture designs. E5645 is Westmere architecture, which equipped with four-issue and out-of-order pipeline. Its theoretical FLOPS is 57.6 GFLOPS. D510 is Pine Trailarchitecture, which is two-issue and in-order pipeline. Its theoretical FLOPS is 4.8 GFLOPS.The real FLOPS is obtained by running HPL benchmark. FLOPS efficiency is computed by(100%∗ (FLOPSReal /FLOPSThoery)). A user-perceivable metric is used to measure the per-formance gaps across two platforms, which is computed as the input data size divided by thetotal processing time, averaging the results of 19 DC workloads in BigDataBench. From Table.2,FLOPS metric can not reflect the user-perceivable performance gaps. The FLOPS gap betweenE5645 and D510 is far (12.2), however, the user-perceivable performance gap of DC workloadsare not that far (only 7.4), for the reasons that DC workloads have a small percentage of floating-point operations. In conclusion, HPL benchmark has distinct behaviors with DC workloads, sothat using FLOPS metric and HPL benchmark can not effectively evaluate DC computing.

6

Table 2: FLOPS+HPL of Xeon E5645 Vs. that of D510.E5645 D510 Ratio

Theory FLOPS 57.6G 4.8G 12Real FLOPS 38.9G 3.2G 12.2

FLOPS Efficiency 68% 67% -user-perceivable 110MB/S 14.8MB/s 7.4

Using FLOPS Metric and BigDataBench for DC. To further evaluate the effectiveness ofFLOPS metric, we use 19 DC workloads from BigDataBench. Table. 3 shows our experimentalresults. We find that in this condition, the FLOPS metric can reflect the user-perceivable perfor-mance gap of DC workloads, with the gap of 6.3 and 7.4, respectively. However, the FLOPS metriccannot reflect the actual system utilization. As seen in Table. 3, the average FLOPS efficiency of19 DC workloads is only 0.16% on E5645 platform, while the average CPU utilization is 63%and IPC (Instructions Per Cycle) is 1.3. Hence, even with the DC workloads, the FLOPS metricstill cannot reflect the actual system utilization, and thus make it hard to find the performancebottlenecks and give optimization guidelines.

Table 3: FLOPS+BigDataBench of Xeon E5645 Vs. that of D510.E5645 D510 Ratio

Theory FLOPS 57.6G 4.8G 12Real FLOPS 0.1G 0.016G 6.3

FLOPS efficiency 0.16% 0.17% -user-perceivable 110MB/S 14.8MB/s 7.4

3.2.2 The Limitation of Roofline Model for DC

The Roofline model is a quantitative performance model based on the FLOPS metric, which isonly applicable to floating-point calculations. We evaluate the feasibility of using the Rooflinemodel proposed in [25] on DC optimization using five workloads from BigDataBench 2.0, on theIntel Xeon E5645 platform. They are Sort, Grep, WordCount, Bayes and Kmeans, implementedwith MPI. The peak FLOPS of E5645 is 57.6 GFLOPS, and the peak memory bandwidth is 13.2GB/s. We calculate the operation intensity for five workloads, with the average value only 0.05, asshown on the Table. 4. Thus, the bottleneck of all five workloads is memory access according to themodel [25]. Then we increase the memory bandwidth through hardware pre-fetching. However,only Grep and WordCount gain significant performance improvements, which are 16% and 10%,respectively. For the other three, the average performance promotion is only 6%, which meansmemory access is not the bottleneck for all five workloads. Above all, the traditional Rooflinemodel is not suitable for modern DC workloads, which have extremely low floating point operationintensity, and further gives misleading optimization guidelines.

Table 4: DC Workloads under the Roofline Model.Workload FLOPS Intensity Bottleneck

Sort 0.01 0.01 Memory AccessGrep 0.01 0.002 Memory Access

WordCount 0.01 0.009 Memory AccessBayes 0.02 0.2 Memory Access

Kmeans 0.1 0.5 Memory Access

7

4 BOPS

BOPS (Basic OPerations per Second) is the average number of BOPs (Basic OPerations) of a spe-cific workload completed every second. We count all arithmetic, logical, comparing and arrayaddressing operations for integer and floating point in BOPs. In this section, we define the re-quirements of an effective metric, and introduce our new metric – BOPS, in terms of definition,measurements, corresponding measuring tools and evaluation.

4.1 Requirements

We define the requirements from the following three perspectives.First, the metric should reflect the performance gaps between different systems.

User-perceivable metrics can reflect the running performance. For example, data processed persecond (GB/s) – which divides the input data size by total running time – is a user-perceivablemetric and effectively reflect the data processing capability. Also, the computation-centric metricshould preserve this characteristic and reflect the performance gap.

Second, the metric is the facility for hardware and software co-design. For the co-design of different layers of system stacks, i.e. application, OS and hardware level, computation-centric metric should support different level measurements, spanning source code, binary code,and hardware instruction level.

Third, the metric can reflect the quantitative performance upper bound on a spe-cific system. Focusing on different system designs, the metric should be sensitive to designdecisions and reflect theoretical performance upper bound. Then the gap between the actual andtheoretical numbers is useful for understanding the performance bottlenecks and guiding theoptimizations.

Fourth, it is convenient to build the measuring tools and ceiling performance modelon the basis of metric. Based on the metric, we can construct corresponding measuring toolsand performance model to guide the system tuning. For example, the Roofline model[25], whichrelates performance and floating-point operational intensity of a given workload, is one of thefamous performance model based on the FLOPS. For DC, the measurement tools and performancemodel are essential for evaluation and comparison.

4.2 BOPs Definition

The definition of BOPs is shown in Table.5. All of the operations in Table.5 are counted as 1except for N-dimensional array addressing. Note that all operations are normalized to the 64bits operation. For arithmetic or logical operations, the number of BOPs is counted accordingto the corresponding arithmetic or logical operation. For array addressing operations, we takeone dimensional array P[i] as the example: load the value of P[i], indicating the addition of an ioffset to the address location of P, so the number of BOPs increments by one. And it can also beapplied to the calculation of multidimensional array. For comparing operations, we transform theoperation to the subtraction operation, we take X <Y as the example, which can transform to X-Y<0, so the number of BOPs increments by one. Through the definition of BOPs, we can see thatin comparison with the FLOPS, the BOPS concerns not only the floating-point operations, butalso the integer operations. On the other hand, like the FLOPs, BOPs normalize all operationsinto 64-bit operations, and each operation is counted as 1 (regardless of the different delays withdifferent operations in the real system).

8

Table 5: Normalization operations of BOPS.Operations Normalized value

Add 1Subtract 1Multiply 1Divide 1

Bitwise operation 1Logic operation 1

Compare operation 1One-dimensional array addressing 1N-dimensional array addressing N

Several other operations are not included in BOPs, including variables declaration, variableassignment, type conversion, branch commands, loop commands, skip commands, function callcommands and return commands, as illustrated in the Table.6.

Table 6: Normalization operations not included in BOPS.Category Descriptions

Variable declaration int iVariable assignment i=10

Type conversion (int*) XBranch call goto, if else, Switch case

Loop call for, whileFunction call Fun() call

Return return command

4.3 BOPS Measurements

BOPS is measured at the source code level and instruction level. From the micro-architecturelevel, the theoretical BOPS number is calculated by NumCPU ∗ NumCore ∗FrequencyCPU ∗NumBOPsPerCycle, the workload’s BOPS is obtained by BOPsWorkload/ExecutionTimeWorkload.In this subsection, we introduce how to measure BOPs from the source code level and instructionlevel.

4.3.1 Source code level measurement

We can get the BOPs from the source codes of the workload, and this method need some manualworks (analyzing the source codes). As the below example, BOPs are not calculated in the firstand second line because they are variable declarations. Line 3 consists of a loop command andtwo integer operations, corresponding BOPs = (1+1) * 100 = 200 for integer operations, while theloop command is not calculated; Line 5 consists of the array address operations and variable as-signments: address assignment is counted as 100 * 1, while variable assignment is not calculated.So the sum of the BOPs of the sample program is: 100 + 200 = 300.

1 long newClusterSize [ 1 0 0 ] ;2 int j ;3 for ( j =0; j <100; j ++)4 {5 newClusterSize [ j ]=0 ;6 }

9

In order to verify the reasonability of the above calculation, the binary code is presented asbelow, from which we can see that there are six operations: movq, addq, addq, movl, cmpq, jne.We count BOPs for addq, addq, movl. The binary code level measurement is in accordance withthe source code level one. In the rest of this paper, we consistently adopt the source-code levelmeasurement.

1 movq $0 , (%rax )2 addq $8 , %rax3 cmpq %rdx , %rax4 jne . L25 movl 664(%rsp ) , %eax6 addq $688 , %rsp

Another thing we need to take into consideration is the system built-in library function. Forthe calculation of the system-level function, such as Strcmp function, we implement the user-levelfunction by ourself and then count the number of BOPs, which may result in small deviation interms of the BOPS number. The implementation of Strcmp function is shown as below.

int strcmp ( const char *cs , const char * ct ){

unsigned char c1 , c2 ;while ( 1 ) {

c1 = * cs ++;c2 = * ct ++;i f ( c1 != c2 )

return c1 < c2 ? −1 : 1 ;i f ( ! c1 )

break ;}return 0;

}

4.3.2 Instruction level measurement

Source code or binary code level measurements need to analyze the code logic based on sourcecodes, which is costly especially for complex system stacks (i.e., Hadoop system stacks). Instruc-tion level measurement can avoid this high analysis cost and source code restriction. The instruc-tion level measurement use the hardware performance counter to obtain BOPs. As different typeof processors have different performance counter events, for convenience, we introduce an approx-imate but simple instruction level measurement method here. That is, to obtain the total numberof instructions (ins), branch instructions (branch_ins), load instructions (load_ins) and store in-structions (store_ins) through hardware performance counters. And the BOPs can be obtainedapproximately by the formula: ins− branch_ins− load_ins− store_ins. This approximatemeasurement method includes all integer instructions (from the BOPS definitions, there are notincluded all of the integer instructions). From our observation, the deviation of source code levelmeasurement and our approximate instruction code level measurement, which is calculated with|1−BOPSApproximateInstruction/BOPSSourceCode|, is very small(not more than 0.08). Further-more, as shown on the Fig. 3, from our observations, most of the integer instructions (almost 99%)are included the BOPs definition, so the small deviation is reasonable. On the other hand, theFLOPs of HPL, using (2/3∗n3 +2∗n2 +O(n)) floating-point operations, is also counted approxi-mately.

10

Figure 3: Integer Instructions Breakdown.

4.4 The BOPS Measuring Tools

HPL [10] is a measuring tool of FLOPS. For BOPS, we develop measuring tools consisting of aseries of representative workloads, considering the diversity of DC workloads.

Big data dwarfs [11] are identified as the frequently-appearing units of computation amonga majority of datacenter workloads, including Matrix, Sampling, Transform, Graph, Logic, Set,Sort and Basic Statistic computations. Moreover, a wide variety of datacenter workloads arecomposed of one or more dwarfs. Based on the eight dwarfs, we select a set of fundamentalworkloads from dwarfs to construct our BOPS measuring tools: Sort (sort computation), NaiveBayes (basic statistic computation) and WordCount (basic statistic computation). These tools cancover three types of workload types: IO intensive (Sort), CPU intensive (Naive Bayes) and hybrid(WordCount). Table. 7 shows the BOPs values and workload descriptions of the BOPS measuringtools.

Table 7: Workloads of the BOPS Measurement Tools.Workload Implements BOPs Descriptions

Sort MPI 529E9 IO-IntensiveWordCount MPI 179E9 Hybrid

Bayes MPI 186E9 CPU-Intensive

4.4.1 Sort

The Sort workload can realize the sorting of an integer array of a specific scale, the sorting algo-rithm uses quick sort algorithm and the merge algorithm. The program is implemented with C++and MPI.

4.4.2 WordCount

The WordCount workload implements the statistics of the words in a specified input text. Forexample, count the frequency of a word appearing in a 1GB text file in the format of txt. Theprogram is implemented with C++ and MPI.

4.4.3 Bayes

The Bayes workload implements the classifying of a specified input text according to the learningmodel. For example, classifying the articles in a 1GB. txt file according to the subjects (politics,sports and entertainment, etc.) and the existing learning models. The program is implementedwith C++ and MPI.

11

4.5 Evaluation

4.5.1 Experimental Platforms

We choose two typical platforms as the experimental platforms for DC physical platforms, andthey are the Intel Xeon E5645 and Intel Atom D510.

Intel Xeon E5645 is targets the high-end workstation, server system markets. It is designedfor CPU-intensive workloads. E5645 has a TDP (Thermal Design Power) of 80W. It supportsout-of-order execution and the four-wide instruction issue, both of which lead to a higher powerconsumption. The Xeon E5645 has six 2.4GHz cores per socket [7], and each physical core sup-ports two hardware threads. The E5645 processor has three level caches. The detail is shown onthe Table. 8.

Table 8: Configuration details of E5645.CPU Type Intel CPU Core

Intel ®Xeon E5645 6 [email protected]

L1 DCache L1 ICache L2 Cache L3 Cache6 × 32 KB 6 × 32 KB 6 × 256 KB 12MB

Intel Atom D510 is based on the Intel Pine Trail architecture, each processor includes twocores with a base frequency of 1.66GHz and two hardware threads sharing a 1MB L2 cache. Thein-order pipeline can issue two instructions every cycle, which lowers transistor count and re-duces power consumption to yield a TDP of 13W. Intel Atom is targeted to embedded applicationsmarkets. Now it also adopted for datacenter for its low power and ultra high density. The detailis shown on the Table. 9.

Table 9: Configuration details of D510.CPU Type Intel CPU Core

Intel ®Atom E5645 4 [email protected]

L1 DCache L1 ICache L2 Cache L3 Cache6 × 32 KB 6 × 32 KB 6 × 256 KB

Each experiment platform is equipped with four nodes, and the data scale for experiments is100GB.

4.5.2 BOPS Vs. User-Perceivable metrics

We evaluates two Intel processors Xeon E5645 and Atom D510 with BOPS and the user-perceivablemetric. The theoretical BOPS is obtained by the micro-architecture. The actual BOPS is cal-culated by the average of Sort, WordCount and Bayes. For example, for E5645, as shown onthe Table. 11, the actual average BOPS of E5645 is 9.2 GBOPS. The user-perceivable is de-fined as the input data size divided by the total processing time. BOPS efficiency is (100%∗(BOPSRealAverage/BOPSThoery)). In contrast to the result from Table. 2 that FLOPS metriccannot reflect the user-perceivable performance gaps (12.2 v.s. 7.4), BOPS can reflect the perfor-mance gaps between different platforms (6.5 v.s. 7.4), as shown in Table. 10.

12

Table 10: BOPS of Xeon E5645 Vs. that of D510.E5645 D510 Ratio

Peak GBOPS 86.4 12.8 6.7Real average GBOPS 9.2 1.4 6.5

BOPS Efficiency 11% 10.9% -user-perceivable 110MB/S 14.8MB/s 7.4

4.5.3 Scalability of BOPS for the DC Cluster

We use BOPS to measure the DC cluster through changing the node scale from one to four, alsocomparing with the user-perceivable metric. As shown on the Fig. 4, we find that both the BOPSand the user-perceivable metrics increase with a larger-scale cluster. This implied that BOPS alsosuit for different cluster scales.

Figure 4: Evaluating DC Cluster with BOPS.

4.5.4 Efficiency Evaluations for The Specific Processor Platform

We choose the Intel Xeon platform as the evaluation platform. For example, Sort tool is 529E9BOPs.We run the Sort measuring tool on the Xeon E5645 node and find that the execution time is 40 sec-onds. BOPS=529E9 (Total BOPs)/61(wall time)=8.7 GBOPS. On the other hand, the peak theoreti-cal BOPS of Xeon E5645 is 86.4GBOPS (86.4G = 1(of CPU)∗6(of Cores)∗2.4G∗6(of BOPspercycle)). We find that the BOPS efficiency on the Xeon E5645 node with Sort tool is 15% (13.2/86.4). Asshown in table 11, the workload efficiencies of BOPS measuring tools are distributed from 6% to15%, and the average value is 11%.

Table 11: the BOPS efficiency on the Xeon E5645.Tool Real BOPS EfficiencySort 13.2 GBOPS 15%

WordCount 5 GBOPS 6%Bayes 9.3 GBOPS 11%

4.5.5 BOPS for traditional workloads

We evaluate the traditional benchmarks with BOPS. As shown in the Table. 12, we chose HPL,Graph500 [22] and Stream [18] as the workloads. BOPS performance and efficiency of the work-load are compared with those of the FLOPS metric.

13

Table 12: BOPS for traditional workloads.HPL Graph500 Stream

GFLOPS 38.9 0.05 0.8FLOPS efficiency 68% 0.04% 0.7%

GBOPS 41 12 13BOPS efficiency 47% 18% 20%

4.6 Discussion

4.6.1 Normalization

In terms of normalized calculations, different delays of different operations are not consideredin the normalized calculations of basic operations, because delays can be extremely different indiverse micro-architecture platforms. For example, the delay of divisions on the Intel Xeon E5645processors is about 7-12 cycles, while on the Intel Atom D510 processors, the delay can be reachup to 38 cycles [5]. Hence, considering the delays of normalization will lead to architecture-relatedissue.

4.6.2 The Efficiency of BOPS Measurement Tools

FLOPS with HPL can reach more than 80% of the system peak efficiency. While the BOPS ef-ficiency is basically 11% of the system peak efficiency. The main reasons can be summarized asfollows: 1) We calculate the theoretical BOPS considering SIMD characteristics. However, themeasuring tools adopt none SIMD programming. In accordance with the SIMD efficiency (128-bitoperation), the system efficiency will double theoretically. 2) Imbalance of operation instructions.For example, E5645 processor is equipped with three computing units (ALU or FPU) and eachcycle can realize six BOPs operations on the condition that the application has made full use ofthe above-mentioned units. DC workloads fail to achieve a fairly high efficiency mainly becausethe implementation not fully benefit from system designs. In terms of the system efficiency, In-tel Xeon E5645 and Intel Atom D510 are 11% and 10.9% respectively. The system efficiency canreach the same order of magnitude of the theoretical efficiency. Next section will prove that theefficiency is enough to support the quantitative analysis and optimization of big data systemsbased on the BOPS performance model. In the next section, we change the Sort of BOPS measur-ing tools from SISD to SIMD with SSE revised, the attainable performance is 28.2GBOPS, whichis the 33% of the theoretical peak performance.

5 BOPS-based DC Roofline Model

5.1 DC Roofline Model

Based on the Roofline model, we proposed the DC Roofline model. The main idea of DC Rooflinemodel is that the performance metric is BOPS. The definitions of DC Roofline model can be de-scribed as:Definition 1: The Performance

We choose BOPS as the performance metric for DC Roofline.Definition 2: Operation Intensity

Operation intensity in the DC Roofline is the ratio between the BOPS and memory band-width, which represents the basic operation times executed by unit byte of the memory access.The operation intensity of memory access (MI) represents the basic operation implemented fortransmission of each memory access byte. The total number of bytes transferred (MQ) is the

14

total number of swap bytes between the CPU and the memory. The total operation (BOPs) repre-sents the total basic operation times. The operation intensity (OIBOPS) could be obtained by thecalculation formula as: OIBOPS =BOPs/MQ

The total number of bytes transferred (MQ) is obtained through (total number of memoryaccesses * 64), and the total number of memory accesses is obtained through the hardware perfor-mance counter. BOPs is obtained through the measurement methods, which introduced in section4.3.Definition 3: The Attainable Performance

Attainable performance of the specific workload: The peak performance of the specific plat-form is Peak BOPS, the peak memory bandwidth is PeakMemoryBandwidth(Peak Memory Band-width). The actual operation intensities of the specific workload are OIBOPS(Operational Inten-sity). The actual performance (P) that could be acquired can be described as:

P = min{PeakBOPS,PeakMemoryBandwidth ∗OIBOPS}The PeakBOPScan be obtained from the BOPS theoretical peak of the specific platform. The

PeakMemoryBandwidthcan be obtained from the result of the Stream [18] benchmark. Mean-while, the actual bandwidths corresponding to different workloads are obtained through the for-mula: "bandwidth = (total number of memory accesses * 64) / (total execution time)", and thetotal number of memory accesses can be obtained through the hardware performance counter.The visualized DC Roofline model can be described as Fig. 5.

Figure 5: DC Roofline Model.

5.2 DC Roofline Usage

5.2.1 DC Roofline for the Intel Platform

For description the usage, we take the E5645 platform as an example, and workloads are thefive MPI workloads of BigDatabench 2.0 too. Please note that the BOPS is the value that notperforming any optimization (so it is different from the result of Table. 11). The peak BOPS ofE5645 is 86.4GBOPS, and the peak memory bandwidth is 13.2GB/s. As shown on the Table. 13.The performance bottleneck of Sort, Bayes and Kmeans are the calculation, and that of the Grepand WordCount are the memory access. As shown on the Table. 4, the average operation intensityof the workloads is only 0.05, ranging from 0.002 to 0.2, thus the bottleneck of five workloadsare all the memory access. On the other hand, The operation intensity of DC Roofline model isbetween 1.9 and 102, which resulted that the bottleneck descriptions are more reasonable.

15

Table 13: DC workloads under the DC Roofline model.Workload BOPS Intensity Bottleneck

Sort 8.8 11.7 CalculationGrep 4.6 1.2 Memory Access

WordCount 3.8 3.2 Memory AccessBayes 5.1 51 Calculation

Kmeans 5.0 25 Calculation

Optimization under the DC Roofline model From the Fig. 6, we find that ILP, SIMD andpre-fetching is three common method for performance improvements. We will introduce theseoptimization methods one by one.

Figure 6: Workloads with DC Roofline Model.

Optimization with ILP As ILP (Instruction-level parallelism) optimization is the commonmethod for the performance promotion, we improved the ILP through adding the compiling opti-mization option with -O2 (i.e., gcc -o2). As shown on the Table. 14, the BOPS of Sort, Bayes andKmeans are improved significantly by 50%, 68% and 120% respectively. Improvements of Grepand WordCount are not obvious, only 1% and 20% respectively. The result corresponds with work-loads analysis on the Table. 13. As the bottlenecks of Sort, Bayes and Kmeans are calculation,the ILP promotion can improve the performance.

Table 14: ILP Optimization for DC workloads.Workload BOPS IPC

Original_Sort 8.8 1.6ILP_Sort 13.2 1.7

Original_Grep 4.6 0.6ILP_Grep 4.9 0.6

Original_WordCount 3.8 0.9ILP_WordCount 4.5 0.9Original_Bayes 5.1 1.2

ILP_Bayes 8.6 1.5Original_Kmeans 5.0 1.7

ILP_Kmeans 11.3 1.8

Optimization with Pre-fetching We improved the memory bandwidth through opening thepre-fetching switch option, and the peak memory bandwidth is from 13.2GB/s to 13.8GB/s. Asshown on the Table. 15, the Grep and WordCount are improved significantly by 16% and 10%respectively. Improvements of Sort, Bayse and Kmeans are not obvious. The result correspondswith workloads analysis on the Table. 13. As the bottlenecks of Grep and WordCount are memoryaccess, the pre-fetching promotion can improve the memory bandwidth performance.

16

Table 15: ILP Optimization for DC workloads.Workload BOPS BandwidthILP_Sort 13.2 0.75

Prefech_Sort 13.2 0.9ILP_Grep 4.9 3.8

Prefech_Grep 5.5 5.6ILP_WordCount 4.5 1.1

Prefech_WordCount 5 1.2ILP_Bayes 8.6 0.1

Prefech_Bayes 9.3 0.2ILP_Kmeans 11.3 0.2

Prefech_Kmeans 11.3 0.2

Optimization with SIMD Single-Instruction-Multiple-Data (SIMD) is the common methodfor the HPC performance promotion, which performs the same operation on multiple data simul-taneously. Modern processors have 128-bit wide SIMD instructions at least (i.e., SSE, AVX, etc).Since an SIMD instruction operates on pairs of adjacent operands, it has proved more efficient forHPC workloads. We also attempt the SIMD on the DC workloads, we change the Sort workloadof BOPS measuring tools from SISD to SIMD through SSE revising. From the Fig. 7, we can seethat the Sort-SEE version can promote the 2.2X performance than that of the SISD version. Us-ing the Sort-SEE. the attainable performance is 28.6GBOPS, which is the 33% of the theoreticalpeak performance.

Figure 7: Sort Optimization with DC Roofline Model.

Optimization Summary As shown on the Fig. 8, we can see that after ILP, SIMD and pre-fetching optimizations all of five workloads have the performance improvements. The BOPS ofSort is from 8.8 to 28.6, that of Grep is from 4.6 to 5.5, that of WordCount is from 3.8 to 5, that ofKmeans is from 5 to 11.3, and that of Bayes is from 5.1 to 9.3.

Figure 8: Workloads Optimizations Using the DC Roofline Model.

17

5.2.2 DC Roofline across different platforms

We choose two typical platforms as the experiment platforms, which are the Intel Xeon E5645(the detail is shown on the Table. 8) and Intel Atom D510 (the detail is shown on the Table. 9).Workloads are the five MPI workloads of BigDatabench 2.0 too. The peak memory bandwidth ofE5645 is 13.8GB/s, and that of D510 is 2.3GB/s. Please note that the value in the Table. 16 isthe optimization values using the DC Roofline model. As shown on the Table. 17 and Table. 16, first, the operation intensity of E5645 is 1.1 to 56.5, and the operation intensity of D510 is0.8 to 10.2. The same workload on different platforms has different bottlenecks, on the E5645platform, Bayes and Kmeans are computational constraints, and other workloads are memoryaccess constraints (the Sort in the Table. 16 is SSE_Sort); on D510, all of the workloads arememory accesses constraints.

Table 16: DC Roofline model under Intel E5645.Workload BOPS Intensity Bottleneck

Sort 28.6 4.7 Memory AccessGrep 5.5 1.1 Memory Access

WordCount 5 4.2 Memory AccessBayes 9.3 46.5 Calculation

Kmeans 11.3 56.5 Calculation

Table 17: DC Roofline model under Intel D510.Workload BOPS Intensity Bottleneck

Sort 2 4.9 Memory AccessGrep 0.9 0.8 Memory Access

WordCount 0.7 1.2 Memory AccessBayes 1.2 10.2 Memory Access

Kmeans 1.5 5.7 Memory Access

5.2.3 DC Roofline for Software Stacks

We take the E5645 platform as the experiments platform, and workloads are the five Hadoopworkloads of BigDatabench 2.0 (Sort-H implied that it is the Hadoop version of Sort). As shownon the Table. 18, all of the Hadoop workloads’ bottlenecks are memory access.

Table 18: DC Roofline model for Hadoop workloads.Workload BOPS Intensity Bottleneck

Sort-H 6.3 1.3 Memory AccessGrep-H 7.9 4.6 Memory Access

WordCount-H 4.5 1.8 Memory AccessBayes-H 8.7 3.9 Memory Access

Kmeans-H 7.7 4.2 Memory Access

5.3 Summary

The Roofline model is a quantitative performance model based on the FLOPS metric, which is onlyapplicable to floating-point calculations and has been verified through our experiments. We pro-pose a new computation-centric metric for DC computing – BOPS, and BOPS based DC Rooflinemodel. Through the experiments, we demonstrate that BOPS based DC Roofline model can help

18

system design and tuning, especially, for the Sort workload, we promote its performance with2.2X under the guide of DC Roofline model.

6 Conclusion

In this paper, we propose a new computation-centric metric—BPOS for evaluating DC computersystems. BOPS is the average number of BOPs (Basic OPerations) of a specific workload com-pleted each second, and BOPs include all of arithmetic, logical, comparing and array address-ing operations for integer and floating point. On the basis of big data and AI dwarfs, whichare frequently-appearing units of computation, identified from typical datacenter application do-mains, we select a set of fundamental workloads as the measuring tool to report the BOPS num-bers. Finally, we proposed a BOPS based Roofline model [25], which we called DC Roofline, asthe quantitative DC performance model for guiding system design and optimization. Through theexperiments, we demonstrate that BOPS-based DC Roofline model indeed helps system designand optimization.

References

[1] http://www.tpc.org/tpcc.

[2] http://www.spec.org/cpu.

[3] http://top500.org/.

[4] “Sort benchmark home page,” http://sortbenchmark.org/.

[5] “Technical report,” http://www.agner.org/optimize/microarchitecture.pdf.

[6] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computingcapabilities,” pp. 483–485, 1967.

[7] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterizationand architectural implications,” pp. 72–81, 2008.

[8] E. L. Boyd, W. Azeem, H. S. Lee, T. Shih, S. Hung, and E. S. Davidson, “A hierarchicalapproach to modeling and improving the performance of scientific applications on the ksr1,”vol. 3, pp. 188–192, 1994.

[9] S. P. E. Corporation, “Specweb2005: Spec benchmark for evaluating the performance of worldwide web servers,” http://www.spec.org/web2005/, 2005.

[10] J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,”Concurrency and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820, 2003.

[11] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “Adwarf-based scalable big data benchmarking methodology,” arXiv preprint arXiv:1711.03229,2017.

[12] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, R. Ren, C. Zheng, G. Lu, J. Li, Z. Cao, Z. Shujie,and H. Tang, “Bigdatabench: A dwarf-based big data and artificial intelligence benchmarksuite,” Technical Report, Institute of Computing Technology, Chinese Academy of Sciences,2017.

[13] S. Harbaugh and J. A. Forakis, “Timing studies using a synthetic whetstone benchmark,”ACM Sigada Ada Letters, no. 2, pp. 23–34, 1984.

19

[14] R. Jain, The art of computer systems performance analysis. John Wiley & Sons Chichester,1991, vol. 182.

[15] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,N. Boden, and A. Borchers, “In-datacenter performance analysis of a tensor processing unit,”pp. 1–12, 2017.

[16] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instructionset architecture for neural networks,” in International Symposium on Computer Architecture,2016, pp. 393–405.

[17] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, C. Xu, and N. Sun, “Cloudrank-d: bench-marking and ranking cloud computing systems for data processing applications,” Frontiersof Computer Science, vol. 6, no. 4, pp. 347–362, 2012.

[18] J. D. Mccalpin, “Stream: Sustainable memory bandwidth in high performance computers,”1995.

[19] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto,H. Kondo, Y. Shimazu, K. Arimoto et al., “A 40gops 250mw massively parallel processorbased on matrix architecture,” in Solid-State Circuits Conference, 2006. ISSCC 2006. Digestof Technical Papers. IEEE International. IEEE, 2006, pp. 1616–1625.

[20] U. Pesovic, Z. Jovanovic, S. Randjic, and D. Markovic, “Benchmarking performance and en-ergy efficiency of microprocessors for wireless sensor network applications,” in MIPRO, 2012Proceedings of the 35th International Convention. IEEE, 2012, pp. 743–747.

[21] M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A genetic algorithms approach tomodeling the performance of memory-bound computations,” pp. 1–12, 2007.

[22] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” inInternational Symposium on High-Performance Parallel and Distributed Computing, 2012,pp. 149–160.

[23] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng,G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: a big data benchmark suite from internetservices,” in HPCA 2014. IEEE, 2014.

[24] R. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Communications ofThe ACM, vol. 27, no. 10, pp. 1013–1030, 1984.

[25] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performancemodel for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76,2009.

[26] Thomasian, “Analytic Queueing Network Models for Parallel Processing of Task Systems,”IEEE Transactions on Computers, vol. 35, no. 12, pp. 1045–1054, 1986.

20

BOPS, N FLOPS! , M OOFLINE ERFORMANCE ODEL OR...

Documents

Transcript of BOPS, N FLOPS! , M OOFLINE ERFORMANCE ODEL OR...