Evaluating Hadoop Clusters with TPCx-HSMultiple other applications were developed on top of the...
Transcript of Evaluating Hadoop Clusters with TPCx-HSMultiple other applications were developed on top of the...
Evaluating Hadoop Clusters with TPCx-HS
Technical Report No. 2015-1
October 6, 2015
Todor Ivanov and Sead Izberovic
Frankfurt Big Data Lab
Chair for Databases and Information Systems
Institute for Informatics and Mathematics
Goethe University Frankfurt
Robert-Mayer-Str. 10,
60325, Bockenheim
Frankfurt am Main, Germany
www.bigdata.uni-frankfurt.de
Copyright © 2015, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Table of Contents 1. Introduction ........................................................................................................................... 1
2. Background ........................................................................................................................... 1
3. Experimental Setup ............................................................................................................... 3
3.1. Hardware ........................................................................................................................ 3
3.2. Software ......................................................................................................................... 3
3.3. Network Setups .............................................................................................................. 4
4. Benchmarking Methodology ................................................................................................. 5
5. Experimental Results ............................................................................................................. 8
5.1. System Ratios .................................................................................................................... 8
5.2. Performance ....................................................................................................................... 8
5.3. Price-Performance ........................................................................................................... 11
6. Resource Utilization ............................................................................................................ 12
6.1. CPU ................................................................................................................................. 12
6.2. Memory ........................................................................................................................... 15
6.3. Disk .................................................................................................................................. 18
6.4. Network ........................................................................................................................... 22
6.5. Mappers and Reducers ..................................................................................................... 24
7. Lessons Learned .................................................................................................................. 24
References ...................................................................................................................................... 25
Appendix ........................................................................................................................................ 27
Acknowledgements ........................................................................................................................ 29
1
1. Introduction
The growing complexity and variety of Big Data platforms makes it both difficult and time
consuming for all system users to properly setup and operate the systems. Another challenge is to
compare the platforms in order to choose the most appropriate one for a particular application.
All these factors motivate the need for a standardized Big Data benchmark that can help the users
in the process of platform evaluation. Just recently TPCx-HS [1][2] has been released as the first
standardized Big Data benchmark designed to stress test a Hadoop cluster.
The goal of this study is to evaluate and compare how the network setup influences the
performance of a Hadoop cluster. In particular, experiments were performed using shared and
dedicated 1Gbit networks utilized by the same Cloudera Hadoop Distribution (CDH) cluster
setup. The TPCx-HS benchmark, which is very network intensive, was used to stress test and
compare both cluster setups. All the presented results are obtained by using the officially
available version [1] of the benchmark, but they are not comparable with the officially reported
results and are meant as an experimental evaluation, not audited by any external organization. As
expected the dedicated 1Gbit network setup performed much faster than the shared 1Gbit setup.
However, what was surprising is the negligible price difference between both cluster setups,
which pays off with a multifold performance return.
The rest of the report is structured as follows: Section 2 provides a brief description of the
technologies involved in our study. An overview of the hardware and software setup used for the
experiments is given in Section 3. Brief summary of the TPCx-HS benchmark is presented in
Section 4. The performed experiments together with the evaluation of the results are presented in
Section 5. Finally, Section 6 concludes with lessons learned.
2. Background
Big Data has emerged as a new term not only in IT, but also in numerous other industries such as
healthcare, manufacturing, transportation, retail and public sector administration [3][4] where it
quickly became relevant. There is still no single definition which adequately describes all Big
Data aspects [5], but the “V” characteristics (Volume, Variety, Velocity, Veracity and more) are
among the widely used one. Exactly these new Big Data characteristics challenge the capabilities
of the traditional data management and analytical systems [5][6]. These challenges also motivate
the researchers and industry to develop new types of systems such as Hadoop and NoSQL
databases [7].
Apache Hadoop [8] is a software framework for distributed storing and processing of large data
sets across clusters of computers using the map and reduce programming model. The architecture
allows scaling up from a single server to thousands of machines. At the same time Hadoop
delivers high-availability by detecting and handling failures at the application layer. The use of
data replication guarantees the data reliability and fast access. The core Hadoop components are
the Hadoop Distributed File System (HDFS) [9][10] and the MapReduce framework [11].
HDFS has a master/slave architecture with a NameNode as a master and multiple DataNodes as
slaves. The NameNode is responsible for the storing and managing of all file structures, metadata,
transactional operations and logs of the file system. The DataNodes store the actual data in the
form of files. Each file is split into blocks of a preconfigured size. Every block is copied and
stored on multiple DataNodes. The number of block copies depends on the Replication Factor.
2
MapReduce is a software framework that provides general programming interfaces for writing
applications that process vast amounts of data in parallel, using a distributed file system, running
on the cluster nodes. The MapReduce unit of work is called job and consists of input data and a
MapReduce program. Each job is divided into map and reduce tasks. The map task takes a split,
which is a part of the input data, and processes it according to the user-defined map function from
the MapReduce program. The reduce task gathers the output data of the map tasks and merges
them according to the user-defined reduce function. The number of reducers is specified by the
user and does not depend on input splits or number of map tasks. The parallel application
execution is achieved by running map tasks on each node to process the local data and then send
the result to a reduce task which produces the final output.
Hadoop implements the MapReduce model by using two types of processes – JobTracker and
TaskTracker. The JobTracker coordinates all jobs in Hadoop and schedules tasks to the
TaskTrackers on every cluster node. The TaskTracker runs tasks assigned by the JobTracker.
Multiple other applications were developed on top of the Hadoop core components, also known
as the Hadoop ecosystem, to make it more ease to use and applicable to variety of industries.
Example for such applications are Hive [12], Pig [13], Mahout [14], HBase [15], Sqoop [16] and
many more.
YARN (Yet Another Resource Negotiator) [17][18] is the next generation Apache Hadoop
platform, which introduces new architecture by decoupling the programming model from the
resource management infrastructure and delegating many scheduling-related functions to per-
application components. This new design [18] offers some improvements over the older platform:
Scalability
Multi-tenancy
Serviceability
Locality awareness
High Cluster Utilization
Reliability/Availability
Secure and auditable operation
Support for programming model diversity
Flexible Resource Model
Backward compatibility
The major difference is that the functionality of the JobTracker is split into two new daemons –
ResourceManager (RM) and ApplicationMaster (AM). The RM is a global service, managing all
the resources and jobs in the platform. It consists of scheduler and ApplicationManager. The
scheduler is responsible for allocation of resources to the various running applications based on
their resource requirements. The ApplicationManager is responsible for accepting jobs-
submissions and negotiating resources from the scheduler. The NodeManager (NM) agent runs
on each worker. It is responsible for allocation and monitoring of node resources (CPU, memory,
disk and network) usage and reports back to the ResourceManager (scheduler). An instance of
the ApplicationMaster runs per-application on each node and negotiates the appropriate resource
container from the scheduler. It is important to mention that the new MapReduce 2.0 maintains
API compatibility with the older stable versions of Hadoop and therefore, MapReduce jobs can
run unchanged.
3
Cloudera Hadoop Distribution (CDH) [19][20] is 100% Apache-licensed open source Hadoop
distribution offered by Cloudera. It includes the core Apache Hadoop elements - Hadoop
Distributed File System (HDFS) and MapReduce (YARN), as well as several additional projects
from the Apache Hadoop Ecosystem. All components are tightly integrated to enable ease of use
and managed by a central application - Cloudera Manager [21].
3. Experimental Setup
3.1. Hardware
The experiments were performed on a cluster consisting of 4 nodes connected directly through
1GBit Netgear switch, as shown on Figure 1. All 4 nodes are Dell PowerEdge T420 servers. The
master node is equipped with 2x Intel Xeon E5-2420 (1.9GHz) CPUs each with 6 cores, 32GB of
RAM and 1TB (SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drive. The worker nodes are
equipped with 1x Intel Xeon E5-2420 (2.20GHz) CPU with 6 cores, 32GB of RAM and 4x 1TB
(SATA, 3.5 in, 7.2K RPM, 64MB Cache) hard drives. More detailed specification of the node
servers is provided in the Appendix (Table 12 and Table 13).
Master
Node
Worker
Node 1
Worker
Node 2
Worker
Node 3
Dedicated/Shared Network with
1 Gbit Switch
Figure 1: Cluster Setup
Setup
Description Summary
Total Nodes: 4 x Dell
PowerEdge T420
Total Processors/
Cores/Threads :
5 CPUs/
30 Cores/
60 Threads
Total Memory: 128 GB
Total Number of
Disks:
13 x 1TB,SATA,
3.5 in, 7.2K RPM,
64MB Cache
Total Storage
Capacity: 13 TB
Network: 1 GBit Ethernet
Table 1: Summary of Total System Resources
Table 1 summarizes the total cluster resources that are used in the calculation of the benchmark
ratios in the next sections.
3.2. Software
This section describes the software setup of the cluster. The exact software versions that were
used are listed in Table 2. Ubuntu Server LTS was installed on all 4 nodes, allocating the entire
first disk. The number of open files per user was changed from the default value of 1024 to 65000
as suggested by the TPCx-HS benchmark and Cloudera guidelines [22]. Additionally, the OS
swappiness option was turned permanently off (vm.swappiness = 0). The remaining three disks,
on all worker nodes, were formatted as ext4 partitions and permanently mounted with options
noatime and nodiratime. Then the partitions were configured to be used by HDFS through the
4
Cloudera Manager. Each 1TB disk provides in total 916.8GB of effective HDFS space, which
means that all three workers (3 x 916.8GB = 8251.2GB = 8.0578TB) have in total around 8TB of
effective HDFS space.
Software Version
Ubuntu Server 64 Bit 14.04.1 LTS, Trusty Tahr,
Linux 3.13.0-32-generic
Java (TM) SE Runtime Environment 1.6.0_31-b04,
1.7.0_72-b14
Java HotSpot (TM) 64-Bit Server VM 20.6-b01, mixed mode
24.72-b04, mixed mode
OpenJDK Runtime Environment 7u71-2.5.3-0ubuntu0.14.04.1
OpenJDK 64-Bit Server VM 24.65-b04, mixed mode
Cloudera Hadoop Distribution 5.2.0-1.cdh5.2.0.p0.36
TPCx-HS Kit 1.1.2
Table 2: Software Stack of the System under Test
Cloudera CDH 5.2, with default configurations, was used for all experiments. Table 3
summarizes the software services running on each node. Due to the resource limitation (only 3
worker nodes) of our experimental setup, the cluster was configured to work with replication
factor of 2. This means that our cluster can store at most 4TB of data.
Server Disk Drive Software Services
Master Node Disk 1/ sda1
Operating System, Root, Swap, Cloudera Manager Services,
Name Node, SecondaryName Node, Hive Metastore, Hive
Server2, Oozie Server, Spark History Server, Sqoop 2 Server,
YARN Job History Server, Resource Manager, Zookeeper
Server
Worker
Nodes 1-3
Disk 1/ sda1 Operating System, Root, Swap,
Data Node, YARN Node Manager
Disk 2/ sdb1 Data Node
Disk 3/ sdc1 Data Node
Disk 4/ sdd1 Data Node
Table 3: Software Services per Node
3.3. Network Setups
The initial cluster setup was using the shared 1GBit network available in our lab. However, as
expected it turned out that it does not provide sufficient network speed for network intensive
cluster applications. Also it was hard to estimate the actual available bandwidth of the shared
network as it consists of multiple workstation machine, which are utilizing it in a none
predictable manner. Therefore, it was clear that a dedicated network should be setup for our
cluster. This was achieved by using a simple 1GBit commodity switch (Netgear GS108 GE, 8-
Port), which connected all four nodes directly in a dedicated 1GBit network.
5
To validate that our cluster setup was properly installed, the network speed was measured using a
standard network tool called iperf [23]. Using the provided instructions [24][25], we obtained
multiple measurements for our dedicated 1GBit network between the NameNode and two of our
DataNodes, reported in Table 4. The iperf server was started by executing “$iperf -s” command
on the NameNode and then executing two times the iperf clients using the “$iperf -
client serverhostname -time 30 -interval 5 -parallel 1 -dualtest” command on the DataNodes. The
iperf numbers show very stable data transfer of around 930 Mbits per second.
Run Server Client Time
(sec) Interval Parallel Type
Transfer 1
(Gbytes)
Speed 1
(Mbits/sec)
Transfer 2
(Gbytes)
Speed 2
(Mbits/sec)
1 NameNode DataNode 1 30 5 1 dualtest 3.24 929 3.25 930
2 NameNode DataNode 1 30 5 1 dualtest 3.24 928 3.25 931
1 NameNode DataNode 2 30 5 1 dualtest 3.25 930 3.25 931
2 NameNode DataNode 2 30 5 1 dualtest 3.24 928 3.25 931
Table 4: Network Speed
The next steps is to test the different cluster setups using a network intensive Big Data
benchmark like TPCx-HS in order to get a better idea of the implications of using a shared
network versus a dedicated one.
4. Benchmarking Methodology
This section presents the TPCx-HS benchmark, its methodology and some of its major features as
described in the current specification (version 1.3.0 from February 19, 2015) [1][2].
The TPCx-HS was released in July 2014 as the first industry’s standard benchmark for Big Data
systems. It stresses both the hardware and software components including the Hadoop run-time
stack, Hadoop File System and MapReduce layers. The benchmark is based on the TeraSort
workload [26], which is part of the Apache Hadoop distribution. Similarly, it consists of four
modules: HSGen, HSDataCkeck, HSSort and HSValidate. The HSGen is a program that
generates the data for a particular Scale Factor (see Clause 4.1 from the TPCx-HS specification)
and is based on the TeraGen, which uses a random data generator. The HSDataCheck is a
program that checks the compliance of the dataset and replication. HSSort is a program, based on
TeraSort, which sorts the data into a total order. Finally, HSValidate is a program, based on
TeraValidate, which validates if the output is correctly sorted.
A valid benchmark execution consists of five separate phases which has to be run sequentially to
avoid any phase overlapping, as depicted on Figure 2. Additionally, Table 5 provides exact
description of each of the execution phases. The benchmark is started by the <TPCx-HS-master>
script and consists of two consecutive runs, Run1 and Run2, as shown on Figure 2. No activities
except file system cleanup are allowed between Run1 and Run2. The completion times of each
phase/module (HSGen, HSSort and HSValidate) except HSDataCheck are currently reported.
An important requirement of the benchmark is to maintain 3-way data replication throughout the
entire experiment. In our case this criteria was not fulfilled, because of the limited resources of
our clusters (only 3 worker nodes). All of our experiments were performed with 2-way data
replication.
The benchmark reports the total elapsed time (T) in seconds for both runs. This time is used for
the calculation of the TPCx-HS Performance Metric also abbreviated with HSph@SF. The run
6
that takes more time and results in lower TPCx-HS Performance Metric is defined as the
performance run. On the contrary, the run that takes less time and results in TPCx-HS
Performance Metric is defined as the repeatability run. The benchmark reported performance
metric is the TPCx-HS Performance Metric for the performance run.
Figure 2: TPCx-HS Execution Phases (version 1.3.0 from February 19, 2015) [1]
Phase Description as provided in TPCx-HS specification
(version 1.3.0 from February 19, 2015) [1]
1 “Generation of input data via HSGen. The data generated must be replicated 3-
ways and written on a Durable Medium.”
2 “Dataset (See Clause 4) verification via HSDataCheck. The program is to verify
the cardinality, size and replication factor of the generated data. If the
HSDataCheck program reports failure then the run is considered invalid.”
3 “Running the sort using HSSort on the input data. This phase samples the input
data and sorts the data. The sorted data must be replicated 3-ways and written on
a Durable Medium.”
4 “Dataset (See Clause 4) verification via HSDataCheck. The program is to verify
the cardinality, size and replication factor of the sorted data. If the HSDataCheck
program reports failure then the run is considered invalid.”
5
“Validating the sorted output data via HSValidate. HSValidate validates the sorted
data. This phase is not part of the primary metric but reported in the Full
Disclosure Report. If the HSValidate program reports that the HSSort did not
generate the correct sort order, then the run is considered invalid. “
Table 5: TPCx-HS Phases
7
The Scale Factor defines the size of the dataset, which is generated by HSGen and used for the
benchmark experiments. In TPCx-HS, it follows a stepped size model. Table 6 summarizes the
supported Scale Factors, together with the corresponding data sizes and number of records. The
last column indicates the argument with which to start the TPCx-HS-master script.
Data Size Scale Factor (SF) Number of Records Option to Start Run
100 GB 0.1 1 Billion ./TPCx-HS-master.sh -g 1
300 GB 0.3 3 Billions ./TPCx-HS-master.sh -g 2
1 TB 1 10 Billions ./TPCx-HS-master.sh -g 3
3 TB 3 30 Billions ./TPCx-HS-master.sh -g 4
10 TB 10 100 Billions ./TPCx-HS-master.sh -g 5
30 TB 30 300 Billions ./TPCx-HS-master.sh -g 6
100 TB 100 1000 Billions ./TPCx-HS-master.sh -g 7
300 TB 300 3000 Billions ./TPCx-HS-master.sh -g 8
1 PB 1000 10 000 Billions ./TPCx-HS-master.sh -g 9
Table 6: TPCx-HS Scale Factors
The TPCx-HS specification defines three major metrics:
Performance metric (HSph@SF)
Price-performance metric ($/HSph@SF)
Power per performance metric (Watts/HSph@SF)
The performance metric (HSph@SF) represents the effective sort throughput of the benchmark
and is defined as:
𝑯𝑺𝒑𝒉@𝑺𝑭 =𝑺𝑭
𝑻/𝟑𝟔𝟎𝟎
where SF is the Scale Factor (see Clause 4.1 from the TPCx-HS specification) and T is the total
elapsed time in seconds for the performance run. 3600 seconds are equal to 1 hour.
The price-performance metric ($/HSph@SF) is defined and calculated as follows:
$/𝑯𝑺𝒑𝒉@𝑺𝑭 =𝑷
𝑯𝑺𝒑𝒉@𝑺𝑭
where P is the total cost of ownership of the tested system. If the price is in currency other than
US dollars, the units can be adjusted to the corresponding currency.
The last power per performance metric, which is not covered in our study, is expressed as
Watts/HSph@SF, which have to be measured following the TPC-Energy requirements [27].
8
5. Experimental Results
This section presents the results of the performed experiments. The TPCx-HS benchmark was run
with three scale factors 0.1, 0.3 and 1 which generate respectively 100GB, 300GB and 1TB
datasets. These three runs were performed for a cluster setup using both shared and dedicated
1Gbit networks. In the first part are presented and evaluated the metrics reported by the TPCx-HS
benchmark for the different experiments. In the second part is analyzed the utilization of cluster
resources with respect to the two network setups.
5.1. System Ratios
The system ratios are additional metrics defined in the TPCx-HS specification to better describe
the system under test. These are the Data Storage Ratio and the Scale Factor to Memory Ratio.
Using the Total Physical Storage (13TB) and the Total Memory (128GB or 0.125TB) reported in
Table 1, we calculate them as follows:
𝑫𝒂𝒕𝒂 𝑺𝒕𝒐𝒓𝒂𝒈𝒆 𝑹𝒂𝒕𝒊𝒐 =𝑇𝑜𝑡𝑎𝑙 𝑃ℎ𝑦𝑠𝑖𝑐𝑎𝑙 𝑆𝑡𝑜𝑟𝑎𝑔𝑒
𝑆𝑐𝑎𝑙𝑒 𝐹𝑎𝑐𝑡𝑜𝑟
𝑺𝒄𝒂𝒍𝒆 𝑭𝒂𝒄𝒕𝒐𝒓 𝒕𝒐 𝑴𝒆𝒎𝒐𝒓𝒚 𝑹𝒂𝒕𝒊𝒐 = 𝑆𝑐𝑎𝑙𝑒 𝐹𝑎𝑐𝑡𝑜𝑟
𝑇𝑜𝑡𝑎𝑙 𝑃ℎ𝑦𝑠𝑖𝑐𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦
Table 7 reports the two ratios for the three different scale factors used in the experiments.
Scale Factor Data Size Data Storage Ratio Scale Factor to Memory Ratio
0.1 100 GB 130 0.8
0.3 300 GB 43.3 2.4
1 1 TB 13 8
Table 7: TPCx-HS Related Ratios
5.2. Performance
In this section are evaluated the results of the multiple experiments. The presented results are
obtained by executing the TPCx-HS kit provided on the official TPC website [1]. However, the
reported times and metrics are experimental, not audited by any authorized organization and
therefore not directly comparable with other officially published full disclosure reports. Figure 3 illustrates the times of the two cluster setups (shared and dedicated 1GBit networks) for
the three datasets 100GB, 300GB and 1TB. It can be clearly observed that for all the cases the
dedicated 1GBit setup performs around 5 times better than the shared setup. Similarly, Figure 4
shows that the dedicated setup achieves between 5 and 6 times more HSph@SF metric than the
shared setup.
9
Figure 3: TPCx-HS Times
Figure 4: TPCx-HS Metric
Table 8 summarizes the experimental results introducing additional statistical comparisons. The
Data Δ column represents the difference in percent of the Data Size to the data baseline in our
case 100GB. In the first case, scale factor 0.3 increases the processed data with 200%, whereas in
the second case the data is increased with 900%. The Time (Sec) shows the average time in
seconds of two complete TPCx-HS runs for all the six test configurations. The following Time
Stdv (%) shows the standard deviation of Time (Sec) in percent between the two runs. Finally,
the Time Δ (%) represents the difference in percent of Time (Sec) to the time baseline in our
case scale factor 0.1. Here we observe that for the shared setup the execution time takes five
times longer than the dedicated setup.
Scale
Factor Data Size
Data Δ
(%) Network
Metric
(HSph@SF) Time (Sec) Time Stdv (%)
Time
Δ (%)
0.1 100 GB baseline shared 0.03 10721.75 0.59 baseline
0.3 300 GB +200 shared 0.03 32142.75 0.50 +199.79
1 1 TB +900 shared 0.03 105483.00 0.41 +883.82
0.1 100 GB baseline dedicated 0.16 2234.75 0.72 baseline
0.3 300 GB +200 dedicated 0.18 6001.00 0.89 +168.53
1 1 TB +900 dedicated 0.17 21047.75 1.53 +841.84
Table 8: TPCx-HS Results
Figure 5 illustrates the scaling behavior between the two network setups based on different data
sizes. The results show that the dedicated setup has a better scaling behavior than the shared
setup. Additionally, the increase of data size improves the scaling behavior of both setups.
2.98 8.93
29.30
0.62 1.67 5.850
10
20
30
40
100 GB 300 GB 1 TB
Tim
e (H
ou
rs)
Time (Lower is better)
Shared Dedicated
0.03 0.03 0.03
0.160.18 0.17
0
0.05
0.1
0.15
0.2
100 GB 300 GB 1 TB
Met
ric
(HS
ph
@S
F)
HSph@SF Metric (Higher is better)
Shared Dedicated
10
Figure 5: TPCx-HS Scaling Behavior (0 on the X and Y-axis is equal to the baseline of SF 0.1/100GB)
The TPCx-HS benchmark consists of five phases, which were explained in Section 4. Table 9
depicts the average times of the three major phases, together with their standard deviations in
percent. Clearly the data sorting phase (HSSort) takes the most processing time, followed by the
data generation phase (HSGen) and finally the data validation phase (HSValidate).
Scale Factor/
Data Size Network
HSGen
(Sec)
HSGen
Stdv
(%)
HSSort
(Sec)
HSSort
Stdv (%)
HSValidate
(Sec)
HSValidate
Stdv (%)
0.1 / 100 GB shared 3031.62 2.05 7391.83 1.18 289.02 0.90
0.3 / 300 GB shared 9149.36 0.99 22501.87 0.66 482.33 2.80
1 / 1 TB shared 29821.68 1.35 74432.82 1.21 1219.52 1.39
0.1 / 100 GB dedicated 543.06 0.47 1394.60 0.97 288.19 0.01
0.3 / 300 GB dedicated 1348.98 0.50 4112.33 0.57 530.97 7.27
1 / 1 TB dedicated 4273.30 0.16 15271.96 3.72 1493.41 6.35
Table 9: TPCx-HS Phase Times
199.79
883.82
168.53
841.84
0
200
400
600
800
1000
0 200 400 600 800 1000
Tim
e Δ
(%
)
Data Δ (%)
Data Scaling Behavior
linear
Shared 300GB
Shared 1TB
Dedicated 300GB
Dedicated 1TB
11
5.3. Price-Performance
The price-performance metric of the TPCx-HS benchmark that reviewed in Section 4 divides the
total cost of ownership (P) of the system under test on the TPCx-HS metric (HSph@SF) for a
particular scale factor. The total cost of our cluster for the dedicated 1Gbit network setup is
summarized in Table 10. It include only the hardware prices as the software that is used in the
setup is available for free. There are no support, administrative and electricity costs included. The
total price of the shared network setup is equal to 6730 €, which basically includes all the
components listed in Table 10 without the network switch.
Hardware Components Price incl. Taxes (€)
1 x Master Node (Dell PowerEdge T420) 1803
1 x Switch (Netgear GS108 GE, 8-Port, 1 GBit) 30
3 x Data Nodes (Dell PowerEdge T420) 4228
9 x Disks (Western Digital Blue Desktop, 1TB) 450
Additional Costs (Cables, Mouse & Monitor) 249
Total Price (€) 6760
Table 10: Total System Cost with 1GBit Switch
Table 11 summarizes the price-performance metric for the tested scale factors in the two network
setups. The lower price-performance metric indicates better system performance. In our
experiments this is the dedicated setup, which has around 6 times smaller price-performance
than the shared setup.
Scale Factor Data Size Network Metric
(HSph@SF)
Price-
Performance
Metric
(€/HSph@SF)
0.1 100 GB shared 0.03 224333.33
0.3 300 GB shared 0.03 224333.33
1 1 TB shared 0.03 224333.33
0.1 100 GB dedicated 0.16 42250.00
0.3 300 GB dedicated 0.18 37555.56
1 1 TB dedicated 0.17 39764.71
Table 11: Price-Performance Metrics
Using the price-performance formula, we can also find the maximal P (maxP) and respectively
the highest switch price for which the dedicated setup will still perform better than the shared
setup.
€/HSph@SF(shared) =P(shared)
HSph@SF=
6730
0.03= 224333.33
12
224333.33 >maxP(dedicated)
0.18
224333.33 x 0.18 > maxP(dedicated)
40380.00 > maxP(dedicated)
The highest total system cost for the dedicated 1Gbit setup should be less than 40 380€ in order
for the system to achieve better price-performance (€/HSph@SF) than the shared setup. This
represents a difference of about 33 620€, which just shows how huge is the gain in terms of cost
when adding the 1Gbit switch.
In summary, the dedicated 1Gbit network setup costs around 30 € more than the shared setup
(due to the already existing network infrastructure), but it achieves between 5 and 6 times better
performance.
6. Resource Utilization
The following section presents graphically the cluster resource utilization in terms of CPU,
memory, network and number of map and reduce jobs. The reported statistics are obtained using
the Performance Analysis Tool (PAT) [28] while running the TPCx-HS with scale factor 0.1
(100GB) for both shared and dedicated network setups. The average values obtained in the
measurement are reported in Table 14 and Table 15 in the Appendix. The graphics represent
complete benchmark run, consisting of Run1 and Run2 as described in Section 4, for both Master
and Worker nodes. The goal is to compare and analyze the cluster resource utilization between
the two different network setups.
6.1. CPU
Figure 6 shows the CPU utilization in percent of the Master node for the dedicated and shared
network setups with respect to the elapsed time (in seconds). The System % (in green) represents
the CPU utilization that occurred when executing at the kernel (system) level. Respectively, the
User % (in red) represents the CPU utilization when executing at the application (user) level.
Finally, the IOwait % (in blue) represent the time in which the CPU was idle waiting for an
outstanding disk I/O request.
Comparing both graphs, one can observe a slightly higher CPU utilization in the case of
dedicated 1Gbit network. However, in both cases the overall CPU utilization (System and User
%) is around 2%, leaving the CPU heavily underutilized.
13
Figure 6: Master Node CPU Utilization
Similar to Figure 6, Figure 7 depicts the CPU utilization for one of the three Worker nodes.
Clearly the dedicated setup utilizes better the CPU on both system and user level. In the same
way, the IOwait times for the dedicated setup are much higher than the shared one. On average,
the overall CPU utilization for the shared network is between 12% and 20%, whereas the overall
for the dedicated network is between 56% and 70%. This difference is especially observed in the
data generation and sorting phases, which are highly network intensive.
0
5
100
125
250
375
500
625
750
875
100
01
12
51
25
01
37
51
50
01
62
51
75
01
87
52
00
02
12
52
25
02
37
52
50
02
62
52
75
02
87
53
00
03
12
53
25
03
37
53
50
03
62
53
75
03
87
54
00
04
12
54
25
04
37
54
50
0
CP
U %
Time (sec)
Dedicated 1 Gbit - CPU Utilization %
System % User % IOwait %
0
5
10
0
585
117
01
75
5
234
0
292
5
351
04
09
5
468
0
526
5
585
06
43
57
02
0
760
5
819
0
877
59
36
0
994
5
105
30
111
15
117
00
122
85
128
70
134
55
140
40
146
25
152
10
157
95
163
80
169
65
175
50
181
35
187
20
193
05
198
90
204
75
210
60
CP
U %
Time (sec)
Shared 1Gbit - CPU Utilization %
System % User % IOwait %
14
Figure 7: Worker Node CPU Utilization
Figure 8 illustrates the number of context switches per second, which measure the rate at which
the threads/processes are switched in the CPU. The higher number of context switches indicates
that the CPU spends more time on storing and restoring process states instead of doing real work
[29]. In both graphics, we observe that the number of context switches per second is stable, on
average between 10000 and 11000. This means that in both cases the Master node is equally
utilized.
0
20
40
60
80
100
01
25
250
375
500
625
750
875
100
01
12
51
25
01
37
51
50
01
62
51
75
01
87
52
00
02
12
52
25
02
37
52
50
02
62
52
75
02
87
53
00
03
12
53
25
03
37
53
50
03
62
53
75
03
87
54
00
04
12
54
25
04
37
54
50
0
CP
U %
Time (sec)
Dedicated 1 Gbit - CPU Utilization %
System % User % IOwait %
0
20
40
60
80
100
0
585
117
0
175
5
234
0
292
5
351
0
409
5
468
0
526
5
585
0
643
5
702
0
760
5
819
0
877
5
936
0
994
5
105
30
111
15
117
00
122
85
128
70
134
55
140
40
146
25
152
10
157
95
163
80
169
65
175
50
181
35
187
20
193
05
198
90
204
75
210
60
CP
U %
Time (sec)
Shared 1Gbit - CPU Utilization %
System % User % IOwait %
0
5000
10000
15000
20000
0
135
270
405
540
675
810
945
108
0
121
5
135
0
148
5
162
0
175
5
189
0
202
5
216
0
229
5
243
0
256
5
270
0
283
5
297
0
310
5
324
0
337
5
351
0
364
5
378
0
391
5
405
0
418
5
432
0
445
5
459
0Co
nte
xt
Sw
itch
es
per
Sec
on
d
Time (sec)
Dedicated 1 Gbit - Context Switches
Total
15
Figure 8: Master Node Context Switches
Similarly, Figure 9 depicts the context switches per second for one of the Worker nodes. In the
dedicated 1Gbit case, we observe the number of context switches varies greatly in the different
benchmark phases. The average number of context switches per second is around 20788. In the
case of shared network, the average number of context switches per second is around 14233 and
the variation between the phases is much smaller.
Figure 9: Worker Node Context Switches
6.2. Memory
0
5000
10000
15000
20000
0
620
124
0
186
0
248
0
310
0
372
0
434
1
496
1
558
1
620
1
682
1
744
1
806
1
868
1
930
1
992
1
105
41
111
61
117
81
124
01
130
21
136
41
142
62
148
82
155
02
161
22
167
42
173
62
179
82
186
02
192
22
198
42
204
62
210
82
Co
nte
xt
Sw
itch
es p
er
Sec
on
d
Time (sec)
Shared 1Gbit - Context Switches
Total
0
10000
20000
30000
40000
50000
60000
0
135
270
405
540
675
810
945
108
0
121
5
135
0
148
5
162
0
175
5
189
0
202
5
216
0
229
5
243
0
256
5
270
0
283
5
297
0
310
5
324
0
337
5
351
0
364
5
378
0
391
5
405
0
418
5
432
0
445
5
459
0
Co
nte
xt
Sw
itch
es p
er
Sec
on
d
Time (sec)
Dedicated 1 Gbit - Context Switches
Total
0
10000
20000
30000
40000
50000
0
620
124
0
186
0
248
0
310
0
372
0
434
0
496
0
558
0
620
0
682
0
744
0
806
0
868
0
930
1
992
1
105
41
111
61
117
81
124
01
130
21
136
41
142
61
148
81
155
01
161
21
167
41
173
61
179
81
186
01
192
21
198
41
204
61
210
81C
on
text
Sw
itch
es p
er
Sec
on
d
Time (sec)
Shared 1Gbit - Context Switches
Total
16
Figure 10 shows the main memory utilization in percent of the Master node for the two network
setups. In the dedicated 1Gbit setup the average memory used is around 48%, whereas in the
shared setup it is around 91.4%.
Figure 10: Master Node Memory Utilization
The same trend is observed on Figure 11, which depicts the free memory in Kbytes for the
Master node. In the case of dedicated network, the average free memory is 16.3GB
(17095562Kbytes), which is the remaining 52% of not utilized memory. Respectively, for the
shared network the average free memory is around 2.7GB (2825635Kbytes), which is the
remaining 8.6% of not utilized memory.
0
20
40
60
80
100
5
130
255
380
505
630
755
880
100
5
113
0
125
5
138
0
150
5
163
0
175
5
188
0
200
5
213
0
225
5
238
0
250
5
263
0
275
5
288
0
300
5
313
0
325
5
338
0
350
5
363
0
375
5
388
0
400
5
413
0
425
5
438
0
450
5
%
Time (sec)
Dedicated 1 Gbit - Memory Utilization %
Total
0
20
40
60
80
100
5
590
117
5
176
0
234
5
293
0
351
5
410
0
468
5
527
0
585
5
644
0
702
5
761
0
819
5
878
0
936
5
995
0
105
35
111
20
117
05
122
90
128
75
134
60
140
45
146
30
152
15
158
00
163
85
169
70
175
55
181
40
187
25
193
10
198
95
204
80
210
65
%
Time (sec)
Shared 1Gbit - Memory Utilization %
Total
0
5000000
10000000
15000000
20000000
5
125
245
365
485
605
725
845
965
108
5
120
5
132
5
144
5
156
5
168
5
180
5
192
5
204
5
216
5
228
5
240
5
252
5
264
5
276
5
288
5
300
5
312
5
324
5
336
5
348
5
360
5
372
5
384
5
396
5
408
5
420
5
432
5
444
5
456
5
Kb
yte
s
Time (sec)
Dedicated 1 Gbit - Free Memory
Total
17
Figure 11: Master Node Free Memory
In the same way, Figure 12 illustrates the main memory utilization in percent for one of the
Worker nodes. For the dedicated 1Gbit case, the average memory used is around 92.3%, whereas
for the shared 1Gbit case it is around 92.9%. This confirms the great resemblance in both
graphics, indicating that the Worker nodes are heavily utilized in the two setups. It will be
advantageous to consider adding more memory to our nodes as it can further improve the
performance by enabling more parallel jobs to be executed [29].
Figure 12: Worker Node Memory Utilization
Figure 13 shows the free memory and the amount of cached memory in Kbytes for one of the
Worker nodes. For the dedicated 1Gbit setup, the average free memory is around 2.4GB
(2528790Kbytes), which is exactly the 7.7% of non-utilized memory. Similarly, for the shared
0
5000000
10000000
15000000
20000000
55
60
111
51
67
02
22
52
78
03
33
53
89
04
44
55
00
05
55
56
11
06
66
57
22
07
77
58
33
08
88
59
44
09
99
51
05
50
111
05
116
60
122
15
127
70
133
25
138
80
144
35
149
90
155
45
161
00
166
55
172
10
177
65
183
20
188
75
194
30
199
85
205
40
210
95
Kb
yte
s
Time (sec)
Shared 1Gbit - Free Memory
Total
0
20
40
60
80
100
5
130
255
380
505
630
755
880
100
5
113
0
125
51
38
0
150
51
63
01
75
5
188
02
00
5
213
0
225
52
38
0
250
5
263
02
75
5
288
03
00
5
313
0
325
53
38
0
350
5
363
03
75
5
388
04
00
54
13
0
425
54
38
0
450
5
%
Time (sec)
Dedicated 1 Gbit - Memory Utilization %
Total
0
20
40
60
80
100
5
590
117
5
176
0
234
5
293
0
351
5
410
0
468
5
527
0
585
5
644
0
702
5
761
0
819
5
878
0
936
5
995
0
105
35
111
20
117
05
122
90
128
75
134
60
140
45
146
30
152
15
158
00
163
85
169
70
175
55
181
40
187
25
193
10
198
95
204
80
210
65
%
Time (sec)
Shared 1Gbit - Memory Utilization %
Total
18
1Gbit setup, the average free memory is around 2.2GB (2326411Kbytes) or around 7% not
utilized memory.
Figure 13: Worker Node Free Memory
6.3. Disk
The following graphics represent the number of read and write requests issued to the storage
devices per second. Figure 14 shows that the Master node has very few read requests for the two
network setups. On the other hand, the average number of write requests per second for the
dedicated 1Gbit setup is around 3.1 and respectively around 1.7 for the shared network setup.
0
10000000
20000000
30000000
400000005
125
245
365
485
605
725
845
965
108
5
120
5
132
5
144
5
156
5
168
5
180
5
192
5
204
5
216
5
228
5
240
5
252
5
264
5
276
5
288
5
300
5
312
5
324
5
336
5
348
5
360
5
372
5
384
5
396
5
408
5
420
5
432
5
444
5
456
5
Kb
yte
s
Time (sec)
Dedicated 1 Gbit - Free Memory
Total
0
10000000
20000000
30000000
40000000
5
560
111
5
167
0
222
5
278
0
333
5
389
0
444
5
500
0
555
5
611
0
666
5
722
0
777
5
833
0
888
5
944
0
999
5
105
50
111
05
116
60
122
15
127
70
133
25
138
80
144
35
149
90
155
45
161
00
166
55
172
10
177
65
183
20
188
75
194
30
199
85
205
40
210
95
Kb
yte
s
Time (sec)
Shared 1Gbit - Free Memory
Total
0
20
40
60
80
0
125
250
375
500
625
750
875
100
0
112
5
125
0
137
5
150
0
162
5
175
0
187
5
200
0
212
5
225
0
237
5
250
0
262
5
275
0
287
5
300
0
312
5
325
0
337
5
350
0
362
5
375
0
387
5
400
0
412
5
425
0
437
5
450
0Nu
mb
er o
f R
equ
ests
Time (sec)
Dedicated 1 Gbit - I/O Requests
Read Requests per SecondWrite Requests per Second
19
Figure 14: Master Node I/O Requests
Similarly, Figure 15 depicts the read and write requests per second for one of the Worker nodes.
For the dedicated 1Gbit network setup, the average read requests per second are around 112 and
respectively around 25 for the shared network setup. The average write requests per second for
the dedicated network are around 40 and respectively around 10 for the shared network. This
clearly indicates that the dedicated setup is much more efficient in reading and writing of data.
Figure 15: Worker Node I/O Requests
The following graphics illustrate the I/O latencies (in milliseconds), which is the average time
spent from issuing of I/O requests to the time of their service by the device. This also includes the
time spent in the device queue and the time for servicing them. Figure 16 shows the Master node
latencies, which for the dedicated setup on average are around 0.15 milliseconds and respectively
around 0.17 milliseconds for the shared setup.
0
50
100
05
85
117
01
75
52
34
02
92
53
51
04
09
54
68
05
26
55
85
06
43
57
02
07
60
58
19
08
77
59
36
09
94
51
05
30
111
15
117
00
122
85
128
70
134
55
140
40
146
25
152
10
157
95
163
80
169
65
175
50
181
35
187
20
193
05
198
90
204
75
210
60
Nu
mb
er o
f R
equ
ests
Time (sec)
Shared 1Gbit - I/O Requests
Read Requests per Second
Write Requests per Second
0
200
400
600
800
0
130
260
390
520
650
780
910
104
0
117
0
130
0
143
0
156
0
169
0
182
0
195
0
208
0
221
0
234
0
247
0
260
0
273
0
286
0
299
0
312
0
325
0
338
0
351
0
364
0
377
0
390
0
403
0
416
0
429
0
442
0
455
0Nu
mb
er o
f R
equ
ests
Time (sec)
Dedicated 1 Gbit - I/O Requests
Read Requests per Second
Write Requests per Second
0
200
400
600
800
0
600
120
0
180
0
240
0
300
0
360
0
420
0
480
0
540
0
600
0
660
0
720
0
780
0
840
0
900
0
960
0
102
00
108
00
114
00
120
00
126
00
132
00
138
00
144
00
150
00
156
00
162
00
168
00
174
00
180
00
186
00
192
00
198
00
204
00
210
00
Nu
mb
er o
f R
equ
ests
Time (sec)
Shared 1Gbit - I/O RequestsRead Requests per SecondWrite Requests per Second
20
Figure 16: Master Node I/O Latencies
Similarly, Figure 17 depicts the Worker node latencies. For the dedicated setup, the average I/O
latency is around 137 milliseconds and respectively around 70 milliseconds for the shared setup.
We also observe that for the shared network setup, there are multiple I/O latencies taking around
200 to 300 milliseconds, whereas for the dedicated setup the longer latencies are much less.
0
2
4
6
8
01
25
250
375
500
625
750
875
100
01
12
51
25
01
37
51
50
01
62
51
75
01
87
52
00
02
12
52
25
02
37
52
50
02
62
52
75
02
87
53
00
03
12
53
25
03
37
53
50
03
62
53
75
03
87
54
00
04
12
54
25
04
37
54
50
0Tim
e (M
illi
seco
nd
s)
Time (sec)
Dedicated 1 Gbit - I/O Latencies
Total
0
5
10
15
05
85
117
01
75
52
34
02
92
53
51
04
09
54
68
05
26
55
85
06
43
57
02
07
60
58
19
08
77
59
36
09
94
51
05
30
111
15
117
00
122
85
128
70
134
55
140
40
146
25
152
10
157
95
163
80
169
65
175
50
181
35
187
20
193
05
198
90
204
75
210
60T
ime
(Mil
lise
co
nd
s)
Time (sec)
Shared 1Gbit - I/O Latencies
Total
0
100
200
300
400
500
01
25
250
375
500
625
750
875
100
01
12
51
25
01
37
51
50
01
62
51
75
01
87
52
00
02
12
52
25
02
37
52
50
02
62
52
75
02
87
53
00
03
12
53
25
03
37
53
50
03
62
53
75
03
87
54
00
04
12
54
25
04
37
54
50
0Tim
e (M
illi
seco
nd
s)
Time (sec)
Dedicated 1 Gbit - I/O Latencies
21
Figure 17: Worker Node I/O Latencies
The following figures depict the number of Kbytes read and written on the storage devices per
second. Figure 18 illustrates this for the Master node. In both graphs, there are no read requests
but only write one. On average around 31 Kbytes are written per second by the dedicated setup
and respectively around 25 Kbytes are written per second by the shared setup.
Figure 18: Master Node Disk Bandwidth
Similarly, Figure 19 shows the disk bandwidth for one of the Worker nodes. The average read
throughput is around 6.4MB (6532Kbytes) per second for the dedicated setup and around 1.4MB
(1438Kbytes) per second for the shared setup. Respectively the average write throughput for the
dedicated setup is around 18.6MB (19010Kbytes) per second and 4MB (4087Kbytes) per second
0
100
200
300
400
05
85
117
01
75
52
34
02
92
53
51
04
09
54
68
05
26
55
85
06
43
57
02
07
60
58
19
08
77
59
36
09
94
51
05
30
111
15
117
00
122
85
128
70
134
55
140
40
146
25
152
10
157
95
163
80
169
65
175
50
181
35
187
20
193
05
198
90
204
75
210
60T
ime
(Mil
lise
co
nd
s)
Time (sec)
Shared 1Gbit - I/O Latencies
0
500
1000
1500
0
130
260
390
520
650
780
910
104
0
117
0
130
0
143
0
156
0
169
0
182
0
195
0
208
0
221
0
234
0
247
0
260
0
273
0
286
0
299
0
312
0
325
0
338
0
351
0
364
0
377
0
390
0
403
0
416
0
429
0
442
0
455
0
KB
yte
s
Time (sec)
Dedicated 1 Gbit - Disk Bandwidth
KBytes Read per SecondKBytes Written per Second
0
500
1000
1500
2000
06
00
120
01
80
02
40
03
00
03
60
04
20
04
80
05
40
06
00
06
60
07
20
07
80
08
40
09
00
09
60
01
02
00
108
00
114
00
120
00
126
00
132
00
138
00
144
00
150
00
156
00
162
00
168
00
174
00
180
00
186
00
192
00
198
00
204
00
210
00
KB
yte
s
Time (sec)
Shared 1Gbit - Disk BandwidthKBytes Read per SecondKBytes Written per Second
22
for the shared setup. In summary, the dedicated network achieves much better throughput levels,
which indicates more efficient data processing and management.
Figure 19: Worker Node Disk Bandwidth
6.4. Network
The following figures depict the number of received and transmitted Kbytes per second. For the
Master node, the average number of received Kbytes per second is around 53 for the dedicated
setups and around 22 for the shared setup. Respectively, the average transmitted Kbytes per
second is around 42 for the dedicated configuration and around 19 for the shared configuration.
0
20000
40000
60000
80000
1000000
135
270
405
540
675
810
945
108
0
121
5
135
0
148
5
162
0
175
5
189
0
202
5
216
0
229
5
243
0
256
5
270
0
283
5
297
0
310
5
324
0
337
5
351
0
364
5
378
0
391
5
405
0
418
5
432
0
445
5
459
0
KB
yte
s
Time (sec)
Dedicated 1 Gbit - Disk Bandwidth
KBytes Read per Second
KBytes Written per Second
0
10000
20000
30000
40000
50000
60000
0
620
124
0
186
0
248
0
310
0
372
0
434
0
496
0
558
0
620
0
682
0
744
0
806
0
868
0
930
0
992
0
105
40
111
60
117
80
124
00
130
20
136
40
142
60
148
80
155
00
161
20
167
40
173
60
179
80
186
00
192
20
198
40
204
60
210
80
KB
yte
s
Time (sec)
Shared 1Gbit - Disk Bandwidth
KBytes Read per SecondKBytes Written per Second
0
500
1000
1500
2000
2500
3000
01
25
250
375
500
625
750
875
100
01
12
51
25
01
37
51
50
01
62
51
75
01
87
52
00
02
12
52
25
02
37
52
50
02
62
52
75
02
87
53
00
03
12
53
25
03
37
53
50
03
62
53
75
03
87
54
00
04
12
54
25
04
37
54
50
0
KB
yte
s
Time (sec)
Dedicated 1 Gbit - Network I/O
KBytes Received per Second
KBytes Transmitted per Second
23
Figure 20: Master Node Network I/O
Analogously, Figure 21 shows the network transfer for one of the Worker nodes. In the case of
dedicated setup, on average are received 32.8MB (33637Kbytes) per second and transmitted
30.6MB (31364Kbytes) per second. In the case of shared setup, on average are received 7.1MB
(7297Kbytes) per second and transmitted 6.4MB (6548Kbytes) per second. Obviously, the
dedicated 1Gbit network achieves almost 5 times better utilization of the network, resulting in
faster overall performance.
Figure 21: Worker Node Network I/O
0
500
1000
1500
2000
2500
06
00
120
01
80
0
240
0
300
03
60
0
420
04
80
0
540
06
00
0
660
07
20
0
780
08
40
0
900
09
60
01
02
00
108
00
114
00
120
00
126
00
132
00
138
00
144
00
150
00
156
00
162
00
168
00
174
00
180
00
186
00
192
00
198
00
204
00
210
00
KB
yte
s
Time (sec)
Shared 1Gbit - Network I/O
KBytes Received per Second
KBytes Transmitted per Second
0100002000030000400005000060000
7000080000
0
130
260
390
520
650
780
910
104
0
117
0
130
0
143
0
156
0
169
0
182
0
195
0
208
0
221
0
234
0
247
0
260
0
273
0
286
0
299
0
312
0
325
0
338
0
351
0
364
0
377
0
390
0
403
0
416
0
429
0
442
0
455
0
KB
yte
s
Time (sec)
Dedicated 1 Gbit - Network I/OKbytes Received per Second
KBytes Transmitted per Second
0
10000
20000
30000
40000
50000
60000
70000
80000
0
600
120
0
180
0
240
0
300
0
360
0
420
0
480
0
540
0
600
0
660
0
720
0
780
0
840
0
900
0
960
0
102
00
108
00
114
00
120
00
126
00
132
00
138
00
144
00
150
00
156
00
162
00
168
00
174
00
180
00
186
00
192
00
198
00
204
00
210
00
KB
yte
s
Time (sec)
Shared 1Gbit - Network I/OKBytes Received per Second
KBytes Transmitted per Second
24
6.5. Mappers and Reducers
Figure 22 shows the number of active map and reduce jobs in the different benchmark phases for
both network setups. The behavior of the two graphics is very similar, except the fact that the
shared setup is around 5 times slower than the dedicated setup.
Figure 22: Worker Node JVM Count
7. Lessons Learned
The report presents a performance evaluation of the Cloudera Hadoop Distribution through the
use of the TPCx-HS benchmark, which is the first officially standardized Big Data benchmark. In
particular, our experiments compare two cluster setups: the first one using shared 1Gbit network
and the second one using dedicated 1Gbit network. Our results show that the cluster with
dedicated network setup is around 5 times faster than the cluster with shared network setup in
terms of:
Execution time
HSph@SF metric
Average read and write throughput per second
Network utilization
On average, the overall CPU utilization for the shared case is between 12% and 20%, whereas
for the dedicated network it is between 56% and 70%. The average main memory usage, for the
dedicated setup is around 92.3%, whereas for the shared setup it is around 92.9%.
Furthermore, based on the price-performance formula it can be concluded that the 5 times
performance gain for the dedicated 1Gbit setup is equal to around 33 620€ in terms of money
when compared to the shared 1Gbit setup. In the future, we plan to stress test the dedicated setup
with other widely used Big Data benchmarks and workloads.
0
2
4
6
8
0
101
203
305
407
510
612
714
815
917
101
9
112
0
122
2
132
3
142
5
152
7
163
0
173
2
183
4
193
7
203
9
214
1
224
3
234
5
244
6
254
8
265
0
275
2
285
5
295
7
305
9
316
0
326
2
336
3
346
5
356
7
367
0
377
1
Nu
mb
er o
f
Time (sec)
Dedicated 1 Gbit - JVM Count
Mappers Reducers
0
2
4
6
8
10
05
48
109
51
64
32
19
02
73
83
28
63
83
34
38
14
92
95
47
76
02
46
57
27
12
07
66
88
21
58
76
39
31
19
85
81
04
06
109
54
115
02
120
50
125
97
131
45
136
92
142
41
147
89
153
36
158
84
164
32
169
79
175
27
180
74
186
22
191
70
197
17
202
65
Nu
mb
er o
f
Time (sec)
Shared 1Gbit - JVM Count
Mappers Reducers
25
References
[1] TPC, “TPCx-HS.” [Online]. Available: http://www.tpc.org/tpcx-hs/.
[2] R. Nambiar, M. Poess, A. Dey, P. Cao, T. Magdon-Ismail, D. Q. Ren, and A. Bond,
“Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems,” in
Performance Characterization and Benchmarking. Traditional to Big Data, R. Nambiar and
M. Poess, Eds. Springer International Publishing, 2014, pp. 1–12.
[3] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big
data: The next frontier for innovation, competition, and productivity,” McKinsey Glob. Inst.,
pp. 1–137, 2011.
[4] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan,
and C. Shahabi, “Big data and its technical challenges,” Commun. ACM, vol. 57, no. 7, pp.
86–94, 2014.
[5] H. Hu, Y. Wen, T. Chua, and X. Li, “Towards Scalable Systems for Big Data Analytics: A
Technology Tutorial,” 2014.
[6] T. Ivanov, N. Korfiatis, and R. V. Zicari, “On the inequality of the 3V’s of Big Data
Architectural Paradigms: A case for heterogeneity,” ArXiv Prepr. ArXiv13110805, 2013.
[7] R. Cattell, “Scalable SQL and NoSQL data stores,” ACM SIGMOD Rec., vol. 39, no. 4, pp.
12–27, 2011.
[8] “Apache Hadoop Project.” [Online]. Available: http://hadoop.apache.org/. [Accessed: 22-
May-2014].
[9] D. Borthakur, “The hadoop distributed file system: Architecture and design,” Hadoop Proj.
Website, vol. 11, p. 21, 2007.
[10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,”
in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010,
pp. 1–10.
[11] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”
Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[12] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R.
Murthy, “Hive - a petabyte scale data warehouse using Hadoop,” in 2010 IEEE 26th
International Conference on Data Engineering (ICDE), 2010, pp. 996–1005.
[13] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B.
Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of
Map-Reduce: the Pig experience,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1414–1425, 2009.
[14] “Apache Mahout: Scalable machine learning and data mining.” [Online]. Available:
http://mahout.apache.org/. [Accessed: 30-Oct-2013].
[15] L. George, HBase: the definitive guide. O’Reilly Media, Inc., 2011.
[16] K. Ting and J. J. Cecho, Apache Sqoop Cookbook. O’Reilly Media, 2013.
[17] Apache YARN, “YARN,” Apache Hadoop NextGen MapReduce, 2014. [Online].
Available: http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-
site/YARN.html. [Accessed: 06-Feb-2014].
[18] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves,
J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E.
Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator,” in
Proceedings of the 4th Annual Symposium on Cloud Computing, New York, NY, USA, 2013,
pp. 5:1–5:16.
26
[19] Cloudera, “CDH,” 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html. [Accessed:
28-Nov-2014].
[20] Cloudera, “CDH Datasheet,” 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/resources/library/datasheet/cdh-datasheet.html.
[Accessed: 28-Nov-2014].
[21] Cloudera, “Cloudera Manager Datasheet,” 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/resources/library/datasheet/cloudera-manager-
4-datasheet.html.
[22] Cloudera, “Configuration Parameters: What can you just ignore?” [Online]. Available:
http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/.
[Accessed: 11-Nov-2014].
[23] “iPerf - The TCP, UDP and SCTP network bandwidth measurement tool.” [Online].
Available: https://iperf.fr/. [Accessed: 04-Oct-2015].
[24] “Measuring Network Performance: Test Network Throughput, Delay-Latency, Jitter,
Transfer Speeds, Packet loss & Reliability. Packet Generation Using Iperf / Jperf,”
Firewall.cx. [Online]. Available: http://www.firewall.cx/networking-topics/general-
networking/970-network-performance-testing.html. [Accessed: 04-Oct-2015].
[25] J. Poelman, “IBM InfoSphere BigInsights Best Practices : Validating performance of a
new cluster,” 10-Jul-2014. [Online]. Available:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4
f21_43ee_b236_c42a1c875961/page/Validating%20performance%20of%20a%20new%20cl
uster. [Accessed: 04-Oct-2015].
[26] Apache Hadoop, “TeraSort.” [Online]. Available:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-
summary.html. [Accessed: 14-Oct-2013].
[27] Transaction Processing Performance Council, “TPC-Energy - Homepage.” [Online].
Available: http://www.tpc.org/tpc_energy/default.asp. [Accessed: 15-Jul-2015].
[28] Intel, “PAT at GitHub,” 2014. [Online]. Available: https://github.com/intel-
hadoop/PAT/tree/master/PAT. [Accessed: 09-Feb-2015].
[29] K. Tannir, Optimizing Hadoop for MapReduce. Packt Publishing Ltd, 2014.
27
Appendix
System Information Description
Manufacturer: Dell Inc.
Product Name: PowerEdge T420
BIOS: 1.5.1 Release Date: 03/08/2013
Memory
Total Memory: 32 GB
DIMMs: 10
Configured Clock Speed: 1333 MHz Part Number:
M393B5273CH0-YH9
Size: 4096 MB
CPU
Model Name: Intel(R) Xeon(R) CPU
E5-2420 0 @ 1.90GHz
Architecture: x86_64
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
CPU MHz: 1200.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
NIC
Settings for em1: Speed: 1000Mb/s
Ethernet controller: Broadcom Corporation NetXtreme
BCM5720 Gigabit Ethernet PCIe
Storage
Storage Controller:
LSI Logic / Symbios Logic
MegaRAID SAS 2008 [Falcon]
(rev 03)
08:00.0 RAID bus controller
Drive / Name Usable Space Model
Disk 1/ sda1 931.5 GB
Western Digital,
WD1003FBYX RE4-1TB,
SATA3, 3.5 in, 7200RPM,
64MB Cache
Table 12: Master Node Specifications
28
System Information Description
Manufacturer: Dell Inc.
Product Name: PowerEdge T420
BIOS: 2.1.2 Release Date: 01/20/2014
Memory
Total Memory: 32 GB
DIMMs: 4
Configured Clock Speed: 1600 MHz Part Number:
M393B2G70DB0-YK0
Size: 16384 MB
CPU
Model Name: Intel(R) Xeon(R) CPU E5-2420 v2 @
2.20GHz
Architecture: x86_64
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
CPU MHz: 2200.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
NIC
Settings for em1: Speed: 1000Mb/s
Ethernet controller: Broadcom Corporation NetXtreme
BCM5720 Gigabit Ethernet PCIe
Storage
Storage Controller:
Intel Corporation C600/X79 series
chipset SATA RAID Controller (rev
05)
00:1f.2 RAID bus
controller
Drive / Name Usable Space Model
Disk 1/ sda1 931.5 GB Dell- 1TB, SATA3, 3.5 in,
7200RPM, 64MB Cache
Disk 2/ sdb1 931.5 GB WD Blue Desktop
WD10EZEX - 1TB,
SATA3, 3.5 in, 7200RPM,
64MB Cache
Disk 3/ sdc1 931.5 GB
Disk 4/ sdd1 931.5 GB
Table 13: Data Node Specifications
29
Master Node
Network Type: Dedicated 1Gbit Shared 1Gbit
Scale Factor: 100GB 100GB
Avg. CPU Utilization % - User % 1.09 0.75
Avg. CPU Utilization % - System % 0.83 0.78
Avg. CPU Utilization % - IOwait % 0.02 0.01
Memory Utilization % 48.04 91.41
Avg. Kbytes Transmitted per Second 42.28 18.96
Avg. Kbytes Received per Second 52.66 21.78
Avg. Context Switches per Second 10756.60 10360.07
Avg. Kbytes Read per Second 0.14 0.00
Avg. Kbytes Written per Second 31.39 25.23
Avg. Read Requests per Second 0.01 0.00
Avg. Write Requests per Second 3.12 1.74
Avg. I/O Latencies in Milliseconds 0.15 0.17
Table 14: Master Node - Resource Utilization
Worker Node
Network Type: Dedicated 1Gbit Shared 1Gbit
Scale Factor: 100GB 100GB
Avg. CPU Utilization % - User % 56.04 12.44
Avg. CPU Utilization % - System % 9.52 3.71
Avg. CPU Utilization % - IOwait % 3.61 1.28
Memory Utilization % 92.31 92.93
Avg. Kbytes Transmitted per Second 31363.57 6548.16
Avg. Kbytes Received per Second 33636.93 7297.27
Avg. Context Switches per Second 20788.16 14233.08
Avg. Kbytes Read per Second 6532.24 1438.23
Avg. Kbytes Written per Second 19010.01 4087.33
Avg. Read Requests per Second 111.75 24.70
Avg. Write Requests per Second 39.50 9.71
Avg. I/O Latencies in Milliseconds 136.87 69.83
Table 15: Worker Node - Resource Utilization
Acknowledgements
This research is supported by the Frankfurt Big Data Lab at the Chair for Databases and
Information Systems (DBIS) at the Goethe University Frankfurt. The authors would like to thank
Naveed Mushtaq, Karsten Tolle, Marten Rosselli and Roberto V. Zicari for their helpful
comments and support. We would like to thank Jeffrey Buell of VMware for his valuable
feedback and corrections.