High Performance Hardware for Data Analysis

High Performance Hardware for Data Analysis

NYC 2014

Mike Pittaro [email protected]

[email protected]

@pmikeyp

www.slideshare.net/lhrc_mikeyp/

github.com/lhrc-mikeyp

mailto:[email protected]

mailto:[email protected]

http://www.slideshare.net/lhrc_mikeyp/

2

About This Talk

• We can’t cover everything about hardware in a 40 minute session.

• We can go deep enough to help you – Understand tradeoffs and balanced architectures

– Ask the right questions about choices

– Learn from what others are doing

• My Approach Today1. Why look at high performance hardware ?

2. Look at a production cluster design

3. Look at the choices and tradeoffs behind the scene

About me:

• Principal Architect for Big Data Solutions at Dell– Part of the Enterprise Solutions Group in Dell engineering

– I develop clusters and reference architectures for our customers

– Background in Supercomputing, Data Warehousing and Data Integration

– Python is my preferred programming language

3

Why consider High Performance Hardware ?

• Choice of hardware can have large impacts

– On performance

– On budget

• Understanding the hardware helps with the software

– Scalable and parallel systems deal with both

• Cloud hosting may not be an option

– You can’t or won’t delegate critical infrastructure to third parties.

– Operating costs are too high

– You need every bit of performance you can get.

• Data is heavy

– Local clusters are persistent

– Large data transfer may not me a viable option.

4

• Customers want results– Performance

– Predictability

– Reliability

– Cost Management

– Proven Solution

– Tested Configuration

• There are many options– Servers

– Processors, GPU’s

– Drives

– Networking

– Jargon and Acronyms

– Lack of Solid Information

The Problem with Big Data Hardware

4

5

• Tested Server Configurations

• Tested Network Configurations

• Base Software Configuration– Big Data Software

– OS Infrastructure

– Operational Infrastructure

• Predefined configuration– Recommended starting point

– Customization is possible

The secret to a good architecture is balance

PricePerformanceFault Zones

Application Workload and Software

A Reference Architecture Fills The Gap

5

6

Cluster Architecture

• The Dell In-Memory Appliance for Cloudera

6

7

Dell In-Memory Appliance – Summary Specs

Cluster Starter Mid-Size Small Enterprise Maximum

Data Nodes 4 12 20 44

Total Memory 1536 GB 4608 GB 7680 GB 26896 GB

Total Storage 176TB 528 TB 880 TB 2112 TB

Processing Cores 80 280 400 880

Racks (42U) 1 2 2 4

Data Node Characteristic ConfigurationServer Dell R720xd (2 Rack Units)

Processor Two Intel Xeon E5-2670v2 2.5GHz,25M Cache, 10 Core

Memory 384GB

Memory Speed 1866 Mt/s DRAM

Disks 12 X 4TB SATA, 3.0 Gbps (48 TB)

Networking Dual 10GbE interfaces

Active/Active BondingManagement Network

Two x 1GbE interfaces

8

Server Examples

• M1000e Blade Chassis (10U)

4 Socket R920 (4U)

2 Socket R730xd (2U)

9

Server Choices

• 4 Socket Servers (e.g. Dell R920)

– Optimized for enterprise applications

› Large RDBMS servers, SAP, SAP HANA, Microsoft Exchange

– Very large memory available (6 TB)

– Often use direct or network attached storage

• ‘Blade’ Servers (e.g. Dell M620, M1000e Chassis)

– Pluggable Processor and Storage modules

– Backplane and Chassis has a lot of shared interconnect logic

– Flexibility for enterprise applications

› Virtualization is popular

• 2 Socket Servers (e.g. Dell R620, R630, R720, R730)

– Many options available

– 1U and 2U chassis footprints

– Developed for Web Hosting and Large Scale-Out Clusters

– Dell Internal Storage – 12 x 3.5” drives, 24 x 2.5” drives (in chassis)

10

Intel Dual Socket Architecture (Grantley)

Haswell CPU Up to 18 coresTDP: Up to 145 W (SVR); 160 W (WS)

Socket Socket-R3

Scalability 2S capability

Memory

4xDDR4 channels1333, 1600, 1866 (2 DPC), 2133 (1 DPC)

RDIMM, LRDIMM

QPI2xQPI 1.1 channels6.4, 8.0, 9.6 GT/s

PCIe

PCIe 3.0 (2.5, 5, 8 GT/s)PCIe Extensions: Dual Cast, Atomics

40xPCIe*3.0

Intel® Xeon®

processorE5-2600 v3

Intel® Xeon®

processorE5-2600 v3

QPI2 Channels

DDR4

LANUp to

4x10GbE

PCIe* 3.0, 40 lanes

Intel® C610series

chipset

WBG

DDR4

DDR4

DDR4

DDR4

DDR4

DDR4

DDR4

11

Intel Processor Generations

Product Xeon E5-2600 E5-2600 V2 E5-2600 V3

Microarchitecture SandyBridge IvyBridge Haswell

Cores / Threads 8 / 16 12/24 18/36

Last Level Cache Up to 20MB Up to 30 MB Up to 45 MB

Max Memory Speed

1600 MT/SDDR3

1866 MT/sDDR3

2133 MT/sDDR4

QPI (GT/s) 2 channels6.4, 7.2, 8.0

2 channels

6.4, 7.2, 8.0

2 channels

6.4, 8.0, 9.6

Max DIMMS 12 12 12

Max Clock Speed 3.1GHz / 3.8GHz 3.7 GHz / 3.8GHz 3.7 Ghz / 3.8Ghz

Process Tech 32nm 22nm 22nm

Year 2012 2013 2014

12

• Assume 1-1.5 Hadoop tasks per core– allows headroom for other processes

– Enable Hyperthreading for Hadoop, Spark

– Hyperthreading for others: it depends

• Hadoop: aim for 1 core / disk spindle

• Impala: can handle more spindles and cores easily

• Spark: I/O depends on back end storage

• Faster processor is better– Most Hadoop jobs are I/O bound, not processor bound

– Hadoop compression uses processor cycles

– Less cores with a faster clock is often a good tradeoff

– The Map / Reduce balance depends on actual workload

– It’s hard to optimize more without knowing the actual workload

Selecting Processors

1

2

13

Selecting Memory

• DDR3 versus DDR4, RDIMM versus LRDIMM

– DDR3 is cheaper now, DDR4 is faster (15%)

• DIMM Sizes

• 8GB, 16GB, 32GB, 64GB, 128GB

• Sweet Spot

– Varies, around 32GB right now

• Balance the memory banks

– 4 memory channels per processor

– 4 x 16GB better than 2 x 32GB

• Server Class Memory

– It’s all ECC checked

– Dell Server BIOS options to optimize checking method

14

Selecting Disks

• 3.5” Drives

– 3TB, 4TB, 6TB per drive

– Pricing sweet spot is 3TB

– Use enterprise grade drives, not consumer !!

– SATA or SAS. SAS slightly faster.

– 3.0 GB/sec is fine, 6.0 Gb/sec is a waste with spinning drives

• 2.5” Drives

– 800GB and 1.2 TB

– More expensive than 3.5” drives

– more spindles and performance

• SATA Solid State Drives

– 6.0 Gb/sec

– 2.5” and 1.8” options

– Expensive for now

– Not as deterministic as spindles

15

• Hadoop scales processing and storage together– The cluster grows by adding more data nodes– The ratio of processor to storage is the main adjustment

• Generally, aim for a 1 spindle / 1 core ratio– I/O is large blocks (64Mb to 256Mb)– Primarily sequential read/write, very little random I/O– 8 tasks will be reading or writing 8 individual spindles

• Drive Sizes and Types– NL SAS or Enterprise SATA 6 Gb/sec– Drive size is mainly a price decision

• Depth per node – Up to 48 TB/node is common – 112 Tb / node is possible– Consider how much data is ‘active’– Very deep storage impacts recovery performance

Spindle / Core / Storage Depth Optimization

1

5

16

PowerEdge C8000 Hadoop Scaling - 16 core Xeon

1

6

0

5,000

10,000

15,000

20,000

25,000

30,000

35,0001

15 29

43

57 71

85

99

113

127

141

155

169

183

197

211

22

5

23

9

Tb

Sto

rag

e

(1) 12 spindle 3Tb versus (3) 6 spindle 3Tb

Cores (1)

Storage (1)

IOPS (1)

Storage (3)

IOPS (3)

17

Network Architecture – Layer 2 Switching

18

Network and Switches

• Simple Tree Structure

– Top of Rack (TOR) for each rack / group of nodes

– Racks feed up to a Cluster or Aggregation Switch

– All switching is at Layer 2 (Ethernet)

› No fancy routing or layer 3 (IP) packet inspection

– Most switches are 48 ports in this class

• Switch Characteristics

– Line rate switching at 10Gbps

– Deep buffers to handle bursts

– Virtual Link Trunking (VLT)– two switches act as one, with failover

– Uplinks are 40GbE

• High Availability and Performance

– Use two 10GbE links to alternate switches

– Bond at the Linux level into a single device

19

Model Data Node Configuration Comments RA

R720Xd Dual socket, 12 cores, 24 x 2.5” spindles

Most popular platform for Hadoop

C8000 Dual socket, 16 cores, 16 x 3.5” spindles

Popular for deep/dense Hadoop applications

C6100 / C6105

Dual socket, 8/12 cores, 12 x 3.5” spindles

Two node version. C6100 ishardware EOL

C2100 Dual Socket, 12 cores, 12 x 3.5” spindles

Popular, hardware EOL but often repurposed for Hadoop

R620 Dual Socket, 8 cores, 10 x 2.5” spindles

1U form factor

C6220 Dual-socket, 8 cores,6 x 2.5” spindles

Core/spindle ratio is not ideal for Hadoop.

In the Wild – Dell Customer Hadoop Configurations

1

9

20

• GPU’s– Possible, not seen too often with Hadoop

• Ingest / Streaming– Usually a custom configuration for high speed capture/loading (e.g. Kafka, Storm)

• Dell PowerEdge VRTX– Designed as a ‘mini-blade’ for branch offices

– Could make a killer data science workstation

What I haven’t talked about!

21

Thank you!

22

High Performance Hardware for Data Analysis

• Choosing hardware for big data analysis is difficult because of the many options and variables involved. The problem is more complicated when you need a full cluster for big data analytics.

• This session will cover the basic guidelines and architectural choices involved in choosing analytics hardware for Spark and Hadoop. I will cover processor core and memory ratios, disk subsystems, and network architecture. This is a practical advice oriented session, and will focus on performance and cost tradeoffs for many different options.

High Performance Hardware for Data Analysis

Technology

Transcript of High Performance Hardware for Data Analysis