1. beyond mission critical virtualizing big data and hadoop

© 2009 VMware Inc. All rights reserved

Beyond Mission Critical: Virtualizing Big-Data and Hadoop

Michael West

Global Architect

Big Data and Storage Field Engineering, VMware

2

Applications And Storage Are Becoming Increasingly Diverse

Virtual Storage Arrays

vSphere

SAN/NAS Object / BLOB

Traditional Applications

• Traditional enterprise storage• HW-based resiliency, QoS

Next Gen Cloud Apps

• Scale out, flash, DAS• Application specific storage

All SSDArray

Server-sideFlash

3

The complexity enterprise IT and developers face today

An Idea for a cool app

Spec a server config

Justify server costs

Procurement process

Wait for HW to arrive

Wait for IT ops to Image the server

Install a Database

LOB Architecture approval

Central IT Architectural

approval

Justify more server for scale

testing

Wait for more HW

Configure ACLs and LBs

4-6 Months from Idea to Production!

New infrastructures

New Languages and Frameworks

New Devices and Domains

New Data types and requirements

4

Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise

5

Real-time analysis allows instant understanding of

market dynamics.

Retailers can have intimate understanding of their

customers needs and use direct targeted marketing.

Market Segment Analysis Personalized Customer Targeting`

6

The Emerging Pattern of Big Data Systems: Retail Example

Real-TimeStreams

Exa-scale Data Store

Parallel DataProcessing

Real-TimeProcessing

MachineLearning

Data Science

Cloud Infrastructure

Analytics

7

Storage: Plan for Peta-scale Data Storage and Processing

2000 2003 2006 2009 2012 20150.01

0.1

1

10

100

1000

Online Apps

AnalyticsPB ofData

Analytics Rapidly Outgrows Traditional Data Size by 100x

8

Unprecedented Scale

“Data transparency, amplified by Social Networks

generates data at a scale never seen before”

- The Human Face of Big Data

We are creating an Exabyte of data every minute in 2013

Yottabyte by 2030

9

A single GE Jet Engine produces

10 Terabytes of data in one hour – 90 Petabytes per year.

Enabling early detection of faults, common mode failures, product engineering feedback.

Post Mortem Proactively Maintained Connected Product

10

Cloud Infrastructure Supports Mixed Big Data Workloads

MachineLearning HadoopReal-Time

Analytics


MachineLearning

Hadoop

Real-TimeAnalytics

Management

Network/Security

Storage/Availability

Compute

11

Cloud Infrastructure Supports Multiple Tenants


Management

Network/Security


Compute

Web UserAnalytics

FinancialAnalysis

Historical CustomerBehavior

12

Software-defined Datacenter: Compute

Agility / Rapid deployment

Lower Capex

Isolation for resource control and security

1

2

3

Operational efficiency4

Management

The Core Values of Virtualization Apply to Big Data

Network/Security


Compute

13

Virtualizing Hadoop

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

14

Serengeti

Virtual Hadoop Manager (VHM)

Hadoop Virtualization Extensions

(HVE)

Big Data Extensions: Core Components

Core is Open Source Tool to simplify virtualized

Hadoop deployment & operations

Serengeti

Virtualization changes for core Hadoop

Contributed back to Apache Hadoop

Advanced resource management on vSphere

15

Strong Isolation between Workloads is Key

Hungry Workload 1

Reckless Workload 2

NosyWorkload 3


16

Hadoopbatch analysis

Big Data Family of Frameworks

File System/Data Store

Host Host Host Host Host Host

HBasereal-time queries

NoSQL Cassandra, Mongo, etcBig SQL

Impala,Pivotal HawQ

Computelayer

Virtualization

Host

OtherSpark,Shark,Solr,

Platfora,Etc,…

17

Traditional Hadoop vs. Elastic Hadoop

Scale-out Network Storage

Traditional Hadoop:Converged

Compute/StorageElastic Compute


18

Management

Software-defined Datacenter: Storage

Requirements of Next Generation Storage

Network/Security


Compute

10x lower cost of storage

Handle explosive data growth

Support a variety ofapplication types

1

2

3

Solve the privacy andsecurity issues4

19

Software-defined Storage Enables Fundamental Economics

0.5 1 2 4 8 16 32 64 128 $-

$0.50

$1.00

$1.50

$2.00

$2.50

$3.00

$3.50

$4.00

$4.50

$5.00

$5.50

Cost per GB

Petabytes Deployed

TraditionalSAN/NAS

DistributedObject

StorageHDFSMAPRCEPH

Scale-out NASIsilon, NTAP

20

HDFS Model

ESX ESX ESX

JT

HDFS or MAPR VM HDFS or MAPR VM HDFS or MAPR VM

Local Disks

SAN/NAS Non-Hadoop VMs

Hadoop Compute VMs

JT: JobTrackerTT: TaskTrackerNN: NameNodeVHM: Virtual Hadoop Manager

NN

TT

TT

TT

VirtualCenter Management Server

DRS DRS DRSDRS DRS

VHM

Hadoop HDFS VMs

TT

TT

TT

JT

21

Virtual Hadoop Manager

State, stats

(Slots used,Pending work)

Commands

(Decommission,Recommission)

Stats and VM configuration

Serengeti Job Tracker

vCenter DB

Manual/Auto

Power on/off

Virtual Hadoop Manager (VHM)

Job Tracker

Task Tracker

Task Tracker

Task Tracker

vCenter Server

SerengetiConfiguration

VCstate and stats

Hadoopstate and stats

VCactions

Hadoopactions

Algorithms

Cluster Configuration

22

Big-Data using Local Disks

Host

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Servers withLocal Disks

16-24 core server12-24 SATA 2-4TB Disks10 GbE adapteriSCSI/NFS for SharedStorage for vMotion etc,…

High Performance 10GBE Switch per Rack

23

Standard Deployment Configuration for D/C Separation

Virtualization Host

OS Image – VMDK

HadoopVirtualNode 1

Task-tracker

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4

VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK

… …

24

Big Data Storage


Elastic ComputeScale-out Network Storage

• Hadoop Protocol• Snapshots• Posix Apps• Full NFS Access• Replication• Erasure Coding

25

Big Data with Scale-out-NAS

Big-Data using Scale-out NAS

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Scale-outNAS

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Scale-outNAS

TempData

SharedData

IsilonScale-out

NAS

LocalDisk or SSDIn each Host

For Transient Data

26

HadoopVirtualNode 2

NN

NN

NN

NN

NN

NN da

ta n

od

e

Isilon

Storage Configuration for Data/Compute Separation With Isilon

Virtualization Host

VMDKOS Image – VMDK

Shared storageSAN/NAS

OS Image – VMDK VMDK

VMDK

HadoopVirtualNode 1

Ext4

Job-tracker

Ext4

Temp

OS Image – VMDK

Ext4

Task-tracker

Ext4HadoopVirtualNode 3

Ext4

Task-tracker

Ext4

27

Management

Software-defined Datacenter: Network and Security

Automate secure network provisioning

Network & Security Requirements for Big Data

Network/Security


Compute

3

New high bandwidth network designs

1

Leverage Software-defined network security

2

28

Customer Success: Hadoop as a Service at FedEx

Scale-out Isilon Cluster- Shared Data- NAS + Hadoop

Elastic vSphere Cluster- Mixed Workloads- vSphere- Existing Rack Mount

Servers

29

Agile Big Data at FedEx

• Trusted Isolation• Well known auditable

platform

Security

• Deploy in minutes• Optimize for shift in

workload characteristics

Agility

• Create true multi-tenancy

• Mixed workloads

Elasticity

30

Breakthrough Use Cases

Web Log Analysis Initial exploration was around detection of mobile devices accessing the

website.

Analysis of 570 billion web server log entries took approximately 9 minutes to complete on a small cluster.

ZIP code Analysis Analysis of data to determine which ZIP codes are the highest source or

destination for shipments.

Shipment Analysis Analysis of shipment information to determine patterns

that may delay a package.

31

Agility: Automation of Hadoop Cluster Management

Deploy

ResizeElastic scaling

CustomizeIncorporate best practices

Manage

Tune configuration

Run

Execute jobsAccess HDFS

32

Monitoring

Agility: Ease of Management Due to Consolidation

Cluster setup and provisioning

Monitoring

HW procurement and sizing

Cluster setup and provisioning

HW procurement and sizing

33

Elasticity: Mixed Workloads on a Shared Platform

Production

Test

Experimentation

Dept A: recommendation engine Dept B: ad targeting

Production

Test

Experimentation

Log files

Social data

Transaction data

Historical cust behavior

34

Customer Success: Identified ROI in 2 Months

Company Profile: Recruiting through Social Media Search• Big Data Shop: Acquire and classify data, then present to customers

Hadoop Virtualization Stages:

• Stage 1: Used AWS EMR to quickly spin up clusters and shutdown to save cost

• Stage 2: Need to run Hadoop jobs 24/7. EMR too expensive and spikey performance based on VM and backing storage they are assigned.

• Stage 3: Cut costs with On Premise Virtual Hadoop with BDE

• 1 hour to read the doc and 30 node cluster available in a few minutes

• Stage 4: Mix Physical and Virtual:

• Physical servers with Local spinning disks for Compute and HDFS

• VMs with HA for support nodes (Zookeeper, Namenode, Jobtracker, Journal, Clients)

• Stage 5: Compute Nodes in VMs for CPU intensive workloads

• Summary: 192 cores, 1 TB RAM, 96 Spindles, 288 TB storage

35

Summary

36

Customers Winning from Consolidated Big Data Platforms

“Dedicated hardware makes no sense”

“Software-defined Datacenter enables rapid deployment multiple tenants and labs”

“Our mixed workloads include Hadoop, Database, ETL and

App-servers”

“Any performance penalties are minor”Management

Network/Security


Compute

37

Cloud Infrastructure is Ready for Big Data – Are you?


38

Q&A

1. beyond mission critical virtualizing big data and hadoop

Technology

Transcript of 1. beyond mission critical virtualizing big data and hadoop