1. beyond mission critical virtualizing big data and hadoop
-
Upload
chiou-nan-chen -
Category
Technology
-
view
115 -
download
1
description
Transcript of 1. beyond mission critical virtualizing big data and hadoop
© 2009 VMware Inc. All rights reserved
Beyond Mission Critical: Virtualizing Big-Data and Hadoop
Michael West
Global Architect
Big Data and Storage Field Engineering, VMware
2
Applications And Storage Are Becoming Increasingly Diverse
Virtual Storage Arrays
vSphere
SAN/NAS Object / BLOB
Traditional Applications
• Traditional enterprise storage• HW-based resiliency, QoS
Next Gen Cloud Apps
• Scale out, flash, DAS• Application specific storage
All SSDArray
Server-sideFlash
3
The complexity enterprise IT and developers face today
An Idea for a cool app
Spec a server config
Justify server costs
Procurement process
Wait for HW to arrive
Wait for IT ops to Image the server
Install a Database
LOB Architecture approval
Central IT Architectural
approval
Justify more server for scale
testing
Wait for more HW
Configure ACLs and LBs
4-6 Months from Idea to Production!
New infrastructures
New Languages and Frameworks
New Devices and Domains
New Data types and requirements
4
Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise
5
Real-time analysis allows instant understanding of
market dynamics.
Retailers can have intimate understanding of their
customers needs and use direct targeted marketing.
Market Segment Analysis Personalized Customer Targeting`
6
The Emerging Pattern of Big Data Systems: Retail Example
Real-TimeStreams
Exa-scale Data Store
Parallel DataProcessing
Real-TimeProcessing
MachineLearning
Data Science
Cloud Infrastructure
Analytics
7
Storage: Plan for Peta-scale Data Storage and Processing
2000 2003 2006 2009 2012 20150.01
0.1
1
10
100
1000
Online Apps
AnalyticsPB ofData
Analytics Rapidly Outgrows Traditional Data Size by 100x
8
Unprecedented Scale
“Data transparency, amplified by Social Networks
generates data at a scale never seen before”
- The Human Face of Big Data
We are creating an Exabyte of data every minute in 2013
Yottabyte by 2030
9
A single GE Jet Engine produces
10 Terabytes of data in one hour – 90 Petabytes per year.
Enabling early detection of faults, common mode failures, product engineering feedback.
Post Mortem Proactively Maintained Connected Product
10
Cloud Infrastructure Supports Mixed Big Data Workloads
MachineLearning HadoopReal-Time
Analytics
Cloud Infrastructure
MachineLearning
Hadoop
Real-TimeAnalytics
Management
Network/Security
Storage/Availability
Compute
11
Cloud Infrastructure Supports Multiple Tenants
Cloud Infrastructure
Management
Network/Security
Storage/Availability
Compute
Web UserAnalytics
FinancialAnalysis
Historical CustomerBehavior
12
Software-defined Datacenter: Compute
Agility / Rapid deployment
Lower Capex
Isolation for resource control and security
1
2
3
Operational efficiency4
Management
The Core Values of Virtualization Apply to Big Data
Network/Security
Storage/Availability
Compute
13
Virtualizing Hadoop
Shrink and expand cluster on demand
Independent scaling of Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command center
Easy to configure/reconfigure
Operational Simplicity
14
Serengeti
Virtual Hadoop Manager (VHM)
Hadoop Virtualization Extensions
(HVE)
Big Data Extensions: Core Components
Core is Open Source Tool to simplify virtualized
Hadoop deployment & operations
Serengeti
Virtualization changes for core Hadoop
Contributed back to Apache Hadoop
Advanced resource management on vSphere
15
Strong Isolation between Workloads is Key
Hungry Workload 1
Reckless Workload 2
NosyWorkload 3
Cloud Infrastructure
16
Hadoopbatch analysis
Big Data Family of Frameworks
File System/Data Store
Host Host Host Host Host Host
HBasereal-time queries
NoSQL Cassandra, Mongo, etcBig SQL
Impala,Pivotal HawQ
Computelayer
Virtualization
Host
OtherSpark,Shark,Solr,
Platfora,Etc,…
17
Traditional Hadoop vs. Elastic Hadoop
Scale-out Network Storage
Traditional Hadoop:Converged
Compute/StorageElastic Compute
Scale-out Network Storage
18
Management
Software-defined Datacenter: Storage
Requirements of Next Generation Storage
Network/Security
Storage/Availability
Compute
10x lower cost of storage
Handle explosive data growth
Support a variety ofapplication types
1
2
3
Solve the privacy andsecurity issues4
19
Software-defined Storage Enables Fundamental Economics
0.5 1 2 4 8 16 32 64 128 $-
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
$3.50
$4.00
$4.50
$5.00
$5.50
Cost per GB
Petabytes Deployed
TraditionalSAN/NAS
DistributedObject
StorageHDFSMAPRCEPH
Scale-out NASIsilon, NTAP
20
HDFS Model
ESX ESX ESX
JT
HDFS or MAPR VM HDFS or MAPR VM HDFS or MAPR VM
Local Disks
SAN/NAS Non-Hadoop VMs
Hadoop Compute VMs
JT: JobTrackerTT: TaskTrackerNN: NameNodeVHM: Virtual Hadoop Manager
NN
TT
TT
TT
VirtualCenter Management Server
DRS DRS DRSDRS DRS
VHM
Hadoop HDFS VMs
TT
TT
TT
JT
21
Virtual Hadoop Manager
State, stats
(Slots used,Pending work)
Commands
(Decommission,Recommission)
Stats and VM configuration
Serengeti Job Tracker
vCenter DB
Manual/Auto
Power on/off
Virtual Hadoop Manager (VHM)
Job Tracker
Task Tracker
Task Tracker
Task Tracker
vCenter Server
SerengetiConfiguration
VCstate and stats
Hadoopstate and stats
VCactions
Hadoopactions
Algorithms
Cluster Configuration
22
Big-Data using Local Disks
Host
Host
Host
Host
Host
Host
Host
Top of Rack Switch
Servers withLocal Disks
16-24 core server12-24 SATA 2-4TB Disks10 GbE adapteriSCSI/NFS for SharedStorage for vMotion etc,…
High Performance 10GBE Switch per Rack
23
Standard Deployment Configuration for D/C Separation
Virtualization Host
OS Image – VMDK
HadoopVirtualNode 1
Task-tracker
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK
… …
24
Big Data Storage
Scale-out Network Storage
Elastic ComputeScale-out Network Storage
• Hadoop Protocol• Snapshots• Posix Apps• Full NFS Access• Replication• Erasure Coding
25
Big Data with Scale-out-NAS
Big-Data using Scale-out NAS
Host
Host
Host
Host
Host
Host
Top of Rack Switch
Scale-outNAS
Host
Host
Host
Host
Host
Host
Top of Rack Switch
Scale-outNAS
TempData
SharedData
IsilonScale-out
NAS
LocalDisk or SSDIn each Host
For Transient Data
26
HadoopVirtualNode 2
NN
NN
NN
NN
NN
NN da
ta n
od
e
Isilon
Storage Configuration for Data/Compute Separation With Isilon
Virtualization Host
VMDKOS Image – VMDK
Shared storageSAN/NAS
OS Image – VMDK VMDK
VMDK
HadoopVirtualNode 1
Ext4
Job-tracker
Ext4
Temp
OS Image – VMDK
Ext4
Task-tracker
Ext4HadoopVirtualNode 3
Ext4
Task-tracker
Ext4
27
Management
Software-defined Datacenter: Network and Security
Automate secure network provisioning
Network & Security Requirements for Big Data
Network/Security
Storage/Availability
Compute
3
New high bandwidth network designs
1
Leverage Software-defined network security
2
28
Customer Success: Hadoop as a Service at FedEx
Scale-out Isilon Cluster- Shared Data- NAS + Hadoop
Elastic vSphere Cluster- Mixed Workloads- vSphere- Existing Rack Mount
Servers
29
Agile Big Data at FedEx
• Trusted Isolation• Well known auditable
platform
Security
• Deploy in minutes• Optimize for shift in
workload characteristics
Agility
• Create true multi-tenancy
• Mixed workloads
Elasticity
30
Breakthrough Use Cases
Web Log Analysis Initial exploration was around detection of mobile devices accessing the
website.
Analysis of 570 billion web server log entries took approximately 9 minutes to complete on a small cluster.
ZIP code Analysis Analysis of data to determine which ZIP codes are the highest source or
destination for shipments.
Shipment Analysis Analysis of shipment information to determine patterns
that may delay a package.
31
Agility: Automation of Hadoop Cluster Management
Deploy
ResizeElastic scaling
CustomizeIncorporate best practices
Manage
Tune configuration
Run
Execute jobsAccess HDFS
32
Monitoring
Agility: Ease of Management Due to Consolidation
Cluster setup and provisioning
Monitoring
HW procurement and sizing
Cluster setup and provisioning
HW procurement and sizing
33
Elasticity: Mixed Workloads on a Shared Platform
Production
Test
Experimentation
Dept A: recommendation engine Dept B: ad targeting
Production
Test
Experimentation
Log files
Social data
Transaction data
Historical cust behavior
34
Customer Success: Identified ROI in 2 Months
Company Profile: Recruiting through Social Media Search• Big Data Shop: Acquire and classify data, then present to customers
Hadoop Virtualization Stages:
• Stage 1: Used AWS EMR to quickly spin up clusters and shutdown to save cost
• Stage 2: Need to run Hadoop jobs 24/7. EMR too expensive and spikey performance based on VM and backing storage they are assigned.
• Stage 3: Cut costs with On Premise Virtual Hadoop with BDE
• 1 hour to read the doc and 30 node cluster available in a few minutes
• Stage 4: Mix Physical and Virtual:
• Physical servers with Local spinning disks for Compute and HDFS
• VMs with HA for support nodes (Zookeeper, Namenode, Jobtracker, Journal, Clients)
• Stage 5: Compute Nodes in VMs for CPU intensive workloads
• Summary: 192 cores, 1 TB RAM, 96 Spindles, 288 TB storage
35
Summary
36
Customers Winning from Consolidated Big Data Platforms
“Dedicated hardware makes no sense”
“Software-defined Datacenter enables rapid deployment multiple tenants and labs”
“Our mixed workloads include Hadoop, Database, ETL and
App-servers”
“Any performance penalties are minor”Management
Network/Security
Storage/Availability
Compute
37
Cloud Infrastructure is Ready for Big Data – Are you?
Cloud Infrastructure
38
Q&A