1 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop (Shanghai Developer
Meetup – Sept 15, 2011)
余家昌 (Andrew Yu)
EMC Greenplum
2 © Copyright 2011 EMC Corporation. All rights reserved.
The Elephant Chase
3 © Copyright 2011 EMC Corporation. All rights reserved.
4 © Copyright 2011 EMC Corporation. All rights reserved.
Yahoo! Hadoop use cases
• Personalized Yahoo! Homepage
• Yahoo! Mail anti-spam
• Search and Ad pipelines
• Ad inventory prediction
• Data analytics
• etc
5 © Copyright 2011 EMC Corporation. All rights reserved.
Enterprise Use Case: “Big ETL”
Challenge: Transform Massive Data
Flows Containing Data Needed for
Complex Analysis
• Examples: – Web Traffic Reduction
– Network Traffic & Performance Analysis
– Location Analytics for People and Goods
– Smart Electric Power Grid
– Genome Analysis
– Clinical Outcome Research & Analysis
• Data Sources: – Web server & app server logs
– CDR / xDRs
– Router & Switching Subsystem Logs
– Sensor networks
Solution: Hadoop/MapReduce as ETL
fabric to load to Analytic Database
• Components:
– Hadoop: Massively-parallel ingest, storage and
analysis
– MapReduce: Runs multiple cascaded custom
analysis / extraction on capture data
– Connectors move structured data to Analytics
DB
• Hadoop’s Roles:
– Capture TBs/day of machine-generated data
– Quality: Run data quality tasks in MapReduce
– Execute MapReduce flows
– Extract/Combine data/metadata
– Move processed data to analytic DB
• Limitations & Cautions:
– Software development, More parts (Cascading/Flow), Maintainability
6 © Copyright 2011 EMC Corporation. All rights reserved.
Enterprise Use Case: Fraud Detection
Challenge: Identify & alert fraudulent
activity patterns
• Examples:
– ESP’s - Email Fraud
– Finance/Banking - Bank Fraud
– Advertising - Click Fraud
– Telecom – Network fraud
• Data Sources:
– Web & app server logs
– IP/Call Records
– Email Traffic
– Customer Transaction Data
– Banking/Credit Data
Solution: Hadoop/MapReduce to filter
& correlate communications
• Components:
– Hadoop: Massively-parallel ingest,
storage and analysis
– Mahout: Machine learning tool for building
fraud algorithms
– MapReduce: Rapid analysis & algorithm
deployment
• Hadoop’s Role(s):
– Massive ingest of historical/real-time data
– Build/Validate model for fraud detection
manually or using Mahout
– Parallel MapReduce jobs for near real-
time fraud detection
• Limitations & Cautions:
– Software development, Partial Solution (not Real-time, not Interactive)
–
7 © Copyright 2011 EMC Corporation. All rights reserved.
Enterprise Use Case: Cluster Analysis
Challenge: Grouping a collection of
data according to common similarities
• Examples:
– Customer segmentation
– Financial cost/risk analysis
– Patient-centric healthcare
– Financial stock classification
– Social network analysis
• Data Sources:
– Health records
– Sales data
– Human genome sequences
– Financial trading data
– Facebook/Twitter/LinkedIn
Solution: Process and Refine in
Hadoop and load into Analytical DB
• Components:
– Hadoop: Flexible data storage as volume
increases and structures vary
– MapReduce: Cascading allows data
processing with minimal adjustments
– Optional: Connectors to move results to
Analytic DB
• Hadoop’s Role(s):
– Flexible: Allow agile implementation of
and unit testing of algorithms
– Large scale analysis in Hadoop creates
more accurate groupings
– Rapid, parallel processing in MapReduce
• Limitations & Cautions:
– Software development, Complex Integration with Sources
9 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD: Community Edition Stack
Hadoop Distributed File System (HDFS)
MapReduce Framework (MapRed)
Pig
Hiv
e
HB
ase
Zook
eepe
r
100% APACHE
Currently supported
Future releases may include support for Oozie and Mahout
10 © Copyright 2011 EMC Corporation. All rights reserved.
100% APACHE
INTERFACE
Greenplum HD: Enterprise Edition Stack
Hadoop Distributed File System (HDFS)
MapReduce Framework (MapRed)
Pig
Hiv
e
HB
ase
Zook
eepe
r
Future releases may include support for Oozie and Mahout
Currently supported
Enha
nced
Mon
itorin
g
11 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD: Enterprise Edition Enterprise-Ready Hadoop Platform for Unstructured Data
• 2 – 5x Faster than Apache Hadoop Faster
• High Availability • Mirroring Reliable
• NFS mountable • System Management
Easier to Use
12 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD is Faster than Other Distributions
DFSIO (higher is better)
Terasort (lower is better)
10 node cluster, 2x Quad-Core, 24G DRAM, 12 x 1TB SATA Drives @ 7200 rpm, Quad NICs
Ela
pse
d tim
e in
min
ute
s
MB
/se
c
0
50
100
150
200
250
3.5 TB 0
100
200
300
400
500
600
700
800
900
1000
Read Write
13 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD Distributed Name Node
• Fully distributed
service running on
all Hadoop nodes
• Automatic and
transparent failover
• Persistent metadata
• Highly scalable in
number of files
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
Hadoop
Node NN
14 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD Job Tracker High Availability
• Assures business continuity
• Designed for mission critical use
– Automatic stateful restart
– Task Tracker reconnects without task loss
– Persistent completed task state
Greenplum Enterprise HD Distribution for Apache Hadoop
Enterprise HD MapReduce
Enterprise HD
Lockless Storage Services
Distributed
Name Node Job Tracker HA
15 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD Snapshots
• Intelligent Snapshots – Automatic data deduplication
– Block sharing for space
savings
• Fast and flexible – Zero performance loss when
writing to the original
• Easy to manage – Scheduled or on-demand
– Drag and drop recovery
REDIRECT ON
WRITE
FOR SNAPSHOT
A B C C’ D
Snapshot
1
Snapshot
2
Snapshot
3
Enterprise HD Lockless Storage
Services
Hadoop / HBASE
APPLICATIONS
READ / WRITE
NFS
APPLICATIONS
16 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD Mirroring
• Business Continuity – Efficient design
– Differential deltas are
updated
– Data is compressed and
check-summed
• Easy to manage – Scheduled or on-demand
– Consistent point-in-time
WAN Datacenter 2
Production Research
Production WAN
Datacenter 1
Cloud
17 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD
Direct Access Using NFS
Greenplum Enterprise HD Distribution for Apache Hadoop
Enterprise HD MapReduce
Enterprise HD
Lockless Storage Services
Distributed
Name Node Job Tracker HA
• Simple application integration
– Leverage NFS for random read/write access
• Direct access for standard Hadoop tools
– Command line utilities
– File browsers
– Desktop utilities
18 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD
Simple Management
• Intuitive
• Insightful
• Complete
• One node
or
thousands
19 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD: Software Distributions
Features Community Edition Enterprise Edition
Apache Compatibility 100% Apache Open Source 100% API Compatible
Name Node High Availability Reference Implementation Distributed and High Avaiability
Job Tracker HA Reference Implementation HT High Availability
Name Node Scalability NN Metadata in Memory Distributed Name Node
Premium Support Yes Yes
Performance 2 - 5x than Community Edition
Snapshots No Yes
Mirrors No Yes
NFS Mounts No Yes
System Management No Yes
Available for Ordering May 9th 2011 Q3
Pricing Per Node Pricing Per Node Pricing
20 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD on Data Computing Appliance
• Introducing the world’s first: – High-performance
– Purpose-built
– Data co-processing Hadoop
appliance
• Combining Greenplum Database
and Greenplum Hadoop in one
appliance
21 © Copyright 2011 EMC Corporation. All rights reserved.
GPDB GPHD Interoperability
GPDB External Tables
GPHD
File on HD
GPHD data in/out
in GPDB Query
22 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Database External Tables for Hadoop
• Bring GPDB relational expressive
power to HDFS – HDFS data presented as external tables
– HDFS data supporting full SQL syntax
• Have ALL, PART or NONE of your
data in HDFS
• Leverage full parallelism of both
Hadoop and GPDB – GPDB can read from/write to HDFS,
Example:
Select count(*) from
HDFS_data h,
GPDB_data g
where h.key = g.key;
Insert into
HDFS_data select *
from GPDB_data;
23 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Enterprise HD HDFS Integration – Parallelized Flow
• Reading: – Each GPDB segment reads a portion of the file
• Segment i of n reads the i/n-th portion
– Access offset from HDFS namenode
– Read data directly from HDFS datanode
• Writing: – Each GPDB segment writes a file
– HDFS balancing distributes the load evenly
across datanodes
24 © Copyright 2011 EMC Corporation. All rights reserved.
Big Data Analytics “Stack”
Greenplum Chorus Enterprise Collaboration Platform for Data
Greenplum Database
World’s Most Scalable MPP Database Platform
Analytic Toolsets (Business Analytics, BI, Statistics, etc.)
Greenplum HD
Enterprise Analytics Platform for Unstructured Data
Greenplum Data Computing Appliances Purpose-built for Big Data Analytics
25 © Copyright 2011 EMC Corporation. All rights reserved.
THANK YOU
Top Related