Hadoop: Distributed Data Processing
Amr AwadallahFounder/CTO, Cloudera, Inc.ACM Data Mining SIGThursday, January 25th, 2010
Amr Awadallah, Cloudera Inc 2
Outline
▪Scaling for Large Data Processing
▪What is Hadoop?▪HDFS and MapReduce▪Hadoop Ecosystem▪Hadoop vs RDBMSes▪Conclusion
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Storage Farm for Unstructured Data (20TB/day)
Instrumentation
Collection
RDBMS (200GB/day)
Interactive Apps
Mostly Append
Ad hoc Queries &Data Mining
ETL Grid Non-ConsumptionFiler heads are a bottleneck
Amr Awadallah, Cloudera Inc 4
The Solution: A Store-Compute Grid
Storage + Computation
Instrumentation
Collection
RDBMS
Interactive Apps “Batch” Apps
Mostly Append
ETL and Aggregations
Ad hoc Queries& Data Mining
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing
▪Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage
▪ MapReduce: Fault-Tolerant Distributed Processing
▪Operates on unstructured and structured data
▪A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)
▪Open source under the friendly Apache License
▪http://wiki.apache.org/hadoop/
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch
▪ 2003-2004: Google publishes GFS and MapReduce papers
▪ 2004: Cutting adds DFS & MapReduce support to Nutch
▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
▪ 2007: NY Times converts 4TB of archives over 100 EC2s
▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm
▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
▪ May 2009: ▪ Yahoo does fastest sort of a TB, 62secs over 1460 nodes▪ Yahoo sorts a PB in 16.25hours over 3658 nodes
▪ June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500)
▪ September 2009: Doug Cutting joins Cloudera
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly 3. Compute Should Move to Data4. Simple Core, Modular and
Extensible
Amr Awadallah, Cloudera Inc 8
Block Size = 64MBReplication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few ¢/month vs $/month
Amr Awadallah, Cloudera Inc 9
MapReduce: Distributed Processing
Amr Awadallah, Cloudera Inc 10
MapReduce Example for Word Count
Split 1
Split i
Split N
Reduce 1
Reduce i
Reduce R
(sorted words, counts)
Shuffle(sorted words, counts)
Map 1(docid, text)
(docid, text) Map i
(docid, text)
Map M
(words, counts)
(words, counts)
“To Be Or Not
To Be?”
Be, 5
Be, 12
Be, 7
Be, 6
Output File 1
(sorted words, sum of counts)
Output File i
(sorted words, sum of counts)
Output File R
(sorted words, sum of counts)
Be, 30
SELECT word, COUNT(1) FROM docs GROUP BY word;cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Amr Awadallah, Cloudera Inc 11
Hadoop High-Level Architecture
Name NodeMaintains mapping of file blocks
to data node slaves
Job TrackerSchedules jobs across
task tracker slaves
Data NodeStores and serves
blocks of data
Hadoop ClientContacts Name Node for data or Job Tracker to submit jobs
Task TrackerRuns tasks (work units)
within a job
Share Physical Node
Amr Awadallah, Cloudera Inc 12
Apache Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Avr
o (S
eri
aliz
atio
n)
Zo
oke
ep
r (C
oo
rdin
atio
n)
Sqoop
RDBMS
(Streaming/Pipes APIs)
Amr Awadallah, Cloudera Inc 13
Relational Databases:Hadoop:
Use The Right Tool For The Right Job
When to use?
• Affordable Storage/Compute
• Structured or Not (Agility)
• Resilient Auto Scalability
When to use?
• Interactive Reporting (<1sec)
• Multistep Transactions
• Interoperability
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:
▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
▪ Cost/node: $5K/node
▪ Effective HDFS Space:▪ ¼ reserved for temp shuffle space, which leaves
9TB/node▪ 3 way replication leads to 3TB effective HDFS
space/node▪ But assuming 7x compression that becomes ~
20TB/node
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per user TB
Amr Awadallah, Cloudera Inc 15
Sample Talks from Hadoop World ‘09▪ VISA: Large Scale Transaction Analysis
▪ JP Morgan Chase: Data Processing for Financial Services
▪ China Mobile: Data Mining Platform for Telecom Industry
▪ Rackspace: Cross Data Center Log Processing
▪ Booz Allen Hamilton: Protein Alignment using Hadoop
▪ eHarmony: Matchmaking in the Hadoop Cloud
▪ General Sentiment: Understanding Natural Language
▪ Yahoo!: Social Graph Analysis
▪ Visible Technologies: Real-Time Business Intelligence
▪ Facebook: Rethinking the Data Warehouse with Hadoop and Hive
Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
Amr Awadallah, Cloudera Inc 16
Cloudera Desktop
Amr Awadallah, Cloudera Inc 17
Conclusion
Hadoop is a data grid operating system which provides an economically
scalable solution for storing and processing large
amounts of unstructured or structured data over long
periods of time.
Amr Awadallah, Cloudera Inc 18
Amr Awadallah
CTO, Cloudera Inc.
http://twitter.com/awadallah
Online Training Videos and Info:
http://cloudera.com/hadoop-training
http://cloudera.com/blog
http://twitter.com/cloudera
Contact Information
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Top Related