ObjectPartnersInc.
Click to edit Master subtitle style
Introduction to Hadoop
Presented by: Joel Crabb
Demo by: Nick Adelman
ObjectPartnersInc. Agenda
Ø TerminologyØ Why does Hadoop Exist?Ø HDFS and HbaseØ ExamplesØ Getting StartedØ Demo
ObjectPartnersInc. Terminology
Ø Hadoop– Core set of technologies hosted by Apache Foundation for
storing and searching data sets in the Tera and Petabyte range
Ø HDFS – Hadoop File System used as the basis for all Hadoop
technologiesØ Hbase
– Distributed Map based database which uses HDFS as its underlying data store
Ø Map Reduce– A framework for programming distributed parallel
processing algorithms
ObjectPartnersInc. Terminology
Ø Distributed Computing– A computing paradigm that parallelizes computations over
multiple compute nodes in order to decrease overall processing time
Ø NOSQL– Programming paradigm which does not use a relational
database as the backend data storeØ Big Data
– Generic term used when working with large data setsØ Name Node
– Server that knows location of all files in cluster
ObjectPartnersInc. Enterprise Architecture 101
Data DataHDFS
Hbase
RDBMSM
ap R
educ
e
Hbase
RDBMS
HDFS
ObjectPartnersInc. The New System Constraint
Ø Hard disk seek time is the new constraint when working with a Petabyte data set– Spread the seek time among multiple servers– Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and
discard the excess
Ø Working under this paradigm requires New Tools
ObjectPartnersInc. New Tools: Why does Hadoop exist?
Ø In the early 2000s Google had problems:
Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible
Ø Answer: distributed file system
Ø Problem 2: Distributed Computing is HardØ Answer: make distributed computing easier
Ø Problem 3: Datasets too large for RDBMSØ Answer: make a new way to store application data
ObjectPartnersInc. Google’s Solution: Tool 1
Ø Google File System (GFS)– A file system specifically built to manage large files and
support distributed computingØ Inexpensive:
– Store files distributed across a cluster of cheap serversØ Reliable:
– Plan for server failure: if you have 1000 servers, one will fail every day
– Always maintain three copies of each file (configurable)Ø Accessible:
– File Chunk size is 64MB = Less file handles to manage– Master table keeps track of locations of each file copy
Problem 1: Store Tera and Petabytes of data
ObjectPartnersInc. Google’s Solution: Tool 2
Ø Map Reduce – abstracts away the hard parts of distributed computing
Ø Programmers no longer need to manage:– Where is the data?– What piece of data am I working on?– How do I move data and result sets?– How do I combine results?
Ø Leverages the GFS– Send processing to the data– Multiple file copies means higher chance to use more
nodes for each process
Problem 2: Distributed Computing is Hard
ObjectPartnersInc. Tool 2: Map Reduce
Ø Distributed parallel processing frameworkØ Map - done N times on N servers
– Perform an operation (search) on a chunk (GBs) of dataØ Search 100 GB
– Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory– Create Maps storing results (key-value pair)
Ø Reduce– Take Maps from N nodes– Merge (reduce) maps to a single sorted map (result set)
Problem 2: Distributed Computing is Hard
ObjectPartnersInc. Google’s Solution: Tool 3
Ø Bigtable: new paradigm in storing large data sets– “a sparse, distributed, persistent multi-dimensional sorted
map”*
Ø Sparse: Few entries in map are populatedØ Distributed: Data spread across multiple logical
machines in multiple copiesØ Multi-dimensional: Maps within maps organize and
store dataØ Sorted: Sorted by lexiographic keys
– Lexiographic = alphabetically including numbers
Problem 3: Data sets too large for RDBMS
*Bigtable: A Distributed Storage System for Structured Data
ObjectPartnersInc. Google’s Architecture
Map Reduce Direct Access Map Reduce
Bigtable
GFS
ObjectPartnersInc. Hadoop – If Something Works…
GFS HDFS
Bigtable Hbase
Map Reduce
Map Reduce
Ø Hadoop was started to recreate these technologies in the Open Source community
ObjectPartnersInc. A Little More on HDFS
Ø Plan for Failure– In a thousand node cluster, machines will fail often– HDFS is built to detect failure and redistribute files
Ø Fast Data Access– Generally a batch processing system
Ø Large Files – typically GB to TB filesØ Simple Coherency
– Once file is closed, it cannot be updated or appendedØ Cloud Ready
– Setup on Amazon EC2 / S3
Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html
ObjectPartnersInc. A Little More on Hbase
Ø Multi-dimensional MapØ Map<byte[ ]
– Map<byte[ ] • Map<byte[ ]
– Map<Long, byte[]>>>>
Ø First Map: Row Key to Column FamilyØ Second Map: Column Family to Column LabelØ Third Map: Column Label to TimestampØ Fourth Map: Timestamp to Value
A Column Family is a grouping of columns of the same data type.
ObjectPartnersInc. Hbase Storage Model
ObjectPartnersInc. Hbase Access
Ø REST interface– http://wiki.apache.org/hadoop/Hbase/Stargate
Ø Groovy– http://wiki.apache.org/hadoop/Hbase/Groovy
Ø Scala– http://wiki.apache.org/hadoop/Hbase/Scala
ObjectPartnersInc. Industry Examples
* Information from http://wiki.apache.org/hadoop/PoweredBy
Ø Web/File Search (Yahoo!)Ø Yahoo! Is the main sponsor and contributor to HadoopØ Has over 25,000 servers running Hadoop
Ø Log aggregation (Amazon, Facebook, Baidu)Ø RDBMS replacement (Google Analytics)Ø Image store (Google Earth)Ø Email store (Gmail)Ø Natural Language Search (Microsoft)Ø Many more…
ObjectPartnersInc. Use Case #1: Yahoo! Search
Ø Problem circa 2006
Ø Yahoo! search is seen as inferior to Google’sØ Google is better at:
– Storing Tera and Petabytes of unstructured data – Searching the data set efficiently– Applying custom analytics to data set– Presenting a more relevant result set
ObjectPartnersInc. Use Case #1: Yahoo! Search
Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce– HDFS
• Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s)
• Runs on commodity hardware• Average server – 2X4 core, 4 – 32 GB RAM *
– Pig (Hadoop Sub-project)• Analytics processing platform
– Map Reduce• Build indexes from raw web data
* http://wiki.apache.org/hadoop/PoweredBy
ObjectPartnersInc.
Use Case #2: RDBMS Replacement
Ø Google Analytics circa 2006Ø Problem
– Store Terabytes of analytics data about website usage– GBs of data added per hour– Data added in small increments– Access and display data in < 3 seconds per request
ObjectPartnersInc. Use Case #2: RDBMS Replacement
Ø Solution – Bigtable, Map Reduce on GFSØ Bigtable sits over GFS inputs small bits of dataØ In 2006, GA cluster supported ~220 TB*Ø Raw Click Table (200 TB)
– Rows keyed by WebsiteName + Session Time– All website data stored consecutively on disk
Ø Summary Table (20 TB)– Map Reduce of Raw Click Table for customer web views
*Bigtable: A Distributed Storage System for Structured Data
Pattern: Collect data in one Bigtable instanceMap Reduce to a View Bigtable instance
ObjectPartnersInc. Can You Use Hadoop?
Ø IF…– You have a large amount of data (Terabytes+)– You can split your data collection data store
from your online or analytics data store – You can order your data lexiographically– You can run analytics as batches– You cannot afford a large enough RDBMS– You need dynamic column additions– You need near linear performance as data set
grows
ObjectPartnersInc. Other Hadoop Technologies
Ø Hive – SQL like query language to use Hadoop like a data warehouse
Ø Pig – parallel data analysis frameworkØ Zookeeper – Distributed application coordination
frameworkØ Chukwa – Data collection system for distributed
computingØ Avro – data serialization framework
ObjectPartnersInc. New Skills for IT
Ø Learning to restructure dataØ Learning to write Map Reduce programsØ Learning to maintain a Hadoop clusterØ Forgetting RDBMS/SQL dominated design
principals
It takes a new style of creativity to both structure datain Hadoop and write useful Map Reduce programs.
ObjectPartnersInc. Getting Started
Ø You can install a test system on a single Unix boxØ For a full system a minimum of 3 servers
– 10 to 20 servers is a small clusterØ Expect to spend a day to a week getting a multi-
node cluster configured.Ø A book like Pro Hadoop, by Jason Venner may
save you time but is based on the 0.19 Hadoop release (currently at 0.20)
ObjectPartnersInc. Optional Quickstart
Ø Cloudera has a preconfigured single node Hadoop instance available for download at: http://www.cloudera.com/hadoop-training-virtual-machine
Ø Yahoo! Has a Hadoop distribution as well at: http://developer.yahoo.com/hadoop/distribution/
ObjectPartnersInc. Alternatives to Hbase
Ø Project Voldemort– http://project-voldemort.com/– Used by Linked In
Ø Hypertable– http://www.hypertable.org/– Used by BaiDu (Search leader of China)
Ø Cassandra– http://cassandra.apache.org/– Apache sponsored distributed database– Used by Facebook
ObjectPartnersInc. Helpful Information
Ø http://hadoop.apache.orgØ http://hbase.apache.orgØ http://wiki.apache.org/hadoop/HadoopPresentationsØ http://labs.google.com/papers/bigtable.htmlØ http://labs.google.com/papers/gfs.htmlØ http://labs.google.com/papers/mapreduce.htmlØ Twitter: @hbaseØ Two articles on Map Reduce in the 01/2010
Communications of the ACM
Top Related