Introduction to Hadoop

Post on 18-Dec-2014

5.050 views 1 download

description

An introduction to Hadoop presentation geared towards educating potential clients on Hadoop\'s capabilities.

Transcript of Introduction to Hadoop

ObjectPartnersInc.

Click to edit Master subtitle style

Introduction to Hadoop

Presented by: Joel Crabb

Demo by: Nick Adelman

ObjectPartnersInc. Agenda

Ø TerminologyØ Why does Hadoop Exist?Ø HDFS and HbaseØ ExamplesØ Getting StartedØ Demo

ObjectPartnersInc. Terminology

Ø Hadoop– Core set of technologies hosted by Apache Foundation for

storing and searching data sets in the Tera and Petabyte range

Ø HDFS – Hadoop File System used as the basis for all Hadoop

technologiesØ Hbase

– Distributed Map based database which uses HDFS as its underlying data store

Ø Map Reduce– A framework for programming distributed parallel

processing algorithms

ObjectPartnersInc. Terminology

Ø Distributed Computing– A computing paradigm that parallelizes computations over

multiple compute nodes in order to decrease overall processing time

Ø NOSQL– Programming paradigm which does not use a relational

database as the backend data storeØ Big Data

– Generic term used when working with large data setsØ Name Node

– Server that knows location of all files in cluster

ObjectPartnersInc. Enterprise Architecture 101

Data DataHDFS

Hbase

RDBMSM

ap R

educ

e

Hbase

RDBMS

HDFS

ObjectPartnersInc. The New System Constraint

Ø Hard disk seek time is the new constraint when working with a Petabyte data set– Spread the seek time among multiple servers– Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and

discard the excess

Ø Working under this paradigm requires New Tools

ObjectPartnersInc. New Tools: Why does Hadoop exist?

Ø In the early 2000s Google had problems:

Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible

Ø Answer: distributed file system

Ø Problem 2: Distributed Computing is HardØ Answer: make distributed computing easier

Ø Problem 3: Datasets too large for RDBMSØ Answer: make a new way to store application data

ObjectPartnersInc. Google’s Solution: Tool 1

Ø Google File System (GFS)– A file system specifically built to manage large files and

support distributed computingØ Inexpensive:

– Store files distributed across a cluster of cheap serversØ Reliable:

– Plan for server failure: if you have 1000 servers, one will fail every day

– Always maintain three copies of each file (configurable)Ø Accessible:

– File Chunk size is 64MB = Less file handles to manage– Master table keeps track of locations of each file copy

Problem 1: Store Tera and Petabytes of data

ObjectPartnersInc. Google’s Solution: Tool 2

Ø Map Reduce – abstracts away the hard parts of distributed computing

Ø Programmers no longer need to manage:– Where is the data?– What piece of data am I working on?– How do I move data and result sets?– How do I combine results?

Ø Leverages the GFS– Send processing to the data– Multiple file copies means higher chance to use more

nodes for each process

Problem 2: Distributed Computing is Hard

ObjectPartnersInc. Tool 2: Map Reduce

Ø Distributed parallel processing frameworkØ Map - done N times on N servers

– Perform an operation (search) on a chunk (GBs) of dataØ Search 100 GB

– Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory– Create Maps storing results (key-value pair)

Ø Reduce– Take Maps from N nodes– Merge (reduce) maps to a single sorted map (result set)

Problem 2: Distributed Computing is Hard

ObjectPartnersInc. Google’s Solution: Tool 3

Ø Bigtable: new paradigm in storing large data sets– “a sparse, distributed, persistent multi-dimensional sorted

map”*

Ø Sparse: Few entries in map are populatedØ Distributed: Data spread across multiple logical

machines in multiple copiesØ Multi-dimensional: Maps within maps organize and

store dataØ Sorted: Sorted by lexiographic keys

– Lexiographic = alphabetically including numbers

Problem 3: Data sets too large for RDBMS

*Bigtable: A Distributed Storage System for Structured Data

ObjectPartnersInc. Google’s Architecture

Map Reduce Direct Access Map Reduce

Bigtable

GFS

ObjectPartnersInc. Hadoop – If Something Works…

GFS HDFS

Bigtable Hbase

Map Reduce

Map Reduce

Ø Hadoop was started to recreate these technologies in the Open Source community

ObjectPartnersInc. A Little More on HDFS

Ø Plan for Failure– In a thousand node cluster, machines will fail often– HDFS is built to detect failure and redistribute files

Ø Fast Data Access– Generally a batch processing system

Ø Large Files – typically GB to TB filesØ Simple Coherency

– Once file is closed, it cannot be updated or appendedØ Cloud Ready

– Setup on Amazon EC2 / S3

Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html

ObjectPartnersInc. A Little More on Hbase

Ø Multi-dimensional MapØ Map<byte[ ]

– Map<byte[ ] • Map<byte[ ]

– Map<Long, byte[]>>>>

Ø First Map: Row Key to Column FamilyØ Second Map: Column Family to Column LabelØ Third Map: Column Label to TimestampØ Fourth Map: Timestamp to Value

A Column Family is a grouping of columns of the same data type.

ObjectPartnersInc. Hbase Storage Model

ObjectPartnersInc. Hbase Access

Ø REST interface– http://wiki.apache.org/hadoop/Hbase/Stargate

Ø Groovy– http://wiki.apache.org/hadoop/Hbase/Groovy

Ø Scala– http://wiki.apache.org/hadoop/Hbase/Scala

ObjectPartnersInc. Industry Examples

* Information from http://wiki.apache.org/hadoop/PoweredBy

Ø Web/File Search (Yahoo!)Ø Yahoo! Is the main sponsor and contributor to HadoopØ Has over 25,000 servers running Hadoop

Ø Log aggregation (Amazon, Facebook, Baidu)Ø RDBMS replacement (Google Analytics)Ø Image store (Google Earth)Ø Email store (Gmail)Ø Natural Language Search (Microsoft)Ø Many more…

ObjectPartnersInc. Use Case #1: Yahoo! Search

Ø Problem circa 2006

Ø Yahoo! search is seen as inferior to Google’sØ Google is better at:

– Storing Tera and Petabytes of unstructured data – Searching the data set efficiently– Applying custom analytics to data set– Presenting a more relevant result set

ObjectPartnersInc. Use Case #1: Yahoo! Search

Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce– HDFS

• Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s)

• Runs on commodity hardware• Average server – 2X4 core, 4 – 32 GB RAM *

– Pig (Hadoop Sub-project)• Analytics processing platform

– Map Reduce• Build indexes from raw web data

* http://wiki.apache.org/hadoop/PoweredBy

ObjectPartnersInc.

Use Case #2: RDBMS Replacement

Ø Google Analytics circa 2006Ø Problem

– Store Terabytes of analytics data about website usage– GBs of data added per hour– Data added in small increments– Access and display data in < 3 seconds per request

ObjectPartnersInc. Use Case #2: RDBMS Replacement

Ø Solution – Bigtable, Map Reduce on GFSØ Bigtable sits over GFS inputs small bits of dataØ In 2006, GA cluster supported ~220 TB*Ø Raw Click Table (200 TB)

– Rows keyed by WebsiteName + Session Time– All website data stored consecutively on disk

Ø Summary Table (20 TB)– Map Reduce of Raw Click Table for customer web views

*Bigtable: A Distributed Storage System for Structured Data

Pattern: Collect data in one Bigtable instanceMap Reduce to a View Bigtable instance

ObjectPartnersInc. Can You Use Hadoop?

Ø IF…– You have a large amount of data (Terabytes+)– You can split your data collection data store

from your online or analytics data store – You can order your data lexiographically– You can run analytics as batches– You cannot afford a large enough RDBMS– You need dynamic column additions– You need near linear performance as data set

grows

ObjectPartnersInc. Other Hadoop Technologies

Ø Hive – SQL like query language to use Hadoop like a data warehouse

Ø Pig – parallel data analysis frameworkØ Zookeeper – Distributed application coordination

frameworkØ Chukwa – Data collection system for distributed

computingØ Avro – data serialization framework

ObjectPartnersInc. New Skills for IT

Ø Learning to restructure dataØ Learning to write Map Reduce programsØ Learning to maintain a Hadoop clusterØ Forgetting RDBMS/SQL dominated design

principals

It takes a new style of creativity to both structure datain Hadoop and write useful Map Reduce programs.

ObjectPartnersInc. Getting Started

Ø You can install a test system on a single Unix boxØ For a full system a minimum of 3 servers

– 10 to 20 servers is a small clusterØ Expect to spend a day to a week getting a multi-

node cluster configured.Ø A book like Pro Hadoop, by Jason Venner may

save you time but is based on the 0.19 Hadoop release (currently at 0.20)

ObjectPartnersInc. Optional Quickstart

Ø Cloudera has a preconfigured single node Hadoop instance available for download at: http://www.cloudera.com/hadoop-training-virtual-machine

Ø Yahoo! Has a Hadoop distribution as well at: http://developer.yahoo.com/hadoop/distribution/

ObjectPartnersInc. Alternatives to Hbase

Ø Project Voldemort– http://project-voldemort.com/– Used by Linked In

Ø Hypertable– http://www.hypertable.org/– Used by BaiDu (Search leader of China)

Ø Cassandra– http://cassandra.apache.org/– Apache sponsored distributed database– Used by Facebook

ObjectPartnersInc. Helpful Information

Ø http://hadoop.apache.orgØ http://hbase.apache.orgØ http://wiki.apache.org/hadoop/HadoopPresentationsØ http://labs.google.com/papers/bigtable.htmlØ http://labs.google.com/papers/gfs.htmlØ http://labs.google.com/papers/mapreduce.htmlØ Twitter: @hbaseØ Two articles on Map Reduce in the 01/2010

Communications of the ACM