Dan Bassett, Jonathan Canfield December 13, 2011
description
Transcript of Dan Bassett, Jonathan Canfield December 13, 2011
Dan Bassett, Jonathan CanfieldDecember 13, 2011
2
What is Hadoop?• Allows for the distributed processing of large data sets across
clusters of computers• Open-source project written in Java• Actively supported• Inspired by a project that Google started
3
What’s the big deal?
• Changes the economics and dynamics of large scale computing
• Scalable• Cost effective• Flexible• Fault Tolerant
4
Commercially supported
• InfoSphere BigInsights• Silicon Graphics CloudRack• EMC Greenplum• Google App Engine• Oracle Big Data Appliance• Cloudera CDH, Professional Services• Microsoft Windows Server, SQL Server
5
Who Uses Hadoop?
6
Prominent Users
• Facebook - claims to have the largest Hadoop cluster in the world at 30PB.
• Yahoo! - claims to have the world’s largest Hadoop production application.
• eBay – 5.3PB, 532 nodes cluster• New York Times – processed 4TB of image data
into 11 million PDFs at cost of ~ $240
7
HOW DOES IT WORK?
8
Architecture• Hadoop Common• Hadoop Distributed File System (HDFS)• MapReduce Engine
9
File System (HDFS)• One big file system from many nodes• Fault-tolerant• Runs on low-cost commodity hardware
10
MapReduce Engine• Splits input data• Assigns work to nodes• Processed in parallel
11
MapReduce Illustration
12
MapReduce Step 1
13
MapReduce Step 2
14
MapReduce Step 3
15
MapReduce Step 4
16
MapReduce Step 4
17
MapReduce Step 5
18
MapReduce Step 5
19
MapReduce Step 6
20
MapReduce Illustration
21
Resources
• Project Homehttp://hadoop.apache.org/
• Wikipediahttp://en.wikipedia.org/wiki/Apache_Hadoop
• IBMhttp://www-01.ibm.com/software/data/infosphere/hadoop/