Dan Bassett, Jonathan Canfield December 13, 2011

Dan Bassett, Jonathan CanfieldDecember 13, 2011

2

What is Hadoop?• Allows for the distributed processing of large data sets across

clusters of computers• Open-source project written in Java• Actively supported• Inspired by a project that Google started

3

What’s the big deal?

• Changes the economics and dynamics of large scale computing

• Scalable• Cost effective• Flexible• Fault Tolerant

4

Commercially supported

• InfoSphere BigInsights• Silicon Graphics CloudRack• EMC Greenplum• Google App Engine• Oracle Big Data Appliance• Cloudera CDH, Professional Services• Microsoft Windows Server, SQL Server

5

Who Uses Hadoop?

6

Prominent Users

• Facebook - claims to have the largest Hadoop cluster in the world at 30PB.

• Yahoo! - claims to have the world’s largest Hadoop production application.

• eBay – 5.3PB, 532 nodes cluster• New York Times – processed 4TB of image data

into 11 million PDFs at cost of ~ $240

7

HOW DOES IT WORK?

8

Architecture• Hadoop Common• Hadoop Distributed File System (HDFS)• MapReduce Engine

9

File System (HDFS)• One big file system from many nodes• Fault-tolerant• Runs on low-cost commodity hardware

10

MapReduce Engine• Splits input data• Assigns work to nodes• Processed in parallel

11

MapReduce Illustration

12

MapReduce Step 1

13

MapReduce Step 2

14

MapReduce Step 3

15

MapReduce Step 4

16

MapReduce Step 4

17

MapReduce Step 5

18

MapReduce Step 5

19

MapReduce Step 6

20

MapReduce Illustration

21

Resources

• Project Homehttp://hadoop.apache.org/

• Wikipediahttp://en.wikipedia.org/wiki/Apache_Hadoop

• IBMhttp://www-01.ibm.com/software/data/infosphere/hadoop/

Dan Bassett, Jonathan Canfield December 13, 2011

Documents

Transcript of Dan Bassett, Jonathan Canfield December 13, 2011