Dan Bassett, Jonathan Canfield December 13, 2011

21
Dan Bassett, Jonathan Canfield December 13, 2011

description

Dan Bassett, Jonathan Canfield December 13, 2011. What is Hadoop ?. Allows for the distributed processing of large data sets across clusters of computers Open-source project written in Java Actively supported Inspired by a project that Google started. What’s the big deal?. - PowerPoint PPT Presentation

Transcript of Dan Bassett, Jonathan Canfield December 13, 2011

Page 1: Dan Bassett, Jonathan Canfield December 13, 2011

Dan Bassett, Jonathan CanfieldDecember 13, 2011

Page 2: Dan Bassett, Jonathan Canfield December 13, 2011

2

What is Hadoop?• Allows for the distributed processing of large data sets across

clusters of computers• Open-source project written in Java• Actively supported• Inspired by a project that Google started

Page 3: Dan Bassett, Jonathan Canfield December 13, 2011

3

What’s the big deal?

• Changes the economics and dynamics of large scale computing

• Scalable• Cost effective• Flexible• Fault Tolerant

Page 4: Dan Bassett, Jonathan Canfield December 13, 2011

4

Commercially supported

• InfoSphere BigInsights• Silicon Graphics CloudRack• EMC Greenplum• Google App Engine• Oracle Big Data Appliance• Cloudera CDH, Professional Services• Microsoft Windows Server, SQL Server

Page 5: Dan Bassett, Jonathan Canfield December 13, 2011

5

Who Uses Hadoop?

Page 6: Dan Bassett, Jonathan Canfield December 13, 2011

6

Prominent Users

• Facebook - claims to have the largest Hadoop cluster in the world at 30PB.

• Yahoo! - claims to have the world’s largest Hadoop production application.

• eBay – 5.3PB, 532 nodes cluster• New York Times – processed 4TB of image data

into 11 million PDFs at cost of ~ $240

Page 7: Dan Bassett, Jonathan Canfield December 13, 2011

7

HOW DOES IT WORK?

Page 8: Dan Bassett, Jonathan Canfield December 13, 2011

8

Architecture• Hadoop Common• Hadoop Distributed File System (HDFS)• MapReduce Engine

Page 9: Dan Bassett, Jonathan Canfield December 13, 2011

9

File System (HDFS)• One big file system from many nodes• Fault-tolerant• Runs on low-cost commodity hardware

Page 10: Dan Bassett, Jonathan Canfield December 13, 2011

10

MapReduce Engine• Splits input data• Assigns work to nodes• Processed in parallel

Page 11: Dan Bassett, Jonathan Canfield December 13, 2011

11

MapReduce Illustration

Page 12: Dan Bassett, Jonathan Canfield December 13, 2011

12

MapReduce Step 1

Page 13: Dan Bassett, Jonathan Canfield December 13, 2011

13

MapReduce Step 2

Page 14: Dan Bassett, Jonathan Canfield December 13, 2011

14

MapReduce Step 3

Page 15: Dan Bassett, Jonathan Canfield December 13, 2011

15

MapReduce Step 4

Page 16: Dan Bassett, Jonathan Canfield December 13, 2011

16

MapReduce Step 4

Page 17: Dan Bassett, Jonathan Canfield December 13, 2011

17

MapReduce Step 5

Page 18: Dan Bassett, Jonathan Canfield December 13, 2011

18

MapReduce Step 5

Page 19: Dan Bassett, Jonathan Canfield December 13, 2011

19

MapReduce Step 6

Page 20: Dan Bassett, Jonathan Canfield December 13, 2011

20

MapReduce Illustration

Page 21: Dan Bassett, Jonathan Canfield December 13, 2011

21

Resources

• Project Homehttp://hadoop.apache.org/

• Wikipediahttp://en.wikipedia.org/wiki/Apache_Hadoop

• IBMhttp://www-01.ibm.com/software/data/infosphere/hadoop/