Tackling Big Data with Hadoop

28
TACKLING BIG DATA WITH HADOOP David Howell Sunday, September 11, 11

description

An introduction to Hadoop, present at Vermont Code Camp 2011.

Transcript of Tackling Big Data with Hadoop

Page 1: Tackling Big Data with Hadoop

TACKLING BIG DATA WITH HADOOP

David Howell

Sunday, September 11, 11

Page 2: Tackling Big Data with Hadoop

WHAT IS BIG DATA?

Sunday, September 11, 11

Page 3: Tackling Big Data with Hadoop

WHAT IS BIG DATA?Google web crawl

Sunday, September 11, 11

Page 4: Tackling Big Data with Hadoop

WHAT IS BIG DATA?stream of Twitter messages

Sunday, September 11, 11

Page 5: Tackling Big Data with Hadoop

WHAT IS BIG DATA?Annoying Farmville requests on Facebook

Sunday, September 11, 11

Page 6: Tackling Big Data with Hadoop

WHAT IS BIG DATA?terabyte-scale data sets

awkward to work with using traditional tools

Sunday, September 11, 11

Page 7: Tackling Big Data with Hadoop

WHAT IS BIG DATA?requires distributed computing

Sunday, September 11, 11

Page 8: Tackling Big Data with Hadoop

MEDIUM DATAdozens to hundreds of gigabytes

still awkward to work with using traditional tools

Sunday, September 11, 11

Page 9: Tackling Big Data with Hadoop

MAP-REDUCEhttp://labs.google.com/papers/mapreduce.html

Sunday, September 11, 11

Page 10: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 11: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 12: Tackling Big Data with Hadoop

COUNTING AT SCALE

Sunday, September 11, 11

Page 13: Tackling Big Data with Hadoop

function map_1(t, search_phrase)emit(search_phrase, 1)

function reduce_1(search_phrase, counts)total = 0for count in countstotal += count

emit(search_phrase, total)

function map_2(search_phrase, total)emit(total, search_phrase)

function reduce_2(total, search_phrases)for search_phrase in search_phrasesemit(search_phrase, total)

sort and shuffle

sort and shuffle

Sunday, September 11, 11

Page 14: Tackling Big Data with Hadoop

cat IN | sort | uniq -c > OUTmap shuffle reduce

awk ‘{print $2,$1}’ OUT | sort > FINAL map shuffle reduce

Sunday, September 11, 11

Page 15: Tackling Big Data with Hadoop

WHY BOTHER?

Sunday, September 11, 11

Page 16: Tackling Big Data with Hadoop

HADOOP

Sunday, September 11, 11

Page 17: Tackling Big Data with Hadoop

DISTRIBUTED COMPUTING PLATFORM

Sunday, September 11, 11

Page 18: Tackling Big Data with Hadoop

TOOLS IN THE PLATFORM

Higher Level APIs•Hive•Cascading•Pig

Map-Reduce APIs•Java•C++•UNIX pipes

Sunday, September 11, 11

Page 19: Tackling Big Data with Hadoop

THE ORIGIN STORY

Sunday, September 11, 11

Page 20: Tackling Big Data with Hadoop

WHO’S USING IT?

Sunday, September 11, 11

Page 21: Tackling Big Data with Hadoop

HADOOPHow does it work?

Sunday, September 11, 11

Page 22: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 23: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 24: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 25: Tackling Big Data with Hadoop

Sunday, September 11, 11

Page 26: Tackling Big Data with Hadoop

DEMO!

Sunday, September 11, 11

Page 27: Tackling Big Data with Hadoop

YOUR DATA PLATFORM

ad hocunstructuredprototypingexperimentdata-driven

curiosityplay

Sunday, September 11, 11

Page 28: Tackling Big Data with Hadoop

LEARN MORE

http://hadoop.apache.org/http://www.cloudera.com/

Hadoop: The Definitive Guide

@[email protected]

http://github.com/dehowell/hadoop-crypto-demo

Sunday, September 11, 11