What is Hadoop - Cimec-120208170829-Phpapp01
-
Upload
romanzotti -
Category
Documents
-
view
223 -
download
0
Transcript of What is Hadoop - Cimec-120208170829-Phpapp01
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
1/28
HadoopA Hands-on Introduction
Claudio MartellaElia Bruni
9 November 2011
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
2/28
Outline
What is Hadoop
Why is Hadoop
How is Hadoop
Hadoop & Python
Some NLP code
A more complicated problem: Eva
2
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
3/28
A bit of Context
2003: first MapReduce library @ Google
2003: GFS paper
2004: MapReduce paper
2005: Apache Nutch uses MapReduce
2006: Hadoop was born
2007: first 1000 nodes cluster at Y!
3
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
4/28
An Ecosystem
HDFS & MapReduce
Zookeeper
HBase
Pig & Hive
Mahout
Giraph
Nutch 4
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
5/28
Traditional way
Design a high-level Schema
You store data in a RDBMS
Which has very poor write throughput
And doesnt scale very much
When you talk about Terabyte of data
Expensive Data Warehouse
5
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
6/28
BigData & NoSQL
Store first, think later
Schema-less storage
Analytics
Petabyte scale
Offline processing
6
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
7/28
Vertical Scalability
Extremely expensive
Requires expertise in distributed systemsand concurrent programming
Lacks of real fault-tolerance
7
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
8/28
Horizontal Scalability
Built on top of commodity hardware
Easy to use programming paradigms
Fault-tolerance through replication
8
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
9/28
1st Assumptions
Data to process does not fit on one node.
Each node is commodity hardware.
Failure happens.
Spread your data among your nodes
and replicate it.
9
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
10/28
2nd Assumptions
Moving computation is cheap.
Moving data is expensive.
Distributed computing is hard.
Move computation to data,
with simple paradigm.
10
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
11/28
3rd Assumptions
Systems run on spinning hard disks.
Disk seek >> disk scan.
Many small files are expensive.
Base the paradigm on scanning large files.
11
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
12/28
Typical Problem
Collect and iterate over many records
Filter and extract something from each
Shuffle & sort these intermediate results
Group-by and aggregate them
Produce final output set
12
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
13/28
Typical Problem
Collect and iterate over many records
Filter and extract something from each
Shuffle & sort these intermediate results
Group-by and aggregate them
Produce final output set
MA
P
R
EDUCE
13
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
14/28
Quick example
127.0.0.1 - frank[10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en](Win98; I ;Nav)"
(frank, index.html)
(index.html, 10/Oct/2000)
(index.html, http://www.example.com/start.html)
14
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
15/28
MapReduce
Programmers define two functions:
map (key, value) (key, value)* reduce (key, [value+]) (key, value)*
Can also define:
combine (key, value) (key, value)*
partitioner: k partition
15
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
16/28
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
mapmap map map
Shuffle and Sort: aggregate values by keys
ba 1 2 c c3 6 a c5 2 b c7 9
a 1 5 b 2 7 c 2 3 6 9
reduce reduce reduce
r1 s1 r2 s2 r3 s3
16
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
17/28
MapReduce daemons
JobTracker: its the Master, it runs theschedule of the jobs, assigns tasks tonodes, collects hearth-beats from workers,reschedules for fault-tolerance.
TaskTracker: its the Worker, it runs on
each slave, runs (multiple) Mappers andReducers each in their JVM.
17
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
18/28
User
Program
(1) fork (1) fork (1) fork
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
Master
output
file 0
output
file 1
(2) assign map(2) assign reduce
(3) read(4) local write
(5) remote read
(6) write
worker
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
18
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
19/28
HDFS daemons
NameNode: its the Master, it keeps thefilesystem metadata (in-memory), the file-
block-node mapping, decides replicationand block placement, collects heart-beatsfrom nodes.
DataNode: its the Slave, it stores theblocks (64MB) of the files and servesdirectly reads and writes.
19
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
20/28
GSF Client
File namespace
/foo/bar
chunk 2ef0
GFS chunkserver GFS chunkserver
(file name, chunk index)
(chunk handle, chunk location)
Instructions to chunkserver
Chunkserver state(chunk handle, byte range)
Linux file system
Linux file system
chunk data
20
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
21/28
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
22/28
Take home recipe
Scan-based computation (no random I/O)
Big datasets
Divide-and-conquer class algorithms
No communication between tasks
22
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
23/28
Not good for
Real-time / Stream processing
Graph processing
Computation without locality
Small datasets
23
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
24/28
Questions?
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
25/28
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
26/28
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
27/28
Our solution
line format:[]*
0 1.3 0 0 7.1 1.1
1.2 0 0 0 0 3.4
0 5.7 0 0 1.1 2
5.1 0 0 4.6 0 10
0 0 0 1.6 0 0
1.3 7.1
1.2 3.4
5.7 1.1
5.1 4.6
1.6
2
1.1
for example: cat12.131305.134.6510
10
27
Tuesday, November 8, 11
-
7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01
28/28
Benchmarking
serial python (single-core): 7 minutes
java+hadoop (single-core): 2 minutes
serial python (big file): 18 days
java+hadoop (parallel, big file): 8 hours
it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28