A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...
Transcript of A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...
A NEW PLATFORM FOR A NEW ERA
2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Immersion v1.1 Internal Use Only Do Not Distribute
3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Welcome!
4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Agenda � Hadoop Overview – Why Hadoop – History
� All About HDFS – What is HDFS – Design Assumptions – HDFS Architecture – About the NameNode
� HDFS – NFS Bridge – What is the HDFS – NFS Bridge
� All About MapReduce – What is MapReduce – How does MapReduce work – Architecture – Fault Tolerance
5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA
6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
� Why Hadoop?
� History of Hadoop
8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Why Hadoop?
9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Why Hadoop?
� Data . . . It's Everywhere � Traditional Computing
� Is Hadoop The Solution?
� Core Hadoop
10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Big Data! - Big Problem!
� 3 main characteristics of big data: Volume, Velocity, Variety
� Volume – Need to manage lots of data – Social enterprises and Industrial Internet generate lots of data
� Velocity – Need fast read / write access to data - measured in nano or micro
seconds
� Variety – Today’s agile business require rapid response to changing
requirements – Fixed schemas are too rigid, need 'emerging' schemas – May not fit into traditional 'relational' and 'transactional' data model
of RDBMS
11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Where Does The Data Come From?
� Social Media
� Log Files
� Video Networks
� Sensor Data
� Transactions ( retail / banking / stock market / etc )
� e-mail / text messaging
� Legacy Documents
12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
There is Value in The Data!
� The value is dependent – Fraud Detection – Marketing Analysis – Threat Analysis – Forecasting – Recommendation Engines – Trade Surveillance – . . .
13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Why Hadoop?
� Data . . . It's Everywhere
� Traditional Computing � Is Hadoop The Solution?
� Core Hadoop
14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Monolithic Computing Model
� Processor bound – Very fast with small amounts of data
� Solution was to build bigger and faster computers – More CPU’s, more memory
� Had serious limitations – Expensive, did not scale as data volumes increased
15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
How About Distributed Computing
'In Pioneer days they used oxen for heavy pulling, and when one ox was not sufficient to accomplish
a task, we did not try to grow a larger ox.'
Admiral Grace Hopper
16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
About Distributed Computing
� Processing is distributed across a cluster of machines – Multiple nodes – Common frameworks include MPI, PMV and Condor
� Primarily focused on distributing the processing workload – Powerful compute nodes – Data typically on a separate storage appliance
MPI - Message Passing Interface PMV - Parallel Virtual Machine Condor - Batch Queueing
17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
The Challenge of Distributed Processing
� Does not scale gracefully – Data has to be copied to be processed – Problem escalates as more nodes are added to the cluster
� Hardware failure – Distributing data means a failure could corrupt part of it – Replication can be a solution – Requires ‘management’ of the pieces
� Sort, combine and analyze data from multiple systems – Finding ways to tie related data together
18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Why Hadoop?
� Data . . . It's Everywhere
� Traditional Computing
� Is Hadoop The Solution? � Core Hadoop
19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Ideal Cluster Requirements
� Linear horizontal scalability – Adding nodes should increase capacity proportionally – Avoid contention with 'shared nothing' architecture – Expandable at a reasonable cost
� Jobs run in relative isolation – Results independent of other running jobs – Reasonable performance degradation due to concurrency
� Simple programming model – Simple API – Multiple language support
20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Ideal Cluster Requirements ( continued )
� Failure Happens - Handle it Efficiently
Automatic Jobs complete automatically
Transparent Failed tasks are restarted
Graceful Proportional loss of capacity
Recoverable Restoration of failed component restores lost capacity
Consistent No corruption or invalid results
Ideal Failure-Handling Properties
21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Why Hadoop?
� Data . . . It's Everywhere
� Traditional Computing
� Is Hadoop The Solution?
� Core Hadoop
22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
The Essence of Hadoop
� Distributed file system (HDFS) – Reliable shared / distributed storage of large data files – Potential for redundant copies for reliability
� Flexible analysis (MapReduce) – Abstracting the process of reading and writing data – Breaking raw data into keys and values for tailored processing
� In addition to HDFS and MapReduce Hadoop consists of the infrastructure to make the components work – Web Interface – Monitoring and Scheduling tools – Filesystem utilities
23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Eco-System
� Eco-system tools available to integrate with Hadoop – Hive and Pig – Data Analysis – Tableau and Datameer – Data Visualization – Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management
� These are not 'Core Hadoop' but rather Eco-system tools – Many ( but not all ) are top level Apache projects
24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop - A Different Approach
� Distributed computing typically requires – Complex synchronization code – Expensive fault tolerant hardware – Redundant high performance storage appliances
� Hadoops approach based on the HDFS and MapReduce whitepapers addresses the problems implicit in distributed systems
25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Scalability
� Hadoop goal is to achieve linear horizontal scalability – Minimal chatter between nodes ( shared nothing environment ) – Scale horizontally by adding nodes to increase capacity and / or
performance
� Components in a cluster will fail – Provision using widely available commodity hardware – Provision what you currently need and 'scale out' when necessary
26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Data Access Is A Bottleneck
� Moving data from a storage appliance to the processors in a traditional distributed system is very time consuming
� Store data and process the data on the same machines – Eliminate the need to copy data between nodes
� Process data intelligently with data locality – Bring the data to the compute engine – Process data on the same node as it is stored when possible
27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Disk Performance Is A Bottleneck
� Disk technology has advanced significantly – Single large capacity disks are readily available – Scan performance has not scaled as well as capacity has scaled
� Take advantage of multiple disks in parallel – Assuming 1 disk and 3TB of data and a transfer rate of 300MB / s
▪ A little over 2h 35m to read the 3TB file – Assuming 1000 disks and 3TB of data distributed across all 1000
disks they can transfer roughly 300GB / s ▪ Less than 10 seconds to read the same 3TB file
� Co-located storage and processing makes this possible
28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Complex Processing Code
� Code for a distributed computing environment is complex
� Hadoop’s infrastructure abstracts complex programming requirements – No sychronization code – No networking code – No file I / O code
� MapReduce developer focuses on programs that provide value to the business – Typically written in java or with Hadoop Streaming
29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Fault Tolerance
� Distributed systems take advantage of expensive components
� Failure is inevitable – plan for it – Minimize the effect of failure – Hadoop does this very well
� Machine failure is a common and regular occurrence – A server with a MTBF of 5 years ( 1825 days or so ) – In a 2000 node cluster that is roughly one failure per day
30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
The History of Hadoop
31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Prior To Hadoop
� Nutch - Open Source Web Search Engine - 2002 – Created by Doug Cutting and Mike Cafarella
� Google Whitepapers – Google Filesystem - 2003 – MapReduce - 2004
� Nutch Re-Architecture – Doug Cutting - 2005
32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Early Hadoop Years
� Hadoop factored out of Nutch – Sub project of Nutch - 2006 – Later became a top level Apache project - 2008
� Much of the early development was led by Yahoo!!
33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Hadoop Today
� Hadoop is mainstream
� Many eco-system projects have been spawned – Most are top level Apache projects – Hive, pig, oozie, flume, hbase and others
� Many organizations have migrated to Hadoop – Has increased the focus on ‘enterprise level' features and tools
� Hadoop is evolving and changing very frequently
34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA
35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
About HDFS
36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
About HDFS
� What is HDFS?
� Basic HDFS Assumptions
� Architecture
� About The Namenode
� Lab Exercise – Getting Familiar With HDFS
37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
What is HDFS?
38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
What is HDFS?
� Hadoop Distributed File System inspired by GoogleFS
� Features: – High performance file system for storing data – Relatively simple centralized management: Master / Slave
architecture – Fault-tolerant: data replication – Optimized for MapReduce processing - Data Locality – Scalability: scale horizontally for additional capacity / performance
� Java application – slower but more portable
� Other file-systems are available (FTP, S3)
39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Basic HDFS Assumptions
40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Basic HDFS Assumptions
� Cluster components fail – Use commodity hardware
� Files are write once / read many
� Leverage large streaming reads not random access
� Favors high sustained throughput over low latency
� Modest number of HUGE files – Multi-gigabyte files are typical
41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Architecture
42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
� Blocks � Components
� HDFS – Writing / Reading
� Interacting With HDFS
43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Blocks
� When a file is added to HDFS it is split into blocks
� This is very similar to files added to any filesystem – Default block size is 64MB – Block size is configurable
� Blocks are replicated across the cluster – Based on the replication factor ( default is 3 ) – Achieves data locality goal by making data more available
44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Multi-Block Replication Pipeline
45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
� Blocks
� Components � HDFS – Writing / Reading
� Interacting With HDFS
46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Classic HDFS Components
� HDFS has three main components – Namenode (single)
▪ Manages file-system content tree ▪ Manages file & directory meta-data ▪ Manages datanodes, and blocks they hold
– Checkpoint namenode (single) ▪ Also known as 'secondary' or 'backup' namenode ▪ Not an active-standby of the namenode ▪ Merges in-memory and disk-based namenode metadata ▪ Keeps copy of this data, but is almost always out of date
– Datanodes (multiple) ▪ Store & retrieve data blocks ▪ Report block usage to namenode
47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Classic HDFS Architecture
NameNode CheckPoint NameNode
DataNode DataNode DataNode DataNode DataNode DataNode
Master Master
Slave Slave Slave Slave Slave Slave
48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NameNode
� Persistently stores all metadata: – File-to-block mappings – File-system tree – File ownership and permissions
� Metadata is stored on disk and read into RAM when the NameNode daemon starts
� Transiently stores block-to-DataNode mappings
� Changes to the metadata are held in RAM – NameNode requires a lot of memory
� All file operations start with the NameNode
� Single point of failure
49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NameNode Storage
� Edit log – In-memory write-ahead log of modification operations
� edits file – Disk-based file of edits – Log is flushed to file after writes, before return to client
� fsimage file – Disk-based file of complete file-system metadata – Write-behind asynchronous merges occur periodically
� Get current state by loading fsimage, replaying edits
50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Checkpoint NameNode
� Also known as 'secondary' NameNode
� CheckPoint NameNode is Not an active-standby of the NameNode
� Performs merging of edits and fsimage files
� Merging requires all metadata to be held in memory
� Has similar memory requirements as NameNode
51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Updating fsimage
� NameNode starts writing to edits.new file
� edits and fsimage copied to checkpoint NameNode
� Checkpoint NameNode merges applies edits to fsimage, writes result as fsimage.ckpt
� fsimage.ckpt copied to NameNode
� NameNode replaces fsimage with fsimage.ckpt
� NameNode replaces edits with edits.new
� NameNode stores merge time in fstime
52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
DataNodes
� Actual contents of the files are stored as blocks on the DataNodes
� Blocks are files on the DataNodes underlying filesystem – Named: blk_xxxxxxx – DataNode cannot associate any block it stores with a file
� Each block is stored on multiple DataNodes – Default repication factor is 3
� Each DataNode runs a DataNode daemon – Controls access to the blocks of data – Communicates with the NameNode
53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
What Is Contemporary HDFS?
� The release of the 0.23 branch of Hadoop ( remember the confusing versioning ) introduced Contemporary HDFS
� Introduced a High Availability NameNode option – Passive NameNode
� Introduced a Federation option – Multiple NameNodes
� We will address these options in the Advanced Configuration module
54 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
� Blocks
� Components
� HDFS – Writing / Reading � Interacting With HDFS
55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Write
Client Node
Client JVM
Distributed FileSystem
HDFS Client
1: create
FSDataOutputStream namenode
JVM
NameNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
2: create
3a: write
4a: write packet
5c: ack packet
4b: write packet
4c: write packet
5b: ack packet
5a: ack packet
7: complete
DataStreamer
3b: Request allocation (as new blocks required)
6: close
56 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Writing To HDFS
57 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Replication Strategy
� First copy of a block is written to the same node that the client is on – If the client is not part of the cluster the first block is written to a
random node that is not too busy
� Second copy of the block is written to a node in a different rack
� Third copy of the block is written to a different node in the same rack that the second copy was written to
58 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Data Locality
� Key factors in achieving performance – Network bandwidth – Disk seek time – Node processing power
� Hadoop emphasizes importance of data locality – Processing as close as possible to data (notably mapper) – Minimize use of network bandwidth
� Terms for data locality (descending preference order) – Local – On-Rack – Off-Rack
59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Data Locality - Local
� Data node on same host as processing (mapper)
Rack B
Node B.0
Node B.1
Node B.2
Rack A
Node A.0
Node A.1
Split mapper
Node A.2
HDFS (input)
60 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Rack B
Node B.0
Node B.1
Node B.2
Rack A
Node A.0
Node A.1
Node A.2
HDFS (input)
� Data node same rack as processing node (mapper)
Data Locality - On Rack
Split
mapper
Very high-speed connection
61 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Rack B
Node B.0
Node B.1
Node B.2
Rack A
Node A.0
Node A.1
Node A.2
HDFS (input)
Data Locality - Off Rack
� Data node different rack from processing node (mapper)
Split
mapper
Medium to high-speed connection
62 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Read
Client Node
Client JVM
Distributed FileSystem
HDFS Client
1: open
FSData InputStream namenode
JVM
NameNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
2: Request file block locations
3: read
6: close
4: read from block
5: read from block
63 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Reading From HDFS
64 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Data Corruption
� Hadoop employs checksums to ensure block integrity
� As the block is read a checksum is calculated and compared against the checksum calculated and stored when the block was written – Fast to calculate and space-efficient
� If the checksums differ the client reads the block from the next DataNode in the list provided by the NameNode – NameNode will replicate the corrupted block elsewhere
� DataNode will verify block checksums periodically to avoid 'bit rot'
65 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Data Reliability and Recovery
� DataNodes send heartbeats to the NameNode every 3 seconds
� If no heartbeats are received then the DataNode is considered to be lost – The NameNode will determine which blocks are on the lost
DataNode – The NameNode finds other DataNodes with valid copies of the
blocks stored on the lost DataNode – Instructs the DataNodes with the valid copies to replicate those
blocks to other DataNodes in the cluster
66 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
� Blocks
� Components
� HDFS – Writing / Reading
� Interacting With HDFS
67 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Interacting with HDFS
� Primary interface is Java CLI app – Hadoop shell
� Java API and others ( C++, Python )
� Web GUI for read-only access
� HFTP provides HTTP/S read-only view – No relation to FTP
� WebHDFS provides read/write RESTful interface
� FUSE, MapR allow mounting as standard filesystem
CLI – Command-Line Interface FUSE – Filesystem in Userspace
68 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Shell Commands
� Parallels of many common Linux commands
$ hadoop fs –ls /$ hadoop fs –tail /path/on/hdfs/file
Commands Purpose ls, lsr, count, du, dus, stat, test Display file-system information
cat, tail, text Display file contents
cp, mkdir, mv, rm, rmr, touchz Manipulate remote objects
chgrp, chmod, chown Change security permissions
69 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Unique Shell Commands
$ hadoop fs –put /my/local/file /path/on/hdfs$ hadoop fs –getmerge /directory/with/files /my/local/path$ hadoop fs –setrep –w 5 /path/on/hdfs/file$ hadoop fs –expunge$ hadoop fs –text /path/on/hdfs/file.zip
� Commands Unique to HDFS Commands Purpose
get, put, copyFromLocal, copyToLocal
Move files between local file-system and HDFS
getmerge Concatenate files in a directory and copy to local setrep Set the replication factor of a file expunge Empty trash (‘deleted’ files are moved to /trash first) text Displays zip or TextRecordInputStream as text
70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Admin Commands
� Administration commands for the HDFS cluster
$ hadoop dfsadmin –safemode enter$ hadoop dfsadmin –setQuota 100 /path/on/hdfs$ hadoop dfsadmin –setSpaceQuota 50g /path/on/hdfs
Commands Purpose report Display status of cluster safemode Toggle safemode, which prevents writes when enabled finalizeUpgrade Remove backup of cluster made in last upgrade refreshNodes Re-apply rules of which hosts can participate in the cluster setQuota, clrQuota
Manage object-count quota for directories
setSpaceQuota, clrSpaceQuota
Manage object-size quota for directories
71 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS Web UI
� Default location is: – http://namenode:50070/
� Summarizes status of cluster
� Read-only view of file-system
� DataNodes present status at: – http://datanode:50010/
72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
About The NameNode
73 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Is The NameNode A Bottleneck?
� No data ever traverses the NameNode – During reads – During writes – During replication – During recovery operations
74 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NameNode Memory Usage
� All metadata is held in RAM for quick response
� Each entry in the metadata is referred to as an 'item'
� An 'item' can be: – A filename – File permissions – Block names – Additional block information
� Each 'item' consumes roughly 150 to 200 bytes of RAM
75 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NameNode Memory Usage
� HDFS / NameNode prefers fewer, larger files as a result of the metadata that needs to be stored
� Consider a 1GB file with the default blocksize of 64MB – Stored as a single 1GB file
▪ Name: 1 item ▪ Blocks: 16 * Default Replication Factor = 48 ▪ Total Number of items to account for this file: 49
– Stored as 1000 individual 1MB files ▪ Names: 1000 ▪ Blocks: 1000 * Default Replication Factor = 3000 ▪ Total Number of items to accout for this file: 4000
76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Lab Exercise Getting Familiar With HDFS
77 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Getting Familiar With HDFS
Lab Exercise
� Getting ‘Help’ in HDFS
� Navigating HDFS
� Loading files into HDFS
� Verifying data in HDFS
78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA
79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS – NFS Gateway
80 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NFS Gateway
• Allows HDFS to be mounted as part of the clients local file system
• Supports a variety of file systems – Linux / Unix – Windows – MacIntosh – Others
• Browse HDFS via the local file system • Download files from HDFS to the local file system • Upload files from the local file system to HDFS • Stream data directly to HDFS through the mount point
81 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NFS Gateway Architecture
82 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
NFS Gateway Configuration
• NFS Gateway machine needs the same thing to run an HDFS Client – Hadoop Jar files – HADOOP_CONF directory
• NFS Gateway can be on the same host as the Namenode, Datanode or any HDFS client.
83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
HDFS – NFS Lab Exercise Subtitle 24 Point Arial Title Case
Additional Line 18 Point Arial
84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Lab Exercise - NFS Gateway
• Manipulate files in HDFS using Hadoop fs commands or HDFS DFS commands
• Manipulate file in HDFS via the HDFS mount
85 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA
86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
All About MapReduce
87 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
About MapReduce
� What is MapReduce?
� Architecture
� Fault Tolerance
� How Does MapReduce Work?
� Running MapReduce Jobs
� Lab Exercise – Run a MapReduce Job
88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
What is MapReduce?
89 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce Is . . .
� MapReduce is programming model for processing large volumes of data – Record oriented data processing ( key / value pairs )
� The goal of MapReduce is automatic parallelization and distribution of tasks
� Ideally each node processes data stored locally on that node
� MapReduce incorporates two developer created phases: – Map phase – performs filtering and sorting – Reduce phase – performs summary operations
� Between the Map and Reduce phases is Sort / Shuffle / Merge which sends data from the Mappers to the Reducers
90 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce Is . . .
� MapReduce programs are generally written in java ( as is Hadoop itself ) – Other language support is available using Hadoop Streaming
� MapReduce framework abstracts all housekeeping requirements from the developer
� Fault tolerant
91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Architecture
92 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce v1.0 or MapReduce v2.0 / YARN
� Both manage compute resources, jobs & tasks
� Job API the same
� MapReduce v1.0 is proven in production
� MapReduce v2.0 - new framework to leverage YARN – Yet Another Resource Negotiator – YARN created to help Yahoo! scale to 4,000 nodes
▪ Still classified as alpha ▪ Not needed for all but the largest clusters
– YARN is a generic distribution platform ▪ MapReduce v2.0 just one application that runs on YARN
93 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Task Management System Components
YARN – Yet Another Resource Negotiator
MapReduce v1.0 MapReduce v2.0 - YARN
Purpose
Client Client Submit MapReduce job
JobTracker (single) Resource Manager (single)
Manages scheduling & resources
Node Managers (multiple)
Co-ordinates job run (starts, monitors progress, completes
TaskTrackers (multiple) MapReduce Application Master (multiple, per task)
Run map & reduce tasks
Distributed File-System (normally HDFS)
Distributed File-System (normally HDFS)
Used for shared data between other components
94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Starting Job – MapReduce v1.0
7: heartbeat (returns task)
Client Node
Client JVM
Job MapReduce program
JobTracker Node
JVM
JobTracker
TaskTracker Node
JVM
TaskTracker
Child JVM
TaskTracker child
Mapper or Reducer
1: initiate job 2: request job ID
4: submit job
5: initialize job
6: determine input splits
8: retrieve job jars, data
9: launch 10: run
Shared File-System (e.g. HDFS)
3: copy job jars, config
95 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Starting Job – MapReduce v2.0
Client Node
Client JVM
Job MapReduce program
Jobtracker Node
1: initiate job 2: request new application
3: copy job jars, config
4: submit job
9: retrieve job jars, data
Node Manager Node
JVM
Node manager
Child JVM
YARN child
Mapper or Reducer
10: run
Shared File-System (e.g. HDFS)
6: determine input splits
7b: start container
Node Manager Node
JVM
MRApp Master Node Manager
5b: launch
5c: initialize job
5a: start container
7a: allocate task resources
8: launch
JVM
ResourceManager
96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Fault Tolerance
97 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Fault Tolerance
• Task processes periodically send heartbeats up the chain
• A task that fails to heartbeat during a period is presumed to be dead – The tasks JVM is killed
• A task that throws an exception is considered failed • Tasks that have failed are rescheduled
– Ideally on a different node that it failed on • If a task fails four times the job fails
98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
How Does MapReduce Work
99 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce Terminology
� A MapReduce job consists of: – Inputs – Task(s)
� A task(s) is: – A unit of work – A job will be broken into many tasks – A task will be either a map task or reduce task
� Map tasks read input splits – Generate ‘intermediate’ results
� Reduce tasks process ‘intermediate’ results – Generate ‘final’ results
100 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Anatomy Of Job Submission
� A job is submitted from a client – The job is assigned a jobID – The client determines the number of input splits for the job – Client copies the jar file and configuration items to HDFS
� A map task is created for each input split – Periodic heartbeats between slave and master nodes
▪ Identify readiness of the slave node to execute tasks – Master node assigns tasks to slave nodes with available capacity
� A JVM is instanced on the slave and the task runs in the JVM – Tasks run in relative isolation – Task status is periodically sent to the master node
101 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Basic Tenants of MapReduce
� Thought Keys . . . – A Mapper(s) processes a single record at a time and emits a
key / value pair – The Intermediate key / value pairs produced by Mapper(s) are
passed to the Reducer(s) – The Reducer(s) aggregates the intermediate keys from the
Mapper(s) and generates final output
102 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce Simpified
103 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Is MapReduce Similar To A Unix Pipeline?
cat input | grep | sort | unique -c | cat > output
Input Map Sort, Shuffle
& Merge
Reduce Output
104 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Expanding On Those Thought Keys
� Each Mapper processes a single input split from HDFS – Likely an HDFS block
� The Framework passes one record at a time to the Mapper
� Each record consists of a key / value pair
� Mapper writes intermediate key / value pairs to local disk
� During Sort / Shuffle / Merge all values for the same intermediate key are transferred to the same Reducer
� The Reducer is passed each key and an iterator of values
� Output from the Reducer is written to HDFS
105 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce In Detail
106 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount()
� With the though keys previously described in mind
� We will use WordCount() to describe the MapReduce infrastructure – WordCount is MapReduce’s equivalent of ‘Hello World’ – Not the most elegant of examples – Simple and understandable example
107 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
The Mapper For WordCount
� Assume the input is a set of text files
� The key is the byte offset into the file for a line
� The value is a line in the file at that offset
Map (key, value){ for each word x in value: output (x, 1);
}
108 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Mapper Input
� Input to the Mapper
2120 see the quick brown fox jump over the lazy dog
2166 the five boxing wizards jump quickly
2202 a mad boxer shot a quick jab to the jaw of his opponent
109 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Mapper Code //Import’s omitted for clarity public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String s = value.toString(); for (String word : s.split("\\W+")) { if (word.length() > 0) { output.collect(new Text(word), new IntWritable(1)); } } } }
110 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Mapper Output
� Intermediate data output by the Mapper
(see,1) (the,1), (quick,1), (brown,1), (fox,1), (jump,1), (over,1), (the,1),
(lazy,1), (dog,1), (the,1) (five,1), (boxing,1), (wizards,1), (jump,1),
(quickly,1), (a,1), (mad,1), (boxer,1), (shot,1), (a,1) (quick,1), (jab,1),
(to,1), (the,1), (jaw,1), (of,1), (his,1), (opponent,1)
111 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Shuffle
� Process by which Mapper output gets to Reducers
� Activities on both Mapper and Reducer side
� Transparent… until something goes slowly / wrong
112 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Shuffle – Map Side
� Mapper output written to circular memory buffer
� When buffer is nearly full, it is prepared for write – Pairs placed in appropriate partition – Partition contents sorted by key – Combiner (if any) run against sorted key/value pairs
� New spill file created on local file-system
� Spill files merged together – If enough separate files, run combiner again
� Output file marked as available
113 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Shuffle – Map Side - Configuration
Property Use mapreduce.task.io.sort.mb Size of circular buffer
mapreduce.map.sort.spill.percent When to start spilling buffer
mapreduce.cluster.local.dir Spill directory list
mapreduce.task.io.sort.factor How many spills to merge at once
mapreduce.map.combine.minspills Min spills files to run Combiner again
mapreduce.map.output.compress Whether to compress output
mapreduce.map.output.compress.codec Compression codec to use
mapreduce.tasktracker.http.threads Threads to use for transferring files
114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Shuffle – Reduce Side
� Reducers poll JobTracker, asking for input locations
� Reducers copy files from Mappers via HTTP
� Inputs held in RAM until thresholds reached
� Files from various Mappers recursively merged – Optimizations to minimize number of disk writes
� Reducer invoked
115 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Shuffle – Reduce Side - Configuration
Property Use
mapreduce.reduce.shuffle.parallelcopies Threads used to copy input files
mapreduce.reduce.job.shuffle.input.buffer. percent
Heap to use for buffering input files
mapreduce.reduce.shuffle.merge.percent Size buffer must be filled to cause merge
mapreduce.reduce.merge.inmem.threshold Input files required to cause merge
mapreduce.task.io.sort.factor Merge factor
116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Reducer Input
� Input to the Reducer after Sort / Shuffle / Merge phase
(a, [1,1]) (boxer, [1]) (boxing, [1]) (brown, [1]) (dog, [1]) (five, [1]) (fox, [1]) (his, [1]) (jab, [1]) (jaw, [1]) (jump, [1,1]) (lazy, [1])
(mad, [1]) (of, [1]) (opponent, [1]) (over, [1]) (quick, [1,1]) (quickly, [1]) (see, [1]) (shot, [1]) (the, [1,1,1,1]) (to, [1]) (wizards, [1])
117 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
The Reducer For WordCount
� The keyword is a word
� The iterator is a list of values associated with the keyword
Reduce (keyword, iterator{ for each x in iterator: sum+=x; final_output(keyword, sum);
}
118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Reducer Output
� Output from the WordCount Reducer written to HDFS
(a, 2) (boxer, 1) (boxing, 1) (brown, 1) (dog, 1) (five, 1) (fox, 1) (his, 1) (jab, 1) (jaw, 1) (jump, 2) (lazy, 1)
(mad, 1) (of, 1) (opponent, 1) (over, 1) (quick, 2) (quickly, 1) (see, 1) (shot, 1) (the, 4) (to, 1) (wizards, 1)
119 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Reducer Code //Import’s omitted for clarity public class SumReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordCount = 0; while (values.hasNext()) { IntWritable value = values.next(); wordCount += value.get(); } output.collect(key, new IntWritable(wordCount)); } }
120 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
WordCount Driver Code //Import’s ommitted for clarity public class WordCount { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: WordCount <input dir> <output dir>"); System.exit(-1); } JobConf conf = new JobConf(WordCount.class); conf.setJobName("Word Count"); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setReducerClass(SumReducer.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }
121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
How To Run A MapReduce Job
122 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Creating And Running A MapReduce Job
• Develop the Mapper and Reducer classes • Develop the Driver class • Compile the Driver, Mapper and Reducer classes
javac –classpath ‘hadoop classpath’ *.java
• Create a jar file jar cvf myjob.jar *.class
• Run the hadoop jar command Hadoop jar myjob.jar Entry input_dir output_dir
123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Lab Exercise Running A MapReduce Job
124 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
Running a MapReduce Job
Lab Exercise
� Load data into HDFS
� Compile wordcount driver, Mapper and Reducer
� Create a jar file
� Submit the jar file
125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA