A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...

125
A NEW PLATFORM FOR A NEW ERA

Transcript of A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...

Page 1: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

A NEW PLATFORM FOR A NEW ERA

Page 2: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Immersion v1.1 Internal Use Only Do Not Distribute

Page 3: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Welcome!

Page 4: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Agenda �  Hadoop Overview –  Why Hadoop –  History

�  All About HDFS –  What is HDFS –  Design Assumptions –  HDFS Architecture –  About the NameNode

�  HDFS – NFS Bridge –  What is the HDFS – NFS Bridge

�  All About MapReduce –  What is MapReduce –  How does MapReduce work –  Architecture –  Fault Tolerance

Page 5: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA

Page 6: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Overview

Page 7: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Overview

�  Why Hadoop?

�  History of Hadoop

Page 8: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Why Hadoop?

Page 9: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Why Hadoop?

�  Data . . . It's Everywhere �  Traditional Computing

�  Is Hadoop The Solution?

�  Core Hadoop

Page 10: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Big Data! - Big Problem!

�  3 main characteristics of big data: Volume, Velocity, Variety

�  Volume –  Need to manage lots of data –  Social enterprises and Industrial Internet generate lots of data

�  Velocity –  Need fast read / write access to data - measured in nano or micro

seconds

�  Variety –  Today’s agile business require rapid response to changing

requirements –  Fixed schemas are too rigid, need 'emerging' schemas –  May not fit into traditional 'relational' and 'transactional' data model

of RDBMS

Page 11: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Where Does The Data Come From?

�  Social Media

�  Log Files

�  Video Networks

�  Sensor Data

�  Transactions ( retail / banking / stock market / etc )

�  e-mail / text messaging

�  Legacy Documents

Page 12: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

There is Value in The Data!

�  The value is dependent –  Fraud Detection –  Marketing Analysis –  Threat Analysis –  Forecasting –  Recommendation Engines –  Trade Surveillance –  . . .

Page 13: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Why Hadoop?

�  Data . . . It's Everywhere

�  Traditional Computing �  Is Hadoop The Solution?

�  Core Hadoop

Page 14: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Monolithic Computing Model

�  Processor bound –  Very fast with small amounts of data

�  Solution was to build bigger and faster computers –  More CPU’s, more memory

�  Had serious limitations –  Expensive, did not scale as data volumes increased

Page 15: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

How About Distributed Computing

'In Pioneer days they used oxen for heavy pulling, and when one ox was not sufficient to accomplish

a task, we did not try to grow a larger ox.'

Admiral Grace Hopper

Page 16: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

About Distributed Computing

�  Processing is distributed across a cluster of machines –  Multiple nodes –  Common frameworks include MPI, PMV and Condor

�  Primarily focused on distributing the processing workload –  Powerful compute nodes –  Data typically on a separate storage appliance

MPI - Message Passing Interface PMV - Parallel Virtual Machine Condor - Batch Queueing

Page 17: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

The Challenge of Distributed Processing

�  Does not scale gracefully –  Data has to be copied to be processed –  Problem escalates as more nodes are added to the cluster

�  Hardware failure –  Distributing data means a failure could corrupt part of it –  Replication can be a solution –  Requires ‘management’ of the pieces

�  Sort, combine and analyze data from multiple systems –  Finding ways to tie related data together

Page 18: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Why Hadoop?

�  Data . . . It's Everywhere

�  Traditional Computing

�  Is Hadoop The Solution? �  Core Hadoop

Page 19: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Ideal Cluster Requirements

�  Linear horizontal scalability –  Adding nodes should increase capacity proportionally –  Avoid contention with 'shared nothing' architecture –  Expandable at a reasonable cost

�  Jobs run in relative isolation –  Results independent of other running jobs –  Reasonable performance degradation due to concurrency

�  Simple programming model –  Simple API –  Multiple language support

Page 20: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Ideal Cluster Requirements ( continued )

�  Failure Happens - Handle it Efficiently

Automatic Jobs complete automatically

Transparent Failed tasks are restarted

Graceful Proportional loss of capacity

Recoverable Restoration of failed component restores lost capacity

Consistent No corruption or invalid results

Ideal Failure-Handling Properties

Page 21: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Why Hadoop?

�  Data . . . It's Everywhere

�  Traditional Computing

�  Is Hadoop The Solution?

�  Core Hadoop

Page 22: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

The Essence of Hadoop

�  Distributed file system (HDFS) –  Reliable shared / distributed storage of large data files –  Potential for redundant copies for reliability

�  Flexible analysis (MapReduce) –  Abstracting the process of reading and writing data –  Breaking raw data into keys and values for tailored processing

�  In addition to HDFS and MapReduce Hadoop consists of the infrastructure to make the components work –  Web Interface –  Monitoring and Scheduling tools –  Filesystem utilities

Page 23: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Eco-System

�  Eco-system tools available to integrate with Hadoop –  Hive and Pig – Data Analysis –  Tableau and Datameer – Data Visualization –  Sqoop – Data Integration –  Oozie – Workflow Management –  Puppet and Chef – Cluster Management

�  These are not 'Core Hadoop' but rather Eco-system tools –  Many ( but not all ) are top level Apache projects

Page 24: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop - A Different Approach

�  Distributed computing typically requires –  Complex synchronization code –  Expensive fault tolerant hardware –  Redundant high performance storage appliances

�  Hadoops approach based on the HDFS and MapReduce whitepapers addresses the problems implicit in distributed systems

Page 25: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Scalability

�  Hadoop goal is to achieve linear horizontal scalability –  Minimal chatter between nodes ( shared nothing environment ) –  Scale horizontally by adding nodes to increase capacity and / or

performance

�  Components in a cluster will fail –  Provision using widely available commodity hardware –  Provision what you currently need and 'scale out' when necessary

Page 26: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Data Access Is A Bottleneck

�  Moving data from a storage appliance to the processors in a traditional distributed system is very time consuming

�  Store data and process the data on the same machines –  Eliminate the need to copy data between nodes

�  Process data intelligently with data locality –  Bring the data to the compute engine –  Process data on the same node as it is stored when possible

Page 27: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Disk Performance Is A Bottleneck

�  Disk technology has advanced significantly –  Single large capacity disks are readily available –  Scan performance has not scaled as well as capacity has scaled

�  Take advantage of multiple disks in parallel –  Assuming 1 disk and 3TB of data and a transfer rate of 300MB / s

▪  A little over 2h 35m to read the 3TB file –  Assuming 1000 disks and 3TB of data distributed across all 1000

disks they can transfer roughly 300GB / s ▪  Less than 10 seconds to read the same 3TB file

�  Co-located storage and processing makes this possible

Page 28: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Complex Processing Code

�  Code for a distributed computing environment is complex

�  Hadoop’s infrastructure abstracts complex programming requirements –  No sychronization code –  No networking code –  No file I / O code

�  MapReduce developer focuses on programs that provide value to the business –  Typically written in java or with Hadoop Streaming

Page 29: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Fault Tolerance

�  Distributed systems take advantage of expensive components

�  Failure is inevitable – plan for it –  Minimize the effect of failure –  Hadoop does this very well

�  Machine failure is a common and regular occurrence –  A server with a MTBF of 5 years ( 1825 days or so ) –  In a 2000 node cluster that is roughly one failure per day

Page 30: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

The History of Hadoop

Page 31: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Prior To Hadoop

�  Nutch - Open Source Web Search Engine - 2002 –  Created by Doug Cutting and Mike Cafarella

�  Google Whitepapers –  Google Filesystem - 2003 –  MapReduce - 2004

�  Nutch Re-Architecture –  Doug Cutting - 2005

Page 32: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Early Hadoop Years

�  Hadoop factored out of Nutch –  Sub project of Nutch - 2006 –  Later became a top level Apache project - 2008

�  Much of the early development was led by Yahoo!!

Page 33: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Today

�  Hadoop is mainstream

�  Many eco-system projects have been spawned –  Most are top level Apache projects –  Hive, pig, oozie, flume, hbase and others

�  Many organizations have migrated to Hadoop –  Has increased the focus on ‘enterprise level' features and tools

�  Hadoop is evolving and changing very frequently

Page 34: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA

Page 35: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

About HDFS

Page 36: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

About HDFS

�  What is HDFS?

�  Basic HDFS Assumptions

�  Architecture

�  About The Namenode

�  Lab Exercise – Getting Familiar With HDFS

Page 37: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

What is HDFS?

Page 38: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

What is HDFS?

�  Hadoop Distributed File System inspired by GoogleFS

�  Features: –  High performance file system for storing data –  Relatively simple centralized management: Master / Slave

architecture –  Fault-tolerant: data replication –  Optimized for MapReduce processing - Data Locality –  Scalability: scale horizontally for additional capacity / performance

�  Java application – slower but more portable

�  Other file-systems are available (FTP, S3)

Page 39: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Basic HDFS Assumptions

Page 40: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Basic HDFS Assumptions

�  Cluster components fail –  Use commodity hardware

�  Files are write once / read many

�  Leverage large streaming reads not random access

�  Favors high sustained throughput over low latency

�  Modest number of HUGE files –  Multi-gigabyte files are typical

Page 41: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Architecture

Page 42: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Architecture

�  Blocks �  Components

�  HDFS – Writing / Reading

�  Interacting With HDFS

Page 43: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Blocks

�  When a file is added to HDFS it is split into blocks

�  This is very similar to files added to any filesystem –  Default block size is 64MB –  Block size is configurable

�  Blocks are replicated across the cluster –  Based on the replication factor ( default is 3 ) –  Achieves data locality goal by making data more available

Page 44: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Multi-Block Replication Pipeline

Page 45: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Architecture

�  Blocks

�  Components �  HDFS – Writing / Reading

�  Interacting With HDFS

Page 46: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Classic HDFS Components

�  HDFS has three main components –  Namenode (single)

▪  Manages file-system content tree ▪  Manages file & directory meta-data ▪  Manages datanodes, and blocks they hold

–  Checkpoint namenode (single) ▪  Also known as 'secondary' or 'backup' namenode ▪  Not an active-standby of the namenode ▪  Merges in-memory and disk-based namenode metadata ▪  Keeps copy of this data, but is almost always out of date

–  Datanodes (multiple) ▪  Store & retrieve data blocks ▪  Report block usage to namenode

Page 47: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Classic HDFS Architecture

NameNode CheckPoint NameNode

DataNode DataNode DataNode DataNode DataNode DataNode

Master Master

Slave Slave Slave Slave Slave Slave

Page 48: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NameNode

�  Persistently stores all metadata: –  File-to-block mappings –  File-system tree –  File ownership and permissions

�  Metadata is stored on disk and read into RAM when the NameNode daemon starts

�  Transiently stores block-to-DataNode mappings

�  Changes to the metadata are held in RAM –  NameNode requires a lot of memory

�  All file operations start with the NameNode

�  Single point of failure

Page 49: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NameNode Storage

�  Edit log –  In-memory write-ahead log of modification operations

�  edits file –  Disk-based file of edits –  Log is flushed to file after writes, before return to client

�  fsimage file –  Disk-based file of complete file-system metadata –  Write-behind asynchronous merges occur periodically

�  Get current state by loading fsimage, replaying edits

Page 50: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Checkpoint NameNode

�  Also known as 'secondary' NameNode

�  CheckPoint NameNode is Not an active-standby of the NameNode

�  Performs merging of edits and fsimage files

�  Merging requires all metadata to be held in memory

�  Has similar memory requirements as NameNode

Page 51: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Updating fsimage

�  NameNode starts writing to edits.new file

�  edits and fsimage copied to checkpoint NameNode

�  Checkpoint NameNode merges applies edits to fsimage, writes result as fsimage.ckpt

�  fsimage.ckpt copied to NameNode

�  NameNode replaces fsimage with fsimage.ckpt

�  NameNode replaces edits with edits.new

�  NameNode stores merge time in fstime

Page 52: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

DataNodes

�  Actual contents of the files are stored as blocks on the DataNodes

�  Blocks are files on the DataNodes underlying filesystem –  Named: blk_xxxxxxx –  DataNode cannot associate any block it stores with a file

�  Each block is stored on multiple DataNodes –  Default repication factor is 3

�  Each DataNode runs a DataNode daemon –  Controls access to the blocks of data –  Communicates with the NameNode

Page 53: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

What Is Contemporary HDFS?

�  The release of the 0.23 branch of Hadoop ( remember the confusing versioning ) introduced Contemporary HDFS

�  Introduced a High Availability NameNode option –  Passive NameNode

�  Introduced a Federation option –  Multiple NameNodes

�  We will address these options in the Advanced Configuration module

Page 54: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

54 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Architecture

�  Blocks

�  Components

�  HDFS – Writing / Reading �  Interacting With HDFS

Page 55: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Write

Client Node

Client JVM

Distributed FileSystem

HDFS Client

1: create

FSDataOutputStream namenode

JVM

NameNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

2: create

3a: write

4a: write packet

5c: ack packet

4b: write packet

4c: write packet

5b: ack packet

5a: ack packet

7: complete

DataStreamer

3b: Request allocation (as new blocks required)

6: close

Page 56: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

56 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Writing To HDFS

Page 57: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

57 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Replication Strategy

�  First copy of a block is written to the same node that the client is on –  If the client is not part of the cluster the first block is written to a

random node that is not too busy

�  Second copy of the block is written to a node in a different rack

�  Third copy of the block is written to a different node in the same rack that the second copy was written to

Page 58: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

58 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Data Locality

�  Key factors in achieving performance –  Network bandwidth –  Disk seek time –  Node processing power

�  Hadoop emphasizes importance of data locality –  Processing as close as possible to data (notably mapper) –  Minimize use of network bandwidth

�  Terms for data locality (descending preference order) –  Local –  On-Rack –  Off-Rack

Page 59: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Data Locality - Local

�  Data node on same host as processing (mapper)

Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Split mapper

Node A.2

HDFS (input)

Page 60: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

60 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Node A.2

HDFS (input)

�  Data node same rack as processing node (mapper)

Data Locality - On Rack

Split

mapper

Very high-speed connection

Page 61: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

61 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Node A.2

HDFS (input)

Data Locality - Off Rack

�  Data node different rack from processing node (mapper)

Split

mapper

Medium to high-speed connection

Page 62: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

62 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Read

Client Node

Client JVM

Distributed FileSystem

HDFS Client

1: open

FSData InputStream namenode

JVM

NameNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

2: Request file block locations

3: read

6: close

4: read from block

5: read from block

Page 63: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

63 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Reading From HDFS

Page 64: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

64 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Data Corruption

�  Hadoop employs checksums to ensure block integrity

�  As the block is read a checksum is calculated and compared against the checksum calculated and stored when the block was written –  Fast to calculate and space-efficient

�  If the checksums differ the client reads the block from the next DataNode in the list provided by the NameNode –  NameNode will replicate the corrupted block elsewhere

�  DataNode will verify block checksums periodically to avoid 'bit rot'

Page 65: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

65 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Data Reliability and Recovery

�  DataNodes send heartbeats to the NameNode every 3 seconds

�  If no heartbeats are received then the DataNode is considered to be lost –  The NameNode will determine which blocks are on the lost

DataNode –  The NameNode finds other DataNodes with valid copies of the

blocks stored on the lost DataNode –  Instructs the DataNodes with the valid copies to replicate those

blocks to other DataNodes in the cluster

Page 66: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

66 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Architecture

�  Blocks

�  Components

�  HDFS – Writing / Reading

�  Interacting With HDFS

Page 67: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

67 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Interacting with HDFS

�  Primary interface is Java CLI app – Hadoop shell

�  Java API and others ( C++, Python )

�  Web GUI for read-only access

�  HFTP provides HTTP/S read-only view –  No relation to FTP

�  WebHDFS provides read/write RESTful interface

�  FUSE, MapR allow mounting as standard filesystem

CLI – Command-Line Interface FUSE – Filesystem in Userspace

Page 68: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

68 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Shell Commands

�  Parallels of many common Linux commands

$ hadoop fs –ls /$ hadoop fs –tail /path/on/hdfs/file

Commands Purpose ls, lsr, count, du, dus, stat, test Display file-system information

cat, tail, text Display file contents

cp, mkdir, mv, rm, rmr, touchz Manipulate remote objects

chgrp, chmod, chown Change security permissions

Page 69: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

69 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Unique Shell Commands

$ hadoop fs –put /my/local/file /path/on/hdfs$ hadoop fs –getmerge /directory/with/files /my/local/path$ hadoop fs –setrep –w 5 /path/on/hdfs/file$ hadoop fs –expunge$ hadoop fs –text /path/on/hdfs/file.zip

�  Commands Unique to HDFS Commands Purpose

get, put, copyFromLocal, copyToLocal

Move files between local file-system and HDFS

getmerge Concatenate files in a directory and copy to local setrep Set the replication factor of a file expunge Empty trash (‘deleted’ files are moved to /trash first) text Displays zip or TextRecordInputStream as text

Page 70: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Admin Commands

�  Administration commands for the HDFS cluster

$ hadoop dfsadmin –safemode enter$ hadoop dfsadmin –setQuota 100 /path/on/hdfs$ hadoop dfsadmin –setSpaceQuota 50g /path/on/hdfs

Commands Purpose report Display status of cluster safemode Toggle safemode, which prevents writes when enabled finalizeUpgrade Remove backup of cluster made in last upgrade refreshNodes Re-apply rules of which hosts can participate in the cluster setQuota, clrQuota

Manage object-count quota for directories

setSpaceQuota, clrSpaceQuota

Manage object-size quota for directories

Page 71: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

71 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS Web UI

�  Default location is: –  http://namenode:50070/

�  Summarizes status of cluster

�  Read-only view of file-system

�  DataNodes present status at: –  http://datanode:50010/

Page 72: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

About The NameNode

Page 73: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

73 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Is The NameNode A Bottleneck?

�  No data ever traverses the NameNode –  During reads –  During writes –  During replication –  During recovery operations

Page 74: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

74 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NameNode Memory Usage

�  All metadata is held in RAM for quick response

�  Each entry in the metadata is referred to as an 'item'

�  An 'item' can be: –  A filename –  File permissions –  Block names –  Additional block information

�  Each 'item' consumes roughly 150 to 200 bytes of RAM

Page 75: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

75 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NameNode Memory Usage

�  HDFS / NameNode prefers fewer, larger files as a result of the metadata that needs to be stored

�  Consider a 1GB file with the default blocksize of 64MB –  Stored as a single 1GB file

▪  Name: 1 item ▪  Blocks: 16 * Default Replication Factor = 48 ▪  Total Number of items to account for this file: 49

–  Stored as 1000 individual 1MB files ▪  Names: 1000 ▪  Blocks: 1000 * Default Replication Factor = 3000 ▪  Total Number of items to accout for this file: 4000

Page 76: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Lab Exercise Getting Familiar With HDFS

Page 77: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

77 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Getting Familiar With HDFS

Lab Exercise

�  Getting ‘Help’ in HDFS

�  Navigating HDFS

�  Loading files into HDFS

�  Verifying data in HDFS

Page 78: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA

Page 79: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS – NFS Gateway

Page 80: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

80 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NFS Gateway

•  Allows HDFS to be mounted as part of the clients local file system

•  Supports a variety of file systems –  Linux / Unix –  Windows –  MacIntosh –  Others

•  Browse HDFS via the local file system •  Download files from HDFS to the local file system •  Upload files from the local file system to HDFS •  Stream data directly to HDFS through the mount point

Page 81: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

81 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NFS Gateway Architecture

Page 82: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

82 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

NFS Gateway Configuration

•  NFS Gateway machine needs the same thing to run an HDFS Client –  Hadoop Jar files –  HADOOP_CONF directory

•  NFS Gateway can be on the same host as the Namenode, Datanode or any HDFS client.

Page 83: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

HDFS – NFS Lab Exercise Subtitle 24 Point Arial Title Case

Additional Line 18 Point Arial

Page 84: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Lab Exercise - NFS Gateway

•  Manipulate files in HDFS using Hadoop fs commands or HDFS DFS commands

•  Manipulate file in HDFS via the HDFS mount

Page 85: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

85 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA

Page 86: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

All About MapReduce

Page 87: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

87 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

About MapReduce

�  What is MapReduce?

�  Architecture

�  Fault Tolerance

�  How Does MapReduce Work?

�  Running MapReduce Jobs

�  Lab Exercise – Run a MapReduce Job

Page 88: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

What is MapReduce?

Page 89: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

89 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce Is . . .

�  MapReduce is programming model for processing large volumes of data –  Record oriented data processing ( key / value pairs )

�  The goal of MapReduce is automatic parallelization and distribution of tasks

�  Ideally each node processes data stored locally on that node

�  MapReduce incorporates two developer created phases: –  Map phase – performs filtering and sorting –  Reduce phase – performs summary operations

�  Between the Map and Reduce phases is Sort / Shuffle / Merge which sends data from the Mappers to the Reducers

Page 90: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

90 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce Is . . .

�  MapReduce programs are generally written in java ( as is Hadoop itself ) –  Other language support is available using Hadoop Streaming

�  MapReduce framework abstracts all housekeeping requirements from the developer

�  Fault tolerant

Page 91: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Architecture

Page 92: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

92 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce v1.0 or MapReduce v2.0 / YARN

�  Both manage compute resources, jobs & tasks

�  Job API the same

�  MapReduce v1.0 is proven in production

�  MapReduce v2.0 - new framework to leverage YARN –  Yet Another Resource Negotiator –  YARN created to help Yahoo! scale to 4,000 nodes

▪  Still classified as alpha ▪  Not needed for all but the largest clusters

–  YARN is a generic distribution platform ▪  MapReduce v2.0 just one application that runs on YARN

Page 93: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

93 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Task Management System Components

YARN – Yet Another Resource Negotiator

MapReduce v1.0 MapReduce v2.0 - YARN

Purpose

Client Client Submit MapReduce job

JobTracker (single) Resource Manager (single)

Manages scheduling & resources

Node Managers (multiple)

Co-ordinates job run (starts, monitors progress, completes

TaskTrackers (multiple) MapReduce Application Master (multiple, per task)

Run map & reduce tasks

Distributed File-System (normally HDFS)

Distributed File-System (normally HDFS)

Used for shared data between other components

Page 94: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Starting Job – MapReduce v1.0

7: heartbeat (returns task)

Client Node

Client JVM

Job MapReduce program

JobTracker Node

JVM

JobTracker

TaskTracker Node

JVM

TaskTracker

Child JVM

TaskTracker child

Mapper or Reducer

1: initiate job 2: request job ID

4: submit job

5: initialize job

6: determine input splits

8: retrieve job jars, data

9: launch 10: run

Shared File-System (e.g. HDFS)

3: copy job jars, config

Page 95: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

95 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Starting Job – MapReduce v2.0

Client Node

Client JVM

Job MapReduce program

Jobtracker Node

1: initiate job 2: request new application

3: copy job jars, config

4: submit job

9: retrieve job jars, data

Node Manager Node

JVM

Node manager

Child JVM

YARN child

Mapper or Reducer

10: run

Shared File-System (e.g. HDFS)

6: determine input splits

7b: start container

Node Manager Node

JVM

MRApp Master Node Manager

5b: launch

5c: initialize job

5a: start container

7a: allocate task resources

8: launch

JVM

ResourceManager

Page 96: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Fault Tolerance

Page 97: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

97 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Fault Tolerance

•  Task processes periodically send heartbeats up the chain

•  A task that fails to heartbeat during a period is presumed to be dead –  The tasks JVM is killed

•  A task that throws an exception is considered failed •  Tasks that have failed are rescheduled

–  Ideally on a different node that it failed on •  If a task fails four times the job fails

Page 98: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

How Does MapReduce Work

Page 99: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

99 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce Terminology

�  A MapReduce job consists of: –  Inputs –  Task(s)

�  A task(s) is: –  A unit of work –  A job will be broken into many tasks –  A task will be either a map task or reduce task

�  Map tasks read input splits –  Generate ‘intermediate’ results

�  Reduce tasks process ‘intermediate’ results –  Generate ‘final’ results

Page 100: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

100 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Anatomy Of Job Submission

�  A job is submitted from a client –  The job is assigned a jobID –  The client determines the number of input splits for the job –  Client copies the jar file and configuration items to HDFS

�  A map task is created for each input split –  Periodic heartbeats between slave and master nodes

▪  Identify readiness of the slave node to execute tasks –  Master node assigns tasks to slave nodes with available capacity

�  A JVM is instanced on the slave and the task runs in the JVM –  Tasks run in relative isolation –  Task status is periodically sent to the master node

Page 101: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

101 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Basic Tenants of MapReduce

�  Thought Keys . . . –  A Mapper(s) processes a single record at a time and emits a

key / value pair –  The Intermediate key / value pairs produced by Mapper(s) are

passed to the Reducer(s) –  The Reducer(s) aggregates the intermediate keys from the

Mapper(s) and generates final output

Page 102: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

102 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce Simpified

Page 103: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

103 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Is MapReduce Similar To A Unix Pipeline?

cat input | grep | sort | unique -c | cat > output

Input Map Sort, Shuffle

& Merge

Reduce Output

Page 104: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

104 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Expanding On Those Thought Keys

�  Each Mapper processes a single input split from HDFS –  Likely an HDFS block

�  The Framework passes one record at a time to the Mapper

�  Each record consists of a key / value pair

�  Mapper writes intermediate key / value pairs to local disk

�  During Sort / Shuffle / Merge all values for the same intermediate key are transferred to the same Reducer

�  The Reducer is passed each key and an iterator of values

�  Output from the Reducer is written to HDFS

Page 105: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

105 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce In Detail

Page 106: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

106 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount()

�  With the though keys previously described in mind

�  We will use WordCount() to describe the MapReduce infrastructure –  WordCount is MapReduce’s equivalent of ‘Hello World’ –  Not the most elegant of examples –  Simple and understandable example

Page 107: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

107 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

The Mapper For WordCount

�  Assume the input is a set of text files

�  The key is the byte offset into the file for a line

�  The value is a line in the file at that offset

Map (key, value){ for each word x in value: output (x, 1);

}

Page 108: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

108 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Mapper Input

�  Input to the Mapper

2120 see the quick brown fox jump over the lazy dog

2166 the five boxing wizards jump quickly

2202 a mad boxer shot a quick jab to the jaw of his opponent

Page 109: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

109 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Mapper Code //Import’s omitted for clarity public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String s = value.toString(); for (String word : s.split("\\W+")) { if (word.length() > 0) { output.collect(new Text(word), new IntWritable(1)); } } } }

Page 110: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

110 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Mapper Output

�  Intermediate data output by the Mapper

(see,1) (the,1), (quick,1), (brown,1), (fox,1), (jump,1), (over,1), (the,1),

(lazy,1), (dog,1), (the,1) (five,1), (boxing,1), (wizards,1), (jump,1),

(quickly,1), (a,1), (mad,1), (boxer,1), (shot,1), (a,1) (quick,1), (jab,1),

(to,1), (the,1), (jaw,1), (of,1), (his,1), (opponent,1)

Page 111: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

111 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Shuffle

�  Process by which Mapper output gets to Reducers

�  Activities on both Mapper and Reducer side

�  Transparent… until something goes slowly / wrong

Page 112: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

112 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Shuffle – Map Side

�  Mapper output written to circular memory buffer

�  When buffer is nearly full, it is prepared for write –  Pairs placed in appropriate partition –  Partition contents sorted by key –  Combiner (if any) run against sorted key/value pairs

�  New spill file created on local file-system

�  Spill files merged together –  If enough separate files, run combiner again

�  Output file marked as available

Page 113: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

113 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Shuffle – Map Side - Configuration

Property Use mapreduce.task.io.sort.mb Size of circular buffer

mapreduce.map.sort.spill.percent When to start spilling buffer

mapreduce.cluster.local.dir Spill directory list

mapreduce.task.io.sort.factor How many spills to merge at once

mapreduce.map.combine.minspills Min spills files to run Combiner again

mapreduce.map.output.compress Whether to compress output

mapreduce.map.output.compress.codec Compression codec to use

mapreduce.tasktracker.http.threads Threads to use for transferring files

Page 114: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Shuffle – Reduce Side

�  Reducers poll JobTracker, asking for input locations

�  Reducers copy files from Mappers via HTTP

�  Inputs held in RAM until thresholds reached

�  Files from various Mappers recursively merged –  Optimizations to minimize number of disk writes

�  Reducer invoked

Page 115: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

115 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Shuffle – Reduce Side - Configuration

Property Use

mapreduce.reduce.shuffle.parallelcopies Threads used to copy input files

mapreduce.reduce.job.shuffle.input.buffer. percent

Heap to use for buffering input files

mapreduce.reduce.shuffle.merge.percent Size buffer must be filled to cause merge

mapreduce.reduce.merge.inmem.threshold Input files required to cause merge

mapreduce.task.io.sort.factor Merge factor

Page 116: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Reducer Input

�  Input to the Reducer after Sort / Shuffle / Merge phase

(a, [1,1]) (boxer, [1]) (boxing, [1]) (brown, [1]) (dog, [1]) (five, [1]) (fox, [1]) (his, [1]) (jab, [1]) (jaw, [1]) (jump, [1,1]) (lazy, [1])

(mad, [1]) (of, [1]) (opponent, [1]) (over, [1]) (quick, [1,1]) (quickly, [1]) (see, [1]) (shot, [1]) (the, [1,1,1,1]) (to, [1]) (wizards, [1])

Page 117: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

117 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

The Reducer For WordCount

�  The keyword is a word

�  The iterator is a list of values associated with the keyword

Reduce (keyword, iterator{ for each x in iterator: sum+=x; final_output(keyword, sum);

}

Page 118: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Reducer Output

�  Output from the WordCount Reducer written to HDFS

(a, 2) (boxer, 1) (boxing, 1) (brown, 1) (dog, 1) (five, 1) (fox, 1) (his, 1) (jab, 1) (jaw, 1) (jump, 2) (lazy, 1)

(mad, 1) (of, 1) (opponent, 1) (over, 1) (quick, 2) (quickly, 1) (see, 1) (shot, 1) (the, 4) (to, 1) (wizards, 1)

Page 119: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

119 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Reducer Code //Import’s omitted for clarity public class SumReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordCount = 0; while (values.hasNext()) { IntWritable value = values.next(); wordCount += value.get(); } output.collect(key, new IntWritable(wordCount)); } }

Page 120: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

120 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

WordCount Driver Code //Import’s ommitted for clarity public class WordCount { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: WordCount <input dir> <output dir>"); System.exit(-1); } JobConf conf = new JobConf(WordCount.class); conf.setJobName("Word Count"); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setReducerClass(SumReducer.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }

Page 121: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

How To Run A MapReduce Job

Page 122: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

122 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Creating And Running A MapReduce Job

•  Develop the Mapper and Reducer classes •  Develop the Driver class •  Compile the Driver, Mapper and Reducer classes

javac –classpath ‘hadoop classpath’ *.java

•  Create a jar file jar cvf myjob.jar *.class

•  Run the hadoop jar command Hadoop jar myjob.jar Entry input_dir output_dir

Page 123: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Lab Exercise Running A MapReduce Job

Page 124: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

124 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Running a MapReduce Job

Lab Exercise

�  Load data into HDFS

�  Compile wordcount driver, Mapper and Reducer

�  Create a jar file

�  Submit the jar file

Page 125: A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration – Oozie – Workflow Management – Puppet and Chef – Cluster Management ... –

125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA