Hadoop M/R Pig Hive

Hadoop: M/R, Pig, HiveA short intro and demo of each Program

By Zahid Mian (February 2015)

Agenda

• Intro to Map/Reduce (M/R)

• M/R Simple Example

• M/R Joins

• M/R Broadcast Join Example

• Intro to Pig

• Pig Example

• Intro to Hive

• Hive Example

• Resources

What is M/R?

• A way of Programing that breaks down work into two tasks: Mapping and Reducing

• Mapping:

• Consume <key, value> pairs

• Produce <key, value> pairs

• Reducers:

• Consume: <key, <list of values>> <“EMC”, “{(…),(…)}”>

• Produce: <key, value> <“EMC”, 27.2229>

• Shuffling and Sorting:

• Behind the scenes actions done by the framework

• Groups all similar keys from all mappers, sorts and passes them to a certain reducer

What is HDFS

• HDFS is a filesystem that ensures data availability by replicating file blocks across several nodes (3 is default)

• Default block size is 64 MB

• A small file (1 KB) will take up 64 MB; “large” file of 65 MB will take up 128 MB;

• Namenode stores metadata info about files

• Datanode stores the actual file(s)

• Files must be added to HDFS

• Files cannot be modified once inside HDFS

Working with HDFS

• Similar to working with Linux Filesystem• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks

• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/input

• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/output

• [cloudera@quickstart ~]$ hadoop fs -rm -r /user/examples/stocks/input/*

• [cloudera@quickstart ~]$ hadoop fs -copyFromLocal ~/datasets/stock*.txt /user/examples/stocks/input/

• [cloudera@quickstart ~]$ hadoop fs -cat /user/examples/stocks/input/stocks.txt

• [cloudera@quickstart ~]$ hadoop fs -rmr /user/examples/stocks/output/*

• Full list of Commands available:• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-

common/FileSystemShell.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Structure of Files (demo)

Symbol, Name, ExchangeSymbol, date, open, high, low,

close, volume, adjclose

Shakespeare Count Words

• Simple text file that contains all of Shakespeare’s works

• Mapper will read each line from text file and produce a <key, value> tuple with the word as the key and the value of 1

• Simply tokenize each line and output each word

• Reducer will get a list of values (all 1s) for each word

• Tuple: <“death”, {1,1,1,1,1,1,1,1}>

• Now simply count the 1s and output as <“death”, 8>

• It’s Hadoop’s job to Shuffle and Sort in order to give the Reducer the correct tuple

• Output of Mapper and Reducer are stored in HDFS

• Logs are generated outside HDFS

M/R: Mapper (Simple)All Mappers must extend this class:

org.apache.hadoop.mapreduce.MapperSpecial Hadoop type; for text files,

this is simply the line number

Special Hadoop type; Indicates type of value Mapper will produce

“Signature” indicates that Mapper will consume

LongWritable and Text; will produce Text and

IntWritable

Notice word is of type Text; one is of type IntWritable

setup method is run only once before any calls to the mapper function

M/R Submitting: Driver

• Compile and Create jar file• Then from command prompt:[cloudera@quickstart ~]$ hadoop jar words.jar Driver /user/examples/shakespeare/input/ /user/examples/shakespeare/output/wordcount

What is Mapper Doing?

• Sample File Segment • Mapper function gets:

• Line Number, Line Text

• <57020, “HAMLET To be, or not to be: that is the

question:”>

• Mapper will:

• 1: tokenize string

• 2: for each word, produce a tuple like:• <“HAMLET”, 1>

• <“To”, 1>

• <“be”, 1>

• <“or”, 1>

• …

• Repeated for all lines

That’s it?

• Hadoop performs some Magic (Shuffling and Sorting) …

• And now we have tuples like:

• <“HAMLET”, {1,1,1,1,1,1,1}>

• <“To”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1}>

• <“be”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}>• Note: List of Values isn’t correct (there are a lot more references to “HAMLET” in

the text file), but it’s meant to be representative of what it would look like

M/R: Reducer (Simple)All Reducers must extend this class:

org.apache.hadoop.mapreduce.ReducerThis is the “key” for the data that’s

being sent to the Reducer

Special Hadoop type; Indicates type of value Mapper will produce

“Signature” indicates that Reducer will consume Text

and IntWritable; will produce Text and DoubleWritable

Notice key is of type Text; result is of type DoubleWritable

What is Reducer Doing?

• Hadoop will send a tuple to the Reducer:

• <“HAMLET”, {1,1,1,1,1,1,1}>

• Reducer function:

• Iterates over all the values for that key

• Value is always 1, so simply sum

• Reducer outputs:

• <“HAMLET”, 7>

M/R Overview

HAMLET To be, or not to …Whether 'tis nobler in the …

Or to take arms against a …And by opposing end …

HAMLET To be, or not to …

Whether 'tis nobler in the …

Or to take arms against a …

And by opposing end …

HAMLET, 1to, 1be, 1or, 1not, 1to, 1

Whether, 1‘tis, 1nobler, 1in, 1the, 1

Or, 1to, 1take, 1arms, 1against, 1a, 1

And, 1by, 1opposing, 1end, 1

HAMLET, 1

a, 1

against, 1

be, 1

by, 1

end, 1

in, 1

nobler, 1

not, 1

opposing, 1

or, 1or, 1

take, 1

to, 1to, 1to, 1

HAMLET, 1

a, 1

against, 1

be, 1

by, 1

end, 1

in, 1

nobler, 1

not, 1

opposing, 1

or, 2

take, 1

to, 3

HAMLET, 1a, 1against, 1and, 1arms, 1be, 1by, 1end, 1in, 1nobler, 1not, 1opposing, 1or, 2take, 1the, 1'tis, 1to, 3whether, 1

Input Files

Each line passedto mapper

Map Key Value Split

Sort andShuffle

Reduce KeyValue Paris

Final Ouput

Map TasksReduce Tasks

Final Output

Joins with M/R

• Not Straightforward (Mapper deals with a single record)

• Two Strategies:

• Re-Partition Join if both tables are Large

• Basic idea is to use Mappers to produce “key’d” records so that both data sets will be in the same partition

• Assume EmployeeID of 100, then Mapper Produces:

• <100, “FirstName, LastName, Address”> (parent record)

• <100, “Skill1, Date, Level”> (child record)

• <100, “Skill2, Date, Level”> (child record)

• Reducer performs the join

• Expensive/Costly due to Shuffling and Sorting

• Broadcast/Replication Join if one table is small

• Essentially send a copy of small table to each Mapper

• Each Mapper performs join

M/R Mapper: Broadcast Join

M/R Reducer: Broadcast Join

M/R Driver: Broadcast Join

ResultsJust the Mapper

Mapper and Reducer (calculate Avg Price by Name)

Final Thoughts on M/R

• Java Experience Necessary

• Hadoop Streaming extends M/R to C, Python, etc.

• Can use Combiners to improve performance

• Reduces Network traffic

• “Difficult” to understand all the details, but granular control over data/process

• Useful when dealing with complex algorithms

• Several file formats available, but can also create custom formats

• Chaining Jobs to use output of one Job as input for another• https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Pig

• Higher level abstraction for writing M/R jobs

• Data Flow “language”

• Sequence of transformations (filtering, grouping, joining, etc.)

• Pig Latin (the language for Pig)

• It’s not SQL, not even close

• Pig scripts are run as M/R jobs in Hadoop

• Pig Shell will compile and optimize script

• Need to understand data in order to create schemas

• Pig can define Simple and Complex types, so a parent/child data can exist in one “line” (think Json)

• User Defined Functions (UDF) can be written in Java, Jython, etc. http://pig.apache.org/docs/r0.9.1/udf.html

Generic Example

• This script shows many of the operations within Pig

Users = load 'users' as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25;

Pages = load 'pages' as (user, url);

Jnd = join Fltrd by name, Pages by user;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into 'top5sites';

Avg Opening Price by Name

Performs join between two datasets;

describe shows you the structure

Pig Scripts are Hadoop Jobsz.pig is the name

of the script

Hive

• It’s not Pig

• SQL-based tool for Hadoop (HiveQL, not SQL)

• More friendlier for SQL users

• “Databases” are simply Namespaces

• “Tables” similar to SQL Tables

• Cannot Insert/Update/Delete

• New data is added when HDFS is updated (add a file to HDFS)

• Metadata is kept in a relational database (MySQL by default)

Hive and HDFS

• When a Table points to a HDFS location, it will read all files in that location; cannot specify a single file

• Easy to create Partitions; simply create sub directories

• That’s why each file is stored in a separate directory

Hive Script

Hive Results

Hive Scripts are Hadoop Jobs

{Blank}

Hadoop M/R Pig Hive

Data & Analytics

Transcript of Hadoop M/R Pig Hive