Hadoop M/R Pig Hive
-
Upload
zahid-mian -
Category
Data & Analytics
-
view
32 -
download
2
Transcript of Hadoop M/R Pig Hive
Hadoop: M/R, Pig, HiveA short intro and demo of each Program
By Zahid Mian (February 2015)
Agenda
• Intro to Map/Reduce (M/R)
• M/R Simple Example
• M/R Joins
• M/R Broadcast Join Example
• Intro to Pig
• Pig Example
• Intro to Hive
• Hive Example
• Resources
What is M/R?
• A way of Programing that breaks down work into two tasks: Mapping and Reducing
• Mapping:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume: <key, <list of values>> <“EMC”, “{(…),(…)}”>
• Produce: <key, value> <“EMC”, 27.2229>
• Shuffling and Sorting:
• Behind the scenes actions done by the framework
• Groups all similar keys from all mappers, sorts and passes them to a certain reducer
What is HDFS
• HDFS is a filesystem that ensures data availability by replicating file blocks across several nodes (3 is default)
• Default block size is 64 MB
• A small file (1 KB) will take up 64 MB; “large” file of 65 MB will take up 128 MB;
• Namenode stores metadata info about files
• Datanode stores the actual file(s)
• Files must be added to HDFS
• Files cannot be modified once inside HDFS
Working with HDFS
• Similar to working with Linux Filesystem• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/input
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/output
• [cloudera@quickstart ~]$ hadoop fs -rm -r /user/examples/stocks/input/*
• [cloudera@quickstart ~]$ hadoop fs -copyFromLocal ~/datasets/stock*.txt /user/examples/stocks/input/
• [cloudera@quickstart ~]$ hadoop fs -cat /user/examples/stocks/input/stocks.txt
• [cloudera@quickstart ~]$ hadoop fs -rmr /user/examples/stocks/output/*
• Full list of Commands available:• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html
Structure of Files (demo)
Symbol, Name, ExchangeSymbol, date, open, high, low,
close, volume, adjclose
Shakespeare Count Words
• Simple text file that contains all of Shakespeare’s works
• Mapper will read each line from text file and produce a <key, value> tuple with the word as the key and the value of 1
• Simply tokenize each line and output each word
• Reducer will get a list of values (all 1s) for each word
• Tuple: <“death”, {1,1,1,1,1,1,1,1}>
• Now simply count the 1s and output as <“death”, 8>
• It’s Hadoop’s job to Shuffle and Sort in order to give the Reducer the correct tuple
• Output of Mapper and Reducer are stored in HDFS
• Logs are generated outside HDFS
M/R: Mapper (Simple)All Mappers must extend this class:
org.apache.hadoop.mapreduce.MapperSpecial Hadoop type; for text files,
this is simply the line number
Special Hadoop type; Indicates type of value Mapper will produce
“Signature” indicates that Mapper will consume
LongWritable and Text; will produce Text and
IntWritable
Notice word is of type Text; one is of type IntWritable
setup method is run only once before any calls to the mapper function
M/R Submitting: Driver
• Compile and Create jar file• Then from command prompt:[cloudera@quickstart ~]$ hadoop jar words.jar Driver /user/examples/shakespeare/input/ /user/examples/shakespeare/output/wordcount
What is Mapper Doing?
• Sample File Segment • Mapper function gets:
• Line Number, Line Text
• <57020, “HAMLET To be, or not to be: that is the
question:”>
• Mapper will:
• 1: tokenize string
• 2: for each word, produce a tuple like:• <“HAMLET”, 1>
• <“To”, 1>
• <“be”, 1>
• <“or”, 1>
• …
• Repeated for all lines
That’s it?
• Hadoop performs some Magic (Shuffling and Sorting) …
• And now we have tuples like:
• <“HAMLET”, {1,1,1,1,1,1,1}>
• <“To”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1}>
• <“be”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}>• Note: List of Values isn’t correct (there are a lot more references to “HAMLET” in
the text file), but it’s meant to be representative of what it would look like
M/R: Reducer (Simple)All Reducers must extend this class:
org.apache.hadoop.mapreduce.ReducerThis is the “key” for the data that’s
being sent to the Reducer
Special Hadoop type; Indicates type of value Mapper will produce
“Signature” indicates that Reducer will consume Text
and IntWritable; will produce Text and DoubleWritable
Notice key is of type Text; result is of type DoubleWritable
What is Reducer Doing?
• Hadoop will send a tuple to the Reducer:
• <“HAMLET”, {1,1,1,1,1,1,1}>
• Reducer function:
• Iterates over all the values for that key
• Value is always 1, so simply sum
• Reducer outputs:
• <“HAMLET”, 7>
M/R Overview
HAMLET To be, or not to …Whether 'tis nobler in the …
Or to take arms against a …And by opposing end …
HAMLET To be, or not to …
Whether 'tis nobler in the …
Or to take arms against a …
And by opposing end …
HAMLET, 1to, 1be, 1or, 1not, 1to, 1
Whether, 1‘tis, 1nobler, 1in, 1the, 1
Or, 1to, 1take, 1arms, 1against, 1a, 1
And, 1by, 1opposing, 1end, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 1or, 1
take, 1
to, 1to, 1to, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 2
take, 1
to, 3
HAMLET, 1a, 1against, 1and, 1arms, 1be, 1by, 1end, 1in, 1nobler, 1not, 1opposing, 1or, 2take, 1the, 1'tis, 1to, 3whether, 1
Input Files
Each line passedto mapper
Map Key Value Split
Sort andShuffle
Reduce KeyValue Paris
Final Ouput
Map TasksReduce Tasks
Final Output
Joins with M/R
• Not Straightforward (Mapper deals with a single record)
• Two Strategies:
• Re-Partition Join if both tables are Large
• Basic idea is to use Mappers to produce “key’d” records so that both data sets will be in the same partition
• Assume EmployeeID of 100, then Mapper Produces:
• <100, “FirstName, LastName, Address”> (parent record)
• <100, “Skill1, Date, Level”> (child record)
• <100, “Skill2, Date, Level”> (child record)
• Reducer performs the join
• Expensive/Costly due to Shuffling and Sorting
• Broadcast/Replication Join if one table is small
• Essentially send a copy of small table to each Mapper
• Each Mapper performs join
M/R Mapper: Broadcast Join
M/R Reducer: Broadcast Join
M/R Driver: Broadcast Join
ResultsJust the Mapper
Mapper and Reducer (calculate Avg Price by Name)
Final Thoughts on M/R
• Java Experience Necessary
• Hadoop Streaming extends M/R to C, Python, etc.
• Can use Combiners to improve performance
• Reduces Network traffic
• “Difficult” to understand all the details, but granular control over data/process
• Useful when dealing with complex algorithms
• Several file formats available, but can also create custom formats
• Chaining Jobs to use output of one Job as input for another• https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
Pig
• Higher level abstraction for writing M/R jobs
• Data Flow “language”
• Sequence of transformations (filtering, grouping, joining, etc.)
• Pig Latin (the language for Pig)
• It’s not SQL, not even close
• Pig scripts are run as M/R jobs in Hadoop
• Pig Shell will compile and optimize script
• Need to understand data in order to create schemas
• Pig can define Simple and Complex types, so a parent/child data can exist in one “line” (think Json)
• User Defined Functions (UDF) can be written in Java, Jython, etc. http://pig.apache.org/docs/r0.9.1/udf.html
Generic Example
• This script shows many of the operations within Pig
Users = load 'users' as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load 'pages' as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into 'top5sites';
Avg Opening Price by Name
Performs join between two datasets;
describe shows you the structure
Pig Scripts are Hadoop Jobsz.pig is the name
of the script
Hive
• It’s not Pig
• SQL-based tool for Hadoop (HiveQL, not SQL)
• More friendlier for SQL users
• “Databases” are simply Namespaces
• “Tables” similar to SQL Tables
• Cannot Insert/Update/Delete
• New data is added when HDFS is updated (add a file to HDFS)
• Metadata is kept in a relational database (MySQL by default)
Hive and HDFS
• When a Table points to a HDFS location, it will read all files in that location; cannot specify a single file
• Easy to create Partitions; simply create sub directories
• That’s why each file is stored in a separate directory
Hive Script
Hive Results
Hive Scripts are Hadoop Jobs
{Blank}