Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on...

Computing & Information SciencesKansas State University

Kansas State University OlatheWorkshop on Big Data – August, 2014

KSU Laboratory forKnowledge Discovery in Databases

Kansas State University Olathe

Tuesday, 12 August 2014

William H. Hsuhttp://www.cis.ksu.edu/~bhsu

Laboratory for Knowledge Discovery in Databases, Kansas State University

http://www.kddresearch.org

Acknowledgements

K-State Manhattan: Majed Alsadhan,

Scott Finkeldei, Kyle Hudson, Surya Teja Kallumadi

K-State Olathe: Dr. Prema Arasu, Dana Reinert,

Paige Adams, Cathy Danahy, Angela Cummins, Emily Surdez,

Quentin New, Amy Burgess

Big Data Workshop: Day 1Part II – Beginner Tutorial on MapReduce

http://www.cis.ksu.edu/~bhsu

http://www.kddresearch.org/




What is MapReduce?What is MapReduce?

A programming model (& its associated implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner Offers high degree of transparencies In other words:

simple and maybe suitable for your tasks !!!

© 2006, H. Setiawan, National University of Singapore

http://bit.ly/mapreduce-intro-setiawan






Distributed GrepDistributed Grep

Very

big

data

Split data

Split data

Split data

Split data

grep

grep

grep

grep

matches

matches

matches

matches

catAll

matches




Distributed Word CountDistributed Word Count

Very

big

data

Split data

Split data

Split data

Split data

count

count

count

count

count

count

count

count

mergemerged

count




Map ReduceMap Reduce

Map:Accepts input

key/value pairEmits intermediate

key/value pair

Reduce :Accepts intermediate

key/value* pairEmits output key/value

pair

Very

big

data

ResultM

A

P

R

E

D

U

C

E

Partitioning

Function




Partitioning FunctionPartitioning Function




Partitioning Function (2)Partitioning Function (2)

Default : hash(key) mod R Guarantee:

Relatively well-balanced partitionsOrdering guarantee within partition

Distributed Sort

Map: emit(key,value)

Reduce (with R=1): emit(key,value)




MapReduceMapReduce

Distributed GrepMap:

if match(value,pattern) emit(value,1)

Reduce: emit(key,sum(value*))

Distributed Word CountMap:

for all w in value do emit(w,1)

Reduce: emit(key,sum(value*))




MapReduce TransparenciesMapReduce Transparencies

Plus Google Distributed File System : Parallelization Fault-tolerance Locality optimization Load balancing




Suitable for your task ifSuitable for your task if

Have a cluster Working with large dataset Working with independent data (or assumed) Can be cast into map and reduce




MapReduce outside GoogleMapReduce outside Google

Hadoop (Java)

Emulates MapReduce and GFS The architecture of Hadoop MapReduce and DFS is master/slave

Master SlaveMapReduce jobtracker tasktrackerDFS namenode datanode




Example Word Count (1)Example Word Count (1)

Map

public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}





Reduce

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); }}





Main

public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf();

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf);}




One time setupOne time setup

set hadoop-site.xml and slaves Initiate namenode Run Hadoop MapReduce and DFS Upload your data to DFS Run your process… Download your data from DFS




SummarySummary

A simple programming model for processing large dataset on large set of computer cluster

Fun to use, focus on problem, and let the library deal with the messy detail




ReferencesReferences

Original paper (http://labs.google.com/papers/mapreduce.html) On wikipedia (http://en.wikipedia.org/wiki/MapReduce) Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/) Starfish - MapReduce in Ruby (http://rufy.com/starfish/)

http://en.wikipedia.org/wiki/MapReduce




What is Cloud Computing?What is Cloud Computing?

“Cloud” refers to large Internet services like Google, Yahoo, etc that run on 10,000’s of machines

More recently, “cloud computing” refers to services by these companies that let external customers rent computing cycles on their clustersAmazon EC2: virtual machines at 10¢/hour, billed hourlyAmazon S3: storage at 15¢/GB/month

Attractive features:Scale: up to 100’s of nodesFine-grained billing: pay only for what you useEase of use: sign up with credit card, get root access

© 2009, M. Zaharia, University of California – Berkeley

http://bit.ly/compute-clouds-zaharia






What is MapReduce?What is MapReduce?

Simple data-parallel programming model designed for scalability and fault-tolerance

Pioneered by GoogleProcesses 20 petabytes of data per day

Popularized by open-source Hadoop projectUsed at Yahoo!, Facebook, Amazon, …

© 2009, M. Zaharia, University of California – Berkeley







What is MapReduce used for?What is MapReduce used for?

At Google:

Index construction for Google SearchArticle clustering for Google NewsStatistical machine translation

At Yahoo!:

“Web map” powering Yahoo! SearchSpam detection for Yahoo! Mail

At Facebook:

Data miningAd optimizationSpam detection




Example: Facebook LexiconExample: Facebook Lexicon

www.facebook.com/lexicon

http://www.facebook.com/lexicon




What is MapReduce used for?What is MapReduce used for?

In research:Astronomical image analysis (Washington)Bioinformatics (Maryland)Analyzing Wikipedia conflicts (PARC)Natural language processing (CMU) Particle physics (Nebraska)Ocean climate simulation (Washington)<Your application here>




OutlineOutline

MapReduce architecture

Example applications

Getting started with Hadoop

Higher-level languages over Hadoop: Pig and Hive

Amazon Elastic MapReduce




MapReduce Design GoalsMapReduce Design Goals

1. Scalability to large data volumes:1000’s of machines, 10,000’s of disks

2. Cost-efficiency:Commodity machines (cheap, but unreliable)Commodity networkAutomatic fault-tolerance (fewer administrators)Easy to use (fewer programmers)




Typical Hadoop ClusterTypical Hadoop Cluster

Aggregation switch

Rack switch

40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth within rack, 8 Gbps out of rack Node specs (Yahoo terasort):

8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf




Typical Hadoop ClusterTypical Hadoop Cluster

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf




ChallengesChallenges

1. Cheap nodes fail, especially if you have many

Mean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 daySolution: Build fault-tolerance into system

2. Commodity network = low bandwidth

Solution: Push computation to the data

3. Programming distributed systems is hard

Solution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults




Hadoop ComponentsHadoop Components

Distributed file system (HDFS)

Single namespace for entire clusterReplicates data 3x for fault-tolerance

MapReduce framework

Executes user jobs specified as “map” and “reduce” functions

Manages work distribution & fault-tolerance




Hadoop Distributed File SystemHadoop Distributed File System

Files split into 128MB blocks Blocks replicated across several

datanodes (usually 3) Single namenode stores metadata (file

names, block locations, etc) Optimized for large files, sequential

reads Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1




MapReduce Programming ModelMapReduce Programming Model

Data type: key-value records

Map function:

(Kin, Vin) list(Kinter, Vinter)

Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)




Example: Word CountExample: Word Count

def mapper(line):

foreach word in line.split():

output(word, 1)

def reducer(key, values):

output(key, sum(values))




Word Count ExecutionWord Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output




MapReduce Execution DetailsMapReduce Execution Details

Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their input block

Minimizes network usage

Mappers save outputs to local disk before serving them to reducers

Allows recovery if a reducer crashesAllows having more reducers than nodes




An Optimization: The CombinerAn Optimization: The Combiner

def combiner(key, values):

output(key, sum(values))

A combiner is a local aggregation function for repeated keys produced by same map

Works for associative functions like sum, count, max

Decreases size of intermediate data

Example: map-side aggregation for Word Count:




Word Count with CombinerWord Count with Combiner

Input Map & CombineShuffle & Sort Reduce Output

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 2fox, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1




Fault Tolerance in MapReduceFault Tolerance in MapReduce

1. If a task crashes:Retry on another node

OK for a map because it has no dependenciesOK for reduce because map outputs are on disk

If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)

Note: For these fault tolerance features to work, your map and reduce tasks must be side-effect-free





2. If a node crashes:Re-launch its current tasks on other nodesRe-run any maps the node previously ran

Necessary because their output files were lost along with the crashed node





3. If a task is going slowly (straggler):Launch second copy of task on another node (“speculative

execution”)Take the output of whichever copy finishes first, and kill the

other

Surprisingly important in large clustersStragglers occur frequently due to failing hardware,

software bugs, misconfiguration, etcSingle straggler may noticeably slow down a job




TakeawaysTakeaways

By providing a data-parallel programming model, MapReduce can control job execution in useful ways:Automatic division of job into tasksAutomatic placement of computation near dataAutomatic load balancingRecovery from failures & stragglers

User focuses on application, not on complexities of distributed computing




OutlineOutline









1. Search1. Search

Input: (lineNumber, line) records Output: lines matching a given pattern

Map:

if(line matches pattern): output(line)

Reduce: identify functionAlternative: no reducer (map-only job)




pigshee

pyak

zebra

aardvarkantbeecow

elephant

2. Sort2. Sort

Input: (key, value) recordsOutput: same records, sorted by key

Map: identity functionReduce: identify function

Trick: Pick partitioningfunction h such thatk1<k2 => h(k1)<h(k2)

Map

Map

Map

Reduce

Reduce

ant, bee

zebra

aardvark,elephant

cow

pig

sheep, yak

[A-M]

[N-Z]




3. Inverted Index3. Inverted Index

Input: (filename, text) records Output: list of files containing each word

Map:

foreach word in text.split(): output(word, filename)

Combine: uniquify filenames for each word

Reduce:def reduce(word, filenames): output(word, sort(filenames))




Inverted Index ExampleInverted Index Example

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt)

greatness, (12th.txt)not, (12th.txt,

hamlet.txt)of, (12th.txt)

or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of greatnes

s

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txtbe, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt




4. Most Popular Words4. Most Popular Words

Input: (filename, text) records Output: top 100 words occurring in the most files

Two-stage solution:

Job 1: Create inverted index, giving (word, list(file)) records

Job 2: Map each (word, list(file)) to (count, word) Sort these records by count as in sort job

Optimizations:

Map to (word, 1) instead of (word, file) in Job 1 Count files in job 1’s reducer rather than job 2’s mapper Estimate count distribution in advance and drop rare words




5. Numerical Integration5. Numerical Integration

Input: (start, end) records for sub-ranges to integrateEasy using custom InputFormat

Output: integral of f(x) dx over entire range

Map: def map(start, end): sum = 0 for(x = start; x < end; x += step): sum += f(x) * step output(“”, sum)

Reduce:def reduce(key, values): output(key, sum(values))




OutlineOutline









Getting Started with HadoopGetting Started with Hadoop

Download from hadoop.apache.org To install locally, unzip and set JAVA_HOME Details: hadoop.apache.org/core/docs/current/quickstart.html

Three ways to write jobs:Java APIHadoop Streaming (for Python, Perl, etc)Pipes API (C++)

http://hadoop.apache.org/core

http://hadoop.apache.org/core/docs/current/quickstart.html




Word Count in JavaWord Count in Java

public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text,

IntWritable> { private final static IntWritable ONE = new

IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>

out, Reporter reporter) throws IOException

{ String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } }





public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text,

IntWritable> { public void reduce(Text key, Iterator<IntWritable>

values, OutputCollector<Text, IntWritable>

out, Reporter reporter) throws

IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } }





public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setOutputKeyClass(Text.class); // out keys are words (strings)

conf.setOutputValueClass(IntWritable.class); // values are counts

JobClient.runJob(conf); }




Word Count in Python with Hadoop Streaming

Word Count in Python with Hadoop Streaming

import sysfor line in sys.stdin: for word in line.split(): print(word.lower() + "\t" + 1)

import syscounts = {}for line in sys.stdin: word, count = line.split("\t”) dict[word] = dict.get(word, 0) +

int(count)for word, count in counts: print(word.lower() + "\t" + 1)

Mapper.py:

Reducer.py:




OutlineOutline









MotivationMotivation

Many parallel algorithms can be expressed by a series of MapReduce jobs

But MapReduce is fairly low-level: must think about keys, values, partitioning, etc

Can we capture common “job building blocks”?




PigPig

Started at Yahoo! Research Runs about 30% of Yahoo!’s jobs Features:

Expresses sequences of MapReduce jobsData model: nested “bags” of itemsProvides relational (SQL) operators (JOIN,

GROUP BY, etc)Easy to plug in Java functionsPig Pen development environment for Eclipse




An Example ProblemAn Example Problem

Suppose you have user data in one file, page view data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.

Load Users

Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt




In MapReduceIn MapReduceimport java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so w e know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));

reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', first Comma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);

lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputF ormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }


http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt




Users = load ‘users’ as (name, age);Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Joined = join Filtered by name, Pages by user;Grouped = group Joined by url;Summed = foreach Grouped generate group, count(Joined) as clicks;Sorted = order Summed by clicks desc;Top5 = limit Sorted 5;

store Top5 into ‘top5sites’;


In Pig LatinIn Pig Latin




Ease of TranslationEase of Translation

Notice how naturally the components of the job translate into Pig Latin.

Load Users

Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Filtered = filter …

Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …





Ease of TranslationEase of Translation

Notice how naturally the components of the job translate into Pig Latin.

Load Users

Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Filtered = filter …

Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …

Job 1

Job 2

Job 3





HiveHive

Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop

Maintains list of table schemasSQL-like query language (HQL)Can call Hadoop Streaming scripts from HQLSupports table partitioning, clustering, complex

data types, some optimizations




Sample Hive QueriesSample Hive Queries

SELECT p.url, COUNT(1) as clicks FROM users u JOIN page_views p ON (u.name =

p.user)WHERE u.age >= 18 AND u.age <= 25GROUP BY p.urlORDER BY clicksLIMIT 5;

• Find top 5 pages visited by users aged 18-25:

• Filter page views through Python script:

SELECT TRANSFORM(p.user, p.date)USING 'map_script.py'AS dt, uid CLUSTER BY dtFROM page_views p;




OutlineOutline









Amazon Elastic MapReduceAmazon Elastic MapReduce

Provides a web-based interface and command-line tools for running Hadoop jobs on Amazon EC2

Data stored in Amazon S3 Monitors job and shuts down machines after use Small extra charge on top of EC2 pricing

If you want more control over how you Hadoop runs, you can launch a Hadoop cluster on EC2 manually using the scripts in src/contrib/ec2




Elastic MapReduce WorkflowElastic MapReduce Workflow




ConclusionsConclusions MapReduce programming model hides the complexity of work

distribution and fault tolerance

Principal design philosophies:

Make it scalable, so you can throw hardware at problems

Make it cheap, lowering hardware, programming and admin costs

MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale




MapReduce: RecapMapReduce: Recap

Programmers must specify:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All values with the same key are reduced together

Optionally, also:partition (k’, number of partitions) → partition for k’Often a simple hash of the key, e.g., hash(k’) mod nDivides up key space for parallel reduce operationscombine (k’, v’) → <k’, v’>*Mini-reducers that run in memory after the map

phaseUsed as an optimization to reduce network traffic

The execution framework handles everything else…

Adapted from slides © 2012, J. Lin & R. Jin

http://bit.ly/jin-cloud-2012






“Everything Else”“Everything Else”

The execution framework handles everything else…Scheduling: assigns workers to map and reduce tasks “Data distribution”: moves processes to dataSynchronization: gathers, sorts, and shuffles

intermediate dataErrors and faults: detects worker failures and restarts

Limited control over data and execution flowAll algorithms must expressed in m, r, c, p

You don’t know:Where mappers and reducers runWhen a mapper or reducer begins or finishesWhich input a particular mapper is processingWhich intermediate key a particular reducer is

processing




combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce

reduce

reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3




Tools for SynchronizationTools for Synchronization

• Cleverly-constructed data structures

– Bring partial results together• Sort order of intermediate keys

– Control order in which reducers process keys

• Partitioner

– Control which reducer processes which keys

• Preserving state in mappers and reducers

– Capture dependencies across multiple keys and values




Preserving StatePreserving State

Mapper object

configure

map

close

stateone object per task

Reducer object

configure

reduce

close

state

one call per input key-value pair

one call per intermediate key

API initialization hook

API cleanup hook




Scalable Hadoop Algorithms: ThemesScalable Hadoop Algorithms: Themes

Avoid object creation

Inherently costly operationGarbage collection

Avoid buffering

Limited heap sizeWorks for small datasets, but won’t scale!




Importance of Local AggregationImportance of Local Aggregation

Ideal scaling characteristics:

Twice the data, twice the running timeTwice the resources, half the running time

Why can’t we achieve this?

Synchronization requires communicationCommunication kills performance

Thus… avoid communication!

Reduce intermediate data via local aggregation

Combiners can help




Shuffle and SortShuffle and Sort

Mapper

Reducer

other mappers

other reducers

circular buffer

(in memory)

spills (on disk)

merged spills (on disk)

intermediate files (on disk)

Combiner

Combiner




Word Count: BaselineWord Count: Baseline

What’s the impact of combiners?




Word Count: Version 1Word Count: Version 1

Are combiners still needed?




Word Count: Version 2Word Count: Version 2


Key: preserve state across

input key-value pairs!




Design Pattern for Local AggregationDesign Pattern for Local Aggregation

“In-mapper combining”Fold the functionality of the combiner into the

mapper by preserving state across multiple map calls

AdvantagesSpeedWhy is this faster than actual combiners?

DisadvantagesExplicit memory management requiredPotential for order-dependent bugs




Combiner DesignCombiner Design

Combiners and reducers share same method signatureSometimes, reducers can serve as combinersOften, not…

Remember: combiner are optional optimizationsShould not affect algorithm correctnessMay be run 0, 1, or multiple times

Example: find average of all integers associated with the same key




Computing the Mean: Version 1Computing the Mean: Version 1

Why can’t we use reducer as combiner?





Why doesn’t this work?





Fixed?




Algorithm Design: Running ExampleAlgorithm Design: Running Example

Term co-occurrence matrix for a text collectionM = N x N matrix (N = vocabulary size)Mij: number of times i and j co-occur in some

context (for concreteness, let’s say context = sentence)

Why?Distributional profiles as a way of measuring

semantic distanceSemantic distance useful for many language

processing tasks




MapReduce: Large Counting ProblemsMapReduce: Large Counting Problems

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem

– A large event space (number of terms)– A large number of observations (the

collection itself)– Goal: keep track of interesting statistics

about the events• Basic approach

– Mappers generate partial counts– Reducers aggregate partial counts

How do we aggregate partial counts efficiently?




First Try: “Pairs”First Try: “Pairs”

Each mapper takes a sentence:

Generate all co-occurring term pairsFor all pairs, emit (a, b) → count

Reducers sum up counts associated with these pairs Use combiners!




Pairs: Pseudo-CodePairs: Pseudo-Code




“Pairs” Analysis“Pairs” Analysis

Advantages

Easy to implement, easy to understand Disadvantages

Lots of pairs to sort and shuffle around (upper bound?)

Not many opportunities for combiners to work




Another Try: “Stripes”Another Try: “Stripes”

Idea: group together pairs into an associative array

Each mapper takes a sentence:

Generate all co-occurring term pairsFor each term, emit a → { b: countb, c:

countc, d: countd … } Reducers perform element-wise sum of associative arrays

(a, b) → 1

(a, c) → 2

(a, d) → 5

(a, e) → 3

(a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }

a → { b: 1, c: 2, d: 2, f: 2 }

a → { b: 2, c: 2, d: 7, e: 3, f: 2 }+

Key: cleverly-constructed data structure

brings together partial results




Stripes: Pseudo-CodeStripes: Pseudo-Code




“Stripes” Analysis“Stripes” Analysis

Advantages

Far less sorting and shuffling of key-value pairs

Can make better use of combiners Disadvantages

More difficult to implementUnderlying object more heavyweightFundamental limitation in terms of size of

event space




Cluster size: 38 cores

Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)




Relative FrequenciesRelative Frequencies

How do we estimate relative frequencies from counts?

Why do we want to do this? How do we do this with MapReduce?

'

)',(count

),(count

)(count

),(count)|(

B

BA

BA

A

BAABf




f(B|A): “Stripes” f(B|A): “Stripes”

Easy!

One pass to compute (a, *)Another pass to directly compute f(B|A)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }




f(B|A): “Pairs” f(B|A): “Pairs”

For this to work:

Must emit extra (a, *) for every bn in mapper

Must make sure all a’s get sent to same reducer (use partitioner)

Must make sure (a, *) comes first (define sort order)

Must hold state in reducer across different key-value pairs

(a, b1) → 3

(a, b2) → 12

(a, b3) → 7

(a, b4) → 1

…

(a, *) → 32

(a, b1) → 3 / 32

(a, b2) → 12 / 32

(a, b3) → 7 / 32

(a, b4) → 1 / 32

…

Reducer holds this value in memory




“Order Inversion”“Order Inversion”

• Common design pattern

– Computing relative frequencies requires marginal counts

– But marginal cannot be computed until you see all counts

– Buffering is a bad idea!– Trick: getting the marginal counts to arrive at the

reducer before the joint counts• Optimizations

– Apply in-memory combining pattern to accumulate marginal counts

– Should we apply combiners?




Synchronization: Pairs vs. StripesSynchronization: Pairs vs. Stripes

Approach 1: turn synchronization into an ordering problemSort keys into correct order of computationPartition key space so that each reducer gets the

appropriate set of partial resultsHold state in reducer across multiple key-value pairs to

perform computation Illustrated by the “pairs” approach

Approach 2: construct data structures that bring partial results togetherEach reducer receives all the data it needs to complete

the computation Illustrated by the “stripes” approach




Secondary SortingSecondary Sorting

MapReduce sorts input to reducers by key

Values may be arbitrarily ordered What if want to sort value also?

E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…




Secondary Sorting: SolutionsSecondary Sorting: Solutions

Solution 1:Buffer values in memory, then sortWhy is this a bad idea?

Solution 2:“Value-to-key conversion” design pattern: form

composite intermediate key, (k, v1)Let execution framework do the sortingPreserve state across multiple key-value pairs

to handle processingAnything else we need to do?




Recap: Tools for SynchronizationRecap: Tools for Synchronization

• Cleverly-constructed data structures

– Bring data together• Sort order of intermediate keys

– Control order in which reducers process keys

• Partitioner

– Control which reducer processes which keys

• Preserving state in mappers and reducers

– Capture dependencies across multiple keys and values




Issues and TradeoffsIssues and Tradeoffs

• Number of key-value pairs

– Object creation overhead– Time for sorting and shuffling pairs across the

network• Size of each key-value pair

– De/serialization overhead• Local aggregation

– Opportunities to perform local aggregation varies– Combiners make a big difference– Combiners vs. in-mapper combining– RAM vs. disk vs. network

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on...

Documents

Transcript of Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on...