Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the...

12
Big Data – Exercises Fall 2019 – Week 6 – ETH Zurich MapReduce Reading: White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O’Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8: mandatory, Chapter 9: recommended) George, L. (2011). HBase: The Definitive Guide (1st ed.). O’Reilly. [ETH library] (Chapter 7: mandatory). Original MapReduce paper: MapReduce: Simplified Data Processing on Large Clusters (optional) This exercise will consist of 2 main parts: Hands-on practice with MapReduce on Azure HDInsight Architecture and theory of MapReduce 1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and choose type A4 V2 Standard for these Worker nodes. HeadNodes nodes should be of type A6 General Purpose . If you still have the storage account from last week, you can connect the same, otherwise create a new one.

Transcript of Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the...

Page 1: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

Big Data – Exercises

Fall 2019 – Week 6 – ETH Zurich

MapReduce

Reading:

White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O’Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8:

mandatory, Chapter 9: recommended)

George, L. (2011). HBase: The Definitive Guide (1st ed.). O’Reilly. [ETH library] (Chapter 7: mandatory).

Original MapReduce paper: MapReduce: Simplified Data Processing on Large Clusters (optional)

This exercise will consist of 2 main parts:

Hands-on practice with MapReduce on Azure HDInsight

Architecture and theory of MapReduce

1. Setup a cluster

Create an Hadoop cluster

Follow the steps from the exercise of last week, but this time select a Hadoop cluster.

Create 4 Worker nodes and choose type A4 V2 Standard for these Worker nodes. HeadNodes nodes should be of

type A6 General Purpose . If you still have the storage account from last week, you can connect the same,

otherwise create a new one.

Page 2: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

Make sure that:

you have enabled "Custom" settings

you have selected "North Europe" for the data center location

you have selected the two HeadNodes as type A6 General Purpose.

you have selected the four RegionServers as type A4 V2 Standard.

the total cost of your cluster is approximately 2.99 CHF/hour

Access your cluster

Make sure you can access your cluster (the HeadNode) via SSH:

$ ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net

Maven

You may need to update your cluster before you can install maven , which is needed to build your code.

$ sudo apt-get update

$ sudo apt-get install maven

Explore your cluster admin dashboard

Connect to your cluster dashboard, navigating to:

https://<cluster-name>.azurehdinsight.net

Username and password are the one you have chosen during the cluster setup (notice that the username is not the

SSH username). Here you can find a lot of information about your cluster.

Select "Hosts" in the topbar and and explore the components running on each node.

Where is the Active NameNode running?

Which services run on the head nodes?

Which physical node is responsible to coordinate a MapReduce job?

2. Write a Word Count MapReduce job

We want to find which are the most frequently-used English words. To answer this question, we prepared a big text

files (1.2GB) where we concatenated the 3,036 books of the Gutenberg dataset.

2.1 Load the dataset

The dataset we provide consists of a concatenation of 3,036 books (gutenberg.txt ). However we provide 3 versions:

gutenberg_x0.1.txt - a smaller dataset of about 120MB

gutenberg.txt - the original dataset, 1.2GB

gutenberg_x10.txt - a bigger dataset of 12GB. This is optional. Load and process this only after you

Page 3: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

gutenberg_x10.txt - a bigger dataset of 12GB. This is optional. Load and process this only after you

finished the exercise with the first two. Be aware that it might take some time.

Follow the steps below to set this dataset up in HDFS:

Log in into the active NameNode via SSH

ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net

Download the dataset from our storage to the local filesystem of the NameNode using wget :

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg/gutenberg_x0.1.txt

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg/gutenberg.txt

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg/gutenberg_x10.txt

Load the dataset into the HDFS filesystem:

With ls -lh you should see the 2 files mentioned above. These files are now in the local hard drive of your

HeadNode.

Upload the files into HDFS where they can be consumed by MapReduce:

$ hadoop dfs -copyFromLocal *.txt /tmp/

2.2 Understand the MapReduce Java API

We wrote a template project that you can use to experiment with MapReduce. Download and unzip the following

package.

wget https://exerciseassets.blob.core.windows.net/exercise06/mapreduce.zip

unzip mapreduce.zip

Now examine the content of the the src folder. You will see one Java class:

MapReduceWordCount: a skeleton for a MapReduce job that loads data from file

Start looking at MapReduceWordCount. You can see that the main method is already provided. Our

WordCountMapper and WordCountReducer are implemented as classes that extend Hadoop's Mapper and

Reducer . For this exercise, you only need to consider (and override) the map() method for the mapper and the

reduce() method for the reducer.

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context co

ntext) {

context.write(key, value);

}

}

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOUT, VALUEO

UT>.Context context) {

Iterator var4 = values.iterator();

while(var4.hasNext()) {

Object value = var4.next();

context.write(key, value);

}

}

}

Consulting the documentation if necessary, answer the following questions:

1. What are possible types for KEYIN, VALUEIN, KEYOUT and VALUEOUT ? Should KEYOUT and VALUEOUT for the

Mapper be the same as KEYIN and VALUEIN for the Reducer?

2. What is the default behavior of a MapReduce job running the base Mapper and Reducer above?

3. What is the role of the object Context ?

Page 4: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

Solution

1. Since the output of mapper and reducer must be serializable, they have to implement the Writable interface.

Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the

framework. Hadoop provides classes implementing these interfaces for string, boolean, integers, long and short

values.

2. Mapper and Reducers are identity functions. Since there is always a shuffling phase, the overall job performs

sorting by input key.

3. Context allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data

for the job and the methods for mapper and reducer to emit output ( write() ). Applications can also use the

Context to set parameters to mappers and reducers runnning on different nodes.

2.2 Write and run your MapReduce wordcount

Edit the provided skeleton and implement mapper and reducer to implement a word count. The goal is to know how

many times each unique word appears in the dataset. You can consider each word as a sequence of characters

separated by a whitespace, or implement a more sophisticated tokenizer if you wish.

Can you use your Reducer as Combiner? If so enable it by uncommenting the appropriate line in the main

method.

Once you are confident on your solution, follow the instruction on the README to compile the program and

run it on your cluster.

The process is very similar to the one for HBase of last week. Answer the following questions:

1. Run the MapReduce job on the cluster with the default configuration and 4 DataNodes using using only the

medium size Gutenberg file for now. (Note: if you want to run your job again, you first need to delete the

previous result folder because Hadoop refuses to write in the same location):

hadoop fs -rm -r <path-to-hdfs-output-folder>

2. While the job proceeds, connect to https://<cluster-name>.azurehdinsight.net/yarnui/hn/cluster to verify the

status. Once done, check the final job report. How many map and reduce tasks were created with the default

configuration?

3. Does it go faster with more reduce tasks? Experiment with job.setNumReduceTasks() . What is the disadvantage

of having multiple reducers? (Hint: check the format of your output)

Solution

A possible solution is the one below. Notice that this Reducer can be also used as a Combiner.

Mapper

private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Reducer

private static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, Interru

Page 5: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, Interru

ptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

1. Some figures are provided below.

2. In my case, the number of map tasks was 3 and the number of reduce tasks 1. By default, the number of map

tasks for a given job is driven by the number of input splits. The parameter mapred.map.tasks can be used to

set a suggested value. For reduce tasks, 1 is the default number. This might be a bad choice in terms of

performance, but on the other hand the result will be on a single file, without further merging necessary.

3. Generally, one reducer might become a bottleneck. In this case, the bottleneck is more evident if no combiner is

used. However, using multiple reducers, multiple output files will be created and an additional merging step is

necessary to get the same result. Note: according to the official documentation, the right number of reduces is

0.95 or 1.75 multiplied by [no. of nodes] * [no. of maximum containers per node]. With 0.95 all of the reduces

can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will

finish their first round of reduces and launch a second wave of reduces doing a much better job of load

balancing. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the

framework for speculative-tasks and failed tasks.

2.3. Download and plot the results

The output of the MapReduce process has been stored in your remote cluster. You can now download the result file

and plot the frequency of the 30 most common words in the dataset. The easiest way for this notebook to be able to

access your result file is to set your container as public.

By default, output files have the form part-A-XXXX where A is r or m to denote Reducer or Mapper outputs, and

XXXX is the id of the specific mapper or reducer task.

Access the MapReduce container in portal.azure.com.

Set the Accesss Policy of the container to public.

Locate the result file in /tmp/results and copy the file URL. The link should be now accessible from your browser.

Add the URL below, run to download the file and plot the results.

In [1]:

RESULT_FILE_URL = '...'import urllib.requesturllib.request.urlretrieve (RESULT_FILE_URL, "results.txt")print ('Done')

In [2]:

%matplotlib inlineimport matplotlib.pyplot as pltimport operatorprint ('Plotting...')freq = {}

# Read input and sort by frequency. Keep only top 30.with open('results.txt', 'rb') as csvfile: for line in csvfile.readlines(): word, count = line.decode('UTF-8').split('\t') freq[word] = int(count)srt = sorted(freq.items(), key=operator.itemgetter(1), reverse=True)[:30]

# Generate plotplt.figure(figsize=(16,6))plt.bar(range(len(srt)), [x[1] for x in srt], align='center', color='#ba2121')plt.xticks(range(len(srt)), [x[0] for x in srt])plt.show()

Done

Page 6: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

plt.show()

In everything is correct, the 3 most frequent words should be the , of and and .

3. Performance comparison

Test your MapReduce with the smaller gutenberg_x0.1.txt as well. If you want, you can also try with

gutenberg_x10.txt . For each test, write down the running time :

time mapreduce-wc /tmp/gutenberg_x0.1.txt /tmp/gut01/

time mapreduce-wc /tmp/gutenberg.txt /tmp/gut1/

time mapreduce-wc /tmp/gutenberg_x10.txt /tmp/gut10/

While the job proceeds, you can connect to https://<cluster-name>.azurehdinsight.net/yarnui/hn/cluster to check

the status. In the meanwhile:

Download the dataset on your laptop. You can either use wget or download one file at a time from your container

in the Azure portal. Note: the bigger file is optional. You need at least 12.5GB of free space in your

hard drive, and will take some time to download and process! Alternatively, you can also run this

experiment on the HeadNode of your cluster, where you should still have the text files.

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg_x0.1.txt

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg.txt

wget https://exerciseassets.blob.core.windows.net/exercise06/gutenberg_x10.txt

We prepared a simple wordcount program in Python. Download it on your laptop (or the cluster HeadNode) and

test how long it takes to process the three datasets. Annotate the times for the next exercise.

python wordcount.py < /pathtoyourfile/gutenberg_x0.1.txt

python wordcount.py < /pathtoyourfile/gutenberg.txt

python wordcount.py < /pathtoyourfile/gutenberg_x10.txt

3.1 Plot

Compare the performance of the MapReduce vs the single-thread implementation of the word count algorithm for the

three different input sizes. Fill the time in seconds in the code below to plot the results.

In [ ]:

# NOTE: remove the last number on the lists below if you did not test gutenberg_x10.txt

/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')/home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Plotting...

Page 7: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

# NOTE: remove the last number on the lists below if you did not test gutenberg_x10.txtsize_input = [1.2*10e2, 1.2*10e3, 1.2*10e4] # the input size in MBtime_mapreduce = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputstime_locallaptop = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputs

%matplotlib inline# Import plot libraryimport matplotlib.pyplot as plt# Plotplt.figure(figsize=(18,9))plt.plot(size_input, time_mapreduce, '#f37626', label='MapReduce', linewidth=3.0)plt.plot(size_input, time_locallaptop, '#00b300', label='Local laptop', linewidth=3.0, linestyle='dashed')

plt.xscale('log')plt.yscale('log')plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)plt.xlabel('Input size (MB)')plt.ylabel('Time (seconds)')plt.title('Wall-time comparison')plt.show()

3.2. Discussion

1. Which is faster, MapReduce on your cluster or a local wordcount implementation? Why?

2. Based on your experiment, what input size is the approximate break-even point for time performance?

3. Why MapReduce is not performing better than local computation for small inputs?

4. How can you optimize the MapReduce performance for this job?

Solution

We have run some more tests. Here we present the running time for 3 configurations on HDInsight, one workstation

and one laptop. The figures below are indicative only, because the performance of every machine depends on

several factors.

MapReduce v1: no combiner with default configuration (1 reducer)

MapReduce v2: no combiner with 8 reduce tasks

MapReduce v3: using combiner with default configuration (1 reducer)

MapReduce v4: using combiner with 8 reduce tasks

1. For small inputs, local computation is faster

2. Using a combiner, MapReduce gets faster than a single workstation somewhere between 1GB and 10GB of data.

This of course, really depends on the task and the configuration of the machines.

3. The shuffling phase adds a lot of overhead computation

4. A combiner drastically improves the performance for this job where the data to be shuffled can be reduced a lot

In [3]:

size_input = [1.2*1e2, 1.2*1e3, 1.2*1e4] # the input size in MBtime_mapreduce1 = [123, 445, 2160] # MapReduce v1time_mapreduce2 = [100, 295, 970] # MapReduce v2time_mapreduce3 = [110, 240, 413] # MapReduce v3time_mapreduce4 = [110, 218, 385] # MapReduce v4# 11.4, 127, 1332#time_localstation = [11.4, 127, 1332] # our workstation (java)time_localstation = [5.1, 47.7, 475.5] # our workstationtime_locallaptop = [14.9, 148.4, 1381.7] # our laptop

%matplotlib inline# Import plot libraryimport matplotlib.pyplot as plt# Plotplt.figure(figsize=(18,9))plt.plot(size_input, time_mapreduce1, '#7d0026', label='MapReduce v1', linewidth=3.0)plt.plot(size_input, time_mapreduce2, '#f30000', label='MapReduce v2', linewidth=3.0)plt.plot(size_input, time_mapreduce3, '#430064', label='MapReduce v3', linewidth=3.0)plt.plot(size_input, time_mapreduce4, '#f37626', label='MapReduce v4', linewidth=3.0)

plt.plot(size_input, time_localstation, '#0055f3', label='Local workstation', linewidth=3.0, linestyle='dashed')plt.plot(size_input, time_locallaptop, '#00b300', label='Local laptop', linewidth=3.0, linestyle='dashed')

Page 8: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

plt.plot(size_input, time_locallaptop, '#00b300', label='Local laptop', linewidth=3.0, linestyle='dashed')

plt.xscale('log')plt.yscale('log')plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)plt.xlabel('Input size (MB)')plt.ylabel('Time (seconds)')plt.title('Performance comparison')plt.show()

4. Querying JSON: The Language Confusion Dataset (optional)

In this task, we will develop a MapReduce application that processes a dataset from the language game. It contains

rows of the following format:

{"target": "Turkish",

"sample": "af0e25c7637fb0dcdc56fac6d49aa55e",

"choices": ["Hindi", "Lao", "Maltese", "Turkish"],

"guess": "Maltese",

"date": "2013-08-19",

"country": "AU"}

Here, the guess field is what the user chose and the target field was the expected answer.

4.1. Set up

Login using ssh to your cluster: ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net

Download the data: wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2

Extract the data: tar -jxvf confusion-2014-03-02.tbz2

Upload the data to HDFS: hdfs dfs -put confusion-2014-03-02/confusion-2014-03-02.json /tmp/

4.2. Query implementation

You can start with the code provided in Task 3.

Download the initial code wget https://exerciseassets.blob.core.windows.net/exercise06/mapreduce.zip and

modify it accordingly.

The query to be implemented is:

Find the number of games where the guessed language is correct (i.e., guess equal to the target one)

and that language is Russian.

To parse a line of text, first add the next to the other imports:

import com.google.gson.JsonObject;

import com.google.gson.JsonParser;

Then, you can use the following to parse and access json elements:

...

Page 9: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

...

JsonObject jsonObject = new JsonParser().parse(value.toString()).getAsJsonObject();

jsonObject.get("target").getAsString();

...

You can follow the README instructions to update, compile and run your code.

Solution:

A possible solution is the one below. Notice that this Reducer can be also used as a Combiner.

Mapper

private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

JsonObject jsonObject = new JsonParser().parse(value.toString()).getAsJsonObject();

if (jsonObject.get("target").getAsString().equals(jsonObject.get("guess").getAsString()) &&

jsonObject.get("target").getAsString().equals("Russian")) {

word.set("Russian");

context.write(word, one);

}

}

}

Reducer The same reducer as in the previous task can be used.

Important: delete your cluster

It is very important that you delete your cluster from the Azure Portal as soon as you finished this exercise.

5. Reverse engineering

Conceptually, a map function takes an input a kay-value pair and emits a list of key-values pairs, while a reduce

function takes in input a key with an associated list of values and returns a list of values or key-value pairs. Often the

type of the final key and value is the same of the type of the intermediate data:

map (k1,v1) --> list(k2,v2)

reduce (k2,list(v2))--> list(k2, v2)

Analyze the following Mapper and Reducer, written in pseudo-code, and answer the questions below.

function map(key, value)

emit(key, value);

function reduce(key, values[])

z = 0.0

for value in values:

z += value

emit(key, z / values.length())

Questions

1. Explain what is the result of running this job on a list of pairs with type ([string], [float]).

2. Write the equivalent SQL query.

3. Could you use this reduce function as combiner as well? Why or why not?

4. If your answer to the previous question was yes, does the number of different keys influences the effectiveness

of the combiner? If you answer was no, can you change the map and reduce functions in such a way the new

reducer the can be used as combiner?

Solution

Page 10: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

Solution

1. This will output a pair (k1, v1) for each unique input key, where k1 is the input key and v1 is the average of

the values associated with k1 .

2. Equivalent to SELECT key, AVG(value) FROM tablename GROUP BY key .

3. No, because the average operation is not associative.

4. If we allow the final output to contain an additional piece of information (how many samples the average

represents), the reducer can be used as combiner, with the values being themselves a pair:

function map(key, value)

emit(key, (1, value));

function reduce(key, values[])

n = 0

z = 0.0

for value in values:

n += value[0]

z += value[0] * value[1]

emit(key, (n, z / n))

6. True or False

Say if the following statements are true or false, and explain why.

1. Each mapper must generate the same number of key/value pairs as its input had.

2. The TaskTracker is responsible for scheduling mappers and reducers and make sure all nodes are correctly

running.

3. The input key/value pairs of mappers are sorted by the key.

4. MapReduce splits might not correspond to HDFS block.

5. One single Reducer is applied to all values associated with the same key.

6. Multiple Reducers can be assigned pairs with the same value.

7. In Hadoop MapReduce, the key-value pairs a Reducer outputs must be of the same type as its input pairs.

Solution

1. False - for each input pair, the mapper can emit zero, one, or several key/value pairs.

2. False - the JobTracker is responsible for this.

3. False - mapper input is not sorted.

4. True - since splits respects logical record boundaries, they might contain data from multiple HDFS blocks.

5. True - this is the principle behind partitioning: one Reducer is responsible for all values associated with a

particular key.

6. True - values are not relevant in partitioning.

7. False - Reducer's input and output pairs might have different types.

7. Some more MapReduce and SQL

Design, in Python or pseudo-code, MapReduce functions that take a very large file of integers and produce as output:

1. The largest integer.

2. The average of all the integers.

3. The same set of integers, but with each integer appearing only once.

4. The number of times each unique integer appears.

5. The number of distinct integers in the input.

For each of these, write the equivalent SQL query, assuming you have a column values that stores all the integers.

Solution

1) SELECT MAX(values) ... - Naive solution, very low degree of parallelization. We can use a Combiner (identical to

Reducer) to increase the degree of parallelization.

function map(key, value)

emit(1, key); //group everything on the same key

Page 11: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

function reduce(key, values[])

global_max = 0

for value in values:

if (value > global_max):

global_max = value

emit(global_max)

2) SELECT AVG(values) ... - using the same Map above

function reduce(key, values[])

sum, count = 0

for value in values:

sum += value

count += 1

emit(sum / count)

3) SELECT DISTINCT values ...

function map(key, value)

emit(key, 1)

function reduce(key, values[])

emit(key)

4) SELECT values, COUNT (*) ... GROUP BY values.

function map(key, value)

emit(key, 1)

function reduce(key, values[])

emit(key, len(values))

5) SELECT COUNT (DISTINCT values) ...

Reuse (3) and add a postprocessing step to combine the output of the reducers. A second MapReduce stage could

also be used but it would probably be an overkill for this scenario.

8. TF-IDF in MapReduce (optional)

Imagine we want to build a search engine over the Gutenberg dataset of ~3000 books. Given a word or a set of

words, we want to rank these books according their relevance for these words. We need a metric to measure the

importance of a word in a set of document...

8.1 Understand TF-IDF

TF-IDF is a statistic to determine the relative importance of the words in a set of documents. It is computed as the

product of two statistics, term frequency ( tf ) and inverse document frequency (idf ).

Given a word t , a document d (in this case a book) and the collection of all documents D we can define tf(t, d)

as the number of times t appears in d . This gives us some information about the content of a document but

because some terms (eg. "the") are so common, term frequency will tend to incorrectly emphasize documents which

happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms.

The inverse document frequency idf(t, D) is a measure of how much information the word provides, that is, whether

the term is common or rare across all documents. It can be computed as:

where | D | is the total number of documents and the denominator represents how many documents contain the

word t at least once. However, this would cause a division-by-zero exception if the user query a word that never

appear in the dataset. A better formulation would be:

Then, the tdidf(t, d, D) is calculated as follows:

Page 12: Big Data – Exercises...1. Setup a cluster Create an Hadoop cluster Follow the steps from the exercise of last week, but this time select a Hadoop cluster. Create 4 Worker nodes and

A high weight in tfidf is reached by a high term frequency (in the given document) and a low document frequency

of the term in the whole collection of documents.

8.2 Implement TF-IDF in MapReduce (pseudo-code)

Implement Mapper and Reducer functions in pseudo-code to compute TF-IDF. Assume each Mapper receives the

document name as string key and the entire document content as string value. The output of your job should be a

list of key-value pairs, where the key is a string in the form "document:word" and the value is the tfidf score for that

document/word pair.

function map(doc_id, doc_content)

...

function reduce(key, values[])

...

Solution

1) Mapper outputs each word-document pair

function map(doc_id, doc_content)

tokens = doc_content.split()

for token in tokens:

emit(token, (doc_id, 1));

2) An optional combiner performs an aggregation over all occurrences of the same word per document

function combiner(token, values)

frequencies = {} //empty dictionary

for (doc_id, count) in values:

frequencies[doc_id] += count //list of documents should fit in memory

for doc_id in frequencies.keys():

emit(token, (doc_id, frequencies[doc_id]));

3) A Reducer is called for each unique word in the collection and outputs a list of document:word pairs with the tf-idf

score

function reduce(token, values[])

frequencies = {} //empty dictionary

for (doc_id, count) in values:

frequencies[doc_id] += count //list of documents should fit in memory

idf = log(N/(1+frequencies.keys().length())) // idf for token

for doc_id in frequencies.keys():

tf = frequencies[doc_id] // tf for token-docid pair

emit(doc_id + ":" + token, tf*idf)