Simplifying MapReduce Data Processing

Simplifying MapReduce Data Simplifying MapReduce Data ProcessingProcessing

2011 Fourth IEEE International Conference on Utility and Cloud Computing

Speaker: Lin You-Wu

Jin-Ming Shih, Chih-Shan LiaoDept. of Computer Science and

Information EngineeringNational Dong Hwa University

Hualien, Taiwan

Ruay-Shiung ChangDept. of Computer Science and

Information EngineeringNational Dong Hwa University

Hualien, Taiwan

1

Outline

I. INTRODUCTIONII. RELATED WORKIII. THE WEB-BASED GUI FOR MAPREDUCE DATA PROCESSINGIV. CASE STUDY AND IMPLEMENTATIONV. CONCLUSIONS AND FUTURE WORK

2

I. Introduction

• MapReduce is a programming model developed by Google for processing and generating large data sets in distributed environments.

• In MapReduce, the complexity of distributed programming is decreased substantially by hiding details of the distributed computing system.

3

• The MapReduce process shown in Fig. 1 splits input data into a lot of Map operation nodes and produces a set of output containing key/value pairs.

4

• Although the MapReduce model can simplify the complexity of programming for distributed computing, MapReduce is still not easy to implement.

• If users want to use MapReduce to compute a large-scale data set, they have to set up their machine's MapReduce environments first.

5

• In this paper, we present a Web-based GUI for MapReduce Data Processing.

• The GUI can let users design their MapReduce workflow intuitively and conveniently without the need of implementing a MapReduce system and actually writing the programs.

6

II. Related Work

A. Hadoop MapReduceB. CascadingC. Pig

7

A. Hadoop MapReduce

• Hadoop is an open-source project implementing Google's MapReduce architecture by Apache.

• Hadoop MapReduce is a software framework for process large-scale data in-parallel on large clusters of commodity hardware.

8

• Hadoop MapReduce framework consists of a single master called JobTracker and many slaves called TaskTracker.

• Job Tracker– responsible scheduling, monitoring and

reexecuting all tasks of job on slave machines.

• TaskTracker– executes the tasks assigned by the JobTracker.

9

B. Cascading

•Cascading is an open source Java library and API offering an abstract layer for MapReduce.

•The Cascading processing model allows developers to rapidly develop complex data processing applications on Hadoop.

10

• On the surface, Cascading is more complex than MapReduce since there are many pipe types and operations.

• But Cascading is easier in terms of solving real-world problems by using building blocks instead of MapReduce programming.

11

C. Pig

• Pig is a high-level platform for creating MapReduce programs to analyze large data sets using Hadoop.

• It provides not only a higher level of abstraction of the data processing capabilities but also maintains the simplicity and reliability of Hadoop.

12

III. The Web-Based GUI for Mapreduce Data Processing

• In MapReduce, the Map phase splits input data using user’s definition.

• The Map output key/value pairs will be grouped in Reduce Phase or Combine Phase.

13

• Our proposed method supports to decrease the complexity of MapReduce data processing by eliminating the actual programming.

• Map phase and Reduce phase were hidden by “Target-Value-Action” in each “Layer”.

• There is a “Container” which integrates the different formats from different outputs.

14

A. Target-Value-Action

• Target-Value-Action is a direct and original thought to reflect data processing in Map and Reduce.

• A user just chooses some objects as targets, gives each object a value and specifies the desired action.

15

• In the choosing of target, users can choose a part of data they want to process as the target.

• Value depends on what type of outcome the user wants. It can be integer, text, or other data types.

• For example, in the word count problem, each word is a Target, Value is “1” and Action is “Sum”.

16

B. Layer

• In our Web-based GUI, chained MapReduce Jobs were controlled by “Layers”.

• Each Layer consists of single or a set of Target-Value-Action and Container.

• Layers can accept other Layers’ output or the original data as the input.

18

C. Container

• “Container” integrates different outputs into the same set with multi-column.

• It can make the output dataflow visible and convenient while operating the inputs from different Layers with different outputs.

19

D. System Architecture

• The Web-based GUI for MapReduce Data Processing is based on a Web-based GUI and a Hadoop cluster.

21

IV. Case Study and Implementation

• In our Web-based GUI, users can drag the Target-Value-Action, input path form and container to put in the drop area on Web page.

• Drag and Drop offer a convenient control method.

22

A. Pairwise Document Similarity

• In Pairwise Document Similarity in Large Collections with MapReduce, the authors calculate the similarity of pairwise document using MapReduce.

• d is a document• sim (di, dj) is the similaritybetweendocuments di and dj• V is the vocabulary set

23

Algorithm

•The authors proposed an efficient solution to the pairwise document similarity problem.

24

• The presented solution can be expressed as two MapReduce jobs: “Indexing” and “Pairwise Similarity”.

25

B. Pair Document Similarity in Our Proposed methods

26

V. Conclusions and Future Work

• We also present three abstract data processing building blocks to hide the programming details from Map and Reduce.

• The presented methods are suitable for a Web-based GUI since the complexity of the MapReduce data processing has been decreased.

27

• User can also directly drop and drag the operation components to process large-scale data using MapReduce.

• In the future, we will enrich the operation components and management functions.

• If users can store, reuse and share their used operation components structure, our method can be more powerful.

28

Thank you for listeningThank you for listening

29

Simplifying MapReduce Data Processing

Documents

Transcript of Simplifying MapReduce Data Processing