Simplified Data Processing On Large Cluster

26
Simplified Data Processing On Large Cluster 1

Transcript of Simplified Data Processing On Large Cluster

Simplified Data Processing

On Large Cluster

1

Present By

Dipen Shah110420107064

Harsh Kevadia110420107049

Nancy Sukhadia110420107025

2

What Is Cluster

A computer cluster consists of a set of loosely connected or

tightly connected computers that work together so that in many

respects they can be viewed as a single system.

The components of a cluster are usually connected to each

other through fast local area networks.

Clusters are usually deployed to improve performance and

availability over that of a single computer.

3

Introduction

On the web large amount of data or say Big Data are being stored,

processed and retrieved in few milliseconds.

Big data cannot be stored, processed, and retrieved from one

machine.

4

Contd..

How huge IT companies store their data? And how the data is

processed and retrieved?

A Big Data requires a lots of processing power for computing

(Processing) and storing a data.

5

How To Divide Large Input Set Into

Smaller Input Set?

The master node takes the input, divides it into smaller sub-problems,

and distributes them to worker nodes.

The worker node processes the smaller problem, and passes the

answer back to its master node.

Sometimes it creates problem for the data which comes in

sequence.

Output of one data can be input of another.

only suitable to those data which are independent to each other so

that the processing can be done independently without waiting for the output of previous data.

6

How To Divide Work Among Various

Worker Node In Same Cluster?

Master node calculates the time require for a normal computation

and it also considers priority of the particular data processing.

Checks all the worker node’s schedule and processing speed.

After analysing this data the work is provided to worker node.

7

Dividing Input Creates Problem

And Affects The Output.

Large set of input are interrelated with each other or sequence of

inputs are important and we must process input as a given

sequence.

Need to develop an algorithm which takes care about all this

problem.

8

Dividing Input So That Optimized

Performance Can Be Achieved.

How to divide a problem into sub problem so that we can get

optimized performance. Optimized in the sense of minimum time

require, minimum resource allocated to that process, coordinating

between worker nodes in cluster.

9

What If Worker Node Fails?

Master Node divides work among the workers. It pings worker node

periodically.

What if the worker node doesn’t respond or worker node fails?

10

What Happen When Master Node

Fails?

There is only single master.

All the computation gets aborted if master node fails.

11

Programing Model

The computation takes a set of input key/value pairs, and produces a set ofoutput key/value pairs.

The user of the MapReduce library expresses the computation as twofunctions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set ofintermediate key/value pairs.

The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key Iand a set of values for that key. It merges together these values to form apossibly smaller set of values.

Typically just zero or one output value is produced per Reduce invocation.

The intermediate values are supplied to the user's reduce function via aniterator.

This allows us to handle lists of values that are too large to fit in memory.

12

Example:

Consider the problem of counting the number of occurrences of each word in a large collection ofdocuments.

The user would write code similar to the following pseudo-code:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

13

Cont.

• The map function emits each word plus an associated count of occurrences.

• The reduce function sums together all counts emitted for a particular word.

• In addition, the user writes code to file in a mapreduce specication object with the

names of the input and output files, and optional tuning parameters.

• The user then invokes the MapReduce function, passing it the specification object.

• The user's code is linked together with the MapReduce library (implemented in

C++).

14

Types:

Even though the previous pseudo-code is written in terms of stringinputs and outputs, conceptually the map and reduce functionssupplied by the user have associated types:

map (k1,v1) -> list(k2,v2)

reduce (k2,list(v2)) -> list(v2)

I.e., the input keys and values are drawn from a different domainthan the output keys and values.

Furthermore, the intermediate keys and values are from the samedomain as the output keys and values.

Our C++ implementation passes strings to and from the user-denedfunctions and leaves it to the user code to convert between stringsand appropriate types.

15

Map Reduce: Example

Distributed Grep

Count of URL Access frequency

Reverse Web Link graph

Term Vector per host

Inverted Index

16

Implementation

Assumption

Execution Overview

Master Data Structure

Fault Tolerance

Implementation Issues

17

Assumption

Each PC configuration cluster

Networking

Failures

Storage

Job scheduling system

18

Execution Overview

1. Split Input set

2. Master assign work to worker and copy it self

3. Worker read input set and produced output

4. Worker save in local disk

5. Reduce worker collect from local disk

6. External Sorting done because data is too large and give to master

7. Master create output file and wake up user program and give.

19

Function: 20

Master Data Structure

State and identity of worker machine

Intermediate file

Update of location

File size

21

Fault Tolerance

Worker Failure

Master Failure

Master Election

1. Manually

2. Highest IP Address

3. Highest MAC Address

22

Implementation Issues

Back up

Network Bandwidth

Locality

23

Conclusion

attribute this success to several reasons. First, the model is easy to

use, even for programmers without experience with parallel and

distributed systems, since it hides the details of parallelization, fault-

tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible.

Google use for web search service, for sorting, for data mining, for

machine learning, and many other systems

24

References

1. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997.

2. Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10.22, Atlanta, Georgia, May 1999.

3. Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996.

4. Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, April 2003.

5. John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed le system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004.

6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11), November 1989.

7. Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78– 91, Saint-Malo, France, 1997.

25

Thank You

Q/A!

26