WELCOME TO BIG DATA TRANING

27
Abhishek Mukherjee Utkarsh Srivastava 12 th ,September Not everything that can be counted counts, and not everything that counts can be counted. WELCOME TO BIG DATA TRANING

Transcript of WELCOME TO BIG DATA TRANING

Page 1: WELCOME TO BIG DATA TRANING

Abhishek MukherjeeUtkarsh Srivastava

12th,September

Not everything that can be counted counts, and not everything that counts can be counted.

WELCOME TO BIG DATA TRANING

Page 2: WELCOME TO BIG DATA TRANING

What are we going to cover today?

Uses of Big Data What is Hadoop? Short intro to the HDFS architecture. What is Map Reduce? The components of Map Reduce Algorithm Hello world of map reduce i.e. Word Count Algorithm Tips and Tricks of Map Reduce

Page 3: WELCOME TO BIG DATA TRANING

Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

Lots of Data(Zetabytes or Terabytes or Petabytes) Systems / Enterprises generate huge amount of data

from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every

30 minutes of flying time.

What is Big Data?

Page 4: WELCOME TO BIG DATA TRANING

Serial vs sequential processing

Serial vs parallel processing

WHY BIGDATA?

Page 5: WELCOME TO BIG DATA TRANING

WHY BIGDATA?

Page 6: WELCOME TO BIG DATA TRANING

WHY BIGDATA?Walmart has exhaustive customer data of

close to 145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually

Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.

Page 7: WELCOME TO BIG DATA TRANING

AccessibleRobustScalableSimple

Differentiating Factors:

Page 8: WELCOME TO BIG DATA TRANING

MOVING TOWARDS HADOOPFather of Hadoop?

Page 9: WELCOME TO BIG DATA TRANING

HADOOP ECOSYSTEM

Page 10: WELCOME TO BIG DATA TRANING

HDFS ARCHITECTURE

Page 11: WELCOME TO BIG DATA TRANING

HDFS ARCHITECTURE CONTD.

Page 12: WELCOME TO BIG DATA TRANING

Map Phase Combiner Phase(Optional) Sort Phase Shuffle Phase Partition Phase(Optional) Reducer Phase

Key points

Map Reduce Algorithm

Page 13: WELCOME TO BIG DATA TRANING

Delving into the algorithm

Page 14: WELCOME TO BIG DATA TRANING

IMPORT THE WORD COUNT EXAMPLE

Page 15: WELCOME TO BIG DATA TRANING

Hello my name is abhishek Hello my name is utsav Hello my passion is cricket

Imagine this as the input file:

Map Phase

This file has 2 lines. Each line in the file has a byte offset of its own which serves as a key to the mapper and the value of the mapper is the data which is present In the line.

Page 16: WELCOME TO BIG DATA TRANING

Operation on output of map phase

Hello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1

Hello(1,1,1)

my(1,1,1)name(1,1,1)

is(1,1,1)

abhishek(1)

utsav(1)passion(1)

cricket(1)

Key(tuple of values)

Page 17: WELCOME TO BIG DATA TRANING

The key points are as follows: Sort the key value pairs according to the key values Shuffle the mapped output to get values with same key

to create a tuple of values with same key This output is fed to the reducer which in turn maps the

values of the tuple by returning a single value for a list of values present in the tuple

Explaination of sort and shuffle phase

Page 18: WELCOME TO BIG DATA TRANING

Reducer phase

Hello(1,1,1)

my(1,1,1)name(1,1,1)

is(1,1,1)

abhishek(1)

utsav(1)passion(1)

cricket(1)

Key(tuple of values)abhishek(1)cricket(1)Hello(3)

is(3)my(3)

name(3)passion(1)utsav(1)

Key(single value)

Page 19: WELCOME TO BIG DATA TRANING

sudo su – makes temporary super user. hadoop fs -ls / hadoop fs -mkdir /mycreatedfolderinhdfs hadoop fs -put /usr/directoryinlocal /user/root/directoryinhdfs hadoop fs -get /user/root/mycreatedfolderinhdfs /usr/folderinlocal hadoop fs -r -mr /mycreatedfolderinhdfs Hadoop jar com.bigdata.session.hadoop.tool.jar {sourcepath}

{Destination path}

BASIC HADOOP COMMANDS

Page 20: WELCOME TO BIG DATA TRANING

IMPORT THE MAXIMUM TEMPERATURE

PROJECT

Page 21: WELCOME TO BIG DATA TRANING

Two types of splitting of input files are possibleHDFS split: Splitting of files into blocks of fixed size

e.g. splitting a file into blocks of 64 MB to promote parallel processing.

N line split: Splitting of files into lines of fixed number of lines to promote parallel processing

Lets see an example in the next slide

Types of splits(Parallel processing in action):

Page 22: WELCOME TO BIG DATA TRANING

Consider this as the input file: Map reduce is a framework based on processing of

data paralelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of map tasks i.e. the number of mappers created. This will create a better understanding on how a file splits.

N LINE SPLITTING:

Can you guess what will happen?????

Page 23: WELCOME TO BIG DATA TRANING

Assume the value of n as 3 Map reduce is a framework based on processing of

data paralelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will

N LINE SPLITTING contd.

observe the effect of n line splitter on the number of map tasks i.e. the number of mappers created. This will create a better understanding of how a file splits.

So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits

Page 24: WELCOME TO BIG DATA TRANING

Hadoop uses its own serialization format, Writables, which is certainly compact and fast. Data needs to be serialized to be sent via a network path.

Data Types available in Map Reduce

Thus we see that these Serialized data types are Java equivalent data types

Page 25: WELCOME TO BIG DATA TRANING

Combiner optimization Partitioner optimization Custom Writables

Tips for optimizing map reduce codes:

Page 26: WELCOME TO BIG DATA TRANING

ANY QUERIES?

Page 27: WELCOME TO BIG DATA TRANING

Abhishek Mukherjee Utkarsh Srivastava

[email protected] [email protected]

No. 9629341857 No. 9629341221

CONTACT DETAILS