Big data Analytics Hadoop
-
Upload
mishika-bharadwaj -
Category
Engineering
-
view
85 -
download
0
Transcript of Big data Analytics Hadoop
![Page 1: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/1.jpg)
1
Big Data
![Page 2: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/2.jpg)
2What is Big data?
‘Big Data’ is similar to ‘small data’, but bigger in size.
but having data bigger it requires different approaches:
-Techniques, tools and architecture
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them.
![Page 3: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/3.jpg)
3Sources of Big DataSocial Media Data
Black Box Data
Stock Exchange Data
Transport Data
Power Grid Data
Search Engine Data
![Page 4: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/4.jpg)
4 Social Media Data: Social media such as Facebook and
Twitter hold information and views posted by millions of people across the globe.
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
![Page 5: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/5.jpg)
5 Transport Data: Transport data includes model,
capacity, distance and availability of a vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Power Grid Data: The power grid data holds information consumed by a particular node with respect to a base station.
![Page 6: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/6.jpg)
6Three Vs of Big Data
Velocity• Data speed
Volume• Data
quantity
Variety• Data Types
![Page 7: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/7.jpg)
7Velocity
high-frequency stock trading algorithms reflect market changes within microseconds
machine to machine processes exchange data between billions of devices
on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
![Page 8: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/8.jpg)
8Volume
• A typical PC might have had 10 gigabytes of storage in 2000.
• Today, Facebook ingests 600 terabytes of new data every day.
• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
![Page 9: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/9.jpg)
9Variety
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.
Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.
Big Data analysis includes different types of data.
![Page 10: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/10.jpg)
10Challenges
Storage
Searching
Sharing
Transfer
Analysis
![Page 11: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/11.jpg)
11
Hadoop
![Page 12: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/12.jpg)
12History of Hadoop Hadoop was created by computer scientists Doug Cutting
and Mike Cafarella in 2005.
It was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts.
Doug named it after his son’s toy elephant.
In November 2016 Apache Hadoop became a registered trademark of the Apache Software Foundation.
![Page 13: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/13.jpg)
13What is Hadoop? Hadoop is an open source, Java-based programming
framework that supports the processing and storage of extremely large data sets in a distributed computing environment.
Hadoop runs applications using the mapreduce algorithm, where the data is processed in parallel on different CPU nodes.
Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure.
Hadoop can perform complete statistical analysis for a huge amount of data.
![Page 14: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/14.jpg)
14Hadoop ArchitectureHADOOPMapReduce
(Distributed Computation)
HDFS(Distributed Storage)
YARN Framework Common
![Page 15: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/15.jpg)
15HADOOP COMMON: Common refers to the collection of common utilities and
libraries that support other Hadoop modules. These libraries provides file system and OS level abstraction
and contains the necessary Java files and scripts required to start Hadoop.
HADOOP YARN: Yet Another Resource Negotiator a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users' applications
![Page 16: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/16.jpg)
16HDFS
Hadoop Distributed File System. Hadoop file system that runs on top of existing file
system Designed to handle very large files with streaming
data access patterns Uses blocks to store a file or parts of a file.
![Page 17: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/17.jpg)
17HDFS - BlocksFile Blocks 64MB (default), 128MB (recommended) – compare to
4 KB in UNIX Behind the scenes, 1 HDFS block is supported by
multiple operating system (OS) blocks Fits well with replication to provide fault tolerance and
availability
. . .
128 MB
OS Block
HDFS Block
![Page 18: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/18.jpg)
18Advantages of blocks
Fixed size – easy to calculate how many fit on a disk
file can be larger than any single disk in the network
If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 128 MB 128 MB 128 MB 36 MB
![Page 19: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/19.jpg)
19HDFS -Replication
Blocks with data are replicated to multiple nodes Allows for node failure without data loss
![Page 20: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/20.jpg)
20Writing a file to HDFS
![Page 21: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/21.jpg)
21
![Page 22: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/22.jpg)
22
![Page 23: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/23.jpg)
23
![Page 24: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/24.jpg)
24
![Page 25: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/25.jpg)
25
![Page 26: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/26.jpg)
26
HADOOP MapReduce
![Page 27: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/27.jpg)
27COMPONENTS OF HADOOP
•HDFS•MapReduce•YARN Framework•Libraries
![Page 28: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/28.jpg)
28A DEFINITION
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.
MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS).
MapReduce is a patented framework by GOOGLE to support distributed computing on large data sets.
![Page 29: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/29.jpg)
29INSPIRATION
The name MapReduce comes from functional programming
map is the name of a higher-order function that applies a given function to each element of a list.
reduce is the name of a higher-order function that analyze a recursive data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value.
![Page 30: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/30.jpg)
30
• DATA NODE• TASK
TRACKER
• DATA NODE• TASK TRACKER
• DATA NODE• TASK
TRACKER
• NAME NODE• JOB TRACKER
MASTER SLAVE
SLAVESLAVE
![Page 31: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/31.jpg)
31HOW MapReduce WORKS? Init - Hadoop divides the input file stored on HDFS into splits (typically of
the size of an HDFS block) and assigns every split to a different mapper, trying to assign every split to the mapper where the split physically resides
Mapper - Hadoop reads the split of the mapper line by line. Hadoop calls the method map() of the mapper for every line passing it as the key/value parameters - the mapper computes its application logic and emits other key/value pairs
Shuffle and sort -Hadoop's partitioner divides the emitted output of the mapper into partitions, each of those is sent to a different reducer. Hadoop collects all the different partitions received from the mappers and sort them by key
Reducer -Hadoop reads the aggregated partitions line by line. Hadoop calls the reduce() method on the reducer for every line of the input - the reducer computes its application logic and emits other key/value pairs - locally, Hadoop writes the emitted pairs output (the emitted pairs) to HDFS
![Page 32: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/32.jpg)
32
![Page 33: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/33.jpg)
33COMMON JOBS FOR MapReduceTEXT
MININGINDEX
BUILDING GRAPHS
PATTERNS FILTERING PREDICTION
RISK ANALYSIS
![Page 34: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/34.jpg)
34WORD COUNT USING MapReduce
![Page 35: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/35.jpg)
35BENEFITS
SimplicityScalabilitySpeedRecoveryMinimal data
motion
![Page 36: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/36.jpg)
36
JAQL
![Page 37: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/37.jpg)
37INTRODUCTION
Functional data processing and query language.
Commonly used for JSON query processing on BigData.
Started as an Open Source project at Google.
Taken over by Google as primary data processing language for their Hadoop software package BigInsights.
![Page 38: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/38.jpg)
38 Supports a variety of data sources like JSON,CSV , TSV,
XML .
Loosely typed functional language with lazy evaluation, so expressions are only materialized when needed.
Can process structure and non traditional data.
Inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig.
![Page 39: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/39.jpg)
39Jaql Allows Users To: Access and load data from different sources (local file
system, web, twitter, HDFS, Hbase, etc.)
Query data (databases)
Transform, aggregate and filter data
Write data into different places (local file system, HDFS, HBase, databases, etc.)
![Page 40: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/40.jpg)
40
Jaql environment
Jaql shell from a command
promptEclipse
Environment
![Page 41: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/41.jpg)
41KEY GOALS
Flexibility
Scalability
Physical Transparency
Modularity
![Page 42: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/42.jpg)
42JAQL I/O
I/O layer JSON JAQL
Interpreter
JSON I/O layer
source destinationExecution
![Page 43: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/43.jpg)
43
• cd $BIGINSIGHTS_HOME/jaql/binjaql
• ./jaqlshelljaqlshell
![Page 44: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/44.jpg)
44ADVANTAGES Easy to create user-defined functions written entirely
in Jaql.
Simplicityfacilitates the developmentmakes easier the distribution between nodes
of the program Map reduce jobs can be directly called
![Page 45: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/45.jpg)
45
HIVE
![Page 46: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/46.jpg)
46
Originated as an internal project by facebook.
data warehouse infrastructure built on top of Hadoop.
SQL-like interface to query called HiveQL.
Compiles query as Map Reduce jobs and runs them in cluster.
Structures data into well defined database concept.
![Page 47: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/47.jpg)
47HIVE ARCHITECTURE
![Page 48: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/48.jpg)
48
Apache PIG
![Page 49: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/49.jpg)
49What is Pig? Pigs Eat Anything
Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc. Pigs Live Anywhere
Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop. Pigs Are Domestic Animals
Pig is designed to be easily controlled and modified by its users.
![Page 50: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/50.jpg)
50 Pig Latin was designed to fit in a sweet spot between
the declarative style of SQL, and the low-level, procedural style of MapReduce.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce
programs, Pig's language layer currently consists of a textual
language called Pig Latin.
![Page 51: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/51.jpg)
51KEY PROPERTIES OF PIG LATINEase of programming. It is trivial to achieve
parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.
![Page 52: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/52.jpg)
52
•Cd $PIG_HOME/binpig
• ./pig –x localgrunt
![Page 53: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/53.jpg)
53
Twitter data analyticsUSING HADOOP
![Page 54: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/54.jpg)
54
Twitter Data Real twitter data
purchased by IBM Available in JSON
( JavaScript Object Notation) format
![Page 55: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/55.jpg)
55
![Page 56: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/56.jpg)
56Objective
use Jaql core operators to manipulate JSON data found in Twitter feeds.
Filter arrays to remove values Sort arrays in either ascending or descending sequence Write data to HDFS
![Page 57: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/57.jpg)
57Procedure
Start jaql shell cd $BIGINSIGHTS_HOME/jaql/bin ./jaqlshell
Locate json twitter records tweets =read(file("file:///home/labfiles/SampleData/Twitter
%20Search.json")); tweets;
Retrieve single field using transform tweets->transform {$.created_at, $.from_user, $.iso_language_code,
$.text};
![Page 58: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/58.jpg)
58 Create a new record, tweetsrecs
tweetrecs = tweets->transform {$.created_at, $.from_user, language:$.iso_language_code, $.text};
Use the filter operator to see all non-english language records from tweetrecs. tweetrecs -> filter $.language != 'en';
Aggregate data and count the number of tweets for each language. tweetrecs -> group by key = $.language into {language: key, num: count($)};
Create the target directory in HDFS hdfsShell(’-mkdir /user/biadmin/jaql’);
Write the results of the previous aggregation to a JSON file in HDFS. tweetrecs->group by key = $.language into {language: key, num: count($)}-
>write(seq("hdfs:/user/biadmin/jaql/twittercount.seq"));
![Page 59: Big data Analytics Hadoop](https://reader036.fdocuments.net/reader036/viewer/2022062306/5885e45d1a28ab906d8b6ba9/html5/thumbnails/59.jpg)
59
THANK YOU! -MISHIKA BHARADWAJ