Hadoop Master Class : A concise overview
-
Upload
abhishek-roy -
Category
Technology
-
view
117 -
download
6
description
Transcript of Hadoop Master Class : A concise overview
![Page 1: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/1.jpg)
Big Data Hadoop Master Class
By Abhishek Roy
![Page 2: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/2.jpg)
About Me
Abhishek Roy
• Around 10 years of tech experience with services, product and startups
• Was leading the data team at Qyuki Digital Media
• Lately involved in Big data with emphasis on recommendation systems and machine learning approaches
• Currently working on building an employee wellness platform, www.feetapart.com
• Linkedin : http://www.linkedin.com/in/abhishekroy8
![Page 3: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/3.jpg)
Agenda
• What is Big Data
• History and Background
• Why Data
• Intro to Hadoop
• Intro to HDFS
• Setup & some hands on
• HDFS architecture
• Intro to Map Reduce
![Page 4: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/4.jpg)
Agenda
• Pig
• Hive
• Apache Mahout
– Running an example MR Job
• Sqoop, Flume, Hue, Zookeeper
• Impala
• Some Real World Use cases
![Page 5: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/5.jpg)
What Is Big Data?
Like the term Cloud, it is a bit ill-defined/hazy
![Page 6: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/6.jpg)
Big Data – A Misnomer
![Page 7: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/7.jpg)
What’s the “BIG” hype about Big Data?
There may be hype, but problems are real and of big value. How?
• We are in the age of advanced analytics (that’s where all the problem is ,we want to analyze the data) where valuable business insight is mined out of historical data.
• We live in the age of crazy data where every individual , enterprise and machine leaves so much data behind summing up to many terabytes & petabytes , and it is only expected to grow.
![Page 8: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/8.jpg)
What’s the “BIG” hype about Big Data?
• Good news. Blessing in disguise. More data means more precision.
• More data usually beats better algorithms.
• But how are we going to analyze?
• Traditional database or warehouse systems crawl or crack at these volumes.
• Inflexible to handle most of these formats.
• This is the very characteristic of Big Data.
![Page 9: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/9.jpg)
Key Hadoop/Big Data Data Sources
• Sentiment • Clickstream • Sensor/Machine • Geographic • Text • Chatter from social networks • Traffic flow sensors • Satellite imagery • Broadcast audio streams
![Page 10: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/10.jpg)
Sources of Big Data
• Banking transactions
• MP3s of rock music
• Scans of government documents
• GPS trails
• Telemetry from automobiles
• Financial market data
• ….
![Page 11: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/11.jpg)
Key Drivers
Spread of cloud computing, mobile computing and social media
technologies, financial transactions
![Page 12: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/12.jpg)
Introduction cont.
• Nature of Big Data
– Huge volumes of data that cannot be handled by traditional database or warehouse systems.
– It is mostly machine produced.
– Mostly unstructured and grows at high velocity.
– Big data doesn’t always mean huge data, it means “difficult” data.
![Page 13: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/13.jpg)
The 4 Vs
![Page 14: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/14.jpg)
Data Explosion
![Page 15: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/15.jpg)
![Page 16: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/16.jpg)
Inflection Points
• Is divide the data and rule a solution here? – Have a multiple disk drive , split your data file into
small enough pieces across the drives and do parallel reads and processing.
– Hardware Reliability (Failure of any drive) is a challenge.
– Resolving data interdependency between drives is a notorious challenge.
– Number of disk drives that can be added to a server is limited
![Page 17: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/17.jpg)
Inflection Points
• Analysis – Much of big data is unstructured. Traditional
RDBMS/EDW cannot handle it.
– Lots of Big Data analysis is adhoc in nature, involves whole data scan, referencing itself, joining, combing etc.
– Traditional RDBMS/EDW cannot handle these with their limited scalability options and architectural limitations.
– You can incorporate betters servers, processors and throw in more RAM but there is a limit to it.
![Page 18: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/18.jpg)
Inflection Points
• We need a Drastically different approach
– A distributed file system with high capacity and high reliability.
– A process engine that can handle structure /unstructured data.
– A computation model that can operate on distributed data and abstracts data dispersion.
![Page 19: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/19.jpg)
What is Apache Hadoop
![Page 20: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/20.jpg)
What is Hadoop?
• “framework for running [distributed] applications on large cluster built of commodity hardware” –from Hadoop Wiki
• Originally created by Doug Cutting
– Named the project after his son’s toy
• The name “Hadoop” has now evolved to cover a family of products, but at its core, it’s essentially just the MapReduce programming paradigm + a distributed file system
![Page 21: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/21.jpg)
History
![Page 22: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/22.jpg)
Core Hadoop : HDFS
![Page 23: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/23.jpg)
Core Hadoop: MapReduce
![Page 24: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/24.jpg)
Pig
A platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these
programs.
![Page 25: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/25.jpg)
Mahout
A machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are
implemented on top of Apache Hadoop.
![Page 26: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/26.jpg)
Hive
Data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in
distributed storage.
![Page 27: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/27.jpg)
Sqoop
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational
databases.
![Page 28: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/28.jpg)
Apache Flume
A distributed service for collecting, aggregating, and moving large log data
amounts to HDFS.
![Page 29: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/29.jpg)
Twitter Storm
Storm can be used to process a stream of new data and update
databases in real time.
![Page 30: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/30.jpg)
Funding & IPO
• Cloudera, (Commerical Hadoop) more than $75 million
• MapR (Cloudera competitor) has raised more than $25 million
• 10Gen (Maker of the MongoDB) $32 million
• DataStax (Products based on Apache Cassandra) $11 million
• Splunk raised about $230 million through IPO
![Page 31: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/31.jpg)
![Page 32: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/32.jpg)
Big Data Application Domains
• Healthcare
• The public sector
• Retail
• Manufacturing
• Personal-location data
• Finance
![Page 33: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/33.jpg)
Relational Databases:
Hadoop:
Use The Right Tool For The Right Job
When to use?
• Affordable Storage/Compute
• Structured or Not (Agility)
• Resilient Auto Scalability
When to use?
• Interactive Reporting (<1sec)
• Multistep Transactions
• Lots of Inserts/Updates/Deletes
![Page 34: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/34.jpg)
Ship the Function to the Data
SAN/NAS
data data data
data data data
data data data
data data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
![Page 35: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/35.jpg)
Economics of Hadoop Storage • Typical Hardware:
– Two Quad Core Nehalems – 24GB RAM – 12 * 1TB SATA disks (JBOD mode, no need for RAID) – 1 Gigabit Ethernet card
• Cost/node: $5K/node • Effective HDFS Space:
– ¼ reserved for temp shuffle space, which leaves 9TB/node – 3 way replication leads to 3TB effective HDFS space/node – But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB Other solutions cost in the range of $5K to $100K per user
TB
![Page 36: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/36.jpg)
Market and Market Segments
Research Data and Predictions
![Page 37: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/37.jpg)
![Page 38: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/38.jpg)
http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
![Page 39: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/39.jpg)
Market for big data tools will rise from $9 billion to $86 billion in 2020
![Page 40: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/40.jpg)
Future of Big Data
• More Powerful and Expressive Tools for Analysis • Streaming Data Processing (Storm from Twitter and S4 from
Yahoo) • Rise of Data Market Places (InfoChimps, Azure
Marketplace) • Development of Data Science Workflows and Tools (Chorus,
The Guardian, New York Times) • Increased Understanding of Analysis and Visualization
http://www.evolven.com/blog/big-data-predictions.html
![Page 41: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/41.jpg)
Opportunities
![Page 42: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/42.jpg)
Skills Gap
• Statistics
• Operations Research
• Math
• Programming
• So-called "Data Hacking"
![Page 43: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/43.jpg)
http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
![Page 44: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/44.jpg)
Market for big data tools will rise from $9 billion to $86 billion in 2020
![Page 45: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/45.jpg)
Typical Hadoop Architecture
Hadoop: Storage and Batch Processing
Data Collection
OLAP Data Mart
Business Intelligence
OLTP Data Store
Interactive Application
Business Users End Customers
Engineers
![Page 46: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/46.jpg)
HDFS Introduction
• Written in Java
• Optimized for larger files
– Focus on streaming data (high-throughput > low-latency)
• Rack-aware
• Only *nix for production env.
• Web consoles for stats
![Page 47: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/47.jpg)
HDFS Architecture
![Page 48: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/48.jpg)
Lets us see what MapReduce is
![Page 49: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/49.jpg)
MapReduce basics
• Take a large problem and divide it into sub-problems
• Perform the same function on all sub-problems
• Combine the output from all sub-problems
![Page 50: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/50.jpg)
MapReduce(M/R) facts
• M/R is excellent for problems where the “sub-problems” are not interdependent
• For example, the output of one “mapper” should not depend on the output or communication with another “mapper”
• The reduce phase does not begin execution until all mappers have finished
• Failed map and reduce tasks get auto-restarted
• Rack/HDFS-aware
![Page 51: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/51.jpg)
What is MapReduce Model
![Page 52: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/52.jpg)
What is the MapReduce Model
• MapReduce is a computation model that supports parallel processing on distributed data using clusters of computers.
• The MapReduce model expects the input data to be split and distributed to the machines on the clusters so that each splits cab be processed independently and in parallel.
• There are two stages of processing in MapReduce model to achieve the final result: Map and Reduce. Every machine in the cluster can run independent map and reduce processes.
![Page 53: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/53.jpg)
What is MapReduce Model
• Map phase processes the input splits. The output of the Map phase is distributed again to reduce processes to combine the map output to give final expected result.
• The model treats data at every stage as a key and value pair, transforming one set of key/value pairs into different set of key/value pairs to arrive at the end result.
• Map process tranforms input key/values pairs to a set of intermediate key/vaue pairs.
• MapReduce framework passes this output to reduce processes which will transform this to get final result which again will be in the form of key/value pairs.
![Page 54: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/54.jpg)
Design of MapReduce -Daemons
The MapReduce system is managed by two daemons • JobTracker & TaskTracker • JobTracker TaskTracker function in master / slave fashion
– JobTracker coordinates the entire job execution – TaskTracker runs the individual tasks of map and
reduce – JobTracker does the bookkeeping of all the tasks run on
the cluster – One map task is created for each input split – Number of reduce tasks is configurable
(mapred.reduce.tasks)
![Page 55: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/55.jpg)
Conceptual Overview
![Page 56: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/56.jpg)
Who Loves it
• Yahoo !! Runs 20,000 servers running Hadoop
• Largest Hadoop clusters is 4000 servers,16PB raw storage
• Facebook runs 2000 Hadoop servers
• 24 PB raw storage and 100 TB raw log/day
• eBay and LinkedIn has production use of Hadoop.
• Sears retails uses Hadoop.
![Page 57: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/57.jpg)
Hadoop and it’s ecosystem
![Page 58: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/58.jpg)
Hadoop Requirements
• Supported Platforms • GNU/Linux is supported as development and
production
• Required Software
• java 1.6.x +
• ssh to be installed ,sshd must be running (for machined in the cluster to interact with master machines)
• Development Environment • Eclipse 3.6 or above
![Page 59: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/59.jpg)
Lab Requirements
• Window 7-64 bit OS , Min 4 GB Ram • VMWare Player5.0.0 • Linux VM-Ubuntu 12.04 LTS
– User: hadoop, password :hadoop123 • Java 6 installed on Linux VM • Open SSH installed on Linux VM • Putty-For opening Telnet sessions to the Linux VM • WinSCP- For transferring files between windows and
Linux VM • Other Linux machines will do as well • Eclipse 3.6
![Page 60: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/60.jpg)
Extract VM
• Folder:
Hadoop_MasterClass\Ubuntu_VM
• Copy:
ubuntu-12.04.1-server-amd64_with_hadoop.zip
• Extract to local using 7-zip
![Page 61: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/61.jpg)
Starting VM
![Page 62: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/62.jpg)
Starting VM
• Enter userID/password
• Type ifconfig
– Note down the ip address
– Connect to the VM using Putty
![Page 63: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/63.jpg)
Install and Configure ssh(non-VM users)
• Install ssh sudo apt –get install ssh – Check ssh installation: which ssh
which sshd which ssh-keygen
– Generate ssh Key ssh-keygen –t rsa –P ” –f ~/.ssh/id_rsa
– Copy public key as an authorized key(equivalent to slave node)
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys
![Page 64: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/64.jpg)
Install and Configure ssh
• Verify SSHby logging into target(localhost here)
– Command:
ssh localhost
– (this command will log you into the machine localhost)
![Page 65: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/65.jpg)
Accessing VM Putty and WinSCP
• Get IP address of the VM by using command in Linus VM
• Use Putty to telnet to VM
• Use WinSCP to FTP files to VM
![Page 66: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/66.jpg)
Accesing VM using Putty & WinSCP
Putty WinSCP
![Page 67: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/67.jpg)
Lab-VM Directory Structure(non-VM users)
• User Home directory for user “hadoop”(created default by OS) – /home/hadoop
• Create working directory for the lab session – /home/hadoop/lab
• Create directory for storing all downloads(installables) – /home/hadoop/lab/downloads
• Create directory for storing data for analysis – /home/hadoop/lab/data
• Create directory for installing tools – /home/hadoop/lab/install
![Page 68: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/68.jpg)
Install and Configure Java(non-VM users)
• Install Open JDK – Command :
– sudo apt-get install openjdk-6-jdk
– Check installation :java -version
• Configure java home in environment – Add a line to .bash_profile to set java Home
• export JAVA_HOME =/usr/lib/jvm/java-6-openjdk-amd64
– Hadoop will use during runtime
![Page 69: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/69.jpg)
Install Hadoop(non-VM users)
• Download hadoop jar with wget
– http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
• Untar
– cd ~/lab/install
– tar xzf ~/lab/downloads/hadoop-1.03.tar.gz
– Check extracted directory “hadoop-1.0.3”
• Configure environment in .bash_profile ( or .bashrc)
– Add below two lines and execute bash profile
• export HADOOP_INSTALL=~/lab/install/hadoop-1.0.3
• export PATH=$PATH:$HADOOP_INSTALL/bin
• . .bash_profile (Execute bashrc)
• Check Hadoop installation
– hadoop version
![Page 70: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/70.jpg)
Setting up Hadoop
• Open $HADOOP_HOME/conf/hadoop-env.sh
• set the JAVA_HOME environment variable to the $JAVA_HOME directory.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
![Page 71: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/71.jpg)
Component of core Hadoop
![Page 72: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/72.jpg)
Component of core Hadoop
• At a high-level Hadoop architecture components can be classified into two categories
– Distributed File management system-HDFS
• This has central and distributed sub components • NameNode: Centrally Monitors and controls the whole file system • DataNode: Take care of the local file segments and constantly communicates with NameNode • Secondary NameNode:Do not confuse. This is not a NameNode Backup. This just backs up the file
system status from the NameNode periodically.
• Distributed computing system-MapReduce Framework • This again has central and distributed sub components
• Job tracker: Centrally monitors the submitted Job and controls all processes running on the nodes(computers) of the clusters.This communicated with Name Node for file system access
• Task Tracker: Take care of the local job execution on the local file segments. This talks to DataNode for file information. This constantly communicates with job Tracker daemon to report the task progress.
• When the Hadoop system is running in a distributed mode all the daemons would be running in the respective computer.
![Page 73: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/73.jpg)
Hadoop Operational Modes
Hadoop can be run in one of the three modes • Standalone(Local) Mode
– No daemons launched – Everything runs in single JVM – Suitable for development
• Pseudo Distributed Mode – All daemons are launched – Runs on a single machine thus simulating a cluster environment – Suitable for testing and debugging
• Fully Distributed Mode
– The Hadoop daemons runs in a cluster environment – Each daemons run on machine respectively assigned to them – Suitable for integration Testing/Production
![Page 74: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/74.jpg)
Hadoop Configuration Files
Filename Format Description
hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop
core-site.xml
Hadoop configuration.XML
Configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce
hdfs-site.xml
Hadoop configuration.XML
Configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes
mapred-site.xml
Hadoop configuration.XML
Configuration settings for MapReduce daemons: the jobtracker, and the tasktrackers
masters Plain text A list of machines (one per line) that each run a secondary namenode
slaves Plain text A list of machines (one per line) that each run a datanode and a tasktracker
hadoop-metrics.properties
Java Properties Properties for controlling how metrics are published in Hadoop
log4j.properties Java Properties
Properties for system logfiles, the namenode audit log, and the task log for the tasktracker child process
![Page 75: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/75.jpg)
Key Configuration Properties
Property Name Conf File Standalone Pseudo Distributed Fully Distributed
Fs.default.name core -site.xml File:///(default) hdfs://localhost/ hdfs://namenode/
dfs.replication hdfs -site.xml NA 1 3(default)
mapred.job.tracker mapes-site.xml local(default) localhost:8021 jobtracket:8021
![Page 76: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/76.jpg)
HDFS
![Page 77: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/77.jpg)
Design of HDFS
• HDFS is a Hadoop's Distributed File System
• Designed for storing very large files(Petabytes)
• Single file can be stored across several disk
• Not suitable for low latency data access
• Designed to be highly fault tolerant hence can run on commodity hardware
![Page 78: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/78.jpg)
HDFS Concepts
• Like any file system HDFS stores files by breaking it into smallest units called Blocks.
• The default HDFS block size is 64 MB
• The large block size helps in maintaining high throughput
• Each block is replicated across multiple machine in the cluster for redundancy.
![Page 79: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/79.jpg)
Design of HDFS -Daemons
• The HDFS file system is managed by two daemons
• NameNode and DataNode
• NameNode and DataNode function in master/slave fashion
– NameNode Manages File system namespace
– Maintains file system and metadata of all the files and directories
• Namespace image
• Edit log
![Page 80: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/80.jpg)
Design of HDFS –Daemons (cont.)
• Datanodes store and retrieve the blocks for the files when they are told by NameNode.
• NameNode maintains the information on which DataNodes all the blocks for a given files are located.
• DataNodes report to NameNode periodically with the list of blocks they are storing.
• With NameNode off ,the HDFS is inaccessible. • Secondary NameNode
– Not a backup for NameNode – Just helps in merging namespace image with edit log to avoid
edit log becoming too large
![Page 81: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/81.jpg)
HDFS Architecture
![Page 82: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/82.jpg)
Fault Tolerance
![Page 83: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/83.jpg)
Fault Tolerance
![Page 84: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/84.jpg)
Live Horizontal Scaling + Rebalancing
![Page 85: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/85.jpg)
Live Horizontal Scaling + Rebalancing
![Page 86: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/86.jpg)
Configuring conf/*-site.xml files
• Need to set the hadoop.tmp.dir parameter
to a directory of your choice.
• We will use /app/hadoop/tmp
• sudo mkdir -p /app/hadoop/tmp
• sudo chmod 777 /app/hadoop/tmp
![Page 87: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/87.jpg)
Configuring HDFS: core-site.xml(Pseudo Distributed Mode)
<?xml version ="1.0"?>
<!--core-site.xml-->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
Note: Add "fs.default.name" property under configuration tag to specify NameNode location.
"localhost" for pseudo distributed mode. Name node runs at port 8020 by default if no port specified.
![Page 88: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/88.jpg)
Configuring mapred-site.xml(Pseudo Distributed Mode)
<?xml version="1.0"?> <!— mapred-site.xml —> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration> Note:Add "mapred.job.tracker" property under configuration tag to specify JobTracker location. "localhost:8021" for Pseudo distributed mode. Lastly set JAVA_HOME in conf/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
![Page 89: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/89.jpg)
Configuring HDFS: hdfs-site.xml(Pseudo Distributed Mode)
• <?xml version ="1.0"?>
• <!--hdfs-site.xml-->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
![Page 90: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/90.jpg)
Starting HDFS
• Format NameNode
hadoop namenode -format
– creates empty file system with storage directories and persistent data structure
– Data nodes are not involved
• Start dfs service
– start-dfs.sh
– Verify daemons : jps If you get the namespace exception, copy the namespace id of the namenode Paste it in the : /app/hadoop/tmp/dfs/data/current/VERSION file
– Stop : stop-dfs.sh
• List/check HDFS
hadoop fsck / -files -blocks
hadoop fs -ls
hadoop fs -mkdir testdir
hadoop fs -ls
hadoop fsck / -files -blocks
![Page 91: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/91.jpg)
Verify HDFS
• Stop dfs services
stop-dfs.sh
Verify Daemons :jps
No java processes should be running
![Page 92: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/92.jpg)
Configuration HDFS -hdfs-site .xml(Pseudo Distributed Mode)
Property Name Description Default Value
dfs.name.dir Directories for NameNode to store it's persistent data(Comma seperated directory name).A copy of metadata is stored in each of listed directory ${hadoop.tmp.dir}/dfs/name
dfs.data.dir Directories where DataNode stores blocks.Each block is stored in only one of these directories ${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir Directories where secondary namenode stores checkpoints.A copy of the checkpoint is stored in each of the listed directory ${hadoop.tmp.dir}/dfs/namesecondary
![Page 93: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/93.jpg)
Basic HDFS Commands • Creating Directory
hadoop fs -mkdir <dirname>
• Removing Directory
hadoop fs -rm <dirname>
• Copying files to HDFS from local file system
hadoop fs -put <local dir>/<filename> <hdfs dir Name>/
• Copying files from HDFS to local file system
hadoop fs -get <hdfs dir Name>/<hdfs file Name> <local dir>/
• List files and directories
hadoop fs -ls <dir name>
• List the blocks that make up each files in HDFS
hadoop fsck / -files -blocks
![Page 94: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/94.jpg)
HDFS Web UI
• Hadoop provides a web UI for viewing HDFS
– Available at http://namenode-host-ip:50070/
– Browse file system
– Log files
![Page 95: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/95.jpg)
MapReduce
• A distributed parallel processing engine of Hadoop • Processes the data in sequential parallel steps called • Map • Reduce • Best run with a DFS supported by Hadoop to exploit it's parallel processing abilities • Has the ability to run on a cluster of computers • Each computer called as a node • Input and output data at every stage is handled in terms of key / value pairs • Key / Value can be choose by programmer • Mapper output is always sorted by key • Mapper output with the same key are sent to the same reducer • Number of mappers and reducers per node can be configured
![Page 96: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/96.jpg)
Start the Map reduce daemons
• start-mapred.sh
![Page 97: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/97.jpg)
Input Splitting Mapping Shuffling Reducing Final Result
The Overall MapReduce Word count Process
![Page 98: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/98.jpg)
Design of MapReduce -Daemons
The MapReduce system is managed by two daemons • JobTracker & TaskTracker • JobTracker TaskTracker function in master / slave fashion
– JobTracker coordinates the entire job execution – TaskTracker runs the individual tasks of map and
reduce – JobTracker does the bookkeeping of all the tasks run on
the cluster – One map task is created for each input split – Number of reduce tasks is configurable
(mapred.reduce.tasks)
![Page 99: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/99.jpg)
Conceptual Overview
![Page 100: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/100.jpg)
Job Sumission – Map phase
![Page 101: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/101.jpg)
Job Submission – Reduce Phase
![Page 102: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/102.jpg)
Anatomy of a MapReduce program
![Page 103: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/103.jpg)
Hadoop - Reading Files
Rack1 Rack2 Rack3 RackN
read file (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, block ids, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
heartbeat/ block report read blocks
![Page 104: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/104.jpg)
Hadoop - Writing Files
Rack1 Rack2 Rack3 RackN
request write (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
block report write blocks
replication pipelining
![Page 105: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/105.jpg)
Hadoop - Running Jobs
Rack1 Rack2 Rack3 RackN
Hadoop Client
JobTracker
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
submit job
deploy job
part 0
map
reduce
shuffle
![Page 106: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/106.jpg)
mapred-site. xml - Pseudo Distributed Mode
<?xml version="1.0"?> <!— mapred-site.xml —> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration> Note:Add "mapred.job.tracker" property under configuration tag to specify JobTracker location. "localhost:8021" for Pseudo distributed mode. Lastly set JAVA_HOME in conf/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
![Page 107: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/107.jpg)
Starting Hadoop -MapReduce Daemons
• Start MapReduce Services
start-mapred.sh
– Verify Daemons:jps
![Page 108: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/108.jpg)
MapReduce Programming
• Having seen the architecture of MapReduce, to perform a job in hadoop a programmer need to create
• A MAP function
• A REDUCE function
• A Driver to communicate with the framework, configure and launch the job
![Page 109: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/109.jpg)
Map Function
• The Map function is represented by Mapper class, which declares an abstract method map()
• Mapper class is generic type with four type parameters for the input and output key/ value pairs
• Mapper <k1, v1, k2, v2> • kl, vi are the types of the input key / value pair • k2, v2 are the types of the output key/value pair • Hadoop provides it's own types that are optimized for network serialization • Text Corresponds to Java String • LongWritable Corresponds to Java Long • IntWritable Corresponds to Java Int • The map() method must be implemented to achieve the input key / value transformation • map method is called by MapReduce framework passing the input key values from the input file • map method is provided with a context object to which the transformed key values can be written to
![Page 110: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/110.jpg)
Mapper — Word Count
• public static class TokenizerMapper • extends Mapper<LongWritable, Text, Text, IntWritable> {
• private final static IntWritable one = new IntWritable(1); • private Text word = new Text(); • • @Override • public void map(LongWritable key, Text value, Context context) • throws 1OException, InterruptedException { • • StringTokenizer itr = new StringTokenizer(value,toString0, " \t\n\r\f,.;:?[]"); • • while (itr.hasMoreTokens()) { • word.set(itr.nextToken0.toLowerCase(); • context.write(word, one); • } • }
![Page 111: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/111.jpg)
Reduce Function
• The Reduce function is represented by Reducer class, which declares an abstract method reduce() • Reducer class is generic type with four type parameters for the input and output
key/ value pairs • Reducer <k2, v2, k3, v3> • k2, v2 are the types of the input key / value pair, this type of this pair must match the output types of Mapper • k3, v3 are the types of the output key / value pair • The reduce() method must be implemented to achieve the desired transformation
of input key / value • Reduce method is called by MapReduce framework passing the input key values
from out of map phase • MapReduce framework guarantees that the records with the same keys from all
the map tasks will reach a single reduce task • Similar to the map, reduce method is provided with a context object to which the
transformed key values can be written to
![Page 112: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/112.jpg)
Reducer — Word Count
• public static class IntSumReducer • extends Reducer< Text, IntWritable, Text, IntWritable> {
• @Override • public void reduce(Text key, Iterable<IntWritable> values, Context context) • throws I0Exception, InterruptedException • { • int sum = 0; • for (IntWritable value values) { • sum += value.get(); • } • context.write(key, new IntWritable(sum)); • } • }
![Page 113: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/113.jpg)
MapReduce Job Driver— Word Count Public class WordCount (
public static void main(String args[])) throws Exception {
if (args.length 1=2) {
System.errprintln("Usage: WordCount <input Path> <output Path>");
System.exit(-1); }
Configuration conf- new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapperclass);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Textclass);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FllelnputFormat.addlnputPath(job, new Path(args[0]);
FlleOutputFormatsetOutputPathCob, new Path(args[1]);
System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
![Page 114: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/114.jpg)
MapReduce Job — Word Count • Copy wordcount.jar from the Hard drive folder to the
Ubuntu VM
hadoop jar <path_to_jar>/wordcount.jar com.hadoop.WordCount <hdfs_input_dir>/pg5000.txt <hdfs_output_dir>
• The <hdfs_output_dir> must not exist
• To view the output directly:
hadoop fs –cat ../../part-r-00000
• To copy the result to local:
hadoop fs -get <part-r-*****>
![Page 115: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/115.jpg)
The MapReduce Web UI
• Hadoop provides a web UI for viewing job information
• Available at http://jobtracker-host:50030/
• follow job's progress while it is running
• find job statistics
• View job logs
• Task Details
![Page 116: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/116.jpg)
Combiner
• Combiner function helps to aggregate the map output before passing on to reduce function
• Reduces intermediate data to be written to disk
• Reduces data to be transferred over network
• Combiner for a job is specified as
job.setCombinerClass(<combinerclassname>.class);
• Combiner is represented by same interface as Reducer
![Page 117: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/117.jpg)
Combiner—local reduce
![Page 118: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/118.jpg)
Combiner
![Page 119: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/119.jpg)
Word Count — With Combiner Public class WordCount {
public static void main(String args[]) throws Exception{
if (args.length!=2) {
System. err printIn("Usage: WordCount <input Path> <output Path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapperclass);
job.setComblnerClass(IntSumReducer.class);
job.setReducerClass(IntSurriReducer class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FilelnputFormataddlnputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]) );
System.exit(job.wait-ForCompletion(true) ? 0 : 1);
}
}
Note: In case of cumulative & associative functions the reducer can work as combiner.
Otherwise a separate combiner needs to be created
![Page 120: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/120.jpg)
• hadoop jar lineindexer.jar com.hadoop.LineIndexer <hdfs_input_dir> <hdfs_output_dir>
• hadoop fs –cat ../../part-r-*****
• hadoop fs -get <part-r-*****>
![Page 121: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/121.jpg)
Partitioning
• Map tasks partition their output keys by the number of reducers • There can be many keys in a partition • All records for a given key will be in a single partition • A Partitioner class controls partitioning based on the Key • Hadoop uses hash partition by default (HashPartitioner) • The default behavior can be changed by implementing the getPartition() method in the Partitioner (abstract) class public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } • A custom partitioner for a job can be set as Job.setPartitionerClass(<customPartitionerClass>.class);
![Page 122: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/122.jpg)
Partitioner— redirecting output from Mapper
![Page 123: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/123.jpg)
Partitioner Example
public class WordPartitioner extends Partitioner <Text, IntWritable>{ @Override public int getPartition(Text key, Int-Writable value,int numPartitions){ String ch = key.toStringasubstring(0,1); /*if (ch.matches("[abcdefghijklm]")) return 0; } else if (ch.matches("[Inopqrstuvwxyzn]")){ return 1; } return 2; */ //return (ch.charAt(0) % numPartitions); //round robin based on ASCI value return 0; return 0;// default behavior } }
![Page 124: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/124.jpg)
Reducer progress during Mapper
• MapReduce job shows something like Map(50%) Reduce(10%)
• Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer.
• The programmer defined reduce method is called only after all the mappers have finished.
![Page 125: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/125.jpg)
Hadoop Streaming
• Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
• Using the streaming system you can develop working hadoop jobs with extremely limited knowldge of Java
• Hadoop actually opens a process and writes and reads from stdin and stdout
![Page 126: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/126.jpg)
Hadoop Streaming
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that
• Can read standard input and write to standard output to write your MapReduce program.
• Streaming is naturally suited for text processing.
![Page 127: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/127.jpg)
Hands on
• Folder :
– examples/Hadoop_streaming_python
– Files “url1” & “url2’ are the input
– multifetch.py is the mapper (open it)
– reducer.py is the reducer(open this as well)
![Page 128: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/128.jpg)
Hands on
• Run : hadoop fs -mkdir urls hadoop fs -put url1 urls/ hadoop fs -put url2 urls/ hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar -mapper <path>/multifetch.py -reducer <path>reducer.py -input urls/* -output titles
![Page 129: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/129.jpg)
Decomposing problems into M/R jobs
• Small map-reduce jobs are usually better
– Easier to implement, test & maintain
– Easier to scale & reuse
• Problem : Find the word/letter that has the maximum occurrences in a set of documents
![Page 130: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/130.jpg)
Decomposing
• Count of each word/letter M/R job (Job 1)
• Find max word/letter count M/R job (Job 2)
Choices can depend on complexity of jobs
![Page 131: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/131.jpg)
Job Chaining
• Multiple jobs can be run in a linear or complex dependent fashion Simple Dependency /Linear Chain Directed Acyclic Graph(DAG) • Simple way is to call the job drivers one after the other with respective configurations JobClient.runJob(conf1); JobClient.runJob(conf2); • If a job fails, the runJob() method will throw an IOException, so later jobs
in the pipeline don’t get executed.
![Page 132: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/132.jpg)
Job Chaining
For complex dependencies you can use JobControI, and ControlledJob classes
Controlledjob cjob1 = new ControlledJob(conf1);
ControlledJob cjob2 = new ControlledJob(conf2);
cjob2.addDependingjob(cjob1);
JobControl jc = new JobControl("Chained Job");
jc.addjob(cjob1);
jc.addjob(cjob2);
jc.run();
![Page 133: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/133.jpg)
Apache Oozie
• Work flow scheduler for Hadoop
• Manages Hadoop Jobs
• Integrated with many Hadoop apps i.e. Pig
• Scalable
• Schedule jobs
• A work flow is a collection of actions i.e.
– map/reduce, pig
• A work flow is
– Arranged as a DAG ( direct acyclic graph )
– Graph stored as hPDL ( XML process definition )
![Page 134: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/134.jpg)
Oozie
• Engine to build complex DAG workflows
• Runs in it’s own daemon
• Describe workflows in set of XML & configuration files
• Has coordinator engine that schedules workflows based on time & incoming data
• Provides ability to re-run failed portions of the workflow
![Page 135: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/135.jpg)
Need for High-Level Languages
• Hadoop is great for large-data processing!
– But writing Java programs for everything is verbose and slow
– Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing languages
– Hive: HQL is like SQL
– Pig: Pig Latin is a bit like Perl
![Page 136: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/136.jpg)
Hive and Pig
• Hive: data warehousing application in Hadoop – Query language is HQL, variant of SQL – Tables stored on HDFS as flat files – Developed by Facebook, now open source
• Pig: large-scale data processing system – Scripts are written in Pig Latin, a dataflow language – Developed by Yahoo!, now open source – Roughly 1/3 of all Yahoo! internal jobs
• Common idea: – Provide higher-level language to facilitate large-data processing – Higher-level language “compiles down” to Hadoop jobs
![Page 137: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/137.jpg)
Hive: Background
• Started at Facebook
• Data was collected by nightly cron jobs into Oracle DB
• “ETL” via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
Source: cc-licensed slide by Cloudera
![Page 138: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/138.jpg)
Hive Components
• Shell: allows interactive queries
• Driver: session handles, fetch, execute
• Compiler: parse, plan, optimize
• Execution engine: DAG of stages (MR, HDFS, metadata)
• Metastore: schema, location in HDFS, SerDe
Source: cc-licensed slide by Cloudera
![Page 139: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/139.jpg)
Data Model
• Tables
– Typed columns (int, float, string, boolean)
– Also, list: map (for JSON-like data)
• Partitions
– For example, range-partition tables by date
• Buckets
– Hash partitions within ranges (useful for sampling, join optimization)
Source: cc-licensed slide by Cloudera
![Page 140: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/140.jpg)
Metastore
• Database: namespace containing a set of tables
• Holds table definitions (column types, physical layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many other relational databases
Source: cc-licensed slide by Cloudera
![Page 141: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/141.jpg)
Physical Layout
• Warehouse directory in HDFS
– E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables
• Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format
Source: cc-licensed slide by Cloudera
![Page 142: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/142.jpg)
Hive: Example • Hive looks similar to an SQL database
• Relational join on two tables:
– Table of word counts from Shakespeare collection
– Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the 25848 62394 I 23031 8854 and 19671 38985 to 18038 13526 of 16700 34654 a 14170 8057 you 12702 2720 my 11297 4135 in 10797 12445 is 8882 6884
![Page 143: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/143.jpg)
Hive: Behind the Scenes SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s) word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k) freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))
(one or more of MapReduce jobs)
(Abstract Syntax Tree)
![Page 144: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/144.jpg)
Hive: Behind the Scenes STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s TableScan alias: s Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 0 value expressions: expr: freq type: int expr: word type: string k TableScan alias: k Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 1 value expressions: expr: freq type: int
Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: ((_col0 >= 1) and (_col2 >= 1)) type: boolean Select Operator expressions: expr: _col1 type: string expr: _col0 type: int expr: _col2 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://localhost:8022/tmp/hive-training/364214370/10002 Reduce Output Operator key expressions: expr: _col1 type: int sort order: - tag: -1 value expressions: expr: _col0 type: string expr: _col1 type: int expr: _col2 type: int Reduce Operator Tree: Extract Limit File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: 10
![Page 145: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/145.jpg)
Example Data Analysis Task
user url time
Amy www.cnn.com 8:00
Amy www.crap.com 8:05
Amy www.myblog.com 10:00
Amy www.flickr.com 10:05
Fred cnn.com/index.htm 12:00
url pagerank
www.cnn.com 0.9
www.flickr.com 0.9
www.myblog.com 0.7
www.crap.com 0.2
Find users who tend to visit “good” pages.
Pages Visits
. . .
. . .
![Page 146: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/146.jpg)
Conceptual Workflow
![Page 147: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/147.jpg)
System-Level Dataflow
. . . . . .
Visits Pages . . .
. . .
join by url
the answer
load load
canonicalize
compute average pagerank
filter
group by user
Pig Slides adapted from Olston et al.
![Page 148: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/148.jpg)
MapReduce Code i m p o r t j a v a . i o . I O E x c e p t i o n ;
i m p o r t j a v a . u t i l . A r r a y L i s t ;
i m p o r t j a v a . u t i l . I t e r a t o r ;
i m p o r t j a v a . u t i l . L i s t ;
i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;
im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;
i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;
i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;
p u b l i c c l a s s M R E x a m p l e {
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {
p u b l i c v o i d r e d u c e ( T e x t k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d
s t o r e i t
/ / a c c o r d i n g l y .
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) {
T e x t t = i t e r . n e x t ( ) ;
S t r i n g v a l u e = t . t oS t r i n g ( ) ;
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
/ / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e s
f o r ( S t r i n g s 1 : f i r s t ) {
f o r ( S t r i n g s 2 : s e c o n d ) {
S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;
o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
}
}
}
p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {
p u b l i c v o i d m a p (
T e x t k ,
T e x t v a l ,
O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F i n d t h e u r l
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;
S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,
/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .
T e x t o u t K e y = n e w T e x t ( k e y ) ;
o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;
}
}
p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,
W r i t a b l e > {
p u b l i c v o i d r e d u c e (
T e x t k ey ,
I t e r a t o r < L o n g W r i t a b l e > i t e r ,
O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / A d d u p a l l t h e v a l u e s w e s e e
l o n g s u m = 0 ;
w hi l e ( i t e r . h a s N e x t ( ) ) {
s u m + = i t e r . n e x t ( ) . g e t ( ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e
im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,
T e x t > {
p u b l i c v o i d m a p (
W r i t a b l e C o m p a r a b l e k e y ,
W r i t a b l e v a l ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;
}
}
p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {
i n t c o u n t = 0 ;
p u b l i c v o i d r e d u c e (
L o n g W r i t a b l e k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
w h i l e ( c o u n t < 1 0 0 & & i t e r . h a s N e x t ( ) ) {
o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;
c o u n t + + ;
}
}
}
p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {
J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
l p . s et J o b N a m e ( " L o a d P a g e s " ) ;
l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n e w
P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,
n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
l p . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d P a g e s = n e w J o b ( l p ) ;
J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;
l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w
P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u ,
n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;
l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d U s e r s = n e w J o b ( l f u ) ;
J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;
j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " ) ;
j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;
j o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;
j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b j o i n J o b = n e w J o b ( j o i n ) ;
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ;
J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;
g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;
g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;
g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e . c l a s s ) ;
g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s s ) ;
g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ;
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s s ) ;
g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;
J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;
t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t Fo r m a t . c l a s s ) ;
t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s ) ;
t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;
t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e w
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e w
P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ;
t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;
l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;
J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i n d t o p 1 0 0 s i t e s f o r u s e r s
1 8 t o 2 5 " ) ;
j c . a d d J o b ( l o a d P a g e s ) ;
j c . a d d J o b ( l o a d U s e r s ) ;
j c . a d d J o b ( j o i n J o b ) ;
j c . a d d J o b ( g r o u p J o b ) ;
j c . a d d J o b ( l i m i t ) ;
j c . r u n ( ) ;
}
}
Pig Slides adapted from Olston et al.
![Page 149: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/149.jpg)
Pig Latin Script
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;
store GoodUsers into '/data/good_users';
![Page 150: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/150.jpg)
HIVE • A datawarehousing framework built on top of Hadoop
• Started at Facebook in 2006
• Targets users are data analysts comfortable with SQL
• Allows to query the data using a SQL-like language called HiveQL
• Queries compiled into MR jobs that are executed on Hadoop
• Meant for structured data
![Page 151: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/151.jpg)
Hive Architecture
![Page 152: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/152.jpg)
Hive Architecture cont. • Can interact with Hive using :
• CLI(Command Line Interface)
• JDBC
• Web GUI
• Metastore – Stores the system catalog and metadata about tables, columns, partitions etc.
• Driver – Manages the lifecycle of a HiveQL statement as it moves through Hive. Query Compiler – Compiles HiveQL into a directed acyclic graph of map/reduce tasks.
• Execution Engine – Executes the tasks produced by the compiler interacting with the underlying Hadoop instance.
• HiveServer – Provides a thrift interface and a JDBC/ODBC server.
![Page 153: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/153.jpg)
Hive Architecture cont.
• Physical Layout • Warehouse directory in HDFS
• e.g., /user/hive/warehouse
• Table row data stored in subdirectories of warehouse
• Partitions form subdirectories of table directories
• Actual data stored in flat files • Control char-delimited text, or SequenceFiles
![Page 154: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/154.jpg)
Hive Vs RDBMS • Latency for Hive queries is generally very high (minutes) even
when data sets involved are very small
• On RDBMSs, the analyses proceed much more iteratively with the response times between iterations being less than a few minutes.
• Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
• Hive is not designed for online transaction processing and does not offer real-time queries and row level updates.
It is best used for batch jobs over large sets of immutable data (like web logs).
![Page 155: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/155.jpg)
Supported Data Types • Integers
• BIGINT(8 bytes), INT(4 bytes), SMALLINT(2 bytes), TINYINT(1 byte).
• All integer types are signed.
• Floating point numbers • FLOAT(single precision), DOUBLE(double precision)
• STRING : sequence of characters
• BOOLEAN : True/False
• Hive also natively supports the following complex types: • Associative arrays – map<key-type, value-type>
• Lists – list<element-type>
• Structs – struct<field-name: field-type, ... >
![Page 156: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/156.jpg)
Hive : Install & Configure • Download a HIVE release compatible with your Hadoop
installation from :
• http://hive.apache.org/releases.html
• Untar into a directory. This is the HIVE’s home directory
• tar xvzf hive.x.y.z.tar.gz
• Configure
• Environment variables – add in .bash_profile
• export HIVE_INSTALL=/<parent_dir_path>/hive-x.y.z
• export PATH=$PATH:$HIVE_INSTALL/bin
• Verify Installation
• Type : hive –help (Displays commands usage)
• Type : hive (Enter the hive shell)
hive>
![Page 157: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/157.jpg)
Hive : Install & Configure cont. • Start Hadoop daemons (Hadoop needs to be running)
• Configure to Hadoop
• Create hive-site.xml in $HIVE_INSTALL/conf directory
• Specify the filesystem and jobtracker using the properties fs.default.name & mapred.job.tracker
• If not set, these default to the local filesystem and the local(in-process) job-runner
• Create following directories under HDFS
• /tmp (execute: hadoop fs –mkdir /tmp)
• user/hive/warehouse (execute: hadoop fs –mkdir /user/hive/warehouse)
• chmod g+w for both (execute : hadoop fs –chmod g+w <dir_path>)
![Page 158: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/158.jpg)
Hive : Install & Configure cont. • Data Store
• Hive stores data under /user/hive/warehouse by default
• Metastore
• Hive by default comes with a light wight SQL database Derby to store the metastore metadata.
• But this can be configured to other databases like MySQL as well.
• Logging
• Hive uses Log4j
• You can find Hive’s error log on the local file system at /tmp/$USER/hive.log
![Page 159: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/159.jpg)
Hive Data Models • Databases
• Tables
• Partitions
• Each Table can have one or more partition Keys which determines how the data is stored.
• Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
• For example, a date_partition of type STRING and country_partition of type STRING.
• Each unique value of the partition keys defines a partition of the Table. For example all “India" data from "2013-05-21" is a partition of the page_views table.
• Therefore, if you run analysis on only the “India" data for 2013-05-21, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly.
![Page 160: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/160.jpg)
Partition example
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
• When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
![Page 161: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/161.jpg)
Hive Data Model cont.
• Buckets
• Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
• For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_views table. These can be used to efficiently sample the data.
![Page 162: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/162.jpg)
A Practical session Starting the Hive CLI
• Start a terminal and run :
• $hive
• Will take you to the hive shell/prompt
hive>
• Set a Hive or Hadoop conf prop:
• hive> set propkey=propvalue;
• List all properties and their values:
• hive> set –v;
![Page 163: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/163.jpg)
A Practical Session
Hive CLI Commands
• List tables:
– hive> show tables;
• Describe a table:
– hive> describe <tablename>;
• More information:
– hive> describe extended <tablename>;
![Page 164: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/164.jpg)
A Practical Session Hive CLI Commands • Create tables:
– hive> CREATE TABLE cite (citing INT, cited INT)
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
• The 2nd and the 3rd lines tell Hive how the data is stored (as a text file)and how it should be parsed (fields are separated by commas).
• Loading data into tables • Let’s load the patent data into table cite
hive> LOAD DATA LOCAL INPATH ‘<path_to_file>/cite75_99.txt’ > OVERWRITE INTO TABLE cite;
• Browse data hive> SELECT * FROM cite LIMIT 10;
![Page 165: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/165.jpg)
A Practical Session Hive CLI Commands • Count
hive>SELECT COUNT(*) FROM cite;
Some more playing around • Create table to store citation frequency of each patent
hive> CREATE TABLE cite_count (cited INT, count INT);
• Execute the query on the previous table and store the results : hive> INSERT OVERWRITE TABLE cite_count
• > SELECT cited, COUNT(citing) • > FROM cite • > GROUP BY cited;
• Query the count table hive> SELECT * FROM cite_count WHERE count > 10 LIMIT 10;
• Drop Table hive> DROP TABLE cite_count;
![Page 166: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/166.jpg)
Data Model Partitioning Data • One or more partition columns may be specified:
hive>CREATE TABLE tbl1 (id INT, msg STRING)
PARTITIONED BY (dt STRING);
• Creates a subdirectory for each value of the partition column, e.g.:
/user/hive/warehouse/tbl1/dt=2009-03-20/
• Queries with partition columns in WHERE clause will scan through only a subset of the data.
![Page 167: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/167.jpg)
Managing Hive Tables
• Managed table
• Default table created (without EXTERNAL keyword)
• Hive manages the data
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• Moves the data into the warehouse directory for the table.
DROP TABLE managed_table;
• Deletes table data & metadata.
![Page 168: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/168.jpg)
Managing Hive Tables
• External table
• You control the creation and deletion of the data.
• The location of the external data is specified at table creation time:
• CREATE EXTERNAL TABLE external_table (dummy STRING)
• LOCATION '/user/tom/external_table'; LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• DROP TABLE external_table Hive will leave the data untouched and only delete the metadata.
![Page 169: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/169.jpg)
Conclusion
• Supports rapid iteration of adhoc queries
• High-level Interface (HiveQL) to low-level infrastructure
(Hadoop).
• Scales to handle much more data than many similar systems
![Page 170: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/170.jpg)
Hive Resources/References
Documentation
• cwiki.apache.org/Hive/home.html
Mailing Lists
Books
• Hadoop, The Definitive Guide, 3rd edition by Tom White, (O’Reilly)
![Page 171: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/171.jpg)
Hive Resources/References
Documentation
• cwiki.apache.org/Hive/home.html
Mailing Lists
Books
• Hadoop, The Definitive Guide, 3rd edition by Tom White, (O’Reilly)
![Page 172: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/172.jpg)
PIG
![Page 173: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/173.jpg)
PIG
• PIG is an abstraction layer on top of MapReduce that frees analysts from the complexity of MapReduce programming • Architected towards handling unstructured and semi structured data • It's a dataflow language, which means the data is processed in a sequence of steps transforming the data • The transformations support relational-style operations such as filter, union, group, and join. • Designed to be extensible and reusable • Programmers can develop own functions and use (UDFs) • Programmer friendly • Allows to introspect data structures • Can do sample run on a representative subset of your input • PIG internally converts each transformation into a MapReduce job and submits to hadoop cluster • 40 percent of Yahoo's Hadoop jobs are run with PIG
![Page 174: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/174.jpg)
Pig : What for ?
• An ad-hoc way of creating and executing map-reduce jobs on very large data sets
• Rapid development
• No Java is required
• Developed by Yahoo
![Page 175: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/175.jpg)
PIG Vs MapReduce(MR)
![Page 176: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/176.jpg)
Pig Use cases
Processing of Web Logs.
Data processing for search platforms.
Support for Ad Hoc queries across large datasets.
Quick Prototyping of algorithms for processing large datasets.
![Page 177: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/177.jpg)
Problem Statement: De-identify personal health information. Challenges: Huge amount of data flows into the systems daily and there are multiple data
sources that we need to aggregate data from.
Crunching this huge data and deidentifying it in a traditional way had problems.
Use Case in Healthcare
![Page 178: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/178.jpg)
Use Case in Healthcare
![Page 179: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/179.jpg)
When to Not Use PIG
• Really nasty data formats or completely unstructured data (video, audio, raw human-readable text).
• Pig is definitely slow compared to Map Reduce jobs.
• When you would like more power to optimize your code.
![Page 180: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/180.jpg)
PIG Architecture
• Pig runs as a client side application ,there is no need to install anything on the cluster.
![Page 181: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/181.jpg)
Install and Configure PIG
• Download a version of PIG compatible with your hadoop installation • http://pig.apache.org/releases.html • Untar into a designated folder. This will be Pig's home directory • tar xzf pig-x.y.z.tar.gz • Configure • Environment Variables add in .bash_profile • export PIG_INSTALL=/<parent directory path>/pig-x.y.z • export PATH=$PATH:$PIG_INSTALL/bin • Verify Installation • Try pig -help • Displays command usage • Try pig • Takes you into Grunt shell grunt>
![Page 182: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/182.jpg)
PIG Execution Modes
• Local Mode • Runs in a single JVM • Operates on local file system • Suitable for small datasets and for development • To run PIG in local mode • pig -x local • MapReduce Mode • In this mode the queries are translated into MapReduce jobs and run on hadoop cluster • PIG version must be compatible with hadoop version • Set HADOOP HOME environment variable to indicate pig which hadoop client to use • export HADOOP_HOME=$HADOOP_INSTALL • If not set it will use a bundled version of hadoop
![Page 183: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/183.jpg)
Pig Latin
![Page 184: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/184.jpg)
Data Analysis Task Example
![Page 185: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/185.jpg)
Conceptual Workflow
![Page 186: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/186.jpg)
Pig Latin Relational Operators
![Page 187: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/187.jpg)
Ways of Executing PIG programs
Grunt • An interactive shell for running Pig commands • Grunt is started when the pig command is run without any options • Script • Pig commands can be executed directly from a script file pig pigscript.pig • It is also possible to run Pig scripts from Grunt shell using run and exec. • Embedded • You can run Pig programs from Java using the PigServer class, much like you can use JDBC • For programmatic access to Grunt, use PigRunner
![Page 188: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/188.jpg)
An Example
Create a file sample.txt with(tab delimited) : 1932 23 2 1905 12 1 And so on. grunt> records = LOAD ‘<your_input_dir>/sample.txt' AS (year:chararray, temperature:int, quality:int); DUMP records; grunt>filtered_records = FILTER records BY temperature !=9999 AND (quality ==0 OR quality ==1 OR quality ==4);
![Page 189: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/189.jpg)
Example cont.
grunt>grouped_records = GROUP filtered_records BY year;
grunt>max_temp = FOREACH grouped_records GENERATE group,
MAX (filtered_records.temperature);
grunt>DUMP max_temp;
![Page 190: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/190.jpg)
Compilation
![Page 191: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/191.jpg)
Data Types
• Simple Type
Category Type Description
Numeric
int 32 - bit signed integer
long 64 - bit signed integer
float 32-bit floating-point number
double 64-bit floating-point number
Text chararray Character array in UTF-16 format
Binary bytearray Byte array
![Page 192: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/192.jpg)
Data Types
• Complex Types
Type Description Example
Tuple Sequence of fields of any type (1 ,'pomegranate' )
Bag An unordered collection of tuples, possibly with
duplicates {(1 ,'pomegranate' ) ,(2)}
map A set of key-value pairs; keys must be character
arrays but values may be any type ['a'#'pomegranate' ]
![Page 193: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/193.jpg)
LOAD Operator
<relation name> = LOAD '<input file with path>' [USING UDF()]
[AS (<field namel>:dataType, <field nome2>:dataType„<field name3>:dataType)]
• Loads data from a file into a relation
• Uses the PigStorage load function as default unless specified otherwise with the USING option
• The data can be given a schema using the AS option.
• The default data type is bytearray if not specified
records=LOAD 'sales.txt';
records=LOAD 'sales.txt' AS (fl:chararray, f2:int, f3:f1oat.);
records=LOAD isales.txt' USING PigStorage('\t');
records= LOAD (sales.txt' USING PigStorage('\t') AS (fl:chararray, f2:int,f3:float);
![Page 194: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/194.jpg)
Diagnostic Operators
• DESCRIBE • Describes the schema of a relation • EXPLAIN • Display the execution plan used to compute a relation • ILLUSTRATE • Illustrate step-by-step how data is transformed • Uses sample of the input data to simulate the execution.
![Page 195: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/195.jpg)
Data Write Operators
• LIMIT
• Limits the number of tuples from a relation
• DUMP
• Display the tuples from a relation
• STORE
• Store the data from a relation into a directory.
• The directory must not exist
![Page 196: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/196.jpg)
Relational Operators
• FILTER • Selects tuples based on Boolean expression teenagers = FILTER cust BY age <20; • ORDER • Sort a relation based on one or more fields • Further processing (FILTER, DISTINCT, etc.) may destroy the ordering ordered list = ORDER cust BY name DESC; • DISTINCT • Removes duplicate tuples unique_custlist = DISTINCT cust;
![Page 197: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/197.jpg)
Relational Operators
• GROUP BY • Within a relation, group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL • FOR EACH • Loop through each tuple in nested_alias and generate new tuple(s) . countByProfession=FOREACH groupByProfession GENERATE group, count(cust); • Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
![Page 198: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/198.jpg)
Relational Operators
• GROUP BY • Within a relation, group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL • FOR EACH • Loop through each tuple in nested alias and generate new tuple(s). • At least one of the fields of nested alias should be a bag • DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed operations in nested op to operate on the inner bag(s). • countByProfession=FOREACH groupByProfession GENERATE group, count(cust); • Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
![Page 199: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/199.jpg)
Operating on Multiple datasets
• Join Compute inner join of two or more relations based on common fields values DUMP A; DUMP B; (1,2,3) (2,4) (4,2,1) (8,9) (8,3,4) (1,3) (4,3,3) (2,7) (7,2,5) (7,9) X=JOIN A BY a1,B BY b1 DUMP X; (1,2,3,1,3) (8,3,4,8,9) (7,2,5,7,9)
![Page 200: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/200.jpg)
Operating on Multiple datasets
• COGROUP Group tuples from two or more relations,based on common group values DUMP A; DUMP B; (1,2,3) (2,4) (4,2,1) (8,9) (8,3,4) (1,3) (4,3,3) (2,7) (7,2,5) (7,9) X=COGROUP A BY a1,B BY b1 DUMP X; (1,{(1,2,3)},{(1,3)}) (8,{(8,3,4)},{(8,9)}) (7,{(7,2,5)},{(7,9)}) (2,{},{(2,4),(2,7)}) (4,{(4,2,1),(4,3,3)},{})
![Page 201: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/201.jpg)
Joins & Cogroups
• JOIN and COGROUP operators perform similar functions.
• JOIN creates a flat set of output records while COGROUP creates a nested set of output records
![Page 202: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/202.jpg)
File – student
Name Age GPA
Joe 18 2.5
Sam 3.0
Angel 21 7.9
John 17 9.0
Joe 19 2.9
File – studentRoll
Name RollNo
Joe 45
Sam 24
Angel 1
John 12
Joe 19
Data
![Page 203: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/203.jpg)
A = load 'student' as (name:chararray, age:int, gpa:float); dump A; (joe,18,2.5) (sam,,3.0) (angel,21,7.9) (john,17,9.0) (joe,19,2.9)
X = group A by name; dump X; (joe,{(joe,18,2.5),(joe,19,2.9)}) (sam,{(sam,,3.0)}) (john,{(john,17,9.0)}) (angel,{(angel,21,7.9)})
Example of GROUP Operator:
Pig Latin - GROUP Operator
![Page 204: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/204.jpg)
A = load 'student' as (name:chararray, age:int,gpa:float); B = load 'studentRoll' as (name:chararray, rollno:int); X = cogroup A by name, B by name; dump X; (joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)}) (sam,{(sam,,3.0)},{(sam,24)}) (john,{(john,17,9.0)},{(john,12)}) (angel,{(angel,21,7.9)},{(angel,1)})
Example of COGROUP Operator:
Pig Latin – COGROUP Operator
![Page 205: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/205.jpg)
Operating on Multiple datasets
UNION Creates the union of two or more relations DUMP A; DUMP B; (1,2,3) (2,4) (4,2,1) (8,9) (8,3,4) X=UNION A , B; DUMP X; (1,2,3) (4,2,1) (8,3,4) (2,4) (8,9)
![Page 206: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/206.jpg)
Operating on Multiple datasets
SPLITS Splits a relation into two or more relations, based on a Boolean expression Y=SPLIT X INTO C IF a1<5 , D IF a1>5; DUMP C; DUMP D; (1,2,3) (4,2,1) (8,3,4) (2,4) (8,9)
![Page 207: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/207.jpg)
User Defined Functions (UDFs)
• PIG lets users define their own functions and lets them be used in the statements
• The UDFs can be developed in Java, Python or Javascript – Filter UDF
• To be subclassof FilterFunc which is a subclass ofEvalFunc
– Eval UDF To be subclassed of EvalFunc
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws 1OException;
}
– Load / Store UDF • To be subclassed of LoadFunc /StoreFunc
![Page 208: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/208.jpg)
Creating UDF : Eval function example public class UPPER extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
String str = (String) input.get(0);
return str.toUpperCase(); // java upper case
function
} catch (Exception e) {
throw new IOException("Caught exception processing
input...", e);
}
}
}
![Page 209: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/209.jpg)
Define and use an UDF
• Package the UDF class into a jar
• Define and use a UDF
• REGISTER yourUDFjar_name.jar;
• cust = LOAD some_cust_data;
• filtered = FOREACH cust GENERATE
com.pkg.UPPER(title) ;
• DUMP filtered;
![Page 210: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/210.jpg)
Piggy Bank
• Piggy Bank is a place for Pig users to share the Java UDFs they have written for use with Pig. The functions are contributed "as-is.“
• Piggy Bank currently supports Java UDFs.
• No binary version to download, download source, build and use.
![Page 211: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/211.jpg)
Pig Vs. Hive
• Hive was invented at Facebook.
• Pig was invented at Yahoo.
• If you know SQL, then Hive will be very familiar to you. Since Hive uses SQL, you will feel at home with all the familiar select, where, group by, and order by clauses similar to SQL for relational databases.
![Page 212: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/212.jpg)
Pig Vs. Hive
• Pig needs some mental adjustment for SQL users to learn.
• Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).
![Page 213: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/213.jpg)
Pig Vs. Hive
• Pig gives you more control and optimization over the flow of the data than Hive does.
• If you are a data engineer, then you’ll likely feel like you’ll have better control over the dataflow (ETL) processes when you use Pig Latin, if you come from a procedural language background.
• If you are a data analyst, however, you will likely find that you can work on Hadoop faster by using Hive, if your previous experience was more with SQL.
![Page 214: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/214.jpg)
Pig Vs. Hive
• Pig Latin allows users to store data at any point in the pipeline without disrupting the pipeline execution.
• Pig is not meant to be an ad-hoc query tool.
![Page 215: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/215.jpg)
Pig Vs. Hive
• Pig needs some mental adjustment for SQL users to learn.
• Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).
![Page 216: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/216.jpg)
Apache Mahout
![Page 217: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/217.jpg)
What is Apache Mahout
• Mahout is an open source machine learning library from Apache.
• Scalable, can run on Hadoop
• Written in Java
• Started as a Lucene sub-project, became an Apache top-level-project in 2010.
250
![Page 218: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/218.jpg)
Machine Learning : Definition
• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
– Intro. To Machine Learning by E. Alpaydin
• Subset of Artificial Intelligence
• Lots of related fields:
– Information Retrieval
– Stats
– Biology
– Linear algebra
– Many more
![Page 219: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/219.jpg)
Machine Learning
• Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
• Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input.
– Common examples of supervised learning include classifying e-mail messages as spam.
• Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.
252
![Page 220: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/220.jpg)
Common Use Cases
• Recommend friends/dates/products
• Classify content into predefined groups
• Find similar content
• Find associations/patterns in actions/behaviors
• Identify key topics/summarize text – Documents and Corpora
• Detect anomalies/fraud
• Ranking search results
• Others?
![Page 221: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/221.jpg)
Apache Mahout
• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License – http://mahout.apache.org
• Why Mahout?
– Many Open Source ML libraries either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License • Or are research-oriented
![Page 222: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/222.jpg)
Sampling of Who uses Mahout?
![Page 223: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/223.jpg)
Apache Mahout components
• The Three C’s:
– Collaborative Filtering (recommenders)
– Clustering
– Classification
• Others:
– Frequent Item Mining
– Math stuff
![Page 224: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/224.jpg)
Content Discovery & Personalization
![Page 225: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/225.jpg)
Key points
• Automatic discovery of the most relevant content without searching for it • Automatic discovery and recommendation of the most appropriate connection between people and interests • Personalization and presentation of the most relevant content (content, inspirations, marketplace, ads) at every page/touch point
![Page 226: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/226.jpg)
Business objectives
• More revenue from incoming traffic
• Improved consumer interaction and loyalty, as each page has more interesting and relevant content
• Drives better utilization of assets within the platform by linking “similar” products
![Page 227: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/227.jpg)
Basic approaches to Recommendations on social sites
• Collaborative Filtering (CF): Collaborative filtering first computes similarity between two users based on their preference towards items, and recommends items which are highly rated(preferred) by similar users.
• Content based Recommendation(CBR) : Content based system provides recommendation directly based on similarity of items and the user history . Similarity is computed based on item attributes using appropriate distance measures.
![Page 228: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/228.jpg)
Collaborative Filtering (CF) – Provide recommendations solely based on preferences
expressed between users and items – “People who watched this also watched that”
• Content-based Recommendations (CBR) – Provide recommendations based on the attributes of the
items (and user profile) – ‘Chak de India’ is a sports movie, Mohan likes sports
movies • => Suggest Chak de India to Mohan
• Mahout geared towards CF, can be extended to do CBR
• Aside: search engines can also solve these problems
![Page 229: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/229.jpg)
Collaborative Filtering approaches
• User based
• Item based
![Page 230: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/230.jpg)
User Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
![Page 231: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/231.jpg)
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
Item Similarity
![Page 232: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/232.jpg)
Component diagram
Recommender App
DataModel Extractor / User Preference Builder
Core platform DB QDSS (Cassandra)
Personalization DB
Feeder
REST API
![Page 233: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/233.jpg)
Key architecture blocks
• Data Extraction – This involves extracting relevant data from :
• Core Platform DB (MySQL) • Qyuki custom Data store (Cassandra)
• Data Mining : – This involves a series of activities to calculate
distances/similarities between items, preferences of users and applying Machine Learning for further use cases.
• Personalization Service: – The application of relevant data contextually for the users
and creations on Qyuki, exposed as a REST api.
![Page 234: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/234.jpg)
Entities to Recommend : Creations
For every creation, there are 2 parts to recommendations :
1) Creations similar to the current creation(Content-Based)
2) Recommendations based on the studied preferences of the user.
![Page 235: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/235.jpg)
Features cont..
Entities to Recommend : Users
• New users/creators to follow based on his social graph & other activities
• New users/creators to collaborate with. – The engine would recommend artists to artists to
collaborate and create. – QInfluence
![Page 236: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/236.jpg)
Personalized Creations
• Track user’s activities – MySQL + Custom Analytics(Cassandra)
• Explicit & Implicit Ratings
• Explicit : None in our case
• Implicit : Emotes, comments, views, engagement with creator
![Page 237: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/237.jpg)
Personalized Creations cont..
• Mahout expects data input in the form : User Item Preference U253 I306 3.5 U279 I492 4.0
• Preference Compute Engine
– Consider all user activities, infer preferences
• Feed preference data to Mahout, you are done(to some extent)
• Filtering and rescoring
![Page 238: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/238.jpg)
Some Mahout recommendations logic/code
• DataModel : preference data model – DataModel dataModel = new FileDataModel(new File(preferenceFile));
• Notion of similarity(Distance measure)
– UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
• Neighbourhood(for user based)
– UserNeighborhood neighborhood = NearestNUserNeighborhood(10, similarity, dataModel);
• Recommender instance – GenericUserBasedRecommender userBasedRecommender = new
GenericUserBasedRecommender(dataModel, neighborhood, similarity);
• Recommendations for a user
– topUserBasedRecommendeditems = userBasedRecommender.recommend(userId, numOfItemsToBeCached, null);
![Page 239: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/239.jpg)
Similarity algorithms
• Pearson correlation–based similarity
– Pearson correlation is a number between –1 and 1
– It measures the tendency of two users’ preference values to move together—to be relatively high, or relatively low, on the same items.
– doesn’t take into account the number of items in which two users’ preferences overlap
![Page 240: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/240.jpg)
Similarity algorithms
• Cosine Similarity
– Ignores 0-0 matches
– A measure of alignment/direction
– Since Cos 0 = 1 (1 means 100% similar)
• Euclidean Distance Similarity
– Based on Euclidean distance between two points (sq root(sum of squares of diff in coordinates)
![Page 241: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/241.jpg)
Which Similarity to use • If your data is dense (almost all attributes have
non-zero values) and the magnitude of the attribute values is important, use distance measures such as Euclidean.
• If the data is subject to grade-inflation (different users may be using different scales) use Pearson.
• If the data is sparse consider using Cosine Similarity.
![Page 242: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/242.jpg)
Hybrid approaches
• Weighted hybrid – Combines scores from each component using linear formula.
• Switching hybrid – Select one recommender among candidates.
• Mixed hybrid – Based on the merging and presentation of multiple ranked lists into one. – Core algorithm of mixed hybrid merges them into a
single ranked list.
![Page 243: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/243.jpg)
Evaluating Recommenders
• Mahout
• Training Set, Test Set
• Iterative preference weight tuning + evaluator
• RMSE
• A/B testing
• Offline
![Page 244: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/244.jpg)
Distributed Recommendations with Mahout & Hadoop
![Page 245: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/245.jpg)
Movie recommendations Case
• Movie rating data (17 million ratings from around 6000 users available as input)
• Goal : Compute recommendations for users based on the rating data
• Use Mahout over Hadoop
• Mahout has support to run Map Reduce Jobs
• Runs a series of Map Reduce Jobs to compute recommendations (Movies a user would like to watch)
278
![Page 246: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/246.jpg)
Run Mahout job on
• hadoop jar ~/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /user/hadoop/input/ratings.csv --output recooutput -s SIMILARITY_COOCCURRENCE
--usersFile /user/hadoop/input/users.txt
![Page 247: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/247.jpg)
Other similarity options
• You can pass one of these as well to the
• “-s” flag : • SIMILARITY_COOCCURRENCE,
SIMILARITY_EUCLIDEAN_DISTANCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE, SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE
280
![Page 248: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/248.jpg)
Learning
• Reach out to experienced folks early
• Recommending follows the 90/10 rule
• Assign the right task to the right person
• Use Analytics metrics to evaluate recommenders periodically
• Importance to human evaluation
• Test thoroughly
• Mahout has some bugs(e.g. filtering part)
• Experiment with hybrid approaches
• You won’t need Hadoop generally
• Performance test your application
• Have a fallback option, hence take charge of your popularity algorithm
![Page 249: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/249.jpg)
What is recommendation?
Recommendation involves the prediction of what new items a user would like or dislike based on preferences of or associations to previous items
(Made-up) Example: A user, John Doe, likes the following books (items):
A Tale of Two Cities The Great Gatsby For Whom the Bell Tolls
Recommendations will predict which new books (items), John Doe, will like: Jane Eyre The Adventures of Tom Sawyer
282
![Page 250: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/250.jpg)
How does Mahout’s Recommendation Engine Work?
5 3 4 4 2 2 1
3 3 3 2 1 1 0
4 3 4 3 1 2 0
4 2 3 4 2 2 1
2 1 1 2 2 1 1
2 1 2 2 1 2 0
1 0 0 1 1 0 1
2
0
0
4
4.5
0
5
40
18.5
24.5
40
26
16.5
15.5
X =
S U R S is the similarity matrix between items
U is the user’s preferences for items
R is the predicted recommendations
![Page 251: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/251.jpg)
What is the similarity matrix, S?
S is a n x n (square) matrix Each element, e, in S are indexed
by row (j) and column (k), ejk
Each ejk in S holds a value that describes how similar are its corresponding j-th and k-th items
In this example, the similarity of the j-th and k-th items are determined by frequency of their co-occurrence (when the j-th item is seen, the k-th item is seen as well)
In general, any similarity measure may be used to produce these values
We see in this example that Items 1 and 2 co-occur 3 times, Items 1 and 3 co-occur 4 times, and so on…
5 3 4 4 2 2 1
3 3 3 2 2 1 0
4 3 4 3 1 2 0
4 2 3 4 2 2 1
2 2 1 2 2 1 1
2 1 2 2 1 2 0
1 0 0 1 1 0 1
S
Item 1
Item
1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item
2
Item
3
Item
4
Item
5
Item
6
Item
7
284
![Page 252: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/252.jpg)
What is the user’s preferences, U?
The user’s preference is represented as a column vector Each value in the vector
represents the user’s preference for j-th item
In general, this column vector is sparse
Values of zero, 0, represent no recorded preferences for the j-th item
2
0
0
4
4.5
0
5
U
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
285
![Page 253: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/253.jpg)
What is the recommendation, R?
R is a column vector representing the prediction of recommendation of the j-th item for the user
R is computed from the multiplication of S and U S x U = R
In this running example, the user already has expressed positive preferences for Items 1, 4, 5 and 7, so we look at only Items 2, 3, and 6
We would recommend to the user Items 3, 2, and 6, in this order, to the user
40
18.5
24.5
40
26
16.5
15.5
R
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
286
![Page 254: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/254.jpg)
What data format does Mahout’s recommendation engine expects?
• For Mahout v0.7, look at RecommenderJob
(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob)
• Each line of the input file should have the following format – userID,itemID[,preferencevalue
]
• userID is parsed as a long
• itemID is parsed as a long
• preferencevalue is parsed as a double and is optional
Format 1
123,345 123,456 123,789 … 789,458
Format 2
123,345,1.0 123,456,2.2 123,789,3.4 … 789,458,1.2
287
![Page 255: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/255.jpg)
How do you run Mahout’s recommendation engine?
Requirements Hadoop cluster on GNU/Linux Java 1.6.x SSH
Assuming you have a Hadoop cluster installed and configured correctly with the data loaded into HDFS, hadoop jar ~/mahout-core-0.7-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /user/hadoop/input/ratings.csv --output recooutput -s SIMILARITY_COOCCURRENCE --usersFile /user/hadoop/input/users.txt
$HADOOP_INSTALL$ is the location where you installed Hadoop $TARGET$ is the directory where you have the Mahout jar file $INPUT$ is the input file name $OUTPUT$ is the output file name
288
![Page 256: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/256.jpg)
Running Mahout RecommenderJob options
• There are plenty of runtime options (check javadocs)
• --userFile (path) : optional; a file containing userIDs; only preferences of these userIDs will be computed
• --itemsFile (path) : optional; a file containing itemIDs; only these items will be used in the recommendation predictions
• --numRecommendations (integer) : number of recommendations to compute per user; default 10
• --booleanData (boolean) : treat input data as having no preference values; default false
• --maxPrefsPerUser (integer) : maximum number of preferences considered per user in final recommendation phase; default 10
• --similarityClassname (classname): similarity measure (cooccurence, euclidean, log-likelihood, pearson, tanimoto coefficient, uncentered cosine, cosine)
289
![Page 257: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/257.jpg)
Coordination in a distributed system • Coordination: An act that multiple nodes must
perform together.
• Examples: – Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
![Page 258: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/258.jpg)
ZooKeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers.
Introducing ZooKeeper
- ZooKeeper Wiki
ZooKeeper is much more than a distributed lock server!
![Page 259: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/259.jpg)
What is ZooKeeper?
• An open source, high-performance coordination service for distributed applications.
• Exposes common services in simple interface: – naming
– configuration management
– locks & synchronization
– group services … developers don't have to write them from scratch
• Build your own on it for specific needs.
![Page 260: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/260.jpg)
ZooKeeper Use Cases
• Configuration Management – Cluster member nodes bootstrapping configuration from a
centralized source in unattended way – Easier, simpler deployment/provisioning
• Distributed Cluster Management – Node join / leave – Node statuses in real time
• Naming service – e.g. DNS • Distributed synchronization - locks, barriers, queues • Leader election in a distributed system. • Centralized and highly reliable (simple) data registry
![Page 261: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/261.jpg)
The ZooKeeper Service
• ZooKeeper Service is replicated over a set of machines
• All machines store a copy of the data (in memory)
• A leader is elected on service startup
• Clients only connect to a single ZooKeeper server & maintains a TCP connection.
• Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.
Image: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
![Page 262: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/262.jpg)
The ZooKeeper Data Model
• ZooKeeper has a hierarchal name space. • Each node in the namespace is called as a ZNode. • Every ZNode has data (given as byte[]) and can optionally
have children. parent : "foo" |-- child1 : "bar" |-- child2 : "spam" `-- child3 : "eggs" `-- grandchild1 : "42"
• ZNode paths: – canonical, absolute, slash-separated – no relative references. – names can have Unicode characters
![Page 263: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/263.jpg)
Consistency Guarantees • Sequential Consistency: Updates are applied in order
• Atomicity: Updates either succeed or fail
• Single System Image: A client sees the same view of the service regardless of the ZK server it connects to.
• Reliability: Updates persists once applied, till overwritten by some clients.
• Timeliness: The clients’ view of the system is guaranteed to be up-to-date within a certain time bound. (Eventual Consistency)
![Page 264: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/264.jpg)
Who uses ZooKeeper? Companies:
• Yahoo!
• Zynga
• Rackspace
• Netflix
• and many more…
Projects: • Apache Map/Reduce (Yarn)
• Apache HBase
• Apache Solr
• Neo4j
• Katta
• and many more…
Reference: https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
![Page 265: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/265.jpg)
Apache Flume
![Page 266: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/266.jpg)
My log data is not near my Hadoop
cluster Application
Servers
Customer Logs
Oracle
Big Data Appliance
?
![Page 267: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/267.jpg)
Moving Data with Flume NG
Oracle
Big Data Appliance
Application
Servers
Flume NG
Agent
Flume NG
Agent
Flume NG
Agent
Logs
Logs
Logs
HDFS Write
HDFS Write
HDFS Write
Flume NG
Agent
Flume NG
Agent
Flume NG
Agent
Avro
Avro
Avro
![Page 268: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/268.jpg)
Building a Basic Flume Agent
• Flume is flexible – Durable Transactions
– In-Flight Data Modification
– Compresses Data
• Flume simpler than it used to be – No Zookeeper requirement
– No Master-Slave architecture
• 3 basic pieces – Source, Channel, Sink
• One configuration file
![Page 269: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/269.jpg)
Flume Configuration
hdfs-agent.sources= netcat-collect
hdfs-agent.sinks = hdfs-write
hdfs-agent.channels= memoryChannel
hdfs-agent.sources.netcat-collect.type = netcat
hdfs-agent.sources.netcat-collect.bind = 127.0.0.1
hdfs-agent.sources.netcat-collect.port = 11111
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/oracle/sabre_example
hdfs-agent.sinks.hdfs-write.rollInterval = 30
hdfs-agent.sinks.hdfs-write.hdfs.writeFormat=Text
hdfs-agent.sinks.hdfs-write.hdfs.fileType=DataStream
hdfs-agent.channels.memoryChannel.type = memory
hdfs-agent.channels.memoryChannel.capacity=10000
hdfs-agent.sources.netcat-collect.channels=memoryChannel
hdfs-agent.sinks.hdfs-write.channel=memoryChannel
Invoke this with: flume-ng agent –f this_file –n hdfs-agent
![Page 270: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/270.jpg)
Sending Data to the Agent
• Connect netcat to the host
• Pipe input to it
• Records are transmitted on newline
• head example.xml | nc localhost 11111
![Page 271: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/271.jpg)
Alternatives to Flume
• Scribe – Thrift-based
– Lightweight, but no support
– Not designed around Hadoop
• Kafka – Designed to resemble a publish-subscribe system
– Explicitly distributed
– Apache Incubator Project
• And Their Trade-Offs
![Page 272: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/272.jpg)
Apache HCatalog
● What is it ?
● How does it work ?
● Interfaces
● Architecture
● Example
![Page 273: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/273.jpg)
HCatalog – What is it ?
● A Hive metastore interface set
● Shared schema and data types for Hadoop tools
● Rest interface for external data access
● Assists inter operability between
– Pig, Hive and Map Reduce
● Table abstraction of data storage
● Will provide data availability notifications
![Page 274: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/274.jpg)
HCatalog – How does it work ?
● Pig
– HCatLoader + HCatStorer interface
● Map Reduce
– HCatInputFormat + HCatOutputFormat interface
● Hive
– No interface necessary
– Direct access to meta data
● Notifications when data available
![Page 275: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/275.jpg)
HCatalog – Interfaces
● Interface via
– Pig
– Map Reduce
– Hive
– Streaming
● Access data via
– Orc file
– RC file
– Text file
– Sequence file
– Custom format
![Page 276: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/276.jpg)
HCatalog – Interfaces
![Page 277: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/277.jpg)
HCatalog – Architecture
![Page 278: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/278.jpg)
HCatalog – Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id;
![Page 279: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/279.jpg)
SQOOP
312
![Page 280: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/280.jpg)
• sqoop allows users to extract data from a structured data store into Hadoop for analysis
• Sqoop can also export the data back the structured stores
• Installing & Configuring SQOOP • Download a version of SQOOP compatible with your hadoop
installation • Untar into a designated folder. This will be SQOOP’s home
directory tar xzf sqoop-x.y.z.tar.gz
• Configure • Environment Variables – add in .bash_profile
• export SQOOP_HOME=/<parent directory path>/sqoop-x.y.z • export PATH=$PATH:$SQOOP_HOME /bin
• Verify Installation sqoop sqoop help 313
![Page 281: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/281.jpg)
Sqoop Connectors
• Sqoop ships with connectors for working with a range of popular relational databases, including • MySQL, PostgreSQL, Oracle, SQL Server, and DB2.
• There is also a generic JDBC connector for connecting
to any database that supports Java’s JDBC protocol.
• Third-party connectors are available for : • EDWs such as Netezza, Teradata, and Oracle • NoSQL stores (such as Couchbase).
314
![Page 282: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/282.jpg)
315
Sqoop Client
RDBMS
Hadoop Cluster
1. Examine the schema
2. Generate Code
MyClass.java
Map Map Map 3. Launch Multiple maps on the cluster
4. Use Generate Code
![Page 283: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/283.jpg)
316
• Install MySQL • sudo apt-get install mysql-client mysql-server
• Create a new MySQL schema • % mysql -u root –p
mysql> CREATE DATABASE hadoopguide; mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost'; mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
![Page 284: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/284.jpg)
317
• Create table widgets CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, widget_name VARCHAR(64) NOT NULL, price DECIMAL(10,2), design_date DATE, version INT, design_comment VARCHAR(100)); Insert some records into widgets table INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10',1, 'Connects two gizmos');
![Page 285: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/285.jpg)
318
• Copy mysql jdbc driver to sqoop’s lib directory • Sqoop does not come with the jdbc driver
Sqoop import % sqoop import –connect jdbc:mysql://localhost/hadoopguide --table widgets -m 1
• Verify import to HDFS
• % hadoop fs -cat widgets/part-m-00000
![Page 286: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/286.jpg)
Importing Data
319
• Import • The Import tool will run a MapReduce job that
connects to the database and reads the table • By default, this will use four map tasks. They write
the output to separate files but to a common directory
• Generates comma-delimited text files by default • In addition to downloading data, the import tool will
also generate a java class as per the table schema (widgets.java) on local.
![Page 287: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/287.jpg)
Hue
![Page 288: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/288.jpg)
Hue
![Page 289: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/289.jpg)
Hue : Data Analysis
![Page 290: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/290.jpg)
Impala
• SQL on HDFS • For analytical & transacational workloads • High performance • Skips the MapReduce flow • Directly queries data in HDFS & HBase • Easy for Hive users to migrate • Open source • C++ • HiveQL
![Page 291: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/291.jpg)
Impala
• Uses Hive’s metastore
• Query is distributed to nodes with relevant data
• No fault tolerance
• Upto 100X faster than queries in Hive
• Supports common HDFS file formats
![Page 292: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/292.jpg)
Impala
• Impala
Next Generation Testing Conference (c)
![Page 293: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/293.jpg)
Impala Architecture
• Impalad
– Runs on every node
– Handles client requests
– Handles query planning & execution
• Statestored
– Provides name service
– Metadata distribution
– Used for finding data
![Page 294: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/294.jpg)
Current Limitations
• No UDFs
• Joins are done in memory space no larger than that of the smallest node
![Page 295: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/295.jpg)
Some Real world Use Cases
![Page 296: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/296.jpg)
![Page 297: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/297.jpg)
![Page 298: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/298.jpg)
PayPal Tracking Architecture
![Page 299: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/299.jpg)
• Customer Purchase History
• Merchant Designations
• Merchant Special Offers
Data Sources
Credit Card Issuer
![Page 300: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/300.jpg)
Techniques
Purchase History
Merchant Information
Merchant Offers
Recommendation Engine Results
(Mahout)
Presentation Data Store
(DB2)
App
App
App
App
App
Hadoop Export (4 hrs)
Import (4 hrs)
Credit Card Issuer
![Page 301: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/301.jpg)
Techniques
Purchase History
Merchant Information
Merchant Offers
Recommendation Engine Results
(Mahout)
Recommendation Search Index
(Solr)
App
App
App
App
App
Hadoop
Index Update (2 min)
Credit Card Issuer
![Page 302: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/302.jpg)
Fraud Detection Data Lake
![Page 303: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/303.jpg)
• Anti-Money Laundering
• Consumer Transactions
Data Sources
![Page 304: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/304.jpg)
Techniques Anti-Money Laundering
System Consumer Transactions
System
![Page 305: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/305.jpg)
Techniques
AML
Consumer Transactions
Data Lake (Hadoop)
Suspicious Events
Latent Dirichlet Allocation, Bayesian Learning Neural Network,
Peer Group Analysis
Analyst
![Page 306: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/306.jpg)
Why Shopping cart analysis is useful to amazon.com
![Page 307: Hadoop Master Class : A concise overview](https://reader030.fdocuments.net/reader030/viewer/2022020122/54c6f16b4a7959334e8b4591/html5/thumbnails/307.jpg)
LinkedIn & Google Reader