Hadoop 101

Introducing:

The Modern Data Operating System

Hadoop is ...A scalable fault tolerant distributed for data storage and processing (open source under the Apache license)

● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage

● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

- Core Hadoop has two main systems:

Map/Reduce

BigTable

MapReduce

Hadoop Origins

Hadoop Chronicles

Map/Reduce

BigTable

Doug Cutting

Etymology

● Hadoop was created in 2004 by "Douglass (Doug) Cutting"

● Implemented Google Filesystem and Big Tables papers

● He aimed it, to index the internet in google style for startup search engine 'Nutch'

● Named it after his son's elephant shaped favourite toy named hadoop

What is Big Data?

"In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools."

Wikipedia

● 2008: Google processes 20PB a day

● 2012: Facebook ingests 500TB of data a day

● 2009: eBay has 6.5 PB user data + 50 TB a day

● 2011: Yahoo! has 180-200 PB of data

How big is big?

Limitations of Existing Analytics Architecture

BI Reports + Online Apps

RDBMS (aggregated data)

ETL (Extract, Transfer & Load)

Storage Grid

Data Collection

Instrumentation (Raw Data Sources)

Moving Data from storage to compute doesn't scale!

Can't explore original raw data

Archiving = Premature deathMostly Append

Why Hadoop?Challenge: Read 1 TB of data

1 Machine

- 4 IO channels- Each channel: 100 MB/s

10 Machines

- 4 IO channels- Each channel: 100 MB/s

?45 minutes 4.5 minutes

Hadoop and Friends

The Key Benefit: Agility/FlexibilitySchema-On-Write (RDBMS) Schema-On-Read (Hadoop)

- Schema must be created before any data can be loaded

- An explicit load operation has to take place which transforms data to DB internal structure

- New columns must be be added explicitly before new data for such columns can be loaded into the database

- Reads are fast - Load is fast- Standards / Governance - Flexibility / Agility

- Data is simply copied to the file store, no transformations are needed

- A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding)

- New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it

Hadoop ComponentsMaster/Slave Architecture

Name Node

Job Tracker

Data Nodes

Task Trackers

File metadata:/kenshoo/data1.txt ---> 1,2,3/kenshoo/data2.txt ---> 4,5

NameNode r=3

hdfs-site.xml

dfs.replication

Data Nodes

3 3 3 1

2 2 1 1 2

Underlying FS options

ext3- released in 2001 - Used by Yahoo!- bootstrap + format slow- set:

- noatime- tune2fs (to turn off reserved blocks)

ext4- released in 2008 - Used by Google- Fast as XFS- set:

- delayed allocation off-noatime- tune2fs (to turn off reserved blocks)

XFS- released in 1993 - Fast- Drawbacks:

- deleting large # of files

Sample HDFS shell Commands

bin/hadoop fs -lsbin/hadoop fs -mkdirbin/hadoop fs -copyFromLocalbin/hadoop fs -copyToLocalbin/hadoop fs -moveToLocalbin/hadoop fs -rmbin/hadoop fs -tailbin/hadoop fs -chmodbin/hadoop fs -setrep -w 4 -R /dir1/s-dir

Mounting using FUSE:

hadoop-fuse-dfs dfs://10.73.9.50 /hdfs

Name Node

Rack 1

Rack 2

Rack 3

Job Tracker HBase Master

Network Topology

Yahoo! Installation

- 8 core switches- 100 racks- 40 servers/rack- 1 GBit in rack- 10 GBit among racks-Total 11PB

Name Node

Rack 1

Rack 2

Rack 3

Job Tracker HBase Master

Rack Awareness

metadata

file.txt = Blk A: DN: 2,7,8

Blk B:DN: 9,12,14

NameNode

Rack 1

Rack 2

HDFS Writes

metadata

NameNode

Client

Rack 1

Rack 2

Rack 3

Reading Files

metadata

Blk B:DN: 9,12,14

NameNode

Client

File1.txt parts:Blk A: 2,7,8Blk B: 9,12,14

wanna read file1.txt

Hadoop 101

Technology

Transcript of Hadoop 101

Hadoop 101

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark Meetup

Hadoop Online Tutorials - indiatrainings.in · Menu Search Hadoop Online Tutorials Author REPLY #1825 Hadoop Eco System › Forums › Hadoop Discussion Forum › 250 Hadoop Interview

Hadoop 101 - Kansas City Big Data Summit 2014

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Snapshotting in Hadoop Distributed File System for Hadoop ...€¦ · Snapshotting in Hadoop Distributed File System for Hadoop Open Platform as Service ... 2.2 Hadoop Open Platform

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

Hadoop 101 v1

PROFESSIONAL HADOOP® SOLUTIONS - Startseite€¦ · The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary

Hadoop ecosystem - hadoop 生態系

250 Hadoop Interview Questions and Answers for Experienced Hadoop Developers - Hadoop Online Tutorials

Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.

Hadoop Present - Open Enterprise Hadoop

Hadoop Training #4: Programming with Hadoop

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Hadoop 3 (2017 hadoop taiwan workshop)

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop 1.0 vs Hadoop 2.0

Hadoop on Azure 101 What is the Big Deal?

Hadoop Installation Guide | Hadoop Configuration