Hadoop by sunitha
-
Upload
sunitha-raghurajan -
Category
Technology
-
view
123 -
download
3
description
Transcript of Hadoop by sunitha
Explained
Sunitha Raghurajan
Data…Data….Data….
• We live in a data world ????• Total FaceBook Users:835,525,280
(March 31st 2012)• The New York Stock Exchange generates
about one terabyte of new trade data per
• day.• • Facebook hosts approximately 10
billion photos, taking up one petabyte of storage
http://www.internetworldstats.com/facebook.htm
Data…is growing ????
From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf).
Problem??
• How do we store and analyze the date??? • one terabyte drives the transfer speed is
around 100 MB/s- more than two and a half hours to read all the data off the disk.
Writing more slower• We had 100 drives holding one hundredth of
the data. • Reliability issues ( failure in hard drive)• Combine data from 100 drives?.• Existing Tools inadequate to process large
data sets
Why can’t we use RDBMS?
• An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data. Longer time to read dataCPU
Memory
Disk
Hadoop is the answer!!!!!• Hadoop is an open source project
licensed under the Apache v2 license http://hadoop.apache.org/
• Used for processing large datasets in parallel with the use of low level commodity machines.
• Hadoop is build on two main parts. An special file system called Hadoop Distributed File System (HDFS) and the Map Reduce Framework.
Hadoop History• Hadoop was created by Doug Cutting, who named
it after his son's toy elephant .• 2002-2004 Nutch Open Source web-scale, crawler-
based search• 2004-2006 Google File System & MapReduce
papers published.Added DFS & MapReduce impl to Nutch
• 2006-2008 Yahoo hired Doug Cutting• On February 19, 2008, Yahoo! Inc. launched what
it claimed was the world's largest Hadoop production application
• The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.[22]
Who uses Hadoop ?Amazon American Airlines
AOL Apple
eBay Federal Reserve Board of Governors
foursquare Fox Interactive Media
FaceBook StumbleUpon
Gemvara Hewlett-Packard
IBM MicroSoft
Twitter NYTimes
NetFlix Linkedin
Why Hadoop?
• Reliable: The software is fault tolerant, it expects and handles hardware and software failures
• Scalable: Designed for massive scale of processors, memory, and local attached storage
• Distributed: Handles replication. Offers massively parallel programming model,
MapReduce
What is MapReduce???
– Programming model used by Google
– A combination of the Map and Reduce models with an associated implementation
– Used for processing and generating large data sets
MapReduce Explained• The basic idea is that you divide the job
into two parts: a Map, and a Reduce. • Map basically takes the problem, splits it
into sub-parts, and sends the sub-parts to different machines – so all the pieces run at the same time.
• Reduce takes the results from the sub-parts and combines them back together to get a single answer.
Distributed Grep
Very big
data
Split data
Split data
Split data
Split data
grep
grep
grep
grep
matches
matches
matches
matches
catAll
matches
MAP REDUCE ARCHITURE
How Map and Reduce Work Together
Map Reduce
• Map:– Accepts input key/value
pair– Emits intermediate
key/value pair
Very big
data
ResultMAP
REDUCE
PartitioningFunction
Reduce :Accepts intermediate key/value* pairEmits output key/value pair
Reduce :Accepts intermediate key/value* pairEmits output key/value pair
http://ayende.com/blog/4435/map-reduce-a-visual-explanationhttp://ayende.com/blog/4435/map-reduce-a-visual-explanation
RDBMS compared to MapReduce
Data Size
Gigabytes Petabytes
Access Interactive and batch
Batch
Updates
Read and write many times
Write once, read many times
integrity
High Low
Scaling Nonlinear Linear
Structure
Static schema Dynamic schema
Hadoop Family
PigPig
MahoutMahout
A platform for manipulating large data sets
A platform for manipulating large data sets ScriptingScripting
Machine Learning AlgorithmsMachine Learning AlgorithmsMachineLearningMachineLearning
HBASEHBASE Bigtable-like structured storage for Hadoop HDFS
Bigtable-like structured storage for Hadoop HDFS Non-Rel RDBMSNon-Rel RDBMS
HIVEHIVE data warehouse system data warehouse system
HDFSHDFS
MapReduceMapReduce
Distribute and replicated data among machines
Distribute and replicated data among machines
Distribute and monitor tasksDistribute and monitor tasks
Non-Rel RDBMSNon-Rel RDBMS
Hadoop commonHadoop common
Zoo KeeperZoo Keeper Distributed Contributed ServiceDistributed Contributed Service
When to use Hadoop?• Complex information processing is needed • Unstructured data needs to be turned into structured data • Queries can’t be reasonably expressed using SQL • Heavily recursive algorithms • Complex but parallelizable algorithms needed, such as geo-
spatial analysis or genome sequencing • Machine learning • Data sets are too large to fit into database RAM, discs, or
require too many cores (10’s of TB up to PB) • Data value does not justify expense of constant real-time
availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
• Results are not needed in real time • Fault tolerance is critical • Significant custom coding would be required to handle job
scheduling
• Reference:http://timoelliott.com/blog/2011/09/hadoop-big-data-and-enterprise-business-intelligence.html
Building Blocks of Hadoop
• Running a set of daemons on different servers on the network
•NameNode •DataNode •Secondary NameNode •JobTracker •TaskTracker
•NameNode •DataNode •Secondary NameNode •JobTracker •TaskTracker
• Questions????
References
• Hadoop in Action By Chuck Lam• Hadoop The Definitive Guide By Tom
White• http://hadoop.apache.org/