Real time hadoop + mapreduce intro
-
Upload
geoff-hendrey -
Category
Documents
-
view
3.120 -
download
5
description
Transcript of Real time hadoop + mapreduce intro
![Page 1: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/1.jpg)
2013: year of real-time access to Big Data?
Geoffrey Hendrey@geoffhendrey
@vertascale
![Page 2: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/2.jpg)
Agenda
• Hadoop MapReduce basics• Hadoop stack & data formats• File access times and mechanics• Key-based indexing systems
(HBase)• MapReduce, Hive/Pig• MPP approaches & alternatives
![Page 3: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/3.jpg)
A very bad* diagram
*this diagram makes it appear that data flows through the master node.
![Page 4: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/4.jpg)
A better picture
![Page 6: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/6.jpg)
Map and Reduce Java Code
![Page 7: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/7.jpg)
Reduce
![Page 8: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/8.jpg)
Reducer Group Iterators
• Reducer groups values together by key• Your code will iterate over the values, emit reduced
result
• Hadoop reducer value iterators return THE SAME OBJECT each next(). Object is “reused” to reduce garbage collection load
• Beware of “reused” objects (this is a VERY common cause of long and confusing debugs)
• Cause for concern: you are emitting an object with non-primitive values. STALE “reused object” state from previous value.
Bear:[1,1] Bear:2
![Page 9: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/9.jpg)
Hadoop Writables
• Values in Hadoop are transmitted (shuffled, emitted) in a binary format
• Hadoop includes primitive types: IntWritable, Text, LongWritable, etc
• You must implement Writable interface for custom objects
public void write(DataOutput d) throws IOException { d.writeUTF(this.string); d.writeByte(this.column);} public void readFields(DataInput di) throws IOException { this.string = di.readUTF(); this.column = di.readByte();}
![Page 10: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/10.jpg)
Hadoop Keys (WritableComparable)• Be very careful to implement equals and
hashcode consistently with compareTo()• compareTo() will control the sort order of keys
arriving in reducer• Hadoop includes ability to write custom partitioner
public int getPartition(Document doc, Text v, int numReducers) { return doc.getDocId()%numReducers;}
![Page 11: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/11.jpg)
Typical Hadoop File Formats
![Page 12: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/12.jpg)
Hadoop Stack Review
![Page 13: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/13.jpg)
Distributed File System
![Page 14: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/14.jpg)
HDFS performance characteristics• HDFS was designed for high throughput,
not low seek latency• best-case configurations have shown HDFS
to perform 92K/s random reads [http://hadoopblog.blogspot.com/]
• Personal experience: HDFS very robust. Fault tolerance is “real”. I’ve unplugged machines and never lost data.
![Page 15: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/15.jpg)
Motivation for Real-time Hadoop• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”• MapReduce jobs are hard to debug
![Page 16: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/16.jpg)
Survey or real-time capabilities• Real-time, in-situ, self-service is the
“Holy Grail” for the business analyst• spectrum of real-time capabilities
exists on Hadoop
Easy Hard
Available in Hadoop Proprietary
HDFS HBase Drill
![Page 17: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/17.jpg)
Real-time spectrum on HadoopUse Case Support Real-time
Seek to a particular byte in a distributed file
HDFS YES
Seek to a particular value in a distributed file, by key (1-dimensional indexing)
HBase YES
Answer complex questions expressible in code (e.g. matching users to music albums). Data science.
MapReduce (Hive, Pig)
NO
Ad-hoc query for scattered records given simple constraints (“field[4]==“music” && field[9]==“dvd”)
MPP Architectures
YES
![Page 18: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/18.jpg)
Hadoop Underpinned By HDFS
• Hadoop Distributed File System (HDFS) • inspired by Google FileSystem (GFS)• underpins every piece of data in “Hadoop”• Hadoop FileSystem API is pluggable• HDFS can be replaced with other suitable
distributed filesystem– S3– kosmos– etc
![Page 19: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/19.jpg)
Amazon S3
![Page 20: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/20.jpg)
MapFile for real-time access?
– Index file must be loaded by client (slow)
– Index file must fit in RAM of client by default
– scan an average of 50% of the sampling interval
– Large records make scanning intolerable– not a viable “real world” solution for
random access
![Page 21: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/21.jpg)
Apache HBase
• Clone of Google’s Big Table.• Key-based access mechanism• Designed to hold billions of rows• “Tables” stored in HDFS• Supports MapReduce over tables,
into tables• Requires you to think hard, and
commit to a key design.
![Page 22: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/22.jpg)
HBase Architecture
![Page 23: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/23.jpg)
HBase random read performancehttp://hstack.org/hbase-performance-testing/• 7 servers, each with• 8 cores• 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks.
• Hbase table• 3 billion records,• 6600 regions.• data size is between 128-256 bytes per row,
spread in 1 to 5 columns.
![Page 24: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/24.jpg)
Zoomed-in “Get” time histogram
http://hstack.org/hbase-performance-testing/
![Page 25: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/25.jpg)
MapReduce
• “MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers”-wikipedia
• MapReduce is strongly tied to HDFS in Hadoop.
• Systems built on HDFS (i.e. HBase) leverage this common foundation for integration with the MR paradigm
![Page 26: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/26.jpg)
MapReduce and Data Science
• Many complex algorithms can be expressed in the MapReduce paradigm– NLP– Graph processing– Image codecs
• The more complex the algorithm, the more Map and Reduce processes become complex programs in their own right.
• Often cascade multiple MR jobs in succession
![Page 27: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/27.jpg)
Is MapReduce real-time?
• MapReduce on Hadoop has certain latencies that are hard to improve – Copy– Shuffle, sort– Iterate
• time-dependent on the both the size of the input data and the number of processors available
• In a nutshell, it’s a “batch process” and isn’t “real-time”
![Page 28: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/28.jpg)
Hive and Pig
• Run on top of MapReduce• Provide “Table” metaphor familiar to SQL
users• Provide SQL-like (or actually same) syntax• Store a “schema” in a database, mapping
tables to HDFS files• Translate “queries” to MapReduce jobs• No more real-time than MapReduce
![Page 29: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/29.jpg)
MPP Architectures
• Massively Parallel Processing• Lots of machines, so also lots of memoryExamples:• Spark – general purpose data science
framework sort of like real-time MapReduce for data science
• Dremel – columnar approach, geared toward answering SQL-like aggregations and BI-style questions
![Page 30: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/30.jpg)
Spark
• Originally designed for iterative machine learning problems at Berkeley
• MapReduce does not do a great job on iterative workloads
• Spark makes more explicit use of memory caches than Hadoop
• Spark can load data from any Hadoop input source
![Page 31: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/31.jpg)
Effect of Memory Caching in Spark
![Page 32: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/32.jpg)
Is Spark Real-time?
• If data fits in memory, execution time for most algorithms still depends on– amount of data to be processed– number of processors
• So, it still “depends”• …but definitely more focused on fast time-
to-answer• Interactive scala and java shells
![Page 33: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/33.jpg)
Dremel MPP architecture
• MPP architecture for ad-hoc query on nested data
• Apache Drill is an OS clone of Dremel• Dremel originally developed at Google• Features “in situ” data analysis• “Dremel is not intended as a replacement
for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.” -Dremel: Interactive Analysis of WebScaleDatasets
![Page 34: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/34.jpg)
In Situ Analysis
• Moving Big Data is a nightmare• In situ: ability to access data in
place – In HDFS– In Big Table
![Page 35: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/35.jpg)
Uses For Dremel At Google
• Analysis of crawled web documents.• Tracking install data for applications on Android
Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build
system.• Etc, etc.
![Page 36: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/36.jpg)
Why so many uses for Dremel?
• On any Big Data problem or application, dev team faces these problems:– “I don’t know what I don’t know” about data– Debugging often requires finding and
correlating specific needles in the haystack– Support and marketing often require
segmentation analysis (identify and characterize wide swaths of data)
• Every developer/analyst wants– Faster time to answer– Fewer trips around the mulberry bush
![Page 37: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/37.jpg)
Column Oriented Approach
![Page 38: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/38.jpg)
Dremel MPP query execution tree
![Page 39: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/39.jpg)
Is Dremel real-time?
![Page 40: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/40.jpg)
Alternative approaches?
• Both MapReduce and MPP query architectures take “throw hardware at the problem” approach.
• Alternatives?– Use MapReduce to build distributed indexes on
data– Combine columnar storage and inverted
indexes to create columnar inverted indexes– Aim for the sweet spot for data scientist and
engineer: Ad-hoc queries with results returned in seconds on a single processing node.
![Page 42: Real time hadoop + mapreduce intro](https://reader033.fdocuments.net/reader033/viewer/2022061218/54b7621e4a7959b0558b4601/html5/thumbnails/42.jpg)
references
• http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/
• http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
• http://yoyoclouds.wordpress.com/tag/hdfs/• http://blog.cloudera.com/blog/2009/02/the-small-files-p
roblem/• http://hadoopblog.blogspot.com/• http://lunarium.info/arc/images/Hbase.gif• http://www.zdnet.com/i/story/60/01/073451/zdnet-ama
zon-s3_growth_2012_q1_1.png• http://hstack.org/hbase-performance-testing/• http://en.wikipedia.org/wiki/File:Mapreduce_Overview.s
vg• http://www.rabidgremlin.com/data20/MapReduceWord
CountOverview1.png• Dremel: Interactive Analysis of WebScale Datasets