An Introduction to Apache Hadoop, Mahout and HBase
-
Upload
lukas-vlcek -
Category
Technology
-
view
5.625 -
download
0
Transcript of An Introduction to Apache Hadoop, Mahout and HBase
![Page 1: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/1.jpg)
An Introduction to Hadoop, Mahout & HBase
Lukáš Vlček, JBUG Brno, May 2012
![Page 2: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/2.jpg)
![Page 3: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/3.jpg)
![Page 4: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/4.jpg)
Hadoop
• Open Source (ASL2) implementation of Google's MapReduce[1] and Google DFS (Distributed File System) [2](~from 2006 Lucene subproject)
[1] MapReduce: Simplified Data Processing on Large Scale Clusters by Jeffrey Dean and Sanjay Ghemawat, Google labs, 2004
[2] The Google File System by Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, 2003
![Page 5: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/5.jpg)
Hadoop
• MapReduceSimple programming model for data processing
putting parallel large-scale data analysis into hands of masses.
• HDFSA filesystem designed for storing large files with
streaming data access patterns, running on clusters of commodity hardware.
• (Common, + other related projects...)
![Page 6: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/6.jpg)
MapReduceprogramming model
map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(v3)
![Page 7: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/7.jpg)
MapReduce“Hello World”
• Counting the number of occurrences of each word in collection of documents.
![Page 8: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/8.jpg)
MapReduce“Hello World”
map(k1,v1) list(k2,v2)→
map(key, value){// key: document name// value: document contentfor each word w in value {
emitIntermediate(w,1)}
}
![Page 9: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/9.jpg)
MapReduce“Hello World”
reduce(k2,list(v2)) list(v3)→
reduce(key, values){// key: a word// value: a list of countsint result = 0;for each word v in values {
result += v;}emit(result);
}
![Page 10: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/10.jpg)
MapReduce – benefits
• The model is easy to useSteep learning curve.
• Many problems are expressible as MapReduce computations
• MapReduce scales to large clustersSuch model makes it easy to parallelize and distribute your computation to thousands of machines!
![Page 11: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/11.jpg)
MapReduce – downsides
• The model is easy to usePeople tend to try the simplest approach first.
• Many problems are expressible as MapReduce computations
MapReduce may not be the best model for you.
• MapReduce scales to large clustersIt is so easy to overload the cluster with simple code.
![Page 12: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/12.jpg)
More elaborated examples
• Distributed PageRank
• Distributed Dijkstra's algorithm (almost)
Lectures to Google software engineering interns,Summer 2007http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
![Page 13: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/13.jpg)
Google PageRank today?
• Aug, 2009: Google moved away from MapReduce back-end indexing system onto a new search architecture, a.k.a. Caffeine.
http://en.wikipedia.org/wiki/Google_Search#Google_Caffeine
![Page 14: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/14.jpg)
Google Maps?
• “In particular, for large road networks it would be prohibitive to precompute and store shortest paths between all pairs of nodes.”
Engineering Fast Route Planning Algorithms, by Peter Sanders and Dominik Schultes, 2007http://algo2.iti.kit.edu/documents/routeplanning/weaOverview.pdf
![Page 15: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/15.jpg)
Demand for Real-Time data
• MapReduce batch oriented processing of (large) data does not fit well into growing demand for real-time data.
• Hybrid approach?Ted Dunning on Twitter's Storm: http://info.mapr.com/ted-storm-2012-03.htmlhttp://www.youtube.com/channel/UCDbTR_Z_k-EZ4e3JpG9zmhg
![Page 16: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/16.jpg)
![Page 17: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/17.jpg)
![Page 18: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/18.jpg)
![Page 19: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/19.jpg)
Mahout
• The goal is to build open source scalable machine learning libraries.
Started by: Isabel Drost, Grant Ingersoll, Karl Wettin
Map-Reduce for Machine Learning on Multicore, 2006http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf
![Page 20: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/20.jpg)
Implemented Algorithms
• Classification
• Clustering
• Pattern Mining
• Regression
• Dimension Reduction
• Evolutionary Algorithms
• Recommenders / Collaborative Filtering
• Vector Similarity
• ...
![Page 21: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/21.jpg)
Back to Hadoop
HDFS
![Page 22: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/22.jpg)
HDFS
• Very large filesUp to GB and PT.
• Streaming data accessWrite once, read many times.Optimized for high data throughput.Read involves large portion of the data.
• Commodity HWClusters made of cheap and low reliable machines.Chance of failure of individual machine is high.
![Page 23: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/23.jpg)
HDFS – don't!
• Low-latency data accessHBase is better for low-latency access.
• Lots of small filesNameNode memory limit.
• Multiple writes and file modificationsSingle file writer.Write at the end of file.
![Page 24: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/24.jpg)
HDFS: High Availability
• Currently being added to the trunk
http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
![Page 25: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/25.jpg)
HDFS: Security & File Appends
• Finally available as well, but probably in different branches.
http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/
http://www.cloudera.com/blog/2009/07/file-appends-in-hdfs/
![Page 26: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/26.jpg)
Improved POSIX support
• Available from third party vendors (for example MapR M3 or M5 edition)
![Page 27: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/27.jpg)
![Page 28: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/28.jpg)
![Page 29: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/29.jpg)
HBase
• Non-relational, auto re-balancing, fault tolerant distributed database
• Modeled after Google BigTable
• Initial prototype in 2007• Canonical use case webtable
Bigtable: A Distributed Storage System for Structured Data, many authors, Google labs (2006)
![Page 30: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/30.jpg)
HBase Conceptual view
Copyright © 2011, Lars George. All rights reserved.
![Page 31: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/31.jpg)
HBase
• Basic operations:Get, Put, Scan, Delete
• A {row, column, version} identify a cell
• Allows run Hadoop's MapReduce jobs• Optimized for high throughput
![Page 32: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/32.jpg)
Use Case: Real-time HBase Analytics
• Nice use case for real-time analysis by Sematext
http://blog.sematext.com/2012/04/22/hbase-real-time-analytics-rollbacks-via-append-based-updates/
http://blog.sematext.com/2012/04/27/hbase-real-time-analytics-rollbacks-via-append-based-updates-part-2/
![Page 33: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/33.jpg)
Use Case: Messaging Platform
• Facebook implemented messaging system using HBase
http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919
![Page 34: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/34.jpg)
Hadoop, Mahout and/or HBase is used by...
• Amazon (A9), Adobe, Ebay, Facebook, Google (university program), IBM, Infochimps, Krugle, Last.fm, LinkedIn, Microsoft, Rackspace, RapLeaf, Spotify, StumbleUpon, Twitter, Yahoo!
… many more!
![Page 35: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/35.jpg)
More Resources
• Hadoop: http://hadoop.apache.org/• Mahout: http://mahout.apache.org/
• HBase: http://hbase.apache.org/book/book.html
![Page 36: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/36.jpg)
Thank you!
![Page 37: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/37.jpg)
Photo Sources
• http://www.flickr.com/photos/renwest/4909849477/ By renwest, CC BY-NC-SA 2.0
• http://www.flickr.com/photos/asianartsandiego/4838273718/ By Asian Curator at The San Diego Museum of Art, CC BY-NC-ND 2.0
• http://www.flickr.com/photos/zeepack/2932405424/ By ZeePack, CC BY-ND 2.0
• http://www.flickr.com/photos/16516252@N00/3132303565/ By blueboy1478, CC BY-ND 2.0
![Page 38: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/38.jpg)
Backup Slides: Anatomy of MapReduce Execution
• http://code.google.com/edu/parallel/mapreduce-tutorial.html#MRExec
![Page 39: An Introduction to Apache Hadoop, Mahout and HBase](https://reader036.fdocuments.net/reader036/viewer/2022081721/55509563b4c90595208b458f/html5/thumbnails/39.jpg)
Backup Slides: HDFS Architecture
• http://hadoop.apache.org/common/docs/current/hdfs_design.html