Hadoop Summit 2014 - recap
-
Upload
userreport -
Category
Technology
-
view
120 -
download
0
description
Transcript of Hadoop Summit 2014 - recap
Hadoop Summit 2014What’s cookin?
Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com
Hard work during the day
A quick bite to eat
More hard work during the night
The End
Overview• YARN
• Tez
• Spark
• BlinkDB
• Summingbird
• Storm
• ML
YARN• Support other workloads than MapReduce
YARN• Allow other apps to ‘go distributed’ on top of HDFS
YARN cluster architecture
Tez• Execution engine on YARN
• Complex graphs of tasks for processing data
Tez• Hive and Pig can use Tez since version 0.13
• 2-3x performance increase compared to older Hive and Pig versions
• Tez does performance optimisations and resource management across the cluster
• Reuses containers and JVMs: effective for short queries in e.g. Hive.
• Multiple jobs at the same time
SparkThe new kid on the block
BlinkDBInteractive queries on Very Large Data, based on
sampling
BlinkDB• Offline sampling module
• Compute data samples, based on a ‘storage budget’
• Store samples on disk and in memory
• Sample selection module
• Select the right samples for an incoming query
• Query execution in parallel
• Answers are augmented by error and confidence bounds
BlinkDB
• BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.
SummingBird• Write MapReduce programs that look like native
Java or Scala collection transformations
• Platform-agnostic
• Execute on a number of distributed MapReduce platforms, like Scalding (Hadoop) or Storm
• The same code can run for batch and streaming
SummingBird• Word-count in pure Scala
!
!
• In SummingBird
SummingBird• ‘Strongly encourages’ the lambda architecture
Storm (on YARN)Stream data processing on Hadoop.
Storm recap:
• Processes unbounded streams of tuples.
• Basic primivitives are Spout's and Bolt's
• A spout is a source of streams.
• A bolt processes streams and may emit new streams
Storm (on YARN)
Storm Alternatives Spark Streaming
Machine LearningSparse Data Representation
uid1: url1, url2, url4, url6, url7, url8
uid2: url2, url3, url5, url9, url10, url11
uid1: 11010111000
uid2: 01101000111
Machine LearningOptions on Hadoop
• Python with UDF
• MLlib
• Mahout
• SparkR
Mahout• A scalable machine learning library
The Mahout community decided to move its codebase onto […] systems that offer a richer programming model and more efficient execution than
Hadoop MapReduce. !
Mahout will therefore reject new MapReduce algorithm implementations from now on.
!We are building our future implementations on top of a DSL […].
Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
https://mahout.apache.org/
Machine LearningTrends
• Sparse data representation
• Deep learning
• Anomaly detection
and a lot more… (come talk to us :))