Hadoop Summit 2014 - recap

26
Hadoop Summit 2014 What’s cookin? Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com

description

 

Transcript of Hadoop Summit 2014 - recap

Page 1: Hadoop Summit 2014 - recap

Hadoop Summit 2014What’s cookin?

Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com

Page 2: Hadoop Summit 2014 - recap

Hard work during the day

Page 3: Hadoop Summit 2014 - recap

A quick bite to eat

Page 4: Hadoop Summit 2014 - recap

More hard work during the night

Page 5: Hadoop Summit 2014 - recap

The End

Page 6: Hadoop Summit 2014 - recap

Overview• YARN

• Tez

• Spark

• BlinkDB

• Summingbird

• Storm

• ML

Page 7: Hadoop Summit 2014 - recap

YARN• Support other workloads than MapReduce

Page 8: Hadoop Summit 2014 - recap

YARN• Allow other apps to ‘go distributed’ on top of HDFS

Page 9: Hadoop Summit 2014 - recap

YARN cluster architecture

Page 10: Hadoop Summit 2014 - recap

Tez• Execution engine on YARN

• Complex graphs of tasks for processing data

Page 11: Hadoop Summit 2014 - recap

Tez• Hive and Pig can use Tez since version 0.13

• 2-3x performance increase compared to older Hive and Pig versions

• Tez does performance optimisations and resource management across the cluster

• Reuses containers and JVMs: effective for short queries in e.g. Hive.

• Multiple jobs at the same time

Page 12: Hadoop Summit 2014 - recap

SparkThe new kid on the block

Page 13: Hadoop Summit 2014 - recap

BlinkDBInteractive queries on Very Large Data, based on

sampling

Page 14: Hadoop Summit 2014 - recap

BlinkDB• Offline sampling module

• Compute data samples, based on a ‘storage budget’

• Store samples on disk and in memory

• Sample selection module

• Select the right samples for an incoming query

• Query execution in parallel

• Answers are augmented by error and confidence bounds

Page 15: Hadoop Summit 2014 - recap

BlinkDB

• BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.

Page 16: Hadoop Summit 2014 - recap

SummingBird• Write MapReduce programs that look like native

Java or Scala collection transformations

• Platform-agnostic

• Execute on a number of distributed MapReduce platforms, like Scalding (Hadoop) or Storm

• The same code can run for batch and streaming

Page 17: Hadoop Summit 2014 - recap

SummingBird• Word-count in pure Scala

!

!

• In SummingBird

Page 18: Hadoop Summit 2014 - recap

SummingBird• ‘Strongly encourages’ the lambda architecture

Page 19: Hadoop Summit 2014 - recap

Storm (on YARN)Stream data processing on Hadoop.

Storm recap:

• Processes unbounded streams of tuples.

• Basic primivitives are Spout's and Bolt's

• A spout is a source of streams.

• A bolt processes streams and may emit new streams

Page 20: Hadoop Summit 2014 - recap

Storm (on YARN)

Page 21: Hadoop Summit 2014 - recap

Storm Alternatives Spark Streaming

Page 22: Hadoop Summit 2014 - recap

Machine LearningSparse Data Representation

uid1: url1, url2, url4, url6, url7, url8

uid2: url2, url3, url5, url9, url10, url11

uid1: 11010111000

uid2: 01101000111

Page 23: Hadoop Summit 2014 - recap

Machine LearningOptions on Hadoop

• Python with UDF

• MLlib

• Mahout

• SparkR

Page 24: Hadoop Summit 2014 - recap

Mahout• A scalable machine learning library

The Mahout community decided to move its codebase onto […] systems that offer a richer programming model and more efficient execution than

Hadoop MapReduce. !

Mahout will therefore reject new MapReduce algorithm implementations from now on.

!We are building our future implementations on top of a DSL […].

Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.

https://mahout.apache.org/

Page 25: Hadoop Summit 2014 - recap

Machine LearningTrends

• Sparse data representation

• Deep learning

• Anomaly detection

Page 26: Hadoop Summit 2014 - recap

and a lot more… (come talk to us :))