Hadoop Summit 2014 - recap

Hadoop Summit 2014What’s cookin?

Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com

http://www.userreport.com

Hard work during the day

A quick bite to eat

More hard work during the night

The End

Overview• YARN

• Tez

• Spark

• BlinkDB

• Summingbird

• Storm

• ML

YARN• Support other workloads than MapReduce

YARN• Allow other apps to ‘go distributed’ on top of HDFS

YARN cluster architecture

Tez• Execution engine on YARN

• Complex graphs of tasks for processing data

Tez• Hive and Pig can use Tez since version 0.13

• 2-3x performance increase compared to older Hive and Pig versions

• Tez does performance optimisations and resource management across the cluster

• Reuses containers and JVMs: effective for short queries in e.g. Hive.

• Multiple jobs at the same time

SparkThe new kid on the block

BlinkDBInteractive queries on Very Large Data, based on

sampling

BlinkDB• Offline sampling module

• Compute data samples, based on a ‘storage budget’

• Store samples on disk and in memory

• Sample selection module

• Select the right samples for an incoming query

• Query execution in parallel

• Answers are augmented by error and confidence bounds

BlinkDB

• BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.

SummingBird• Write MapReduce programs that look like native

Java or Scala collection transformations

• Platform-agnostic

• Execute on a number of distributed MapReduce platforms, like Scalding (Hadoop) or Storm

• The same code can run for batch and streaming

SummingBird• Word-count in pure Scala

!

!

• In SummingBird

SummingBird• ‘Strongly encourages’ the lambda architecture

Storm (on YARN)Stream data processing on Hadoop.

Storm recap:

• Processes unbounded streams of tuples.

• Basic primivitives are Spout's and Bolt's

• A spout is a source of streams.

• A bolt processes streams and may emit new streams

Storm (on YARN)

Storm Alternatives Spark Streaming

Machine LearningSparse Data Representation

uid1: url1, url2, url4, url6, url7, url8

uid2: url2, url3, url5, url9, url10, url11

uid1: 11010111000

uid2: 01101000111

Machine LearningOptions on Hadoop

• Python with UDF

• MLlib

• Mahout

• SparkR

Mahout• A scalable machine learning library

The Mahout community decided to move its codebase onto […] systems that offer a richer programming model and more efficient execution than

Hadoop MapReduce. !

Mahout will therefore reject new MapReduce algorithm implementations from now on.

!We are building our future implementations on top of a DSL […].

Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.

https://mahout.apache.org/

https://mahout.apache.org/

Machine LearningTrends

• Sparse data representation

• Deep learning

• Anomaly detection

and a lot more… (come talk to us :))

Hadoop Summit 2014 - recap

Technology

Transcript of Hadoop Summit 2014 - recap