What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold...

25

Transcript of What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold...

Page 1: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(
Page 2: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

What’s changed since then? .

Page 3: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Open source processing engine and set of libraries

Cloud service based on Spark

Page 4: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Users:1

2

3

Hardware: I/O bottleneck ➡ compute

Delivery: the public cloud

Page 5: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Pig

Page 6: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

84%

38% 38%

71%

31%

58%

18%

2014 Languages Used 2015 Languages Used

Page 7: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

“hdfs://...”

map line => parsePoint(line)

filter p => p.x > 100 count

Page 8: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

hides

Page 9: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

groupByKey

Page 10: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

map word => (word, 1)

groupByKey

map (k, vs) => (k, vs.sum)

Materializes all groupsas Seq[Int] objects

Then promptlyaggregates them

Page 11: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

structured data

SIGMOD 2015

Page 12: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Logical Plan

Physical Plan

OptimizerRDDs

SQL

CodeGenerator

Data Frames

Page 13: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

DataFrames hold rows with a known schema and offer relational ops through a DSL

users = ctx.sql(“select * from hive.users”)

ca_users = users[users.state == “CA”]

ca_users.count()

ca_users.groupBy(“name”).avg(“age”)

ca_users.map(lambda row: row.name.upper())

Expression AST

Page 14: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

Aggregation benchmark (s)

Page 15: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Modular API based on scikit-learn

Relational + graph operations

All built on DataFramesenables cross-library optimization

Page 16: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Users:1

2

3

Hardware: I/O bottleneck ➡ compute

Delivery: the public cloud

Page 17: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

2010

Storage50+MB/s(HDD)

Network 1Gbps

CPU ~3GHz

Page 18: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

2010 2016

Storage50+MB/s(HDD)

500+MB/s(SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Page 19: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

2010 2016

Storage50+MB/s(HDD)

500+MB/s(SSD)

10x

Network 1Gbps 10Gbps 10x

CPU ~3GHz ~3GHz

Page 20: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

• Many current systems are 2-10x off peak performance

Page 21: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Results from Nested Vector Language (NVL) project at MIT

HyPerDatabase

TensorFlowWord2Vec

GraphMatPageRank

Current in-memorysystems

Hand tuned code

Page 22: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Spark 1.6 14Mrows/s

Spark 2.0 125Mrows/s

Page 23: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

Users:1

2

3

Hardware: I/O bottleneck ➡ compute

Delivery: the public cloud

Page 24: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(

• Multi-tenant

• Fully measured

• Elastic

• Continuously updated

Must design an organization, not a piece of software

Page 25: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(