From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Post on 27-Jan-2017

1.630 views 3 download

Transcript of From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Speakers: Igor Maravić & Neville Li, Spotify

From stream to recommendation withCloud Pub/Sub and Cloud Dataflow

DATA & ANALYTICS

22

Current Event Delivery System

3

Client

Client

Client

Client

Current event delivery system

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

4

Client

Client

Client

Client

Complex

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

5

Client

Client

Client

Client

Stateless

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

6

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

77

Redesigning Event Delivery

8

Redesigning event delivery

Gateway

Syslog

File Tailer

Any data centre

Client

Hadoop

Client

Client

Client Event Delivery Service

Reliable Persistent Queue

ETL

9

Same API

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

10

Persistence

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

11

Keep it simple

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Reliable Persistent Queue

ETL

Client

Client

Client

Client

Build it!

1313

Choosing reliable persistent queue

Kafka 0.8

14

Proven technology

15

16

Strong community

1717

Reliable persistent queue

18

Event delivery with Kafka 0.8

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

19

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

Event delivery with Kafka 0.8

Cloud Pub/Sub

20

Retains undelivered data

22

At least once delivery

2323

Globally available

24

Simple REST API

2525

No operational responsibility*

2626

SHUT UP AND

TAKE MY MONEY!

2727

Caution advised!

Building up trust in Cloud Pub/Sub

28

29

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

Demo time!

30

31

2M events per second.

Cloud Pub/Sub, Spotify chooses You!

32

33

Event delivery with Cloud Pub/Sub

Gateway

Any data centre

Client

HadoopClient

Client

Client

Cloud Pub/Sub

Event Delivery Service

File Tailer

Syslog

Cloud Storage

Dataflow

ETL using Cloud Dataflow

3434

Streaming ETL job with Cloud Dataflow

35

Dataflow SDK is a framework

36

Cloud Dataflow is a managed service

37

ETL job

38

Single Cloud Pub/Sub subscription

ConsumeRunning

39

GCS and HDFS in parallel.

40

2016-03-22 03H

2016-03-2204H

Event time based hourly buckets

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

41

Incremental bucket fill

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-22 04H

2016-03-2203H

42

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2203H

Bucket completeness

2016-03-2202H

2016-03-2204H

43

2016-03-2123H

2016-03-2204H

Late data handling

2016-03-2203H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2202H

44

Event time based hourly bucketsIncremental bucket fillBucket completeness

Late data handling

45

Windowing

Window4,061 elements/s

ConsumeRunning

Shard4,061 elements/s

Write to HDFSRunning

Write to GCSRunning

46

Windowing@Override

public PCollection<KV<String, Iterable<EventMessage>>> apply(

final PCollection<KV<String, EventMessage>> shardedEvents) {

return shardedEvents

.apply("Assign Hourly Windows",

Window.<~>into(

FixedWindows.of(ONE_HOUR))

.withAllowedLateness(ONE_DAY)

.triggering(

AfterWatermark.pastEndOfWindow()

.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))

.withLateFirings(AfterFirst.of(

AfterPane.elementCountAtLeast(maxEventsInFile),

AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(TEN_SECONDS))))

.discardingFiredPanes())

.apply("Aggregate Events", GroupByKey.create());

}

4747

Streaming

Where are we right now?

49

Preliminary resultsWatermark Lag

Minutes

5050

ScioScala API for Google Cloud Dataflow

51

Origin story

Scalding and Spark popular for ML, recommendations, analytics @ Spotify

50+ users, 400+ unique jobs

Early 2015 - Dataflow Scala hack project

52

Why not Scalding on GCE

Pros

● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud

● Stable and proven

Cons

● Hadoop cluster operations

● Multi-tenancy, resource contention and utilization

● No streaming mode

53

Why not Spark on GCE

Pros

● Batch, streaming, interactive and SQL

● MLlib, GraphX

● Scala, Python, and R support

Cons

● Hard to tune and scale

● Cluster lifecycle management

54

Why Dataflow with Scala

Dataflow

● Hosted solution, no operations

● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable

● Simple unified model for batch and streaming

Scala

● High level DSL, easy transition for developers

● Reusable and composable code via functional programming

● Numerical libraries: Breeze, Algebird

55

Cloud Storage Pub/Sub Datastore BigtableBigQuery

Batch Streaming Interactive REPL

Scio Scala API

Dataflow Java SDK Scala Libraries

Extra features

56

Scio

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i ̯o]

Verb: I can, know, understand, have knowledge.

Core API similar to spark-core, some ideas from scalding

github.com/spotify/scio

57

WordCount

Almost identical to Spark version

val sc = ScioContext()sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

58

PageRank in 13 lines

def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks}

59

SQL and Big Data Pipelines

SQL is easier to write than data pipelines, but

Hive with TSV or Avro

● Row based storage, inefficient full scan

● No integration with other frameworks

Parquet

● Inspired by Google Dremel which powers BigQuery

● Immature Hive integration, hard to scale with Spark SQL

● Poor impedance matching with Scalding, Avro, etc.

60

BigQuery and Scio BigQuery

● Slicing and dicing, aggregation, etc.

● Scaling independently

● Web UI, Tableau, QlikView etc.

Scio

● Custom logic hard to express in SQL

● Seamless integration with BigQuery IO

● Scala macros for type safety

61

JSON vs Type Safe BigQuery

JSON approach, a.k.a. everything is Object

sc.bigQuerySelect("...").map { r => (r.get("track").asInstanceOf[TableRow] .get("name").asInstanceOf[String], r.get("audio").asInstanceOf[TableRow] .get("tempo").toString.toInt )}

Compile Run job Wait NullPointerException or ClassCastException Repeat

Type safe approach

@BigQueryType.fromQuery("...")class TrackTempo

sc.typedBigQuery[TrackTempo]().map { t => (t.track.name, t.audio.tempo.getOrElse(-1))}

Compile Run Profit

62

Spotify Running

60 million tracks

30 million users * 10 tempo buckets * 25 personalized tracks

Audio: tempo, energy, time signature ...

Metadata: genres, categories

Latent vectors from collaborative filtering

63

Rapid prototyping with Bigquery

64

Spotify Running

SELECT user_id, vectorFROM UserEntity WHERE ...

SELECTtrack_id, audio.tempo ...FROM TrackEntityWHERE ...

most popularper recording

top N tracksper artist

bucket bytempo

vector LSH per bucket

GBK GBK GBK

RB

K

top tracks per user + bucket side input

Cloud Datastore

65

typedBigQuery@(Runni...

typedBigQuery@(Runni...

filter@Running.scala:1...

typedBigQuery@(Runni...

typedBigQuery@(Runni...

map@Running.scala:1

map@Running.scala:1

Succeeded

Succeeded

Succeeded

Succeeded

Running...

Running...

4,788 elements/s

66

67

What’s the catch?

Early stage, some rough edges

No interactive mode → Scio REPL (WIP), BigQuery + Datalab

No machine learning → TensorFlow

Licensed under Apache 2, contribution welcome!

Learnings?

7070

Thank YouIgor Maravić <igor@spotify.com>Neville Li <neville@spotify.com>