From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Speakers: Igor Maravić & Neville Li, Spotify

From stream to recommendation withCloud Pub/Sub and Cloud Dataflow

DATA & ANALYTICS

Current Event Delivery System

Client

Current event delivery system

Gateway

Syslog

SyslogProducer

Any Data Centre

Groupers RealtimeBrokers

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Client

Complex

Gateway

Syslog

SyslogProducer

Any Data Centre

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Client

Stateless

Gateway

Syslog

SyslogProducer

Any Data Centre

ETL job

CheckpointMonitor

Hadoop

Hadoop Data Center

Service Discovery

ACKBrokers

SyslogConsumer

LivenessMonitor

Brokers

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

Redesigning Event Delivery

Redesigning event delivery

Gateway

Syslog

File Tailer

Any data centre

Client

Hadoop

Client

Client Event Delivery Service

Reliable Persistent Queue

Same API

Gateway

Syslog

File Tailer

Any data centreHadoop

Event Delivery Service

Client

Persistence

Gateway

Syslog

File Tailer

Client

Keep it simple

Gateway

Syslog

File Tailer

Client

Build it!

Choosing reliable persistent queue

Kafka 0.8

Proven technology

Strong community

Reliable persistent queue

Event delivery with Kafka 0.8

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

Gateway

Syslog

File Tailer

Any data centre

ClientHadoop

Client

ClientEvent

Delivery Service

Hadoop data centre

Camus(ETL)

Brokers MirrorMakers

Brokers

Event delivery with Kafka 0.8

Cloud Pub/Sub

Retains undelivered data

At least once delivery

Globally available

Simple REST API

No operational responsibility*

SHUT UP AND

TAKE MY MONEY!

Caution advised!

Building up trust in Cloud Pub/Sub

Delivered data growth

2007 2008 2009 2010 2011 2012 2013 2014 2015

Demo time!

2M events per second.

Cloud Pub/Sub, Spotify chooses You!

Event delivery with Cloud Pub/Sub

Gateway

Any data centre

Client

HadoopClient

Client

Cloud Pub/Sub

File Tailer

Syslog

Cloud Storage

Dataflow

ETL using Cloud Dataflow

Streaming ETL job with Cloud Dataflow

Dataflow SDK is a framework

Cloud Dataflow is a managed service

ETL job

Single Cloud Pub/Sub subscription

ConsumeRunning

GCS and HDFS in parallel.

2016-03-22 03H

2016-03-2204H

Event time based hourly buckets

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

Incremental bucket fill

2016-03-2123H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-22 04H

2016-03-2203H

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2203H

Bucket completeness

2016-03-2202H

2016-03-2204H

2016-03-2123H

2016-03-2204H

Late data handling

2016-03-2203H

2016-03-2200H

2016-03-2201H

2016-03-2202H

2016-03-2200H

2016-03-2201H

2016-03-2123H

2016-03-2202H

Event time based hourly bucketsIncremental bucket fillBucket completeness

Late data handling

Windowing

Window4,061 elements/s

ConsumeRunning

Shard4,061 elements/s

Write to HDFSRunning

Write to GCSRunning

Windowing@Override

public PCollection<KV<String, Iterable<EventMessage>>> apply(

final PCollection<KV<String, EventMessage>> shardedEvents) {

return shardedEvents

.apply("Assign Hourly Windows",

Window.<~>into(

FixedWindows.of(ONE_HOUR))

.withAllowedLateness(ONE_DAY)

.triggering(

AfterWatermark.pastEndOfWindow()

.withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile))

.withLateFirings(AfterFirst.of(

AfterPane.elementCountAtLeast(maxEventsInFile),

AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(TEN_SECONDS))))

.discardingFiredPanes())

.apply("Aggregate Events", GroupByKey.create());

Streaming

Where are we right now?

Preliminary resultsWatermark Lag

Minutes

ScioScala API for Google Cloud Dataflow

Origin story

Scalding and Spark popular for ML, recommendations, analytics @ Spotify

50+ users, 400+ unique jobs

Early 2015 - Dataflow Scala hack project

Why not Scalding on GCE

● Big community - Twitter, eBay, Etsy, Stripe, LinkedIn, SoundCloud

● Stable and proven

● Hadoop cluster operations

● Multi-tenancy, resource contention and utilization

● No streaming mode

Why not Spark on GCE

● Batch, streaming, interactive and SQL

● MLlib, GraphX

● Scala, Python, and R support

● Hard to tune and scale

● Cluster lifecycle management

Why Dataflow with Scala

Dataflow

● Hosted solution, no operations

● Ecosystem: GCS, Bigquery, Pubsub, Datastore, Bigtable

● Simple unified model for batch and streaming

● High level DSL, easy transition for developers

● Reusable and composable code via functional programming

● Numerical libraries: Breeze, Algebird

Cloud Storage Pub/Sub Datastore BigtableBigQuery

Batch Streaming Interactive REPL

Scio Scala API

Dataflow Java SDK Scala Libraries

Extra features

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i ̯o]

Verb: I can, know, understand, have knowledge.

Core API similar to spark-core, some ideas from scalding

github.com/spotify/scio

WordCount

Almost identical to Spark version

val sc = ScioContext()sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

PageRank in 13 lines

def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks}

SQL and Big Data Pipelines

SQL is easier to write than data pipelines, but

Hive with TSV or Avro

● Row based storage, inefficient full scan

● No integration with other frameworks

Parquet

● Inspired by Google Dremel which powers BigQuery

● Immature Hive integration, hard to scale with Spark SQL

● Poor impedance matching with Scalding, Avro, etc.

BigQuery and Scio BigQuery

● Slicing and dicing, aggregation, etc.

● Scaling independently

● Web UI, Tableau, QlikView etc.

● Custom logic hard to express in SQL

● Seamless integration with BigQuery IO

● Scala macros for type safety

JSON vs Type Safe BigQuery

JSON approach, a.k.a. everything is Object

sc.bigQuerySelect("...").map { r => (r.get("track").asInstanceOf[TableRow] .get("name").asInstanceOf[String], r.get("audio").asInstanceOf[TableRow] .get("tempo").toString.toInt )}

Compile Run job Wait NullPointerException or ClassCastException Repeat

Type safe approach

@BigQueryType.fromQuery("...")class TrackTempo

sc.typedBigQuery[TrackTempo]().map { t => (t.track.name, t.audio.tempo.getOrElse(-1))}

Compile Run Profit

Spotify Running

60 million tracks

30 million users * 10 tempo buckets * 25 personalized tracks

Audio: tempo, energy, time signature ...

Metadata: genres, categories

Latent vectors from collaborative filtering

Rapid prototyping with Bigquery

Spotify Running

SELECT user_id, vectorFROM UserEntity WHERE ...

SELECTtrack_id, audio.tempo ...FROM TrackEntityWHERE ...

most popularper recording

top N tracksper artist

bucket bytempo

vector LSH per bucket

GBK GBK GBK

top tracks per user + bucket side input

Cloud Datastore

typedBigQuery@(Runni...

filter@Running.scala:1...

typedBigQuery@(Runni...

map@Running.scala:1

Succeeded

Running...

4,788 elements/s

What’s the catch?

Early stage, some rough edges

No interactive mode → Scio REPL (WIP), BigQuery + Datalab

No machine learning → TensorFlow

Licensed under Apache 2, contribution welcome!

Learnings?

Blog posts @ labs.spotify.com

Spotify’s Event Delivery - The Road To The CloudPart I, Part II, Part III

Thank YouIgor Maravić <igor@spotify.com>Neville Li <neville@spotify.com>

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Technology

Transcript of From stream to recommendation using apache beam with cloud pubsub and cloud dataflow

Dataflow I: Dataflow Analysis

Google Cloud Platform - KxGoogle Cloud Platform Confidential & Proprietary 6 2002 2004 2006 2008 2010 2012 2013 Cloud Storage DataProc BigTable Cloud Storage BigQuery DataFlow

BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

· 2020-02-28 · Cloud Dataflow - Batch Stream Unified Batch and Stream ... a unified model for batch and stream processing supporting multiple runtimes Google Cloud Dataflow ...

ThingPark Wireless - ETSI€¦ · ThingPark Wireless & ETSI M2M M2M cloud: • ETSI M2M NCS broker • Sync/Async/PubSub interactions • Big Data storage • Security M2M GW reference

laravel websocket(use redis pubsub) [Laravel meetup tokyo]

ASAPIO Cloud Integrator for Solace PubSub+...ASAPIO Cloud Integrator connects SAP on-prem systems to the cloud and Solace PubSub+ ASAPIO Cloud Integrator installs as add-on to SAP

Dataflow Integration Solution for Power BI Reza... · Dataflow is a Power Query process that runs in the cloud independently from any Power BI reports. Where the Output Stored? Dataflow

A Hybrid Systolic-Dataflow Architecture for Inductive ...jianw/hpca2020.pdfHybrid Systolic-Dataflow Hybrid Systolic-Dataflow Hybrid Systolic-Dataflow Fig. 3: Proposed Architecture

Google Cloud Dataflow

Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, and BigQuery

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud

XStream - tvcutsem.github.iotvcutsem.github.io/assets/XStream_ifip17.pdf · • Google Cloud Dataflow, MapReduce, FlumeJava, Sawzall, Millwheel • Distributed stream processing:

Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜

Solace PubSub+ Event Broker Integration with Boomi Using JMS · 2020-03-02 · Solace PubSub+ Event Broker Integration with Boomi Using JMS 3 Access Solace PubSub+ Manager Solace

Google cloud Dataflow & Apache Flink

Assuring Integrity of Dataflow Processing in Large-Scale Cloud Systems

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

リアルタイムサーバー〜Erlang/OTPで作るPubSubサーバー〜