Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark -...

35
Generating Real-time, Streaming Recommendations [NiFi + Kafka + Spark ML] Kafka Summit SF April 26, 2016

Transcript of Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark -...

Generating Real-time, Streaming Recommendations[NiFi + Kafka + Spark ML]

Kafka Summit SFApril 26, 2016

Who am I?Chris Fregly, Principal Data Solutions Engineer @ IBM Spark Technology Center

Previously, Data Engineer @ Netflix and Databricks

Contributor @ Apache Spark, Committer @ Netflix OSS

Founder @ Advanced Spark and TensorFlow

Author @ Advanced Spark (advancedspark.com)

Relevant Spark Contribution

SPARK-1981: Add Kinesis support for Spark Streaming

Me

Fun Meetup!

Fun Workshop!!San Jose: May 14th (full details @ advancedspark.com)

Fun Github Repo!!!

Agenda

Live, Interactive Demo!

NiFi

Spark Streaming

Streaming Recommendations

Netflix Pipeline (Bonus!)

Live, Interactive Demo

http://demo2.advancedspark.com

Agenda

Live, Interactive Demo!

NiFi

Spark Streaming

Streaming Recommendations

Netflix Recommendations (Bonus!)

NiFi

NiFi = “Niagra Files”

Maintainers @ Hortonworks since 2015

Developed @ NSA over last 8+ years

Integrates with EVERYTHING!

Provides Data Provenance

Data Flow Management

Me, Normal Guy

Joe Witt,NiFi Co-Creator

Buffalo Wild Wings

Hat

NiFi + Kafka

NiFi Routing: Http Request

NiFi Geo-Enrichment

NiFi Extract Kafka Topic

NiFi Kafka PUT (Finally!)

NiFi Post-Kafka HttpResponse

NiFi Data Provenance

NiFi Provenance Event TypesATTRIBUTES_MODIFIED (ie. Extract Topic Name)

CONTENT_MODIFIED (ie. Enrich with Geo)

RECEIVE (ie. Handle Http Request)

ROUTE (ie. Check Http Method)

SEND (ie. PutKafka)

DROP (Handle Http Response)

NiFi Search Data Provenance

NiFi Kafka Provenance Event

NiFi Kafka Provenance Event

NiFi Kafka Provenance Event

NiFi Provenance Lineage

Agenda

Live, Interactive Demo!

NiFi

Spark Streaming

Streaming Recommendations

Netflix Recommendations (Bonus!)

Spark Streaming

Submits Time-Based Micro Batches of Data as Spark Jobs

Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!

Framework for Custom Streaming Receivers

Flexible Window Operations, Optimized State Management

Basic Back Pressure and Throttling Support

At Least Once Guarantees through Write Ahead Log (WAL)

Original Kafka Receiver

Newer Kafka “Direct” Receiver

Spark Streaming KafkaRDDKafka “Direct” Streaming Implementation (Spark 1.4+)

Recover/Replay from Kafka using File System-like Offsets

Removes need for Write Ahead Log (WAL)

Uses Kafka, itself, as the WAL!

KafkaRDD

Agenda

Live, Interactive Demo!

NiFi

Spark Streaming

Streaming Recommendations

Netflix Recommendations (Bonus!)

Streaming RecommendationsIncremental Matrix Factorization!!

(Based on github.com/brkyvz/streaming-matrix-factorization)

Recommendation Serving LayerUse Case: Recommendation Service Depends on Redis Cache

Problem: Redis Cache Goes Down!?Answer: github.com/Netflix/Hystrix Circuit Breaker!

Circuit States:Closed: Service OKOpen: Service DOWN

Fallback to Non-Personalized Recommendations from Disk

Agenda

Live, Interactive Demo!

NiFi

Spark Streaming

Streaming Recommendations

Netflix Recommendations (Bonus!)

Netflix Data Pipeline

9 million events, 22 GB per second @ peak!

EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps

Auto-scaling,Fault tolerance

A/B Tests,Trending Now

SAMZA

Splits high andnormal priority

Recommendations Pipeline

Batch Matrix Factorization

Keep Batch Video (V) Matrix

Calculate Newer User (U) Matrix

Compute U x V Dot Product

Save Model to Disk and EVCache

https://github.com/Netflix/EVCache

Throw away batch user factors (U)

Keep video factors (V)

Thank You, Kafka Summit SF!Chris Fregly@cfregly

All Source Code, Demos, and Docker Images Available@ advancedspark.com,

github.com/fluxcapacitor/pipeline

Join the Global Meetup for Slides, Videos, Book@ advancedspark.com