Real time data viz with Spark Streaming, Kafka and D3.js

download Real time data viz with Spark Streaming, Kafka and D3.js

of 22

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Real time data viz with Spark Streaming, Kafka and D3.js

PowerPoint Presentation

Stream processing and visualization for transaction investigationUsing Kafka, Spark, and D3.jsBen LairdCapital One Labs

About meCornell Engineering 07BS, Operations ResearchJohns Hopkins 12MS, Applied Math

Data EngineerNorthrop GrummanIBMSpace Debris TrackingNLP of intel documentsCounter-IED GIS analysis

Cornell expectations

Cornell reality

C1 Labs Data Science

Now: Data Scientist at Capital One Labs

C1 Labs Data Science

A technical challenge: Build a dynamic, rich visualization of large, streaming dataNormally, we have two optionsSmall dataEasy visualizationBig dataNo visualization

C1 Labs Data Science

Data Science: More than just HadoopUnderstanding all the requirements of your problem and the architecture that meets those demands is an ever important for a data scientistData processing solution doesnt matter if you have a 1hr load time in the browser. Visualization doesnt matter if there is no way to process/store data

Stream HandlingStream ProcessingIntermediate StorageWeb Server/FrameworkEvent Based CommBrowser Viz

C1 Labs Data Science

Our system must be able to process and visualize a real time transaction streamRequirement: System must handle 1B+ transactionsLoading 1B records on the client side isnt feasible

Our data is not only big, it is live.Assume a stream of 50 records/second

C1 Labs Data Science

Proposed solution: Use existing big data tools to process stream before web stackToolPurposeApache KafkaDistributed Messaging for transaction streamApache Spark StreamingDistributed processing of transaction stream. Aggregate to levels that can be handled by browserMongoDBIntermediate storage in Capped Collection for web server accessNode.jsServer side framework for web server and Mongo interactionSocket.ioEvent based communication Pass new data from stream into browserCrossfilterClient side data indexDC.js/D3.jsD3.js graphics and intergration with Crossfilter

How/Why did I pick these for our architecture?

C1 Labs Data Science

A foray into data visualization toolsFrom the beautiful: Minard Map, 1869Source:

C1 Labs Data Science

to the not beautiful


C1 Labs Data Science

With most solutions, you face a trade off between ease of use and flexibilityIf you need a quick solution or dont need full control or customization, there are fantastic options


ElasticSearch Kibana

C1 Labs Data Science

D3.js provides an extremely powerful way of joining data with completely custom graphics

Limitless possibilities. Complete control over data and viz. Not trivial to use though!

C1 Labs Data Science

Bind data directly to elements in the DOM. Create graphics from scratch

C1 Labs Data Science

All about finding the right level of abstraction. Introduce DC.jsDont always want to construct bar charts from the ground up.Build axes, ticks, set colors, scales, bar widths, height, projections...Too tedious sometimesDC.js adds a thin layer on top of d3.js to construct most chart types and to link charts together for fast filtering.

C1 Labs Data Science

DC.js combines d3.js with Squares crossfilterBuilt by

Javascript library for very fast ( val rdd = sc.textFile("all_text_corpus.txt)

scala> val allWords = rdd.flatMap(sentence=>sentence.split(" )

scala> val counts =>(word,1)).reduceByKey(_+_)

scala>{case (k,v)=>(v,k)}.sortByKey(ascending=false).map{case (v,k)=>(k,v)}.take(25)

Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481), (in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761), (with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619), (at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))

C1 Labs Data Science

Word Count in Spark vs Java MapReduce

C1 Labs Data Science

Transaction Aggregation with Spark

Batch up incoming transactions every 30 seconds, and compute average transaction size and total number of transactions for every merchant, zip code for a 5 min sliding window. Write batched results to MongoDB

C1 Labs Data Science

MongoDB for intermediate storageUse capped collection to immediately find last element. No costly O(N) or worse searches.Tap into Mongo with Node.js

C1 Labs Data Science

Node.js and for server side updatesAdd listener in client side javascript

C1 Labs Data Science


C1 Labs Data Science