Spark Streaming Early Warning Use Case
-
Upload
randomchance -
Category
Data & Analytics
-
view
87 -
download
1
Transcript of Spark Streaming Early Warning Use Case
![Page 1: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/1.jpg)
STREAMING EARLY WARNINGData Day Seattle 6-27-2015Chance Coble
![Page 2: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/2.jpg)
2
Use Case Profile
➾Telecommunications company Had business problems/pain Scalable analytics infrastructure is a problem
Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data
➾We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy – EC2 helped
![Page 3: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/3.jpg)
3
Organization Profile
➾Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second
nearly half-a-million calls in a 5 minute window Technology is loosely split into
Operational Support Systems (OSS) Business Support Systems (BSS)
➾ Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack
![Page 4: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/4.jpg)
4
Jargon
➾ Number Comprised of Country Code (possibly), Area Code (NPA),
Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded
1 5309512 867
![Page 5: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/5.jpg)
5
Jargon
➾Trunk Group A trunk is a line connecting transmissions for two points. The
group of trunks has some common property, in this case being owned by the same entity.
Transmissions from ingress trunks are routed to transmissions to egress trunks.
➾Route – In this case, selection of a trunk group to facilitate the termination at the calls destination
➾QoS – Quality of Service governed by metrics Call Duration – Short calls are an indication of quality problems ASR – Average Seizure Rate
This company measures this as #connected calls / #calls attempted
➾Real-time: Within 5 minutes
![Page 6: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/6.jpg)
6
The Problem
➾A switch handles most of their routing➾Configuration table in switch governs routing
if-this-then-that style logic.
➾Proprietary technology handles adjustments to that table Manual intervention also required
Call Logs Business Rules Application
Database Intranet Portal
![Page 7: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/7.jpg)
7
The Problem
➾Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geo-
coding and account information
Their job is to connect calls at the most efficient price without sacrificing quality
![Page 8: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/8.jpg)
8
Why Spark?
➾They were interested because Workbench can simplify operationalizing analytics
They can skip a generation of clunky big data tools Works with their data structures Will “scale-out” rather than up Can handle fault-tolerant in-memory updates
![Page 10: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/10.jpg)
10
Spark Basics - Architecture
Spark Driver
Spark Context
Cluster Manager
Executor
…
Tasks Cache
Executor
Executor
Tasks
Tasks
Cache
Cache
![Page 11: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/11.jpg)
11
Spark Basics – Call Status Count Example
val cdrLogPath = ”/cdrs/cdr20140731042210.ssv” val conf = new SparkConf().setAppName(”CDR Count") val sc = new SparkContext(conf) val cdrLines = sc.textFile(cdrLogPath)
val cdrDetails = cdrLines.map(_.split(“;”)).cache() val successful = cdrDetails.filter(x => x(6)==“S”).count() val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count()
println(”Successful: %s, Unsuccessful: %s” .format(successful, unsuccessful))
![Page 12: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/12.jpg)
12
Spark Basics - RDD’s
➾Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW!
➾Resilient Data is redundantly stored, and can be recomputed through a
generated DAG
➾ Distributed The DAG can process each small task, as well as a subset of the
data through optimizations in the Spark planning engine.
➾ Dataset➾This construct is native to Spark computation
![Page 13: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/13.jpg)
13
Spark Basics - RDD’s
➾Lazy➾Transformations for tasks and slices
![Page 14: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/14.jpg)
14
Streaming Applications – Why try it?
➾Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications
Network Data
![Page 15: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/15.jpg)
15
Streaming Models
➾Record-at-a-time Receive One Record and process it
Simple, low-latency High-Throughput
➾Micro-Batch Receive records and occasionally run a batch process over a
window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning
Global state Fault tolerance Unified Code
![Page 16: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/16.jpg)
16
DStreams
➾Stands for Discretized Streams➾A series of RDD’s➾Spark already provided computation model on RDD’s➾Note records are ordered as they are received
They are also time-stamped for computation in that order Is that always the way you want to see your data?
![Page 17: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/17.jpg)
17
Fault Tolerance – Parallel Recovery
➾ Failed Nodes➾ Stragglers!
![Page 18: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/18.jpg)
18
Fault Tolerance - Recompute
![Page 19: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/19.jpg)
19
Throughput vs. Latency
![Page 20: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/20.jpg)
20
Anatomy of a Spark Streaming Program
val sparkConf = new SparkConf().setAppName(“QueueStream”)
val ssc = new StreamingContext(sparkConf, Seconds(1))
val rddQueue = new SynchronizedQueue[RDD[Int]]()
val inputStream = ssc.queueStream(rddQueue)
val mappedStream = inputStream.map(x => (x % 10, 1))
val reducedStream = mappedStream.reduceByKey(_ + _)
reducedStream.print()
ssc.start()
for(i 1 to 30) {
rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10)
Thread.sleep(1000)
}
ssc.stop()
Utilities also available forTwitterKafkaFlume
Filestream
![Page 21: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/21.jpg)
21
Windows
WindowSlide
![Page 22: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/22.jpg)
22
Streaming Call Analysis with Windows
val path = "/Users/chance/Documents/cdrdrop”
val conf = new SparkConf()
.setMaster("local[12]")
.setAppName("CDRIncidentDetection")
.set("spark.executor.memory","8g")
val ssc = new StreamingContext(conf,Seconds(iteration))
val callStream = ssc.textFileStream(path)
val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";"))
val cdrArr = cdr.filter(c => c.length>136)
.map(c => extractCallDetailRecord(c))
val result = detectIncidents(cdrArr)
result.foreach(rdd => rdd.take(10)
.foreach{case(x,(d,high,low,res)) =>
println(x + "," + high + "," + d + "," + low + "," + res) })
ssc.start()
ssc.awaitTermination()
![Page 23: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/23.jpg)
23
Can we enable new analytics?
➾Incident detection Chose a univariate technique[1] to detect behavior out of profile
from recent events Technique identifies
out of profile events dramatic shifts in the profile
Easy to understand
Recent Window
![Page 24: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/24.jpg)
24
Is it simple to deploy?
➾EC2 helped➾Client had no Hadoop, and little NoSQL expertise➾Develop and Deploy
Built with sbt, ran on master
➾Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD’s accordingly Results were explored in two ways
Accessing results through data virtualization Writing RDD results (small) to SQL database
Using a business intelligence tool to create report content
Call Logs Streaming DataCurrent Processing
HDFS on EC2
Analysis and Reporting Dashboards
Multiple Delivery Options
![Page 25: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/25.jpg)
25
Results
![Page 26: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/26.jpg)
26
Results
10
50
100
150
200
250
300
350
WordCount (Published)
Throughput (MB)
![Page 27: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/27.jpg)
27
Summary of Results
➾Technology would scale Handled 5 minutes of data sub second
➾Proved new analytics enabled Solved single-variable incident detection Small, simple code
➾Made a case for Scala adoption Team is still skeptical about big data
➾Wanted to prove it was easy to deploy – EC2 helped Burned on forward slash bug in AWS secret token
![Page 28: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/28.jpg)
28
Incident Visual
![Page 29: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/29.jpg)
29
References
➾[1] Zaharia et al : Discretized Streams ➾[2] Zaharia et al:
Discretized Streams: Fault-Tolerant Streaming ➾[3] Das : Spark Streaming – Real-time Big-Data Processing➾[4] Spark Streaming Programming Guide➾[5] Running Spark on EC2➾[6] Spark on EMR➾[7] Ahelegby: Time Series Outliers
![Page 30: Spark Streaming Early Warning Use Case](https://reader036.fdocuments.net/reader036/viewer/2022081519/55c51e98bb61ebab7d8b4673/html5/thumbnails/30.jpg)
30
Contact Us
CONTACT US
Email: chance at blacklightsolutions.com
Phone: 512.795.0855
Web: www.blacklightsolutions.com
Twitter: @chancecoble