Scala like distributed collections - dumping time-series data with apache spark
-
Upload
demi-ben-ari -
Category
Software
-
view
651 -
download
1
Transcript of Scala like distributed collections - dumping time-series data with apache spark
Scala-like Distributed Collections: Dumping Time-Series Data With
Apache Spark
Demi Ben-Ari - CTO @ Panorays
About Me
Demi Ben-Ari, Co-Founder & CTO @ Panorays ● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder “Big Things” Big Data Community In the Past: ● Sr. Data Engineer - Windward
● Team Leader & Sr. Java Software Engineer, Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek
Agenda
● Scala and Spark analogies
● Data flow and Environment
● What’s our time series data like?
● Where we started from - where we got to
○ Problems and our decisions
● Conclusions
Scala and Spark analogies
Scala is...
● Functional
● Object Oriented
● Statically typed
● Interoperates well with Java and Javascript
○ JVM based
DSLs on top of Scala
SBT Spiral
Scalaz
Slick Dispatch
Chisel
Specs
Opti{X}
shapeless
ScalaTest
Squeryl
Scala & Spark (Architecture)
Scala REPL Scala Compiler
Spark Runtime
Scala Runtime
JVM
File System (eg. HDFS,
Cassandra, S3..) Cluster Manager (eg. Yarn, Mesos)
What kind of DSL is Apache Spark
● Centered around Collections
● Immutable data sets equipped with functional transformations
● These are exactly the Scala collection operations
map flatMap filter ...
reduce fold aggregate ...
union intersection ...
Spark vs. Scala Collections
● So, Spark is exactly Scala Collections, but running in a Cluster?
● Not quite. There are Two main differences:
○ Spark is Lazy, Scala collections are strict
○ Spark has added functionality, eg. PairRDDs.
■ Gives us the power doing lots of operations in the NoSQL distributed
world
Collections Design Choices
Imperative Functional
Strict Lazy
VS
VS
java.util scala.collections.immutable
Scala OCaml
Spark C#
Scala Streams, views
Spark is A Multi-Language Platform
● Why to use Scala instead of Python?
○ Native to Spark, Can use everything without translation
○ Types help
So Bottom Line… What’s Spark???
United Tools Platform - Single Framework
Batch
Interactive Streaming
Single Framework
United Tools Platform
Spark Standalone Cluster - Architecture
● Master ● History
Server ● etc
Master Core 3 Core 4
Core 2
Worker Memory
Core 1 Slave
Slave
Slave
Slave
Core 3 Core 4
Core 2
Worker Memory
Core 1
Core 3 Core 4
Core 2
Worker Memory
Core 1
Slave
Core 3 Core 4
Core 2
Worker Memory
Core 1
Core 3 Core 4
Core 2
Worker Memory
Core 1
Slave
Slave
Slave
Data flow and Environment (Our Use Case)
Structure of the Data
● Geo Locations + Metadata
● Arriving over time
● Different types of messages being reported by sattelites
● Encoded
● Might arrive later than acttually transmitted
Data Flow Diagram
External Data
Source
Analytics Layers
Data Pipeline
Parsed Raw
Entity Resolution Process
Building insights on top of the entities
Data Output Layer
Anomaly Detection
Trends
Environment Description
Cluster
Dev Testing Live Staging Production Env
OB1K
RESTful Java Services
Basic Terms ● Idempotence
is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application.
● Function: Same input => Same output
Basic Terms
● Missing Parts in Time Series Data ◦ Data arriving from the satellites ⚫ Might be causing delays because of bad transmission
◦ Data vendors delaying the data stream ◦ Calculation in Layers may cause Holes in the Data
● Calculating the Data layers by time slices
Basic Terms ● Partitions == Parallelizm
◦ Physical / Logical partitioning
● Resilient Distributed Datasets (RDDs) == Collections
◦ fault-tolerant collection of elements that can be operated on in parallel.
◦ Applying immutable transformations and actions over RDDs
So what’s the problem?
The Problem - Receiving DATA
Beginning state, no data, and the timeline begins
T = 0
Level 3 Entity
Level 2 Entity
Level 1 Entity
The Problem - Receiving DATA
T = 10
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size
Level 1 entities data arrives and gets stored
The Problem - Receiving DATA
T = 10
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size Level 3 entities are created
on top of Level 2’s Data (Decreased amount of data)
Level 2 entities are created on top of Level 1’s
Data (Decreased amount of
data)
The Problem - Receiving DATA
T = 20
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size
Because of the sliding window’s back size, level 2 and 3 entities would not be created properly
and there would be “Holes” in the Data
Level 1 entity's data arriving late
Solution to the Problem ● Creating Dependent Micro services forming a data pipeline ◦ Mainly Apache Spark applications ◦ Services are only dependent on the Data - not the previous
service’s run ● Forming a structure and scheduling of “Back Sliding Window” ◦ Know your data and it’s relevance through time ◦ Don’t try to foresee the future – it might Bias the results
Starting point & Infrastructure
How we started? ● Spark Standalone – via ec2 scripts ◦ Around 5 nodes (r3.xlarge instances) ◦ Didn’t want to keep a persistent HDFS – Costs a lot ◦ 100 GB (per day) => ~150 TB for 4 years ◦ Cost for server per year (r3.xlarge): ● On demand: ~2900$ ● Reserved: ~1750$
● Know your costs: http://www.ec2instances.info/
Decision ● Working with S3 as the persistence layer ◦ Pay extra for ● Put (0.005 per 1000 requests) ● Get (0.004 per 10,000 requests) ◦ 150TB => ~210$ for 4 years of Data
● Same format as HDFS (CSV files) ◦ s3n://some-bucket/entity1/201412010000/part-00000 ◦ s3n://some-bucket/entity1/201412010000/part-00001 ◦ ……
What about the serving?
MongoDB for Serving
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB Replica
Set
Spark Cluster
Master
Write
Read
Spark Slave - Server Specs
● Instance Type: r3.xlarge ● CPU’s: 4 ● RAM: 30.5GB ● Storage: ephemeral ● Amount: 10+
MongoDB - Server Specs ● MongoDB version: 2.6.1 ● Instance Type: m3.xlarge (AWS) ● CPU’s: 4 ● RAM: 15GB ● Storage: EBS ● DB Size: ~500GB ● Collection Indexes: 5 (4 compound)
The Problem ● Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes
● Why? ◦ ~20 minutes to write with the Java mongo driver – Async
(Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
Alternative Solutions
● Sharded MongoDB (With replica sets) ◦ Pros: ● Increases Throughput by the amount of shards ● Increases the availability of the DB ◦ Cons: ● Very hard to manage DevOps wise (for a small team of
developers) ● High cost of servers – because each shared need 3 replicas
Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark Cluster
Master
Write
Read
Master
Our DevOps – After that solution
We had no DevOps guy at that time at all
☹
Alternative Solutions
● Apache Cassandra ◦ Pros: ● Very large developer community ● Linearly scalable Database ● No single master architecture ● Proven working with distributed engines like Apache Spark ◦ Cons: ● We had no experience at all with the Database ● No Geo Spatial Index – Needed to implement by ourselves
The Solution
● Migration to Apache Cassandra ● Create easily a Cassandra cluster using DataStax Community AMI
on AWS ◦ First easy step – Using the spark-cassandra-connector
(Easy bootstrap move to Spark ⬄ Cassandra) ◦ Creating a monitoring dashboard to Cassandra
Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark Cluster
Write
Read
Result
● Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40
minutes in MongoDB ● Took 2 weeks to go from “Zero to Hero”, and to ramp up a running
solution that work without glitches
So what’s the problem (Again)?
Transferring the Heaviest Process
● Micro service that runs every 10 minutes ● Writes to Cassandra 30GB per iteration ◦ (Replication factor 3 => 90GB)
● At first took us 18 minutes to do all of the writes ◦ Not Acceptable in a 10 minute process
Cluster On OpsCenter - Before
Transferring the Heaviest Process ● Solutions ◦ We chose the i2.xlarge ◦ Optimization of the Cluster ◦ Changing the JDK to Java-8 ● Changing the GC algorithm to G1 ◦ Tuning the Operation system ● Ulimit, removing the swap ◦ Write time went down to ~5 minutes (For 30GB RF=3)
Sounds good right? I don’t think so
Cloud Watch After Tuning
The Solution ● Taking the same Data Model that we held in Cassandra (All of the
Raw data per 10 minutes) and put it on S3 ◦ Write time went down from ~5 minutes to 1.5 minutes
● Added another process, not dependent on the main one, happens every 15 minutes ◦ Reads from S3, downscales the amount and Writes them to
Cassandra for serving
How it looks after all?
Parsed Raw
Static / Aggregated
Data
Spark Analytics Layers
UI Serving
Downscaled Data
Heavy Fusion
Process
Conclusion ● Always give an estimate to your data ◦ Frequency ◦ Volume ◦ Arrangement of the previous phase
● There is no “Best” persistence layer ◦ There is the right one for the job ◦ Don’t overload an existing solution
Conclusion ● Spark is a great framework for distributed collections
◦ Fully functional API ◦ Can perform imperative actions
● “With great power, comes lots of partitioning” ◦ Control your work and data distribution via partitions
● https://www.pinterest.com/pin/155514993354583499/ (Thanks)
Questions?
Thanks! my contact:
�Demi Ben-Ari ● LinkedIn ● Twitter: @demibenari ● Blog: http://progexc.blogspot.com/ ● Email: [email protected] ● “Big Things” Community �Meetup, YouTube, Facebook, Twitter