Scala like distributed collections - dumping time-series data with apache spark

download Scala like distributed collections - dumping time-series data with apache spark

of 54

  • date post

    16-Apr-2017
  • Category

    Software

  • view

    630
  • download

    1

Embed Size (px)

Transcript of Scala like distributed collections - dumping time-series data with apache spark

  • Scala-like Distributed Collections: Dumping Time-Series Data With

    Apache Spark

    Demi Ben-Ari - CTO @ Panorays

  • About Me

    Demi Ben-Ari, Co-Founder & CTO @ Panorays BSc Computer Science Academic College Tel-Aviv Yaffo Co-Founder Big Things Big Data Community

    In the Past: Sr. Data Engineer - Windward Team Leader & Sr. Java Software Engineer,

    Missile defense and Alert System - Ofek IAF Interested in almost every kind of technology A True Geek

  • Agenda

    Scala and Spark analogies

    Data flow and Environment

    Whats our time series data like?

    Where we started from - where we got to

    Problems and our decisions

    Conclusions

  • Scala and Spark analogies

  • Scala is...

    Functional

    Object Oriented

    Statically typed

    Interoperates well with Java and Javascript

    JVM based

  • DSLs on top of Scala

    SBT Spiral

    Scalaz

    Slick Dispatch

    Chisel

    Specs

    Opti{X}

    shapeless

    ScalaTest

    Squeryl

  • Scala & Spark (Architecture)

    Scala REPL Scala Compiler

    Spark Runtime

    Scala Runtime

    JVM

    File System (eg. HDFS,

    Cassandra, S3..) Cluster Manager (eg. Yarn, Mesos)

  • What kind of DSL is Apache Spark

    Centered around Collections

    Immutable data sets equipped with functional transformations

    These are exactly the Scala collection operations

    map flatMap filter ...

    reduce fold aggregate ...

    union intersection ...

  • Spark vs. Scala Collections

    So, Spark is exactly Scala Collections, but running in a Cluster?

    Not quite. There are Two main differences:

    Spark is Lazy, Scala collections are strict

    Spark has added functionality, eg. PairRDDs.

    Gives us the power doing lots of operations in the NoSQL distributed world

  • Collections Design Choices

    Imperative Functional

    Strict Lazy

    VS

    VS

    java.util scala.collections.immutable

    Scala OCaml

    Spark C#

    Scala Streams, views

  • Spark is A Multi-Language Platform

    Why to use Scala instead of Python?

    Native to Spark, Can use everything without translation

    Types help

  • So Bottom Line Whats Spark???

  • United Tools Platform - Single Framework

    Batch

    Interactive Streaming

    Single Framework

  • United Tools Platform

  • Spark Standalone Cluster - Architecture

    Master

    History Server

    etc

    Master Core 3 Core 4

    Core 2

    Worker Memory

    Core 1 Slave

    Slave

    Slave

    Slave

    Core 3 Core 4

    Core 2

    Worker Memory

    Core 1

    Core 3 Core 4

    Core 2

    Worker Memory

    Core 1

    Slave

    Core 3 Core 4

    Core 2

    Worker Memory

    Core 1

    Core 3 Core 4

    Core 2

    Worker Memory

    Core 1

    Slave

    Slave

    Slave

  • Data flow and Environment (Our Use Case)

  • Structure of the Data

    Geo Locations + Metadata

    Arriving over time

    Different types of messages being reported by sattelites

    Encoded

    Might arrive later than acttually transmitted

  • Data Flow Diagram

    External Data

    Source

    Analytics Layers

    Data Pipeline

    Parsed Raw

    Entity Resolution Process

    Building insights on top of the entities

    Data Output Layer

    Anomaly Detection

    Trends

  • Environment Description

    Cluster

    Dev Testing Live Staging Production Env

    OB1K

    RESTful Java Services

  • Basic Terms Idempotence

    is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application.

    Function: Same input => Same output

  • Basic Terms

    Missing Parts in Time Series Data Data arriving from the satellites Might be causing delays because of bad transmission

    Data vendors delaying the data stream Calculation in Layers may cause Holes in the Data

    Calculating the Data layers by time slices

  • Basic Terms Partitions == Parallelizm

    Physical / Logical partitioning

    Resilient Distributed Datasets (RDDs) == Collections

    fault-tolerant collection of elements that can be operated on in parallel.

    Applying immutable transformations and actions over RDDs

  • So whats the problem?

  • The Problem - Receiving DATA

    Beginning state, no data, and the timeline begins

    T = 0

    Level 3 Entity

    Level 2 Entity

    Level 1 Entity

  • The Problem - Receiving DATA

    T = 10

    Level 3 Entity

    Level 2 Entity

    Level 1 Entity

    Computation sliding window size

    Level 1 entities data arrives and gets stored

  • The Problem - Receiving DATA

    T = 10

    Level 3 Entity

    Level 2 Entity

    Level 1 Entity

    Computation sliding window size Level 3 entities are created

    on top of Level 2s Data (Decreased amount of data)

    Level 2 entities are created on top of Level 1s

    Data (Decreased amount of

    data)

  • The Problem - Receiving DATA

    T = 20

    Level 3 Entity

    Level 2 Entity

    Level 1 Entity

    Computation sliding window size

    Because of the sliding windows back size, level 2 and 3 entities would not be created properly

    and there would be Holes in the Data

    Level 1 entity's data arriving late

  • Solution to the Problem Creating Dependent Micro services forming a data pipeline

    Mainly Apache Spark applications Services are only dependent on the Data - not the previous

    services run Forming a structure and scheduling of Back Sliding Window

    Know your data and its relevance through time Dont try to foresee the future it might Bias the results

  • Starting point & Infrastructure

  • How we started? Spark Standalone via ec2 scripts

    Around 5 nodes (r3.xlarge instances) Didnt want to keep a persistent HDFS Costs a lot 100 GB (per day) => ~150 TB for 4 years Cost for server per year (r3.xlarge): On demand: ~2900$ Reserved: ~1750$

    Know your costs: http://www.ec2instances.info/

  • Decision Working with S3 as the persistence layer

    Pay extra for Put (0.005 per 1000 requests) Get (0.004 per 10,000 requests) 150TB => ~210$ for 4 years of Data

    Same format as HDFS (CSV files) s3n://some-bucket/entity1/201412010000/part-00000 s3n://some-bucket/entity1/201412010000/part-00001

  • What about the serving?

  • MongoDB for Serving

    Worker 1

    Worker 2

    .

    .

    Worker N

    MongoDB Replica

    Set

    Spark Cluster

    Master

    Write

    Read

  • Spark Slave - Server Specs

    Instance Type: r3.xlarge CPUs: 4 RAM: 30.5GB Storage: ephemeral Amount: 10+

  • MongoDB - Server Specs MongoDB version: 2.6.1 Instance Type: m3.xlarge (AWS) CPUs: 4 RAM: 15GB Storage: EBS DB Size: ~500GB Collection Indexes: 5 (4 compound)

  • The Problem Batch jobs

    Should run for 5-10 minutes in total Actual - runs for ~40 minutes

    Why? ~20 minutes to write with the Java mongo driver Async

    (Unacknowledged) ~20 minutes to sync the journal Total: ~ 40 Minutes of the DB being unavailable No batch process response and no UI serving

  • Alternative Solutions

    Sharded MongoDB (With replica sets) Pros: Increases Throughput by the amount of shards Increases the availability of the DB Cons: Very hard to manage DevOps wise (for a small team of

    developers) High cost of servers because each shared need 3 replicas

  • Workflow with MongoDB

    Worker 1

    Worker 2

    .

    .

    Worker N

    Spark Cluster

    Master

    Write

    Read

    Master

  • Our DevOps After that solution

    We had no DevOps guy at that time at all

  • Alternative Solutions

    Apache Cassandra Pros: Very large developer community Linearly scalable Database No single master architecture Proven working with distributed engines like Apache Spark

    Cons: We had no experience at all with the Database No Geo Spatial Index Needed to implement by ourselves

  • The Solution

    Migration to Apache Cassandra Create easily a Cassandra cluster using DataStax Community AMI

    on AWS First easy step Using the spark-cassandra-connector

    (Easy bootstrap move to Spark Cassandra) Creating a monitoring dashboard to Cassandra

  • Workflow with Cassandra

    Worker 1

    Worker 2

    .

    .

    Worker N

    Cassandra

    Cluster

    Spark Cluster

    Write

    Read

  • Result

    Performance improvement Batch write parts of the job run in 3 minutes instead of ~ 40

    minutes in MongoDB Took 2 weeks to go from Zero to Hero, and to ramp up a running

    solution that work without glitches

  • So whats the problem (Again)?

  • Transferring the Heaviest Process

    Micro service that runs every 10 minutes Writes to Cassandra 30GB per iteration

    (Replication factor 3 => 90GB) At first took us 18 minutes to do all of the writes

    Not Acceptable in a 10 minute process

  • Cluster On OpsCenter - Before