Devops Spark Streaming

38

Transcript of Devops Spark Streaming

Agenda

Distributed ProcessingTwo Use CasesSparkSpark Streaming Ecosystem

DistributedProcessing

1. https://spark.apache.org/docs/latest/cluster-overview.html

Berkeley AMPLab 2009Fast, general purpose cluster computing platform10X to 100X faster than Hadoop - runs in-memoryon top of Hadoop

1. Open source implementation forResilient Distributed Datasets(RDD's)

2. Advanced DAG execution enginesupporting cyclic data flow and in-memory computing

3. Java, Scala, Python and R4. Mesos, Yarn, StandAlone, Cloud,

Notebook5. HDFS, Hive, Cassandra, HBase,

Tachyon, Hadoop

RDD's + DAG + Lazy ExecutionRDD's + DAG + Lazy Execution

credit: Pietro Michirardi - Spark Internals

credit: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html

Spark StreamingSpark Streamingecosystem

credit: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

NETFLIX ARCHITECTURENETFLIX ARCHITECTURE

Spark Streaming Set Up

ZooKeeper is a system for distributedcoordination and service discovery

Is highly-available

ZooKeeper Features

Distributed coordinationDistributed queuesDistributed locksDiscovery service Leader election

Distributed: runs on a set of servers called brokers ScalablePublisher-Subscriber System - topic based subscriptionReliable - messages passed to Kafka are replicated andpersisted to diskPreserves message order

Credit: http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/

localhost:2181

credit: https://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-examplecredit: Jeremy Freeman

Semantics

At most onceAt least onceExactly once

credit: https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming supports "at least once"and with Kafka "exactly once"

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-

approach-no-receivers

Lambda Architecture - combine batch andstreaming data

credit: Strata+Hadoop NYC

Combine machine learning to real-timedata

1. credit: Strata+Hadoop NYC

Combine SQL with real-time data

credit: Hadoop+Strata NYC

The End