Spark streaming , Spark SQL

Spark Streaming & Spark SQL

Yousun Jeong jerryjung@sk.com

History - SparkDeveloped in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations “Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.”

Gartner, Advanced Analytics and Data Science (2014)

History - Spark

Some key points about Spark: • handles batch, interactive, and real-time within a single

framework • native integration with Java, Python, Scala programming

at a higher level of abstraction • multi-step Directed Acrylic Graphs (DAGs).

many stages compared to just Hadoop Map and Reduce only.

Data Sharing in MR

http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Benchmark Test

databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-sorting.html

RDDResilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel

There are currently two types: • parallelized collections – take an existing Scala collection

and run functions on it in parallel • Hadoop datasets – run functions on each record of a file

in Hadoop distributed file system or any other storage system supported by Hadoop

Fault Tolerance• An RDD is an immutable, deterministically re-

computable, distributed dataset.

• RDD tracks lineage info rebuild lost data

Benefit of SparkSpark help us to have the gains in processing speed and implement various big data applications easily and speedily

▪ Support for Event Stream

Processing

▪ Fast Data Queries in Real Time

▪ Improved Programmer Productivity

▪ Fast Batch Processing of Large Data

Why I use spark …

Big Data

Big Data is not just “big”

The 3V of Big Data

Big Data Processing1. Batch Processing

• processing data en masse • big & complex • higher latencies ex) MR

2. Stream Processing• one-at-a-time processing • computations are relatively simple and generally independent • sub-second latency ex) Storm

3. Micro-Batching• small batch size (batch+streaming)

Spark Streaming Integration

Spark Streaming In Actionimport org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ // create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(10)) // create a DStream that will connect to serverIP:serverPort val lines = ssc.socketTextStream(serverIP, serverPort) // split each line into words val words = lines.flatMap(_.split(" ")) // count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // print a few of the counts to the console wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate

Spark UI

Spark SQL

Spark SQL In Action// Data can easily be extracted from existing sources, // such as Apache Hive.

val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId”"")

// Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib

val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)

Spark SQL In Actionval allCandidates = sql(""" SELECT userId, age, latitude, logitude FROM Users WHERE subscribed = FALSE”"")

// Results of ML algorithms can be used as tables // in subsequent SQL statements.

case class Score(userId: Int, score: Double) val scores = allCandidates.map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features))} scores.registerAsTable("Scores")

MR vs RDD - Compute an Average

RDD vs DF - Compute an Average

Using RDDs data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()

Using DataFrames sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()

Spark 2.0 : Structured Streaming

• Structured Streaming

• High-level streaming API built on Spark SQL engine

• Runs the same queries on DataFrames

• Event time, windowing, sessions, sources & sinks

• Unifies streaming, interactive and batch queries

• Aggregate data in a stream, then serve using JDBC

• Change queries at runtime

• Build and apply ML models

Spark 2.0 Example: Page View Count

Input: records in Kafka Query: select count(*) group by page, minute(evtime) Trigger:“every 5 sec” Output mode: “update-in-place”, into MySQL sink

logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id). agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...")

Spark 2.0 Use Case: Fraud Detection

Spark 2.0 Performance

Thank You!

Spark streaming , Spark SQL

Data & Analytics

Transcript of Spark streaming , Spark SQL

Big Data Analytics with Spark and Oscar BAOdell/teaching/cc/guest/CC_20141212_TamasJa… · Big Data Analytics with Spark and Oscar BAO Tamas Jambor, ... Spark SQL Spark Streaming

Spark, spark streaming & tachyon

MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL Streaming MLlib GraphX #ibmedge Apache Spark 6 • Unified Analytics Platform –

Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Exploratory Analysis of Spark Structured Streaming · exactly-once stream processing without the user having to reason about streaming. built and executed on top of the Spark SQL

Arshdeep Bahga, Vijay K. Madisetti, Raj K. Madisetti ...Streaming for streaming jobs, Spark SQL for analysis of structured data, MLlib ma-chine learning library for Spark, and GraphX

Spark™...using Spark core, Streaming, and SQL for some of the most important banks in Spain. He has also contributed to Spark and the spark‐csv projects. Brian Gawalt received

Spark Streaming Preview

Chapter 1: Scala Overview · Spark shell - Details for Job O X + Executors Stage 2 *details *details +details SQL ... SQL and DataFrames Spark Streaming Ml-lib (machine learning)

Intro to Apache Spark...• review Spark SQL, Spark Streaming, Shark! • review advanced topics and BDAS projects! • follow-up courses and certiﬁcation! • developer community

Streaming SQL

Spark Streaming Resiliency (Bay Area Spark Meetup)

Spark Streaming

Devops Spark Streaming

Streaming Office Hours today after the lecture until 7pm. Streaming Overview Spark Streaming Spark Streaming Programming Final Project Announcement Outline Streaming Overview Spark

Hadoop architecture and ecosystem...Input stream 17 Test Spark streaming Second sentence Spark streaming Second Spark batch of 10 seconds (test,1), (spark,2), (streaming,2), ... version

Spark SQL | Apache Spark

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Chapter 1: Scala Overview · Spark shell - Details for Job O X + Executors Stage 2 details details +details SQL ... SQL and DataFrames Spark Streaming Ml-lib (machine learning)