Download - Apache Spark

Transcript
Page 1: Apache Spark

SparkMajid Hajibaba

Page 2: Apache Spark

Outline

An Overview on Spark

Spark Programming Guide

An Example on Spark

Running Applications on Spark

Spark Streaming

Spark Streaming Programing Guide

An Example on Spark Streaming

Spark and Storm: A Comparison

Spark SQL

15 January 2015Majid Hajibaba - Spark 2

Page 3: Apache Spark

An Overview

15 January 2015Majid Hajibaba - Spark 3

Page 4: Apache Spark

Cluster Mode Overview

Spark applications run as independent sets of processes on a cluster

Executor processes run tasks in multiple threads

Driver should be close to the workers

For remotely operating, use RPC instead of remote driver

• Coordinator

• Standalone

• Mesos

• YARN

http://spark.apache.org/docs/1.0.1/cluster-overview.html

15 January 2015 4Majid Hajibaba - Spark

Page 5: Apache Spark

Core is a “computational engine” that is responsible for scheduling,distributing, and monitoring applications in a cluster

higher-level components (Shark; GraphX; Streaming; …) are Likelibraries in a software project

tight integration has several benefits

Simple Improvements, Minimized Costs, Combine Processing Models

.

Spark - A Unified Stack

15 January 2015 5Majid Hajibaba - Spark

Page 6: Apache Spark

Spark Processing Model

15 January 2015 6Majid Hajibaba - Spark

In memory iterative MapReduce

MapReduce

Processing Model

Page 7: Apache Spark

Spark Goal

Provide distributed memory abstractions for clusters to support apps

with working sets

Retain the attractive properties of MapReduce:

Fault tolerance

Data locality

Scalability

Solution: augment data flow model with “resilient distributed datasets”

(RDDs)

15 January 2015 7Majid Hajibaba - Spark

Page 8: Apache Spark

Resilient Distributed Datasets (RDDs)

Immutable collection of elements that can be operated on in parallel

Created by transforming data using data flow operators (e.g. map)

Parallel operations on RDDs

Benefits

Consistency is easy

due to immutability

Inexpensive fault tolerance

log lineage

no replicating/checkpointing

Locality-aware scheduling of tasks on partitions

Applicable to a broad variety of applications

15 January 2015 8Majid Hajibaba - Spark

Page 9: Apache Spark

RDDs

15 January 2015Majid Hajibaba - Spark 9

Immutable

Collection of

Objects

Partitioned and Distributed

Page 10: Apache Spark

Spark Programming Guide

Page 11: Apache Spark

Linking with Spark

Spark 1.2.0 works with Java 6 and higher

To write a Spark application in Java, you need to add a dependency on

Spark. Spark is available through Maven Central at:

Importing Spark classes into the program:

groupId = org.apache.spark

artifactId = spark-core_2.10

version = 1.2.0

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.SparkConf;

15 January 2015 11Majid Hajibaba - Spark

Page 12: Apache Spark

Initializing Spark - Creating a SparkContext

Tells Spark how to access a cluster

The entry point / The first thing a Spark program

This is done through the following constructor:

Example:

Or through SparkConf for advanced configuration

new SparkContext(master, appName, [sparkHome], [jars])

15 January 2015 12Majid Hajibaba - Spark

import org.apache.spark.api.java.JavaSparkContext;

JavaSparkContext ctx = new

JavaSparkContext("master_url",

"application name", ["path_to_spark_home",

"path_to_jars"]);

Page 13: Apache Spark

SparkConf

Configuration for a Spark application

Sets various Spark parameters as key-value pairs

SparkConf object contains information about the application

The constructor will load values from any spark.* Java system

properties set and the classpath in the application

Example

import org.apache.spark.SparkConf;

SparkConf conf =

new SparkConf().setAppName(appName).setMaster(master);

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf().setAppName("application

name");

JavaSparkContext ctx = new JavaSparkContext(sparkConf);

15 January 2015 13Majid Hajibaba - Spark

Page 14: Apache Spark

Loading data into an RDD

Spark's primary unit for data representation

Allows for easy parallel operations on the data

Native collections in Java can serve as the basis for an RDD

number of partitions can be set manually by passing it as a second parameter toparallelize (e.g. ctx.parallelize(data, 10)).

To loading external data from a file can use textFile method in SparkContextas:

textFile(path: String, minSplits: Int )

path: the path of text file

minSplits: min number of splits for Hadoop RDDs

The resulting is an overridden string with each line being a unique element inthe RDD

import org.apache.spark.api.java.JavaRDD;

JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));

15 January 2015 14Majid Hajibaba - Spark

Page 15: Apache Spark

textFile method

Read a text file and return it as an RDD of Strings

File can be take from

a local file system (available on all nodes in Distributed mode)

HDFS

Hadoop-supported file system URI

.

import org.apache.spark.api.java.JavaRDD;

JavaRDD<String> lines = ctx.textFile(“file_path”, 1);

import org.apache.spark.Sparkfiles;

import org.apache.spark.api.java.JavaRDD;

...

ctx.addFile(“file_path");

JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));

import org.apache.spark.api.java.JavaRDD;

...

JavaRDD<String> lines = ctx.textFile(“hdfs://...”);

15 January 2015 15Majid Hajibaba - Spark

Page 16: Apache Spark

Manipulating RDD

Transformations: to create a new dataset from an existing one

map: works on each individual element in the input RDD and produces a new

output element

Transformation functions do not transform the existing elements, rather they

return a new RDD with the new elements

Actions: to return a value to the driver program after running a computation

on the dataset

reduce: operates on pairs to aggregates all the data elements of the dataset

import org.apache.spark.api.java.function.Function;

rdd.map(new Function<Integer, Integer>() {

public Integer call(Integer x) { return x+1;}

});

import org.apache.spark.api.java.function.Function2;

rdd.reduce(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer x, Integer y) { return x+y;}

});

15 January 2015 16Majid Hajibaba - Spark

Page 17: Apache Spark

RDD Basics

A simple program

This dataset is not loaded in memory

lines is merely a pointer to the file

lineLengths is not immediately computed

Breaks the computation into tasks to run on separate machines

Each machine runs both its part of the map and a local reduction

Local reduction only answers to the driver program

To use lineLengths again later, we could add the following before the reduce:

This would cause lineLengths to be saved in memory after the first time it iscomputed.

JavaRDD<String> lines = ctx.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

lineLengths.persist();

15 January 2015 17Majid Hajibaba - Spark

Page 18: Apache Spark

functions are represented by classes implementing the interfaces in the

org.apache.spark.api.java.function package

Two ways to create such functions:

1. Use lambda expressions to concisely define an implementation (In Java 8)

2. Implement the Function interfaces in your own class, and pass an instance of

it to Spark

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(new

Function<String, Integer>() {

public Integer call(String s) { return s.length(); }

});

int totalLength = lineLengths.reduce(new

Function2<Integer, Integer, Integer>() {

public Integer call(Integer a, Integer b)

{ return a + b; }

});

class GetLength implements Function<String, Integer> {

public Integer call(String s) { return s.length(); }

}

class Sum implements Function2<Integer, Integer, Integer> {

public Integer call(Integer a, Integer b) { return a + b;}

}

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(new GetLength());

int totalLength = lineLengths.reduce(new Sum());

Passing Functions to Spark

JavaRDD<String> lines = ctx.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

15 January 2015 18Majid Hajibaba - Spark

Page 19: Apache Spark

Working with Key-Value Pairs

key-value pairs are represented using the scala.Tuple2 class

call new Tuple2(a, b) to create a tuple

access its fields with tuple._1() and tuple._2()

RDDs of key-value pairs

distributed “shuffle” operations (e.g. grouping or aggregating the elements

by a key)

Represented by the JavaPairRDD class

JavaPairRDDs can be constructed from JavaRDDs Using special versions of

the map operations (mapToPair, flatMapToPair)

The JavaPairRDD will have both standard RDD:

reduceByKey

sortByKey

import scala.Tuple2;

...

Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);

System.out.println(tuple._1() + “ " + tuple._2());

15 January 2015 19Majid Hajibaba - Spark

Page 20: Apache Spark

Working with Key-Value Pairs

reduceByKey example

to count how many times each line of text occurs in a file

sortByKey example

to sort the pairs alphabetically

and to bring them back to the driver program as an array of objects

import scala.Tuple2;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

...

JavaRDD<String> lines = ctx.textFile("data.txt");

JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new

Tuple2(s, 1));

JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +

b);

...

counts.sortByKey();

counts.collect();

15 January 2015 20Majid Hajibaba - Spark

Page 21: Apache Spark

flatMap

flatMap is a combination of map and flatten

Return a Sequence rather than a single item; Then flattens the result

Use case: to parse all the data, but may fail to parse some of it

15 January 2015Majid Hajibaba - Spark 21

htt

p:/

/w

ww

.slidesh

are

.net/

frodri

guezolivera

/apache-s

park

-str

eam

ing

Page 22: Apache Spark

RDD Operations

15 January 2015 23Majid Hajibaba - Spark

Page 23: Apache Spark

An Example

Page 24: Apache Spark

Counting Words

15 January 2015 25Majid Hajibaba - Spark

Page 25: Apache Spark

A Complete Example

Word Counter Program

Package and classes

Import

needed

classes

Package’s name

(will be passed to spark submitter)

15 January 2015 26Majid Hajibaba - Spark

Page 26: Apache Spark

A Complete Example

Main Class

Creating a SparkContext

Creating a SparkConf

Application name

(will be passed to spark submitter)

Loading data into an RDD

Base RDD

15 January 2015 27Majid Hajibaba - Spark

Page 27: Apache Spark

A Complete Example

JavaRDDs and JavaPairRDDs functions

construct

JavaPairRDDs

from JavaRDDs

count how many

times each word of

text occurs in a file

values for each key are aggregated

create a tuple (key-value pairs )

Transformed RDD

15 January 2015 28Majid Hajibaba - Spark

Page 28: Apache Spark

A Complete Example

Printing results

accessing tuples

action

15 January 2015 29Majid Hajibaba - Spark

Page 29: Apache Spark

Iteration 1

output = count.collect();

Spark Execution Model

15 January 2015 30Majid Hajibaba - Spark

Page 30: Apache Spark

Iteration 2

output = count.reduce(func);

Spark Execution Model

15 January 2015 31Majid Hajibaba - Spark

Page 31: Apache Spark

Running Applications on Spark

Page 32: Apache Spark

Building Application

With sbt ($ sbt package)

With maven ($ mvn package)

./src

./src/main

./src/main/java

./src/main/java/app.java

<project>

<artifactId>word-counter</artifactId>

<name>Word Counter</name>

<packaging>jar</packaging>

<version>1.0</version>

<dependencies>

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.2.0</version>

</dependency>

</dependencies>

</project>

name := "Word Counter"

organization := "org.apache.spark"

version := "1.0"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"

Directory layout

Pom.xml

name.sbt

15 January 2015 33Majid Hajibaba - Spark

Page 33: Apache Spark

Submitting Application

Starting Spark (Master and Slaves)

Submitting a job

Submission syntax:

./bin/spark-submit \

--class <main-class>

--master <master-url> \

--deploy-mode <deploy-mode> \

--conf <key>=<value> \

... # other options

<application-jar> \

[application-arguments]

$ sudo ./bin/spark-submit

--class "org.apache.spark.examples.JavaWordCount"

--master spark://127.0.0.1:7077

test/target/word-counter-1.0.jar /var/log/syslog

$ ./sbin/start-all.sh

15 January 2015 34Majid Hajibaba - Spark

Page 34: Apache Spark

Spark Streaming

15 January 2015Majid Hajibaba - Spark 35

Page 35: Apache Spark

Overview

Data can be ingested from many sources like Kafka, Flume, Twitter,

ZeroMQ, Kinesis or TCP sockets

Data can be processed using complex algorithms expressed with high-

level functions like map, reduce, join and window

Processed data can be pushed out to filesystems, databases, and live

dashboards

Potential for combining batch processing and streaming processing in

the same system

you can apply Spark’s machine learning and graph processing algorithms on

data streams

15 January 2015Majid Hajibaba - Spark 36

Page 36: Apache Spark

Run a streaming computation as a series of very small, deterministic

batch jobs

Chop up the live stream into batches of X seconds

Spark treats each batch of data

as RDDs and processes them using

RDD operations

Finally, the processed results of

the RDD operations are returned

in batches

Batch sizes as low as ½ second,

latency of about 1 second

Spark Streaming – How Work

15 January 2015Majid Hajibaba - Spark 37

Page 37: Apache Spark

Dstreams (Discretized Stream)

represents a continuous stream of data

is represented as a sequence of RDDs

can be created from

input data streams from sources such as Kafka, Flume, and Kinesis

by applying high-level operations on other Dstreams

Example: lines to words

15 January 2015Majid Hajibaba - Spark 38

Page 38: Apache Spark

Running Example - JavaNetworkWordCount

You will first need to run Netcat as a data server by using

Remember you must be installed spark

Then, in a different terminal, you can start the example by using

Then, any lines typed in the terminal running the netcat server will be

counted and printed on screen every second.

15 January 2015Majid Hajibaba - Spark 39

$ nc -lk 9999

$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999

Page 39: Apache Spark

Spark Streaming Programing

Guide

15 January 2015Majid Hajibaba - Spark 40

Page 40: Apache Spark

Linking with Spark

Like as Spark batch processing

Spark 1.2.0 works with Java 6 and higher

To write a Spark application in Java, you need to add a dependency on

Spark.

add the following dependency to your Maven project.

add the following dependency to your SBT project.

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.10</artifactId>

<version>1.2.0</version>

</dependency>

15 January 2015 41Majid Hajibaba - Spark

libraryDependencies += "org.apache.spark" %

"spark-streaming_2.10" % "1.2.0"

Page 41: Apache Spark

Initializing – Creating StreamingContext

Like as SparkContext

Using constructor

The batchDuration is the size of the batches

the time interval at which streaming data will be divided into batches

can be created from a SparkConf object

can also be created from an existing JavaSparkContext

15 January 2015Majid Hajibaba - Spark 42

import org.apache.spark.SparkConf;

import org.apache.spark.streaming.Duration;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));

...

JavaSparkContext ctx = ... //existing JavaSparkContext

JavaStreamingContext ssc =

new JavaStreamingContext(ctx, Durations.seconds(1));

new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])

Page 42: Apache Spark

Setting the Right Batch Size

batches of data should be processed as fast as they are being generated

the batch interval used may have significant impact on the data rates

figure out the right batch size for an application

test it with a conservative batch interval and a low data rate

5-10 seconds

If system is stable (the delay is comparable to the batch size)

increasing the data rate and/or reducing the batch size

If system is unstable (the delay is continuously increasing)

Get to the previous stable batch size

15 January 2015Majid Hajibaba - Spark 43

Page 43: Apache Spark

Input DStreams and Receivers

Input DStream is associated with a Receiver

except file stream

Receiver

receives the data from a source and

stores it in memory for processing

Spark Streaming provides two categories of built-in streaming sources.

Basic sources

like file systems, socket connections, and Akka actors

directly available in the StreamingContext API

Advanced sources

like Kafka, Flume, Kinesis, Twitter, etc.

are available through extra utility classes

Custom sources

15 January 2015Majid Hajibaba - Spark 44

Page 44: Apache Spark

Basic Sources

File Streams

will monitor the directory dataDirectory and process any files created in that directory

For simple text files

Socket Streams

Custom Actors

Actors are concurrent processes that communicate by exchanging messages

Queue of RDDs

Each RDD into the queue will be treated as a batch of data in the DStream, andprocessed like a stream

15 January 2015Majid Hajibaba - Spark 45

streamingContext.fileStream<KeyClass, ValueClass,

InputFormatClass>(dataDirectory);

streamingContext.textFileStream(dataDirectory)

streamingContext.actorStream(actorProps, actor-name)

streamingContext.queueStream(queueOfRDDs)

streamingContext.socketStream(String hostname, int port,

Function converter, StorageLevel storageLevel)

Page 45: Apache Spark

Advanced Sources

require interfacing with external non-Spark libraries

Twitter

Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven

Programming: Import the TwitterUtils class and create a DStream with

TwitterUtils.createStream as shown below

Deploying: Generate an uber JAR with all the dependencies (including the

dependency spark-streaming-twitter_2.10 and its transitive dependencies) and

then deploy the application. This is further explained in the Deploying section.

Flume

Kafka

Kinesis

15 January 2015Majid Hajibaba - Spark 46

import org.apache.spark.streaming.twitter.*;

TwitterUtils.createStream(jssc);

Page 46: Apache Spark

Custom Sources

implement an user-defined receive

15 January 2015Majid Hajibaba - Spark 47

Page 47: Apache Spark

Socket Text Stream

Create an input stream from network source hostname:port

Data is received using a TCP socket

Receive bytes is interpreted as UTF8 encoded \n delimited lines

Storage level to use for storing the received objects

15 January 2015Majid Hajibaba - Spark 48

socketTextStream(String hostname, int port);

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import org.apache.spark.api.java.StorageLevels;

...

ssc.socketTextStream(“localhost”,9999,

StorageLevels.MEMORY_AND_DISK_SER);

socketTextStream(String hostname, int port, StorageLevel

storageLevel)

Page 48: Apache Spark

Class ReceiverInputDStream

Abstract class for defining any InputDStream

Start a receiver on worker nodes to receive external data

JavaReceiverInputDStream

An interface to ReceiverInputDStream

The abstract class for defining input stream received over the network

Example:

Creates a DStream from text data received over a TCP socket connection

15 January 2015Majid Hajibaba - Spark 49

import org.apache.spark.api.java.StorageLevels;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

...

JavaReceiverInputDStream<String> lines =

ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);

Page 49: Apache Spark

Output Operations on DStreams

Allow DStream’s data to be pushed out external systems

Trigger the actual execution of all the DStream transformations

Similar to actions for RDDs

15 January 2015Majid Hajibaba - Spark 50

Output Operation Meaning

print()

Prints first ten elements of every batch of data in a

DStream on the driver node running the streaming

application.

saveAsTextFiles (prefix, [suffix]) Save DStream's contents as a text files. The file name at

each batch interval is generated based on prefix and suffix.

saveAsObjectFiles(prefix, [suffix]) Save DStream's contents as a SequenceFile of serialized

Java objects.

saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.

foreachRDD(func)

Applies a function to each RDD generated from the

stream. This function should push the data in each RDD to

a external system, like saving the RDD to files, or writing

it over the network to a database. The function is executed

in the driver process running the streaming application.

Page 50: Apache Spark

Persisting (or caching) a dataset in memory across operations

Each node stores any computed partitions in memory and reuses them

Methods

.cache() just memory - for iterative algorithms

.persist() just memory - reuses in other actions on dataset

.persist(storageLevel) storageLevel:

Example:

.

RDD Persistence

15 January 2015 51Majid Hajibaba - Spark

MEMORY_ONLY

MEMORY_ONLY_SER

MEMORY_AND_DISK

MEMORY_AND_DISK_SER

DISK_ONLY

import org.apache.spark.api.java.StorageLevels;

...

JavaReceiverInputDStream<String> lines = ssc.socketTextStream(

args[0], Integer.parseInt(args[1]),

StorageLevels.MEMORY_AND_DISK_SER);

Page 51: Apache Spark

UpdateStateByKey

To maintain state

Update state with new information

Define the state

Define the state update function

using updateStateByKey requires the checkpointing

15 January 2015Majid Hajibaba - Spark 52

import com.google.common.base.Optional;

...

Function2<List<Integer>, Optional<Integer>, Optional<Integer>>

updateFunction = new Function2<List<Integer>, Optional<Integer>,

Optional<Integer>>() {

@Override public Optional<Integer> call(List<Integer> values,

Optional<Integer> state) {

Integer newSum = ... // add the new values with the

//previous running count

return Optional.of(newSum);

}};

...

JavaPairDStream<String, Integer> runningCounts =

pairs.updateStateByKey(updateFunction);

applied on a DStream containing words

Page 52: Apache Spark

To operate 24/7 and be resilient to failures

Needs to checkpoints enough information to recover from failures

Two types of data that are checkpointed

Metadata checkpointing

To recover from failure of the node running the driver

Includes Configuration; DStream operations; Incomplete batches

Data checkpointing

To cut off the dependency chains

Remove accumulated metadata in stateful operations

To enable checkpointing:

The interval of checkpointing of a DStream can be set by using

checkpoint interval of 5 - 10 times is good

dstream.checkpoint(checkpointInterval)

ctx.checkpoint(hdfsPath)

Checkpointing

15 January 2015Majid Hajibaba - Spark 53

Page 53: Apache Spark

An Stream Example

Page 54: Apache Spark

A Complete Example

Network Word Counter Program

Package and classes

Import

needed

classes

Package’s name

(will be passed to spark submitter)

15 January 2015 55Majid Hajibaba - Spark

Page 55: Apache Spark

A Complete Example

Main Class

Creating a SparkStreamingContext

Creating a

SparkConf

Application name

(will be passed to spark submitter)

Socket Streams as SourceInput DStream

15 January 2015 56Majid Hajibaba - Spark

Setting batch size

Page 56: Apache Spark

A Complete Example

JavaDStream and JavaPairDStream functions

construct

JavaPairDstream

from JavaDstream

count how many

times each word

of text occurs in

an stream

values for each key are aggregated

create a tuple (key-value pairs )

Transformed DStream

15 January 2015 57Majid Hajibaba - Spark

Page 57: Apache Spark

A Complete Example

Printing results

Wait for the execution to stop

Start the execution of the

streams

15 January 2015 58Majid Hajibaba - Spark

Print the first ten elements

Page 58: Apache Spark

Spark and Storm A Comparison

15 January

201559Majid Hajibaba - Spark

Page 59: Apache Spark

Spark vs. Strom

Spark Storm

Origin UC Berkeley, 2009 Twitter

Implemented in Scala Clojure (Lisp like)

Enterprise Support Yes No

Source Model Open Source Open Source

Big Data Processing Batch and Stream Stream

Processing Type processing in short

interval batches

real time

Latency a few Second sub-Second

Programming API Scala, Java, Python Any PL

Guarantee Data

Processing

Exactly one At least one

Bach Processing Yes No

Coordination With zookeeper zookeeper

15 January 2015 60Majid Hajibaba - Spark

Page 60: Apache Spark

Apache Spark

Ippon USA

15 January 2015 61Majid Hajibaba - Spark

Page 61: Apache Spark

Apache Storm

15 January 2015Majid Hajibaba - Spark 62

Page 62: Apache Spark

Comparison

Higher throughput than Storm

Spark Streaming: 670k records/sec/node

Storm: 115k records/sec/node

Commercial systems: 100-500k records/sec/node

15 January 2015Majid Hajibaba - Spark 63

Page 63: Apache Spark

Spark SQL

15 January 2015Majid Hajibaba - Spark 64

Page 64: Apache Spark

Spark SQL

Allows relational queries expressed in SQL to be executed using Spark

Data Sources are in JavaSchemaRDDs

JavaSchemaRDD

new type of RDD

is similar to a table in a traditional relational database

are composed of Row objects along with a schema that describes it

can be created from an existing RDD, a JSON dataset, or …

15 January 2015Majid Hajibaba - Spark 65

Page 65: Apache Spark

Spark SQL Programming Guide

15 January 2015Majid Hajibaba - Spark 66

Page 66: Apache Spark

Initializing - Creating JavaSQLContext

To create a basic JavaSQLContext, all you need is a JavaSparkContext

It must be based spark context

15 January 2015Majid Hajibaba - Spark 67

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.sql.api.java.JavaSQLContext;

...

...

JavaSparkContext sc = ...; // An existing JavaSparkContext.

JavaSQLContext sqlContext = new JavaSQLContext(sc);

Page 67: Apache Spark

SchemaRDD

SchemaRDD can be operated on

as normal RDDs

as a temporary table

allows you to run SQL queries over it

Converting RDDs into SchemaRDDs

Reflection based approach

Uses reflection to infer the schema of an RDD

More concise code

Works well when we know the schema while writing the application

Programmatic based approach

Construct a schema and then apply it to an existing RDD

More verbose

Allows to construct SchemaRDDs when the columns and types are not known until

runtime

15 January 2015Majid Hajibaba - Spark 68

Page 68: Apache Spark

JavaBean

Is just a standard (a convention)

Is a class that encapsulates many objects into a single object

All properties private (using get/set)

A public no-argument constructor

Implements Serializable

Lots of libraries depend on it

15 January 2015Majid Hajibaba - Spark 69

public static class Person implements Serializable {

private String name;

private int age;

public String getName() { return name; }

public void setName(String name) { this.name = name; }

public int getAge() { return age; }

public void setAge(int age) { this.age = age; }

}

Page 69: Apache Spark

Reflection based - An Example

Load a text file like people.txt

Convert each line to a JavaBean

people now is an RDD of JavaBeans

15 January 2015Majid Hajibaba - Spark 70

JavaRDD<Person> people = sc.textFile("people.txt").map(

new Function<String, Person>() {

public Person call(String line) throws Exception {

String[] parts = line.split(",");

Person person = new Person();

person.setName(parts[0]);

person.setAge(Integer.parseInt(parts[1].trim()));

return person;

}

});

Page 70: Apache Spark

Reflection based - An Example

Apply a schema to an RDD of JavaBeans (people)

Register it as a temporary table

SQL can be run over RDDs that have been registered as tables

The result is SchemaRDD and support all the normal RDD operations

The columns of a row in the result can be accessed by ordinal

15 January 2015Majid Hajibaba - Spark 71

JavaSchemaRDD schemaPeople =

sqlContext.applySchema(people, Person.class);

schemaPeople.registerTempTable("people");

JavaSchemaRDD teenagers = sqlContext.sql(

"SELECT name FROM people WHERE age >= 13 AND age <= 19")

List<String> teenagerNames = teenagers.map(

new Function<Row, String>() {

public String call(Row row) {

return "Name: " + row.getString(0);

}

}).collect();

Page 71: Apache Spark

Programmatic based

JavaBean classes cannot be defined ahead of time

SchemaRDD can be created programmatically with three steps

Create an RDD of Rows from the original RDD

Create the schema represented by a StructType matching the structure of

Rows in the RDD created in Step 1.

Apply the schema to the RDD of Rows via applySchema method provided by

JavaSQLContext.

Example

The structure of records (schema) is encoded in a string

Load a text file and convert each line to a JavaBean.

15 January 2015Majid Hajibaba - Spark 72

String schemaString = "name age";

JavaRDD<String> people =

sc.textFile("examples/src/main/resources/people.txt");

Page 72: Apache Spark

Programmatic based –An Example

Generate the schema based on the string of schema

Convert records of the RDD (people) to Rows

15 January 2015Majid Hajibaba - Spark 73

import org.apache.spark.sql.api.java.DataType;

import org.apache.spark.sql.api.java.StructField;

import org.apache.spark.sql.api.java.StructType;

...

List<StructField> fields = new ArrayList<StructField>();

for (String fieldName: schemaString.split(" ")) {

fields.add(DataType.createStructField(fieldName,

DataType.StringType, true));}

StructType schema = DataType.createStructType(fields);

import org.apache.spark.sql.api.java.Row;

...

JavaRDD<Row> rowRDD = people.map(

new Function<String, Row>() {

public Row call(String record) throws Exception {

String[] fields = record.split(",");

return Row.create(fields[0], fields[1].trim());

}

});

Page 73: Apache Spark

Programmatic based –An Example

Apply the schema to the RDD.

Register the SchemaRDD as a table.

SQL can be run over RDDs that have been registered as tables

The result is SchemaRDD and support all the normal RDD operations

The columns of a row in the result can be accessed by ordinal

15 January 2015Majid Hajibaba - Spark 74

JavaSchemaRDD peopleSchemaRDD =

sqlContext.applySchema(rowRDD, schema);

peopleSchemaRDD.registerTempTable("people");

JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");

List<String> names = results.map(

new Function<Row, String>() {

public String call(Row row) {

return "Name: " + row.getString(0);

}

}).collect();

Page 74: Apache Spark

JSON Datasets

Inferring the schema of a JSON dataset and load it to JavaSchemaRDD

Two methods in a JavaSQLContext

jsonFile() : loads data from a directory of JSON files where each line of the

files is a JSON object – but not regular multi-line JSON file

jsonRDD(): loads data from an existing RDD where each element of the RDD

is a string containing a JSON object

A JSON file can be like this:

15 January 2015Majid Hajibaba - Spark 75

JavaSchemaRDD people = sqlContext.jsonFile(path);

Page 75: Apache Spark

JSON Datasets

The inferred schema can be visualized using the printSchema()

The result is something like this:

Register this JavaSchemaRDD as a table

SQL statements can be run by using the sql methods

15 January 2015Majid Hajibaba - Spark 76

people.printSchema();

people.registerTempTable("people");

JavaSchemaRDD teenagers = sqlContext.sql(

"SELECT name FROM people WHERE age >= 13 AND age <= 19");

Page 76: Apache Spark

JSON Datasets

JavaSchemaRDD can be created for a JSON dataset represented by anRDD[String] storing one JSON object per string

Arrays are native examples of RDDs

Register this JavaSchemaRDD as a table

SQL statements can be run by using the sql methods

.

15 January 2015Majid Hajibaba - Spark 77

List<String> jsonData =

Arrays.asList("{\"name\":\"Yin\",\"address\":

{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");

JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);

JavaSchemaRDD anotherPeople =

sqlContext.jsonRDD(anotherPeopleRDD);

people.registerTempTable("people");

JavaSchemaRDD teenagers = sqlContext.sql(

"SELECT name FROM people WHERE age >= 13 AND age <= 19");

Page 77: Apache Spark

Thrift JDBC/ODBC server

To start the JDBC/ODBC server:

By default, the server listens on localhost:10000

We can use beeline to test the Thrift JDBC/ODBC server

Connect to the JDBC/ODBC server in beeline with

Beeline will ask for a username and password

Simply enter the username on your machine and a blank password

See existing databases;

Create a database;

15 January 2015Majid Hajibaba - Spark 78

$ ./sbin/start-thriftserver.sh

$ ./bin/beeline

beeline> !connect jdbc:hive2://localhost:10000

0: jdbc:hive2://localhost:10000> SHOW DATABASES;

0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;

Page 78: Apache Spark

End

any question?

15 January 2015Majid Hajibaba - Spark 79