SparkMajid Hajibaba
Outline
An Overview on Spark
Spark Programming Guide
An Example on Spark
Running Applications on Spark
Spark Streaming
Spark Streaming Programing Guide
An Example on Spark Streaming
Spark and Storm: A Comparison
Spark SQL
15 January 2015Majid Hajibaba - Spark 2
An Overview
15 January 2015Majid Hajibaba - Spark 3
Cluster Mode Overview
Spark applications run as independent sets of processes on a cluster
Executor processes run tasks in multiple threads
Driver should be close to the workers
For remotely operating, use RPC instead of remote driver
• Coordinator
• Standalone
• Mesos
• YARN
http://spark.apache.org/docs/1.0.1/cluster-overview.html
15 January 2015 4Majid Hajibaba - Spark
Core is a “computational engine” that is responsible for scheduling,distributing, and monitoring applications in a cluster
higher-level components (Shark; GraphX; Streaming; …) are Likelibraries in a software project
tight integration has several benefits
Simple Improvements, Minimized Costs, Combine Processing Models
.
Spark - A Unified Stack
15 January 2015 5Majid Hajibaba - Spark
Spark Processing Model
15 January 2015 6Majid Hajibaba - Spark
In memory iterative MapReduce
MapReduce
Processing Model
Spark Goal
Provide distributed memory abstractions for clusters to support apps
with working sets
Retain the attractive properties of MapReduce:
Fault tolerance
Data locality
Scalability
Solution: augment data flow model with “resilient distributed datasets”
(RDDs)
15 January 2015 7Majid Hajibaba - Spark
Resilient Distributed Datasets (RDDs)
Immutable collection of elements that can be operated on in parallel
Created by transforming data using data flow operators (e.g. map)
Parallel operations on RDDs
Benefits
Consistency is easy
due to immutability
Inexpensive fault tolerance
log lineage
no replicating/checkpointing
Locality-aware scheduling of tasks on partitions
Applicable to a broad variety of applications
15 January 2015 8Majid Hajibaba - Spark
RDDs
15 January 2015Majid Hajibaba - Spark 9
Immutable
Collection of
Objects
Partitioned and Distributed
Spark Programming Guide
Linking with Spark
Spark 1.2.0 works with Java 6 and higher
To write a Spark application in Java, you need to add a dependency on
Spark. Spark is available through Maven Central at:
Importing Spark classes into the program:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
15 January 2015 11Majid Hajibaba - Spark
Initializing Spark - Creating a SparkContext
Tells Spark how to access a cluster
The entry point / The first thing a Spark program
This is done through the following constructor:
Example:
Or through SparkConf for advanced configuration
new SparkContext(master, appName, [sparkHome], [jars])
15 January 2015 12Majid Hajibaba - Spark
import org.apache.spark.api.java.JavaSparkContext;
JavaSparkContext ctx = new
JavaSparkContext("master_url",
"application name", ["path_to_spark_home",
"path_to_jars"]);
SparkConf
Configuration for a Spark application
Sets various Spark parameters as key-value pairs
SparkConf object contains information about the application
The constructor will load values from any spark.* Java system
properties set and the classpath in the application
Example
import org.apache.spark.SparkConf;
SparkConf conf =
new SparkConf().setAppName(appName).setMaster(master);
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
SparkConf sparkConf = new SparkConf().setAppName("application
name");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
15 January 2015 13Majid Hajibaba - Spark
Loading data into an RDD
Spark's primary unit for data representation
Allows for easy parallel operations on the data
Native collections in Java can serve as the basis for an RDD
number of partitions can be set manually by passing it as a second parameter toparallelize (e.g. ctx.parallelize(data, 10)).
To loading external data from a file can use textFile method in SparkContextas:
textFile(path: String, minSplits: Int )
path: the path of text file
minSplits: min number of splits for Hadoop RDDs
The resulting is an overridden string with each line being a unique element inthe RDD
import org.apache.spark.api.java.JavaRDD;
JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));
15 January 2015 14Majid Hajibaba - Spark
textFile method
Read a text file and return it as an RDD of Strings
File can be take from
a local file system (available on all nodes in Distributed mode)
HDFS
Hadoop-supported file system URI
.
import org.apache.spark.api.java.JavaRDD;
JavaRDD<String> lines = ctx.textFile(“file_path”, 1);
import org.apache.spark.Sparkfiles;
import org.apache.spark.api.java.JavaRDD;
...
ctx.addFile(“file_path");
JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile(“hdfs://...”);
15 January 2015 15Majid Hajibaba - Spark
Manipulating RDD
Transformations: to create a new dataset from an existing one
map: works on each individual element in the input RDD and produces a new
output element
Transformation functions do not transform the existing elements, rather they
return a new RDD with the new elements
Actions: to return a value to the driver program after running a computation
on the dataset
reduce: operates on pairs to aggregates all the data elements of the dataset
import org.apache.spark.api.java.function.Function;
rdd.map(new Function<Integer, Integer>() {
public Integer call(Integer x) { return x+1;}
});
import org.apache.spark.api.java.function.Function2;
rdd.reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) { return x+y;}
});
15 January 2015 16Majid Hajibaba - Spark
RDD Basics
A simple program
This dataset is not loaded in memory
lines is merely a pointer to the file
lineLengths is not immediately computed
Breaks the computation into tasks to run on separate machines
Each machine runs both its part of the map and a local reduction
Local reduction only answers to the driver program
To use lineLengths again later, we could add the following before the reduce:
This would cause lineLengths to be saved in memory after the first time it iscomputed.
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lineLengths.persist();
15 January 2015 17Majid Hajibaba - Spark
functions are represented by classes implementing the interfaces in the
org.apache.spark.api.java.function package
Two ways to create such functions:
1. Use lambda expressions to concisely define an implementation (In Java 8)
2. Implement the Function interfaces in your own class, and pass an instance of
it to Spark
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new
Function<String, Integer>() {
public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new
Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b)
{ return a + b; }
});
class GetLength implements Function<String, Integer> {
public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) { return a + b;}
}
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());
Passing Functions to Spark
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
15 January 2015 18Majid Hajibaba - Spark
Working with Key-Value Pairs
key-value pairs are represented using the scala.Tuple2 class
call new Tuple2(a, b) to create a tuple
access its fields with tuple._1() and tuple._2()
RDDs of key-value pairs
distributed “shuffle” operations (e.g. grouping or aggregating the elements
by a key)
Represented by the JavaPairRDD class
JavaPairRDDs can be constructed from JavaRDDs Using special versions of
the map operations (mapToPair, flatMapToPair)
The JavaPairRDD will have both standard RDD:
reduceByKey
sortByKey
import scala.Tuple2;
...
Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);
System.out.println(tuple._1() + “ " + tuple._2());
15 January 2015 19Majid Hajibaba - Spark
Working with Key-Value Pairs
reduceByKey example
to count how many times each line of text occurs in a file
sortByKey example
to sort the pairs alphabetically
and to bring them back to the driver program as an array of objects
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new
Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +
b);
...
counts.sortByKey();
counts.collect();
15 January 2015 20Majid Hajibaba - Spark
flatMap
flatMap is a combination of map and flatten
Return a Sequence rather than a single item; Then flattens the result
Use case: to parse all the data, but may fail to parse some of it
15 January 2015Majid Hajibaba - Spark 21
htt
p:/
/w
ww
.slidesh
are
.net/
frodri
guezolivera
/apache-s
park
-str
eam
ing
RDD Operations
15 January 2015 23Majid Hajibaba - Spark
An Example
Counting Words
15 January 2015 25Majid Hajibaba - Spark
A Complete Example
Word Counter Program
Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 26Majid Hajibaba - Spark
A Complete Example
Main Class
Creating a SparkContext
Creating a SparkConf
Application name
(will be passed to spark submitter)
Loading data into an RDD
Base RDD
15 January 2015 27Majid Hajibaba - Spark
A Complete Example
JavaRDDs and JavaPairRDDs functions
construct
JavaPairRDDs
from JavaRDDs
count how many
times each word of
text occurs in a file
values for each key are aggregated
create a tuple (key-value pairs )
Transformed RDD
15 January 2015 28Majid Hajibaba - Spark
A Complete Example
Printing results
accessing tuples
action
15 January 2015 29Majid Hajibaba - Spark
Iteration 1
output = count.collect();
Spark Execution Model
15 January 2015 30Majid Hajibaba - Spark
Iteration 2
output = count.reduce(func);
Spark Execution Model
15 January 2015 31Majid Hajibaba - Spark
Running Applications on Spark
Building Application
With sbt ($ sbt package)
With maven ($ mvn package)
./src
./src/main
./src/main/java
./src/main/java/app.java
<project>
<artifactId>word-counter</artifactId>
<name>Word Counter</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
</project>
name := "Word Counter"
organization := "org.apache.spark"
version := "1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
Directory layout
Pom.xml
name.sbt
15 January 2015 33Majid Hajibaba - Spark
Submitting Application
Starting Spark (Master and Slaves)
Submitting a job
Submission syntax:
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
$ sudo ./bin/spark-submit
--class "org.apache.spark.examples.JavaWordCount"
--master spark://127.0.0.1:7077
test/target/word-counter-1.0.jar /var/log/syslog
$ ./sbin/start-all.sh
15 January 2015 34Majid Hajibaba - Spark
Spark Streaming
15 January 2015Majid Hajibaba - Spark 35
Overview
Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP sockets
Data can be processed using complex algorithms expressed with high-
level functions like map, reduce, join and window
Processed data can be pushed out to filesystems, databases, and live
dashboards
Potential for combining batch processing and streaming processing in
the same system
you can apply Spark’s machine learning and graph processing algorithms on
data streams
15 January 2015Majid Hajibaba - Spark 36
Run a streaming computation as a series of very small, deterministic
batch jobs
Chop up the live stream into batches of X seconds
Spark treats each batch of data
as RDDs and processes them using
RDD operations
Finally, the processed results of
the RDD operations are returned
in batches
Batch sizes as low as ½ second,
latency of about 1 second
Spark Streaming – How Work
15 January 2015Majid Hajibaba - Spark 37
Dstreams (Discretized Stream)
represents a continuous stream of data
is represented as a sequence of RDDs
can be created from
input data streams from sources such as Kafka, Flume, and Kinesis
by applying high-level operations on other Dstreams
Example: lines to words
15 January 2015Majid Hajibaba - Spark 38
Running Example - JavaNetworkWordCount
You will first need to run Netcat as a data server by using
Remember you must be installed spark
Then, in a different terminal, you can start the example by using
Then, any lines typed in the terminal running the netcat server will be
counted and printed on screen every second.
15 January 2015Majid Hajibaba - Spark 39
$ nc -lk 9999
$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
Spark Streaming Programing
Guide
15 January 2015Majid Hajibaba - Spark 40
Linking with Spark
Like as Spark batch processing
Spark 1.2.0 works with Java 6 and higher
To write a Spark application in Java, you need to add a dependency on
Spark.
add the following dependency to your Maven project.
add the following dependency to your SBT project.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.2.0</version>
</dependency>
15 January 2015 41Majid Hajibaba - Spark
libraryDependencies += "org.apache.spark" %
"spark-streaming_2.10" % "1.2.0"
Initializing – Creating StreamingContext
Like as SparkContext
Using constructor
The batchDuration is the size of the batches
the time interval at which streaming data will be divided into batches
can be created from a SparkConf object
can also be created from an existing JavaSparkContext
15 January 2015Majid Hajibaba - Spark 42
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));
...
JavaSparkContext ctx = ... //existing JavaSparkContext
JavaStreamingContext ssc =
new JavaStreamingContext(ctx, Durations.seconds(1));
new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])
Setting the Right Batch Size
batches of data should be processed as fast as they are being generated
the batch interval used may have significant impact on the data rates
figure out the right batch size for an application
test it with a conservative batch interval and a low data rate
5-10 seconds
If system is stable (the delay is comparable to the batch size)
increasing the data rate and/or reducing the batch size
If system is unstable (the delay is continuously increasing)
Get to the previous stable batch size
15 January 2015Majid Hajibaba - Spark 43
Input DStreams and Receivers
Input DStream is associated with a Receiver
except file stream
Receiver
receives the data from a source and
stores it in memory for processing
Spark Streaming provides two categories of built-in streaming sources.
Basic sources
like file systems, socket connections, and Akka actors
directly available in the StreamingContext API
Advanced sources
like Kafka, Flume, Kinesis, Twitter, etc.
are available through extra utility classes
Custom sources
15 January 2015Majid Hajibaba - Spark 44
Basic Sources
File Streams
will monitor the directory dataDirectory and process any files created in that directory
For simple text files
Socket Streams
Custom Actors
Actors are concurrent processes that communicate by exchanging messages
Queue of RDDs
Each RDD into the queue will be treated as a batch of data in the DStream, andprocessed like a stream
15 January 2015Majid Hajibaba - Spark 45
streamingContext.fileStream<KeyClass, ValueClass,
InputFormatClass>(dataDirectory);
streamingContext.textFileStream(dataDirectory)
streamingContext.actorStream(actorProps, actor-name)
streamingContext.queueStream(queueOfRDDs)
streamingContext.socketStream(String hostname, int port,
Function converter, StorageLevel storageLevel)
Advanced Sources
require interfacing with external non-Spark libraries
Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven
Programming: Import the TwitterUtils class and create a DStream with
TwitterUtils.createStream as shown below
Deploying: Generate an uber JAR with all the dependencies (including the
dependency spark-streaming-twitter_2.10 and its transitive dependencies) and
then deploy the application. This is further explained in the Deploying section.
Flume
Kafka
Kinesis
15 January 2015Majid Hajibaba - Spark 46
import org.apache.spark.streaming.twitter.*;
TwitterUtils.createStream(jssc);
Custom Sources
implement an user-defined receive
15 January 2015Majid Hajibaba - Spark 47
Socket Text Stream
Create an input stream from network source hostname:port
Data is received using a TCP socket
Receive bytes is interpreted as UTF8 encoded \n delimited lines
Storage level to use for storing the received objects
15 January 2015Majid Hajibaba - Spark 48
socketTextStream(String hostname, int port);
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.StorageLevels;
...
ssc.socketTextStream(“localhost”,9999,
StorageLevels.MEMORY_AND_DISK_SER);
socketTextStream(String hostname, int port, StorageLevel
storageLevel)
Class ReceiverInputDStream
Abstract class for defining any InputDStream
Start a receiver on worker nodes to receive external data
JavaReceiverInputDStream
An interface to ReceiverInputDStream
The abstract class for defining input stream received over the network
Example:
Creates a DStream from text data received over a TCP socket connection
15 January 2015Majid Hajibaba - Spark 49
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
...
JavaReceiverInputDStream<String> lines =
ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);
Output Operations on DStreams
Allow DStream’s data to be pushed out external systems
Trigger the actual execution of all the DStream transformations
Similar to actions for RDDs
15 January 2015Majid Hajibaba - Spark 50
Output Operation Meaning
print()
Prints first ten elements of every batch of data in a
DStream on the driver node running the streaming
application.
saveAsTextFiles (prefix, [suffix]) Save DStream's contents as a text files. The file name at
each batch interval is generated based on prefix and suffix.
saveAsObjectFiles(prefix, [suffix]) Save DStream's contents as a SequenceFile of serialized
Java objects.
saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.
foreachRDD(func)
Applies a function to each RDD generated from the
stream. This function should push the data in each RDD to
a external system, like saving the RDD to files, or writing
it over the network to a database. The function is executed
in the driver process running the streaming application.
Persisting (or caching) a dataset in memory across operations
Each node stores any computed partitions in memory and reuses them
Methods
.cache() just memory - for iterative algorithms
.persist() just memory - reuses in other actions on dataset
.persist(storageLevel) storageLevel:
Example:
.
RDD Persistence
15 January 2015 51Majid Hajibaba - Spark
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
import org.apache.spark.api.java.StorageLevels;
...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
UpdateStateByKey
To maintain state
Update state with new information
Define the state
Define the state update function
using updateStateByKey requires the checkpointing
15 January 2015Majid Hajibaba - Spark 52
import com.google.common.base.Optional;
...
Function2<List<Integer>, Optional<Integer>, Optional<Integer>>
updateFunction = new Function2<List<Integer>, Optional<Integer>,
Optional<Integer>>() {
@Override public Optional<Integer> call(List<Integer> values,
Optional<Integer> state) {
Integer newSum = ... // add the new values with the
//previous running count
return Optional.of(newSum);
}};
...
JavaPairDStream<String, Integer> runningCounts =
pairs.updateStateByKey(updateFunction);
applied on a DStream containing words
To operate 24/7 and be resilient to failures
Needs to checkpoints enough information to recover from failures
Two types of data that are checkpointed
Metadata checkpointing
To recover from failure of the node running the driver
Includes Configuration; DStream operations; Incomplete batches
Data checkpointing
To cut off the dependency chains
Remove accumulated metadata in stateful operations
To enable checkpointing:
The interval of checkpointing of a DStream can be set by using
checkpoint interval of 5 - 10 times is good
dstream.checkpoint(checkpointInterval)
ctx.checkpoint(hdfsPath)
Checkpointing
15 January 2015Majid Hajibaba - Spark 53
An Stream Example
A Complete Example
Network Word Counter Program
Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 55Majid Hajibaba - Spark
A Complete Example
Main Class
Creating a SparkStreamingContext
Creating a
SparkConf
Application name
(will be passed to spark submitter)
Socket Streams as SourceInput DStream
15 January 2015 56Majid Hajibaba - Spark
Setting batch size
A Complete Example
JavaDStream and JavaPairDStream functions
construct
JavaPairDstream
from JavaDstream
count how many
times each word
of text occurs in
an stream
values for each key are aggregated
create a tuple (key-value pairs )
Transformed DStream
15 January 2015 57Majid Hajibaba - Spark
A Complete Example
Printing results
Wait for the execution to stop
Start the execution of the
streams
15 January 2015 58Majid Hajibaba - Spark
Print the first ten elements
Spark and Storm A Comparison
15 January
201559Majid Hajibaba - Spark
Spark vs. Strom
Spark Storm
Origin UC Berkeley, 2009 Twitter
Implemented in Scala Clojure (Lisp like)
Enterprise Support Yes No
Source Model Open Source Open Source
Big Data Processing Batch and Stream Stream
Processing Type processing in short
interval batches
real time
Latency a few Second sub-Second
Programming API Scala, Java, Python Any PL
Guarantee Data
Processing
Exactly one At least one
Bach Processing Yes No
Coordination With zookeeper zookeeper
15 January 2015 60Majid Hajibaba - Spark
Apache Spark
Ippon USA
15 January 2015 61Majid Hajibaba - Spark
Apache Storm
15 January 2015Majid Hajibaba - Spark 62
Comparison
Higher throughput than Storm
Spark Streaming: 670k records/sec/node
Storm: 115k records/sec/node
Commercial systems: 100-500k records/sec/node
15 January 2015Majid Hajibaba - Spark 63
Spark SQL
15 January 2015Majid Hajibaba - Spark 64
Spark SQL
Allows relational queries expressed in SQL to be executed using Spark
Data Sources are in JavaSchemaRDDs
JavaSchemaRDD
new type of RDD
is similar to a table in a traditional relational database
are composed of Row objects along with a schema that describes it
can be created from an existing RDD, a JSON dataset, or …
15 January 2015Majid Hajibaba - Spark 65
Spark SQL Programming Guide
15 January 2015Majid Hajibaba - Spark 66
Initializing - Creating JavaSQLContext
To create a basic JavaSQLContext, all you need is a JavaSparkContext
It must be based spark context
15 January 2015Majid Hajibaba - Spark 67
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.api.java.JavaSQLContext;
...
...
JavaSparkContext sc = ...; // An existing JavaSparkContext.
JavaSQLContext sqlContext = new JavaSQLContext(sc);
SchemaRDD
SchemaRDD can be operated on
as normal RDDs
as a temporary table
allows you to run SQL queries over it
Converting RDDs into SchemaRDDs
Reflection based approach
Uses reflection to infer the schema of an RDD
More concise code
Works well when we know the schema while writing the application
Programmatic based approach
Construct a schema and then apply it to an existing RDD
More verbose
Allows to construct SchemaRDDs when the columns and types are not known until
runtime
15 January 2015Majid Hajibaba - Spark 68
JavaBean
Is just a standard (a convention)
Is a class that encapsulates many objects into a single object
All properties private (using get/set)
A public no-argument constructor
Implements Serializable
Lots of libraries depend on it
15 January 2015Majid Hajibaba - Spark 69
public static class Person implements Serializable {
private String name;
private int age;
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public int getAge() { return age; }
public void setAge(int age) { this.age = age; }
}
Reflection based - An Example
Load a text file like people.txt
Convert each line to a JavaBean
people now is an RDD of JavaBeans
15 January 2015Majid Hajibaba - Spark 70
JavaRDD<Person> people = sc.textFile("people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
Reflection based - An Example
Apply a schema to an RDD of JavaBeans (people)
Register it as a temporary table
SQL can be run over RDDs that have been registered as tables
The result is SchemaRDD and support all the normal RDD operations
The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 71
JavaSchemaRDD schemaPeople =
sqlContext.applySchema(people, Person.class);
schemaPeople.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19")
List<String> teenagerNames = teenagers.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
Programmatic based
JavaBean classes cannot be defined ahead of time
SchemaRDD can be created programmatically with three steps
Create an RDD of Rows from the original RDD
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via applySchema method provided by
JavaSQLContext.
Example
The structure of records (schema) is encoded in a string
Load a text file and convert each line to a JavaBean.
15 January 2015Majid Hajibaba - Spark 72
String schemaString = "name age";
JavaRDD<String> people =
sc.textFile("examples/src/main/resources/people.txt");
Programmatic based –An Example
Generate the schema based on the string of schema
Convert records of the RDD (people) to Rows
15 January 2015Majid Hajibaba - Spark 73
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
...
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataType.createStructField(fieldName,
DataType.StringType, true));}
StructType schema = DataType.createStructType(fields);
import org.apache.spark.sql.api.java.Row;
...
JavaRDD<Row> rowRDD = people.map(
new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return Row.create(fields[0], fields[1].trim());
}
});
Programmatic based –An Example
Apply the schema to the RDD.
Register the SchemaRDD as a table.
SQL can be run over RDDs that have been registered as tables
The result is SchemaRDD and support all the normal RDD operations
The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 74
JavaSchemaRDD peopleSchemaRDD =
sqlContext.applySchema(rowRDD, schema);
peopleSchemaRDD.registerTempTable("people");
JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");
List<String> names = results.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
JSON Datasets
Inferring the schema of a JSON dataset and load it to JavaSchemaRDD
Two methods in a JavaSQLContext
jsonFile() : loads data from a directory of JSON files where each line of the
files is a JSON object – but not regular multi-line JSON file
jsonRDD(): loads data from an existing RDD where each element of the RDD
is a string containing a JSON object
A JSON file can be like this:
15 January 2015Majid Hajibaba - Spark 75
JavaSchemaRDD people = sqlContext.jsonFile(path);
JSON Datasets
The inferred schema can be visualized using the printSchema()
The result is something like this:
Register this JavaSchemaRDD as a table
SQL statements can be run by using the sql methods
15 January 2015Majid Hajibaba - Spark 76
people.printSchema();
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
JSON Datasets
JavaSchemaRDD can be created for a JSON dataset represented by anRDD[String] storing one JSON object per string
Arrays are native examples of RDDs
Register this JavaSchemaRDD as a table
SQL statements can be run by using the sql methods
.
15 January 2015Majid Hajibaba - Spark 77
List<String> jsonData =
Arrays.asList("{\"name\":\"Yin\",\"address\":
{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);
JavaSchemaRDD anotherPeople =
sqlContext.jsonRDD(anotherPeopleRDD);
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
Thrift JDBC/ODBC server
To start the JDBC/ODBC server:
By default, the server listens on localhost:10000
We can use beeline to test the Thrift JDBC/ODBC server
Connect to the JDBC/ODBC server in beeline with
Beeline will ask for a username and password
Simply enter the username on your machine and a blank password
See existing databases;
Create a database;
15 January 2015Majid Hajibaba - Spark 78
$ ./sbin/start-thriftserver.sh
$ ./bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> SHOW DATABASES;
0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;
End
any question?
15 January 2015Majid Hajibaba - Spark 79
Top Related