Apache Spark

SparkMajid Hajibaba

Outline

An Overview on Spark

Spark Programming Guide

An Example on Spark

Running Applications on Spark

Spark Streaming

Spark Streaming Programing Guide

An Example on Spark Streaming

Spark and Storm: A Comparison

Spark SQL

15 January 2015Majid Hajibaba - Spark 2

An Overview


Cluster Mode Overview

Spark applications run as independent sets of processes on a cluster

Executor processes run tasks in multiple threads

Driver should be close to the workers

For remotely operating, use RPC instead of remote driver

• Coordinator

• Standalone

• Mesos

• YARN

http://spark.apache.org/docs/1.0.1/cluster-overview.html

15 January 2015 4Majid Hajibaba - Spark

Core is a “computational engine” that is responsible for scheduling,distributing, and monitoring applications in a cluster

higher-level components (Shark; GraphX; Streaming; …) are Likelibraries in a software project

tight integration has several benefits

Simple Improvements, Minimized Costs, Combine Processing Models

.

Spark - A Unified Stack


Spark Processing Model


In memory iterative MapReduce

MapReduce

Processing Model

Spark Goal

Provide distributed memory abstractions for clusters to support apps

with working sets

Retain the attractive properties of MapReduce:

Fault tolerance

Data locality

Scalability

Solution: augment data flow model with “resilient distributed datasets”

(RDDs)


Resilient Distributed Datasets (RDDs)

Immutable collection of elements that can be operated on in parallel

Created by transforming data using data flow operators (e.g. map)

Parallel operations on RDDs

Benefits

Consistency is easy

due to immutability

Inexpensive fault tolerance

log lineage

no replicating/checkpointing

Locality-aware scheduling of tasks on partitions

Applicable to a broad variety of applications


RDDs


Immutable

Collection of

Objects

Partitioned and Distributed

Spark Programming Guide

Linking with Spark

Spark 1.2.0 works with Java 6 and higher

To write a Spark application in Java, you need to add a dependency on

Spark. Spark is available through Maven Central at:

Importing Spark classes into the program:

groupId = org.apache.spark

artifactId = spark-core_2.10

version = 1.2.0

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.SparkConf;


Initializing Spark - Creating a SparkContext

Tells Spark how to access a cluster

The entry point / The first thing a Spark program

This is done through the following constructor:

Example:

Or through SparkConf for advanced configuration

new SparkContext(master, appName, [sparkHome], [jars])



JavaSparkContext ctx = new

JavaSparkContext("master_url",

"application name", ["path_to_spark_home",

"path_to_jars"]);

SparkConf

Configuration for a Spark application

Sets various Spark parameters as key-value pairs

SparkConf object contains information about the application

The constructor will load values from any spark.* Java system

properties set and the classpath in the application

Example


SparkConf conf =

new SparkConf().setAppName(appName).setMaster(master);



SparkConf sparkConf = new SparkConf().setAppName("application

name");

JavaSparkContext ctx = new JavaSparkContext(sparkConf);


Loading data into an RDD

Spark's primary unit for data representation

Allows for easy parallel operations on the data

Native collections in Java can serve as the basis for an RDD

number of partitions can be set manually by passing it as a second parameter toparallelize (e.g. ctx.parallelize(data, 10)).

To loading external data from a file can use textFile method in SparkContextas:

textFile(path: String, minSplits: Int )

path: the path of text file

minSplits: min number of splits for Hadoop RDDs

The resulting is an overridden string with each line being a unique element inthe RDD


JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));


textFile method

Read a text file and return it as an RDD of Strings

File can be take from

a local file system (available on all nodes in Distributed mode)

HDFS

Hadoop-supported file system URI

.


JavaRDD<String> lines = ctx.textFile(“file_path”, 1);

import org.apache.spark.Sparkfiles;


...

ctx.addFile(“file_path");

JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));


...

JavaRDD<String> lines = ctx.textFile(“hdfs://...”);


Manipulating RDD

Transformations: to create a new dataset from an existing one

map: works on each individual element in the input RDD and produces a new

output element

Transformation functions do not transform the existing elements, rather they

return a new RDD with the new elements

Actions: to return a value to the driver program after running a computation

on the dataset

reduce: operates on pairs to aggregates all the data elements of the dataset

import org.apache.spark.api.java.function.Function;

rdd.map(new Function<Integer, Integer>() {

public Integer call(Integer x) { return x+1;}

});

import org.apache.spark.api.java.function.Function2;

rdd.reduce(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer x, Integer y) { return x+y;}

});


RDD Basics

A simple program

This dataset is not loaded in memory

lines is merely a pointer to the file

lineLengths is not immediately computed

Breaks the computation into tasks to run on separate machines

Each machine runs both its part of the map and a local reduction

Local reduction only answers to the driver program

To use lineLengths again later, we could add the following before the reduce:

This would cause lineLengths to be saved in memory after the first time it iscomputed.

JavaRDD<String> lines = ctx.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

lineLengths.persist();


functions are represented by classes implementing the interfaces in the

org.apache.spark.api.java.function package

Two ways to create such functions:

1. Use lambda expressions to concisely define an implementation (In Java 8)

2. Implement the Function interfaces in your own class, and pass an instance of

it to Spark

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(new

Function<String, Integer>() {

public Integer call(String s) { return s.length(); }

});

int totalLength = lineLengths.reduce(new

Function2<Integer, Integer, Integer>() {

public Integer call(Integer a, Integer b)

{ return a + b; }

});

class GetLength implements Function<String, Integer> {

public Integer call(String s) { return s.length(); }

}

class Sum implements Function2<Integer, Integer, Integer> {

public Integer call(Integer a, Integer b) { return a + b;}

}

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(new GetLength());

int totalLength = lineLengths.reduce(new Sum());

Passing Functions to Spark


JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);


Working with Key-Value Pairs

key-value pairs are represented using the scala.Tuple2 class

call new Tuple2(a, b) to create a tuple

access its fields with tuple._1() and tuple._2()

RDDs of key-value pairs

distributed “shuffle” operations (e.g. grouping or aggregating the elements

by a key)

Represented by the JavaPairRDD class

JavaPairRDDs can be constructed from JavaRDDs Using special versions of

the map operations (mapToPair, flatMapToPair)

The JavaPairRDD will have both standard RDD:

reduceByKey

sortByKey

import scala.Tuple2;

...

Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);

System.out.println(tuple._1() + “ " + tuple._2());


Working with Key-Value Pairs

reduceByKey example

to count how many times each line of text occurs in a file

sortByKey example

to sort the pairs alphabetically

and to bring them back to the driver program as an array of objects

import scala.Tuple2;

import org.apache.spark.api.java.JavaPairRDD;


...


JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new

Tuple2(s, 1));

JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +

b);

...

counts.sortByKey();

counts.collect();


flatMap

flatMap is a combination of map and flatten

Return a Sequence rather than a single item; Then flattens the result

Use case: to parse all the data, but may fail to parse some of it


htt

p:/

/w

ww

.slidesh

are

.net/

frodri

guezolivera

/apache-s

park

-str

eam

ing

RDD Operations


An Example

Counting Words


A Complete Example

Word Counter Program

Package and classes

Import

needed

classes

Package’s name

(will be passed to spark submitter)


JavaWordCount.java

A Complete Example

Main Class

Creating a SparkContext

Creating a SparkConf

Application name


Loading data into an RDD

Base RDD


A Complete Example

JavaRDDs and JavaPairRDDs functions

construct

JavaPairRDDs

from JavaRDDs

count how many

times each word of

text occurs in a file

values for each key are aggregated

create a tuple (key-value pairs )

Transformed RDD


A Complete Example

Printing results

accessing tuples

action


Iteration 1

output = count.collect();

Spark Execution Model


Iteration 2

output = count.reduce(func);

Spark Execution Model


Running Applications on Spark

Building Application

With sbt ($ sbt package)

With maven ($ mvn package)

./src

./src/main

./src/main/java

./src/main/java/app.java

<project>

<artifactId>word-counter</artifactId>

<name>Word Counter</name>

<packaging>jar</packaging>

<version>1.0</version>

<dependencies>

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>1.2.0</version>

</dependency>

</dependencies>

</project>

name := "Word Counter"

organization := "org.apache.spark"

version := "1.0"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"

Directory layout

Pom.xml

name.sbt


Submitting Application

Starting Spark (Master and Slaves)

Submitting a job

Submission syntax:

./bin/spark-submit \

--class <main-class>

--master <master-url> \

--deploy-mode <deploy-mode> \

--conf <key>=<value> \

... # other options

<application-jar> \

[application-arguments]

$ sudo ./bin/spark-submit

--class "org.apache.spark.examples.JavaWordCount"

--master spark://127.0.0.1:7077

test/target/word-counter-1.0.jar /var/log/syslog

$ ./sbin/start-all.sh


Spark Streaming


Overview

Data can be ingested from many sources like Kafka, Flume, Twitter,

ZeroMQ, Kinesis or TCP sockets

Data can be processed using complex algorithms expressed with high-

level functions like map, reduce, join and window

Processed data can be pushed out to filesystems, databases, and live

dashboards

Potential for combining batch processing and streaming processing in

the same system

you can apply Spark’s machine learning and graph processing algorithms on

data streams


Run a streaming computation as a series of very small, deterministic

batch jobs

Chop up the live stream into batches of X seconds

Spark treats each batch of data

as RDDs and processes them using

RDD operations

Finally, the processed results of

the RDD operations are returned

in batches

Batch sizes as low as ½ second,

latency of about 1 second

Spark Streaming – How Work


Dstreams (Discretized Stream)

represents a continuous stream of data

is represented as a sequence of RDDs

can be created from

input data streams from sources such as Kafka, Flume, and Kinesis

by applying high-level operations on other Dstreams

Example: lines to words


Running Example - JavaNetworkWordCount

You will first need to run Netcat as a data server by using

Remember you must be installed spark

Then, in a different terminal, you can start the example by using

Then, any lines typed in the terminal running the netcat server will be

counted and printed on screen every second.


$ nc -lk 9999

$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999

Spark Streaming Programing

Guide


Linking with Spark

Like as Spark batch processing

Spark 1.2.0 works with Java 6 and higher

To write a Spark application in Java, you need to add a dependency on

Spark.

add the following dependency to your Maven project.

add the following dependency to your SBT project.

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.10</artifactId>

<version>1.2.0</version>

</dependency>


libraryDependencies += "org.apache.spark" %

"spark-streaming_2.10" % "1.2.0"

Initializing – Creating StreamingContext

Like as SparkContext

Using constructor

The batchDuration is the size of the batches

the time interval at which streaming data will be divided into batches

can be created from a SparkConf object

can also be created from an existing JavaSparkContext



import org.apache.spark.streaming.Duration;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));

...

JavaSparkContext ctx = ... //existing JavaSparkContext

JavaStreamingContext ssc =

new JavaStreamingContext(ctx, Durations.seconds(1));

new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])

Setting the Right Batch Size

batches of data should be processed as fast as they are being generated

the batch interval used may have significant impact on the data rates

figure out the right batch size for an application

test it with a conservative batch interval and a low data rate

5-10 seconds

If system is stable (the delay is comparable to the batch size)

increasing the data rate and/or reducing the batch size

If system is unstable (the delay is continuously increasing)

Get to the previous stable batch size


Input DStreams and Receivers

Input DStream is associated with a Receiver

except file stream

Receiver

receives the data from a source and

stores it in memory for processing

Spark Streaming provides two categories of built-in streaming sources.

Basic sources

like file systems, socket connections, and Akka actors

directly available in the StreamingContext API

Advanced sources

like Kafka, Flume, Kinesis, Twitter, etc.

are available through extra utility classes

Custom sources


Basic Sources

File Streams

will monitor the directory dataDirectory and process any files created in that directory

For simple text files

Socket Streams

Custom Actors

Actors are concurrent processes that communicate by exchanging messages

Queue of RDDs

Each RDD into the queue will be treated as a batch of data in the DStream, andprocessed like a stream


streamingContext.fileStream<KeyClass, ValueClass,

InputFormatClass>(dataDirectory);

streamingContext.textFileStream(dataDirectory)

streamingContext.actorStream(actorProps, actor-name)

streamingContext.queueStream(queueOfRDDs)

streamingContext.socketStream(String hostname, int port,

Function converter, StorageLevel storageLevel)

Advanced Sources

require interfacing with external non-Spark libraries

Twitter

Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven

Programming: Import the TwitterUtils class and create a DStream with

TwitterUtils.createStream as shown below

Deploying: Generate an uber JAR with all the dependencies (including the

dependency spark-streaming-twitter_2.10 and its transitive dependencies) and

then deploy the application. This is further explained in the Deploying section.

Flume

Kafka

Kinesis


import org.apache.spark.streaming.twitter.*;

TwitterUtils.createStream(jssc);

Custom Sources

implement an user-defined receive


Socket Text Stream

Create an input stream from network source hostname:port

Data is received using a TCP socket

Receive bytes is interpreted as UTF8 encoded \n delimited lines

Storage level to use for storing the received objects


socketTextStream(String hostname, int port);


import org.apache.spark.api.java.StorageLevels;

...

ssc.socketTextStream(“localhost”,9999,

StorageLevels.MEMORY_AND_DISK_SER);

socketTextStream(String hostname, int port, StorageLevel

storageLevel)

Class ReceiverInputDStream

Abstract class for defining any InputDStream

Start a receiver on worker nodes to receive external data

JavaReceiverInputDStream

An interface to ReceiverInputDStream

The abstract class for defining input stream received over the network

Example:

Creates a DStream from text data received over a TCP socket connection




import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

...

JavaReceiverInputDStream<String> lines =

ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);

Output Operations on DStreams

Allow DStream’s data to be pushed out external systems

Trigger the actual execution of all the DStream transformations

Similar to actions for RDDs


Output Operation Meaning

print()

Prints first ten elements of every batch of data in a

DStream on the driver node running the streaming

application.

saveAsTextFiles (prefix, [suffix]) Save DStream's contents as a text files. The file name at

each batch interval is generated based on prefix and suffix.

saveAsObjectFiles(prefix, [suffix]) Save DStream's contents as a SequenceFile of serialized

Java objects.

saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.

foreachRDD(func)

Applies a function to each RDD generated from the

stream. This function should push the data in each RDD to

a external system, like saving the RDD to files, or writing

it over the network to a database. The function is executed

in the driver process running the streaming application.

Persisting (or caching) a dataset in memory across operations

Each node stores any computed partitions in memory and reuses them

Methods

.cache() just memory - for iterative algorithms

.persist() just memory - reuses in other actions on dataset

.persist(storageLevel) storageLevel:

Example:

.

RDD Persistence


MEMORY_ONLY

MEMORY_ONLY_SER

MEMORY_AND_DISK

MEMORY_AND_DISK_SER

DISK_ONLY


...

JavaReceiverInputDStream<String> lines = ssc.socketTextStream(

args[0], Integer.parseInt(args[1]),

StorageLevels.MEMORY_AND_DISK_SER);

UpdateStateByKey

To maintain state

Update state with new information

Define the state

Define the state update function

using updateStateByKey requires the checkpointing


import com.google.common.base.Optional;

...

Function2<List<Integer>, Optional<Integer>, Optional<Integer>>

updateFunction = new Function2<List<Integer>, Optional<Integer>,

Optional<Integer>>() {

@Override public Optional<Integer> call(List<Integer> values,

Optional<Integer> state) {

Integer newSum = ... // add the new values with the

//previous running count

return Optional.of(newSum);

}};

...

JavaPairDStream<String, Integer> runningCounts =

pairs.updateStateByKey(updateFunction);

applied on a DStream containing words

To operate 24/7 and be resilient to failures

Needs to checkpoints enough information to recover from failures

Two types of data that are checkpointed

Metadata checkpointing

To recover from failure of the node running the driver

Includes Configuration; DStream operations; Incomplete batches

Data checkpointing

To cut off the dependency chains

Remove accumulated metadata in stateful operations

To enable checkpointing:

The interval of checkpointing of a DStream can be set by using

checkpoint interval of 5 - 10 times is good

dstream.checkpoint(checkpointInterval)

ctx.checkpoint(hdfsPath)

Checkpointing


An Stream Example

A Complete Example

Network Word Counter Program

Package and classes

Import

needed

classes

Package’s name



JavaNetworkWordCount.java

A Complete Example

Main Class

Creating a SparkStreamingContext

Creating a

SparkConf

Application name


Socket Streams as SourceInput DStream


Setting batch size

A Complete Example

JavaDStream and JavaPairDStream functions

construct

JavaPairDstream

from JavaDstream

count how many

times each word

of text occurs in

an stream

values for each key are aggregated

create a tuple (key-value pairs )

Transformed DStream


A Complete Example

Printing results

Wait for the execution to stop

Start the execution of the

streams


Print the first ten elements

Spark and Storm A Comparison

15 January

201559Majid Hajibaba - Spark

Spark vs. Strom

Spark Storm

Origin UC Berkeley, 2009 Twitter

Implemented in Scala Clojure (Lisp like)

Enterprise Support Yes No

Source Model Open Source Open Source

Big Data Processing Batch and Stream Stream

Processing Type processing in short

interval batches

real time

Latency a few Second sub-Second

Programming API Scala, Java, Python Any PL

Guarantee Data

Processing

Exactly one At least one

Bach Processing Yes No

Coordination With zookeeper zookeeper


Apache Spark

Ippon USA


Apache Storm


Comparison

Higher throughput than Storm

Spark Streaming: 670k records/sec/node

Storm: 115k records/sec/node

Commercial systems: 100-500k records/sec/node


Spark SQL


Spark SQL

Allows relational queries expressed in SQL to be executed using Spark

Data Sources are in JavaSchemaRDDs

JavaSchemaRDD

new type of RDD

is similar to a table in a traditional relational database

are composed of Row objects along with a schema that describes it

can be created from an existing RDD, a JSON dataset, or …


Spark SQL Programming Guide


Initializing - Creating JavaSQLContext

To create a basic JavaSQLContext, all you need is a JavaSparkContext

It must be based spark context



import org.apache.spark.sql.api.java.JavaSQLContext;

...

...

JavaSparkContext sc = ...; // An existing JavaSparkContext.

JavaSQLContext sqlContext = new JavaSQLContext(sc);

SchemaRDD

SchemaRDD can be operated on

as normal RDDs

as a temporary table

allows you to run SQL queries over it

Converting RDDs into SchemaRDDs

Reflection based approach

Uses reflection to infer the schema of an RDD

More concise code

Works well when we know the schema while writing the application

Programmatic based approach

Construct a schema and then apply it to an existing RDD

More verbose

Allows to construct SchemaRDDs when the columns and types are not known until

runtime


JavaBean

Is just a standard (a convention)

Is a class that encapsulates many objects into a single object

All properties private (using get/set)

A public no-argument constructor

Implements Serializable

Lots of libraries depend on it


public static class Person implements Serializable {

private String name;

private int age;

public String getName() { return name; }

public void setName(String name) { this.name = name; }

public int getAge() { return age; }

public void setAge(int age) { this.age = age; }

}

Reflection based - An Example

Load a text file like people.txt

Convert each line to a JavaBean

people now is an RDD of JavaBeans


JavaRDD<Person> people = sc.textFile("people.txt").map(

new Function<String, Person>() {

public Person call(String line) throws Exception {

String[] parts = line.split(",");

Person person = new Person();

person.setName(parts[0]);

person.setAge(Integer.parseInt(parts[1].trim()));

return person;

}

});

Reflection based - An Example

Apply a schema to an RDD of JavaBeans (people)

Register it as a temporary table

SQL can be run over RDDs that have been registered as tables

The result is SchemaRDD and support all the normal RDD operations

The columns of a row in the result can be accessed by ordinal


JavaSchemaRDD schemaPeople =

sqlContext.applySchema(people, Person.class);

schemaPeople.registerTempTable("people");

JavaSchemaRDD teenagers = sqlContext.sql(

"SELECT name FROM people WHERE age >= 13 AND age <= 19")

List<String> teenagerNames = teenagers.map(

new Function<Row, String>() {

public String call(Row row) {

return "Name: " + row.getString(0);

}

}).collect();

Programmatic based

JavaBean classes cannot be defined ahead of time

SchemaRDD can be created programmatically with three steps

Create an RDD of Rows from the original RDD

Create the schema represented by a StructType matching the structure of

Rows in the RDD created in Step 1.

Apply the schema to the RDD of Rows via applySchema method provided by

JavaSQLContext.

Example

The structure of records (schema) is encoded in a string

Load a text file and convert each line to a JavaBean.


String schemaString = "name age";

JavaRDD<String> people =

sc.textFile("examples/src/main/resources/people.txt");

Programmatic based –An Example

Generate the schema based on the string of schema

Convert records of the RDD (people) to Rows


import org.apache.spark.sql.api.java.DataType;

import org.apache.spark.sql.api.java.StructField;

import org.apache.spark.sql.api.java.StructType;

...

List<StructField> fields = new ArrayList<StructField>();

for (String fieldName: schemaString.split(" ")) {

fields.add(DataType.createStructField(fieldName,

DataType.StringType, true));}

StructType schema = DataType.createStructType(fields);

import org.apache.spark.sql.api.java.Row;

...

JavaRDD<Row> rowRDD = people.map(

new Function<String, Row>() {

public Row call(String record) throws Exception {

String[] fields = record.split(",");

return Row.create(fields[0], fields[1].trim());

}

});

Programmatic based –An Example

Apply the schema to the RDD.

Register the SchemaRDD as a table.

SQL can be run over RDDs that have been registered as tables

The result is SchemaRDD and support all the normal RDD operations

The columns of a row in the result can be accessed by ordinal


JavaSchemaRDD peopleSchemaRDD =

sqlContext.applySchema(rowRDD, schema);

peopleSchemaRDD.registerTempTable("people");

JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");

List<String> names = results.map(

new Function<Row, String>() {

public String call(Row row) {

return "Name: " + row.getString(0);

}

}).collect();

JSON Datasets

Inferring the schema of a JSON dataset and load it to JavaSchemaRDD

Two methods in a JavaSQLContext

jsonFile() : loads data from a directory of JSON files where each line of the

files is a JSON object – but not regular multi-line JSON file

jsonRDD(): loads data from an existing RDD where each element of the RDD

is a string containing a JSON object

A JSON file can be like this:


JavaSchemaRDD people = sqlContext.jsonFile(path);

JSON Datasets

The inferred schema can be visualized using the printSchema()

The result is something like this:

Register this JavaSchemaRDD as a table

SQL statements can be run by using the sql methods


people.printSchema();

people.registerTempTable("people");


"SELECT name FROM people WHERE age >= 13 AND age <= 19");

JSON Datasets

JavaSchemaRDD can be created for a JSON dataset represented by anRDD[String] storing one JSON object per string

Arrays are native examples of RDDs

Register this JavaSchemaRDD as a table

SQL statements can be run by using the sql methods

.


List<String> jsonData =

Arrays.asList("{\"name\":\"Yin\",\"address\":

{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");

JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);

JavaSchemaRDD anotherPeople =

sqlContext.jsonRDD(anotherPeopleRDD);

people.registerTempTable("people");


"SELECT name FROM people WHERE age >= 13 AND age <= 19");

Thrift JDBC/ODBC server

To start the JDBC/ODBC server:

By default, the server listens on localhost:10000

We can use beeline to test the Thrift JDBC/ODBC server

Connect to the JDBC/ODBC server in beeline with

Beeline will ask for a username and password

Simply enter the username on your machine and a blank password

See existing databases;

Create a database;


$ ./sbin/start-thriftserver.sh

$ ./bin/beeline

beeline> !connect jdbc:hive2://localhost:10000

0: jdbc:hive2://localhost:10000> SHOW DATABASES;

0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;

End

any question?


Apache Spark

Software

Transcript of Apache Spark