D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library...

Post on 21-May-2020

36 views 0 download

Transcript of D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library...

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 8 Spark Intro Modified Sep 26, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.



Created at UC Berkeley’s AMPLab

2009 Project started 2014 May Version 1.0 2016 July Version 2.0.2 2017 July Version 2.2.0

Programming interface for Java, Python, Scala, R

Interactive shell for Python, Scala, R (experimental)

Runs on Linux, Mac, Windows

Cluster manager Native Spark cluster Hadoop YARN Apache Mesos

File System HDFS MapR File System Cassandra OpenStack Swift S3

Pseudo-Distributed Mode Single machine Uses local file system

Python vs Scala on Spark


Scala is faster that Python But that is not so important here Most of the computation on Spark is done in Spark

Using Python with Spark Python data has to be

Converted between Python format and Scala/Java format Sent between Python process and JVM

Time Line


1991 - Java project started

1995 - Java 1.0 released, Design Patterns book published

2000 - Java 3

2001 - Scala project started

2002 - Nutch started

2004 - Google MapReduce paper

Scala version 1 released

2005 - F# released

2006 - Hadoop split from Nutch

Scala version 2 released

2007 - Clojure released

2009 - Spark project started

2012 - Hadoop 1.0

2014 - Spark 1.0

Major Parts of Spark


Spark Core Resilient Distributed Dataset (RDD)

Spark SQL SQL, csv, json Dataframe

Spark Streaming Near real-time response

MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction Optimization




Ecosystem of packages, libraries and systems on top of Spark Core

Unstructured API Structured API

Resilient Distributed Datasets (RDD) Accumulators Broadcast variables

DataFrames Datasets Spark SQL

Newer, faster, higher level Preferred over Unstructured

Basic Architecture


Local Mode


Driver Process




We will start using local mode

Use local mode to Develop Spark code



Connection to Spark cluster Runs on master node Used to create RDDs, accumulators, broadcast variables Only one SparkContext per JVM stop() the current SparkContext before starting another

SparkContext org.apache.spark.SparkContext Scala version

JavaSparkContext org.apache.spark.api.java.JavaSparkContext Java version

Entry point for Unstructured API



Contains a SparkContext

Entry point to use Dataset & DataFrame

Connection to Spark cluster Runs on master node


Major Data Structures


Resilient Distributed Datasets (RDDs) Fault-tolerant collection of elements that can be operated on in parallel

Dataset & Dataframes Fault-tolerant collection of elements that can be operated on in parallel Rows & Columns JSON, csv, SQL tables Part of SparkSQL



RDD & Dataset Divided into partitions Each partition is on different machine

Resilient & Distributed


Distributed Partitions on different machines

Resilient Each partition can be replicated on multiple machines Data structure knows how to reproduce operations

Basic Operations


RDDs, Dataframes, Datasets Immutable

Transformations Create new dataset (RDD) from existing one Lazy Only done when needed by an action Examples

map, filter, sample, union, distinct, groupByKey, repartition

Actions Return results to driver program Examples

reduce, collect, count, first, take

Actions & Transformations on DataSet



View the Spark Scala API

Starting Spark Scala REPL



From Spark installation

scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@508abc74

scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1c618295

Provided variables - Spark Scala REPL only

Starting spark shell


Al pro 9->spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/24 19:57:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:58:03 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at Spark context available as 'sc' (master = local[*], app id = local-1506308274361). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.

Sample Interaction


where - Transformation count - Action

scala> val range = spark.range(100) range: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val rangeWithLabel = range.toDF("number") rangeWithLabel: org.apache.spark.sql.DataFrame = [number: bigint]

scala> val divisibleBy2 = rangeWithLabel.where("number % 2 = 0") divisibleBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]

scala> divisibleBy2.count() res2: Long = 50

Sample Interaction


scala> val filePath = "/Users/whitney/test/README.md" filePath: String = /Users/whitney/test/README.md

scala> val textFile = spark.read.textFile(filePath) textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count() res3: Long = 103

scala> textFile.first() res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

scala> linesWithSpark.count() res5: Long = 20

spark.read returns DataFrameReader


Can read cvs jdbc json ORC Parquet text

Setting number of Workers


./bin/spark-shell --master local[4]

Application using SBT


name := "Simple Project"

version := "1.0"

scalaVersion := “2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"


Sample Program Using RDDs


import org.apache.spark.{SparkConf, SparkContext}

object MasterConnect { def main(args: Array[String]): Unit = {

val filePath = "/Users/whitney/test/README.md" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val minPartitions = 2 val logData = sc.textFile(filePath, minPartitions) val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }

Sample Program Using Spark SQL


import org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object SparkTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName(“Word Count") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val filePath = "/Users/whitney/test/README.md" val textFile = spark.read.textFile(filePath) val linesWithSpark = textFile.filter(line => line.contains("Spark")) val sparkCount = linesWIthSpark.count() println(sparkCount) } }



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/24 19:53:17 INFO SparkContext: Running Spark version 2.2.0 17/09/24 19:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:53:19 INFO SparkContext: Submitted application: Datasets Test 17/09/24 19:53:19 INFO SecurityManager: Changing view acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing view acls groups to: 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls groups to: 17/09/24 19:53:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set() 17/09/24 19:53:20 INFO Utils: Successfully started service 'sparkDriver' on port 61753. 17/09/24 19:53:20 INFO SparkEnv: Registering MapOutputTracker 17/09/24 19:53:20 INFO SparkEnv: Registering BlockManagerMaster 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/09/24 19:53:20 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-45dac785-124b-4203-a39e-895866d2cca3 17/09/24 19:53:20 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/09/24 19:53:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/09/24 19:53:20 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/09/24 19:53:21 INFO SparkUI: Bound SparkUI to, and started at 17/09/24 19:53:21 INFO Executor: Starting executor ID driver on host localhost 17/09/24 19:53:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61754.

DataFrame, DataSet & RDD


What are they

What is the difference

When do use which one

Which languages can use them



+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table with rows and Columns

Schema Column labels Column types

Row org.apache.spark.sql.Row

Partitioner Distributes DataFrame among cluster

Plan Series of transformations to perform on DataFrame

Langauges Scala, Java, JVM languages, Python, R

Optimized Spark Catalyst Optimizer



Same as DataFrame except for Rows

Programmer defines Row class Scala Cas Class Java Bean

Difference from DataFrame Compiler knows column names and column types in DataSet

Compile time error checking

Better data layout

Languages Scala, Java, JVM languages



+-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table No information about types

No compile time or runtime type checking

Far fewer optimizations No Catalyst Optimizer No space optimization

Example - Same data RDD 33.3 MB DataFrame 7.3 MB

Languages Java, Scala Python, R - not recommended

Shares same basic operations as DataFrames & DataSets

Spark Types


Java Types are not space efficient “abcd” - 48 bytes

Spark has its own types

Special memory representation of each type Space efficient Cache aware

Spark Scala Python Python API

ByteType Byte int or long ByteType()

ShortType Short int or long ShortType()

IntegerType Int int or long IntergerType()

LongType Long int or long LongType()

Structured verses Unstructured


Structured = DataSet, DataFrame

Unstructured = RDD

Typed verses Untyped


Typed = DataSet

Untyped = DataFrame

Some Sample Data


{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15} {"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15} {"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":588} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":40} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Moldova","count":1}

JSON flight Data 2015

United States Bureau of Transportation statistics

The Definitive Guide, Zaharia & Chambers, O’Reilly Media, Inc, 2017-10-??



scala> val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"

flightFile: String = /Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flightData2015.take(2) res3: Array[org.apache.spark.sql.Row] =

Array([United States,Romania,15], [United States,Croatia,1])

Explain - Spark Plan


scala> flightData2015.explain() == Physical Plan == *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>


scala> val sortedFlightData2015 = flightData2015.sort("count") sortedFlightData2015: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> sortedFlightData2015.explain() == Physical Plan == *Sort [count#46L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(count#46L ASC NULLS FIRST, 200) +- *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema:


scala> sortedFlightData2015.take(2) res6: Array[org.apache.spark.sql.Row] =

Array([United States,Singapore,1], [Moldova,United States,1])

Conceptual Plan



Spark stores the plan in case it needs to recompute the result



scala> val jsonSchema = spark.read.json(jsonFlightFile).schema jsonSchema: org.apache.spark.sql.types.StructType =

StructType( StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

StructField name The name of this field. dataType The data type of this field. nullable Indicates if values of this field can be null values. metadata



Spark was able to infer the schema since JSON object has labels and types

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}

Other data formats are less structured

Reading CSV


+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

name,age Andy,30 Justin,19 Michael,


root |-- name: string (nullable = true) |-- age: integer (nullable = true)

scala> val peopleFile = “/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv"

scala> val reader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@288aaeaf

scala> reader.option(“header",true) scala> reader.option("inferSchema",true)

scala> val df = reader.csv(peopleFile) df: org.apache.spark.sql.DataFrame = [name: string, age: int]

Reading CSV


scala> df.show +-------+----+| name| age|+-------+----+| Andy| 30|| Justin| 19||Michael|null|+-------+----+

scala> df.schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true)

Some CSV options


encoding sep (erator) header inferSchema ignoreLeadingWhiteSpace nullValue dateFormat timeStampFormat

mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record


port org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object PeopleExample { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("Datasets Test") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val peopleFile = "/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv" val reader = spark.read reader.option("header", true) reader.option("inferSchema", true) val df = reader.csv(peopleFile) df.show df.printSchema spark.stop } }

Type Inference Issues


val reader = spark.read What type is reader?

scala> val reader: DataFrameReader = spark.read <console>:23: error: not found: type DataFrameReader val reader: DataFrameReader = spark.read

scala> val reader: org.apache.spark.sql.DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@3982832f

scala> import org.apache.spark.sql.DataFrameReader import org.apache.spark.sql.DataFrameReader

scala> val reader: DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@e6d4002

We can select columns


name,age Andy,30 Justin,19 Michael,


scala> val names = df.select("name") names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

We can select columns


name,age Andy,30 Justin,19 Michael,


scala> import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.col

scala> val names = df.select(col("name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

If you Don’t lk Abrvtns


name,age Andy,30 Justin,19 Michael,


scala> import org.apache.spark.sql.functions.column import org.apache.spark.sql.functions.col

scala> val names = df.select(column(“name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Column Operations


scala> val older = df.select(col("name"), col("age").plus(1)) older: org.apache.spark.sql.DataFrame = [name: string, (age + 1): int]

scala> older.show +-------+---------+ | name|(age + 1)| +-------+---------+ | Andy| 31| | Justin| 20| |Michael| null| +———+---------+

scala> older.printSchema root |-- name: string (nullable = true) |-- (age + 1): integer (nullable = true)

scala> val older = df.select($"name", $"age" + 1)

Column Operations - Java vs Scala


df.select(col("name"), col("age").plus(1))

df.select($"name", $"age" + 1)

Scala Only

Java or Scala


scala> val adult = older.filter($"age" > 21) scala> val adult = older.filter(col("age") > 21) scala> val adult = older.filter(col("age").gt(21)) adult: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: string, (age + 1): int]

scala> adult.show +----+---------+ |name|(age + 1)| +----+---------+ |Andy| 31| +----+---------+

scala> adult.explain == Physical Plan == *Project [name#104, (age#105 + 1) AS (age + 1)#123] +- *Filter (isnotnull(age#105) && (age#105 > 21)) +- *FileScan csv [name#104,age#105]

Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,21)], ReadSchema: struct<name:string,age:int>


scala> df.groupBy("age").count.show +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+

Saving DataFrames


{"name":"Andy","age":30} {"name":"Justin","age":19} {"name":"Michael"}

json, parquet, jdbc, orc, libsvm, csv, text


scala> df.write.format("json").save("people.json")

Produces a directory: people.json Contents:

_SUCCESS 0 Byte file


Using SQL


scala> df.createOrReplaceTempView("people")

scala> val sqlExample = spark.sql("SELECT * FROM people") sqlExample: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> sqlExample.show +-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+



action No return value Only prints out value So can not use the result

What happens on cluster? Actions return value to master node But often run in batch mode

Collect - Returns a result


scala> val data = sqlExample.collect data: Array[org.apache.spark.sql.Row] = Array([Andy,30], [Justin,19], [Michael,null])

scala> data(0) res22: org.apache.spark.sql.Row = [Andy,30]

scala> data(0)(0) res23: Any = Andy