Getting started in Apache Spark and Flink (with Scala) - Part II

38
11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus Getting started in Apache Spark and Flink (with Scala) Alexander Panchenko, Gerold Hintz, Steffen Remus

Transcript of Getting started in Apache Spark and Flink (with Scala) - Part II

Page 1: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus

Getting started in Apache Sparkand Flink (with Scala)Alexander Panchenko, Gerold Hintz, Steffen Remus

Page 2: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 2

Outline

Scala basics of Scala programming language

Spark motivation / what do you get on top of MapReduce basics of Spark: RDDs, transformations, actions, shuffling “tricks” useful in Spark context Spark Hands-on session run Spark notebook and solve easy tasks setup Spark project & submit job to cluster

Flink theory difference from Spark

Page 3: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 3

Three main benefits to use Spark

1. Spark is easy to use—you can develop applications on your laptop, using a high-level API

2. Spark is fast, enabling interactive use and complex algorithms3. Spark is a general engine, letting you combine multiple types of

computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines.

This tutorial is based on the book by creators of Spark:Karau H., Konwinski A., Windell P., Zaharia M. “LearningSpark. Lighting-fast Data Analysis.” O’Really. 2015

Page 4: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 4

Data Science Tasks

Experimentation: development of the model Python, MATLAB, R iPython notebooks Interactive computing Easy-to-useProduction: using the model Java, Scala, C++/C Unit tests Fault tolerance No interactive computing ScalabilityScala + Spark can be used for both!

Page 5: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 5

A Brief History of Spark

Spark is an open source project Spark started in 2009 as a research project in the UC Berkeley RAD Lab Research papers were published about Spark at academic conferences

and soon after its creation in 2009 In 2011, the AMPLab started to develop higher-level components on

Spark, such as Shark (Hive on Spark) and Spark Streaming Currently one of the most active project in Scala language:

Page 6: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 6

What Is Apache Spark?

Spark Core: resilient distributed dataset (RDD) Spark SQL: Hive tables, Parquet, JSON, Datasets

Page 7: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 7

What Is Apache Spark?

Components for distributed execution in Spark

Page 8: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 8

Spark Runtime Architecture

The components of a distributed Spark application

Page 9: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 9

Spark Runtime Architecture

The master/slave architecture with one central coordinator and many distributed workers

The central coordinator is called the driver The driver communicates with distributed workers called executors The driver is the process where the main() method of your program runs The driver:

Converting a user program into tasks Scheduling tasks on executors

Page 10: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 10

Downloading Spark and Getting Started

Download a version “Pre-built for Hadoop 2.X and later”:http://spark.apache.org/downloads.html

Directories you see here that come with Spark: README.md

Contains short instructions for getting started with Spark. bin

Contains executable files that can be used to interact with Spark in various ways (e.g., the Spark shell, which we will cover later in this chapter).

core, streaming, python, … Contains the source code of major components of the Spark project.

examples Contains some helpful Spark standalone jobs that you can look at and run to

learn about the Spark API.

Page 11: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 11

Introduction to Spark’s Scala Shell

Run: bin/spark-shell Type in the shell the Scala line count:

We can run parallel operations on the RDD, such as counting the lines of text in the file or printing the first one

Page 12: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 12

Filtering: lambda functions

Filtering example (Scala):

Filtering example (Java 7):

Filtering example (Java 8):

Page 13: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 13

Standalone Spark Applications

Link to Spark (Maven or SBT), e.g.: Write a sample class, e.g. word count:

Page 14: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 14

Standalone Spark Applications

SBT build file

Build JAR and run it:

Page 15: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 15

Programming with RDDs

RDD -- Resilient Distributed Dataset Immutable distributed collection of objects Each RDD is split into multiple partitions Partitions may be computed on different nodes

Creating an RDD Loading an external dataset

Distributing a collection of objects

Page 16: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 16

Programming with RDDs

Once created, RDDs offer two types of operations: Transformations Actions

Transformations construct a new RDD from a previous one Actions, compute a result based on an RDD

either return it to the driver program or save it to an external storage system, e.g. HDFS

RDDs are recomputed each time you run an action To reuse an RDD you need to persist it in memory:

Page 17: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 17

Spark Execution Steps (Shell & Standalone)

1. Create some input RDDs from external data.

2. Transform them to define new RDDs using transformations like filter().

3. Persist any intermediate RDDs that will need to be reused.

4. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.

Page 18: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 18

RDD Operations: Transformations

filter() operation does not mutate the existing inputRDD It returns a pointer to an entirely new RDD inputRDD can still be reused later in the program, e.g.:

Page 19: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 19

RDD Operations: Actions

Return some result and launch actual computation:

take() to retrieve a small number of elements

Page 20: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 20

Common Transformations and Actions

Element-wise transformations Mapped and filtered RDD from an input RDD:

Squaring the values in an RDD:

Page 21: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 21

Common Transformations and Actions

Element-wise transformations Splitting lines into multiple words:

Difference between flatMap() and map() on an RDD:

Page 22: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 22

Common Transformations and Actions

Some simple set operations:

Page 23: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 23

Common Transformations and Actions

Basic RDD transformations on an RDD containing {1, 2, 3, 3}:

Page 24: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 24

Common Transformations and Actions

Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}:

Page 25: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 25

Common Transformations and Actions

Basic actions on an RDD containing {1, 2, 3, 3}:

Page 26: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 26

Common Transformations and Actions

Basic actions on an RDD containing {1, 2, 3, 3}:

Page 27: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 27

Persistence (Caching)

Double execution: Reusing result:

Persistence levels:

Page 28: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 28

Working with Key/Value Pairs

Pair RDDs are a useful building block in many programs Allow you to act on each key in parallel or regroup data For instance:

reduceByKey() method that can aggregate data for each key join() method that can merge two RDDs by grouping elements with the same

key

Creating Pair RDDs = creating Scala tuples: Creating a pair RDD using the first word as the key

Page 29: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 29

Transformations on Pair RDDs

Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})

Page 30: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 30

Transformations on Pair RDDs

Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})

Page 31: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 31

Transformations on Pair RDDs

Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)})

Page 32: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 32

Transformations on Pair RDDs

Using partial functions syntax for Pair RDDs in Scala

Simple filter on second element:

Page 33: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 33

Transformations on Pair RDDs

Word and document counts: Per-key average with reduceByKey() and mapValues():

Page 34: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 34

Transformations on Pair RDDs

Word count example revisited:

Page 35: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 35

Transformations on Pair RDDs

Example of a join (inner join is the default):

Page 36: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 36

Actions Available on Pair RDDs

Actions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)}))

Page 37: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 37

Example: PageRank

links – (pageID, link List) – a list of neighbors of each page ranks – (pageID,rank) – current rank for each page

Page 38: Getting started in Apache Spark and Flink (with Scala) - Part II

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 38

Important topics not covered in this intro

MLlib Machine Learning in the distributed way Basic Linear Algebra in the distributed way: sparse and dense vectors

and matrices

Partitioning No free lunch, neither automagic scaling of any algorithm Making efficient algorithm = trying to minimize shuffling of the data

Spark SQL, Spark 2.0, Datasets, DataFrames Something like Python’s pandas or R’s DataFrame Great for interactive data mining and for working with CSV files