Getting started in Apache Spark� and Flink (with Scala) - Part II

Click here to load reader

download Getting started in Apache Spark� and Flink (with Scala) - Part II

of 38

  • date post

    21-Jan-2017
  • Category

    Technology

  • view

    192
  • download

    1

Embed Size (px)

Transcript of Getting started in Apache Spark� and Flink (with Scala) - Part II

PowerPoint Presentation

Getting started in Apache Sparkand Flink (with Scala)Alexander Panchenko, Gerold Hintz, Steffen Remus

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #

1

OutlineScalabasics of Scala programming language

Sparkmotivation / what do you get on top of MapReducebasics of Spark: RDDs, transformations, actions, shufflingtricks useful in Spark contextSpark Hands-on sessionrun Spark notebook and solve easy taskssetup Spark project & submit job to cluster

Flinktheorydifference from Spark

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Three main benefits to use Spark

Spark is easy to useyou can develop applications on your laptop, using a high-level APISpark is fast, enabling interactive use and complex algorithmsSpark is a general engine, letting you combine multiple types of computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines.

This tutorial is based on the book by creators of Spark:Karau H., Konwinski A., Windell P., Zaharia M. LearningSpark. Lighting-fast Data Analysis. OReally. 2015

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Data Science TasksExperimentation: development of the modelPython, MATLAB, RiPython notebooksInteractive computingEasy-to-useProduction: using the modelJava, Scala, C++/CUnit testsFault toleranceNo interactive computingScalabilityScala + Spark can be used for both!

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #A Brief History of SparkSpark is an open source projectSpark started in 2009 as a research project in the UC Berkeley RAD LabResearch papers were published about Spark at academic conferences and soon after its creation in 2009In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark) and Spark StreamingCurrently one of the most active project in Scala language:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #What Is Apache Spark?

Spark Core: resilient distributed dataset (RDD)Spark SQL: Hive tables, Parquet, JSON, Datasets

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #What Is Apache Spark?

Components for distributed execution in Spark

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Spark Runtime ArchitectureThe components of a distributed Spark application

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Spark Runtime ArchitectureThe master/slave architecture with one central coordinator and many distributed workersThe central coordinator is called the driverThe driver communicates with distributed workers called executorsThe driver is the process where the main() method of your program runsThe driver: Converting a user program into tasks Scheduling tasks on executors

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Downloading Spark and Getting StartedDownload a version Pre-built for Hadoop 2.X and later:http://spark.apache.org/downloads.htmlDirectories you see here that come with Spark: README.mdContains short instructions for getting started with Spark. binContains executable files that can be used to interact with Spark in various ways (e.g., the Spark shell, which we will cover later in this chapter). core, streaming, python, Contains the source code of major components of the Spark project. examplesContains some helpful Spark standalone jobs that you can look at and run to learn about the Spark API.

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Introduction to Sparks Scala ShellRun: bin/spark-shellType in the shell the Scala line count:

We can run parallel operations on the RDD, such as counting the lines of text in the file or printing the first one

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Filtering: lambda functionsFiltering example (Scala):

Filtering example (Java 7):

Filtering example (Java 8):

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Standalone Spark Applications

Link to Spark (Maven or SBT), e.g.:Write a sample class, e.g. word count:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Standalone Spark Applications

SBT build file

Build JAR and run it:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Programming with RDDsRDD -- Resilient Distributed DatasetImmutable distributed collection of objectsEach RDD is split into multiple partitionsPartitions may be computed on different nodes

Creating an RDDLoading an external dataset

Distributing a collection of objects

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Programming with RDDsOnce created, RDDs offer two types of operations:TransformationsActionsTransformations construct a new RDD from a previous oneActions, compute a result based on an RDDeither return it to the driver programor save it to an external storage system, e.g. HDFSRDDs are recomputed each time you run an actionTo reuse an RDD you need to persist it in memory:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Spark Execution Steps (Shell & Standalone)

Create some input RDDs from external data.

Transform them to define new RDDs using transformations like filter().

Persist any intermediate RDDs that will need to be reused.

Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #RDD Operations: Transformations

filter() operation does not mutate the existing inputRDDIt returns a pointer to an entirely new RDDinputRDD can still be reused later in the program, e.g.:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #RDD Operations: ActionsReturn some result and launch actual computation:

take() to retrieve a small number of elements

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and ActionsElement-wise transformationsMapped and filtered RDD from an input RDD:

Squaring the values in an RDD:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and ActionsElement-wise transformationsSplitting lines into multiple words:

Difference between flatMap() and map() on an RDD:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and Actions

Some simple set operations:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and Actions Basic RDD transformations on an RDD containing {1, 2, 3, 3}:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and Actions Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Persistence (Caching)Double execution: Reusing result:

Persistence levels:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Working with Key/Value Pairs

Pair RDDs are a useful building block in many programsAllow you to act on each key in parallel or regroup dataFor instance:reduceByKey() method that can aggregate data for each keyjoin() method that can merge two RDDs by grouping elements with the same key

Creating Pair RDDs = creating Scala tuples:Creating a pair RDD using the first word as the key

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDs Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDs Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDsTransformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)})

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDsUsing partial functions syntax for Pair RDDs in Scala

Simple filter on second element:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDsWord and document counts:Per-key average with reduceByKey() and mapValues():

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDsWord count example revisited:

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Transformations on Pair RDDsExample of a join (inner join is the default):

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Actions Available on Pair RDDsActions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)}))

11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | #Example: PageRanklinks (pageID, link List) a list of neighbors of each pageranks (pageID,rank) current rank for each page

11.07.2016 | Spar