Introduction to Apache Spark

download Introduction to  Apache Spark

of 45

  • date post

  • Category


  • view

  • download


Embed Size (px)


Introduction to Apache Spark. Patrick Wendell - Databricks. What is Spark?. Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop. Efficient. Usable. Rich APIs in Java, Scala , Python Interactive shell. General execution graphs In-memory storage. - PowerPoint PPT Presentation

Transcript of Introduction to Apache Spark

Developing with Apache Spark

Introduction to Apache SparkPatrick Wendell - DatabricksPreamble: * Excited to kick off first day of training * This first tutorial is about using Spark CORE * Weve got a curriculum jammed packed with material, so lets go ahead and get started1What is Spark?EfficientGeneral execution graphsIn-memory storageUsableRich APIs in Java, Scala, PythonInteractive shellFast and Expressive Cluster Computing Engine Compatible with Apache Hadoop2-5 less codeUp to 10 faster on disk,100 in memoryGeneralize the map/reduce framework


The Spark Community

+You!One of the most exciting things youll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involvedIf your logo is not up here, forgive us its hard to keep up!3Todays TalkThe Spark programming model

Language and deployment choices

Example algorithm (PageRank)Spark Programming ModelKey Concept: RDDsResilient Distributed DatasetsCollections of objects spread across a cluster, stored in RAM or on DiskBuilt through parallel transformationsAutomatically rebuilt on failureOperationsTransformations(e.g. map, filter, groupBy)Actions(e.g. count, collect, save)Write programs in terms of operations on distributed datasetsRDD Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage

6Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(hdfs://...)errors = lines.filter(lambda s: s.startswith(ERROR))messages = s: s.split(\t)[2])messages.cache()

Block 1Block 2Block 3WorkerWorkerWorkerDrivermessages.filter(lambda s: mysql in s).count()messages.filter(lambda s: php in s).count(). . .tasksresultsCache 1Cache 2Cache 3Base RDDTransformed RDDActionFull-text search of Wikipedia60GB on 20 EC2 machine0.5 sec vs. 20s for on-diskAdd variables to the functions in functional programming

7Scaling DownGracefully8Fault RecoveryRDDs track lineage information that can be used to efficiently recompute lost datamsgs = textFile.filter(lambda s: s.startsWith(ERROR)) .map(lambda s: s.split(\t)[2])HDFS FileFiltered RDDMapped RDDfilter(func = startsWith())map(func = split(...))9Programming with RDDsSparkContextMain entry point to Spark functionalityAvailable in shell as variable scIn standalone programs, youd make your own (see later for details)Creating RDDs# Turn a Python collection into an RDDsc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3sc.textFile(file.txt)sc.textFile(directory/*.txt)sc.textFile(hdfs://namenode:9000/path/file)

# Use existing Hadoop InputFormat (Java/Scala only)sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformationsnums = sc.parallelize([1, 2, 3])

# Pass each element through a functionsquares = x: x*x) // {1, 4, 9}

# Keep elements passing a predicateeven = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more othersnums.flatMap(lambda x: => range(x))# => {0, 0, 1, 0, 1, 2}Range object (sequence of numbers 0, 1, , x-1)All lazy13Basic Actionsnums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Return first K elementsnums.take(2) # => [1, 2]

# Count number of elementsnums.count() # => 3

# Merge elements with an associative functionnums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text filenums.saveAsTextFile(hdfs://file.txt)Launch computations14Working with Key-Value PairsSparks distributed reduce transformations operate on RDDs of key-value pairsPython: pair = (a, b) pair[0] # => a pair[1] # => bScala: val pair = (a, b)pair._1 // => apair._2 // => bJava:Tuple2 pair = new Tuple2(a, b); pair._1 // => apair._2 // => bSome Key-Value Operationspets = sc.parallelize( [(cat, 1), (dog, 1), (cat, 2)])pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map sidelines = sc.textFile(hamlet.txt)counts = lines.flatMap(lambda line: line.split( )) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y)Example: Word Countto be ornot to betobeornottobe(to, 1)(be, 1)(or, 1)(not, 1)(to, 1)(be, 1)(be, 2)(not, 1)(or, 1)(to, 2)Other Key-Value Operationsvisits = sc.parallelize([ (index.html,, (about.html,, (index.html, ])

pageNames = sc.parallelize([ (index.html, Home), (about.html, About) ])

visits.join(pageNames) # (index.html, (, Home))# (index.html, (, Home))# (about.html, (, About))

visits.cogroup(pageNames) # (index.html, ([,], [Home]))# (about.html, ([], [About]))Setting the Level of ParallelismAll the pair RDD operations take an optional second parameter for number of tasks

words.reduceByKey(lambda x, y: x + y, 5)words.groupByKey(5)visits.join(pageViews, 5)Using Local VariablesAny external variables you use in a closure will automatically be shipped to the cluster:

query = sys.stdin.readline()pages.filter(lambda x: query in x).count()

Some caveats:Each task gets a new copy (updates arent sent back)Variable must be Serializable / Pickle-ableDont use fields of an outer object (ships all of it!)Under The Hood: DAG SchedulerGeneral task graphsAutomatically pipelines functionsData locality awarePartitioning awareto avoid shuffles= cached partition= RDDjoinfiltergroupByStage 3Stage 1Stage 2A:B:C:D:E:F:mapNOT a modified version of Hadoop21More RDD OperatorsmapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoinreducecountfoldreduceByKeygroupByKeycogroupcrosszipsampletakefirstpartitionBymapWithpipesave ...How to Run SparkLanguage SupportStandalone ProgramsPython, Scala, & Java

Interactive ShellsPython & Scala

PerformanceJava & Scala are faster due to static typingbut Python is often finePythonlines = sc.textFile(...)lines.filter(lambda s: ERROR in s).count()Scalaval lines = sc.textFile(...)lines.filter(x => x.contains(ERROR)).count()JavaJavaRDD lines = sc.textFile(...);lines.filter(new Function() { Boolean call(String s) { return s.contains(error); }}).count();Interactive ShellThe Fastest Way to Learn SparkAvailable in Python and ScalaRuns as an application on an existing Spark ClusterOR Can run locally

The barrier to entry for working with the spark API is minimal25import sysfrom pyspark import SparkContext

if __name__ == "__main__": sc = SparkContext( local, WordCount, sys.argv[0], None) lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split( )) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y)


or a Standalone Applicationimport;

JavaSparkContext sc = new JavaSparkContext( masterUrl, name, sparkHome, new String[] {app.jar}));

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._

val sc = new SparkContext(url, name, sparkHome, Seq(app.jar))

Cluster URL, or local / local[N]App nameSpark install path on clusterList of JARs with app code (to ship)Create a SparkContextScalaJavafrom pyspark import SparkContext

sc = SparkContext(masterUrl, name, sparkHome, []))

PythonMention assembly JAR in Maven etcUse the shell with your own JAR`27Add Spark to Your ProjectScala / Java: add a Maven dependency on

groupId: org.spark-projectartifactId:spark-core_2.10version: 0.9.0

Python: run program with our pyspark script

Administrative GUIshttp://:8080 (by default)Software ComponentsSpark runs as a library in your program (1 instance per app)Runs tasks locally or on clusterMesos, YARN or standalone modeAccesses storage systems via Hadoop InputFormat APICan use HBase, HDFS, S3, Your applicationSparkContextLocal threadsCluster managerWorkerSpark executorWorkerSpark executorHDFS or other storageJust pass local or local[k] as master URLDebug using local debuggersFor Java / Scala, just run your program in a debuggerFor Python, use an attachable debugger (e.g. PyDev)Great for development & unit testsLocal ExecutionCluster ExecutionEasiest way to launch is EC2:

./spark-ec2 -k keypair i id_rsa.pem s slaves \ [launch|stop|start|destroy] clusterNameSeveral options for private clusters:Standalone mode (similar to Hadoops deploy scripts)MesosHadoop YARNAmazon EMR: Example Application: PageRankTo Skip to the next section, go to slide 5133Example: PageRankGood example of a more complex algorithmMultiple stages of map & reduceBenefits from Sparks in-memory cachingMultiple iterations over the same dataBasic IdeaGive pages ranks (scores) based on links to themLinks from many pages high rankLink from a high-rank page high rankImage:

Algorithm1. each page at a rank of 1On each iteration, have page p contributerankp / |neighborsp| to its neighborsSet each pages rank to 0.15 + 0.85 contribs36AlgorithmStart each page at a rank of 1On each iteration, have page p contributerankp / |neighborsp| to its neighborsSet each pages rank to 0.15 + 0.85 contribs each page at a rank of 1On each iteration, have page p contributerankp / |neighborsp| to its neighborsSet each pages rank to 0.15