Yet another intro to Apache Spark
-
Upload
simon-lia-jonassen -
Category
Technology
-
view
81 -
download
0
Transcript of Yet another intro to Apache Spark
A brief intro toApache Spark
– You eat, I talk…
Spark Framework• Efficient data processing via in-memory RDD.
• A rich data-flow API (Java, Scala and Python).
• An interactive shell (Scala and Python).
• Execution environment running in Local and Standalone modes, or on
top of Hadoop/Yarn, Apache Mesos, Amazon EC2.
• Several extensions on top of the core engine:
• Spark SQL, Spark Streaming, MLlib and GraphX.
2
Get It Running$ git clone https://github.com/apache/spark
$ export JAVA_HOME=...
$ spark/sbt/sbt assembly
$ spark/sbin/start-master.sh
$ spark/sbin/start-slave.sh --master spark://localhost:7077
01.
02.
03.
04.
05.
3
Resilient Distributed Datasets (RDD)• Immutable data collection partitioned across the nodes.
• Data-flow model with parallel transformations and actions.
• Transformations are lazy, the actual computation is done only on actions.
• Recompute partitions on failure from the computation graph (lineage).
• Can be persisted to memory and/or disk for future reuse.
4
Transformations and Actions• Transformations
• filter, map, flatMap, group/sort/reduceByKey, distinct, union,
intersection, cartesian, subtract, join, cogroup, sample
• Actions
• count, collect, reduce, take, takeSample, foreach, first, saveAsText
• Persistence
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2/
5
Hello World! (pyspark)>>> file = sc.textFile(".../spark/README.md")
>>> file.first()
u'# Apache Spark'
>>> file.filter(lambda line: "Spark" in line).count()
19
>>> wordCounts = file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
01.
02.
03.
04.
05.
06.
07.
08.
6
Advanced RDD• Data sets can be cached in memory for repeated access.
• Data that does not fit in RAM can be stored on disk.
• The user can decide partitioning for better join performance.
• Each RDD is represented as
• a set of partitions
• a set of dependencies on parent RDDs
• a function for computing it from its parents
• metadata about partitioning and data placement
7
RDD: Narrow vs Wide Dependencies• Narrow: each parent partition has no more than one child partition.
• Can do pipelined execution (operator chaining).
• Easier recovery - need to recompute only the lost partitions and
they can be computed in parallel on different nodes.
• Wide: multiple child partitions.
• Needs shuffling.
• During computation (action) there is (was) materialization of parent
partitions before the shuffle.
8
Comparison to DSM and Map-Reduce• Spark has an expressive API and support for Scala/Java/Python.
• Spark does efficient scheduling and recovery.
• Spark is best suitable for iterative batch data-flow operations on large
data sets.
• For ML and Graph applications it has shown x20 speedup due to
elimination I/O and deseriazation.
9
Spark Platform• Spark SQL
• Provides Hive compatible SQL access and JDBC/ODBC.
• GhraphX
• Provides a flexible API for graph processing.
• Includes a variety of graph algorithms for computing PageRank,
connected components, triangle count, SVD, label propagation, etc.
10
Spark Platform• Spark Streaming
• Provides a flexible streaming API based on micro-batch processing.
• Includes methods for stream source definitions, transformations and
window operations.
• MLlib
• Provides a set of ML algorithms for classification (logistic regres-
sion, SVM, naive bayes), linear regression and clustering (k-means),
matrix decomposition (SVD/PCA) and collaborative filtering (ALS).
11
Personal impressions• The interactive shell is awesome!
• Good documentation and lots of examples, source code is in Scala is =/
• Tons of info messages are distracting, errors messages on teardown are
spooky.
• MLllib lacks methods for data cleaning/transformation, model validation
and exploration.
12
References• Zaharia et al., 2012: Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing. (Paper of the week!)
• http://spark.apache.org/
• Slideshare presentations: one, two, three, four, five.
13
Thanks!