Yet another intro to Apache Spark

A brief intro toApache Spark

– You eat, I talk…

Spark Framework• Efficient data processing via in-memory RDD.

• A rich data-flow API (Java, Scala and Python).

• An interactive shell (Scala and Python).

• Execution environment running in Local and Standalone modes, or on

top of Hadoop/Yarn, Apache Mesos, Amazon EC2.

• Several extensions on top of the core engine:

• Spark SQL, Spark Streaming, MLlib and GraphX.

2

Get It Running$ git clone https://github.com/apache/spark

$ export JAVA_HOME=...

$ spark/sbt/sbt assembly

$ spark/sbin/start-master.sh

$ spark/sbin/start-slave.sh --master spark://localhost:7077

01.

02.

03.

04.

05.

3

Resilient Distributed Datasets (RDD)• Immutable data collection partitioned across the nodes.

• Data-flow model with parallel transformations and actions.

• Transformations are lazy, the actual computation is done only on actions.

• Recompute partitions on failure from the computation graph (lineage).

• Can be persisted to memory and/or disk for future reuse.

4

Transformations and Actions• Transformations

• filter, map, flatMap, group/sort/reduceByKey, distinct, union,

intersection, cartesian, subtract, join, cogroup, sample

• Actions

• count, collect, reduce, take, takeSample, foreach, first, saveAsText

• Persistence

• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,

MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2/

5

Hello World! (pyspark)>>> file = sc.textFile(".../spark/README.md")

>>> file.first()

u'# Apache Spark'

>>> file.filter(lambda line: "Spark" in line).count()

19

>>> wordCounts = file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a + b)

01.

02.

03.

04.

05.

06.

07.

08.

6

Advanced RDD• Data sets can be cached in memory for repeated access.

• Data that does not fit in RAM can be stored on disk.

• The user can decide partitioning for better join performance.

• Each RDD is represented as

• a set of partitions

• a set of dependencies on parent RDDs

• a function for computing it from its parents

• metadata about partitioning and data placement

7

RDD: Narrow vs Wide Dependencies• Narrow: each parent partition has no more than one child partition.

• Can do pipelined execution (operator chaining).

• Easier recovery - need to recompute only the lost partitions and

they can be computed in parallel on different nodes.

• Wide: multiple child partitions.

• Needs shuffling.

• During computation (action) there is (was) materialization of parent

partitions before the shuffle.

8

Comparison to DSM and Map-Reduce• Spark has an expressive API and support for Scala/Java/Python.

• Spark does efficient scheduling and recovery.

• Spark is best suitable for iterative batch data-flow operations on large

data sets.

• For ML and Graph applications it has shown x20 speedup due to

elimination I/O and deseriazation.

9

Spark Platform• Spark SQL

• Provides Hive compatible SQL access and JDBC/ODBC.

• GhraphX

• Provides a flexible API for graph processing.

• Includes a variety of graph algorithms for computing PageRank,

connected components, triangle count, SVD, label propagation, etc.

10

Spark Platform• Spark Streaming

• Provides a flexible streaming API based on micro-batch processing.

• Includes methods for stream source definitions, transformations and

window operations.

• MLlib

• Provides a set of ML algorithms for classification (logistic regres-

sion, SVM, naive bayes), linear regression and clustering (k-means),

matrix decomposition (SVD/PCA) and collaborative filtering (ALS).

11

Personal impressions• The interactive shell is awesome!

• Good documentation and lots of examples, source code is in Scala is =/

• Tons of info messages are distracting, errors messages on teardown are

spooky.

• MLllib lacks methods for data cleaning/transformation, model validation

and exploration.

12

References• Zaharia et al., 2012: Resilient distributed datasets: A fault-tolerant

abstraction for in-memory cluster computing. (Paper of the week!)

• http://spark.apache.org/

• Slideshare presentations: one, two, three, four, five.

13

http://spark.apache.org/

http://www.slideshare.net/deanchen11/scala-bay-spark-talk?qid=1ad47f0f-2256-450c-8dba-f36a54e46e03&v=default&b=&from_search=37

http://www.slideshare.net/cloudera/spark-webinar-92314-dl?qid=1ad47f0f-2256-450c-8dba-f36a54e46e03&v=default&b=&from_search=46

http://www.slideshare.net/pacoid/how-spark-fits-into-the-big-data-landscape?qid=1ad47f0f-2256-450c-8dba-f36a54e46e03&v=qf1&b=&from_search=1

http://www.slideshare.net/pacoid/brief-intro-to-apache-spark-stanford-icme?qid=1ad47f0f-2256-450c-8dba-f36a54e46e03&v=qf1&b=&from_search=2

http://www.slideshare.net/cloudera/spark-devwebinarslides-final?qid=1ad47f0f-2256-450c-8dba-f36a54e46e03&v=qf1&b=&from_search=3

Thanks!

Yet another intro to Apache Spark

Technology

Transcript of Yet another intro to Apache Spark