Yet another intro to Apache Spark

14
A brief intro to Apache Spark – You eat, I talk…

Transcript of Yet another intro to Apache Spark

Page 1: Yet another intro to Apache Spark

A brief intro toApache Spark

– You eat, I talk…

Page 2: Yet another intro to Apache Spark

Spark Framework• Efficient data processing via in-memory RDD.

• A rich data-flow API (Java, Scala and Python).

• An interactive shell (Scala and Python).

• Execution environment running in Local and Standalone modes, or on

top of Hadoop/Yarn, Apache Mesos, Amazon EC2.

• Several extensions on top of the core engine:

• Spark SQL, Spark Streaming, MLlib and GraphX.

2

Page 3: Yet another intro to Apache Spark

Get It Running$ git clone https://github.com/apache/spark

$ export JAVA_HOME=...

$ spark/sbt/sbt assembly

$ spark/sbin/start-master.sh

$ spark/sbin/start-slave.sh --master spark://localhost:7077

01.

02.

03.

04.

05.

3

Page 4: Yet another intro to Apache Spark

Resilient Distributed Datasets (RDD)• Immutable data collection partitioned across the nodes.

• Data-flow model with parallel transformations and actions.

• Transformations are lazy, the actual computation is done only on actions.

• Recompute partitions on failure from the computation graph (lineage).

• Can be persisted to memory and/or disk for future reuse.

4

Page 5: Yet another intro to Apache Spark

Transformations and Actions• Transformations

• filter, map, flatMap, group/sort/reduceByKey, distinct, union,

intersection, cartesian, subtract, join, cogroup, sample

• Actions

• count, collect, reduce, take, takeSample, foreach, first, saveAsText

• Persistence

• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,

MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2/

5

Page 6: Yet another intro to Apache Spark

Hello World! (pyspark)>>> file = sc.textFile(".../spark/README.md")

>>> file.first()

u'# Apache Spark'

>>> file.filter(lambda line: "Spark" in line).count()

19

>>> wordCounts = file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a + b)

01.

02.

03.

04.

05.

06.

07.

08.

6

Page 7: Yet another intro to Apache Spark

Advanced RDD• Data sets can be cached in memory for repeated access.

• Data that does not fit in RAM can be stored on disk.

• The user can decide partitioning for better join performance.

• Each RDD is represented as

• a set of partitions

• a set of dependencies on parent RDDs

• a function for computing it from its parents

• metadata about partitioning and data placement

7

Page 8: Yet another intro to Apache Spark

RDD: Narrow vs Wide Dependencies• Narrow: each parent partition has no more than one child partition.

• Can do pipelined execution (operator chaining).

• Easier recovery - need to recompute only the lost partitions and

they can be computed in parallel on different nodes.

• Wide: multiple child partitions.

• Needs shuffling.

• During computation (action) there is (was) materialization of parent

partitions before the shuffle.

8

Page 9: Yet another intro to Apache Spark

Comparison to DSM and Map-Reduce• Spark has an expressive API and support for Scala/Java/Python.

• Spark does efficient scheduling and recovery.

• Spark is best suitable for iterative batch data-flow operations on large

data sets.

• For ML and Graph applications it has shown x20 speedup due to

elimination I/O and deseriazation.

9

Page 10: Yet another intro to Apache Spark

Spark Platform• Spark SQL

• Provides Hive compatible SQL access and JDBC/ODBC.

• GhraphX

• Provides a flexible API for graph processing.

• Includes a variety of graph algorithms for computing PageRank,

connected components, triangle count, SVD, label propagation, etc.

10

Page 11: Yet another intro to Apache Spark

Spark Platform• Spark Streaming

• Provides a flexible streaming API based on micro-batch processing.

• Includes methods for stream source definitions, transformations and

window operations.

• MLlib

• Provides a set of ML algorithms for classification (logistic regres-

sion, SVM, naive bayes), linear regression and clustering (k-means),

matrix decomposition (SVD/PCA) and collaborative filtering (ALS).

11

Page 12: Yet another intro to Apache Spark

Personal impressions• The interactive shell is awesome!

• Good documentation and lots of examples, source code is in Scala is =/

• Tons of info messages are distracting, errors messages on teardown are

spooky.

• MLllib lacks methods for data cleaning/transformation, model validation

and exploration.

12

Page 14: Yet another intro to Apache Spark

Thanks!