Intro to Apache Spark

42
1 © Cloudera, Inc. All rights reserved. Intro to Apache Spark Anand Iyer Senior Product Manager, Cloudera

Transcript of Intro to Apache Spark

Page 1: Intro to Apache Spark

1© Cloudera, Inc. All rights reserved.

Intro to Apache SparkAnand IyerSenior Product Manager, Cloudera

Page 2: Intro to Apache Spark

2© Cloudera, Inc. All rights reserved.

Target Audience

• New to Spark, or have very rudimentary knowledge of Spark.• Have basic knowledge of Map-Reduce

If you are an advanced Spark developer, you are unlikely to get much out of this talk.• No performance tuning or debugging tips

Page 3: Intro to Apache Spark

3© Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

• Easy to Develop• Rich APIs in Java, Scala, Python• Interactive shell

• Fast to Run•General execution graphs• In-memory Caching

Page 4: Intro to Apache Spark

4© Cloudera, Inc. All rights reserved.

Easy to code API

Page 5: Intro to Apache Spark

5© Cloudera, Inc. All rights reserved.

RDD: Resilient Distributed DatasetsAbstraction to represent the large distributed sets of data that are being processed.

RDDs are:• Broken up into partitions, which are distributed across nodes• In practice, RDDs usually have between 100 to 10K partitions

• Partitions operated upon in parallel• Immutable• Fault-Tolerant via concept of lineage

Page 6: Intro to Apache Spark

6© Cloudera, Inc. All rights reserved.

Spark jobs are DAGs of operations on RDDsOperations on RDDs• Transformations: Create a new RDD from existing RDDs• Actions: Run computation on RDD, return values to the driver

= RDD

joinfilter

groupBy

B:

C: D: E:

G:

Ç√Ω

map

A:

map

take

F:

Page 7: Intro to Apache Spark

7© Cloudera, Inc. All rights reserved.

Rich Expressive API

• map• filter• groupBy• sort• union• join• leftOuterJoin• rightOuterJoin

• reduce• count• fold• reduceByKey• groupByKey• cogroup• cross• zip

sampletakefirstpartitionBymapWithpipesave ...

Page 8: Intro to Apache Spark

8© Cloudera, Inc. All rights reserved.

Example: Logistic Regression

sc = SparkContext(…)rawData = sc.textFile(“hdfs://…”)data = rawData.map(parserFunc).cache()

w = numpy.random.rand(D)

for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

print “Final w: %s” % w

Page 9: Intro to Apache Spark

9© Cloudera, Inc. All rights reserved.

Execution model and Spark Internals

Page 10: Intro to Apache Spark

10© Cloudera, Inc. All rights reserved.

Driver & Executors

• Driver: Master node• One Driver per Spark App• Runs the main(…) function of your app

• Executors: Worker nodes

Page 11: Intro to Apache Spark

11© Cloudera, Inc. All rights reserved.

Logical graph to physical execution plan

= cached partition

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

• Execution graph is broken into Stages• Each Stage consists of multiple Tasks

• Task is unit of computation that is scheduled on an Executor• A Stage consists of multiple operations that can be pipelined• Stages are split when data needs to be “shuffled”

Page 12: Intro to Apache Spark

12© Cloudera, Inc. All rights reserved.

Shuffle• Redistributes data among partitions• reduce, groupBy, join

• Hash keys to buckets• Identical to MapReduce Shuffle

• Shuffle entails writes to disk

Page 13: Intro to Apache Spark

13© Cloudera, Inc. All rights reserved.

Spark WebUI lets you visualize DAG

Page 14: Intro to Apache Spark

14© Cloudera, Inc. All rights reserved.

Drivers & Executors revisited• Driver• One Driver per Spark App• Runs the main(…) function of your app• Creates logical DAG and physical execution plan• Schedules Tasks• Driver receives and collects the results of Actions

• Executors • Hold RDD partitions• Execute Tasks as scheduled by Driver

Page 15: Intro to Apache Spark

15© Cloudera, Inc. All rights reserved.

Spark runs on Cluster Managers

• Spark does not manage cluster of machines• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)

Page 16: Intro to Apache Spark

16© Cloudera, Inc. All rights reserved.

Why is Spark Fast?

Page 17: Intro to Apache Spark

17© Cloudera, Inc. All rights reserved.

Memory management leads to greater performance

Trends:½ price every 18 months2x bandwidth every 3 years

128 – 384 GB

12-24 cores

50 GB per sec

Memory can be enabler for high performance big data applications

Page 18: Intro to Apache Spark

18© Cloudera, Inc. All rights reserved.

Persisting or Caching RDDs

• If an RDD will be re-used, persist it to prevent re-computation• Very common in iterative algorithms

• By default, cached RDDs held in memory

• But memory may not suffice• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory

to disk

Page 19: Intro to Apache Spark

19© Cloudera, Inc. All rights reserved.

Lineage for Fault-Tolerance

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

Page 20: Intro to Apache Spark

20© Cloudera, Inc. All rights reserved.

Lineage

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

Page 21: Intro to Apache Spark

21© Cloudera, Inc. All rights reserved.

Lineage

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

Page 22: Intro to Apache Spark

22© Cloudera, Inc. All rights reserved.

Lineage

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

Page 23: Intro to Apache Spark

23© Cloudera, Inc. All rights reserved.

Lineage

= RDD

joinfilter

groupBy

B: F:

C: D: E:

G:

map

A:

map

take

Page 24: Intro to Apache Spark

24© Cloudera, Inc. All rights reserved.

joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:

Page 25: Intro to Apache Spark

25© Cloudera, Inc. All rights reserved.

joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:

Page 26: Intro to Apache Spark

26© Cloudera, Inc. All rights reserved.

joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:

Page 27: Intro to Apache Spark

27© Cloudera, Inc. All rights reserved.

joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:

Page 28: Intro to Apache Spark

28© Cloudera, Inc. All rights reserved.

joinfilter

groupBy

B: F:

C: D: E:

H:

Ç√Ω

map

A:

map

take

Lineage Truncation

= RDD

Lineage gets truncated at an RDD when:• RDD is persisted to memory or disk• RDD already materialized on disk due to shuffle

G:

Page 29: Intro to Apache Spark

29© Cloudera, Inc. All rights reserved.

Summary of what makes Spark fast• Maximize use of memory• Re-used RDDs can be explicitly cached to prevent re-computation

• Leverage Lineage & Pipelining to minimize writing intermediate data to disk

• Efficient Task Scheduler• Ensure worker nodes are kept busy via quick scheduling of Tasks

• More optimizations coming in Spark SQL• Compact binary in-memory data representation, etc• More details in subsequent slides

Page 30: Intro to Apache Spark

30© Cloudera, Inc. All rights reserved.

Spark will replace MapReduceTo become the standard execution engine for Hadoop

Page 31: Intro to Apache Spark

31© Cloudera, Inc. All rights reserved.

Spark Streaming

Page 32: Intro to Apache Spark

32© Cloudera, Inc. All rights reserved.

Spark Streaming• Incoming data represented as DStreams (Discretized Streams)• Data commonly read from streaming data channels like Kafka or Flume

• A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)

Page 33: Intro to Apache Spark

33© Cloudera, Inc. All rights reserved.

Discretized Stream• Incoming data stream is broken down into micro-batches• Micro-batch size is user defined, usually 0.3 to 1 second • Micro-batches are disjoint

• Each micro-batch is an RDD • Effectively, a DStream is a sequence of RDDs, one per micro-batch• Spark Streaming known for high throughput

Page 34: Intro to Apache Spark

34© Cloudera, Inc. All rights reserved.

Windowed DStreams• Defined by specifying a window size and a step size

• Both are multiples of micro-batch size• Operations invoked on each window’s data

Page 35: Intro to Apache Spark

35© Cloudera, Inc. All rights reserved.

Maintain and update arbitrary stateupdateStateByKey(...)• Define initial state• Provide state update function• Continuously update with new information• State maintained as RDD, updated via Transformation

Examples: • Running count of words seen in text stream• Per user session state from activity stream

Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15) micro-batches

Page 36: Intro to Apache Spark

36© Cloudera, Inc. All rights reserved.

Spark SQL & Dataframes

Page 37: Intro to Apache Spark

37© Cloudera, Inc. All rights reserved.

Dataframes

• Distributed collection of data organized as named typed columns

• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via lineage

• Can be constructed from:• Structured data files: Json, avro, parquet, etc• Tables in Hive• Tables in a RDBMS• Existing RDDs by programmatically applying schema

Page 38: Intro to Apache Spark

38© Cloudera, Inc. All rights reserved.

Spark SQL

• SQL statements to process Dataframes

• Embed SQL statements in your scala, java, python Spark application

• Queries can also be issued via JDBC/ODBC

Page 39: Intro to Apache Spark

39© Cloudera, Inc. All rights reserved.

Why Spark SQL? Ease of programming• Easy to code against schema’d records

• SQL is often an easier alternative to code, for non-complex operations on relational data

• Embed SQL in your scala, java or python applications to seamlessly mix “regular” spark for complex operations, along with SQL

Page 40: Intro to Apache Spark

40© Cloudera, Inc. All rights reserved.

Why Spark SQL? Performance

SQL processed by Query Optimizer Automatic Optimizations

• Compressed memory format (as against java serialized objects in RDDs)• Predicate pushdown (read less data to reduce IO)• Optimal pipelining of operations• Cost based optimizer • …

Page 41: Intro to Apache Spark

41© Cloudera, Inc. All rights reserved.

MLlibCollection of popular machine learning algorithms:Classifiers: logistic regression, boosted trees, random forests,etcClustering: k-means, LDARecommender Systems: ALSDimensionality Reduction: PCA and SVDFeature Engineering: TF-IDF, Word2Vec, etcStatistical Functions: Chi-Squared Test, Pearson Correlation,etc

Pipelines API: Chain together feature engineering, training, model validation into one pipeline

Page 42: Intro to Apache Spark

42© Cloudera, Inc. All rights reserved.

Thank YouAnd of course….we are hiring!!!