Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets...

36
Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen

Transcript of Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets...

Page 1: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark:A Unified Engine for Big Data Processing

Presented by: Huanyi Chen

Page 2: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark:A Unified Engine for Big Data Processing§ Engine?

§ Unified?

Apache Spark: A Unified Engine for Big Data Processing PAGE 2

Page 3: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark:A Unified Engine for Big Data Processing§ Engine?

§ convert one form of data into other useful forms

§ Unified?§ Multiple types of conversions

Apache Spark: A Unified Engine for Big Data Processing PAGE 3

Page 4: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark:A Unified Engine for Big Data Processing

§ What is Apache Spark? (Engine)

§ How can it make multiple types of conversions over big data? (Unified)

Apache Spark: A Unified Engine for Big Data Processing PAGE 4

Page 5: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

What is Apache Spark?§ A framework like MapReduce

§ Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 5

RDDs

Page 6: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 6

Page 7: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 7

Page 8: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 8

I/O I/O

Page 9: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 9

Page 10: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 10

§ An RDD is a read-only, partitioned collection of records

§ Transformations§ create RDDs (map, filter, join, etc.)

§ Actions§ return a value to the application

§ or export data to a storage system

§ Persistence§ Users can indicate which RDDs they will reuse and choose a storage strategy for them

(e.g., in-memory storage).

§ Partitioning§ Users can ask that an RDD’s elements be partitioned across machines based on a key

in each record.

Page 11: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 11

Page 12: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 12

§ Lineage§ An RDD has enough information about how it was derived from

other datasets.

§ Narrow dependencies§ each partition of the parent RDD is used by at most one partition of

the child RDD

§ Wide dependencies:§ multiple child partitions

Page 13: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 13

Page 14: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 14

Example: run an action on RDD G

Page 15: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

MapReduce Ecosystem Spark Ecosystem

MapReduce vs Spark

Apache Spark: A Unified Engine for Big Data Processing PAGE 15

Page 16: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Higher-Level Libraries

Apache Spark: A Unified Engine for Big Data Processing PAGE 16

Page 17: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

SQL and DataFrames

Apache Spark: A Unified Engine for Big Data Processing PAGE 17

Page 18: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

SQL and DataFrames

Apache Spark: A Unified Engine for Big Data Processing PAGE 18

§ !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'(§ Spark SQL’s DataFrame API supports inline definition of

user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems.

Page 19: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

UDF in MySQL

Apache Spark: A Unified Engine for Big Data Processing PAGE 19

Page 20: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

UDF in Spark SQL

Apache Spark: A Unified Engine for Big Data Processing PAGE 20

Page 21: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Spark Streaming

Apache Spark: A Unified Engine for Big Data Processing PAGE 21

Page 22: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Continuous operator processing model Discretized stream processing model

Spark Streaming

Apache Spark: A Unified Engine for Big Data Processing PAGE 22

Page 23: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 23

Page 24: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 24

Page 25: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 25

Page 26: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 26

Page 27: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 27

Page 28: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 28

§ Not able to beat specialized graph-parallel systems itself

§ But outperform them in graph analytics pipeline

Page 29: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

MLlib

Apache Spark: A Unified Engine for Big Data Processing PAGE 29

Page 30: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

MLlib

Apache Spark: A Unified Engine for Big Data Processing PAGE 30

§ More than 50 common algorithms for distributed model training

§ Support pipeline construction on Spark

§ Integrate with other Spark libraries well

Page 31: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Why use Apache Spark?

§ Ecosystem

§ Competitive performance

§ Low cost in sharing data

§ Low latency of MapReduce Steps

§ Control over bottleneck resources

Apache Spark: A Unified Engine for Big Data Processing PAGE 31

Page 32: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark in 2016

§ Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs.

§ Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016.

Apache Spark: A Unified Engine for Big Data Processing PAGE 32

Page 33: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark Today

Apache Spark: A Unified Engine for Big Data Processing PAGE 33

Page 34: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark:A Unified Engine for Big Data Processing

§ What is Apache Spark§ Apache Spark = MapReduce + RDDs

§ How can it make multiple types of conversions over big data§ Higher-level libraries enable Apache Spark to handle different types

of big data workload

Apache Spark: A Unified Engine for Big Data Processing PAGE 34

Page 35: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Apache Spark: A Unified Engine for Big Data Processing PAGE 35

“Try Apache Spark if you are new to the big data processing world”

Huanyi Chen

Page 36: Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Q&A§ What issues will it cause by persisting data in memory? For

example, garbage collection?

§ What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world?

§ Will optimizing one library cause other libraries to lose performance?

§ Is using memory as the storage really the next generation of storage?

Apache Spark: A Unified Engine for Big Data Processing PAGE 36