with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A...

29
Getting Started with Apache Spark

Transcript of with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A...

Page 1: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Getting Started with Apache Spark

Page 2: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Welcome and Housekeeping

2

● You should have received instructions on how to participate in the training session

● If you have questions, you can use the Q&A window in Go To Webinar

● The slides will also be made available to you as well as a recording of the session after the event

Page 3: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

About Your Instructor

3

Doug Bateman is Director of Training and Education at Databricks. Prior to this role he was Director of Training at NewCircle.

Page 4: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Apache Spark - Genesis and Open Source

4

Spark was originally created at the AMP Lab at Berkeley. The original creators went on to found Databricks.

Spark was created to address bringing data and machine learning together

Spark was donated to the Apache Foundation to create the Apache Spark open source project

Page 5: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Accelerate innovation by unifying data science, engineering and business

• Original creators of • 2000+ global companies use our platform across big

data & machine learning lifecycle

VISION

WHO WE ARE

Unified Analytics Platform SOLUTION

Page 6: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Open Format Based on Parquet

With Transactions

Apache Spark API’s

Introducing Delta Lake

A New Standard for Building Data Lakes

Page 7: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Apache Spark - A Unified Analytics Engine

7

Page 8: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Apache Spark

8

“Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

● Research project at UC Berkeley in 2009● APIs: Scala, Java, Python, R, and SQL● Built by more than 1,200 developers from more than 200

companies

Page 9: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

HOW TO PROCESS LOTS OF DATA?

Page 10: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

M&Ms

10

Page 11: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Spark Cluster

11

One Driver and many Executor JVMs

Page 12: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Spark APIs

12

● RDD● DataFrame● Dataset

Page 13: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

RDD

13

Resilient: Fault-tolerant

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed● Track lineage information● Operations on collection of elements in parallel

Page 14: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Transformations and Actions

14

Transformations Actions

Filter Count

Sample Take

Union Collect

Page 15: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Dataframe

15

Data with columns (built on RDDs)

Improved performance via optimizations

Page 16: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Datasets

16

Page 17: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Dataframe vs. Dataset

17

Page 18: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

DATAFRAMES

Page 19: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Why Switch to Dataframes?

19

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

# RDD(dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))

.map(lambda (x, (y, z)): (x, y / z)))

# DataFramedataDF = dataRDD.toDF(["name", "age"])

dataDF.groupBy("name").agg(avg("age"))

● User-friendly API

Page 20: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Why Switch to Dataframes?

20

● User-friendly API

Benefits:

■ SQL/DataFrame queries■ Tungsten and Catalyst

optimizations■ Uniform APIs across languages

Page 21: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Why Switch to Dataframes?

21

Wrapper to create logical plan

Page 22: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Catalyst: Under the Hood

22

Page 23: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Still Not Convinced?

23

Page 24: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Structured APIs in Spark

24

Page 25: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

WHY SWITCH FROM MAPREDUCE TO SPARK?

Page 26: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Spark vs. MapReduce

26

Page 27: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

When to Use Spark

27

● Scale out: Model or data too large to process on a single machine

● Speed up: Benefit from faster results

Page 29: with Apache Spark Getting Started · 2020-06-07 · Apache Spark API’s Introducing Delta Lake A New Standard for Building Data Lakes. Apache Spark - A Unified Analytics Engine 7.

Questions?

29

Further Training Options: http://bit.ly/DBTrng

● Live Onsite Training● Live Online● Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB