Post on 21-May-2020
Intro to PySpark WorkshopGarren StaubliSr. Data Engineer@gstaubli
Resources: garrens.com/pyspark124#PySparkWorkshop
Working with Spark since 2015• Batch analytics in Spark + Hive, Pig
and Hadoop MapReduce• Real-time big data reporting
using Spark/Impala/CDH• Spark Structured Streaming + ML apps
for real-time decision making
2
Do I know what I’m talking about?
Resources: garrens.com/pyspark124
50+ answers on
for Spark
#PySparkWorkshop
3
Main Points
• Apache Spark• Sample App Walkthrough• Interactive Azure Jupyter
Notebook• Python-specific Spark advice• Resources to continue learning
Resources: garrens.com/pyspark124#PySparkWorkshop
4
About Apache Spark
Resources: garrens.com/pyspark124
Structured Spark.ML GraphFrame
Lazily Evaluated• Transforms vs Actions
Immutable
#PySparkWorkshop
5
About Apache Spark
Resources: garrens.com/pyspark124#PySparkWorkshop
6
Spark application (Driver)spark = SparkSession.builder\
.appName(name="PySpark Intro")\
.master("local[*]")\
.getOrCreate()
Master (Cluster Manager)
Slave (Worker)
detailed architecture
Executor
Task Task
Slave (Worker)
Executor
Task Task
Slave (Worker)
Executor
Task Task
SparkSession
Resources: garrens.com/pyspark124#PySparkWorkshop
7
About Apache Spark | Spark SQL
Resources: garrens.com/pyspark124
SQL is not about SQLis about more than SQL
#PySparkWorkshop
8
About Apache Spark | 2 Kinds of Actions
Resources: garrens.com/pyspark124#PySparkWorkshop
9
About Apache Spark | Modern vs Legacy
Resources: garrens.com/pyspark124#PySparkWorkshop
10
About Apache Spark | Modern Optimization
Resources: garrens.com/pyspark124#PySparkWorkshop
11
About Apache Spark | Planning
Resources: garrens.com/pyspark124#PySparkWorkshop
12
Walkthrough | Create Spark Session
Resources: garrens.com/pyspark124
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(name="PySpark Intro")\
.master("local[*]")\
.getOrCreate()
Deploy modes: Local, standalone, YARN, Mesos and Kubernetes
#PySparkWorkshop
13
Walkthrough | Read CSV into DataFrame
Resources: garrens.com/pyspark124
green_trips = spark.read\ .option("header", "true")\ .option("inferSchema", "true")\ .csv("green_tripdata_2017-06.csv")
Forces eager evaluation; default is false
#PySparkWorkshop
14
Walkthrough | Behind the Scenes: UI
Resources: garrens.com/pyspark124#PySparkWorkshop
15
Walkthrough | Behind the Scenes: UI
Resources: garrens.com/pyspark124#PySparkWorkshop
16
Walkthrough | DataFrame Schema
Resources: garrens.com/pyspark124
green_trips.printSchema()Eagerly evaluated (inferSchema = true) Lazily evaluated (inferSchema = false)
#PySparkWorkshop
• 2014• 2015 #1• 2016 #1• 2017 #4
• 2015 #1• 2016 #1
• 2014• 2015• 2016
• 2016
• 2015 #373• 2016 #166• 2017 #161
1717
You guessed it… We’re hiring!
Resources: garrens.com/pyspark124#PySparkWorkshop