Programming in Spark using PySpark

Mostafa ElzoghbiSr. Technical Evangelist – Microsoft

@MostafaElzoghbihttp://mostafa.rocks

Session Objectives & Takeaways• Programming Spark• Spark Program Structure• Working with RDDs• Transformations versus Actions• Lambda, Shared Variables (Broadcast vs accumulators)

• Visualizing big data in Spark

• Spark in the cloud (Azure)• Working with cluster types, notebooks, scaling.

Python Spark (pySpark)• We are using the Python programming interface to Spark (pySpark)• pySpark provides an easy-to-use programming abstraction and parallel runtime: “Here’s an operation, run it on all of the data”

• RDDs are the key concept

Apache Spark Driver and Workers• A Spark program is two programs:

• A driver program and a workers program• Worker programs run on cluster nodes or in

local threads

• RDDs (Resilient Distributed Datasets) are distributed

Spark Essentials: Master• The master parameter for a SparkContext determines which type and size of

cluster to use

Spark Context• A Spark program first creates a SparkContext object

» Tells Spark how and where to access a cluster» pySpark shell and Databricks cloud automatically create the sc variable» iPython and programs must use a constructor to create a new

SparkContext

• Use SparkContext to create RDDs

Resilient Distributed Datasets• The primary abstraction in Spark

» Immutable once constructed» Track lineage information to efficiently recompute lost data» Enable operations on collection of elements in parallel

• You construct RDDs» by parallelizing existing Python collections (lists)» by transforming an existing RDDs» from files in HDFS or any other storage system

RDDs• Spark revolves around the concept of a resilient distributed dataset (RDD),

which is a fault-tolerant collection of elements that can be operated on in parallel.

• Two types of operations: transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk

Creating an RDD• Create RDDs from Python collections (lists)• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles,

any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*

Working with RDDs• Create an RDD from a data source: <list>• Apply transformations to an RDD: map filter • Apply actions to an RDD: collect count

Spark Transformations• Create new datasets from an existing one• Use lazy evaluation: results not computed right away –• instead Spark remembers set of transformations applied to base dataset

» Spark optimizes the required calculations» Spark recovers from failures and slow workers

• Think of this as a recipe for creating result

Python lambda Functions• Small anonymous functions (not bound to a name)

lambda a, b: a+b» returns the sum of its two arguments

• Can use lambda functions wherever function objects are required• Restricted to a single expression

Spark Actions• Cause Spark to execute recipe to transform source• Mechanism for getting results out of Spark

Spark Program Lifecycle1. Create RDDs from external data or parallelize a collection in your driver program2. Lazily transform them into new RDDs3. cache() some RDDs for reuse -- IMPORTANT4. Perform actions to execute parallel5. Computation and produce results

pySpark Shared Variables• Broadcast Variables

» Efficiently send large, read-only value to all workers» Saved at workers for use in one or more Spark operations» Like sending a large, read-only lookup table to all the nodes

At the driver: broadcastVar = sc.broadcast([1, 2, 3])

At a worker: broadcastVar.value

• Accumulators» Aggregate values from workers back to driver» Only driver can access value of accumulator» For tasks, accumulators are write-only» Use to count errors seen in RDD across workers>>> accum = sc.accumulator(0) >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(x): >>> global accum >>> accum += x

>>> rdd.foreach(f) >>> accum.value Value: 10

Visualizing Big Data in the browser• Challenges:• Manipulating large data can take long time

Memory: caching -> Scale clustersCPU: Parallelism -> Scale clusters

• We have more data points than possible pixels> Summarize: Aggregation, Pivoting (more data than pixels)> Model (Clustering, Classification, D. Reduction, …etc)> Sample: approximate (faster) and exact sampling

• Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.

Spark Kernels and MAGIC keywords• PySpark kernel supports set of %%MAGIC keywords• It supports built-in IPython built-in magics, including %%sh.• Auto visualization• Magic keywords:• %%SQL % Spark SQL• %%lsmagic % List all supported magic keywords (Important)• %env % Set environment variable• %run % Execute python code• %who % List all variables of global scope

• Run code from a different kernel in a notebook.

Spark in AzureHadoop clusters in Azure are packaged under “HDInsight” service

Spark in Azure• Create clusters in few clicks• Apache Spark comes only in Linux OS.• Multiple HDP versions• Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets.

• Multiple Storage options:• Azure Storage• ADL store

• External metadata store in SQL server database for Hive and Oozie.

• All notebooks are stored in the storage account associated with Spark cluster

• Zeppelin notebook is available on certain Spark versions but not all.

Programming Spark Apps in HDInsight• Supports four kernels in Jupyter in HDInsight Spark clusters in Azure

Spark Apps using Jupyter

References• Spark Programming Guidehttp://spark.apache.org/docs/latest/programming-guide.html • edx.org: Free Apache Spark courses• Visualizations for Databrickshttps://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Overview.html • SPARKHub by Databrickshttps://sparkhub.databricks.com/resources/

Thank you• Check out my blog big data articles: http://mostafa.rocks

• Follow me on Twitter: @MostafaElzoghbi

• Want some help in building cloud solutions? Contact me to know more.

Programming in Spark using PySpark

Technology

Transcript of Programming in Spark using PySpark

Python and Bigdata - An Introduction to Spark (PySpark)

Chapter 1: Installing and Configuring Spark · Chapter 7: Structured Streaming with PySpark [ 60 ] [ 61 ] [ 62 ] ... Spark Drill Impala HBase Arrow Memory Parquet Cassandra Kudu Model

pyspark package - Univerzita Karlovaufal.mff.cuni.cz/~straka/courses/npfl102/pyspark-1.6.1.pdfpyspark package Contents PySpark is the Python API for Spark. Public classes: ... Create

Introduction to Big Data with Apache Spark...• Supported by pySpark DataFrames (SparkSQL)" • Some of the functionality SQL provides:" » Create, modify, delete relations" » Add,

Improving Python and Spark (PySpark) Performance and Interoperability

Debugging PySpark: Spark Summit East talk by Holden Karau

Apache Hivemall Meets PySpark...Apache Hivemall Meets PySpark Scalable Machine Learning with Hive, Spark, and Python Takuya Kitazawa @ takuti Apache Hivemall PPMC EUROPEScalability

Big Data Optimized Spark programming · Optimized Spark programming Stéphane Vialle & Gianluca Quercini Spark Technology 1.RDD concepts and operations 2.SPARK application schemeand

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Apache spark - Spark's distributed programming model

Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Spark processor programming guide - CSIRO … · Spark processor programming guide ... Spark is a toolkit for simulating the spread of wildfires over terrain. ... CMake directory

Deploying Large (Spark) ML models and scoring in near-real ... · Pyspark models *CAN be deployed in a Scala Pipeline. Spark Models CAN be scored in “near” real time using external

Parallel Programming with Apache Sparklambda.uta.edu/cse6331/papers/spark_tutorial.pdf · Parallel Programming with Apache Spark. What is Apache Spark? Open source computing engine

Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

Dr. Christian Staudt data scientist in cooperation with · 2. Spark Fundamentals An overview of Spark - a framework for programming distributed computation, using PySpark, its Python

Parallel Programming With Apache Spark

PySpark with Juypter

Introduction to Big Data with Apache Spark...Python Spark (pySpark)" • We are using the Python programming interface to Spark (pySpark)" • pySpark provides an easy-to-use programming

Improving PySpark performance: Spark Performance Beyond the JVM