PySpark Best Practices by Juliet Hougland

Juliet Hougland Spark Summit Europe 2015 @j_houg

PySpark Best Practices

RDDssc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Spark Execution Model

PySpark Execution Model

PySpark Driver Program

Function closures need to be executed on worker nodes by a python process.

How do we ship around Python functions?

Pickle!

https://flic.kr/p/c8N4sE

Pickle!

Best Practices for Writing PySpark

REPLs and Notebookshttps://flic.kr/p/5hnPZp

Share your code

https://flic.kr/p/sw2cnL

Standard Python Projectmy_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py

What is the shape of a PySpark job?

https://flic.kr/p/4vWP6U

• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

PySpark Structure?

https://flic.kr/p/ZW54

Shout out to my colleagues in the UK

PySpark Structure?my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv

• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

Simple Main Method

• Write a function for anything inside an transformation • Make it static

• Separate Feature generation or data standardization from your modeling

Write Testable Code

Featurize.py …

@static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double

@static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])

• Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead

Write Serializable Code

https://flic.kr/p/za5cy

• Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/spark-testing-base

Testing with SparkTestingBase

• Unit test as much as possible • Integration test the whole flow • Use sample of real data

• Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results

Testing Suggestions

https://flic.kr/p/tucHHL

Best Practices for Running PySpark

Writing distributed code is the easy part…

Running it is hard.

Get Serious About Logs

• Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces

Know your environment

• You may want to use python packages on your cluster • Actively manage dependencies on your cluster

• Spark versions <1.4.0 require the same version of Python on driver and workers

Complex Dependencies

Many Python EnvironmentsPath to Python binary to use on the cluster can be set with PYSPARK_PYTHON

Can be set it in spark-env.sh

if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi

http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

Thank YouQuestions?

@j_houg juliet@cloudera.com

PySpark Best Practices by Juliet Hougland

Data & Analytics

Transcript of PySpark Best Practices by Juliet Hougland

PySparkAudit: PySpark Data Audit - GitHub Pages · PySparkAudit: PySpark Data Audit – Data Scientist and Master of Data Science – Harvard University – Email:yimingxu@g.harvard.edu

PySpark of Warcraft

Getting the best performance with PySpark - Spark Summit West 2016

Sentiment Analysis with PySpark(AI) What is face recognition?Sentiment Analysis with PySpark (A.I) Mr. Suraj Tiwari Bhagwan Mahavir College of Management (IMCA), Surat st143693@gmail.com

Programming in Spark using PySpark

Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

STATS 700-002 Data Analysis using Pythonklevin/teaching/Fall2017/... · Running Spark Option 1: Run in interactive mode Type pyspark on the command line PySpark provides an interface

pyspark package - Univerzita Karlovaufal.mff.cuni.cz/~straka/courses/npfl102/pyspark-1.6.1.pdfpyspark package Contents PySpark is the Python API for Spark. Public classes: ... Create

Hougland CHE275 Chapter9 Slides

Using PySpark to Process Boat Loads of Data

Thomas Thomas Pro ET, Juliet Classic Juliet Pro & … · Thomas, Thomas Pro, ET, Juliet Classic, Juliet Pro, Juliet Pro 60 – USB & Network Ready Edition 2 No part of this publication

Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Apache Hivemall Meets PySpark...Apache Hivemall Meets PySpark Scalable Machine Learning with Hive, Spark, and Python Takuya Kitazawa @ takuti Apache Hivemall PPMC EUROPEScalability

Debugging PySpark: Spark Summit East talk by Holden Karau

From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

PySpark Best Practices

Performant data processing with PySpark, SparkR and DataFrame API

Sentiment Analysis with PySpark(AI) What is face …bmbca.bmefcolleges.edu.in/uploads/8_Digital_Mind_4th_vol...Sentiment Analysis with PySpark (A.I) Mr. Suraj Tiwari Bhagwan Mahavir

Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

Facebook Juliet shall die with a kiss. WallPhotosMessagesBoxesJulietLogout View photos of Juliet (7) Send Juliet a message Poke Juliet Wall InfoNotes Write.