PySpark Best Practices

Juliet Hougland Sept 2015 @j_houg

• Core written, operates on the JVM • Also has Python and Java APIs

• Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN

• Interactive REPL • ML library == MLLib

Spark MLLib

• Model building and eval • Fast • Basics covered

• LR, SVM, Decision tree • PCA, SVD • K-means • ALS

• Algorithms expect RDDs of consistent types (i.e. LabeledPoints)

RDDssc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

Partition 1

Partition 2

Partition 3

Partition 4

Thanks: Kostas Sakellis

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Spark Execution Model

PySpark Execution Model

PySpark Driver Program

Function closures need to be executed on worker nodes by a python process.

How do we ship around Python functions?

Pickle!

https://flic.kr/p/c8N4sE

Pickle!

Best Practices for Writing PySpark

REPLs and Notebookshttps://flic.kr/p/5hnPZp

Share your code

https://flic.kr/p/sw2cnL

Standard Python Projectmy_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py

What is the shape of a PySpark job?

https://flic.kr/p/4vWP6U

!• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

PySpark Structure?

https://flic.kr/p/ZW54

Shout out to my colleagues in the UK

PySpark Structure?my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv

!• Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data

Simple Main Method

• Write a function for anything inside an transformation • Make it static

• Separate Feature generation or data standardization from your modeling

Write Testable Code

Featurize.py … !@static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double !@static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])

• Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead

Write Serializable Code

https://flic.kr/p/za5cy

• Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/spark-testing-base

Testing with SparkTestingBase

• Unit test as much as possible • Integration test the whole flow !• Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results

Testing Suggestions

https://flic.kr/p/tucHHL

Best Practices for Running PySpark

Writing distributed code is the easy part…

Running it is hard.

Get Serious About Logs

• Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces

Know your environment

• You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this.

• Spark versions <1.4.0 require the same version of Python on driver and workers

Complex Dependencies

Many Python EnvironmentsPath to Python binary to use on the cluster can be set with PYSPARK_PYTHON !Can be set it in spark-env.sh

if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi

Thank YouQuestions? !@j_houg

PySpark Best Practices

Software

Transcript of PySpark Best Practices

“Substantive Best Practices” Best Practices in Trial ...media.dsba.org/PreAdmitMaterials/28-Best Practices-Trial Practice...“Substantive Best Practices” Best Practices in Trial

Introduction to PySpark

BEST PRACTICES GUIDE Nimble Storage Best Practices for ...uploads.nimblestorage.com/.../uploads/2015/...windows_file_sharing.pdf · BEST PRACTICES GUIDE Nimble Storage Best Practices

BEST PRACTICES GUIDE Nimble Storage Best Practices … · BEST PRACTICES GUIDE . Nimble Storage Best Practices for CommVault Simpana*. Efficient Nimble Storage snapshots managed by

About Intellipaat · 2019-04-08 · Python is one of the best programming languages that is used for the domain of Data Science. ... PySpark, integrating PySpark with Jupyter Notebook

Introduction to Big Data with Apache Spark...Python Spark (pySpark)" • We are using the Python programming interface to Spark (pySpark)" • pySpark provides an easy-to-use programming

Pyspark Package — PySpark 1.3

“Substantive Best Practices” Best Practices in Bankruptcy Lawmedia.dsba.org/PreAdmitMaterials/20-Best Practices... · 2016-11-07 · “Substantive Best Practices” Best Practices

VPLEX™ Networking Best Practices...H13552 Best Practices VPLEX Networking Best Practices Implementation Planning and Best Practices Abstract This White Paper provides an overview

2015 Fat Loss Best Practices...2015 Fat Loss Best Practices 2015 Fat Loss Best Practices 2015 Fat Loss Best Practices Sprint Intervals ...

TAB Best Practices. Peer Board Top Best Practices.

AGTA Presentation Best Practices. SuperShuttle Tampa Best Practices.

BEST PRACTICES GUIDE: Nimble Storage Best Practices for Microsoft

pyspark package - Univerzita Karlovaufal.mff.cuni.cz/~straka/courses/npfl102/pyspark-1.6.1.pdfpyspark package Contents PySpark is the Python API for Spark. Public classes: ... Create

pyspark Documentation

Programming in Spark using PySpark

Up and running with pyspark

Best Practices for PowerPoint! Welcome! Best Practices for Presentations.

Best Practices Guide: Vyatta Firewall - Brocade...Best Practices Guide

Dispute Best Practices - Hancock Whitney · Dispute Best Practices Dispute Best Practices Dispute Best Practices Guide overview 3 Authorization overview 4 Transaction overview 5 Retrieval