Performant data processing with PySpark, SparkR and DataFrame API

Performant data processing with PySpark, SparkR and

DataFrame API

Ryuji Tamagawa from Osaka

Many Thanks to Holden Karau, for the discussion we had about this talk.

Agenda

Who am I ?

Spark

Spark and non-JVM languages

DataFrame APIs come to rescue

Examples

Who am I ?Software engineer working for Sky, from architecture design to troubleshooting in the field

Translator working with O’Reilly Japan

‘Learning Spark’ is the 27th book

Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’

A bed for 6 cats

Works of 2015

Available Jan, 2016 ?

Works of past

Motivation for today’s talk

I want to deal with my ‘Big’ data,

WITH PYTHON !!

Apache Spark

Apache Spark

You may already have heard a lot

Fast, distributed data processing framework with high-level APIs

Written in Scala, run in JVM

OSHDFS

Hive e.t.c.

HBaseMapReduce

YARN

Impala e.t.c（in-

memory SQL engine）

Spark （Spark Streaming, MLlib,

GraphX, Spark SQL)

Why it’s fastDo not need to write temporary data to storage every time

Do not need to invoke JVM process every time

map

JVM Invocation

I/0

HD

FS

reduce

JVM Invocation

I/0

map

JVM Invocation

I/0

reduce

JVM Invocation

I/0

f1（read data to RDD）

Executor（JVM）Invocation

HD

FS

I/O

f2

f3

f4（persist to storage）

f5（does shuffle） I/O

f6

f7

Mem

ory (RDD

s)

access

access

access

access I/O

access

access

MapReduce Spark

Apache Spark and

non-JVM languages

Spark supports non-JVM languages Shells

PySpark, for Python users

SparkR, for R users

GUI Environment : Jupiter, RStudio

You can write application code in these languages

The Web UI tells us a lot

http://<address>:4040

Performance problems with those languages

Data processing performance with those languages may be several times slower than JVM languages

The reason lies in the architecture

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

The choices you have had

Learn Scala

Write (more lines of) code in Java

Use non-JVM languages with more CPU cores to make up the performance gap

DataFrame APIs come to the rescue !

DataFrame

Tabular data with schema based on RDD

Successor of Schema RDD (Since 1.4)

Has rich set of APIs for data operation

Or, you can simply use SQL!

Do it within JVM

When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime

Obviously, the performance is almost same compared to JVM languages

Only code goes through

Executo

rDataFrame APIs compared to

RDD APIs by Examples

JVM

DataFrame, Cached

Python

lambda items: items[0] == ‘abc’

transfer

DataFrame, result

transfer

Driv

er

Executo

r

DataFrame APIs compared to RDD APIs by Examples

JVM

DataFrame, Cached

filter(df[“_1”] == “abc”)

transfer

DataFrame, result

Driv

er

Watch out for UDFs

You can write UDFs in Python

You can use lambdas in Python, too

Once you use them, data flows between the two worlds

slen = udf( lambda s: len(s), IntegerType())

df.select( slen(df.name)) .collect()

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

‘BIG’ data in DataFrame

filtering with ‘native APIs’

‘Small’ data in DataFrame

whatever operation with

UDFs

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

slen = udf( lambda s: len(s), IntegerType())

sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’).collect()

processed first !

Ingesting DataIt’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Executo

r

JVM

DataFrameDriver

Local Data

Py4J

Driver Machine

HDFS (Parquet)

Driver Machine

Ingesting DataExecuto

r

JVM

DataFrameDriver Py4Jcode only

HDFS (Parquet)

code only

It’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Appendix : Parquet

Parquet: general purpose file format for analytic workload

Columnar storage : reduces I/O significantly

High compression rate

projection pushdown

Today, workloads become CPU-intensive : very fast read, CPU-internal-aware

Performant data processing with PySpark, SparkR and DataFrame API

Software

Transcript of Performant data processing with PySpark, SparkR and DataFrame API