Performant data processing with PySpark, SparkR and DataFrame API

26
Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.

Transcript of Performant data processing with PySpark, SparkR and DataFrame API

Page 1: Performant data processing with PySpark, SparkR and DataFrame API

Performant data processing with PySpark, SparkR and

DataFrame API

Ryuji Tamagawa from Osaka

Many Thanks to Holden Karau, for the discussion we had about this talk.

Page 2: Performant data processing with PySpark, SparkR and DataFrame API

Agenda

Who am I ?

Spark

Spark and non-JVM languages

DataFrame APIs come to rescue

Examples

Page 3: Performant data processing with PySpark, SparkR and DataFrame API

Who am I ?Software engineer working for Sky, from architecture design to troubleshooting in the field

Translator working with O’Reilly Japan

‘Learning Spark’ is the 27th book

Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’

A bed for 6 cats

Page 4: Performant data processing with PySpark, SparkR and DataFrame API

Works of 2015

Available Jan, 2016 ?

Page 5: Performant data processing with PySpark, SparkR and DataFrame API

Works of past

Page 6: Performant data processing with PySpark, SparkR and DataFrame API

Motivation for today’s talk

I want to deal with my ‘Big’ data,

WITH PYTHON !!

Page 7: Performant data processing with PySpark, SparkR and DataFrame API

Apache Spark

Page 8: Performant data processing with PySpark, SparkR and DataFrame API

Apache Spark

You may already have heard a lot

Fast, distributed data processing framework with high-level APIs

Written in Scala, run in JVM

OSHDFS

Hive e.t.c.

HBaseMapReduce

YARN

Impala e.t.c(in-

memory SQL engine)

Spark (Spark Streaming, MLlib,

GraphX, Spark SQL)

Page 9: Performant data processing with PySpark, SparkR and DataFrame API

Why it’s fastDo not need to write temporary data to storage every time

Do not need to invoke JVM process every time

map

JVM Invocation

I/0

HD

FS

reduce

JVM Invocation

I/0

map

JVM Invocation

I/0

reduce

JVM Invocation

I/0

f1(read data to RDD)

Executor(JVM)Invocation

HD

FS

I/O

f2

f3

f4(persist to storage)

f5(does shuffle) I/O

f6

f7

Mem

ory (RDD

s)

access

access

access

access I/O

access

access

MapReduce Spark

Page 10: Performant data processing with PySpark, SparkR and DataFrame API

Apache Spark and

non-JVM languages

Page 11: Performant data processing with PySpark, SparkR and DataFrame API

Spark supports non-JVM languages Shells

PySpark, for Python users

SparkR, for R users

GUI Environment : Jupiter, RStudio

You can write application code in these languages

Page 12: Performant data processing with PySpark, SparkR and DataFrame API

The Web UI tells us a lot

http://<address>:4040

Page 13: Performant data processing with PySpark, SparkR and DataFrame API

Performance problems with those languages

Data processing performance with those languages may be several times slower than JVM languages

The reason lies in the architecture

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Page 14: Performant data processing with PySpark, SparkR and DataFrame API

The choices you have had

Learn Scala

Write (more lines of) code in Java

Use non-JVM languages with more CPU cores to make up the performance gap

Page 15: Performant data processing with PySpark, SparkR and DataFrame API

DataFrame APIs come to the rescue !

Page 16: Performant data processing with PySpark, SparkR and DataFrame API

DataFrame

Tabular data with schema based on RDD

Successor of Schema RDD (Since 1.4)

Has rich set of APIs for data operation

Or, you can simply use SQL!

Page 17: Performant data processing with PySpark, SparkR and DataFrame API

Do it within JVM

When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime

Obviously, the performance is almost same compared to JVM languages

Only code goes through

Page 18: Performant data processing with PySpark, SparkR and DataFrame API

Executo

rDataFrame APIs compared to

RDD APIs by Examples

JVM

DataFrame, Cached

Python

lambda items: items[0] == ‘abc’

transfer

DataFrame, result

transfer

Driv

er

Page 19: Performant data processing with PySpark, SparkR and DataFrame API

Executo

r

DataFrame APIs compared to RDD APIs by Examples

JVM

DataFrame, Cached

filter(df[“_1”] == “abc”)

transfer

DataFrame, result

Driv

er

Page 20: Performant data processing with PySpark, SparkR and DataFrame API

Watch out for UDFs

You can write UDFs in Python

You can use lambdas in Python, too

Once you use them, data flows between the two worlds

slen = udf( lambda s: len(s), IntegerType())

df.select( slen(df.name)) .collect()

Page 21: Performant data processing with PySpark, SparkR and DataFrame API

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

‘BIG’ data in DataFrame

filtering with ‘native APIs’

‘Small’ data in DataFrame

whatever operation with

UDFs

Page 22: Performant data processing with PySpark, SparkR and DataFrame API

Make it small first, then use UDFs

Filter or sample your ‘big’ data with DataFrame APIs

Then use UDFs

SQL optimizer does not take it into account when making plans (so far)

slen = udf( lambda s: len(s), IntegerType())

sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’).collect()

processed first !

Page 23: Performant data processing with PySpark, SparkR and DataFrame API

Ingesting DataIt’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Executo

r

JVM

DataFrameDriver

Local Data

Py4J

Driver Machine

HDFS (Parquet)

Page 24: Performant data processing with PySpark, SparkR and DataFrame API

Driver Machine

Ingesting DataExecuto

r

JVM

DataFrameDriver Py4Jcode only

HDFS (Parquet)

code only

It’s slow to Deal with files like CSVs by non-JVM driver

Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first

You can process Such files directly from JVM processes (executors) even when using non-JVM languages

Page 25: Performant data processing with PySpark, SparkR and DataFrame API

Appendix : Parquet

Page 26: Performant data processing with PySpark, SparkR and DataFrame API

Parquet: general purpose file format for analytic workload

Columnar storage : reduces I/O significantly

High compression rate

projection pushdown

Today, workloads become CPU-intensive : very fast read, CPU-internal-aware