Next-generation Python Big Data Tools, powered by Apache Arrow


Me

• Data Science Tools at Cloudera, formerly DataPad CEO/founder •  Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects

• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba;ng)}

• Mostly work in Python and Cython/C/C++


In process: Python for Data Analysis: 2nd Edi4on Coming late 2016 / early 2017


Python + Big Data: The State of things

•  See “Python and Apache Hadoop: A State of the Union” from February 17 • Areas where much more work needed

• Binary file format read/write support (e.g. Parquet files) • File system libraries (HDFS, S3, etc.) • Client drivers (Spark, Hive, Impala, Kudu) • Compute system integra;on (Spark, Impala, etc.)


Apache Arrow

Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow


Arrow in a Slide

• New Top-‐level Apache Sofware Founda;on project • Announced Feb 17, 2016

•  Focused on Columnar In-‐Memory Analy;cs 1.  10-‐100x speedup on many workloads 2.  Common data layer enables companies to choose best of

breed systems 3.  Designed to work with any programming language 4.  Support for both rela;onal and complex data as-‐is

•  Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow!

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R


Apache Arrow: What is it?

• hkp://arrow.apache.org • Not a piece of sofware, exactly! • A standardized in-‐memory representa;on for columnar data • Enables

• Suitable for implemen;ng high-‐performance analy;cs in-‐memory (think like “pandas internals”)

• Cheap data interchange amongst systems, likle or no serializa;on • Flexible support for complex JSON-‐like data

• Targets: Impala, Kudu, Parquet, Spark


Focus on CPU Efficiency

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

Row 1

Row 2

Row 3

Row 4

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

session_id

timestamp

source_ip

Traditional Memory Buffer

Arrow Memory Buffer

• Cache Locality • Super-‐scalar & vectorized opera;on

• Minimal Structure Overhead • Constant value access

• With minimal structure overhead • Operate directly on columnar compressed data


High Performance Sharing & Interchange Today With Arrow

•  Each system has its own internal memory format

•  70-80% CPU wasted on serialization and deserialization

•  Similar functionality implemented in multiple projects

•  All systems utilize the same memory format

•  No overhead for cross-system communication

•  Projects can share functionality (eg, Parquet-to-Arrow reader)

Pandas Drill

Impala

HBase

KuduCassandra

Parquet

Spark

Arrow Memory

Pandas Drill

Impala

HBase

KuduCassandra

Parquet

Spark

Copy & ConvertCopy & Convert

Copy & Convert

Copy & Convert

Copy & Convert


Big Data Systems: Poor Python IO performance

h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/


Real World Example: Feather File Format for Python and R • Problem: fast, language-‐agnos;c binary data frame file format

• Wriken by Wes McKinney (Python) Hadley Wickham (R)

• Read speeds close to disk IO performance

Arrow array 0Arrow array 1

…Arrow array n

Feather metadata

Feather file

Apache Arrow memory

Google flatbuffers


Real World Example: Feather File Format for Python and R

library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)

import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)

R Python


Apache Parquet: Binary columnar storage format

•  I just became a Parquet commiker! •  github.com/apache/parquet-‐cpp •  Python users will soon be able to read Parquet files via PyArrow •  parquet-‐cpp <-‐> PyArrow <-‐> pandas


Language Bindings • Target Languages

•  Java (beta) • CPP (underway) • Python & Pandas (underway) • R •  Julia

•  Ini;al Focus • Read a structure • Write a structure • Manage Memory


pandas and Arrow in context


RPC & IPC: Moving Data Between Systems RPC • Avoid Serializa;on & Deserializa;on •  Layer TBD: Focused on suppor;ng vectored io

• Scaker/gather reads/writes against socket

IPC • Alpha implementa;on using memory mapped files

• Moving data between Python and Drill • Working on shared alloca;on approach

• Shared reference coun;ng and well-‐defined ownership seman;cs


Execu;ng data science languages in the compute layer

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?


Real World Example: Python With Spark, Drill, Impala

in partition 0

…

in partition n - 1

SQL Engine

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

…

out partition n - 1

SQL Engine


What’s Next • Parquet for Python & C++

• Using Arrow as intermediary • Available IPC Implementa;on •  Spark, Drill Integra;on

• Faster UDFs, Storage interfaces


Apache Arrow in prac;ce


Get Involved •  Join the community

• [email protected] • Slack: hkps://apachearrowslackin.herokuapp.com/ • hkp://arrow.apache.org • @ApacheArrow


Thank you Wes McKinney @wesmckinn Views are my own

Next-generation Python Big Data Tools, powered by Apache Arrow

Technology

Transcript of Next-generation Python Big Data Tools, powered by Apache Arrow