Apache Arrow and Python: The latest

19
1 © Cloudera, Inc. All rights reserved. Apache Arrow and Python in context Wes McKinney @wesmckinn Data Science Summit 2016-07-12

Transcript of Apache Arrow and Python: The latest

Page 1: Apache Arrow and Python: The latest

1© Cloudera, Inc. All rights reserved.

Apache Arrow and Python in contextWes McKinney @wesmckinn

Data Science Summit 2016-07-12

Page 2: Apache Arrow and Python: The latest

2© Cloudera, Inc. All rights reserved.

Me

• Data Science Tools at Cloudera• Creator of pandas

• Wrote Python for Data Analysis 2012 (2nd ed coming 2017)

• Open source projects

• Python {pandas, Ibis, statsmodels}

• Apache {Arrow, Parquet, Kudu (incubating)}

• Mostly work in Python and Cython/C/C++

Page 3: Apache Arrow and Python: The latest

3© Cloudera, Inc. All rights reserved.

WrangleConf - July 28 in San Francisco

http://wrangleconf.comStorytelling from real-world data science

work (and BBQ, of course)

Page 4: Apache Arrow and Python: The latest

4© Cloudera, Inc. All rights reserved.

Python + Big Data: The State of things

• See “Python and Apache Hadoop: A State of the Union” from February 17

• Areas where much more work needed

• Binary file format read/write support (e.g. Parquet files)

• File system libraries (HDFS, S3, etc.)

• Client drivers (Spark, Hive, Impala, Kudu)

• Compute system integration (Spark, Impala, etc.)

Page 5: Apache Arrow and Python: The latest

5© Cloudera, Inc. All rights reserved.

Apache Arrow

Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow

Page 6: Apache Arrow and Python: The latest

6© Cloudera, Inc. All rights reserved.

Arrow in a Slide

• New Top-level Apache Software Foundation project• Announced Feb 17, 2016

• Focused on Columnar In-Memory Analytics1. 10-100x speedup on many workloads2. Common data layer enables companies to choose best of

breed systems 3. Designed to work with any programming language4. Support for both relational and complex data as-is

• Developers from 13+ major open source projects involved

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 7: Apache Arrow and Python: The latest

7© Cloudera, Inc. All rights reserved.

High Performance Sharing & InterchangeToday With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Similar functionality implemented in multiple projects

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg, Parquet-to-Arrow reader)

Page 8: Apache Arrow and Python: The latest

8© Cloudera, Inc. All rights reserved.

Apache Arrow: What is it?

• http://arrow.apache.org

• Specification matters more than Implementation

• A standardized in-memory representation for columnar data

• Enables

• Suitable for implementing high-performance analytics in-memory (think like “pandas internals”)

• Cheap data interchange amongst systems, little or no serialization

• Flexible support for complex JSON-like data

• Targets: Impala, Kudu, Parquet, Spark

Page 9: Apache Arrow and Python: The latest

9© Cloudera, Inc. All rights reserved.

Focus on CPU Efficiency

TraditionalMemory Buffer

ArrowMemory Buffer

• Cache Locality

• Super-scalar & vectorized operation

• Minimal Structure Overhead

• Constant value access • With minimal structure overhead

• Operate directly on columnar compressed data

Page 10: Apache Arrow and Python: The latest

10© Cloudera, Inc. All rights reserved.

Example: Feather File Format for Python and R

•Problem: fast, language-agnostic binary data frame file format

•Written by Wes McKinney (Python) Hadley Wickham (R)

•Read speeds close to disk IO performance

Page 11: Apache Arrow and Python: The latest

11© Cloudera, Inc. All rights reserved.

Real World Example: Feather File Format for Python and R

library(feather) path <- "my_data.feather"write_feather(df, path) df <- read_feather(path)

import feather path = 'my_data.feather' feather.write_dataframe(df, path)df = feather.read_dataframe(path)

R Python

Page 12: Apache Arrow and Python: The latest

12© Cloudera, Inc. All rights reserved.

In progress: Parquet on HDFS for pandas users

pandas

pyarrow

libarrow libarrow_io

Parquet files in HDFS / filesystems

Arrow-Parquet adapter

Native libhdfs, otherfilesystem interfaces

C++ libraries

Python + C extensions

Data structures

parquet-cpp

Raw filesystem interface

Python wrapper classes

Page 13: Apache Arrow and Python: The latest

13© Cloudera, Inc. All rights reserved.

Language Bindings• Target Languages

• Java (beta)

• CPP (underway)

• Python & Pandas (underway)

• R

• Julia

• Initial Focus

• Read a structure

• Write a structure

• Manage Memory

Page 14: Apache Arrow and Python: The latest

14© Cloudera, Inc. All rights reserved.

RPC & IPC: Moving Data Between SystemsRPC

• Avoid Serialization & Deserialization

• Layer TBD: Focused on supporting vectored io

• Scatter/gather reads/writes against socket

IPC

• Alpha implementation using memory mapped files

• Moving data between Python and Drill

• Working on shared allocation approach

• Shared reference counting and well-defined ownership semantics

Page 15: Apache Arrow and Python: The latest

15© Cloudera, Inc. All rights reserved.

Executing data science languages in the compute layer

Page 16: Apache Arrow and Python: The latest

16© Cloudera, Inc. All rights reserved.

Real World Example: Python With Spark, Drill, Impala

Page 17: Apache Arrow and Python: The latest

17© Cloudera, Inc. All rights reserved.

What’s on the horizon• Parquet for Python & C++

• Using Arrow as intermediary

• IPC Implementation + Java/C++ interop

• Spark, Drill Integration

• Faster UDFs, Storage interfaces

Page 18: Apache Arrow and Python: The latest

18© Cloudera, Inc. All rights reserved.

Get Involved• Join the community

[email protected]

• Slack: https://apachearrowslackin.herokuapp.com/

• http://arrow.apache.org

• @ApacheArrow

Page 19: Apache Arrow and Python: The latest

19© Cloudera, Inc. All rights reserved.

Thank youWes McKinney @wesmckinn

Views are my own