Next-generation Python Big Data Tools, powered by Apache Arrow

22
1 © Cloudera, Inc. All rights reserved. Nextgenera;on Python Big Data Tools, powered by Apache Arrow Wes McKinney @wesmckinn SF Big Analy;cs Meetup, 20160405

Transcript of Next-generation Python Big Data Tools, powered by Apache Arrow

Page 1: Next-generation Python Big Data Tools, powered by Apache Arrow

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Next-­‐genera;on    Python  Big  Data  Tools,    powered  by  Apache  Arrow  Wes  McKinney  @wesmckinn  SF  Big  Analy;cs  Meetup,  2016-­‐04-­‐05  

Page 2: Next-generation Python Big Data Tools, powered by Apache Arrow

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Me  

• Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder  •  Serial  creator  of  structured  data  tools  /  user  interfaces  • Wrote  bestseller  Python  for  Data  Analysis  2012  • Open  source  projects  

• Python  {pandas,  Ibis,  statsmodels}  • Apache  {Arrow,  Parquet,  Kudu  (incuba;ng)}  

• Mostly  work  in  Python  and  Cython/C/C++    

Page 3: Next-generation Python Big Data Tools, powered by Apache Arrow

3  ©  Cloudera,  Inc.  All  rights  reserved.  

In  process:  Python  for  Data  Analysis:  2nd  Edi4on  Coming  late  2016  /  early  2017  

Page 4: Next-generation Python Big Data Tools, powered by Apache Arrow

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  +  Big  Data:  The  State  of  things  

•  See  “Python  and  Apache  Hadoop:  A  State  of  the  Union”  from  February  17  • Areas  where  much  more  work  needed  

• Binary  file  format  read/write  support  (e.g.  Parquet  files)  • File  system  libraries  (HDFS,  S3,  etc.)  • Client  drivers  (Spark,  Hive,  Impala,  Kudu)  • Compute  system  integra;on  (Spark,  Impala,  etc.)  

Page 5: Next-generation Python Big Data Tools, powered by Apache Arrow

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow  

Many  slides  here  from  my  joint  talk  with  Jacques  Nadeau,  VP  Apache  Arrow  

Page 6: Next-generation Python Big Data Tools, powered by Apache Arrow

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Arrow  in  a  Slide  

• New  Top-­‐level  Apache  Sofware  Founda;on  project  • Announced  Feb  17,  2016  

•  Focused  on  Columnar  In-­‐Memory  Analy;cs  1.  10-­‐100x  speedup  on  many  workloads  2.  Common  data  layer  enables  companies  to  choose  best  of  

breed  systems    3.  Designed  to  work  with  any  programming  language  4.  Support  for  both  rela;onal  and  complex  data  as-­‐is  

•  Developers  from  13+  major  open  source  projects  involved  • A  significant  %  of  the  world’s  data  will  be  processed  through  Arrow!  

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 7: Next-generation Python Big Data Tools, powered by Apache Arrow

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow:  What  is  it?    

• hkp://arrow.apache.org  • Not  a  piece  of  sofware,  exactly!  • A  standardized  in-­‐memory  representa;on  for  columnar  data  • Enables  

• Suitable  for  implemen;ng  high-­‐performance  analy;cs  in-­‐memory  (think  like  “pandas  internals”)  

• Cheap  data  interchange  amongst  systems,  likle  or  no  serializa;on  • Flexible  support  for  complex  JSON-­‐like  data  

• Targets:  Impala,  Kudu,  Parquet,  Spark  

Page 8: Next-generation Python Big Data Tools, powered by Apache Arrow

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Focus  on  CPU  Efficiency  

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

Row 1

Row 2

Row 3

Row 4

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

session_id

timestamp

source_ip

Traditional Memory Buffer  

Arrow Memory Buffer  

• Cache  Locality  • Super-­‐scalar  &  vectorized  opera;on  

• Minimal  Structure  Overhead  • Constant  value  access    

• With  minimal  structure  overhead  • Operate  directly  on  columnar  compressed  data  

Page 9: Next-generation Python Big Data Tools, powered by Apache Arrow

9  ©  Cloudera,  Inc.  All  rights  reserved.  

High  Performance  Sharing  &  Interchange  Today With Arrow

•  Each system has its own internal memory format

•  70-80% CPU wasted on serialization and deserialization

•  Similar functionality implemented in multiple projects

•  All systems utilize the same memory format

•  No overhead for cross-system communication

•  Projects can share functionality (eg, Parquet-to-Arrow reader)

Pandas Drill

Impala

HBase

KuduCassandra

Parquet

Spark

Arrow Memory

Pandas Drill

Impala

HBase

KuduCassandra

Parquet

Spark

Copy & ConvertCopy & Convert

Copy & Convert

Copy & Convert

Copy & Convert

Page 10: Next-generation Python Big Data Tools, powered by Apache Arrow

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Big  Data  Systems:  Poor  Python  IO  performance  

h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/  

Page 11: Next-generation Python Big Data Tools, powered by Apache Arrow

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Real  World  Example:  Feather  File  Format  for  Python  and  R  • Problem:  fast,  language-­‐agnos;c  binary  data  frame  file  format  

• Wriken  by  Wes  McKinney  (Python)  Hadley  Wickham  (R)  

• Read  speeds  close  to  disk  IO  performance  

Arrow array 0Arrow array 1

…Arrow array n

Feather metadata

Feather file

Apache Arrow memory

Google flatbuffers

Page 12: Next-generation Python Big Data Tools, powered by Apache Arrow

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Real  World  Example:  Feather  File  Format  for  Python  and  R  

library(feather)      path  <-­‐  "my_data.feather"  write_feather(df,  path)      df  <-­‐  read_feather(path)  

import  feather      path  =  'my_data.feather'      feather.write_dataframe(df,  path)  df  =  feather.read_dataframe(path)  

R   Python  

Page 13: Next-generation Python Big Data Tools, powered by Apache Arrow

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Parquet:  Binary  columnar  storage  format  

•  I  just  became  a  Parquet  commiker!  •  github.com/apache/parquet-­‐cpp  •  Python  users  will  soon  be  able  to  read  Parquet  files  via  PyArrow  •  parquet-­‐cpp  <-­‐>  PyArrow  <-­‐>  pandas  

Page 14: Next-generation Python Big Data Tools, powered by Apache Arrow

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Language  Bindings  • Target  Languages  

•  Java  (beta)  • CPP  (underway)  • Python  &  Pandas  (underway)  • R  •  Julia  

•  Ini;al  Focus  • Read  a  structure  • Write  a  structure    • Manage  Memory  

Page 15: Next-generation Python Big Data Tools, powered by Apache Arrow

15  ©  Cloudera,  Inc.  All  rights  reserved.  

pandas  and  Arrow  in  context  

Page 16: Next-generation Python Big Data Tools, powered by Apache Arrow

16  ©  Cloudera,  Inc.  All  rights  reserved.  

RPC  &  IPC:  Moving  Data  Between  Systems  RPC  • Avoid  Serializa;on  &  Deserializa;on  •  Layer  TBD:  Focused  on  suppor;ng  vectored  io  

• Scaker/gather  reads/writes  against  socket  

IPC  • Alpha  implementa;on    using  memory  mapped  files  

• Moving  data  between  Python  and  Drill  • Working  on  shared  alloca;on  approach  

• Shared  reference  coun;ng  and  well-­‐defined  ownership  seman;cs  

Page 17: Next-generation Python Big Data Tools, powered by Apache Arrow

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Execu;ng  data  science  languages  in  the  compute  layer  

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?

Page 18: Next-generation Python Big Data Tools, powered by Apache Arrow

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Real  World  Example:  Python  With  Spark,  Drill,  Impala  

in partition 0

in partition n - 1

SQL Engine

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

out partition n - 1

SQL Engine

Page 19: Next-generation Python Big Data Tools, powered by Apache Arrow

19  ©  Cloudera,  Inc.  All  rights  reserved.  

What’s  Next  • Parquet  for  Python  &  C++  

• Using  Arrow  as  intermediary  • Available  IPC  Implementa;on  •  Spark,  Drill  Integra;on  

• Faster  UDFs,  Storage  interfaces  

Page 20: Next-generation Python Big Data Tools, powered by Apache Arrow

20  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow  in  prac;ce  

Page 21: Next-generation Python Big Data Tools, powered by Apache Arrow

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Get  Involved  •  Join  the  community  

• [email protected]  • Slack:  hkps://apachearrowslackin.herokuapp.com/  • hkp://arrow.apache.org  • @ApacheArrow  

Page 22: Next-generation Python Big Data Tools, powered by Apache Arrow

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Wes  McKinney  @wesmckinn  Views  are  my  own