Ibis:ScalingPythonAnaly=cs on...

29
1 © Cloudera, Inc. All rights reserved. Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, Budapest BI Forum 20151014 @wesmckinn

Transcript of Ibis:ScalingPythonAnaly=cs on...

Page 1: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis:  Scaling  Python  Analy=cs  on  Hadoop  and  Impala  Wes  McKinney,  Budapest  BI  Forum  2015-­‐10-­‐14  @wesmckinn  

Page 2: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Me  

• R&D  at  Cloudera  •  Serial  creator  of  structured  data  tools  /  user  interfaces  • Mathema=cian  —  MIT  ‘07  •  “Professional  SQL  programmer”  2007-­‐2010  (@  AQR)  • Created  pandas  (Python  library)  in  2008  • Wrote  bestseller  Python  for  Data  Analysis  2012  •  Founder  of  DataPad    

Page 3: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  is  popular…  

• Python  has  become  a  standard  language  of  data  science  • Why  is  it  popular?  • Maximizes  produc=vity  for  data  engineers  and  data  scien=sts  • Build  robust  socware  and  do  interac=ve  data  analysis  with  100%  Python  code    • Easy-­‐to-­‐learn  and  makes  happy  and  produc=ve  data  teams    • Large,  diverse  open  source  development  community  • Comprehensive  libraries:  data  wrangling,  ML,  visualiza=on,  etc.  

• Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium  size  data  

Page 4: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

4  ©  Cloudera,  Inc.  All  rights  reserved.  

…but  Python  does  not  scale  today  

• Python  ecosystem  confined  to  single-­‐node  analysis  • Great  for  smaller  data  sets  • Requires  sampling  or  aggrega=ons  for  larger  data  • Distributed  tools  compromise  in  various  ways  

• Extrac=ng  samples  or  aggrega=ons  for  larger  data  means:  • “Scales”  by  losing  more  fidelity  • Addi=onal  ETL  overhead  to  extract  samples/aggrega=ons  • Loss  of  produc=vity  with  mul=ple  languages,  tools,  etc  • Blocks  certain  analysis  and  use  cases  

Page 5: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Industry  Analy=cs   Scien=fic  Compu=ng  

Heterogeneous  data          Flat  tables  and  JSON  Spark  /  MapReduce  SQL  DFS-­‐friendly  /  streaming  data  formats  More  physical  machines  

Homogeneous  data          Mul=dimensional  arrays  HPC  tools  Linear  algebra  Scien=fic  data  formats  Fewer  physical  machines  

Some  simplis=c  generaliza=ons  

Page 6: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Industry  Analy=cs   Scien=fic  Compu=ng  

Heterogeneous  data          Flat  tables  and  JSON  Spark  /  MapReduce  SQL  DFS-­‐friendly  /  streaming  data  formats  More  physical  machines  

Homogeneous  data          Mul=dimensional  arrays  HPC  tools  Linear  algebra  Scien=fic  data  formats  (e.g.  HDF5)  Fewer  physical  machines  

Some  simplis=c  generaliza=ons  

Python:  heavy  investment,    generally  

Python:  light  investment,  generally  

Page 7: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

7  ©  Cloudera,  Inc.  All  rights  reserved.  

pandas  

• Hugely  popular  Python  table  /  “data  frame”  library  • Labeled  table,  array,  and  =me  series  data  structures  

• Popular  for  data  prepara=on,  ETL,  and  in-­‐memory  analy=cs  • Built  using  Python’s  scien=fic  compu=ng  stack  • User  API  /  domain  specific  language  • Bespoke  in-­‐memory  analy=cs  /  rela=onal  algebra  engine  •  IO  interfaces  (CSV,  SQL,  etc.)  • Expanded  data  type  system  (beyond  NumPy)  

•  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  flaqened)  

Page 8: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Many  SQL  engines  

…  and  more  

Page 9: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

9  ©  Cloudera,  Inc.  All  rights  reserved.  

The  “Great  Decoupling”  for  Big  Data  UI

Ibis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Page 10: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

10  ©  Cloudera,  Inc.  All  rights  reserved.  

A  sample  big  data  architecture  

Kafka

Kafka

Kafka

Kafka

Application dataHDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL

Page 11: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Nested  /  Complex  types  support  

• Arrays,  structs,  maps,  and  unions  as  first-­‐class  value  types  • Analyze  JSON-­‐like  data  directly  without  flaqening  or  normaliza=on  • Most  new  SQL  engines  have  some  level  of  support  •  Impala  • Presto  • Drill  • BigQuery  • Spark  SQL  • Hive  • …  

Page 12: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis  in  a  nutshell  

•  For  Python  programmers  doing  analy=cs  in  industry  • Project  Blog:  hqp://blog.ibis-­‐project.org  •  Joint  project  with  Impala  team  @  Cloudera  • Apache-­‐licensed,  open  source  hqp://github.com/cloudera/ibis    • Cracing  a  compelling  Python-­‐on-­‐Hadoop  user  experience  • Remove  SQL  coding  from  user  workflows  • Develop  high  performance  Python  extension  APIs  

Page 13: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis  in  a  nutshell,  cont’d  

• Composable  Python  DSL  (“Ibis  expressions”)  makes  hand-­‐coding  SQL  SELECT  statements  unnecessary  •  Ibis  for  SQL  Programmers:  hqp://docs.ibis-­‐project.org/sql.html  • Development  roadmap  targets  Impala  (C++  /  LLVM)  query  engine  • …  but  SQL  compiler  toolchain  is  general  purpose  

• Current  supports  Impala  and  SQLite,  but  soon  other  dialects  • We  welcome  external  contributors  for  other  Analy=c  SQL  engines  

Page 14: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Page 15: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Benefits  of  Ibis  

• Maximize  developer  produc=vity  • Mirrors  single-­‐node  Python  experience  • Solve  big  data  problems  without  leaving  Python  • Leverage  Python  skills,  ecosystem,  and  tools  

• Python  as  first-­‐class  language  for  Hadoop  • Full-­‐fidelity  analysis  without  extrac=ons  • Python  analysis  at  any  scale  • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  

Page 16: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Brief  interac=ve  demo  

Page 17: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis/Impala  Joint  Roadmap  

• More  natural  data  modeling  • Complex  types  support  

•  Integra=on  with  full  Python  data  ecosystem  • Advanced  analy=cs  +  machine  learning  • Enable  use  of  performance  compu=ng  tools  

• User  extensibility  with  na=ve  performance  •  In-­‐memory  columnar  format  • Python-­‐to-­‐LLVM  IR  compila=on  

• Workflow  and  usability  tools  

Page 18: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Execu=ng  data  science  languages  in  the  compute  layer  

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?

Page 19: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Enabling  interoperability  with  big  data  systems  

• Distributed  /  MPP  query  engines:  implemented  in  a  host  language  • Typically  C/C++  or  Java/Scala  

• User-­‐defined  func=ons  (UDFs)  through  various  means  •  Implement  in  host  language  •  Implement  in  user  language  through  some  external  language  protocol  (ocen  RPC-­‐based)  

• External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  

Page 20: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

20  ©  Cloudera,  Inc.  All  rights  reserved.  

What  are  UDFs  good  for?  

• Note:  industry  data  scien=sts  have  libraries  containing  100s  of  UDFs  for  Hive  or  other  distributed  query  engines  

• Custom  data  transforma=ons  • Custom  domain  logic  (date  /  =me  /  data  types)  • Custom  data  types  • Custom  aggrega=ons  (incl.  machine  learning  /  sta=s=cs  expressible  as  reduc=ons)  

Page 21: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  are  external  UDFs  slow?  

•  Serializa=on  /  deserializa=on  overhead  •  Scalar  vs  vectorized  computa=ons  • RPC  overhead  

Page 22: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Example:  Vectoriza=on  for  interpreted  languages  

SUM(CASE WHEN x > y THEN x ELSE x + y END)

Page 23: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

23  ©  Cloudera,  Inc.  All  rights  reserved.  

Vectorized  vs  Interpreted  perf  

Page 24: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

24  ©  Cloudera,  Inc.  All  rights  reserved.  

How  to  make  them  fast?  

• Common  run=me  memory  representa=on  for  tabular  data  •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol  • Vectorized  UDF  interface  (for  interpreted  languages)  •  Impala  is  uniquely  posi=oned  to  play  well  with  Ibis  • Best-­‐in-­‐class  performance  and  scalability  • C++  and  LLVM-­‐based  (JIT  compiler)  run=me  • Unified,  efficient  data  interchange  amongst  Ibis,  Impala,  and  Kudu  will  enable  high  performance  real  =me  analy=cs  from  Python  

Page 25: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

25  ©  Cloudera,  Inc.  All  rights  reserved.  

Memory  representa=on  

• Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of  materialized  transient  data  •  Impala:  hqp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/  • Apache  Drill:  hqps://drill.apache.org/faq/  

•  Industry-­‐standard  serializa=on  format:  Apache  Parquet  • hqps://parquet.apache.org/  

Page 26: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

26  ©  Cloudera,  Inc.  All  rights  reserved.  

Serializa=on  vs  In-­‐memory  

•  Serializa=on  formats  (e.g.  Parquet)    • Op=mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput  • Do  not  consider  random  access  or  in-­‐memory  analy=cs  as  a  goal  

• No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC  protocols  (Parquet,  Thric,  protobuf,  Avro,  etc.)  

Page 27: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

27  ©  Cloudera,  Inc.  All  rights  reserved.  

Standardized  in-­‐memory  columnar  (IMC)  

• Compact  in-­‐memory  representa=on  for  semistructured  data  • Part  of  Impala’s  upcoming  dev  roadmap  •  Some  prior  IMC-­‐for-­‐SQL  work:  Apache  Drill  •  Standardized  memory  representa=on  means  data  can  be  shared  without  serializa=on  • Create  a  canonical  C/C++  implementa=on  for  use  in  Python  /  R  /  Julia  

Page 28: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

28  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis’s  Vision  

• Uncompromised  Python  experience  • 100%  Python  end-­‐to-­‐end  user  workflows    • Enable  integra=on  with  the  exis=ng  Python  data  ecosystem  (pandas,  scikit-­‐learn,  NumPy,  etc)  

•  Interac=ve  at  big  data  scale  • Full-­‐fidelity  analysis  without  extrac=ons  • Scalability  for  big  data  • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  

Page 29: Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

29  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Wes  McKinney  @wesmckinn  Views  are  my  own