Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1...
Transcript of Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1...
![Page 1: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/1.jpg)
1 © Cloudera, Inc. All rights reserved.
Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, Budapest BI Forum 2015-‐10-‐14 @wesmckinn
![Page 2: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/2.jpg)
2 © Cloudera, Inc. All rights reserved.
Me
• R&D at Cloudera • Serial creator of structured data tools / user interfaces • Mathema=cian — MIT ‘07 • “Professional SQL programmer” 2007-‐2010 (@ AQR) • Created pandas (Python library) in 2008 • Wrote bestseller Python for Data Analysis 2012 • Founder of DataPad
![Page 3: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/3.jpg)
3 © Cloudera, Inc. All rights reserved.
Python is popular…
• Python has become a standard language of data science • Why is it popular? • Maximizes produc=vity for data engineers and data scien=sts • Build robust socware and do interac=ve data analysis with 100% Python code • Easy-‐to-‐learn and makes happy and produc=ve data teams • Large, diverse open source development community • Comprehensive libraries: data wrangling, ML, visualiza=on, etc.
• Main use case: data science & engineering swiss army knife on small-‐to-‐medium size data
![Page 4: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/4.jpg)
4 © Cloudera, Inc. All rights reserved.
…but Python does not scale today
• Python ecosystem confined to single-‐node analysis • Great for smaller data sets • Requires sampling or aggrega=ons for larger data • Distributed tools compromise in various ways
• Extrac=ng samples or aggrega=ons for larger data means: • “Scales” by losing more fidelity • Addi=onal ETL overhead to extract samples/aggrega=ons • Loss of produc=vity with mul=ple languages, tools, etc • Blocks certain analysis and use cases
![Page 5: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/5.jpg)
5 © Cloudera, Inc. All rights reserved.
Industry Analy=cs Scien=fic Compu=ng
Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines
Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats Fewer physical machines
Some simplis=c generaliza=ons
![Page 6: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/6.jpg)
6 © Cloudera, Inc. All rights reserved.
Industry Analy=cs Scien=fic Compu=ng
Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines
Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats (e.g. HDF5) Fewer physical machines
Some simplis=c generaliza=ons
Python: heavy investment, generally
Python: light investment, generally
![Page 7: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/7.jpg)
7 © Cloudera, Inc. All rights reserved.
pandas
• Hugely popular Python table / “data frame” library • Labeled table, array, and =me series data structures
• Popular for data prepara=on, ETL, and in-‐memory analy=cs • Built using Python’s scien=fic compu=ng stack • User API / domain specific language • Bespoke in-‐memory analy=cs / rela=onal algebra engine • IO interfaces (CSV, SQL, etc.) • Expanded data type system (beyond NumPy)
• Supports flat data only (or semi-‐structured data that can be flaqened)
![Page 8: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/8.jpg)
8 © Cloudera, Inc. All rights reserved.
Many SQL engines
… and more
![Page 9: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/9.jpg)
9 © Cloudera, Inc. All rights reserved.
The “Great Decoupling” for Big Data UI
Ibis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
![Page 10: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/10.jpg)
10 © Cloudera, Inc. All rights reserved.
A sample big data architecture
Kafka
Kafka
Kafka
Kafka
Application dataHDFS
JSON Spark/MapReduce
Columnar storage
Analytic SQL Engine
User
SQL
![Page 11: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/11.jpg)
11 © Cloudera, Inc. All rights reserved.
Nested / Complex types support
• Arrays, structs, maps, and unions as first-‐class value types • Analyze JSON-‐like data directly without flaqening or normaliza=on • Most new SQL engines have some level of support • Impala • Presto • Drill • BigQuery • Spark SQL • Hive • …
![Page 12: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/12.jpg)
12 © Cloudera, Inc. All rights reserved.
Ibis in a nutshell
• For Python programmers doing analy=cs in industry • Project Blog: hqp://blog.ibis-‐project.org • Joint project with Impala team @ Cloudera • Apache-‐licensed, open source hqp://github.com/cloudera/ibis • Cracing a compelling Python-‐on-‐Hadoop user experience • Remove SQL coding from user workflows • Develop high performance Python extension APIs
![Page 13: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/13.jpg)
13 © Cloudera, Inc. All rights reserved.
Ibis in a nutshell, cont’d
• Composable Python DSL (“Ibis expressions”) makes hand-‐coding SQL SELECT statements unnecessary • Ibis for SQL Programmers: hqp://docs.ibis-‐project.org/sql.html • Development roadmap targets Impala (C++ / LLVM) query engine • … but SQL compiler toolchain is general purpose
• Current supports Impala and SQLite, but soon other dialects • We welcome external contributors for other Analy=c SQL engines
![Page 14: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/14.jpg)
14 © Cloudera, Inc. All rights reserved.
![Page 15: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/15.jpg)
15 © Cloudera, Inc. All rights reserved.
Benefits of Ibis
• Maximize developer produc=vity • Mirrors single-‐node Python experience • Solve big data problems without leaving Python • Leverage Python skills, ecosystem, and tools
• Python as first-‐class language for Hadoop • Full-‐fidelity analysis without extrac=ons • Python analysis at any scale • Na=ve hardware speeds for a broad set of use cases
![Page 16: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/16.jpg)
16 © Cloudera, Inc. All rights reserved.
Brief interac=ve demo
![Page 17: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/17.jpg)
17 © Cloudera, Inc. All rights reserved.
Ibis/Impala Joint Roadmap
• More natural data modeling • Complex types support
• Integra=on with full Python data ecosystem • Advanced analy=cs + machine learning • Enable use of performance compu=ng tools
• User extensibility with na=ve performance • In-‐memory columnar format • Python-‐to-‐LLVM IR compila=on
• Workflow and usability tools
![Page 18: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/18.jpg)
18 © Cloudera, Inc. All rights reserved.
Execu=ng data science languages in the compute layer
UIIbis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
Python, R, Julia, …?
![Page 19: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/19.jpg)
19 © Cloudera, Inc. All rights reserved.
Enabling interoperability with big data systems
• Distributed / MPP query engines: implemented in a host language • Typically C/C++ or Java/Scala
• User-‐defined func=ons (UDFs) through various means • Implement in host language • Implement in user language through some external language protocol (ocen RPC-‐based)
• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)
![Page 20: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/20.jpg)
20 © Cloudera, Inc. All rights reserved.
What are UDFs good for?
• Note: industry data scien=sts have libraries containing 100s of UDFs for Hive or other distributed query engines
• Custom data transforma=ons • Custom domain logic (date / =me / data types) • Custom data types • Custom aggrega=ons (incl. machine learning / sta=s=cs expressible as reduc=ons)
![Page 21: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/21.jpg)
21 © Cloudera, Inc. All rights reserved.
Why are external UDFs slow?
• Serializa=on / deserializa=on overhead • Scalar vs vectorized computa=ons • RPC overhead
![Page 22: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/22.jpg)
22 © Cloudera, Inc. All rights reserved.
Example: Vectoriza=on for interpreted languages
SUM(CASE WHEN x > y THEN x ELSE x + y END)
![Page 23: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/23.jpg)
23 © Cloudera, Inc. All rights reserved.
Vectorized vs Interpreted perf
![Page 24: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/24.jpg)
24 © Cloudera, Inc. All rights reserved.
How to make them fast?
• Common run=me memory representa=on for tabular data • Share-‐memory (zero-‐copy or memcpy-‐only) external UDF protocol • Vectorized UDF interface (for interpreted languages) • Impala is uniquely posi=oned to play well with Ibis • Best-‐in-‐class performance and scalability • C++ and LLVM-‐based (JIT compiler) run=me • Unified, efficient data interchange amongst Ibis, Impala, and Kudu will enable high performance real =me analy=cs from Python
![Page 25: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/25.jpg)
25 © Cloudera, Inc. All rights reserved.
Memory representa=on
• Many query engines are standardizing on in-‐memory columnar rep’n of materialized transient data • Impala: hqp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/ • Apache Drill: hqps://drill.apache.org/faq/
• Industry-‐standard serializa=on format: Apache Parquet • hqps://parquet.apache.org/
![Page 26: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/26.jpg)
26 © Cloudera, Inc. All rights reserved.
Serializa=on vs In-‐memory
• Serializa=on formats (e.g. Parquet) • Op=mize for IO / DFS throughput at expense of CPU/memory bus throughput • Do not consider random access or in-‐memory analy=cs as a goal
• No standardized in-‐memory containers for materialized data from file / RPC protocols (Parquet, Thric, protobuf, Avro, etc.)
![Page 27: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/27.jpg)
27 © Cloudera, Inc. All rights reserved.
Standardized in-‐memory columnar (IMC)
• Compact in-‐memory representa=on for semistructured data • Part of Impala’s upcoming dev roadmap • Some prior IMC-‐for-‐SQL work: Apache Drill • Standardized memory representa=on means data can be shared without serializa=on • Create a canonical C/C++ implementa=on for use in Python / R / Julia
![Page 28: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/28.jpg)
28 © Cloudera, Inc. All rights reserved.
Ibis’s Vision
• Uncompromised Python experience • 100% Python end-‐to-‐end user workflows • Enable integra=on with the exis=ng Python data ecosystem (pandas, scikit-‐learn, NumPy, etc)
• Interac=ve at big data scale • Full-‐fidelity analysis without extrac=ons • Scalability for big data • Na=ve hardware speeds for a broad set of use cases
![Page 29: Ibis:ScalingPythonAnaly=cs on HadoopandImpala · ©"Cloudera,"Inc."All"rights"reserved." 1 Ibis:"Scaling"Python"Analy=cs" on Hadoop"and"Impala Wes"McKinney,"BudapestBIForum"2015I10I14"](https://reader033.fdocuments.net/reader033/viewer/2022060800/6083cced39750910aa0f4d67/html5/thumbnails/29.jpg)
29 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own