Download - DataFrames: The Good, Bad, and Ugly

Transcript
Page 1: DataFrames: The Good, Bad, and Ugly

1  ©  Cloudera,  Inc.  All  rights  reserved.  

DataFrames:  The  Good,  Bad,  and  Ugly  Wes  McKinney  @wesmckinn  NY  R  Conference,  2015-­‐04-­‐25  

Page 2: DataFrames: The Good, Bad, and Ugly

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Disclaimer:  the  views  presented  in  this  talk  are  my  personal  opinions  and  not  necessarily  those  of  Cloudera  

Page 3: DataFrames: The Good, Bad, and Ugly

3  ©  Cloudera,  Inc.  All  rights  reserved.  

This  talk  

•  Some  commentary  on  all  the  data  frame  interfaces  out  there  • Biased  observaTons  and  cursory  judgments  • Thoughts  on  craVing  high  quality  data  tools  

Page 4: DataFrames: The Good, Bad, and Ugly

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Disclaimer  #2:  This  is  a  nuanced  discussion  

Page 5: DataFrames: The Good, Bad, and Ugly

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Who  am  I?  

•  Father  of  pandas  (2008  -­‐  )    •  Financial  analyTcs  in  R  /  Python  starTng  2007  • 2010-­‐2012  • Hiatus  from  gainful  employment  • Make  pandas  ready  for  primeTme  • Write  "Python  for  Data  Analysis”  

• 2013-­‐2014:  DataPad  with  Chang  She  &  co  • 2014  -­‐  :  Cloudera  

Page 6: DataFrames: The Good, Bad, and Ugly

6  ©  Cloudera,  Inc.  All  rights  reserved.  

What’s  in  a  DataFrame?        

Page 7: DataFrames: The Good, Bad, and Ugly

7  ©  Cloudera,  Inc.  All  rights  reserved.  

What’s  in  a  DataFrame?  A  table  with  some  rows  By  any  other  name                      would  analyze  as  sweet.  

Page 8: DataFrames: The Good, Bad, and Ugly

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Got  a  table?    

Page 9: DataFrames: The Good, Bad, and Ugly

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Got  a  table?  Put  a  DataFrame  (interface)  on  it!  

Page 10: DataFrames: The Good, Bad, and Ugly

10  ©  Cloudera,  Inc.  All  rights  reserved.  

What  is  this  “data  frame”  that  you  speak  of  

• A  table-­‐like  data  structure  • An  API  /  user  interface  for  the  table  • SelecTng  data  • Math  and  relaTonal  algebra  (join,  filter,  etc.)  • File  /  database  IO  • ad  infinitum  

Page 11: DataFrames: The Good, Bad, and Ugly

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Some  axes  of  comparison  

• Data  structure  internals  (types,  in-­‐memory  representaTon,  etc.)  • Basic  table  API  • RelaTonal  algebra  support  • Group-­‐by  /  split-­‐apply-­‐combine  API  • Performance,  memory  use,  evaluaTon  semanTcs  • Missing  data  • Data  Tdying  /  ETL  tools  •  IO  uTliTes  • Domain  specific  tools  (e.g.  Tme  series)  • …  

Page 12: DataFrames: The Good, Bad, and Ugly

12  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Great  Data  Tool  Decoupling™  

• Thesis:  over  Tme,  user  interfaces,  data  storage,  and  execuTon  engines  will  decouple  and  specialize  •  In  fact,  you  should  really  want  this  to  happen  • Share  systems  among  languages  • Reduce  fragmentaTon  and  “lock-­‐in”  • ShiV  developer  focus  to  usability    

• PredicTon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  

Page 13: DataFrames: The Good, Bad, and Ugly

13  ©  Cloudera,  Inc.  All  rights  reserved.  

CraVing  quality  data  tools  

• Quality  /  usefulness  is  usually  forged  by  the  fire  of  basle  

• Real  world  use  cases  and  social  proof  trump  theory  

• Eat  that  dog  food  

• When  in  doubt?  Look  at  the  test  suite.  

Page 14: DataFrames: The Good, Bad, and Ugly

14  ©  Cloudera,  Inc.  All  rights  reserved.  

R  data  frames  

• Thin  layer  on  top  of  R  list  type  • Sequence  of  named  vectors  • Can  have  row  names  (any  R  vector)  

•  Simple  column  and  row  selecTon  API  • AnalyTcs,  data  transformaTon,  etc.  leV  to  base  package  and  add-­‐on  libraries  • Richness  /  usability  comes  largely  from  libraries  

Page 15: DataFrames: The Good, Bad, and Ugly

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Some  awesome  R  data  frame  stuff  

•  “Hadley  Stack”  • dplyr,  Tdyr  •  legacy:  plyr,  reshape2    • ggplot2  

• data.table  (data.frame  +  indices,  fast  algorithms)  •  xts  :  Tme  series    

Page 16: DataFrames: The Good, Bad, and Ugly

16  ©  Cloudera,  Inc.  All  rights  reserved.  

R  data  frames:  rough  edges  

• Copy-­‐on-­‐write  semanTcs  • API  fragmentaTon  /  inconsistency  • Use  the  “Hadley  stack”  for  improved  sanity  

•  Factor  /  String  dichotomy  • stringsAsFactors=FALSE  a  blessing  and  curse  

•  Somewhat  limited  type  system  

Page 17: DataFrames: The Good, Bad, and Ugly

17  ©  Cloudera,  Inc.  All  rights  reserved.  

dplyr  

• Composable  table  API  • Good  example  of  what  the  “decoupled”  future  might  look  like  • New  in-­‐memory  R/RCpp  execuTon  engine  • SQL  backends  for  large  subset  of  API  

Page 18: DataFrames: The Good, Bad, and Ugly

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  DataFrames  

• R/pandas-­‐inspired  API  for  tabular  data  manipulaTon  in  Scala,  Python,  etc.  •  Logical  operaTon  graphs  rewrisen  internally  in  more  efficient  form  • Good  interop  with  Spark  SQL  •  Some  interoperability  with  pandas  • ParTal  API  Decoupling!  (it  sTll  binds  you  to  Spark)  

Page 19: DataFrames: The Good, Bad, and Ugly

19  ©  Cloudera,  Inc.  All  rights  reserved.  

pandas  

•  Several  key  data  structures,  data  frame  among  them  • Considerably  more  complex  internals  than  other  data  frame  libraries  •  Some  good  things  • Born  of  need  • A  “baseries  included”  approach  • Hierarchical  axis  labeling:  addresses  some  hard  use  cases  at  expense  of  semanTc  complexity  • Strong  Tme  series  support  

Page 20: DataFrames: The Good, Bad, and Ugly

20  ©  Cloudera,  Inc.  All  rights  reserved.  

pandas:  rough  edges  

• Axis  labelling  can  get  in  the  way  for  folks  needing  “just  a  table”  • Ceded  control  of  its  type  system  /  data  rep’n  from  day  1  to  NumPy  •  Inefficient  string  handling  (uses  NumPy  object  arrays)  • Missing  data  handling  less  precise  than  other  tools  

Page 21: DataFrames: The Good, Bad, and Ugly

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Julia:  DataFrames.jl  

•  Started  by  Harlan  Harris  &  co  • Part  of  broader  JuliaStats  iniTaTve  • More  R-­‐like  than  pandas-­‐like  • Very  acTve:  >  50  contributors!    

•  STll  comparaTvely  early  • Less  comprehensive  API  • More  limited  IO  capabiliTes  

Page 22: DataFrames: The Good, Bad, and Ugly

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Other  data  frames  

•  Saddle  (Scala)  • Dev’d  by  Adam  Klein  (ex-­‐AQR)  at  Novus  Partners  (fintech  startup)  • Designed  and  used  for  financial  use  cases  

• Deedle  (F#  /  .NET)  • Dev’d  by  AK’s  colleagues  at  BlueMountain  (hedge  fund)  

• GraphLab  /  Dato  • Really  good  C++  data  frame  with  Python  interface  • Dual-­‐licensed:  AGPL  +  Commercial  

• That’s  not  all!  Haskell,  Go,  usw…  

Page 23: DataFrames: The Good, Bad, and Ugly

23  ©  Cloudera,  Inc.  All  rights  reserved.  

We’re  not  done  yet  

• The  future  is  JSON-­‐like  • Support  for  nested  types  /  semi-­‐structured  data  is  sTll  weak  

• Wanted:  Apache-­‐licensed,  community  standard  C/C++  data  frame  that  we  all  use  (R,  Python,  Julia)  • Bring  on  the  Great  Decoupling  

Page 24: DataFrames: The Good, Bad, and Ugly

24  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  @wesmckinn