Why Python is better for Data Science

30
WHY PYTHON IS BETTER FOR DATA SCIENCE ÍCARO MEDEIROS São Paulo Big Data Meetup São Paulo - SP, 25/11/2015

Transcript of Why Python is better for Data Science

Page 1: Why Python is better for Data Science

WHY PYTHON IS BETTER FOR DATA SCIENCE

ÍCARO MEDEIROS

São Paulo Big Data MeetupSão Paulo - SP, 25/11/2015

Page 2: Why Python is better for Data Science

DATA SCIENTISTS SHOULD DO…

http://berkeleysciencereview.com/article/first-rule-data-science/

Page 3: Why Python is better for Data Science

WHY PYTHON?

▸ General purpose

▸ Smooth learning curve

▸ REPL (IPython!)

▸ Programmer productivity

▸ Popular and mature

▸ Glue language (high level API, low level C/Fortran bindings)

▸ Science ecosystem (growing!)

Page 4: Why Python is better for Data Science

PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS

http://githut.info/

Page 5: Why Python is better for Data Science

PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS

pypl.github.io/PYPL.html

Page 6: Why Python is better for Data Science

AVOID THE TWO LANGUAGE PROBLEM

Page 7: Why Python is better for Data Science

PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW

https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22

Page 8: Why Python is better for Data Science

AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES

Jeff Hammerbacher

ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD…

http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/

Page 9: Why Python is better for Data Science

OPTIONS FOR PROCESSING PIPELINE

Airflow

https://github.com/airbnb/airflow

https://github.com/spotify/luigi

Page 10: Why Python is better for Data Science

AIRFLOW EXAMPLE

https://github.com/airbnb/airflow

Page 11: Why Python is better for Data Science

REGRESSION ANALYSIS IN PYTHON: EASY

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html

Page 12: Why Python is better for Data Science
Page 13: Why Python is better for Data Science

PYTHON <3 BIG DATA

map reduce in python

pure python HDFS client

fast and general engine for large-scale data processing

mrjob

http://spark.apache.org

https://github.com/spotify/snakebite

https://pythonhosted.org/mrjob…

Page 14: Why Python is better for Data Science

OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]

DataFrame operations are optimized and compiled into JVM bytecode

https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

Page 15: Why Python is better for Data Science

RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'

Page 16: Why Python is better for Data Science

RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'

SO CONCISE

Page 17: Why Python is better for Data Science

COMMUNICATE RESULTS WITH IPYTHON / JUPYTER

Language agnostic :)

Page 18: Why Python is better for Data Science

COMMUNICATE RESULTS WITH IPYTHON / JUPYTER

DEMO TIME

Page 19: Why Python is better for Data Science

MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!

Page 20: Why Python is better for Data Science

PYTHON FITS ALL!

Page 21: Why Python is better for Data Science

PYTHON FITS ALL!

Page 22: Why Python is better for Data Science

PYTHON FOR SCIENCE IS

G R O W I N G

Page 23: Why Python is better for Data Science

SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY

# module imports imports/numpy1 sys 2437939 5.852 os 2009086 4.823 re 1303009 3.124 numpy 416981 1.005 warnings 371345 0.896 subprocess 344934 0.837 django 282097 0.688 math 281987 0.68

11 matplotlib 146913 0.3513 pylab 77817 0.1914 scipy 69092 0.1722 pandas 18928 0.0524 theano 5482 0.051

6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE

https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement

Page 24: Why Python is better for Data Science

SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION

https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement

import numpy as np from numpy.linalg import inv, solve

# Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))

# With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

S = ( H β − r ) T ( H V H T ) − 1 ( H β − r )

PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15)

2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON

AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES

Page 25: Why Python is better for Data Science

SCIENCE STACK IS GETTING BETTER EACH DAY

https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8

Page 26: Why Python is better for Data Science

SCIENCE STACK IS ALWAYS EVOLVING…

https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29

Page 27: Why Python is better for Data Science

CONDA: AUTOMATING ENVIRONMENTS

https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60

Page 28: Why Python is better for Data Science

THE STACK IS STILL GETTING NEW MEMBERS…

http://www.tensorflow.org/

Page 29: Why Python is better for Data Science

TAKEAWAY MESSAGE

TRY PYTHON. IT WILL BE A ONE WAY TRIP!

Page 30: Why Python is better for Data Science

slides icaromedeiros.com.br

slideshare.net/icaromedeiros

@icaromedeiros