Why Python is better for Data Science

Post on 11-Apr-2017

3.071 views 2 download

Transcript of Why Python is better for Data Science

WHY PYTHON IS BETTER FOR DATA SCIENCE

ÍCARO MEDEIROS

São Paulo Big Data MeetupSão Paulo - SP, 25/11/2015

DATA SCIENTISTS SHOULD DO…

http://berkeleysciencereview.com/article/first-rule-data-science/

WHY PYTHON?

▸ General purpose

▸ Smooth learning curve

▸ REPL (IPython!)

▸ Programmer productivity

▸ Popular and mature

▸ Glue language (high level API, low level C/Fortran bindings)

▸ Science ecosystem (growing!)

PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS

http://githut.info/

PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS

pypl.github.io/PYPL.html

AVOID THE TWO LANGUAGE PROBLEM

PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW

https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22

AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES

Jeff Hammerbacher

ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD…

http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/

OPTIONS FOR PROCESSING PIPELINE

Airflow

https://github.com/airbnb/airflow

https://github.com/spotify/luigi

AIRFLOW EXAMPLE

https://github.com/airbnb/airflow

REGRESSION ANALYSIS IN PYTHON: EASY

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html

PYTHON <3 BIG DATA

map reduce in python

pure python HDFS client

fast and general engine for large-scale data processing

mrjob

http://spark.apache.org

https://github.com/spotify/snakebite

https://pythonhosted.org/mrjob…

OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]

DataFrame operations are optimized and compiled into JVM bytecode

https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'

RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'

SO CONCISE

COMMUNICATE RESULTS WITH IPYTHON / JUPYTER

Language agnostic :)

COMMUNICATE RESULTS WITH IPYTHON / JUPYTER

DEMO TIME

MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!

PYTHON FITS ALL!

PYTHON FITS ALL!

PYTHON FOR SCIENCE IS

G R O W I N G

SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY

# module imports imports/numpy1 sys 2437939 5.852 os 2009086 4.823 re 1303009 3.124 numpy 416981 1.005 warnings 371345 0.896 subprocess 344934 0.837 django 282097 0.688 math 281987 0.68

11 matplotlib 146913 0.3513 pylab 77817 0.1914 scipy 69092 0.1722 pandas 18928 0.0524 theano 5482 0.051

6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE

https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement

SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION

https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement

import numpy as np from numpy.linalg import inv, solve

# Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))

# With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

S = ( H β − r ) T ( H V H T ) − 1 ( H β − r )

PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15)

2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON

AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES

SCIENCE STACK IS GETTING BETTER EACH DAY

https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8

SCIENCE STACK IS ALWAYS EVOLVING…

https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29

CONDA: AUTOMATING ENVIRONMENTS

https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60

THE STACK IS STILL GETTING NEW MEMBERS…

http://www.tensorflow.org/

TAKEAWAY MESSAGE

TRY PYTHON. IT WILL BE A ONE WAY TRIP!

slides icaromedeiros.com.br

slideshare.net/icaromedeiros

@icaromedeiros