Why Python is better for Data Science
-
Upload
icaro-medeiros -
Category
Software
-
view
3.071 -
download
2
Transcript of Why Python is better for Data Science
WHY PYTHON IS BETTER FOR DATA SCIENCE
ÍCARO MEDEIROS
São Paulo Big Data MeetupSão Paulo - SP, 25/11/2015
DATA SCIENTISTS SHOULD DO…
http://berkeleysciencereview.com/article/first-rule-data-science/
WHY PYTHON?
▸ General purpose
▸ Smooth learning curve
▸ REPL (IPython!)
▸ Programmer productivity
▸ Popular and mature
▸ Glue language (high level API, low level C/Fortran bindings)
▸ Science ecosystem (growing!)
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
http://githut.info/
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
pypl.github.io/PYPL.html
AVOID THE TWO LANGUAGE PROBLEM
PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22
AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES
Jeff Hammerbacher
ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD…
http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/
OPTIONS FOR PROCESSING PIPELINE
Airflow
https://github.com/airbnb/airflow
https://github.com/spotify/luigi
AIRFLOW EXAMPLE
https://github.com/airbnb/airflow
REGRESSION ANALYSIS IN PYTHON: EASY
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
PYTHON <3 BIG DATA
map reduce in python
pure python HDFS client
fast and general engine for large-scale data processing
mrjob
http://spark.apache.org
https://github.com/spotify/snakebite
https://pythonhosted.org/mrjob…
OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]
DataFrame operations are optimized and compiled into JVM bytecode
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
SO CONCISE
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
Language agnostic :)
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
DEMO TIME
MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!
PYTHON FITS ALL!
PYTHON FITS ALL!
PYTHON FOR SCIENCE IS
G R O W I N G
SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY
# module imports imports/numpy1 sys 2437939 5.852 os 2009086 4.823 re 1303009 3.124 numpy 416981 1.005 warnings 371345 0.896 subprocess 344934 0.837 django 282097 0.688 math 281987 0.68
11 matplotlib 146913 0.3513 pylab 77817 0.1914 scipy 69092 0.1722 pandas 18928 0.0524 theano 5482 0.051
6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
import numpy as np from numpy.linalg import inv, solve
# Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))
# With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)
S = ( H β − r ) T ( H V H T ) − 1 ( H β − r )
PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15)
2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON
AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES
SCIENCE STACK IS GETTING BETTER EACH DAY
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
SCIENCE STACK IS ALWAYS EVOLVING…
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29
CONDA: AUTOMATING ENVIRONMENTS
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60
THE STACK IS STILL GETTING NEW MEMBERS…
http://www.tensorflow.org/
TAKEAWAY MESSAGE
TRY PYTHON. IT WILL BE A ONE WAY TRIP!
slides icaromedeiros.com.br
slideshare.net/icaromedeiros
@icaromedeiros