Scientific world in python
-
Upload
- -
Category
Engineering
-
view
90 -
download
0
Transcript of Scientific world in python
Scientific World in Pythonbrief introduction to SciPy stack
Jiangwei Guo
Data Management Center
January 19, 2017
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 1 / 32
Outline
1 SciPy movement
2 Core SciPy libs for MLNumPySciPypandasscikit-learn
3 bonus
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 2 / 32
why Python?
hot and strong, widely used in industries
simple and elegant, easy to learn
whole ecosystem and active communities
glue language, standard ML-API language
good for and widely used as prototyping language
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 3 / 32
Python’s next steps
“One thing I want to point out arethe SciPy and NumPy movements.Those people are introducing Pythonas a replacement for MatLab. It’sopen source, it’s better, they canchange it. They are taking it toplaces where I had never expectedPython would travel. They havethings like the Jupiter Notebooksthat show interactive Python in thebrowser. There is a lot of incrediblycool work that is happening in thatarea.”
– Guido van Rossum
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 4 / 32
SciPy stack
SciPy stack is a collection ofopen source software forscientific computing in Python.
Implementation andenhancement of MatLab withpython and followed by SparkMLib.
One of the most active pythoncommunities and sponsored byNumFOCUS.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 5 / 32
SciPy stack - continued
NumPy fundamental package for numerical computation.
SciPy a collection of numerical algorithms and domain-specifictoolboxes.
Matplotlib a mature and popular plotting package.
pandas providing high-performance, easy to use data structures.
Scikits extra packages for more specific functionality, such asscikit-image, scikit-learn, etc.
IPython a rich interactive interface, jupyter allows web access.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 6 / 32
NumFOCUS as sponsor
The mission of NumFOCUS is to promote sustainable high-levelprogramming languages, open code development, and reproduciblescientific research.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 7 / 32
Intro to NumPy
NumPy is short for numerical python library, latest released version is1.12.
fundamental and standard package for scientific and numericalcomputing, key concept is N-dimensional array.
I powerful N-dimensional array objectI a grid of values, all of the same typeI sophisticated functions and routinesI indexed by a tuple of nonnegative integers
recommended import style
import numpy as np
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 8 / 32
Universal functions (ufunc)
operates on ndarrays in an element-by-element fashion.
broadcastingI used through NumPy to decide how to handle disparately shaped arrays
when performing arithmetic operationsI broadcastable
1 the arrays all have exactly the same shape2 the arrays have the same number of dimensions and length of each
dimensions is either a common length or 13 the arrays that have too few dimensions can have their shapes
prepended with a dimension of length 1 to satisfy property 2
type casting
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 9 / 32
Slicing and Indexing
1 slice object (start:stop:step notation inside of brackets), an integer, ora tuple of slice objects and integers
Example
a = np.arange(30).reshape(10, -1); a[9]; a[2:7:2]; a[:, 1]
2 Integer array indexing
Example
a = np.arange(6).reshape(3, -1); a[[0, 1, 2], [0, 1, 0]]
3 Boolean array indexing
Example
a = np.arange(10).reshape(4, -1); a[a > 2]
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 10 / 32
Routinesarray manipulation
I change dimensions, reshape or flattenI transpose and transpose-likeI join, concatenate, split, ect.
element-level mathematical routine
Example
a = np.arange(12).reshape(3, -1)
np.sum(a); np.prod(a, axis=0); np.cumsum(a, axis=0)
np.log10(a)
ndarray-level mathematical routine
Example
x + y; x y; x * y; x / y; np.add(x, y) (element-wise)
np.subtract(x, y), np.multiply(x, y), np.divide(x, y)
x.dot(y); np.dot(x, y) (array-like)
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 11 / 32
Intro to SciPy
SciPy is short for scientific python library, latest released version is0.18.1.
SciPy is a collection of mathematical algorithms and conveniencefunctions built on the Numpy extension of Python.
Scipy has rich of high-level numerical routines, ontains varioustoolboxes dedicated to common issues in scientific computing.
With SciPy an interactive Python session becomes a data-processingand system-prototyping environment rivaling systems such asMATLAB, IDL, Octave, R-Lab, and SciLab.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 12 / 32
difference between NumPy and SciPy
“... NumPy is meant to be a library for numerical arrays, to be usedby anybody needing such an object in Python. SciPy is meant to be alibrary for scientists/engineers, so it aims for more rigorous theoreticalmathematics.”
“In an ideal world, NumPy would contain nothing but the array datatype and the most basic operations: indexing, sorting, reshaping,basic element-wise functions, et cetera. All numerical code wouldreside in SciPy.”
“NumPy contains some linear algebra functions, even though thesemore properly belong in SciPy. In any case, SciPy contains morefully-featured versions of the linear algebra modules, as well as manyother numerical algorithms.”
“... all of the Numpy functions have been subsumed into the scipynamespace so that all of those functions are available withoutadditionally importing Numpy, the scipy init method execute afrom numpy import *.”
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 13 / 32
Sparse matrices (scipy.sparse)
Sparse matrices can be used in arithmetic operations: they supportaddition, subtraction, multiplication, division, and matrix power.
Advantages of the CSR formatI efficient arithmetic operations CSR + CSR, CSR * CSR, etc.I efficient row slicingI fast matrix vector products
Disadvantages of the CSR formatI slow column slicing operations (consider CSC)I changes to the sparsity structure are expensive (consider LIL or DOK)
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 14 / 32
Intro to pandas
pandas derives from panel data library, latest released stable version is0.19.2.
pandas provide fast, flexible, and expressive data structures designedto make working with relational or labeled data both easy andintuitive.
pandas is not the implementation or extension of NumPy/SciPy, itmanipulates data in specific way.
main data structures includes Series, DataFrame (much likedata.frame in R), Panel, etc.
recommended import style
import pandas as pd
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 15 / 32
advantages of using pandas over NumPy
pandas is well suited for tabular data, time series data, arbitrarymatrix data with row and column labels.
skilled in row- and column-oriented operations, especially with labelsI columns or rows can be inserted and deleted from DataFrameI automatic and explicit data alignment, objects can be explicitly aligned
to a set of labelsI intelligent label-based slicing, fancy indexing, and subsetting of large
data setsI intuitive merging and joining data sets
pandas support many statistics methods for variable analysis, such asgroup-by, pivoting
easy handling of missing data (represented as NaN)
seamless integration with python data structures and NumPy
robust IO tools for loading and dumping data
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 16 / 32
indexing and slicing
standard Python/NumPy expressions for selecting and setting areintuitive and come in handy
attribute access
recommend optimized pandas data access methods, loc/at forlabel-based access and iloc/iat for location-based access, loc for blockaccess and at for scalar access
boolean indexing
Example
df[:, 0:2]; df[::-1]; df[:, :3]; df[’a’]; df.head()
df.a
df.loc[’a’:’c’, [’A’, ’D’]]; df.at[’a’, ’A’]
df.iloc[::-1, 2:4]; df.iat[0, 0]
df[df.A > 0]; df[df[’a’].isin([’x’, ’z’])]
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 17 / 32
interactive data analysing as R does
Example
df = pd.read csv(’path/data.csv’)
df.head(); df.shape; df.dtypes
df.describe(); df.mean()
df.dropna(how=’any’); df.dropna(axis=1), default value is 0
df.fillna(value=5); df[’d’].fillna(3); df.fillna(df.mean())
df[’a’] = df[’a’].apply(np.log10)
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 18 / 32
interactive data analysing as R does - continued
Example
df.groupby(’a’).sum()
df.groupby([’a’, ’b’]).mean()
pd.pivot table(df, values=’d’, index=[’a’, ’b’], columns=[’c’])
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 19 / 32
merge dataframes
pd.concat()
Example
assume df1.shape = (3, 4), df2.shape = (3, 4)
pd.concat([df1, df2]), shape of result is (6, 4)
pd.concat([df1, df2], axis=1), shape of result is (3, 8)
df.append(), append rows to a dataframe, defaults considering labels
pd.merge(left, right, how=’inner’, on=None, ...), SQL style merges
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 20 / 32
Intro to scikit-learn
scikit-learn project started as scikits.learn, a Google Summer of Codeproject by David Cournapeau, its name stems from the SciKit (SciPyToolkit), a separately-developed and distributed third-party extensionto SciPy, the latest stable version is 0.19.2.
sklearn can be used in two typical waysI interactive use in interactive interpreter, enhanced by IpythonI classes or functions import to Python projects
Machine Learning in PythonI Simple and efficient tools for data mining and data analysisI Accessible to everybody, and reusable in various contextsI Built on NumPy, SciPy, and matplotlibI Open source, commercially usable - BSD license
recommended import style
from sklearn.xxx import xxx
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 21 / 32
overview of sklearn and ML
dataset transformationI feature extraction, efficient for text encodingI preprocessing data, such as normalization,
encode categorical variablesI pipeline and feature union, preceding class
must implement transform interface, the lastclass decides the whole pipeline’s functions
modelingI almost all ML algorithms
F supervised, unsupervised, semi-supervisedF regression, classification, clustering,
dimensionality reduction, label propagation
I feature selectionI ensemble, bagging and boosting (bias-variance
tradeoff)I scikit-learn wrapper interface for third-party
ML libs
“The selected featuresdecide the limits ofthe model, thedifferent algorithmsjust approaching thelimits ofperformances.”
– nobody
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 22 / 32
overview of sklearn and ML - continued
model valuation and parameter tuningI compulsory implementation of score for predictorI cross-validation for evaluating estimator performance, such as k-fold,
leave one out and ect.I standard model valuation function, such as roc curve and auc for
classifiersI tuning the hyper-parameters of an estimator, such GridSearchCV,
RandomizedSearchCV and ect.
model persistence with pickle/cpickle
datasets and examples
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 23 / 32
naming policy for scikit-learn
classesI Camel naming
functionsI joined lower case wordsI fit, predict, score, transform, apply, predict proba, set params,
set params, etc.
parametersI joined lower case words
learned attributesI joined lower case words, trailed with
data setsI X, y, X train, X test
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 24 / 32
API design philosophy - general principles
1 consistency, all objects share a consistent interface composed of alimited set of methods
2 inspection, constructor parameters and parameter values determinedby learning algorithms are stored and exposed as public attributes
3 non-proliferation of classes, learning algorithms are the only objects tobe represented using custom classes
4 composition, meta transformers/estimators and ensemble functions
5 sensible defaults, appropriate default value for user-defined parametersis defined as much as possible
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 25 / 32
API design philosophy - data representation
as close as possible to the matrix representation, NumPy for densedata and SciPy for sparse data
for efficient reasons, the public interface is oriented towardsprocessing batches of samples rather than single samples per API call.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 26 / 32
API design philosophy - core interface
estimatorsI defines instantiation mechanisms of algorithm objectsI expose fit method for learning model from training dataI estimator initialization and actual learning are strictly separated
predictorsI extends the notion of estimator by adding a predict method that
produces predictions for X testI classify predictors also provide a predict proba method which returns
class probabilitiesI predictors also provide a score function to assess the estimator’s
performance on a batch of input data
transformersI modify or filter data before feeding it to a learning algorithmI some estimators implement a transformer
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 27 / 32
introduction to matplotlib
matplotlib is a python 2D plottinglibrary, seamless combined with SciPystack
pyplot module provides a MATLAB-likeinterface
matplotlib can be used in Pythonscripts, the Python and IPython shell,the jupyter notebook
R’s plot library ggplot2 is animplementation of Leland Wilkinson’sGrammar of Graphics : a generalscheme for data visualization whichbreaks up graphs into semanticcomponents such as scales and layers
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 28 / 32
introduction to Ipython
IPython is an enhanced interactive Python shell that has lots of interestingfeatures including named inputs and outputs, access to shell commands,improved debugging and many more.Jupyter Notebook App (formerly IPython Notebook) is an applicationrunning inside the browser.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 29 / 32
introduction to Anaconda
Anaconda is an easy-to-install free package manager, environmentmanager, Python distribution, and collection of over 720 open sourcepackages offering free community support.Anaconda is the recommended distribution of Python distribution forscientific project.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 30 / 32
other scientific libs
TensorFlow An open-source software library for Machine Intelligence
gensim topic modelling for humans
networkX High-productivity software for complex networks
NLTK Natural Language Toolkit
XGBoost eXtreme Gradient Boosting
scikit-learn-contrib scikit-learn compatible projects, such asimbalanced-learn and ect.
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 31 / 32
REFERENCES
1 http://www.scipy-lectures.org/index.html
2 http://scikit-learn.org/stable/index.html
3 https://scipy.org/docs.html
4 https://www.tensorflow.org/
5 https://radimrehurek.com/gensim/
6 https://networkx.github.io/
7 http://www.nltk.org/
8 https://github.com/dmlc/xgboost
9 https://github.com/scikit-learn-contrib
10 API design for machine learning software: experiences from thescikit-learn project (2013)
Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 32 / 32