The Artful Business of Data Mining: Computational Statistics with Open Source Tools

78
The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13

Transcript of The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Page 1: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

The Artful Businessof Data Mining

Computational Statistics with Open Source Tool

Wednesday 20 March 13

Page 2: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

David Coallier@davidcoallier

Wednesday 20 March 13

Page 3: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data ScientistAt Engine Yard (.com)

Wednesday 20 March 13

Page 4: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Find Data

Wednesday 20 March 13

Page 5: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Clean Data

Wednesday 20 March 13

Page 6: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Analyse Data?

Wednesday 20 March 13

Page 7: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Analyse Data

Wednesday 20 March 13

Page 8: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Question Data

Wednesday 20 March 13

Page 9: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Report Findings

Wednesday 20 March 13

Page 10: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data Scientist

Wednesday 20 March 13

Page 11: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data Janitor

Wednesday 20 March 13

Page 12: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Actual Tasks

Wednesday 20 March 13

Page 13: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“If your modelis elegant, it’s probably wrong”

Wednesday 20 March 13

Page 14: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“The Times they area-Changing”

— Bob Dylan

Wednesday 20 March 13

Page 15: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Python & R

Wednesday 20 March 13

Page 16: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

SciPyhttp://www.scipy.org

Wednesday 20 March 13

Page 17: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.stats

Wednesday 20 March 13

Page 18: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.statsDescriptive Statistics

Wednesday 20 March 13

Page 19: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from scipy.stats import describe

s = [1,2,1,3,4,5]

print describe(s)

Wednesday 20 March 13

Page 20: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.statsProbability Distributions

Wednesday 20 March 13

Page 21: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ExamplePoisson Distribution

Wednesday 20 March 13

Page 22: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

f (k;λ) = λ ke−k

k!for k >= 0

Wednesday 20 March 13

Page 23: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)

Wednesday 20 March 13

Page 24: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

print p.mean()print p.sum()...

Wednesday 20 March 13

Page 25: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NumPyhttp://www.numpy.org/

Wednesday 20 March 13

Page 26: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NumPyLinear Algebra

Wednesday 20 March 13

Page 27: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

1 00 1

⎛⎝⎜

⎞⎠⎟

Wednesday 20 March 13

Page 28: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)

Wednesday 20 March 13

Page 29: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )

Wednesday 20 March 13

Page 30: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

MatplotlibPython Plotting

Wednesday 20 March 13

Page 31: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

statsmodelsAdvanced Statistics Modeling

Wednesday 20 March 13

Page 32: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NLTKNatural Language Tool Kit

Wednesday 20 March 13

Page 33: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scikit-learnMachine Learning

Wednesday 20 March 13

Page 34: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)

clf.predict([[2., 2.]])>>> array([1])

Wednesday 20 March 13

Page 35: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PyBrain... Machine Learning

Wednesday 20 March 13

Page 36: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PyMCBayesian Inference

Wednesday 20 March 13

Page 37: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PatternWeb Mining for Python

Wednesday 20 March 13

Page 38: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NetworkXStudy Networks

Wednesday 20 March 13

Page 39: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

MILKMOAR machine LEARNING!

Wednesday 20 March 13

Page 40: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Pandaseasy-to-use

data structures

Wednesday 20 March 13

Page 41: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])

print x[x['age'] > 20].count()print x[x['age'] > 20].mean()

Wednesday 20 March 13

Page 42: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RWednesday 20 March 13

Page 43: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RStudioThe IDE

Wednesday 20 March 13

Page 44: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

lubridateand zoo

Dealing with Dates...

Wednesday 20 March 13

Page 45: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone

Wednesday 20 March 13

Page 46: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

reshape2Reshape your Data

Wednesday 20 March 13

Page 47: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ggplot2Visualise your Data

Wednesday 20 March 13

Page 48: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RCurl, RJSONIOFind more Data

Wednesday 20 March 13

Page 49: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

HMiscMiscellaneous useful functions

Wednesday 20 March 13

Page 50: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

forecastCan you guess?

Wednesday 20 March 13

Page 51: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

garchAnd ruGarch

Wednesday 20 March 13

Page 52: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

quantmodStatistical Financial Trading

Wednesday 20 March 13

Page 53: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

xtsExtensible Time Series

Wednesday 20 March 13

Page 54: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

igraphStudy Networks

Wednesday 20 March 13

Page 55: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

maptoolsRead & View Maps

Wednesday 20 March 13

Page 56: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)

Wednesday 20 March 13

Page 57: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

StorageWednesday 20 March 13

Page 58: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Oppose“big” Data

Wednesday 20 March 13

Page 59: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“Learn how

to sample”

Wednesday 20 March 13

Page 60: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ExperimentsWednesday 20 March 13

Page 61: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

What DoYou Want to Answer?

Wednesday 20 March 13

Page 62: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

UnderstandYour Audience

Wednesday 20 March 13

Page 63: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ScientificReporting

Wednesday 20 March 13

Page 64: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Busy-nessTime is money

Wednesday 20 March 13

Page 65: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PublicVisualisation

Wednesday 20 March 13

Page 66: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Best Visualisation,Bad Data

Wednesday 20 March 13

Page 67: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Best Forecastingmodels...Bad Visualisation

Wednesday 20 March 13

Page 68: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 69: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 70: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

SeanchaíWednesday 20 March 13

Page 71: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 72: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

FeelitWednesday 20 March 13

Page 73: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 74: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 75: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 76: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“Don’t be scared of bar charts.”

Wednesday 20 March 13

Page 77: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Mathematical StatisticsEngineering BusinessEconomicsCuriosity

Wednesday 20 March 13

Page 78: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

davidcoallier.github.com@davidcoallier on Twitter

Wednesday 20 March 13