The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
-
Upload
krishna-sankar -
Category
Data & Analytics
-
view
6.918 -
download
12
Transcript of The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
October 29, 2014
@ksankar // doubleclix.wordpress.com
http://www.bigdatatechcon.com/classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI
“I want to die on Mars but not on
impact”
— Elon Musk, interview with Chris Ande
rson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �
Agenda
o Spark & Data Science DevOps • Spark, Python & Machine Learning • Goals/non-goals • Intro to Spark
• Stack, Mechanisms – RDD
• Datasets : SOTU, Titanic, Frequent Flier
• Statistical Toolbox • Summary, Correlations
o “Mood Of the Union” • State of the Union w/ Washington,
Lincoln, FDR, JFK, Clinton, Bush & Obama
• Map reduce, parse text
o Clustering • K-means for Gallactic Hoppers!
o Break [3:15-3:45) o Predicting Survivors with Classification
• Decision Trees • NaiveBayes (Titanic data set)
o Linear Regression o Recommendation Engine
• Collab Filtering w/movie lens o Discussions/Slack
Oct 29 2-‐3:15 (75min), 3:45-‐5:00 (75 min) = 150 min [20] 2:00 – 2:20 [30] 2:20 – 2:50 [25] 2:50 – 3:15 [30] 3:45 – 4:15 [10] 4:15 – 4:25 [20] 4:25 – 4:45 [15] 4:45 – 5:00
Goals & non-goals
Goals
¤ Understand how to program Machine Learning with Spark & Python
¤ Focus on programming & ML application
¤ Give you a focused time to work thru examples § Work with me. I will wait
if you want to catch-up ¤ Less theory, more usage - let us
see if this works ¤ As straightforward as possible § The programs can be
optimized
Non-goals
¡ Go deep into the algorithms • We don’t have sufficient
time. The topic can be easily a 5 day tutorial !
¡ Dive into spark internals • That is for another day
¡ The underlying computation, communication, constraints & distribution is a fascinating subject • Paco does a good job
explaining them ¡ A passive talk
• Nope. Interactive & hands-on
About Me
o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), • Written Books (Web 2.0, Wireless, Java,…) • Standards, some work in AI, • Guest Lecturer at Naval PG School,… • Planning MS-CFinance or Statistics or Computational Math
o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
The Nuthead band !
Spark & Data Science DevOps
2:00
Close Encounters
� 1st ◦ This Tutorial
� 2nd ◦ Do More Hands-on Walkthrough
� 3nd ◦ Listen To Lectures ◦ More competitions …
Spark Installation
o Install Spark 1.1.0 in local Machine o https://spark.apache.org/downloads.html
• Pre-built For Hadoop 2.4 is fine o Download & uncompress o Remember the path & use it wherever you see /usr/local/spark/ o I have downloaded in /usr/local & have a softlink spark to the latest version
Tutorial Materials
o Github : https://github.com/xsankar/cloaked-ironman o Clone or download zip o Open terminal o cd ~/cloaked-ironman o IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" /usr/local/spark/bin/
pyspark o Note : o I have a soft link “spark” in my /usr/local that points to the spark version that I
use. For example ln -s spark-1.1.0/ spark o Click on ipython dashboard o Just look thru the ipython notebooks
Data Science - Context
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Volume o Velocity o Streaming Data
o Canonical form o Data catalog o Data Fabric across the
organization o Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
Collect Store Transform
o Metadata o Monitor counters &
Metrics o Structured vs. Multi-‐
structured
o Flexible & Selectable § Data Subsets § Attribute sets
o Refine model with § Extended Data
subsets § Engineered
Attribute sets o Validation run across a
larger data set
Reason Model Deploy
Data Management
Data Science
o Dynamic Data Sets o 2 way key-‐value tagging of
datasets o Extended attribute sets o Advanced Analytics
Explore Visualize Recommend Predict
o Performance o Scalability o Refresh Latency o In-‐memory Analytics
o Advanced Visualization o Interactive Dashboards o Map Overlay o Infographics
¤ Bytes to Business a.k.a. Build the full stack
¤ Find Relevant Data For Business
¤ Connect the Dots
Volume
Velocity
Variety
Data Science - Context
Context
Connectedness
Intelligence
Interface
Inference
“Data of unusual size” that can't be brute forced
o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality
Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms
Parameters AIributes
Data (Scoring)
Model SelecMon
Reason & Learn
Models
Visualize, Recommend, Explore
Model Assessment
Feature SelecMon Dimensionality ReducMon
Data Science Maturity Model & Spark Isolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics
Data Small Data Larger Data set Big Data Big Data Factory Model
Context Local Domain Cross-‐domain + External
Cross domain + External
Model, Reason & Deploy
• Single set of boxes, usually owned by the Model Builders
• Departmental
• Deploy -‐ Central AnalyMcs Infrastructure
• Models sMll owned & operated by Modelers
• Partly Enterprise-‐wide
• Central AnalyMcs Infrastructure • Model & Reason – by Model Builders • Deploy, Operate – by ops • Residuals and other metrics
monitored by modelers • Enterprise-‐wide
• Distributed AnalyMcs Infrastructure • AI Augmented models • Model & Reason – by Model
Builders • Deploy, Operate – by ops • Data as a moneMzed service,
extending to eco system partners
• Reports • Dashboards • Dashboards + some APIs • Dashboards + Well defined APIs + programming models
Type • DescripMve & ReacMve • + PredicMve • + AdapMve • AdapMve
Datasets • All in the same box • Fixed data sets, usually in temp data spaces
• Flexible Data & AIribute sets • Dynamic datasets with well-‐defined refresh policies
Workload • Skunk works • Business relevant apps with approx SLAs
• High performance appliance clusters • Appliances and clusters for mulMple workloads including real Mme apps
• Infrastructure for emerging technologies
Strategy • Informal definiMons • Data definiMons buried in the analyMcs models
• Some data definiMons • Data catalogue, metadata & AnnotaMons
• Big Data MDM Strategy
The Sense & Sensibility of a DataScien3st DevOps
Factory = OperaMonal
Lab = InvesMgaMve
hIp://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scienMst-‐devops/
Spark-The Stack
hIp://databricks.com/blog/2014/10/10/spark-‐breaks-‐previous-‐large-‐scale-‐sort-‐record.html
RDD – The workhorse of Spark
o Resilient Distributed Datasets • Collection that can be operated in parallel
o Transformations – create RDDs • Map, Filter,…
o Actions – Get values • Collect, Take,…
o We will apply these operations during this tutorial
Algorithm spectrum
o Regression o Logit o CART o Ensemble :
Random Forest
o Clustering o KNN o Genetic Alg o Simulated
Annealing
o Collab Filtering
o SVM o Kernels
o SVD
o NNet o Boltzman
Machine o Feature
Learning
Machine Learning Cute Math Ar0ficial Intelligence
ALL MLlib APIs are not available in Python (as of 1.1.0)
API Spark 1.1.0 Spark 1.2.0
Java/Scala Python
Basic Statistics ✔ ✔
Linear Models ✔ ✔
Decision Trees ✔ ✔
Random Forest ✖ ✖
Collab Filtering ✔ ✔
Clustering-KMeans ✔ ✔
Clustering-Hierarchical ✖ ✖
SVD ✔ ✖
PCA ✔ ✖
Standard Scaler, Normalizer ✔ ✖
Model Evaluation-PR/ROC
Spark 1.2 MLlib JIRA h=p://bit.ly/1ywotkm
Statistical Toolbox
o Sample data : Car mileage data
hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py
“Mood Of the Union” with TF-IDF
2:20
Scenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark & python ? o Is it different from Hadoop-MapReduce ? o Is it better ?
POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush & Barack Obama
o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)
Lookout for these interesting Spark features
o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value
Read & Create word vector iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 1 of 3
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 2 of 3
Remove Common Words – 3 of 3
FDR vs. Barack Obama as reflected by SOTU
Barack Obama vs. Bill Clinton
GWB vs Abe Lincoln as reflected by SOTU
Epilogue
o Interesting Exercise o Highlights
• Map-reduce in a couple of lines ! • But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey • Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework: • TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf • Haven’t seen it in python for 1.1.
hIp://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/
Clustering
2:50
Scenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?
Clustering - Theory
o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs
human direction & domain knowledge to interpret & guide o Two types:
• Centroid based clustering – k-means clustering
• Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++
• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12-kmpar.pdf
Lookout for these interesting Spark features
o Application of Statistics toolbox o Center & Scale RDD o Filter RDDs
Clustering - API
o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||") o K = number of clusters to create, default=2 o initializationMode = The initialization algorithm. This can be either "random" to
choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||
o KMeansModel.predict o Maps a point to a cluster
Data iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Data & Create RDD
Train & Predict
Calculate error
But Data is not even
So let us center & scale the data and try again
Looks Good
Let us try with 5 clusters
Let us map the cluster to our data
And interpret them We have mulMple cluster types: • 1 : Very AcMve – Give them the most
aIenMon • 3 : Very AcMve on-‐line, few flights – Give
them on-‐line coupons • 4 : RelaMvely new customers, not that
acMve – Give them flight coupons to encourage them to fly more. Ask them why they are not flying. May be they are flying to desMnaMons (say Jupiter) where InterGallacMc has less gates
Note : • This is just a sample interpreta0on. • In real life we would “noodle” over the
clusters & tweak them to be useful, interpretable and dis0nguishable.
• May be 3 is more suited to create targeted promo0ons
iPython notebook at https://github.com/xsankar/cloaked-ironman
Epilogue
o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.
Break
3:15
Predicting Survivors with Classification
3:45
Titanic Passenger Metadata • Small • 3 Predictors
• Class • Sex • Age • Survived?
Classification - Scenario
o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run & test classification
iPython notebook at https://github.com/xsankar/cloaked-ironman
Classifying Classifiers
Statistical Structural
Regression Naïve Bayes
Bayesian Networks
Rule-‐based Distance-‐based
Neural Networks
Production Rules Decision Trees
Multi-‐layer Perception
Functional Nearest Neighbor
Linear Spectral Wavelet
kNN Learning vector Quantization
Ensemble
Random Forests
Logistic Regression1
SVM Boosting
1Max Entropy Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
Classifiers
Regression Continuous Variables
Categorical Variables
Decision Trees
k-‐NN(Nearest Neighbors)
Bias Variance
Model Complexity Over-fitting
BoosMng Bagging
CART
Classification - Spark API
o Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",
maxDepth=4, maxBins=100) o Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy
• Larger = Higher Accuracy • Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning
o intelligent framework - need this for scale
Lookout for these interesting Spark features
o Concept of Labeled Point & how to create an RDD of LPs o Print the tree o Calculate Accuracy & MSE from RDDs
Read data & extract features
iPython notebook at https://github.com/xsankar/cloaked-ironman
Create the model
Extract labels & features
Calculate Accuracy & MSE
Use NaiveBayes Algorithm
Decision Tree – Best Practices
maxDepth Tune with Data/Model SelecMon
maxBins Set low, monitor communicaMons, increase if needed
# RDD parMMons Set to # of cores • Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
hIps://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup
DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100)
Future …
o Actually we should split the data to training & test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest
• Bagging
• PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc
◦ “Output of weak classifiers into a powerful commiIee” ◦ Final PredicMon = weighted majority vote ◦ Later classifiers get misclassified points � With higher weight, � So they are forced � To concentrate on them ◦ AdaBoost (AdapMveBoosting) ◦ BoosMng vs Bagging � Bagging – independent trees <-‐ Spark shines here � BoosMng – successively weighted
Boosting � Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
◦ Builds large collecMon of de-‐correlated trees & averages them
◦ Improves Bagging by selecMng i.i.d* random variables for splipng
◦ Simpler to train & tune ◦ “Do remarkably well, with very li=le tuning required” – ESLII ◦ Less suscepMble to over fipng (than boosMng) ◦ Many RF implementaMons � Original version -‐ Fortran-‐77 ! By Breiman/Cutler � Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
� Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
◦ Two Step � Develop a set of learners � Combine the results to develop a composite predictor ◦ Ensemble methods can take the form of: � Using different algorithms, � Using the same algorithm with different sepngs � Assigning different parts of the dataset to different classifiers
◦ Bagging & Random Forests are examples of ensemble method
Ref: Machine Learning In Action
Ensemble Methods � Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)
o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate • Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
Linear Regression
4:15
Linear Regression - API
LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModelBase predict() LinearRegressionModel LinearRegressionWithSGD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)
Basic Linear Regression
Use LR model for prediction & calculate MSE
Step size is important, the model can diverge !
Interesting step size
Recommendation Engine
4:25
Recommendation & Personalization - Spark
Automated Analytics-‐ Let Data tell story Feature Learning, AI, Deep Learning
Learning Models -‐ fit parameters as it gets more data
Dynamic Models – model selection based on context
o Knowledge Based o Demographic Based o Content Based o Collaborative Filtering
o Item Based o User Based
o Latent Factor based
o User Rating o Purchased o Looked/Not purchased
Spark (in 1.1.0) implements the user based ALS collaboraMve filtering
Ref: ALS -‐ CollaboraMve Filtering for Implicit Feedback Datasets, Yifan Hu ; AT&T Labs., Florham Park, NJ ; Koren, Y. ; Volinsky, C. ALS-‐WR -‐ Large-‐Scale Parallel CollaboraMve Filtering for the Nevlix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan
Spark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)
Read & Parse
Split & Train
Evaluate
Epilogue
o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations
• Join (HashJoin) • In memory, Grace, Recursive hash join
hIp://technet.microsox.com/en-‐us/library/ms189313(v=sql.105).aspx
Questions ?
4:45
Reference
1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup 6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html 7. http://blogs.gartner.com/matthew-davis/
Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/
References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf
The Beginning As The End
How did we do ? 4:45