The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

October 29, 2014

@ksankar // doubleclix.wordpress.com

http://www.bigdatatechcon.com/classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI

“I want to die on Mars but not on

impact”

— Elon Musk, interview with Chris Ande

rson

“The shrewd guess, the fertile hypothesis, the courageous leap to a

tentative conclusion – these are the most valuable coin of the thinker at

work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �

Agenda

o  Spark & Data Science DevOps •  Spark, Python & Machine Learning •  Goals/non-goals •  Intro to Spark

•  Stack, Mechanisms – RDD

•  Datasets : SOTU, Titanic, Frequent Flier

•  Statistical Toolbox •  Summary, Correlations

o  “Mood Of the Union” •  State of the Union w/ Washington,

Lincoln, FDR, JFK, Clinton, Bush & Obama

•  Map reduce, parse text

o Clustering •  K-means for Gallactic Hoppers!

o  Break [3:15-3:45) o  Predicting Survivors with Classification

•  Decision Trees •  NaiveBayes (Titanic data set)

o  Linear Regression o  Recommendation Engine

•  Collab Filtering w/movie lens o Discussions/Slack

Oct 29 2-‐3:15 (75min), 3:45-‐5:00 (75 min) = 150 min [20] 2:00 – 2:20 [30] 2:20 – 2:50 [25] 2:50 – 3:15 [30] 3:45 – 4:15 [10] 4:15 – 4:25 [20] 4:25 – 4:45 [15] 4:45 – 5:00

Goals & non-goals

Goals

¤ Understand how to program Machine Learning with Spark & Python

¤ Focus on programming & ML application

¤ Give you a focused time to work thru examples § Work with me. I will wait

if you want to catch-up ¤ Less theory, more usage - let us

see if this works ¤ As straightforward as possible § The programs can be

optimized

Non-goals

¡ Go deep into the algorithms • We don’t have sufficient

time. The topic can be easily a 5 day tutorial !

¡ Dive into spark internals •  That is for another day

¡ The underlying computation, communication, constraints & distribution is a fascinating subject •  Paco does a good job

explaining them ¡ A passive talk

•  Nope. Interactive & hands-on

About Me

o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things:

•  Big Data (Retail, Bioinformatics, Financial, AdTech), •  Written Books (Web 2.0, Wireless, Java,…) •  Standards, some work in AI, •  Guest Lecturer at Naval PG School,… •  Planning MS-CFinance or Statistics or Computational Math

o Volunteer as Robotics Judge at First Lego league World Competitions o  @ksankar, doubleclix.wordpress.com

The Nuthead band !

Spark & Data Science DevOps

2:00

Close Encounters

� 1st ◦  This Tutorial

�  2nd ◦  Do More Hands-on Walkthrough

�  3nd ◦  Listen To Lectures ◦  More competitions …

Spark Installation

o Install Spark 1.1.0 in local Machine o https://spark.apache.org/downloads.html

• Pre-built For Hadoop 2.4 is fine o Download & uncompress o Remember the path & use it wherever you see /usr/local/spark/ o I have downloaded in /usr/local & have a softlink spark to the latest version

Tutorial Materials

o Github : https://github.com/xsankar/cloaked-ironman o Clone or download zip o Open terminal o cd ~/cloaked-ironman o IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" /usr/local/spark/bin/

pyspark o Note : o I have a soft link “spark” in my /usr/local that points to the spark version that I

use. For example ln -s spark-1.1.0/ spark o Click on ipython dashboard o Just look thru the ipython notebooks

Data Science - Context

o  Scalable Model Deployment

o  Big Data automation & purpose built appliances (soft/hard)

o  Manage SLAs & response times

o  Volume o  Velocity o  Streaming Data

o  Canonical form o  Data catalog o  Data Fabric across the

organization o  Access to multiple

sources of data o  Think Hybrid – Big Data

Apps, Appliances & Infrastructure

Collect Store Transform

o  Metadata o  Monitor counters &

Metrics o  Structured vs. Multi-‐

structured

o  Flexible & Selectable §  Data Subsets §  Attribute sets

o  Refine model with §  Extended Data

subsets §  Engineered

Attribute sets o  Validation run across a

larger data set

Reason Model Deploy

Data Management

Data Science

o  Dynamic Data Sets o  2 way key-‐value tagging of

datasets o  Extended attribute sets o  Advanced Analytics

Explore Visualize Recommend Predict

o  Performance o  Scalability o  Refresh Latency o  In-‐memory Analytics

o  Advanced Visualization o  Interactive Dashboards o  Map Overlay o  Infographics

¤  Bytes to Business a.k.a. Build the full stack

¤  Find Relevant Data For Business

¤  Connect the Dots

Volume

Velocity

Variety

Data Science - Context

Context

Connectedness

Intelligence

Interface

Inference

“Data of unusual size” that can't be brute forced

o  Three Amigos o  Interface = Cognition o  Intelligence = Compute(CPU) & Computational(GPU) o  Infer Significance & Causality

Day in the life of a (super) Model

Intelligence

Inference

Data Representation

Interface

Algorithms

Parameters AIributes

Data (Scoring)

Model SelecMon

Reason & Learn

Models

Visualize, Recommend, Explore

Model Assessment

Feature SelecMon Dimensionality ReducMon

Data Science Maturity Model & Spark Isolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics

Data Small Data Larger Data set Big Data Big Data Factory Model

Context Local Domain Cross-‐domain + External

Cross domain + External

Model, Reason & Deploy

•  Single set of boxes, usually owned by the Model Builders

•  Departmental

•  Deploy -‐ Central AnalyMcs Infrastructure

•  Models sMll owned & operated by Modelers

•  Partly Enterprise-‐wide

•  Central AnalyMcs Infrastructure •  Model & Reason – by Model Builders •  Deploy, Operate – by ops •  Residuals and other metrics

monitored by modelers •  Enterprise-‐wide

•  Distributed AnalyMcs Infrastructure •  AI Augmented models •  Model & Reason – by Model

Builders •  Deploy, Operate – by ops •  Data as a moneMzed service,

extending to eco system partners

•  Reports •  Dashboards •  Dashboards + some APIs •  Dashboards + Well defined APIs + programming models

Type •  DescripMve & ReacMve •  + PredicMve •  + AdapMve •  AdapMve

Datasets •  All in the same box •  Fixed data sets, usually in temp data spaces

•  Flexible Data & AIribute sets •  Dynamic datasets with well-‐defined refresh policies

Workload •  Skunk works •  Business relevant apps with approx SLAs

•  High performance appliance clusters •  Appliances and clusters for mulMple workloads including real Mme apps

•  Infrastructure for emerging technologies

Strategy •  Informal definiMons •  Data definiMons buried in the analyMcs models

•  Some data definiMons •  Data catalogue, metadata & AnnotaMons

•  Big Data MDM Strategy

The Sense & Sensibility of a DataScien3st DevOps

Factory = OperaMonal

Lab = InvesMgaMve

hIp://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scienMst-‐devops/

Spark-The Stack

hIp://databricks.com/blog/2014/10/10/spark-‐breaks-‐previous-‐large-‐scale-‐sort-‐record.html

RDD – The workhorse of Spark

o Resilient Distributed Datasets • Collection that can be operated in parallel

o Transformations – create RDDs • Map, Filter,…

o Actions – Get values • Collect, Take,…

o We will apply these operations during this tutorial

Algorithm spectrum

o  Regression o  Logit o  CART o  Ensemble :

Random Forest

o  Clustering o  KNN o  Genetic Alg o  Simulated

Annealing

o  Collab Filtering

o  SVM o  Kernels

o  SVD

o  NNet o  Boltzman

Machine o  Feature

Learning

Machine Learning Cute Math Ar0ficial Intelligence

ALL MLlib APIs are not available in Python (as of 1.1.0)

API Spark 1.1.0 Spark 1.2.0

Java/Scala Python

Basic Statistics ✔ ✔

Linear Models ✔ ✔

Decision Trees ✔ ✔

Random Forest ✖ ✖

Collab Filtering ✔ ✔

Clustering-KMeans ✔ ✔

Clustering-Hierarchical ✖ ✖

SVD ✔ ✖

PCA ✔ ✖

Standard Scaler, Normalizer ✔ ✖

Model Evaluation-PR/ROC

Spark 1.2 MLlib JIRA h=p://bit.ly/1ywotkm

Statistical Toolbox

o Sample data : Car mileage data

hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py

“Mood Of the Union” with TF-IDF

2:20

Scenario – Mood Of the Union

o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?

o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark & python ? o Is it different from Hadoop-MapReduce ? o Is it better ?

POA (Plan Of Action)

o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush & Barack Obama

o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)

Lookout for these interesting Spark features

o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value

Read & Create word vector iPython notebook at https://github.com/xsankar/cloaked-ironman

Remove Common Words – 1 of 3

iPython notebook at https://github.com/xsankar/cloaked-ironman

FDR vs. Barack Obama as reflected by SOTU

Barack Obama vs. Bill Clinton

GWB vs Abe Lincoln as reflected by SOTU

Epilogue

o Interesting Exercise o Highlights

•  Map-reduce in a couple of lines ! •  But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)

•  Set differences using substractByKey •  Ability to sort a map by values (or any arbitrary function, for that matter)

o To Explore as homework: •  TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf •  Haven’t seen it in python for 1.1.

hIp://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/

Clustering

2:50

Scenario – Clustering with Spark

o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.

o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.

o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?

Clustering - Theory

o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs

human direction & domain knowledge to interpret & guide o Two types:

• Centroid based clustering – k-means clustering

•  Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++

• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12-kmpar.pdf


o Application of Statistics toolbox o Center & Scale RDD o Filter RDDs

Clustering - API

o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||") o K = number of clusters to create, default=2 o  initializationMode = The initialization algorithm. This can be either "random" to

choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||

o KMeansModel.predict o Maps a point to a cluster

Data iPython notebook at https://github.com/xsankar/cloaked-ironman

Read Data & Create RDD

Train & Predict

Calculate error

But Data is not even

So let us center & scale the data and try again

Looks Good

Let us try with 5 clusters

Let us map the cluster to our data

And interpret them We have mulMple cluster types: •  1 : Very AcMve – Give them the most

aIenMon •  3 : Very AcMve on-‐line, few flights – Give

them on-‐line coupons •  4 : RelaMvely new customers, not that

acMve – Give them flight coupons to encourage them to fly more. Ask them why they are not flying. May be they are flying to desMnaMons (say Jupiter) where InterGallacMc has less gates

Note : •  This is just a sample interpreta0on. •  In real life we would “noodle” over the

clusters & tweak them to be useful, interpretable and dis0nguishable.

•  May be 3 is more suited to create targeted promo0ons


Epilogue

o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too

low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al.

(Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a

scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters

with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.

Break

3:15

Predicting Survivors with Classification

3:45

Titanic Passenger Metadata •  Small •  3 Predictors

•  Class •  Sex •  Age •  Survived?

Classification - Scenario

o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run & test classification


Classifying Classifiers

Statistical Structural

Regression Naïve Bayes

Bayesian Networks

Rule-‐based Distance-‐based

Neural Networks

Production Rules Decision Trees

Multi-‐layer Perception

Functional Nearest Neighbor

Linear Spectral Wavelet

kNN Learning vector Quantization

Ensemble

Random Forests

Logistic Regression1

SVM Boosting

1Max Entropy Classifier

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Classifiers

Regression Continuous Variables

Categorical Variables

Decision Trees

k-‐NN(Nearest Neighbors)

Bias Variance

Model Complexity Over-fitting

BoosMng Bagging

CART

Classification - Spark API

o  Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",

maxDepth=4, maxBins=100) o  Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy

•  Larger = Higher Accuracy •  Smaller = less communication (as # of bins = number of instances)

o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning

o  intelligent framework - need this for scale


o Concept of Labeled Point & how to create an RDD of LPs o Print the tree o Calculate Accuracy & MSE from RDDs

Read data & extract features


Create the model

Extract labels & features

Calculate Accuracy & MSE

Use NaiveBayes Algorithm

Decision Tree – Best Practices

maxDepth Tune with Data/Model SelecMon

maxBins Set low, monitor communicaMons, increase if needed

# RDD parMMons Set to # of cores •  Usually the recommendation is that the RDD partitions should be over

partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out

•  But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help

•  Joe Bradley talk (reference below) has interesting insights

hIps://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup

DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100)

Future …

o Actually we should split the data to training & test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest

• Bagging

• PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc

◦  “Output of weak classifiers into a powerful commiIee” ◦  Final PredicMon = weighted majority vote ◦  Later classifiers get misclassified points �  With higher weight, �  So they are forced �  To concentrate on them ◦  AdaBoost (AdapMveBoosting) ◦  BoosMng vs Bagging �  Bagging – independent trees <-‐ Spark shines here �  BoosMng – successively weighted

Boosting �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

◦  Builds large collecMon of de-‐correlated trees & averages them

◦  Improves Bagging by selecMng i.i.d* random variables for splipng

◦  Simpler to train & tune ◦  “Do remarkably well, with very li=le tuning required” – ESLII ◦  Less suscepMble to over fipng (than boosMng) ◦  Many RF implementaMons �  Original version -‐ Fortran-‐77 ! By Breiman/Cutler �  Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab

* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Random Forests+

�  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

◦  Two Step �  Develop a set of learners �  Combine the results to develop a composite predictor ◦  Ensemble methods can take the form of: �  Using different algorithms, �  Using the same algorithm with different sepngs �  Assigning different parts of the dataset to different classifiers

◦  Bagging & Random Forests are examples of ensemble method

Ref: Machine Learning In Action

Ensemble Methods �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Random Forests

o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables

o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)

o Error prediction •  For each iteration, predict for dataset that is not in the sample (OOB data) •  Aggregate OOB predictions •  Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate •  Can use this to search for optimal # of predictors

•  We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliers

Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk

A Brief Overview of RF by Dan Steinberg

Linear Regression

4:15

Linear Regression - API

LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModelBase predict() LinearRegressionModel LinearRegressionWithSGD

train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)

LassoModel Least-squares fit with an l_1 penalty term.

LassoWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)

RidgeRegressionModel Least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)

Basic Linear Regression

Use LR model for prediction & calculate MSE

Step size is important, the model can diverge !

Interesting step size

Recommendation Engine

4:25

Recommendation & Personalization - Spark

Automated Analytics-‐ Let Data tell story Feature Learning, AI, Deep Learning

Learning Models -‐ fit parameters as it gets more data

Dynamic Models – model selection based on context

o  Knowledge Based o  Demographic Based o  Content Based o  Collaborative Filtering

o  Item Based o  User Based

o  Latent Factor based

o  User Rating o  Purchased o  Looked/Not purchased

Spark (in 1.1.0) implements the user based ALS collaboraMve filtering

Ref: ALS -‐ CollaboraMve Filtering for Implicit Feedback Datasets, Yifan Hu ; AT&T Labs., Florham Park, NJ ; Koren, Y. ; Volinsky, C. ALS-‐WR -‐ Large-‐Scale Parallel CollaboraMve Filtering for the Nevlix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan

Spark Collaborative Filtering API

o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,

alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)

Read & Parse

Split & Train

Evaluate

Epilogue

o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations

• Join (HashJoin) •  In memory, Grace, Recursive hash join

hIp://technet.microsox.com/en-‐us/library/ms189313(v=sql.105).aspx

Questions ?

4:45

Reference

1.  SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark

2.  http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering

3.  http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is

4.  http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/

5.  https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup 6.  http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html 7.  http://blogs.gartner.com/matthew-davis/

Essential Reading List

o  A few useful things to know about machine learning - by Pedro Domingos •  http://dl.acm.org/citation.cfm?id=2347755

o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert •  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/

lack_of_a_priori_distinctions_wolpert.pdf o  http://www.no-free-lunch.org/ o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C

•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf

o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe •  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o  Avoid these three mistakes, James Faghmo •  https://medium.com/about-data/73258b3848a4

o  Leakage in Data Mining: Formulation, Detection, and Avoidance •  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/

cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

For your reading & viewing pleasure … An ordered List

①  An Introduction to Statistical Learning •  http://www-bcf.usc.edu/~‾gareth/ISL/

②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning •  http://online.stanford.edu/course/statistical-learning-winter-2014

③  Prof. Pedro Domingo •  https://class.coursera.org/machlearning-001/lecture/preview

④  Prof. Andrew Ng •  https://class.coursera.org/ml-003/lecture/preview

⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data •  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥  Mathematicalmonk @ YouTube •  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦  The Elements Of Statistical Learning •  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/

References:

o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas •  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-

learning

o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel •  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn

o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner •  http://strataconf.com/strata2013/public/schedule/detail/27291

o The Problem of Multiple Testing •  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/

PIIS1934148209014609.pdf

The Beginning As The End

How did we do ? 4:45

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Data & Analytics

Transcript of The Hitchhiker's Guide to Machine Learning with Python & Apache Spark