Spark Machine Learning - GitHub Pages

19
Spark Machine Learning

Transcript of Spark Machine Learning - GitHub Pages

Page 1: Spark Machine Learning - GitHub Pages

Spark Machine Learning

Page 2: Spark Machine Learning - GitHub Pages

GPU

3 / 21

Page 3: Spark Machine Learning - GitHub Pages

DataFrame

Pandas DataFrames

Pandas DataFrame Python

4 / 21

Page 4: Spark Machine Learning - GitHub Pages

Transformers DataFrame DataFrame

DataFrame

DataFrame

5 / 21

Page 5: Spark Machine Learning - GitHub Pages

Estimators ML

fit() DataFrame

6 / 21

Page 6: Spark Machine Learning - GitHub Pages

Spark MLib 1

2

3

4

58 / 21

Page 7: Spark Machine Learning - GitHub Pages

+ EvaluationModel(Transformer)Training Training data

Testing Test data Predicted priorities

EvaluationPredicted labels +true labels (on test data)

RDD[(Double, Double)] RegressionMetrics

Metric (MSE)Double

Estimator

Transformer

How good is the model?Model selection

Choose among different modelsor model hyperparameters

Evaluator

Page 8: Spark Machine Learning - GitHub Pages

Overview of ML Algorithms• Prediction• Regression• Classification

• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization

feature vector à label

(categorical label)(real value label)

Iterative optimizationPartition data by rows (instances):• Easy to handle billions of rows• Hard to scale # features

• 107 for *Regression, SVM, NB• 103 for DecisionTree

LinearRegression, DecisionTreeLogisticRegression, NaiveBayes

E.g.: given log file, predict priority

Page 9: Spark Machine Learning - GitHub Pages

Overview of ML Algorithms• Prediction• Regression• Classification

• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization

Per-row transformation

Tokenizer, HashingTF, IDF, Word2Vec

Normalizer, StandardScaler

E.g.: convert text to feature vectors

Arguably the most important part of machine learning

Page 10: Spark Machine Learning - GitHub Pages

Overview of ML Algorithms• Prediction• Regression• Classification

• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization

Phrased as matrix factorizationGiven (users x products) matrix with many

missing entries,Find low-rank factorization.Fill in missing entries.

ALS

user à recommended products

E.g.: Recommend movies to users

Partition data by both users and products.

Scales to millions of users and products.

Page 11: Spark Machine Learning - GitHub Pages

Overview of ML Algorithms• Prediction• Regression• Classification

• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization

Iterative optimization.Local optima.

KMeans, GaussianMixutre, LDA

feature vectors à clusters (no labels)

E.g.: Given news articles, automatically group articles by topics

Partition data by rows.Easy to handle billions of rows

Page 12: Spark Machine Learning - GitHub Pages

Overview of ML Algorithms• Prediction• Regression• Classification

• Feature transformation• Recommendation• Clustering• Other (MLLib)• Statistics• Linear algebra• Optimization

ChiSqSelector, Statistics, MultivariateOnlineSummarizer

E.g.: Is model A significantly better than model B?

DenseMatrix, SparseMatrix, EigenValueDecomposition, ...

E.g.: Matrix decomposition

GradientDescent, LBFGS

E.g.: Given function f(x), find x to minimize f(x)

Page 13: Spark Machine Learning - GitHub Pages

Pipeline

Transformers Estimators ML

fit() Pipeline transform()

7 / 21

Page 14: Spark Machine Learning - GitHub Pages

ML Pipelines

Trainmodel

Evaluate

Loaddata

Extractfeatures

Page 15: Spark Machine Learning - GitHub Pages

ML Pipelines

Trainmodel1

Evaluate

Datasource 1Datasource 2

Datasource 3

ExtractfeaturesExtractfeatures

Featuretransform1

Featuretransform2

Featuretransform3

Trainmodel2

Ensemble

Page 16: Spark Machine Learning - GitHub Pages

TunningPipeline Pipeline

MLlib CrossValidator ParamMap

9 / 21

Page 17: Spark Machine Learning - GitHub Pages

CrossValidator folds

k = 3 folds 3

CrossValidator ParamMaps ParamMap

ParamMap

CrossValidator ParamMap

10 / 21

Page 18: Spark Machine Learning - GitHub Pages

DataFrame

df = sqlContext.createDataFrame (data, [“label”, “features”])

LR 10

lr = LogisticRegression(maxIter = 10)

11 / 21

Page 19: Spark Machine Learning - GitHub Pages

model = lr.fit(df)

model.transform(df).show()

12 / 21