Spark Machine Learning - GitHub Pages
Transcript of Spark Machine Learning - GitHub Pages
Spark Machine Learning
GPU
3 / 21
DataFrame
Pandas DataFrames
Pandas DataFrame Python
4 / 21
Transformers DataFrame DataFrame
DataFrame
DataFrame
5 / 21
Estimators ML
fit() DataFrame
6 / 21
Spark MLib 1
2
3
4
58 / 21
+ EvaluationModel(Transformer)Training Training data
Testing Test data Predicted priorities
EvaluationPredicted labels +true labels (on test data)
RDD[(Double, Double)] RegressionMetrics
Metric (MSE)Double
Estimator
Transformer
How good is the model?Model selection
Choose among different modelsor model hyperparameters
Evaluator
Overview of ML Algorithms• Prediction• Regression• Classification
• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization
feature vector à label
(categorical label)(real value label)
Iterative optimizationPartition data by rows (instances):• Easy to handle billions of rows• Hard to scale # features
• 107 for *Regression, SVM, NB• 103 for DecisionTree
LinearRegression, DecisionTreeLogisticRegression, NaiveBayes
E.g.: given log file, predict priority
Overview of ML Algorithms• Prediction• Regression• Classification
• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization
Per-row transformation
Tokenizer, HashingTF, IDF, Word2Vec
Normalizer, StandardScaler
E.g.: convert text to feature vectors
Arguably the most important part of machine learning
Overview of ML Algorithms• Prediction• Regression• Classification
• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization
Phrased as matrix factorizationGiven (users x products) matrix with many
missing entries,Find low-rank factorization.Fill in missing entries.
ALS
user à recommended products
E.g.: Recommend movies to users
Partition data by both users and products.
Scales to millions of users and products.
Overview of ML Algorithms• Prediction• Regression• Classification
• Feature transformation• Recommendation• Clustering• Other• Statistics• Linear algebra• Optimization
Iterative optimization.Local optima.
KMeans, GaussianMixutre, LDA
feature vectors à clusters (no labels)
E.g.: Given news articles, automatically group articles by topics
Partition data by rows.Easy to handle billions of rows
Overview of ML Algorithms• Prediction• Regression• Classification
• Feature transformation• Recommendation• Clustering• Other (MLLib)• Statistics• Linear algebra• Optimization
ChiSqSelector, Statistics, MultivariateOnlineSummarizer
E.g.: Is model A significantly better than model B?
DenseMatrix, SparseMatrix, EigenValueDecomposition, ...
E.g.: Matrix decomposition
GradientDescent, LBFGS
E.g.: Given function f(x), find x to minimize f(x)
Pipeline
Transformers Estimators ML
fit() Pipeline transform()
7 / 21
ML Pipelines
Trainmodel
Evaluate
Loaddata
Extractfeatures
ML Pipelines
Trainmodel1
Evaluate
Datasource 1Datasource 2
Datasource 3
ExtractfeaturesExtractfeatures
Featuretransform1
Featuretransform2
Featuretransform3
Trainmodel2
Ensemble
TunningPipeline Pipeline
MLlib CrossValidator ParamMap
9 / 21
CrossValidator folds
k = 3 folds 3
CrossValidator ParamMaps ParamMap
ParamMap
CrossValidator ParamMap
10 / 21
DataFrame
df = sqlContext.createDataFrame (data, [“label”, “features”])
LR 10
lr = LogisticRegression(maxIter = 10)
11 / 21
model = lr.fit(df)
model.transform(df).show()
12 / 21