Download - Spark MLlib - Training Material

Transcript
Page 1: Spark MLlib - Training Material

Spark MLlib

Page 2: Spark MLlib - Training Material

• ExperienceVpon Data EngineerTWM, Keywear, Nielsen

• Bryan’s notes for data analysishttp://bryannotes.blogspot.tw

• Spark.TW

• Linikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79

ABOUT ME

Page 3: Spark MLlib - Training Material

Agenda• Introduction to Machine Learning• Why MLlib• Basic Statistic• K-means• Logistic Regression• ALS• Demo

Page 4: Spark MLlib - Training Material

Introduction to Machine Learning

Page 5: Spark MLlib - Training Material

Definition

• Machine learning is a study of computer algorithms that improve automatically through experience.

Page 6: Spark MLlib - Training Material

Terminology• Observation

- The basic unit of data. • Features

- Features is that can be used to describe each observation in a quantitative manner.

• Feature vector- is an n-dimensional vector of numerical features that represents some object.

• Training/ Testing/ Evaluation set- Set of data to discover potential predictive relationships

Page 7: Spark MLlib - Training Material

Learning(Training)

• Features:1. Color: Red/ Green2. Type: Fruit3. Shape: nearly Circleetc…

Page 8: Spark MLlib - Training Material

Learning(Training)ID Color Type Shape is apple

(Label)

1 Red Fruit Cirle Y

2 Red Fruit Cirle Y

3 Black Logo nearly Cirle N

4 Blue N/A Cirle N

Page 9: Spark MLlib - Training Material

Categories of Machine Learning

• Classification: predict class from observations.

• Clustering: group observations into meaningful groups.

• Regression: predict value from observations.

Page 10: Spark MLlib - Training Material

Cluster the observations with no Labels

https://en.wikipedia.org/wiki/Cluster_analysis

Page 11: Spark MLlib - Training Material

Cut the observations

http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression

Page 12: Spark MLlib - Training Material

Find a model to describe the observations

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

Page 13: Spark MLlib - Training Material

Machine Learning Pipelines

http://www.nltk.org/book/ch06.html

Page 14: Spark MLlib - Training Material

What is MLlib

Page 15: Spark MLlib - Training Material

What is MLlib• MLlib is an Apache Spark component focusing

on machine learning:

• MLlib is Spark’s core ML library

• Developed by MLbase team in AMPLab

• 80+ contributions from various organization

• Support Scala, Python, and Java APIs

Page 16: Spark MLlib - Training Material

Algorithms in MLlib• Statistics: Description, correlation

• Clustering: k-means

• Collaborative filtering: ALS

• Classification: SVMs, naive Bayes, decision tree.

• Regression: linear regression, logistic regression

• Dimensionality: SVD, PCA

Page 17: Spark MLlib - Training Material

Why Mllib

• Scalability

• Performance

• user-friendly documentation and APIs

• Cost of maintenance

Page 18: Spark MLlib - Training Material

Performance

Page 19: Spark MLlib - Training Material

Data Type

• Dense vector

• Sparse vector

• Labeled point

Page 20: Spark MLlib - Training Material

Dense & Sparse

• Raw Data:

IDA

B C D E F

1 1 0 0 0 0 32 0 1 0 1 0 23 1 1 1 0 1 1

Page 21: Spark MLlib - Training Material

Dense vs Sparse

• Training Set:- number of example: 12 million- number of features: 500- sparsity: 10%

• Not only save storage, but also received a 4x speed up

Dense Sparse

Storge 47GB 7GB

Time 240s 58s

Page 22: Spark MLlib - Training Material

Labeled Point• Dummy variable (1,0)

• Categorical variable (0, 1, 2, …)from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

Page 23: Spark MLlib - Training Material

DEMO

Page 24: Spark MLlib - Training Material

Descriptive Statistics• Supported function:

- count- max- min- mean- variance…

• Supported data types- Dense- Sparse- Labeled Point

Page 25: Spark MLlib - Training Material

Examplefrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np

## example data( 2 x 2 matrix at least)data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])

## to RDDdistData = sc.parallelize(data)

## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

Page 26: Spark MLlib - Training Material

DEMO

Page 27: Spark MLlib - Training Material

Quiz 1

• Which statement is wrong purpose of descriptive statistic?1. Find the extreme value of the data.2. Understanding the data properties preliminary. 3. Describe the full properties of the data.

Page 28: Spark MLlib - Training Material

Correlation

https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples

Page 29: Spark MLlib - Training Material

Example• Data Set

from pyspark.mllib.stat import Statistics

## 選擇統計方法corrType='pearson'

## 建立測試資料及group1 = sc.parallelize([1,3,4,3,7,5,6,8])group2 = sc.parallelize([1,2,3,4,5,6,7,8])corr = Statistics.corr(group1,group2, corrType)print corr# 0.87

A B C D E F G H

Case1 1 3 4 3 7 5 6 8

Case2 1 2 3 4 5 6 7 8

Page 30: Spark MLlib - Training Material

Shape is important

http://2012books.lardbucket.org/books/beginning-psychology/

Page 31: Spark MLlib - Training Material

DEMO

Page 32: Spark MLlib - Training Material

Quiz 2

• Which situation below can not use pearson correlation?1. When two variables have no correlation2. When two variables have correlation of indices3. When count of data rows > 304. When variables are normal distribution.

Page 33: Spark MLlib - Training Material

Clustering ModelK-Means

Page 34: Spark MLlib - Training Material

K-means

• K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Page 35: Spark MLlib - Training Material

Clustering with K-Means

Page 36: Spark MLlib - Training Material

Clustering with K-Means

Page 37: Spark MLlib - Training Material

Clustering with K-Means

Page 38: Spark MLlib - Training Material

Clustering with K-Means

Page 39: Spark MLlib - Training Material

Clustering with K-Means

Page 40: Spark MLlib - Training Material

Clustering with K-Means

Page 41: Spark MLlib - Training Material

Summary

https://en.wikipedia.org/wiki/K-means_clustering

Page 42: Spark MLlib - Training Material

K-Means: Pythonfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

Page 43: Spark MLlib - Training Material

DEMO

Page 44: Spark MLlib - Training Material

• Which data is not suitable for K-means analysis?

Quiz 3

Page 45: Spark MLlib - Training Material

Classification ModelLogistic Regression

Page 46: Spark MLlib - Training Material

linear regression

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

Page 47: Spark MLlib - Training Material

When outcome is only 1/0

http://www.slideshare.net/21_venkat/logistic-regression-17406472

Page 48: Spark MLlib - Training Material

Hypotheses function• hypotheses:

• sigmoid function:

Page 49: Spark MLlib - Training Material

• Maximum Likelihood estimation

Cost Function

if y = 0if y = 1 regularized

Page 50: Spark MLlib - Training Material

Sample Codefrom pyspark.mllib.classification import LogisticRegressionWithSGDfrom pyspark.mllib.regression import LabeledPointfrom numpy import array

# Load and parse the datadef parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")parsedData = data.map(parsePoint)

# Build the modelmodel = LogisticRegressionWithSGD.train(parsedData)

# Evaluating the model on training datalabelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())print("Training Error = " + str(trainErr))

Page 51: Spark MLlib - Training Material

Quiz 4

• What situation is not suitable for logistic regression?1. data with multiple label point2. data with no label3. data with more the 1,000 features4. data with dummy features

Page 52: Spark MLlib - Training Material

DEMO

Page 53: Spark MLlib - Training Material

Recommendation ModelAlternating Least Squares

Page 54: Spark MLlib - Training Material

Definition• Collaborative Filtering(CF) is a subset of

algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for.

• Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.

Page 55: Spark MLlib - Training Material

Definition

• CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)

Page 56: Spark MLlib - Training Material

Recommendation

Page 57: Spark MLlib - Training Material

Recommendation

Page 58: Spark MLlib - Training Material

Recommendation

Page 59: Spark MLlib - Training Material

Recommendation

Page 60: Spark MLlib - Training Material
Page 61: Spark MLlib - Training Material

ALS: Pythonfrom pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the datadata = sc.textFile("data/mllib/als/test.data")ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squaresrank = 10numIterations = 20model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training datatestdata = ratings.map(lambda p: (p[0], p[1]))predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()print("Mean Squared Error = " + str(MSE))

Page 62: Spark MLlib - Training Material

DEMO

Page 63: Spark MLlib - Training Material

Quiz 5

• Search for other recommendation algorithms, point out the difference with ALS. - Item Base- User Base- Content Base

Page 64: Spark MLlib - Training Material

Reference• Machine Learning Library (MLlib) Guide

http://spark.apache.org/docs/1.4.1/mllib-guide.html

• MLlib: Spark's Machine Learning Libraryhttp://www.slideshare.net/jeykottalam/mllib

• Recent Developments in Spark MLlib and Beyondhttp://www.slideshare.net/Hadoop_Summit

• Introduction to Machine Learninghttp://www.slideshare.net/rahuldausa/introduction-to-machine-learning