Spark MLlib - Training Material

download Spark MLlib - Training Material

of 64

Embed Size (px)

Transcript of Spark MLlib - Training Material

Spark MLlib

ExperienceVpon Data EngineerTWM, Keywear, NielsenBryans notes for data analysishttp://bryannotes.blogspot.twSpark.TWLinikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79

About me

AgendaIntroduction to Machine LearningWhy MLlibBasic StatisticK-meansLogistic RegressionALSDemo

Introduction to Machine Learning

DefinitionMachine learning is a study of computer algorithms that improve automatically through experience.

TerminologyObservation- The basic unit of data. Features- Features is that can be used to describe each observation in a quantitative manner.Feature vector- is an n-dimensional vector of numerical features that represents some object.Training/ Testing/ Evaluation set- Set of data to discover potential predictive relationships

Learning(Training)Features:1. Color: Red/ Green2. Type: Fruit3. Shape: nearly Circleetc

Learning(Training)

IDColorTypeShapeis apple(Label)1RedFruitCirleY2RedFruitCirleY3BlackLogonearly CirleN4BlueN/ACirleN

Categories of Machine LearningClassification: predict class from observations.Clustering: group observations into meaningful groups.Regression: predict value from observations.

Cluster the observations with no Labels

https://en.wikipedia.org/wiki/Cluster_analysis

Cut the observations

http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression

Find a model to describe the observations

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

Machine Learning Pipelines

http://www.nltk.org/book/ch06.html

What is MLlib

What is MLlibMLlib is an Apache Spark component focusing on machine learning:MLlib is Sparks core ML libraryDeveloped by MLbase team in AMPLab80+ contributions from various organizationSupport Scala, Python, and Java APIs

Algorithms in MLlibStatistics: Description, correlationClustering: k-meansCollaborative filtering: ALSClassification: SVMs, naive Bayes, decision tree.Regression: linear regression, logistic regressionDimensionality: SVD, PCA

Why MllibScalabilityPerformanceuser-friendly documentation and APIsCost of maintenance

Performance

Data TypeDense vectorSparse vector Labeled point

Dense & SparseRaw Data:

IDA

BCDEF110000320101023111011

Dense vs SparseTraining Set:- number of example: 12 million- number of features: 500- sparsity: 10%Not only save storage, but also received a 4x speed upDenseSparseStorge47GB7GBTime240s58s

Labeled PointDummy variable (1,0)Categorical variable (0, 1, 2, )

from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

DEMO

Descriptive StatisticsSupported function:- count- max- min- mean- varianceSupported data types- Dense- Sparse- Labeled Point

Examplefrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np

## example data2 x 2 matrix at leastdata= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])

## to RDDdistData = sc.parallelize(data)

## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

DEMO

Quiz 1Which statement is wrong purpose of descriptive statistic1. Find the extreme value of the data.2. Understanding the data properties preliminary. 3. Describe the full properties of the data.

Correlation

https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples

ExampleData Set

from pyspark.mllib.stat import Statistics

## corrType='pearson'

## group1 = sc.parallelize([1,3,4,3,7,5,6,8])group2 = sc.parallelize([1,2,3,4,5,6,7,8])corr = Statistics.corr(group1,group2, corrType)print corr# 0.87ABCDEFGHCase113437568Case212345678

Shape is important

http://2012books.lardbucket.org/books/beginning-psychology/

DEMO

Quiz 2Which situation below can not use pearson correlation?1. When two variables have no correlation2. When two variables have correlation of indices3. When count of data rows > 304. When variables are normal distribution.

Clustering ModelK-Means

K-meansK-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Summary

https://en.wikipedia.org/wiki/K-means_clustering

K-Means: Pythonfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

DEMO

Which data is not suitable for K-means analysis?Quiz 3

Classification ModelLogistic Regression

linear regression

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

When outcome is only 1/0

http://www.slideshare.net/21_venkat/logistic-regression-17406472

Hypotheses functionhypotheses:sigmoid function:

Maximum Likelihood estimationCost Function

if y = 0if y = 1

regularized

Sample Codefrom pyspark.mllib.classification import LogisticRegressionWithSGDfrom pyspark.mllib.regression import LabeledPointfrom numpy import array

# Load and parse the datadef parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")parsedData = data.map(parsePoint)

# Build the modelmodel = LogisticRegressionWithSGD.train(parsedData)

# Evaluating the model on training datalabelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())print("Training Error = " + str(trainErr))

Quiz 4What situation is not suitable for logistic regression?1. data with multiple label point2. data with no label3. data with more the 1,000 features4. data with dummy features

DEMO

Recommendation ModelAlternating Least Squares

DefinitionCollaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.

DefinitionCF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)

Recommendation

Recommendation

Recommendation

Recommendation

ALS: Pythonfrom pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the datadata = sc.textFile("data/mllib/als/test.data")ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squaresrank = 10numIterations = 20model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training datatestdata = ratings.map(lambda p: (p[0], p[1]))predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()print("Mean Squared Error = " + str(MSE))

DEMO

Quiz 5Search for other recommendation algorithms, point out the difference with ALS. - Item Base- User Base- Content Base

ReferenceMachine Learning Library (MLlib) Guidehttp://spark.apache.org/docs/1.4.1/mllib-guide.htmlMLlib: Spark's Machine Learning Libraryhttp://www.slideshare.net/jeykottalam/mllibRecent Developments in Spark MLlib and Beyondhttp://www.slideshare.net/Hadoop_SummitIntroduction to Machine Learninghttp://www.slideshare.net/rahuldausa/introduction-to-machine-learning