Spark MLlib - Training Material

download Spark MLlib - Training Material

of 64

Embed Size (px)

Transcript of Spark MLlib - Training Material

Spark MLlib

ExperienceVpon Data EngineerTWM, Keywear, NielsenBryans notes for data analysishttp://bryannotes.blogspot.twSpark.TWLinikedin

About me

AgendaIntroduction to Machine LearningWhy MLlibBasic StatisticK-meansLogistic RegressionALSDemo

Introduction to Machine Learning

DefinitionMachine learning is a study of computer algorithms that improve automatically through experience.

TerminologyObservation- The basic unit of data. Features- Features is that can be used to describe each observation in a quantitative manner.Feature vector- is an n-dimensional vector of numerical features that represents some object.Training/ Testing/ Evaluation set- Set of data to discover potential predictive relationships

Learning(Training)Features:1. Color: Red/ Green2. Type: Fruit3. Shape: nearly Circleetc


IDColorTypeShapeis apple(Label)1RedFruitCirleY2RedFruitCirleY3BlackLogonearly CirleN4BlueN/ACirleN

Categories of Machine LearningClassification: predict class from observations.Clustering: group observations into meaningful groups.Regression: predict value from observations.

Cluster the observations with no Labels

Cut the observations

Find a model to describe the observations

Machine Learning Pipelines

What is MLlib

What is MLlibMLlib is an Apache Spark component focusing on machine learning:MLlib is Sparks core ML libraryDeveloped by MLbase team in AMPLab80+ contributions from various organizationSupport Scala, Python, and Java APIs

Algorithms in MLlibStatistics: Description, correlationClustering: k-meansCollaborative filtering: ALSClassification: SVMs, naive Bayes, decision tree.Regression: linear regression, logistic regressionDimensionality: SVD, PCA

Why MllibScalabilityPerformanceuser-friendly documentation and APIsCost of maintenance


Data TypeDense vectorSparse vector Labeled point

Dense & SparseRaw Data:



Dense vs SparseTraining Set:- number of example: 12 million- number of features: 500- sparsity: 10%Not only save storage, but also received a 4x speed upDenseSparseStorge47GB7GBTime240s58s

Labeled PointDummy variable (1,0)Categorical variable (0, 1, 2, )

from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))


Descriptive StatisticsSupported function:- count- max- min- mean- varianceSupported data types- Dense- Sparse- Labeled Point

Examplefrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np

## example data2 x 2 matrix at leastdata= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])

## to RDDdistData = sc.parallelize(data)

## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])


Quiz 1Which statement is wrong purpose of descriptive statistic1. Find the extreme value of the data.2. Understanding the data properties preliminary. 3. Describe the full properties of the data.


ExampleData Set

from pyspark.mllib.stat import Statistics

## corrType='pearson'

## group1 = sc.parallelize([1,3,4,3,7,5,6,8])group2 = sc.parallelize([1,2,3,4,5,6,7,8])corr = Statistics.corr(group1,group2, corrType)print corr# 0.87ABCDEFGHCase113437568Case212345678

Shape is important


Quiz 2Which situation below can not use pearson correlation?1. When two variables have no correlation2. When two variables have correlation of indices3. When count of data rows > 304. When variables are normal distribution.

Clustering ModelK-Means

K-meansK-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means

Clustering with K-Means


K-Means: Pythonfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))


Which data is not suitable for K-means analysis?Quiz 3

Classification ModelLogistic Regression

linear regression

When outcome is only 1/0

Hypotheses functionhypotheses:sigmoid function:

Maximum Likelihood estimationCost Function

if y = 0if y = 1


Sample Codefrom pyspark.mllib.classification import LogisticRegressionWithSGDfrom pyspark.mllib.regression import LabeledPointfrom numpy import array

# Load and parse the datadef parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")parsedData =

# Build the modelmodel = LogisticRegressionWithSGD.train(parsedData)

# Evaluating the model on training datalabelsAndPreds = p: (p.label, model.predict(p.features)))trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())print("Training Error = " + str(trainErr))

Quiz 4What situation is not suitable for logistic regression?1. data with multiple label point2. data with no label3. data with more the 1,000 features4. data with dummy features


Recommendation ModelAlternating Least Squares

DefinitionCollaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.

DefinitionCF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)





ALS: Pythonfrom pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the datadata = sc.textFile("data/mllib/als/")ratings = l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squaresrank = 10numIterations = 20model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training datatestdata = p: (p[0], p[1]))predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))ratesAndPreds = r: ((r[0], r[1]), r[2])).join(predictions)MSE = r: (r[1][0] - r[1][1])**2).mean()print("Mean Squared Error = " + str(MSE))


Quiz 5Search for other recommendation algorithms, point out the difference with ALS. - Item Base- User Base- Content Base

ReferenceMachine Learning Library (MLlib) Guide Spark's Machine Learning Library Developments in Spark MLlib and Beyond to Machine Learning