MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI -...

22
MLI - An API for Distributed Machine Learning Sarang Dev

Transcript of MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI -...

Page 1: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLI - An API for Distributed Machine Learning

Sarang Dev

Page 2: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLI - API

● Simplify the development of high-performance, scalable, distributed algorithms.

● Targets common ML problems related to data loading, feature extraction, model training.

● Usability : Comparable to Matlab, R

● Scalability : Matches low level systems like Graphlab,Vowpal Wabbit

Page 3: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Big Picture-MLBase

● ML Optimizer: This layer aims to automate the task of ML pipeline construction.

● MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.

● MLlib: Apache Spark's distributed ML library. Many features in MLlib have been borrowed from ML Optimizer and MLI. Maintained by Spark community.

Page 4: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Installing MLI

● https://github.com/amplab/MLI● Java 7 (not compatible with 8)● Scala 2.9.3● Spark 0.8.0● Needs some change in build files to compile

https://drive.google.com/open?id=0B64IP8kXPIDpTVE0NmFaanFWOUU

Uses sbt(interactive build tool) for building● Run command in sbt prompt

>compile

>assembly (makes a jar in the target)

Page 5: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLI Interfaces

MLTable● MLTable is an object which provides a familiar

table-like interface to a developer, and is designed to mimic a SQL table.

● Interface for processing the semi-structured, mixed type data.

● Once data is featurized, it can be cast into an MLNumericTable, which is a convenience type that most ML algorithms will expect as input.

Page 6: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLI Interfaces

LocalMatrix● LocalMatrix provides linear algebra primitives on partitions

of data. The partitions are automatically determined by the system.

Optimizer, Algorithm, and Model● Can implement algorithms using the Algorithm interface,

which should return a model as specified by the Model interface.

● Optimization techniques are used to converge to an approximate solution while iterating over the data.

Page 7: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Using MLI

● ADD_JARS = <path to mli jar> spark-shell

We can perform all the tasks in a spark-shell which always has a initialized spark context.import mli.feat._import mli.interface._

val mc = new MLContext(sc)val inputTable = mc.loadFile("/home/sarang/Downloads/sample.txt").cache() //MLTable

// c is the column on which we want to perform N-gram extraction// n is the N-gram length, e.g., n=2 corresponds to bigrams// k is the number of top N-grams we want to use (sorted by N-gram frequency)val (featurizedData, ngfeaturizer) = NGrams.extractNGrams(inputTable, c=0, n=2, k=10, stopWords = NGrams.stopWords)val (scaledData, featurizer) = Scale.scale(featurizedData.filter(_.nonZeros.length > 5).cache(), 0, ngfeaturizer)

Page 8: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Engine for large-scale data processing● Speed : Run programs up to 100x faster than

Hadoop MapReduce in memory● Ease of Use : Write applications in Java, Scala,

Python, R● Libraries : Spark powers a stack of libraries

including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming

Page 9: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Page 10: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Page 11: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Page 12: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Page 13: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Spark

Page 14: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLlib

MLlib is a standard component of Spark providing machine learning primitives on top of Spark.

● Scalability● Performance● User-friendly APIs● Integration with Spark and its other components● Support for Java, Scala, Python

Page 15: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

MLib● Classification: logistic regression, naive Bayes,...● Regression: generalized linear regression, isotonic regression,...● Decision trees, random forests, and gradient-boosted trees● Recommendation: alternating least squares (ALS)● Clustering: K-means, Gaussian mixtures (GMMs),...● Topic modeling: latent Dirichlet allocation (LDA)● Feature transformations: standardization, normalization, hashing,...● Model evaluation and hyper-parameter tuning● ML Pipeline construction● ML persistence: saving and loading models and Pipelines● Survival analysis: accelerated failure time model● Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan● Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),...● Statistics: summary statistics, hypothesis testing,...

Page 16: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Data Types in MLlib

● Local vectorA local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine

● Labeled pointA labeled point is a local vector, either dense or sparse, associated with a label/response.

Eg . val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

● Local matrix● Distributed matrix

– RowMatrix

– IndexedRowMatrix

– CoordinateMatrix

– BlockMatrix

Page 17: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

DataFrames in Spark SQL

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6. The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.

Can be used as MLTable interface of MLI.

val sentenceData = spark.createDataFrame(Seq(

(0, "Hi I heard about Spark"),

(0, "I wish Java could use case classes"),

(1, "Logistic regression models are neat")

)).toDF("label", "sentence")

Page 18: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Using MLib

● Example :

K-Means Clustering( partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean)

Implement using Pyspark

Implement using Scala

Page 19: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Kmeans Pyspark from numpy import array

from math import sqrt

from pyspark import SparkContext, SparkConf

from pyspark.mllib.clustering import KMeans, KMeansModel

conf = SparkConf().setAppName("KMeans").setMaster("local")

sc = SparkContext(conf=conf)

# Load and parse the data

data = sc.textFile("/home/sarang/Downloads/kmeans_data.txt")

parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)

clusters = KMeans.train(parsedData, 2, maxIterations=10,

runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors

def error(point):

center = clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

print("Within Set Sum of Squared Error = " + str(WSSSE))

# Save and load model

clusters.save(sc, "/home/sarang/KMeansModel")

sameModel = KMeansModel.load(sc, "/home/sarang/KmeansModel")

We can also use the pyspark shell instead

Page 20: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Kmeans Scala

import org.apache.spark.ml.clustering.KMeans

val dataset = spark.read.format("libsvm").load("/home/sarang/Downloads/kmeans_data1.txt")

// Trains a k-means model.

val kmeans = new KMeans().setK(2).setSeed(1L)

val model = kmeans.fit(dataset)

// Evaluate clustering by computing Within Set Sum of Squared Errors.

val WSSSE = model.computeCost(dataset)

model.clusterCenters.foreach(println)

Page 21: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

Conclusion

● MLI is outdated and most of its features have been included in MLlib.

● MLlib can act as a powerful tool for machine learning.

Page 22: MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms.

References

● MLI Tutorial : http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html

● Mllib : http://spark.apache.org/docs/latest/mllib-guide.html