MLlib and Machine Learning on Spark
Embed Size (px)
Transcript of MLlib and Machine Learning on Spark
Apache SparkMLlib and Machine Learning on SparkPetr ZapletalCake SolutionsApache Spark and Big DataHistory and market overviewInstallationMLlib and Machine Learning on SparkPorting R code to Scala and SparkConcepts - Core, SQL, GraphX, StreamingSparks distributed programming modelDeploymentTable of contentsMachine Learning IntroductionSpark ML Support - MLlibMachine Learning TechniquesTips & ConsiderationsML PipelinesQ & AMachine LearningSubfield of Artificial Intelligence (AI)Construction & Study of systems that can learn from dataComputers act without being explicitly programmedCan be seen as building blocks to make computers behave more intelligentlyMachine Learning
TerminologyFeatureseach item is described by number of featuresSamplessample is an item to processdocument, picture, row in db, graph, ...Feature vectorn-dimensional vector of numerical features representing some sampleLabelled datadata with known classification resultsTerminology
CategoriesSupervised learninglabelled data are availableUnsupervised learningNo labelled data is availableSemi-supervised learningmix of Supervised and Unsupervised learningusually small part of data is labelledReinforcement learningmodel is continuously learn and relearn based on the actions and the effects/rewards from that actions.reward feedback"Reinforcement learning (RL) and supervised learning are usually portrayed as distinct methods of learning from experience. RL methods are often applied to problems involving sequential dynamics and optimization of a scalar performance objective, with online exploration of the effects of actions. Supervised learning methods, on the other hand, are frequently used for problems involving static input-output mappings and minimization of a vector error signal, with no explicit dependence on how training examples are gathered. As discussed by Barto and Dietterich (this volume), the key feature distinguishing RL and supervised learning is whether training information from the environment serves as an evaluation signal or as an error signal"
ApplicationsSpeech recognitionEffective web searchRecommendation systemsComputer visionInformation retrievalSpam filteringComputational financeFraud detectionMedical diagnosisStock market analysisStructural health monitoring...
MLlib IntroductionSparks scalable machine learning libraryCommon learning algorithms and utilities
Benefits of MLlibPart of SparkIntegrated workflowScala, Java & Python APIBroad coverage of applications & algorithms Rapid improvements in speed & robustnessOngoing development & Large communityEasy to use, well documentedTypical Steps in ML Pipeline
spark-1.3.0-snapshotData TypesVectorboth dense and sparse vectorsLabeledPointlabelled data point for supervised learningRatingrating of a product by a user, used for recommendationVarious Modelsresult of a training algorithmused for predicting unknown dataMatrices
Feature Extraction & Basic StatisticsSeveral classes for common operationsScaling, normalization, statistical summary, correlation, Numeric RDD operations, sampling, Random generators Words extractions (TF-IDF)generating feature vectors from text documents/web pages
Term FrequencyInverse Document Frequency, or TF-IDF, is a simple way to generate feature vectors from text documents (e.g. web pages). It computes two statistics for each term in each document: the term frequency, TF, which is the number of times the term occurs in that document, and the inverse document frequency, IDF, which measures how (in)frequently a term occurs across the whole document corpus. The product of these values, TF \times IDF, shows how relevant a term is to a specific document (i.e. if it is common in that specific document but rare in the whole corpus).
ClassificationClassify samples into predefined categorySupervised learningBinary classification (SVMs, logistic regression)Multiclass Classification (decision trees, naive Bayes)Spam x non-spam, fruit x logo, ...
logistic regression -> datas are labeled 1 or 0 -> classificationRegressionPredict value from observations, many techniquesPredicted values are continuousSupervised learningLinear least squares, Lasso, ridge regression, decision treesHouse prices, stock exchange, power consumption, height of person, ...
A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency.http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-usehttp://scikit-learn.org/stable/modules/linear_model.html#logistic-regressionLinear least squares is one of the mathematics/statistical problem solving methods, using least squares algorithmic technique to increase solution approximation accuracy, corresponding with a particular problem's complexity:lasso (least absolute shrinkage and selection operator) - version of least squares
Linear Regression ExampleMethod run trains modelParameters are set with setters setNumInterations and setInterceptStochastic Gradient Descent (SGD) algorithm is used for minimizing function
Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the BroydenFletcherGoldfarbShanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.
SGD is a great general-purpose optimization algorithm, and it is easy to implement. I would generally use it first, before trying something more complicated. I believe SGD is just as good as, if not superior, to L-BFGS in the not highly varying (and sometimes even convex) optimization surfaces common in current NLP models. (I would nonetheless be interested in a controlled comparison between SGD and the L-BFGS using the Berkeley cache-flushing trick.)
ClusteringGrouping objects into groups (~ clusters) of high similarityUnsupervised learning -> groups are not predefinedNumber of clusters must be definedK-means, Gaussian Mixture Model (EM algorithm), Power Iteration Clustering (PIC), Latent Dirichlet Allocation(LDA)
The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarties as edge properties, described in Lin and Cohen, Power Iteration Clustering. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via power iteration and uses it to cluster vertices.
Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm as follows:Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.LDA takes in a collection of documents as vectors of word counts. It learns clustering using expectation-maximizationon the likelihood function. After fitting on the documents, LDA provides:Topics: Inferred topics, each of which is a probability distribution over terms (words).Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
Collaborative FilteringUsed for recommender systemsCreates and analyses matrix of ratings, predicts missing entriesExplicit (given rating) vs implicit (views, clicks, likes, shares, ...) feedbackAlternating least squares (ALS)
Dimensionality ReductionProcess of reducing number of variables under considerationPerformance needs, removing non-informative dimensions, plotting, .... Principal Component Analysis (PCA) - ignoring non-informative dimsSingular Value Decomposition (SVD)factorizes matrix into 3 descriptive matricesstorage save, noise reduction
http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.htmlTipsPreparing featureseach algorithm is only as good as input featuresprobably the most important step in MLcorrect scaling, labeling for each algorithmAlgorithm configurationperformance greatly varies according to paramsCaching RDD for reusemost of the algorithms are iterativeinput dataset should be cached (cache() method) before passing into MLlib algorithmRecognizing sparsityOverfittingModel is overtrained to the testing dataModel describes random errors or noise instead of underlying relationshipResults in poor predictive performance
Data Partitioning Supervised learningPartitioning labelled dataLabelled dataTraining setset of samples used for learningexperiments with algorithm parametersTest settesting fitted modelmust not tune model any furtherCommon separation - 70/30http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-setPerformance
10-100x faster than Hadoop & Mahout
Steady Performance Gains
Pipeline APIPipeline is a series of algorithms (feature transformation, model fitting, ...)Easy workflow constructionDistribution of parameters into each stageMLlib is easier to use Uses uniform dataset representation - SchemaRDD from SparkSQLmultiple named columns (similar to SQL table)
ConclusionWhat is Machine LearningMachine Learning Use Cases & TechniquesSparks Machine Learning library - MLlibTips for using MLlib and SparkQuestions