Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout...

22
Apache Mahout Anubhab Chatterjee Ashwary Sharma Department of Data Science 15th April, 2020

Transcript of Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout...

Page 1: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Apache MahoutAnubhab Chatterjee

Ashwary Sharma Department of Data Science

15th April, 2020

Page 2: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Acknowledgement

Big Data and Hadoop CourseProf. Venkatesh Vinayakarao

Chennai Mathematical Institute

Page 3: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Background - Bits to Big Bytes

▪ The continuous and gradual evolution of standalone computing systems

▪ Rise of digital networks and distributed systems, growth of cloud use and availability

▪ Big data: storage and processing challenges

▪ Open source software solutions

▪ Distributed backends: Hadoop, Spark etc.

Page 4: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

The Question, Want and Need

▪ Promising match: machine learning and distributed computing

▪ Power of machine learning - updatability, learnability

▪ Fuel for updating - input data and computational power; provisions of the cloud at comparatively low costs

▪ Storage needs - adequately met for now

▪ Processing needs - much left to be desired in the open source world, the software

▪ What can we do with what we have?

Page 5: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Voilà!

▪ Apache Mahout is a framework ‘for creating scalable, performant ML applications’ - primarily focused on linear algebra.

▪ A project of the Apache Software Foundation

➢ Open Source

➢ Free

➢ Apache developers’ community

▪ Initial release - v0.1 on 7th April, 2009

Page 6: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Apache Mahout - v0.14

▪ ‘Framework for distributed implementation of machine learning algorithms’

➢ Stable release - 6th March, 2019

▪ Java and Scala are supported HLLs

▪ Features

➢ ‘Distributed linear algebra framework’

➢ ‘Mathematically expressive Scala DSL’ – Samsara

➢ ‘Support for Multiple Distributed Backends’

➢ Engine bindings and native solvers

Page 7: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark

val userTagsRDD = sc.textFile("/path/to/lastfm/user_taggedartists.dat") .map(line => line.split("\t")) .map(a => (a(0), a(2))) .filter(_._1 != "userID") val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)

val userArtistsRDD = sc.textFile("/path/to/lastfm/user_artists.dat") .map(line => line.split("\t")) .map(a => (a(0), a(1))) .filter(_._1 != "userID") val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc)

val userFriendsRDD = sc.textFile("/path/to/lastfm/user_friends.dat") .map(line => line.split("\t")) .map(a => (a(0), a(1))) .filter(_._1 != "userID") val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc)

Mahout v0.14 Exhibition Code

import org.apache.mahout.math.cf.SimilarityAnalysis

val artistReccosLlrDrmListByArtist = SimilarityAnalysis.cooccurrencesIDSs( Array(userArtistsIDS, userTagsIDS, userFriendsIDS), maxInterestingItemsPerThing = 20, maxNumInteractions = 500, randomSeed = 1234)

Pre-processing code

using Spark

Mahout recommender

code

This sample details the ease of calling the standard ‘recommender’ algorithm through the Mahout library on the LastFM dataset, using Scala.

As you notice, all the implementational details are automatically handled for in-core and distributed data.

Source: https://mahout.apache.org/docs/latest/tutorials/cco-lastfm/

Page 8: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Features - Mahout v0.14

▪ Distributed Linear Algebra Framework

➢ DRMs - distributed row matrices data type

➢ In-core and distributed executions available separately for linear algebra operations

➢ Implemented algorithms available: distributed QR decomposition, distributed stochastic PCA, distributed stochastic SVD

➢ Machine learning algorithms’ core computations are linear algebraic in nature: QR decomposition, PCA etc.

Page 9: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Features - Mahout v0.14

▪ Mathematically Expressive Scala DSL - Samsara

➢ Mathematically expressive and easier to write codes representing mathematical equations

➢ ‘Mahout runs inline with your Java/Scala application’

➢ ‘Write once run everywhere’

\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]

val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)

Example

to

Page 10: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Features - Mahout v0.14

▪ Support for multiple backends

➢ Spark is the recommended backend, while Hadoop, H20, Lucene etc. are supported since different paradigms are better suited to different computational needs.

➢ Samsara makes code backend agnostic.

➢ Standalone capability - for prototyping

▪ Engine bindings and native solving optimization

▪ Logical/Physical optimisations

▪ DRM —> BLAS(Basic Linear Algebra Solvers)

Page 11: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

What’s In A Name!!

▪ ‘Mahout’ is derived from the Hindi word ‘Mahavat’, meaning the rider of an elephant

▪ The symbolism of the chosen name for the library becomes apparent since the main intent was to provide scalable machine learning capabilities over Hadoop (whose mascot is an elephant).

Page 13: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

What the ‘Elephant’ offers: - Out of the box algorithms

▪ Collaborative Filtering

➢ Item-based Collaborative Filtering ➢Matrix Factorization with Alternating

Least Squares ➢Matrix Factorization with Alternating

Least Squares on Implicit Feedback

▪ Classification

➢ Naive Bayes ➢ Complementary Naive Bayes ➢ Random Forest

▪ Dimensionality Reduction

➢ Lanczos Algorithm ➢Stochastic SVD ➢Principal Component Analysis

▪ Clustering

➢ Canopy Clustering ➢ k-Means Clustering ➢ Fuzzy k-Means ➢ Streaming k-Means ➢ Spectral Clustering

▪ Topic Models

➢ Latent Dirichlet Allocation

▪ Miscellaneous

➢ Frequent Pattern Matching ➢ RowSimilarityJob ➢ ConcatMatrices ➢ Colocations

Page 14: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

The Bigshots Riding the ‘Elephant’

▪ Companies such as Adobe, Facebook, LinkedIn, Foursquare,

Twitter, and Yahoo use Mahout internally.

▪ Foursquare helps you in finding out places, food, and entertainment

available in a particular area. It uses the recommender engine of

Mahout.

▪ Twitter uses Mahout for user interest modelling.

▪ Yahoo! uses Mahout for pattern mining.

▪ 365MEDIA uses Mahout’s Classification and Collaborative Filtering

algorithms in its Real-time system named UPTIMEand 365Media/

Social.

Page 15: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Getting Started: Taming the ‘Beast’

▪ Following are demonstrations depicting the use of Mahout for two use cases and to get you the idea of the ease of access of it’s inbuilt algorithms, namely Clustering and Recommendation System

▪ This basic demonstration works on the premise that the Hadoop environment is up and running on your system. (PS: If not, Google it!!. It’s complicated.) (PPS: If Google Sir fails to help you out, contact the instructor. Trust us, he is better in helping out his students!!)

▪ To install Mahout, go to http://www.apache.org/dyn/closer.cgi/mahout/ and download mahout-distribution-0.7.tar.gz. Uncompress the archive:

What’s done is done. The joy is in the doing. - Shakespeare

Page 16: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Use Case #1: Movie Recommender Engine

▪ The recommender engine accepts any files containing a set of lines with the userId, the itemId and a preference value(optional) separated by a tab. The userId and itemId must be an integer and the preference value can be an integer or a double. The GroupLens Movie DataSet provides the rating of movies in this format. You can download it: MovieLens 100k.

▪ Uncompress the archive:

▪ Copy the file u.data to HDFS:

▪ Run the Mahout Recommender hadoop jar <MAHOUT DIRECTORY>/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE -- input u.data --output output

▪ And ‘Abra-Kadabra’ we are done!!! I swear!! Don’t believe it??

Page 17: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Use Case #1: Compiling Results▪ With the argument “-s SIMILARITY_COOCURRENCE”, we tell the recommender which

item similarity formula to use. Mahout computes the recommendations by running several Hadoop MapReduce jobs. After 30-50 minutes, the jobs are finished and each user will have the 10 movies that she might mostly like based on the co-occurrence of each movie in users’ reviews.

▪ To copy and merge the files from HDFS to your local filesystem, type:

▪ Output:

A little python code on the output gives the actual results as shown:

Page 18: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Use Case #2: Clustering

▪ Clustering is the procedure to organize elements or items of a given collection into groups based on the similarity between the items. For example, the applications related to online news publishing group their news articles using clustering. We will work with a generalized sample of text file.

▪ Make sure Hadoop is up and running. Create directories in the Hadoop file system to store the input file, sequence files, and clustered data using the following command:

▪ Copy Input File to HDFS (Reminder: we are doing this with a random text file):

▪ Using Mahout’s seqdirectory utility convert the given input file into sequence file format:

▪ Here is an example for better understanding:

Page 19: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Use Case #2: Canopy Algorithm

▪ Canopy clustering is a simple and fast technique used by Mahout for clustering purpose. The objects will be treated as points in a plain space. This technique is often used as an initial step in other clustering techniques such as k-means clustering. You can run a Canopy job using the following syntax:

Here is the output

The resulting canopies are shown superimposed upon the sample data. Each canopy is represented by two circles, with

radius T1 and radius T2. This is a nice representation of the data but it still has lots of

room for improvement.

Page 20: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Prospective Future

▪ Backend support: support to newer and existing backends is envisioned as per need.

▪ Samsara enables ‘write once use everywhere’.

▪ Version 14.1 is on the horizon - with heavy design upgrades(strange version numbering).

▪ The Apache community is active and thriving, with Mahout and Spark being one of the most vitalised projects - here’s to hoping!

Page 21: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Takeaways

▪ Easy and efficient deploying of machine learning and linear algebra algorithms on distributed systems is enabled by the Apache Mahout framework.

▪ The framework provides many features, most importantly: support for multiple cloud backends and mathematically expressive Scala DSL - Samsara.

▪ It is a free and open-source software, favorably adopted and highly performant in relevant domains.

▪ Arguably long duration of support and fruitful development ahead.

Page 22: Mahout ppt 5 - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/studentppt/mahout.pdfMahout v0.14 Exhibition Code import org.apache.mahout.math.cf.SimilarityAnalysis val artistReccosLlrDrmListByArtist

Thank You!