Artificial Intelligence Layer: Mahout, MLLib, and other projects

22
Artificial Intelligence Layer Mahout, MLLib, & other projects Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Transcript of Artificial Intelligence Layer: Mahout, MLLib, and other projects

Page 1: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer

Mahout, MLLib, & other projects

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Page 2: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Core technologies like DFS and MR (i.e., Hadoop)

➢ ETL for transforming data (i.e., Pig)

➢ Alternative core/ETL technology (i.e., Spark)

➢ Now we can build AI tools from scratch

So far...

Page 3: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Can I save some work with existing code?

Page 4: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Write some UDF wrappers for Weka in Pig/Spark

➢ Use connectors to R and Python

➢ Parallelize execution of multiple non-distributed algorithms

Actually, we can...

Page 5: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Still problematic if algorithm instances are very big

➢ They are not really parallel algorithms

➢ Use parallel algorithms to tackle big problems:○ Apache Mahout

○ Apache Spark

But we can do better!

Page 6: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Collection of parallel AI & ML algorithms

➢ Map Reduce algorithms → Spark

➢ Latest major release: Mahout 0.9 (February 2014)http://mahout.apache.org/

Apache Mahout

Page 7: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Clustering algorithms:○ K-means (parallel)○ Fuzzy K-means (parallel)

○ Spectral K-means (parallel)

➢ Classification algorithms:○ Logistic regression (non parallel)○ Naive Bayes (parallel)○ Random Forest (parallel)

○ Multilayer perceptron (non parallel)

Apache Mahout: Algorithms

Page 8: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Dimensionality reduction:○ Singular Value Decomposition (parallel)○ PCA (parallel)○ Lanczos decomposition (parallel)

○ QR decomposition (parallel)

➢ Text algorithms:○ TF-IDF (parallel)

Apache Mahout: Algorithms

Page 9: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Just type mahout in the shell

➢ A list of available algorithms will pop out

➢ Typing mahout algorithm_name will print the help for the specific algorithm

➢ Executing distributed algorithms requires of Hadoop and DFS

Mahout from shell

Page 10: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ mahout recommenditembased :○ --input: file with user_id item_id rows to represent

purchases

○ --output: where mahout should store results○ --usersFile: who we should recommend○ --itemsFile: what items we can recommend

○ -b: true (in our case, binary data)

○ --similarityClassname: SIMILARITY_LOGLIKELIHOOD

or SIMILARITY_TANIMOTO_COEFFICIENT (in our

case, binary data)

Mahout example: Item-based Collaborative filtering

Page 11: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Execute:

mahout recommenditembased --input data/purchases_mahout.tsv --output mahout_cf --usersFile data/users_mahout.tsv --itemsFile data/valid_products_mahout.tsv --booleanData --similarityClassname SIMILARITY_LOGLIKELIHOOD

Mahout example: Item-based Collaborative filtering

Page 12: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Machine learning library inside Spark

➢ Completely distributed

➢ It is bundled with Spark!

MLLib

Page 13: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Classification & Regression:○ Support Vector Machines

○ Logistic Regression

○ Linear Regression

○ Random Forests

➢ Clustering:○ K-means

MLLib: Algorithms

Page 14: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Dimensionality reduction:○ Singular Value Decomposition

○ PCA

➢ Clustering:○ K-means

➢ Collaborative filtering○ ALS item-based recommender

MLLib: Algorithms

Page 15: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Let us apply K-means on the iris data set

MLLib: K-Means example

import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectorsval numClusters = 3val numIteration = 20val data_iris = sc.textFile( “hdfs:///user/sanguix/data/iris.csv”).map( l=> l.split(“,”,-1) )val parsedData = data_iris.map( r => Vectors.dense( Array( r(0).toDouble, r(1).toDouble, r(2).toDouble, r(3).toDouble ) ) ).cache()

val clusters = KMeans.train( parsedData, numClusters, numIteration )

Page 16: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Spark built-in library for graphs

➢ Algorithms:○ PageRank

○ (Strong) Connected components

○ Label propagation

○ Other basic graph operations

Other projects: Graphx

Page 17: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Graph framework over Hadoop

➢ Specialized for building algorithms for graphs

➢ Latest major release:Giraph 1.1.0 (Nov. 2014)http://giraph.apache.org/

Other projects: Giraph

Page 18: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Distributed framework for machine learning➢ Originally created at Carnegie Mellon➢ Algorithms:

○ Collaborative filtering

○ Text analysis

○ Page Rank

○ Deep learning

➢ Latest release: GraphLab 2.2 (July 2013)https://github.com/graphlab-code/graphlab

Other projects: GraphLab

Page 19: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ML library on top of Hadoop/Spark➢ Algorithms:

○ Random Forests

○ Generalized Linear Model

○ Deep learning

○ K-Means

➢ Latest release: H2O 2.8.4.4(February 2015)https://github.com/h2oai/h2o-dev

Other projects: H2O

Page 20: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Large scale data processing engine in Java/Scala

➢ In memory collections➢ Latest release: Flink 0.8.0

(January 2015)http://flink.apache.org/

Other projects: Apache Flink

Page 21: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Mahout in Action. Sean Owen. Eds. Manning Publications (2011)

➢ Apache Mahout Cookbook. Piero Giacomelli. Ed. Packt Publishing (2013)

➢ StackOverflow

Extra information

Page 22: Artificial Intelligence Layer: Mahout, MLLib, and other projects

Artificial Intelligence Layer

Mahout, MLLib, & other projects

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015