Artificial Intelligence Layer: Mahout, MLLib, and other projects
-
Upload
victor-sanchez-anguix -
Category
Data & Analytics
-
view
241 -
download
0
Transcript of Artificial Intelligence Layer: Mahout, MLLib, and other projects
Artificial Intelligence Layer
Mahout, MLLib, & other projects
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Core technologies like DFS and MR (i.e., Hadoop)
➢ ETL for transforming data (i.e., Pig)
➢ Alternative core/ETL technology (i.e., Spark)
➢ Now we can build AI tools from scratch
So far...
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Can I save some work with existing code?
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Write some UDF wrappers for Weka in Pig/Spark
➢ Use connectors to R and Python
➢ Parallelize execution of multiple non-distributed algorithms
Actually, we can...
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Still problematic if algorithm instances are very big
➢ They are not really parallel algorithms
➢ Use parallel algorithms to tackle big problems:○ Apache Mahout
○ Apache Spark
But we can do better!
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Collection of parallel AI & ML algorithms
➢ Map Reduce algorithms → Spark
➢ Latest major release: Mahout 0.9 (February 2014)http://mahout.apache.org/
Apache Mahout
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Clustering algorithms:○ K-means (parallel)○ Fuzzy K-means (parallel)
○ Spectral K-means (parallel)
➢ Classification algorithms:○ Logistic regression (non parallel)○ Naive Bayes (parallel)○ Random Forest (parallel)
○ Multilayer perceptron (non parallel)
Apache Mahout: Algorithms
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Dimensionality reduction:○ Singular Value Decomposition (parallel)○ PCA (parallel)○ Lanczos decomposition (parallel)
○ QR decomposition (parallel)
➢ Text algorithms:○ TF-IDF (parallel)
Apache Mahout: Algorithms
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Just type mahout in the shell
➢ A list of available algorithms will pop out
➢ Typing mahout algorithm_name will print the help for the specific algorithm
➢ Executing distributed algorithms requires of Hadoop and DFS
Mahout from shell
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ mahout recommenditembased :○ --input: file with user_id item_id rows to represent
purchases
○ --output: where mahout should store results○ --usersFile: who we should recommend○ --itemsFile: what items we can recommend
○ -b: true (in our case, binary data)
○ --similarityClassname: SIMILARITY_LOGLIKELIHOOD
or SIMILARITY_TANIMOTO_COEFFICIENT (in our
case, binary data)
Mahout example: Item-based Collaborative filtering
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Execute:
mahout recommenditembased --input data/purchases_mahout.tsv --output mahout_cf --usersFile data/users_mahout.tsv --itemsFile data/valid_products_mahout.tsv --booleanData --similarityClassname SIMILARITY_LOGLIKELIHOOD
Mahout example: Item-based Collaborative filtering
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Machine learning library inside Spark
➢ Completely distributed
➢ It is bundled with Spark!
MLLib
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Classification & Regression:○ Support Vector Machines
○ Logistic Regression
○ Linear Regression
○ Random Forests
➢ Clustering:○ K-means
MLLib: Algorithms
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Dimensionality reduction:○ Singular Value Decomposition
○ PCA
➢ Clustering:○ K-means
➢ Collaborative filtering○ ALS item-based recommender
MLLib: Algorithms
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Let us apply K-means on the iris data set
MLLib: K-Means example
import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectorsval numClusters = 3val numIteration = 20val data_iris = sc.textFile( “hdfs:///user/sanguix/data/iris.csv”).map( l=> l.split(“,”,-1) )val parsedData = data_iris.map( r => Vectors.dense( Array( r(0).toDouble, r(1).toDouble, r(2).toDouble, r(3).toDouble ) ) ).cache()
val clusters = KMeans.train( parsedData, numClusters, numIteration )
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Spark built-in library for graphs
➢ Algorithms:○ PageRank
○ (Strong) Connected components
○ Label propagation
○ Other basic graph operations
Other projects: Graphx
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Graph framework over Hadoop
➢ Specialized for building algorithms for graphs
➢ Latest major release:Giraph 1.1.0 (Nov. 2014)http://giraph.apache.org/
Other projects: Giraph
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Distributed framework for machine learning➢ Originally created at Carnegie Mellon➢ Algorithms:
○ Collaborative filtering
○ Text analysis
○ Page Rank
○ Deep learning
➢ Latest release: GraphLab 2.2 (July 2013)https://github.com/graphlab-code/graphlab
Other projects: GraphLab
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ML library on top of Hadoop/Spark➢ Algorithms:
○ Random Forests
○ Generalized Linear Model
○ Deep learning
○ K-Means
➢ Latest release: H2O 2.8.4.4(February 2015)https://github.com/h2oai/h2o-dev
Other projects: H2O
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Large scale data processing engine in Java/Scala
➢ In memory collections➢ Latest release: Flink 0.8.0
(January 2015)http://flink.apache.org/
Other projects: Apache Flink
Artificial Intelligence Layer: Mahout, MLLib & other projects. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Mahout in Action. Sean Owen. Eds. Manning Publications (2011)
➢ Apache Mahout Cookbook. Piero Giacomelli. Ed. Packt Publishing (2013)
➢ StackOverflow
Extra information
Artificial Intelligence Layer
Mahout, MLLib, & other projects
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015