Seattle Scalability Mahout

Numerical Recipes

in Hadoop

Principal SDE, LinkedInCommitter, Apache Mahout, Zoie, Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)

Jake Mannixlinkedin/in/jakemannixtwitter/[email protected]@apache.org

mailto:[email protected]

A Mathematician’s Apology

• Full-text search:– Score documents matching “query string”

• Collaborative filtering recommendation:– Users who liked {those} also liked {these}

• (Social/web)-graph proximity:– People/pages “close” to {this} are {these}

What mathematical structure describes all of these?

Matrix Multiplication!

Full-text Search

• Vector Space Model of IR• Corpus as term-document matrix • Query as bag-of-words vector • Full-text search is just:

Collaborative Filtering

• User preference matrix – (and item-item similarity matrix )

• Input user as vector of preferences • (simple) Item-based CF recommendations are:

T

Graph Proximity

• Adjacency matrix:– 2nd degree adjacency matrix:

• Input all of a user’s “friends” or page links:• (weighted) distance measure of 1st – 3rd

degree connections is then:

Dictionary

Doc-based sharding Row-based distributed matrix

Term-based sharding Column-based distributed matrix

Indexing Matrix transpose

PageRank Coefficients of first eigenvector

Predict all CF ratings Rebuild preference matrix

Applications Linear Algebra

How does this help?• In Search:

– Latent Semantic Indexing (LSI)– probabalistic LSI– Latent Dirichlet Allocation

• In Recommenders:– Singular Value Decomposition– Layered Restricted Boltzmann Machines

• (Deep Belief Networks)

• In Graphs:– PageRank– Spectral Decomposition / Spectral Clustering

Often use “Dimensional Reduction”

• To alleviate the sparse Big Data problem of “the curse of dimensionality”

• Used to improve recall and relevance – in general: smooth the metric on your data set

New applications with Matrices

If Search is finding doc-vector by:

and users query with data represented: Q =

Giving implicit feedback based on click-through per session: C =

… continued

• Then has the form (docs-by-terms) for search!

• Approach has been used by Ted Dunning at Veoh– (and probably others)

Linear Algebra performance tricks

• Naïve item-based recommendations:– Calculate item similarity matrix:– Calculate item recs:

• Express in one step:• In matrix notation:• Re-writing as:

– is the vector of preferences for user “v”, – is the vector of preferences of item “i”– The result is the matrix sum of the outer (tensor) products of these

vectors, scaled by the entry they intersect at.

Item Recommender via Hadoop

Apache Mahout

• Apache Mahout currently on release 0.3– http://lucene.apache.org/mahout

• Will be a “Top Level Project” soon (before 0.4)– ( http://mahout.apache.org )

• “Scalable Machine Learning with commercially friendly licensing”

http://mahout.apache.org/

Mahout Features• Recommenders

– absorbed the Taste project• Classification (Naïve Bayes, C-Bayes, more)• Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)• Fast non-distributed linear mathematics

– absorbed the classic CERN Colt project• Distributed Matrices and decomposition

– absorbed the Decomposer project• mahout shell-script analogous to $HADOOP_HOME/bin/hadoop

– $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100– $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300– etc…

• Taste web-app for real-time recommendations

DistributedRowMatrix

• Wrapper around a SequenceFile<IntWritable,VectorWritable>

• Distributed methods like:• Matrix transpose();• Matrix times(Matrix other);• Vector times(Vector v);• Vector timesSquared(Vector v);

• To get SVD: pass into DistributedLanczosSolver:– LanczosSolver.solve(Matrix input, Matrix

eigenVectors, List<Double> eigenValues, int rank);

Questions?

Contact:

[email protected]

[email protected]

http://twitter.com/pbrane

http://www.decomposer.org/blog

http://www.linkedin.com/in/jakemannix




Appendix

• There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…

• That having been said, there are some general techniques

Dealing with Curse of Dimensionality

• Sparseness means fast, but overlap is too small• Can we reduce the dimensionality (from “all possible text

tokens” or “all userIds”) while keeping the nice aspects of the search problem?

• If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…

Solution A: Matrix decomposition

• Singular Value Decomposition (truncated)

– “best” approximation to your matrix– Used in Latent Semantic Indexing (LSI)– For graphs: spectral decomposition– Collaborative filtering (Netflix leaderboard)

• Issues: very computation intensive – no parallelized open-source packages see Apache Mahout

• Makes things too dense

SVD: continued

• Hadoop impl. in Mahout (Lanczos)– O(N*d*k) for rank-k SVD on N docs, d elt’s each

• Density can be dealt with by doing Canopy Clustering offline

• But only extracting linear feature mixes• Also, still very computation intensive and I/O

intensive (k-passes over data set), are there better dimensional reduction methods?

Solution B: Stochastic Decomposition

co-ocurrence-based kernel + online Random Projection

+ SVD

Co-ocurrence-based kernel

• Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)

• “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}

• {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}

• Dim(features) goes from 105 to 108+ (yikes!)

Online Random Projection

• Randomly project kernelized text vectors down to “merely” 103 dimensions with a Gaussian matrix

• Or project each nGram down to an random (but sparse) 103-dim vector:– V= {123876244 =>1.3} (tf-IDF of “disney”)

– V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))

Outer-product and Sum

• Take the 103-dim projected vectors and outer-product with themselves,

• result is 103 x 103-dim matrix• sum these in a Combiner• All results go to single Reducer, where you

compute…

SVD

• SVD-them quickly (they fit in memory) • Over and over again (as new data comes in)• Use the most recent SVD to project your

(already randomly projected) text still further (now encoding “semantic” similarity).

• SVD-projected vectors can be assigned immediately to nearest clusters if desired

References

• Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061

• Sparse hashing/projection:John Langford et al. “Vowpal Wabbit”http://hunch.net/~vw/

http://arxiv.org/abs/0909.4061

http://hunch.net/~vw/

Seattle Scalability Mahout

Technology

Transcript of Seattle Scalability Mahout