Seattle Scalability Mahout

27
Numerical Recipes in Hadoop Principal SDE , LinkedIn Committer , Apache Mahout, Zoie, Bobo-Browse, Decomposer Author , Lucene in Depth (Manning MM/DD/2010) Jake Mannix linkedin/in/jakemannix twitter/pbrane [email protected] [email protected]

description

Talk given at the Seattle Scalability / NoSQL / Hadoop / etc MeetUp on March 31, 2010

Transcript of Seattle Scalability Mahout

Page 1: Seattle Scalability Mahout

Numerical Recipes

in Hadoop

Principal SDE, LinkedInCommitter, Apache Mahout, Zoie, Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)

Jake Mannixlinkedin/in/jakemannixtwitter/[email protected]@apache.org

Page 2: Seattle Scalability Mahout

A Mathematician’s Apology

• Full-text search:– Score documents matching “query string”

• Collaborative filtering recommendation:– Users who liked {those} also liked {these}

• (Social/web)-graph proximity:– People/pages “close” to {this} are {these}

What mathematical structure describes all of these?

Page 3: Seattle Scalability Mahout

Matrix Multiplication!

Page 4: Seattle Scalability Mahout

Full-text Search

• Vector Space Model of IR• Corpus as term-document matrix • Query as bag-of-words vector • Full-text search is just:

Page 5: Seattle Scalability Mahout

Collaborative Filtering

• User preference matrix – (and item-item similarity matrix )

• Input user as vector of preferences • (simple) Item-based CF recommendations are:

T

Page 6: Seattle Scalability Mahout

Graph Proximity

• Adjacency matrix:– 2nd degree adjacency matrix:

• Input all of a user’s “friends” or page links:• (weighted) distance measure of 1st – 3rd

degree connections is then:

Page 7: Seattle Scalability Mahout

Dictionary

Doc-based sharding Row-based distributed matrix

Term-based sharding Column-based distributed matrix

Indexing Matrix transpose

PageRank Coefficients of first eigenvector

Predict all CF ratings Rebuild preference matrix

Applications Linear Algebra

Page 8: Seattle Scalability Mahout

How does this help?• In Search:

– Latent Semantic Indexing (LSI)– probabalistic LSI– Latent Dirichlet Allocation

• In Recommenders:– Singular Value Decomposition– Layered Restricted Boltzmann Machines

• (Deep Belief Networks)

• In Graphs:– PageRank– Spectral Decomposition / Spectral Clustering

Page 9: Seattle Scalability Mahout

Often use “Dimensional Reduction”

• To alleviate the sparse Big Data problem of “the curse of dimensionality”

• Used to improve recall and relevance – in general: smooth the metric on your data set

Page 10: Seattle Scalability Mahout

New applications with Matrices

If Search is finding doc-vector by:

and users query with data represented: Q =

Giving implicit feedback based on click-through per session: C =

Page 11: Seattle Scalability Mahout

… continued

• Then has the form (docs-by-terms) for search!

• Approach has been used by Ted Dunning at Veoh– (and probably others)

Page 12: Seattle Scalability Mahout

Linear Algebra performance tricks

• Naïve item-based recommendations:– Calculate item similarity matrix:– Calculate item recs:

• Express in one step:• In matrix notation:• Re-writing as:

– is the vector of preferences for user “v”, – is the vector of preferences of item “i”– The result is the matrix sum of the outer (tensor) products of these

vectors, scaled by the entry they intersect at.

Page 13: Seattle Scalability Mahout

Item Recommender via Hadoop

Page 14: Seattle Scalability Mahout

Apache Mahout

• Apache Mahout currently on release 0.3– http://lucene.apache.org/mahout

• Will be a “Top Level Project” soon (before 0.4)– ( http://mahout.apache.org )

• “Scalable Machine Learning with commercially friendly licensing”

Page 15: Seattle Scalability Mahout

Mahout Features• Recommenders

– absorbed the Taste project• Classification (Naïve Bayes, C-Bayes, more)• Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)• Fast non-distributed linear mathematics

– absorbed the classic CERN Colt project• Distributed Matrices and decomposition

– absorbed the Decomposer project• mahout shell-script analogous to $HADOOP_HOME/bin/hadoop

– $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100– $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300– etc…

• Taste web-app for real-time recommendations

Page 16: Seattle Scalability Mahout

DistributedRowMatrix

• Wrapper around a SequenceFile<IntWritable,VectorWritable>

• Distributed methods like:• Matrix transpose();• Matrix times(Matrix other);• Vector times(Vector v);• Vector timesSquared(Vector v);

• To get SVD: pass into DistributedLanczosSolver:– LanczosSolver.solve(Matrix input, Matrix

eigenVectors, List<Double> eigenValues, int rank);

Page 17: Seattle Scalability Mahout

Questions?

Contact:

[email protected]

[email protected]

http://twitter.com/pbrane

http://www.decomposer.org/blog

http://www.linkedin.com/in/jakemannix

Page 18: Seattle Scalability Mahout

Appendix

• There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…

• That having been said, there are some general techniques

Page 19: Seattle Scalability Mahout

Dealing with Curse of Dimensionality

• Sparseness means fast, but overlap is too small• Can we reduce the dimensionality (from “all possible text

tokens” or “all userIds”) while keeping the nice aspects of the search problem?

• If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…

Page 20: Seattle Scalability Mahout

Solution A: Matrix decomposition

• Singular Value Decomposition (truncated)

– “best” approximation to your matrix– Used in Latent Semantic Indexing (LSI)– For graphs: spectral decomposition– Collaborative filtering (Netflix leaderboard)

• Issues: very computation intensive – no parallelized open-source packages see Apache Mahout

• Makes things too dense

Page 21: Seattle Scalability Mahout

SVD: continued

• Hadoop impl. in Mahout (Lanczos)– O(N*d*k) for rank-k SVD on N docs, d elt’s each

• Density can be dealt with by doing Canopy Clustering offline

• But only extracting linear feature mixes• Also, still very computation intensive and I/O

intensive (k-passes over data set), are there better dimensional reduction methods?

Page 22: Seattle Scalability Mahout

Solution B: Stochastic Decomposition

co-ocurrence-based kernel + online Random Projection

+ SVD

Page 23: Seattle Scalability Mahout

Co-ocurrence-based kernel

• Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)

• “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}

• {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}

• Dim(features) goes from 105 to 108+ (yikes!)

Page 24: Seattle Scalability Mahout

Online Random Projection

• Randomly project kernelized text vectors down to “merely” 103 dimensions with a Gaussian matrix

• Or project each nGram down to an random (but sparse) 103-dim vector:– V= {123876244 =>1.3} (tf-IDF of “disney”)

– V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))

Page 25: Seattle Scalability Mahout

Outer-product and Sum

• Take the 103-dim projected vectors and outer-product with themselves,

• result is 103 x 103-dim matrix• sum these in a Combiner• All results go to single Reducer, where you

compute…

Page 26: Seattle Scalability Mahout

SVD

• SVD-them quickly (they fit in memory) • Over and over again (as new data comes in)• Use the most recent SVD to project your

(already randomly projected) text still further (now encoding “semantic” similarity).

• SVD-projected vectors can be assigned immediately to nearest clusters if desired

Page 27: Seattle Scalability Mahout

References

• Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061

• Sparse hashing/projection:John Langford et al. “Vowpal Wabbit”http://hunch.net/~vw/