Matrix methods for Hadoop
-
Upload
david-gleich -
Category
Technology
-
view
10.451 -
download
2
description
Transcript of Matrix methods for Hadoop
Matrix Methods with Hadoop DAVID F. GLEICH ASSISTANT PROFESSOR "COMPUTER SCIENCE "PURDUE UNIVERSITY
David Gleich · Purdue 1
Slides bit.ly/10SIe1A Code github.com/dgleich/matrix-hadoop-tutorial
bit.ly/10SIe1A
David Gleich · Purdue 2 bit.ly/10SIe1A
A bit of philosophy …
3 Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue 4 bit.ly/10SIe1A
Matrix computations
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
Least squares Eigenvalues
Ax Ax = b min kAx � bk Ax = �x
Operations Linear "systems David Gleich · Purdue 5 bit.ly/10SIe1A
Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info Understand some of the issues that could arise.
David Gleich · Purdue 6 bit.ly/10SIe1A
Ideal outcomes
How to use techniques from "matrix computations in order "to solve your problems quickly!
1986 David Gleich · Purdue 7 bit.ly/10SIe1A
Taking the red pill …
8 Image from rockysprings, deviantart, CC share-alike
Matrix computations Physics
Statistics Engineering
Graphics Bioinformatics
Databases Machine learning
Information retrieval Computer vision Social networks
David Gleich · Purdue 9 bit.ly/10SIe1A
matrix computations "≠"
linear algebra
David Gleich · Purdue 10
bit.ly/10SIe1A
A SQL statement as a "matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
How do I find the average rating for each product?
David Gleich · Purdue 11
bit.ly/10SIe1A
A SQL statement as a "matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
SELECT ! p.product_id, ! p.name, ! AVG(pr.rating) AS rating_average!FROM products p !INNER JOIN product_ratings pr!ON pr.product_id = p.product_id!GROUP BY p.product_id!ORDER BY rating_average DESC !
How do I find the average rating for each product?
David Gleich · Purdue 12
bit.ly/10SIe1A
This SQL statement is a "matrix computation!
13
Image from rockysprings, deviantart, CC share-alike
SELECT ! ... ! AVG(pr.rating) !... !GROUP BY p.product_id!
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
David Gleich · Purdue 14
bit.ly/10SIe1A
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
But it’s a weird matrix"
Missing entries!
David Gleich · Purdue 15
bit.ly/10SIe1A
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
4
4
4
4 5 4
But it’s a weird matrix"
Matrix
SELECT AVG(r) ... GROUP BY pid
Vector
Average"of ratings
David Gleich · Purdue 16
bit.ly/10SIe1A
But it’s a weird matrix"and not a linear operator
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
avg(A) =
2
6664
Pj A1,j/
Pj “A1,j 6= 0”P
j A2,j/P
j “A2,j 6= 0”...P
j Am,j/P
j “Am,j 6= 0”
3
7775
David Gleich · Purdue 17
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
bit.ly/10SIe1A
matrix computations "≠"
linear algebra
David Gleich · Purdue 18
bit.ly/10SIe1A
… but there is a linear operator hiding …
David Gleich · Purdue 19
avg(A) = Pe
P =
2
64
A1,1/P
j “A1,j 6= 0” A1,2/P
j “A1,j 6= 0” · · ·A2,1/
Pj “A2,j 6= 0” A2,2/
Pj “A2,j 6= 0” · · ·
.... . .
3
75
e is the vector of all ones
bit.ly/10SIe1A
Hadoop, MapReduce, and Matrix Methods
David Gleich · Purdue 20
bit.ly/10SIe1A
MapReduce
David Gleich · Purdue 21
bit.ly/10SIe1A
The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.
Express algorithms in "“data-local operations”. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
Data scalable
Fault-tolerance by design
22
David Gleich · Purdue bit.ly/10SIe1A
wordcount "is a matrix computation too
map(document) :
for word in document
emit (word, 1)
reduce(word, counts) :
emit (word, sum(counts))
1 2 D D
3 4 D D
5 D
matrix,1 matrix,1 matrix,1 matrix,1
bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1
hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1
David Gleich · Purdue 23
bit.ly/10SIe1A
wordcount "is a matrix computation too
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
doc1
doc2
docm
= A
colsum(A) = AT e word count = e is the vector of all ones
David Gleich · Purdue 24
bit.ly/10SIe1A
inverted index"is a matrix computation too
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
doc1
doc2
docm
= A
David Gleich · Purdue 25
bit.ly/10SIe1A
2
66664
A1,1 A2,1 · · · Am,1
A1,2 A2,2 · · ·...
.... . .
. . . Am,n�1A1,n · · · Am�1,n Am,n
3
77775= AT
term1
term2
termm
inverted index"is a matrix computation too
David Gleich · Purdue 26
bit.ly/10SIe1A
A recommender system "with social info
David Gleich · Purdue 27
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
bit.ly/10SIe1A
A recommender system "with social info
David Gleich · Purdue 28
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
pid1
pid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75uid1
uid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75
bit.ly/10SIe1A
A recommender system "with social info
David Gleich · Purdue 29
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
R S
bit.ly/10SIe1A
A recommender system "with social info
David Gleich · Purdue 30
Recommend each item based on the average rating of all trusted users
“X = S RT” with something that is"almost a matrix-matrix"product
R pid1
pid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75 S uid1
uid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75
Xuid,pid =
X
uid2
Suid,uid2Ruid2,pid
!· X
uid2
“Suid,uid2 and Ruid2,pid 6= 0”
!�1
bit.ly/10SIe1A
Tools I like
hadoop streaming dumbo mrjob hadoopy C++
David Gleich · Purdue 31
bit.ly/10SIe1A
Tools I don’t use but other people seem to like …
pig java hbase mahout Eclipse Cassandra
David Gleich · Purdue 32
Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy
bit.ly/10SIe1A
hadoop streaming
the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout
David Gleich · Purdue 33
bit.ly/10SIe1A
mrjob from
a wrapper around hadoop streaming for map and reduce functions in python
class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
David Gleich · Purdue 34
bit.ly/10SIe1A
How can Hadoop streaming possibly be fast?
Hadoop streaming frameworks
Iter 1QR (secs.)
Iter 1Total (secs.)
Iter 2Total (secs.)
OverallTotal (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too
David Gleich (Sandia)
All timing results from the Hadoop job tracker
C++ in streaming beats a native Java implementation.
16/22MapReduce 2011
500 GB matrix. Computing the R in a QR factorization. "See my next talk!
David Gleich · Purdue 35
Example available from github.com/dgleich/mrtsqr"
for verification
mrjob could be faster if it used typedbytes for intermediate storage see https://github.com/Yelp/mrjob/pull/447
bit.ly/10SIe1A
Matrix-vector product
David Gleich · Purdue 36
Ax = y
y
i
=X
k
A
ik
x
k
A x
Follow along! ���matrix-hadoop/codes/smatvec.py!
bit.ly/10SIe1A
Where do matrix-vector products arise? Google’s PageRank Computing cosine-similarity between one document and all other documents Predictions from kernel methods Computing averages (the example above)
David Gleich · Purdue 37
bit.ly/10SIe1A
Matrix-vector product
David Gleich · Purdue 38
Ax = y
y
i
=X
k
A
ik
x
k
A x
A is stored by row
$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !3 4 -1.45 !4 2 0.061 !
x is stored entry-wise !
$ head samples/vec_5.txt !0 0.241 !1 -0.98 !2 0.237 !3 -0.32 !4 0.080 !
Follow along! ���matrix-hadoop/codes/smatvec.py!
bit.ly/10SIe1A
Matrix-vector product"(in pictures)
David Gleich · Purdue 39
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x Reduce 2!Output sum(Aik xk)"
y
bit.ly/10SIe1A
Matrix-vector product"(in pictures)
David Gleich · Purdue 40
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # xi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (vals[i], # column ! (row, # i,Aij! float(vals[i+1]))) !
bit.ly/10SIe1A
Matrix-vector product"(in pictures)
David Gleich · Purdue 41
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], val[1]*vecval) !
Note that you should use a secondary sort to avoid reading both in memory
bit.ly/10SIe1A
Matrix-vector product"(in pictures)
David Gleich · Purdue 42
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x Reduce 2!Output sum(Aik xk)"
y def sumred(self, key, vals): ! yield (key, sum(vals)) !
bit.ly/10SIe1A
Matrix-matrix product
David Gleich · Purdue 43
A B
Follow along! ���matrix-hadoop/codes/matmat.py!
AB = CCij =
X
k
Aik Bkj
bit.ly/10SIe1A
Matrix-matrix product
David Gleich · Purdue 44
A B
Follow along! ���matrix-hadoop/codes/matmat.py!
AB = CCij =
X
k
Aik Bkj
A is stored by row
$ head samples/smat_10_5_A.txt !0 0 0.599 4 -1.53 !1 !2 2 0.260 !3 !4 0 0.267 1 0.839
B is stored by row
$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 ! bit.ly/10SIe1A
Matrix-matrix product "(in pictures)
David Gleich · Purdue 45
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B Reduce 2!Output sum(Aik Bkj)"
C
bit.ly/10SIe1A
Matrix-matrix product "(in code)
David Gleich · Purdue 46
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B
def joinmap(self, key, line): ! mtype = self.parsemat() ! vals = line.split() ! row = vals[0] ! rowvals = \ ! [(vals[i],float(vals[i+1])) ! for i in xrange(1,len(vals),2)] ! if mtype==1: ! # matrix A, output by col ! for val in rowvals: ! yield (val[0], (row, val[1])) ! else: ! yield (row, (rowvals,)) !
bit.ly/10SIe1A
Matrix-matrix product "(in pictures)
David Gleich · Purdue 47
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B
def joinred(self, key, line): ! # load the data into memory ! brow = [] ! acol = [] ! for val in vals: ! if len(val) == 1: ! brow.extend(val[0]) ! else: ! acol.append(val) ! ! for (bcol,bval) in brow: ! for (arow,aval) in acol: ! yield ((arow,bcol),aval*bval) !
bit.ly/10SIe1A
Matrix-matrix product "(in pictures)
David Gleich · Purdue 48
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B Reduce 2!Output sum(Aik Bkj)"
C def sumred(self, key, vals): ! yield (key, sum(vals)) !
bit.ly/10SIe1A
Our social recommender
David Gleich · Purdue 49
RT S
Follow along! ���matrix-hadoop/recsys/recsys.py!
R is stored entry-wise !
$ gunzip –c data/rating.txt.gz!139431556 591156 5 !139431556 1312460676 5 !139431556 204358 4 139431556 368725 5 !Object ID! User ID! Rating!
S is stored entry-wise !
$ gunzip –c data/rating.txt.gz!3287060356 232085 -1 !3288305540 709420 1 !3290337156 204418 -1 !3294138244 269243 -1 !Other ID! Trust!My ID!
bit.ly/10SIe1A
Social recommender "(in code)
David Gleich · Purdue 50
A B
A Map 1!Align on columns"
B
def joinmap(self, key, line): ! parts = line.split('\t') ! if len(parts) == 8: # ratings ! objid = parts[0].strip() ! uid = parts[1].strip() ! rat = int(parts[2]) ! yield (uid, (objid, rat)) ! else len(parts) == 4: # trust ! myid = parts[0].strip() ! otherid = parts[1].strip() ! value = int(parts[2]) ! if value > 0: ! yield (otherid, (myid,)) !
Conceptually, the first step is the same as the matrix-matrix product. We reorganize the data by user-id to be able to map the trust relationships
bit.ly/10SIe1A
Matrix-matrix product "(in pictures)
David Gleich · Purdue 51
A B
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B
def joinred(self, key, vals): ! tusers = [] # uids that trust key ! ratobjs = [] # objs rated by uid=key ! for val in vals: ! if len(val) == 1: ! tusers.append(val[0]) ! else: ! ratobjs.append(val) !! for (objid, rat) in ratobjs: ! for uid in tusers: ! yield ((uid, objid), rat) !
Conceptually, the second step
is the same as the matrix-
matrix product too, we “map”
the ratings from each trusted
user back to the source.
bit.ly/10SIe1A
Matrix-matrix product "(in pictures)
David Gleich · Purdue 52
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B Reduce 2!Output sum(Aik Bkj)"
C def avgred(self, key, vals): ! s = 0. ! n = 0 ! for val in vals: ! s += val! n += 1 ! # the smoothed average of ratings ! yield key, ! (s+self.options.avg)/float(n+1) ! !
bit.ly/10SIe1A
Better ways to store "matrices in Hadoop
David Gleich · Purdue 53
A B
A B
Block matrices minimize the number of intermediate keys and values used. I’d form them based on the first reduce No need for “integer” keys that
fall between 1 and n!
bit.ly/10SIe1A
Tall-and-skinny matrices are common in BigData
David Gleich · Purdue 54
A1
A4
A2
A3
A4
A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "the input to a map task.
bit.ly/10SIe1A
Double-precision floating point was designed for the era where “big” was 1000-10000
David Gleich · Purdue 55
bit.ly/10SIe1A
Error analysis of summation
s = 0; for i=1 to n: s = s + x[i] A simple summation formula has "error that is not always small if n is a billion
David Gleich · Purdue 56
fl(x + y ) = (x + y )(1 + ")
fl(X
i
x
i
) �X
i
x
i
nµX
i
|xi
| µ ⇡ 10�16
bit.ly/10SIe1A
If your application matters then watch out for this issue. Use quad-precision arithmetic or compensated summation instead.
David Gleich · Purdue 57
bit.ly/10SIe1A
Compensated Summation “Kahan summation algorithm” on Wikipedia s = 0.; c = 0.; for i=1 to n: y = x[i] – c t = s + y c = (t – s) – y s = t
David Gleich · Purdue 58
Mathematically, c is always zero. On a computer, c can be non-zero The parentheses matter! fl(csum(x)) �
X
i
x
i
(µ + nµ2)X
i
|xi
|
µ ⇡ 10�16
bit.ly/10SIe1A
Collaborators, Friends, and People who have taught me
MRTSQR!Paul Constantine (Stanford) Austin Benson (Stanford) James Demmel (Berkeley) Simform!Jeremy Templeton (Sandia) Joe Ruthruff (Sandia) Yangyang Hou (Purdue) Joe Nichols (Stanford)
Sandia MapReduce!Todd Plantenga Tammy Kolda Justin Basilico (now Netflix) Others!Margot Gerritsen (Stanford)
Grants Sandia CSAR
David Gleich · Purdue 59
bit.ly/10SIe1A
Questions?
60
Image from rockysprings, deviantart, CC share-alike