Distributed CUR Decomposition for...

7
Distributed CUR Decomposition for Bi-Clustering Stephen Kline, Kevin Shaw June 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final Project

Transcript of Distributed CUR Decomposition for...

Page 1: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

DistributedCURDecompositionforBi-Clustering

StephenKline,KevinShawJune1,2016

{sakline,keshaw}@stanford.eduStanfordUniversity,CME323FinalProject

Page 2: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

CURasalternativetoSVD– e.g.Biclustering

CompleteratingsforselectUsers

CompleteratingsforselectMovies

Movie Ratingssparse /huge

SVD– accuratebutheavy,Less interpretable(rotatedspace)

CUR– less accuratebutlight,Moreinterpretable*

*Asarchetypal usersandmovies

1

Movies

Users

User-Movie Biclusters

Biclustering wasoriginally developedinthecontextofDNAmicroarrays

Biclustering alsohaspotential inotherareasandhas addedinterpretability

Source:SourceCodeforBiologyMedicine(April2013)- "Thenon-negativematrixfactorizationtoolboxforbiologicaldatamining"

Page 3: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

ReviewofSVD:A=U∑VT

• PRO- Highaccuracyo ksingular values/vectors produce thebestk-rank

approximation toA

• CON - Highcomputation/spacerequirementso Inour biclustering applicationwithMovieLens

data,thedistributed SVDis “roughlysquare” -ARPACK (vs.“tallandskinny” – ATAtrick)

dense /fulldense /full

Asparse /huge

densebig

dense /bigsparse/small

2

Page 4: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

BackgroundonA=CUR

• CURtradesaccuracy…... forcomputation/spacesavings

• C/R =cols/rows fromA• U =pseudo-inverse ofW

(intersectionof CandR)

C

R

sparsebig

sparse /bigdense/small

Asparse /huge

Pinv( ) Intersectionof CandR(callitW,verysmall)

• Col/RowSelect() algsamplesw/replacement (allowsduplicates)

• Pinv(W) calculatedviaSVD(W)

• Accuracybetterforlargedatasets

3

Page 5: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

DesignDecisionsforDistributedCUR

C

R

sparsebig

sparse /bigdense/small

Asparse /huge

KeyDesignDecision:Distributetwoinstances ofAavoidingfutureall-to-allcommunications

Only necessary tostoreCandRassetofindices intoACompute Ulocally

Therearemultiple variations ofCUR.Weselected thealgorithmaspresented in:Drineas, et.al.,2006."FastMonteCarloAlgorithms forMatricesIII"which(forexample)does notremoveduplicatecols/rows assomeothers do.

4

Page 6: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

Serialvs.DistributedCUR- Asymptotics

• BuildCandR:o Generateprobabilities– O(mn)o CreateCmatrix– O(mk)o CreateRmatrix– O(nk)

• ConstructU• ComputeCTC– O(mk2)• SVDofCTC– O(k3)• ComputeAandB– O(k3)• U=ABT – O(k3)

5

• BuildCandR:• Generateprobabilities– O(mn +p)cost,O(maxdense)time

o Create2RDDsbyRow/Colpartition– O(mn)cost,AtoAo Bothinstances:reducetoRow/Colsums—

O(maxdense)time,nocommunicationo Oneinstance:reduceRowsumtototal– O(p)cost,O(logp)timeo Broadcasttotaltocalculateprobs – O(p)cost,O(logp)time

• CreateC/Rmatriceso Locallysamplekrows/cols– O(k)o BroadcastsampletoRDDs– O(pk)cost,O(klogp)time

• ConstructU• SameasSerial(lessopportunitytodistribute)

Serial Distributed (communication cost andcomputation time)

Page 7: Distributed CUR Decomposition for Bi-Clusteringrezab/classes/cme323/S16/projects_slides/kline_shaw.pdfJune 1, 2016 {sakline, keshaw}@stanford.edu Stanford University, CME 323 Final

Biclustering:DistributedCURvsSVD- Empirics

6