Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...
Transcript of Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention...
![Page 1: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/1.jpg)
SchoolofComputerScienceCarnegieMellonUniversity
Big Arctic Data
Evangelos (Vagelis) PapalexakisSchool of Computer Science,Carnegie Mellon University
Arctic Analysis 2014, Greenland
![Page 2: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/2.jpg)
RoadmapRoadmap
• Motivation & Introduction••••••
• Motivation & Introduction••••••
2
![Page 3: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/3.jpg)
Eric Fisher, “See something, say something”3
![Page 4: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/4.jpg)
http://socialgraph.blogspot.com/2010/12/facebook‐map‐of‐world‐visualising.html 4
![Page 5: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/5.jpg)
How big is big?How big is big?
Slide adapted from: http://graphlab.com/learn/presentations.htmlPicture from: http://web.netenrich.com/Portals/128884/images/FB_SERVER_040_x900.jpg
Need many data centersto store the data
100#Hours#a#MinuteYouTube#28#Million##
Wikipedia#Pages#
1#Billion#Facebook#Users#
6#Billion##Flickr#Photos#
5
![Page 6: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/6.jpg)
Definition – The 3 V’sDefinition – The 3 V’s
• VolumeHard to store
• VarietyVery diverse/rich
• VelocityComing in faster than we can handle
• VolumeHard to store
• VarietyVery diverse/rich
• VelocityComing in faster than we can handle
6
![Page 7: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/7.jpg)
Success storySuccess story
http://www.forbes.com/sites/kashmirhill/2012/02/16/how‐target‐figured‐out‐a‐teen‐girl‐was‐pregnant‐before‐her‐father‐did/
• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information
• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy
• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed
pregnant. More impressive: Target was able to estimate the due date
somewhat accurately
• Target assigns every customer ID number, tied to their credit card (or name, or email) Also gather any additional information
• Combination of lotions and multivitamins was strong predictor for early stages of pregnancy
• Target figured out that a girl was pregnant before her father did Sent her flyers with baby related merchandise Father was furious After pregnancy test, they found out that the girl was indeed
pregnant. More impressive: Target was able to estimate the due date
somewhat accurately
7
![Page 8: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/8.jpg)
Success Story 2Success Story 2
• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data
Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models
• Google Translate Use large scale data “dirty” instead of hoping for high quality annotated data
Many training instances found “in the wild” Let the data guide the machine translation instead of using very complicated models
Alon Halevy et al. The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 2009
8
![Page 9: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/9.jpg)
But I don’t have that much data!!Why should I care?
But I don’t have that much data!!Why should I care?
• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency
• Even with small/medium data one can benefit by borrowing ideas Speed up algorithmsMemory efficiency
9
![Page 10: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/10.jpg)
RoadmapRoadmap
•• Matlab is great•••••
•• Matlab is great•••••
10
![Page 11: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/11.jpg)
Matlab is great!Matlab is great!
• Powerful tool• Great implementations of Matrix algorithms
Eigen‐decomposition Singular Value Decomposition Basic matrix operations
• Vector based operations• Instant “debugging” by plotting
• All of the above make it a great prototyping tool for math intensive data analysis
• Powerful tool• Great implementations of Matrix algorithms
Eigen‐decomposition Singular Value Decomposition Basic matrix operations
• Vector based operations• Instant “debugging” by plotting
• All of the above make it a great prototyping tool for math intensive data analysis
11
![Page 12: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/12.jpg)
Data representation mattersData representation matters
• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations
• Need to represent data carefully
• Original data size many times deceptive• Data that we analyze ends up being much smaller in terms of Storage necessaryNumber of observations
• Need to represent data carefully
12
![Page 13: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/13.jpg)
29mixtures
18989 m/z value
LC‐MSLC‐MSLC‐MS
1054 retentiontime
1.55% dense
Sparse storage
~ 275 MB
~ 4.4 GB
Liquid‐Chromatography Mass‐Spectrometry (LC‐MS) measurements are usually treated as two‐way arrays, i.e., samplesby peaks. The original raw data is a three‐way array and we can explore its underlying structure by taking advantage ofsparsity.
usually converted into a set of peaks
mixtures
peaks
Each peak is a(m/z, retention time) pair.
Dense storage
RAW DATA:
Note that this is a very small data set with only 29 samples!
Slide borrowed from Evrim Acar
13
Sparse vs. dense storageSparse vs. dense storage
![Page 14: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/14.jpg)
Tensor ToolboxTensor Toolbox
• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]
• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html
• Matlab toolbox for tensor computations• Support for sparse tensor storage and computationMatlab does not inherently support that Careful implementation of sparse computations for efficiency[1]
• Available at http://www.sandia.gov/~tgkolda/TensorToolbox/index‐2.5.html
[1] Bader & Kolda, Efficient MATLAB computations with sparse and factored tensors, SIAM JSC’07
14
![Page 15: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/15.jpg)
Matlab Parallel Computing ToolboxMatlab Parallel Computing Toolbox
• Support for parallel computations• Provides “parallel for” (parfor)
Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!
• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:
Example of using the above w/ sampling for fast PARAFAC
• Support for parallel computations• Provides “parallel for” (parfor)
Shared memory parallel execution For loops have to be independent Need to write them carefully… …But it pays off!
• Can run code on multiple cores/CPUs or even clusters• Can run random restarts of algorithm in parallel• Later today:
Example of using the above w/ sampling for fast PARAFAC
15
![Page 16: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/16.jpg)
RoadmapRoadmap
••• Map/Reduce••••
••• Map/Reduce••••
16
![Page 17: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/17.jpg)
Map/Reduce MotivationMap/Reduce Motivation
• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search
Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together
• Developed by GoogleMany terabytes of crawled webpages (mainly text)Need to create inverted index For each word, find how many documents contain it Useful for web search
Many ”cheap”/commodity machines at their disposal Faulty and not efficient as units Potentially very powerful if combined together
17
![Page 18: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/18.jpg)
The Map/Reduce FrameworkThe Map/Reduce Framework
• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud
Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine
Provides a Reduce() function Tells system how to combine values of all records with same key
• Map/Reduce: Provides a distributed file system (GFS – Google File System) where files are stored in the cloud
Sees everything as <key, value> pairs Provides a Map() function Tells system to gather data records with the same key to one worker machine
Provides a Reduce() function Tells system how to combine values of all records with same key
18
![Page 19: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/19.jpg)
The Map/Reduce FrameworkThe Map/Reduce Framework
• Abstracts the computation into a Map() & Reduce() pair
• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!
• The programmer does not need to know details about the cluster
• Abstracts the computation into a Map() & Reduce() pair
• Can have chains of Map/Reduce operationsMost non‐elementary algos need more than one Map/Reduce operation!
• The programmer does not need to know details about the cluster
19
![Page 20: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/20.jpg)
Apache HadoopApache Hadoop
• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python
• Open source M/R implementation by Apache• Provides HDFS (Hadoop File System)• Mostly programmed in Java & Python
20
![Page 21: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/21.jpg)
Hadoop’s inner workings by example
Hadoop’s inner workings by example
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
21
![Page 22: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/22.jpg)
Map functionMap function
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
22
![Page 23: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/23.jpg)
Reduce functionReduce function
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
23
![Page 24: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/24.jpg)
Putting it all togetherPutting it all together
Image from: http://blog.trifork.com/2009/08/04/introduction‐to‐hadoop/
24
![Page 25: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/25.jpg)
Matrix Multiplication in HadoopMatrix Multiplication in Hadoop
• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as
• How can we multiply them on Hadoop?
• Slightly more complicated example• Have two matrices Amxn, Bnxp stored (in single file) as
• How can we multiply them on Hadoop?
A 0 0 5
A 0 1 4
A 1 1 2
B 0 0 7
B 0 1 1
…
25
![Page 26: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/26.jpg)
Map()Map()Input:
26http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/
Key ideas:• Mapper has to emit <k,v> pairs with (i, k) as the key• (i, k) is a single value of A*B• Inner index j is fixed• Iterates over k (for A) and over m (for B)
![Page 27: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/27.jpg)
Reduce()Reduce()
27
Output:
http://importantfish.com/one‐step‐matrix‐multiplication‐with‐hadoop/
Key ideas:• Each mapper works on one element (i,k) of A*B• Collects all a_ij and b_kj where i and k fixed• Calculates the sum of produces for (i,k)‐th element
![Page 28: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/28.jpg)
Why should I care?Why should I care?
• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!
• Portable Anything you write runs on any Hadoop cluster
• Scalable You can run your code to 1 or 1000 machines without changes!
• Fault tolerant Even when cluster nodes fail the job finishes
• Easy to program No need to be a C++/MPI expert! No parallel programming knowledge needed!
• Portable Anything you write runs on any Hadoop cluster
• Scalable You can run your code to 1 or 1000 machines without changes!
• Fault tolerant Even when cluster nodes fail the job finishes
28
![Page 29: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/29.jpg)
ShortcomingsShortcomings
• If data fits in memory, could be much slower than in‐memory approaches!
• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).
At the beginning of every iteration M/R has to read things from HDFS.
Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/
• If data fits in memory, could be much slower than in‐memory approaches!
• Iterative algorithms At the end of every iteration M/R has to write things on HDFS (disk).
At the beginning of every iteration M/R has to read things from HDFS.
Slows down iterative algorithms!!Ways around it: Haloop https://code.google.com/p/haloop/ Twister http://www.iterativemapreduce.org/
29
![Page 30: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/30.jpg)
ApplicationsApplications
• Graph Mining PegasusHEigen
• Machine LearningMahout
• Tensor AnalysisGigaTensor
• Graph Mining PegasusHEigen
• Machine LearningMahout
• Tensor AnalysisGigaTensor
30
![Page 31: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/31.jpg)
Graph MiningGraph Mining
• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in
Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank
Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example
• We are given a graph e.g. who‐talks‐to‐whom• In Graph Mining we are interested in
Finding regular patterns in the graph Degree distribution of nodes Graph Diameter # connected components # triangles, clustering coefficient PageRank
Finding anomalies Nodes that are “special” Potential spammers/fraudsters in our example
31
![Page 32: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/32.jpg)
PegasusPegasus• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:
Relax multiply to combine Relax sum to aggregate
Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min
• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/
• Many graph mining tasks can be reduced to a “generalized matrix‐vector product” Generalized:
Relax multiply to combine Relax sum to aggregate
Different choices for combine & aggregate give us different graph features PageRank: combine=multiply, aggregate=sum Connected components: combine=multiply, aggregate=min
• Pegasus: Introduces the above abstraction Provides efficient & scalable Hadoop implementation Project page: http://www.cs.cmu.edu/~pegasus/
32
![Page 33: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/33.jpg)
PegasusPegasus1.4M nodes6.3M edges
33
![Page 34: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/34.jpg)
PegasusPegasus
34
![Page 35: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/35.jpg)
Triangle CountingTriangle Counting
• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network
• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers
• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph
• Triangle: A set of three nodes connected to each other E.g. two people get introduced by mutual friend in party, completing a triangle in the social network
• Triangle counts:Unusual number of triangles among nodes can indicate fraudsters/spammers
• Direct relation of #triangles and eigenvalue decomposition of adjacency matrix of graph
35
![Page 36: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/36.jpg)
HEigen – Eigenvalue DecompositionHEigen – Eigenvalue Decomposition
• Scalable tool for computing eigenvalue decomposition
• Using Lanczos algorithm with Selective Orthogonalization
• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally
• Scalable tool for computing eigenvalue decomposition
• Using Lanczos algorithm with Selective Orthogonalization
• Uses selective parallelization to choose which subtask to parallelize Frobenius norm & small intermediate eigendecompositions are run locally
U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011
36
![Page 37: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/37.jpg)
HEigenHEigen
U Kang et al, Spectral Analysis for Billion‐Scale Graphs: Discoveries and Implementation, PAKDD 2011 37
![Page 38: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/38.jpg)
Machine Learning-Mahout
Machine Learning-Mahout
• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/
• Apache’s Hadoop Machine Learning ToolboxMatrix Factorization (SVD, NMF) K‐means clustering Topic Modeling (LDA) Logistic RegressionNaïve Bayes ClassificationMany more: Download at: https://mahout.apache.org/
38
![Page 39: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/39.jpg)
SchoolofComputerScienceCarnegieMellonUniversity
GigaTensor: Scaling Tensor Analysis Up By 100 Times –Algorithms and Discoveries
U Kang, Evangelos Papalexakis, Abhay Harpale, Christos Faloutsos
![Page 40: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/40.jpg)
MotivationMotivation
• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web
Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?
Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs
• Suppose we have Knowledge Base data E.g. Read the Web Project / Never Ending Language Learner (NELL) at CMU Subject – verb – object triplets, mined from the web
Many gigabytes of data!How do we find potential new synonyms to a word using this knowledge base?
Working Problem: NELL dataset: 24M subjects, 24M objects, 46M verbs
40
![Page 41: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/41.jpg)
CP/PARAFAC decompositionCP/PARAFAC decomposition• Decompose X
into sum of rank one tensors
• Decompose X into sum of rank one tensors
X + … +
a1 aF
b1 bF
c1 cF
Objective function:
≈
41
![Page 42: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/42.jpg)
ALS algorithm for CP/PARAFACALS algorithm for CP/PARAFAC
• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:
Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C
Block coordinate descent algorithmMonotone convergence to local optimum
• Objective function is non‐convex!• Linear on each of the variables• Most popular approach:
Alternating Least Squares (ALS) Fix B, C and optimize for A Fix A, C and optimize for B Fix A, B and optimize for C
Block coordinate descent algorithmMonotone convergence to local optimum
42
![Page 43: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/43.jpg)
ALS Zoom-In: Intermediate Data Explosion
ALS Zoom-In: Intermediate Data Explosion
X
Unfold/Matricize
X(1)
(CB) = [C(:,1) B(:,1) … C(:,F) B(:,F)]JKxFKronecker product
CP/PARAFAC property
Khatri Rao Product
• (CB) can be very large• Materializing is a showstopper!• Intermediate Data Explosion• Same issues for B and C!
43
![Page 44: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/44.jpg)
Main IdeaMain Idea
• Avoiding Intermediate Data Explosion• Avoiding Intermediate Data Explosion
Size of Intermediate Data (NELL)- Proposed: 1.5 GB
Size of Intermediate Data (NELL)- Naïve: 100 PB
(Before) (After)
44
![Page 45: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/45.jpg)
ResultsResults
• GigaTensor solved 100x larger problemsthan the current state of the art
• GigaTensor solved 100x larger problemsthan the current state of the art
GigaTensor
Out ofMemory
100x
45
![Page 46: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/46.jpg)
BREAKBREAK
46
![Page 47: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/47.jpg)
RoadmapRoadmap
•••• Other Distributed Approaches•••
•••• Other Distributed Approaches•••
47
![Page 48: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/48.jpg)
Other Distributed ApproachesOther Distributed Approaches
• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?
• Three approaches (with Graph flavor)GraphLab PregelGraphChi
• Map/Reduce has certain flaws• What if we incorporate knowledge about the problem in the computational model?
• Three approaches (with Graph flavor)GraphLab PregelGraphChi
48
![Page 49: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/49.jpg)
GraphLabGraphLab
• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies
• In ML applications there usually are Data dependencies Iterative algos
• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph
• Map/Reduce is perfect for embarassingly data parallel computationsWordCount is a good example No data dependencies
• In ML applications there usually are Data dependencies Iterative algos
• GraphLab Expresses data dependencies as a Graph Performs computations distributed on that Graph
49
![Page 50: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/50.jpg)
GraphLabGraphLabHigh level idea
• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem
• Sync Analogous to Reduce() Also applies to overlapping parts of the problem
• Update Analogous to Map() Unlike Map(), can be also done on overlapping pieces of the problem
• Sync Analogous to Reduce() Also applies to overlapping parts of the problem
Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud VLDB 2012 50
![Page 51: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/51.jpg)
GraphLab ApplicationsGraphLab Applications
• Not restricted to Graph computations• Can express many problems in this way:
Least squares regression Lasso regressionMatrix Factorization
• Active community (software package and annual conference) http://graphlab.com/index.html
• Not restricted to Graph computations• Can express many problems in this way:
Least squares regression Lasso regressionMatrix Factorization
• Active community (software package and annual conference) http://graphlab.com/index.html
51
![Page 52: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/52.jpg)
PregelPregel• Google’s response to Graph Computations• Vertex centric computations
A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology
• Can express algorithms such as PageRank or Shortest Paths this way
• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google
• Google’s response to Graph Computations• Vertex centric computations
A vertex can: Receive messages Send message to other vertices Modify its state Modify the Graph topology
• Can express algorithms such as PageRank or Shortest Paths this way
• Very scalable (runs on Google’s various Graphs)• Easy to program (15 lines of code for PageRank)• Internal to Google
Grzegorz Malewicz et al. Pregel: A System for Large‐Scale Graph Processing, SIGMOD’1052
![Page 53: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/53.jpg)
GraphChiGraphChi
• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi
Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts
• Performance comparable to distributed approaches!
• GraphLab & Pregel run on clusters• What about a single machine?• GraphChi
Single machineDisk based storage (local) Breaks large graph into small partsUses parallel sliding windows to process parts
• Performance comparable to distributed approaches!
Aapo Kyrola et al. GraphChi: Large‐Scale Graph Computation on Just a PC , USENIX’1253
![Page 54: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/54.jpg)
RoadmapRoadmap
••••• Databases••
••••• Databases••
54
![Page 55: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/55.jpg)
DatabasesDatabases
• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)
e.g. table STUDENT with entries(name, student_id, gpa)
Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;
• (Relational) Database Systems Store data in “Relations” (tables) Issue queries on the data Typically SQL (Structured Query Language)
e.g. table STUDENT with entries(name, student_id, gpa)
Find all students with gpa >= 3.5 SELECT * FROM STUDENTWHERE gpa>=3.5;
55
![Page 56: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/56.jpg)
ExampleExampleName Student_id GPA
Rasmus 1 4
Evrim 2 4
Vagelis 3 3
SELECT * FROM STUDENTWHERE gpa>=3.5;
Name Student_id GPA
Rasmus 1 4
Evrim 2 4
STUDENT
56
![Page 57: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/57.jpg)
Joins of two tablesJoins of two tables
• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)
• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
• We have two tables: STUDENT(name, student_id, gpa) TAKES_CLASS(student_id, class_name)
• We can ask: What do students with gpa>=3.5 take? SELECT UNIQUE(class_name) FROM STUDENTJOIN TAKES_CLASS ON STUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
57
![Page 58: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/58.jpg)
ExampleExample
Name Student_id
GPA
Rasmus 1 4
Evrim 2 4
Vagelis 3 3
SELECT UNIQUE(class_name)FROM STUDENT JOIN TAKES_CLASS ONSTUDENT.student_id = TAKES_CLASS.student_idWHERE STUDENT.gpa>=3.5;
STUDENT
Class_name Student_id
Chemometrics 101 1
Databases 201 1
Chemometrics 101 2
Chemometrics 101 3
Class_name
Chemometrics 101
Databases 201
TAKES_CLASS
58
![Page 59: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/59.jpg)
DatabasesDatabases
• That’s all nice…• But, why would we want to use it?• That’s all nice…• But, why would we want to use it?
59
![Page 60: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/60.jpg)
Matrix operations in DBMSMatrix operations in DBMS• Say that we have two matrices A, B• Store them in a DB as
A(row, col, value) B(row, col, value)
• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;
• Gives us A*B !
• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication
• Say that we have two matrices A, B• Store them in a DB as
A(row, col, value) B(row, col, value)
• Then SELECT A.row, B.col, SUM(A.value*B.value) FROM A JOIN B ON A.col=B.row GROUP BY A.row, B.col;
• Gives us A*B !
• http://stackoverflow.com/questions/6582191/sql‐query‐for‐multiplication
60
![Page 61: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/61.jpg)
What else can we do?What else can we do?
• We can find eigenvectors of a matrix A• Simply do Power Iteration
Start with random xDo x(i) = A*x(i‐1) until x converges
• Series of Matrix‐Vector multiplications• SQL can do that
• We can find eigenvectors of a matrix A• Simply do Power Iteration
Start with random xDo x(i) = A*x(i‐1) until x converges
• Series of Matrix‐Vector multiplications• SQL can do that
61
![Page 62: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/62.jpg)
Why should I care?Why should I care?
• Re‐usableWrite a library of queries, use it at will
• Portable SQL is a standard, so any DBMS supports basic SQL operations
• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed
• Re‐usableWrite a library of queries, use it at will
• Portable SQL is a standard, so any DBMS supports basic SQL operations
• ScalableDBMS are the industrial workhorsesOptimized for efficiency & speed
62
![Page 63: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/63.jpg)
NoSQLNoSQL
• Traditional RDMS implement stuff like Concurrency control Data integrity
• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back
Data should be consistent
• Not really necessary for Data Analysis Data is usually immutable
• Traditional RDMS implement stuff like Concurrency control Data integrity
• which are necessary when doing DB transactions e.g. a bank DB needs to make sure that all transactions are either committed or rolled back
Data should be consistent
• Not really necessary for Data Analysis Data is usually immutable
63
![Page 64: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/64.jpg)
NoSQLNoSQL
• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems
• Drop the concurrency control• Drop data integrity constraints• What’s left is NoSQL systems
• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language
• NoSQL sometimes means “Not only SQL” Some NoSQL systems support SQL‐like queriesOthers have their own language
64
![Page 65: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/65.jpg)
SciDBSciDB
• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end
Currently in R Soon in Python, Matlab etc
• All computation & storage takes place in Database server
• Data Management and Analysis System• Minimal support for transactions• Data is stored as vectors• Provides a high level front‐end
Currently in R Soon in Python, Matlab etc
• All computation & storage takes place in Database server
65
![Page 66: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/66.jpg)
RoadmapRoadmap
•••••• Sampling•
•••••• Sampling•
66
![Page 67: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/67.jpg)
SamplingSampling
• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay
• Today: CUR decomposition TensorCUR ParCube
• Very powerful technique • Reduces data size• If done carefully, preserves data characteristics• Is able to speed/scale up computations with small price to pay
• Today: CUR decomposition TensorCUR ParCube
67
![Page 68: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/68.jpg)
Analysis using SVDAnalysis using SVD
A UVTΣ
≈
products
users
users
latent groups
latent groups
products
• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data
• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A
• Sometimes, hard to interpret cols of U, V Might not directly correspond to something in the data
• (Alternative) CUR decomposition: Instead of latent approximation, use actual cols & rows of A
SVD
68
![Page 69: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/69.jpg)
CUR DecompositionCUR Decomposition
A CRU
• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!
Not true for SVD
• C contains cols of A sampled at random• R contains rows of A sampled at random• U = pinv(C)*A*pinv(R)• If A is sparse then C,R sparse too!
Not true for SVD
Mahoney et al. CUR matrix decompositions for improved data analysis , PNAS 200969
≈
![Page 70: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/70.jpg)
CUR discussionCUR discussion
• Good for cases when we can’t interpret latent dimensions
• Directly interpretable factors• Retains sparsity on factors
• Good for cases when we can’t interpret latent dimensions
• Directly interpretable factors• Retains sparsity on factors
70
![Page 71: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/71.jpg)
Tensor CURTensor CUR
• Extension of the CUR decomposition to tensors
• Assumes that third mode is “special” e.g. time
• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)
• Extension of the CUR decomposition to tensors
• Assumes that third mode is “special” e.g. time
• Approximates tensor asA = n1 x n2 x n3 C is a n1 x n2 x c (Where c is small)
Mahoney et al. Tensor‐CUR decompositions for tensor‐based data, SIAM JMAA 2008 71
![Page 72: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/72.jpg)
Speeding up and parallelizing tensor decompositions
Speeding up and parallelizing tensor decompositions
• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?
• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC
Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization
• Given a large tensor or matrix‐tensor couple• How can we decompose them in a single machine (possibly multi‐core)?
• Idea: Use sampling and parallelization: ParCube: ECML‐PKDD 2012 Approximate, Parallel PARAFAC
Turbo‐SMT: SIAM SDM 2014 Approximate, Parallel Coupled Matrix‐Tensor Factorization
72
![Page 73: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/73.jpg)
PARCUBE: The big picturePARCUBE: The big picture
! "#! "
$! "
%! "
##"
$#"
%#"
! $%!"
&"
#! "
$! "
%! "
! $%#"
• Sampling selects small portion of indices• PARAFAC vectors ai bi ci will be sparse by construction
Break up tensor into small piecesusing sampling
Fit dense PARAFAC decomposition on small sampled tensors
Match columns and distribute non‐zero values to appropriate indices in original (non‐sampled) space
73
![Page 74: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/74.jpg)
Putting the pieces togetherPutting the pieces together
…
• Say we have matrices As from each sample• Possibly have re‐ordering of factors• Each matrix corresponds to different sampled index set of the
original index space• All factors share the “upper” part (by construction)
Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would
G3
74
![Page 75: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/75.jpg)
Up to 200x speedupUp to 200x speedup
75
![Page 76: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/76.jpg)
NeurosemanticsNeurosemantics
…
• Brain Scan Data*
• 9 persons in fMRI machine• Presented with 60 concrete
nouns• 7s pause between nouns
to ‘neutralize’ activity…
airplanedog
noun
s
*Mitchell et al. Predicting human brain activity associated with the meanings of nouns. Science, 2008Data available@ http://www.cs.cmu.edu/afs/cs/project/theo‐73/www/science2008/data.html
These images don’t correspond to the right words!
76voxelsquestions
![Page 77: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/77.jpg)
Neurosemantics ResultsNeurosemantics Results
77
![Page 78: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/78.jpg)
RoadmapRoadmap
••••••• Streaming & Sketching
••••••• Streaming & Sketching
78
![Page 79: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/79.jpg)
Problem1Problem1
• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?
• You are given a series of N numbers• N is much larger than anything you can store• You see this series of numbers only once• Suppose you can store M numbers• How can you sample M of those numbers uniformly at random?
79
![Page 80: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/80.jpg)
Data StreamsData Streams
• The previous problem is a Data Stream problem
• We are going to see: Sketching Streaming Algorithms
• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient
• The previous problem is a Data Stream problem
• We are going to see: Sketching Streaming Algorithms
• Even without the streaming constraint, such algorithms offer useful insights!Make algorithms fasterMake algorithms more space efficient
80
![Page 81: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/81.jpg)
Reservoir SamplingReservoir Sampling• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R
• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]
Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.
• R stores the M numbers of our sample• For the first M numbers that we see, we add them to R
• After R is full, we need to decide if we add a sample: For the i‐th number of the stream, say S[i]: Generate random number j in range: 1…i If j ≤ M then R[j] = S[i] Otherwise ignore S[i]
Probability of adding samples in R is decreasing Can prove (by induction) that the sample is uniformly random.
81
![Page 82: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/82.jpg)
Problem2Problem2
• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:
Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?
• We have a stream of N numbers• Again, we can’t store them• Say, we call them "vector a"• How can we answer:
Point queries: e.g. give me a(i)Dot products: given two big vectors a & b, what is aTb?
82
![Page 83: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/83.jpg)
CountMin sketch preliminariesCountMin sketch preliminaries
• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]
We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct
• We have a dxwmatrix C• A set of d hash functions {1..N}{1…w}• Vector a is represented in an incremental fashion At time t the state of the vector is[a1(t) a2(t)….aN(t)]
We see updates of its coordinates over time, e.g. update (it,ct) ait(t) = ait(t‐1) + ct
83
![Page 84: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/84.jpg)
CountMin SketchCountMin Sketch
• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct
• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems
• When we see update (it,ct) For j=1…d update C[j,hj(it)] = C[j,hj(it)] + ct
• See Graham Cormode, Count‐Min Sketch, Springer Encyclopedia of Database Systems
84
![Page 85: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/85.jpg)
CountMin at workCountMin at work
• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where
• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb
With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value
• How can I estimate a(i)? aest(i) = minj C[j,hj(i)], for j = 1…d Error guarantee: aest(i) ≤ a(i) + ε||a ||1 where
• How can I estimate aTb? Treat Ca, Cb as d w‐dimensional vectors aTb can be estimated as the minimum inner product between pairs of rows of Ca, Cb
With prob. 1‐δ estimate is at most ε||a||1||b||1more than true value
85
![Page 86: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/86.jpg)
Streams of Co-evolving Time-SeriesStreams of Co-evolving Time-Series
(a) Sensor measurements (b) Hidden variablesFigure 1: Illust rat ion of problem. Sensors measurechlorine in drinking water and show a daily, near si-nusoidal periodicity during phases 1 and 3. Duringphase 2, some of the sensors are “ stuck” due to a ma-jor leak. The extra hidden variable int roduced duringphase 2 captures the presence of a new trend. SPIRITcan also tell us which sensors part icipate in the new,“ abnormal” t rend (e.g., close to a construct ion site).In phase 3, everything returns to normal.
• We are given n sensors
• We record their activity over time
• We would like to track their PCA as new measurements become available
• We are given n sensors
• We record their activity over time
• We would like to track their PCA as new measurements become available
86
![Page 87: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/87.jpg)
SPIRITSPIRIT
• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently
• SPIRIT Adapts number of principal components k Adapts the loadings Tracks the scores/hidden variablesDoes the above and efficiently
Papadimitriou et al. Streaming Pattern Discovery in Multiple Time‐Series, VLDB 200587
![Page 88: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/88.jpg)
Tensor StreamTensor Stream
Given:
track its decompositionwithout re‐computing
88
![Page 89: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/89.jpg)
Tensor StreamsTensor Streams
• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT
Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009
• At least two approaches exist Jimeng Sun et al. Incremental Tensor Analysis: Theory and Applications, ACM TKDD 2008 Amongst others uses SPIRIT
Nion & Sidiropoulos, Adaptive Algorithms to Track the PARAFAC Decomposition of a Third‐Order Tensor, IEEE TSP 2009
89
![Page 90: Big Arctic Data - ku · 2014-04-17 · 29 mixtures 18989 m/z value LC‐MS LC‐MS 1054 retention time 1.55% dense Sparse storage ~ 275 MB ~ 4.4 GB Liquid‐Chromatography Mass‐Spectrometry](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea412b36b5a44242a7ea041/html5/thumbnails/90.jpg)
The EndThe End
Web: www.cs.cmu.edu/~epapalexCode: www.cs.cmi.edu/~epapalex/code.htmlemail: [email protected]
Questions?
90