Graphical Models for the Internet (Part1)
-
Upload
jun-wang -
Category
Technology
-
view
370 -
download
0
Transcript of Graphical Models for the Internet (Part1)
![Page 1: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/1.jpg)
Graphical Models for the InternetAmr Ahmed & Alexander Smola
Yahoo! Research & Australian National University, Santa Clara, [email protected], [email protected]
![Page 2: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/2.jpg)
Thanks
JoeyGonzalez
YuchengLow
QirongHo
ShravanNarayanamurthy
VanjaJosifovski
Choon HuiTeo
EricXing
JamesPetterson
JakeEisenstein
Shuang HongYang
VishyVishwanathan
MarkusWeimer
AlexandrosKaratzoglou
MohamedAly
SergiyMatyusevich
![Page 3: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/3.jpg)
1. Data on the Internet
![Page 4: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/4.jpg)
• Tiny (2 cores)(512MB, 50MFlops, 1000 samples)
• Small (4 cores)(4GB, 10GFlops, 100k samples)
• Medium (16 cores)(32GB, 100GFlops, 1M samples)
• Large (256 cores)(512GB, 1TFlops, 100M samples)
• Massive... need to work hard get it to work
Size calibration
![Page 5: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/5.jpg)
Data• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >10B useful webpages
![Page 6: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/6.jpg)
The Web for $100k/month• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
• 10 billion pages(this is a small subset, maybe 10%)10k/page = 100TB ($10k for disks or EBS 1 month )
• 1000 machines10ms/page = 1 dayafford 1-10 MIP/page($20k on EC2 for 0.68$/h)
• 10 Gbit link($10k/month via ISP or EC2)
• 1 day for raw data• 300ms/page roundtrip• 1000 servers for 1 month
($70k on EC2 for 0.085$/h)
![Page 7: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/7.jpg)
Data - Identity & Graph
100M-1B vertices
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
![Page 8: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/8.jpg)
Data - User generated content
>1B images, 40h video/minute
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
![Page 9: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/9.jpg)
Data - User generated content
>1B images, 40h video/minute
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
crawl it
![Page 10: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/10.jpg)
>1B texts
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
Data - Messages
![Page 11: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/11.jpg)
>1B texts
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) impossible without NDA
Data - Messages
![Page 12: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/12.jpg)
Data - User Tracking
alex.smola.org
>1B ‘identities’
• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
![Page 13: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/13.jpg)
Data - User Tracking• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
![Page 14: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/14.jpg)
Personalization• 100-1000M users
• Spam filtering• Personalized targeting
& collaborative filtering• News recommendation• Advertising
• Large parameter space(25 parameters = 100GB)
• Distributed storage(need it on every server)
• Distributed optimization• Model synchronization• Time dependence• Graph structure
![Page 15: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/15.jpg)
• Ads
• Click feedback
• Emails
• Tags
• Editorial data is very expensive! Do not use!
• Graphs
• Document collections
• Email/IM/Discussions
• Query stream
(implicit) Labels no labels(typical case)
![Page 16: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/16.jpg)
Challenges• Scale
• Billions of instances (documents, clicks, users, ads)• Rich data structure (ontology, categories, tags)• Model does not fit into single machine
• Modeling• Plenty of unlabeled data, temporal structure, side information• User-understandable structure• Solve problem. Don’t simply apply clustering/LDA/PCA/ICA• We only cover building blocks
• Inference• 10k-100k clusters/discrete objects, 1M-100M unique tokens• Communication
![Page 17: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/17.jpg)
• Tools• Load distribution, balancing and synchronization• Clustering, Topic Models
• Models• Dynamic non-parametric models• Sequential latent variable models
• Inference Algorithms• Distributed batch• Sequential Monte Carlo
• Applications• User profiling• News content analysis & recommendation
Roadmap
![Page 18: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/18.jpg)
2. Basic Tools
![Page 19: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/19.jpg)
Systems
Algorithms run on MANY REAL and FAULTY boxesnot Turing machines. So we need to deal with it.
![Page 20: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/20.jpg)
• High Performance ComputingVery reliable, custom built, expensive
• Consumer hardwareCheap, efficient, easy to replicate,Not very reliable, deal with it!
Commodity Hardware
![Page 21: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/21.jpg)
Slide from talk of Jeff Deanhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
![Page 22: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/22.jpg)
Map Reduce• 1000s of (faulty) machines• Lots of jobs are mostly embarrassingly parallel
(except for a sorting/transpose phase)• Functional programming origins
• Map(key,value)processes each (key,value) pair and outputs a new (key,value) pair
• Reduce(key,value)reduces all instances with same key to aggregate
• Example - extremely naive wordcount• Map(docID, document)
for each document emit many (wordID, count) pairs• Reduce(wordID, count)
sum over all counts for given wordID and emit (wordID, aggregate)from Ramakrishnan, Sakrejda, Canon, DoE 2011
![Page 23: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/23.jpg)
Map Reduce• 1000s of (faulty) machines• Lots of jobs are mostly embarrassingly parallel
(except for a sorting/transpose phase)• Functional programming origins
• Map(key,value)processes each (key,value) pair and outputs a new (key,value) pair
• Reduce(key,value)reduces all instances with same key to aggregate
• Example - extremely naive wordcount• Map(docID, document)
for each document emit many (wordID, count) pairs• Reduce(wordID, count)
sum over all counts for given wordID and emit (wordID, aggregate)
![Page 24: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/24.jpg)
Map Reduce
Ghemawat & Dean, 2003
map(key,value) reduce(key,value)
easy fault tolerance (simply restart workers)
moves computation to data
disk based inter process communication
![Page 25: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/25.jpg)
Map Reduce• Map must be stateless in blocks• Reduce must be commutative in data• Fault tolerance
• Start jobs where the data is (nodes run the filesystem, too)• Restart machines if maps fail (have replicas)• Restart reducers based on intermediate data
• Good fit for many algorithms• Good if only a small number of MapReduce iterations needed• Need to request machines at each iteration (time consuming)• State lost in between maps• Communication only via file I/O• Need to wait for last reducer
![Page 26: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/26.jpg)
Map Reduce• Map must be stateless in blocks• Reduce must be commutative in data• Fault tolerance
• Start jobs where the data is (nodes run the filesystem, too)• Restart machines if maps fail (have replicas)• Restart reducers based on intermediate data
• Good fit for many algorithms• Good if only a small number of MapReduce iterations needed• Need to request machines at each iteration (time consuming)• State lost in between maps• Communication only via file I/O• Need to wait for last reducer
unsuitable for algorithms with many iterations
![Page 27: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/27.jpg)
Many alternatives• Dryad/LINQ
Microsoft - directed acyclic graphs• S4
Yahoo - streaming directed acyclic graphs• Pregel
Google - bulk synchronous processing• YARN
Use Hadoop scheduler directly• Mesos, Hadoop workalikes & patches
![Page 28: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/28.jpg)
Clustering
![Page 29: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/29.jpg)
ClusteringDensity Estimation
Clustering
x
x
θ
y
p(x, �) = p(�)nY
i=1
p(xi|�)
p(x, y, �) = p(⇥)KY
k=1
p(�k)nY
i=1
p(yi|⇥)p(xi|�, yi) θ
![Page 30: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/30.jpg)
ClusteringDensity Estimation
Clustering
x
x
θ
y
p(x, �) = p(�)nY
i=1
p(xi|�)
p(x, y, �) = p(⇥)KY
k=1
p(�k)nY
i=1
p(yi|⇥)p(xi|�, yi) θ
find θfind θ
![Page 31: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/31.jpg)
ClusteringDensity Estimation
Clustering
x
x
θ
y
p(x, �) = p(�)nY
i=1
p(xi|�)
p(x, y, �) = p(⇥)KY
k=1
p(�k)nY
i=1
p(yi|⇥)p(xi|�, yi) θ
find θfind θ
log-concave
general nonlinear
![Page 32: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/32.jpg)
Clustering• Optimization problem
• Options• Direct nonconvex optimization (e.g. BFGS)• Sampling (draw from the joint distribution)
for memory efficiency• Variational approximation
(concave lower bounds aka EM algorithm)
maximize
�
X
y
p(x, y, �)
maximize
�log p(⇥) +
KX
k=1
log p(�k) +nX
i=1
log
X
yi�Y[p(yi|⇥)p(xi|�, yi)]
![Page 33: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/33.jpg)
Clustering• Integrate out θ
• Y is coupled• Sampling• Collapsed p
x
y
θ• Integrate out y
• Nonconvex optimization problem
• EM algorithm
x
θ
x
Y
p(y|x) � p({x} | {xi : yi = y} ⇥Xfake)p(y|Y ⇥ Yfake)
![Page 34: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/34.jpg)
Clustering• Integrate out θ
• Y is coupled• Sampling• Collapsed p
x
y
θ• Integrate out y
• Nonconvex optimization problem
• EM algorithm
x
θ
x
Y
p(y|x) � p({x} | {xi : yi = y} ⇥Xfake)p(y|Y ⇥ Yfake)
![Page 35: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/35.jpg)
Clustering• Integrate out θ
• Y is coupled• Sampling• Collapsed p
x
y
θ• Integrate out y
• Nonconvex optimization problem
• EM algorithm
x
θ
x
Y
p(y|x) � p({x} | {xi : yi = y} ⇥Xfake)p(y|Y ⇥ Yfake)
![Page 36: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/36.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
![Page 37: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/37.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
(b,g) - draw p(.,g)
![Page 38: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/38.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
(b,g) - draw p(.,g)(g,g) - draw p(g,.)
![Page 39: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/39.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)
![Page 40: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/40.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)(b,g) - draw p(b,.)
![Page 41: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/41.jpg)
Gibbs sampling• Sampling:
Draw an instance x from distribution p(x)• Gibbs sampling:
• In most cases direct sampling not possible• Draw one set of variables at a time
0.45 0.050.05 0.45
(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)(b,g) - draw p(b,.)(b,b) ...
![Page 42: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/42.jpg)
Gibbs sampling for clustering
![Page 43: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/43.jpg)
Gibbs sampling for clustering
randominitialization
![Page 44: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/44.jpg)
Gibbs sampling for clustering
sample cluster labels
![Page 45: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/45.jpg)
Gibbs sampling for clustering
resamplecluster model
![Page 46: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/46.jpg)
Gibbs sampling for clustering
resamplecluster labels
![Page 47: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/47.jpg)
Gibbs sampling for clustering
resamplecluster model
![Page 48: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/48.jpg)
Gibbs sampling for clustering
resamplecluster labels
![Page 49: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/49.jpg)
Gibbs sampling for clustering
resamplecluster model e.g. Mahout Dirichlet Process Clustering
![Page 50: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/50.jpg)
Topic models
![Page 51: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/51.jpg)
Grouping objects
![Page 52: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/52.jpg)
Grouping objects
Singapore
![Page 53: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/53.jpg)
Grouping objects
![Page 54: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/54.jpg)
Grouping objects
![Page 55: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/55.jpg)
Grouping objects
airline
restaurant
university
![Page 56: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/56.jpg)
Grouping objects
Australia
Singapore
USA
![Page 57: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/57.jpg)
Topic Models
USA airline
Singaporeairline
Singapore food
USA food
Singapore university
Australia university
![Page 58: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/58.jpg)
Clustering & Topic ModelsClustering
?
group objectsby prototypes
![Page 59: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/59.jpg)
Clustering & Topic ModelsClustering
?
group objectsby prototypes
Topics
decompose objectsinto prototypes
![Page 60: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/60.jpg)
Clustering & Topic Models
x
y
θ
prior
cluster probability
cluster label
instance x
y
θ
prior
topic probability
topic label
instance
clustering Latent Dirichlet Allocation
α α
![Page 61: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/61.jpg)
Clustering & Topic Models
DocumentsmembershipCluster/
topicdistributions
x =
clustering: (0, 1) matrixtopic model: stochastic matrixLSI: arbitrary matrices
![Page 62: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/62.jpg)
Topics in text
Latent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003
![Page 63: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/63.jpg)
Joint Probability Distribution
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(⇤, z,⌅, x|�,⇥)
=KY
k=1
p(⌅k|⇥)mY
i=1
p(⇤i|�)
m,miY
i,j
p(zij |⇤i)p(xij |zij ,⌅)
![Page 64: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/64.jpg)
sample zindependently
sample θindependently
Joint Probability Distribution
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(⇤, z,⌅, x|�,⇥)
=KY
k=1
p(⌅k|⇥)mY
i=1
p(⇤i|�)
m,miY
i,j
p(zij |⇤i)p(xij |zij ,⌅)
sample Ψindependently
![Page 65: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/65.jpg)
sample zindependently
sample θindependently
Joint Probability Distribution
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(⇤, z,⌅, x|�,⇥)
=KY
k=1
p(⌅k|⇥)mY
i=1
p(⇤i|�)
m,miY
i,j
p(zij |⇤i)p(xij |zij ,⌅)
sample Ψindependently slow
![Page 66: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/66.jpg)
Collapsed Sampler
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(z, x|�,⇥)
=mY
i=1
p(zi|�)kY
k=1
p({xij |zij = k} |⇥)
![Page 67: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/67.jpg)
sample zsequentially
Collapsed Sampler
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(z, x|�,⇥)
=mY
i=1
p(zi|�)kY
k=1
p({xij |zij = k} |⇥)
![Page 68: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/68.jpg)
sample zsequentially
Collapsed Sampler
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(z, x|�,⇥)
=mY
i=1
p(zi|�)kY
k=1
p({xij |zij = k} |⇥)fast
![Page 69: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/69.jpg)
Collapsed Sampler
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(z, x|�,⇥)
=mY
i=1
p(zi|�)kY
k=1
p({xij |zij = k} |⇥)
n�ij(t, w) + �t
n�i(t) +P
t �t
n�ij(t, d) + �t
n�i(d) +P
t �t
Griffiths & Steyvers, 2005
![Page 70: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/70.jpg)
Collapsed Sampler
xij
zij
θi
language prior
topic probability
topic label
instance
α
ψkβ
p(z, x|�,⇥)
=mY
i=1
p(zi|�)kY
k=1
p({xij |zij = k} |⇥)fast
n�ij(t, w) + �t
n�i(t) +P
t �t
n�ij(t, d) + �t
n�i(d) +P
t �t
Griffiths & Steyvers, 2005
![Page 71: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/71.jpg)
Sequential Algorithm (Gibbs sampler)
• For 1000 iterations do• For each document do
• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table
![Page 72: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/72.jpg)
Sequential Algorithm (Gibbs sampler)
this kills parallelism
• For 1000 iterations do• For each document do
• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table
![Page 73: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/73.jpg)
3. Design Principles
![Page 74: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/74.jpg)
Scaling Problems
![Page 75: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/75.jpg)
3 Problems
meanvariance
cluster weight
data cluster ID
![Page 76: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/76.jpg)
3 Problems
global statedata local state
![Page 77: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/77.jpg)
3 Problems
too big for single machine
huge only local
![Page 78: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/78.jpg)
3 Problems
data
local state
global state
Vanilla LDA
User profiling
global state
![Page 79: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/79.jpg)
3 Problems
data
local state
global state
Vanilla LDA
User profiling
global state
![Page 80: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/80.jpg)
3 Problems
global stateis too large
does not fit into memory
network load & barriers
does not fit into memory
local stateis too large
![Page 81: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/81.jpg)
3 Problems
global stateis too large
does not fit into memory
network load & barriers
does not fit into memory
local stateis too large
stream local data from disk
![Page 82: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/82.jpg)
3 Problems
global stateis too large
does not fit into memory
network load & barriers
does not fit into memory
local stateis too large
stream local data from disk
asynchronous synchronization
![Page 83: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/83.jpg)
3 Problems
global stateis too large
does not fit into memory
network load & barriers
does not fit into memory
local stateis too large
stream local data from disk
asynchronous synchronization
partial view
![Page 84: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/84.jpg)
Global state synchronization
![Page 85: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/85.jpg)
Lock and Barrier
• Changes in μ affect distribution in z
• Changes in z affect distribution in μ(in particular for a collapsed sampler)
• Lock z, then recompute μ• Lock all but single zi (for collapsed sampler)
![Page 86: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/86.jpg)
Variable replication
• No locks between machines to access z
• Synchronization mechanism for global μ needed
• In LDA this is the local copy of the (topic,word) counts
1 copy per machinebackground sync
![Page 87: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/87.jpg)
Distribution
globalreplica
rack
cluster
![Page 88: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/88.jpg)
Distribution
globalreplica
rack
cluster
![Page 89: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/89.jpg)
Message Passing• Start with common state• Child stores old and new state• Parent keeps global state• Transmit differences asynchronously
• Inverse element for difference• Abelian group for commutativity
(sum, log-sum, cyclic group, exponential families)
local to global global to local
x x+ (xglobal � x
old)
x
old x
global
� x� x
old
x
old x
x
global x
global + �
![Page 90: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/90.jpg)
Distribution• Dedicated server for variables
• Insufficient bandwidth (hotspots)• Insufficient memory
• Select server via consistent hashing(random trees a la Karger et al. 1999 if needed)
m(x) = argminm2M
h(x,m)
![Page 91: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/91.jpg)
Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine• Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine
m(x) = argminm2M
h(x,m)
![Page 92: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/92.jpg)
Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine
m(x) = argminm2M
h(x,m)
![Page 93: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/93.jpg)
Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine
m(x) = argminm2M
h(x,m)
![Page 94: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/94.jpg)
Synchronization Protocol
![Page 95: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/95.jpg)
Synchronization• Data rate between machines is O(1/k)• Machines operate asynchronously (barrier free)• Solution
• Schedule message pairs• Communicate with r random machines simultaneously
local
global
r=1
![Page 96: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/96.jpg)
Synchronization• Data rate between machines is O(1/k)• Machines operate asynchronously (barrier free)• Solution
• Schedule message pairs• Communicate with r random machines simultaneously
local
global
r=1
![Page 97: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/97.jpg)
Synchronization• Data rate between machines is O(1/k)• Machines operate asynchronously (barrier free)• Solution
• Schedule message pairs• Communicate with r random machines simultaneously
local
global
r=1
![Page 98: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/98.jpg)
Synchronization• Data rate between machines is O(1/k)• Machines operate asynchronously (barrier free)• Solution
• Schedule message pairs• Communicate with r random machines simultaneously• Use Luby-Rackoff PRPG for load balancing
• Efficiency guarantee
4 simultaneous connections are sufficient
![Page 99: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/99.jpg)
Architecture
![Page 100: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/100.jpg)
Sequential Algorithm (Gibbs sampler)
• For 1000 iterations do• For each document do
• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table
![Page 101: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/101.jpg)
Sequential Algorithm (Gibbs sampler)
this kills parallelism
• For 1000 iterations do• For each document do
• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table
![Page 102: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/102.jpg)
Distributed asynchronous sampler• For 1000 iterations do (independently per computer)
• For each thread/core do• For each document do
• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message
• In parallel update local (word, topic) table• In parallel update global (word, topic) table
![Page 103: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/103.jpg)
Distributed asynchronous sampler
continuoussync
barrier free
concurrentcpu hdd net
minimal view
• For 1000 iterations do (independently per computer)• For each thread/core do
• For each document do• For each word in the document do
• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message
• In parallel update local (word, topic) table• In parallel update global (word, topic) table
![Page 104: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/104.jpg)
Multicore Architecture
• Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler
• Joint state table• much less memory required• samplers syncronized (10 docs vs. millions delay)
• Hyperparameter update via stochastic gradient descent• No need to keep documents in memory (streaming)
tokens
topics
file
combiner
count
updater
diagnostics
&
optimization
output to
filetopics
samplersampler
samplersampler
sampler
Intel Threading Building Blocks
joint state table
![Page 105: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/105.jpg)
Cluster Architecture
• Distributed (key,value) storage via ICE• Background asynchronous synchronization
• single word at a time to avoid deadlocks• no need to have joint dictionary• uses disk, network, cpu simultaneously
sampler sampler sampler sampler
iceiceiceice
![Page 106: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/106.jpg)
Cluster Architecture
• Distributed (key,value) storage via ICE• Background asynchronous synchronization
• single word at a time to avoid deadlocks• no need to have joint dictionary• uses disk, network, cpu simultaneously
sampler sampler sampler sampler
iceiceiceice
![Page 107: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/107.jpg)
Making it work• Startup
• Naive: randomly initialize topics on each node (read from disk if already assigned - hotstart)
• Forward sampling for startup much faster• Aggregate changes on the fly
• Failover• State constantly being written to disk
(worst case we lose 1 iteration out of 1000)• Restart via standard startup routine
• Achilles heel: need to restart from checkpoint if even a single machine dies.
![Page 108: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/108.jpg)
Easily extensible• Better language model (topical n-grams)
can process millions of users (vs 1000s)• Conditioning on side information (upstream)
estimate topic based on authorship, source, joint user model ...
• Conditioning on dictionaries (downstream)integrate topics between different languages
• Time dependent sampler for user modelapproximate inference per episode
![Page 109: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/109.jpg)
Google LDA Mallet Irvine’08 Irvine’09 Yahoo LDA
Multicore no yes yes yes yes
Cluster MPI no MPI point 2 point ICE
State table dictionarysplit
separatesparse separate separate joint
sparse
Schedule synchronousexact
synchronousexact
synchronousexact
asynchronousapproximate
messages
asynchronousexact
![Page 110: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/110.jpg)
Speed (2010 numbers)• 1M documents per day on 1 computer
(1000 topics per doc, 1000 words per doc)• 350k documents per day per node
(context switches & memcached & stray reducers)• 8 Million docs (Pubmed)
(sampler does not burn in well - too short doc)• Irvine: 128 machines, 10 hours• Yahoo: 1 machine, 11 days • Yahoo: 20 machines, 9 hours
• 20 Million docs (Yahoo! News Articles) • Yahoo: 100 machines, 12 hours
![Page 111: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/111.jpg)
Fast sampler
• 8 Million documents, 1000 topics, {100,200,400} machines, LDA• Red (symmetric latency bound message passing)• Blue (asynchronous bandwidth bound message passing & message scheduling)
• 10x faster synchronization time• 10x faster snapshots• Scheduling improves 10% already on 150 machines
![Page 112: Graphical Models for the Internet (Part1)](https://reader033.fdocuments.net/reader033/viewer/2022060108/5550b6a7b4c905fa618b4b55/html5/thumbnails/112.jpg)
• Tools• Load distribution, balancing and synchronization• Clustering, Topic Models
• Models• Dynamic non-parametric models• Sequential latent variable models
• Inference Algorithms• Distributed batch• Sequential Monte Carlo
• Applications• User profiling• News content analysis & recommendation
Roadmap