DiscoRank: optimizing discoverability on SoundCloud
-
Upload
amelie-anglade -
Category
Technology
-
view
1.524 -
download
2
description
Transcript of DiscoRank: optimizing discoverability on SoundCloud
![Page 1: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/1.jpg)
DiscoRank: Optimizing Discoverability on SoundCloud
Amélie Anglade
![Page 2: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/2.jpg)
• Developer at SoundCloud
• SoundCloud is the world’s largest social sound platform
• Academic background in Music Information Retrieval (MIR)
• Design, prototype and implement Machine Learning algorithms for music discovery
![Page 3: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/3.jpg)
DISCOVERABILITY ?
![Page 4: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/4.jpg)
![Page 5: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/5.jpg)
![Page 6: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/6.jpg)
![Page 7: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/7.jpg)
PAGERANK
![Page 8: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/8.jpg)
• The web is a graph:• nodes = web pages• edges = hyperlinks
• The (Page)rank of a node depends on the link structure of the graph
WEB AND PAGERANK
![Page 9: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/9.jpg)
RANDOM SURFER
![Page 10: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/10.jpg)
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
![Page 11: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/11.jpg)
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
![Page 12: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/12.jpg)
Nodes visited more often:• Nodes with many links• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E
![Page 13: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/13.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 14: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/14.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 15: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/15.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 16: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/16.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 17: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/17.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 18: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/18.jpg)
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
![Page 19: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/19.jpg)
TELEPORT
A
B
C
D
E
![Page 20: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/20.jpg)
TELEPORT
A
B
C
D
E
![Page 21: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/21.jpg)
TELEPORT
A
B
C
D
E
![Page 22: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/22.jpg)
If N nodes in graph, probability to teleport to any other node (including self) = 1/N
TELEPORT
A
B
C
D
E
1/N1/N
1/N
1/N
1/N
⇒
![Page 23: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/23.jpg)
TELEPORT
A
B
C
D
E
1/N1/N
1/N
1/N
α?
1-α
1/N
At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)
![Page 24: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/24.jpg)
Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state: the PageRank vector.
![Page 25: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/25.jpg)
PAGERANK EQUATION
![Page 26: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/26.jpg)
SOUNDCLOUDDISCORANK
![Page 27: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/27.jpg)
![Page 28: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/28.jpg)
DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in
![Page 29: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/29.jpg)
• Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track• Track is featured in Playlist
...
• New big (but sparse) adjacency matrix:
UNIVERSAL SEARCH
![Page 30: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/30.jpg)
![Page 31: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/31.jpg)
• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
![Page 32: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/32.jpg)
PERFORMANCE OPTIMIZATION
![Page 33: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/33.jpg)
• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank realtime
A VERY LARGE GRAPH
![Page 34: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/34.jpg)
•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes• Delta encode the “to” node ids
USING SPARSITY
![Page 35: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/35.jpg)
• We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch once a week
• In between:• we create additional graph segments with new
entities and events• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:• Also allows for experimentation
VERSIONED DISCORANK
![Page 36: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/36.jpg)
• MySQL batch jobs
• DiscoRank results stored in HDFS
• At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine
its Lucene score with its DiscoRank
INTEGRATION IN OUR INFRASTRUCTURE
![Page 37: DiscoRank: optimizing discoverability on SoundCloud](https://reader033.fdocuments.net/reader033/viewer/2022052900/5562c242d8b42aaf178b4c41/html5/thumbnails/37.jpg)
Amélie AngladeSound/Music Information Retrieval Engineer
about.me/utstikkar@utstikkar
We’re hiring!
www.soundcloud.com