DataEngConf: The Science of Virality at BuzzFeed
-
Upload
hakka-labs -
Category
Software
-
view
466 -
download
0
Transcript of DataEngConf: The Science of Virality at BuzzFeed
![Page 1: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/1.jpg)
![Page 2: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/2.jpg)
HISTORY OF VIRALITY
![Page 3: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/3.jpg)
![Page 4: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/4.jpg)
THE DATA
![Page 5: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/5.jpg)
THE DATA: OLD VERSION
Article being viewedUser viewing articleTime of pageviewReferring domain
![Page 6: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/6.jpg)
THE DATA: NEW VERSION
Article being viewed
Time of pageviewReferring domain
User viewing article
Referring User
![Page 7: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/7.jpg)
DIFFERENT PERSPECTIVE:
Pageviews are a process on a graph!
![Page 8: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/8.jpg)
WHAT THE GRAPH LOOKS LIKE:
![Page 9: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/9.jpg)
WHAT THE PROCESS LOOKS LIKE:
![Page 10: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/10.jpg)
WHAT THE DATA LOOKS LIKE:
![Page 11: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/11.jpg)
WHAT CAN DO YOU WITH OLD PAGEVIEWS?
(Educated)
Guess!
![Page 12: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/12.jpg)
CONNIE
![Page 13: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/13.jpg)
OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCEProbabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph
Gets larger when differences in pageview times gets smaller
![Page 14: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/14.jpg)
SIMPLIFIED VERSION:Observe:
Guess:
![Page 15: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/15.jpg)
SIMPLIFIED VERSION:Guess:
Reality:
![Page 16: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/16.jpg)
Check out a toy implementation here!
github.com/akellehe/pyconnie
![Page 17: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/17.jpg)
NEW GRAPH RECONSTRUCTION: TRIVIAL
These are actually Unique Visitors …
![Page 18: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/18.jpg)
LIFE IS A LITTLE MESSY…
This is more like what the Pageview graph looks like
![Page 19: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/19.jpg)
PROBLEM: DATA MUNGING• Lots of potential for heuristics!• How do we get promotion attribution from
propagations?• Trees are important: how can we be sure
we get them?
![Page 20: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/20.jpg)
PROBLEM: STREAMLINING ANALYSIS• How do we work from a common set of definitions?• How do we avoid repeating analysis?• How can we streamline data visualization? EDA?• How do we share optimized analyses? And avoid
inefficient (but correct) algorithms?
![Page 21: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/21.jpg)
DEFINE DATA STRUCTURES!• All data munging happens “under the hood”• Data pre-processing is unit-tested• No room for heuristics: standardization!• Hard math definitions can be consistency-checked!
![Page 22: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/22.jpg)
PROPAGATION SETFor one article
For the site (or other set of articles, S)
![Page 23: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/23.jpg)
PROPAGATION SETPageviews to article b in time T
Pageviews to the site in time T
The simplest data structure. Just a representation of the raw pageview logs.
Represented as a generator of UserEdge objects
![Page 24: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/24.jpg)
PROPAGATION GRAPH,
![Page 25: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/25.jpg)
PROPAGATION GRAPH
![Page 26: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/26.jpg)
PROPAGATION GRAPH
![Page 27: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/27.jpg)
INFLUENCE GRAPHPropagation graph together with a map,
That measures the influence of the origin user in p on the pageviewing user
![Page 28: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/28.jpg)
CONSIDER:
![Page 29: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/29.jpg)
PROPAGATION FOREST
![Page 30: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/30.jpg)
PROPAGATION FORESTThe propagation graph is great, but we’d also like a concept like unique visitors!
If there is attribution ordering in the graph, we can trace content back to its source!
![Page 31: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/31.jpg)
PROPAGATION FOREST: FIRST PARENT ATTRIBUTION
n pageviews One UV
![Page 32: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/32.jpg)
PROPAGATION FOREST gets the credit
![Page 33: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/33.jpg)
RESULT: ALL GRAPHS ARE FORESTS
Promotions have 0 indegree,Users have 1 indegree
total edges in connected components:
Trees!
![Page 34: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/34.jpg)
CAREFUL FOR EDGE CASES: MISSING DATA?All connected components should be rooted at a promotion source.
What happens if we lose the first edge (e.g. use the wrong T)?
![Page 35: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/35.jpg)
PROPAGATION FOREST: CYCLE BREAKINGConsider … Cycle is not broken by
first-parent attribution
Traversal algorithms go on forever!
![Page 36: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/36.jpg)
PROPAGATION FOREST: CYCLE BREAKINGConsider … As long as they’re not equal,
the can be ordered, say
Then, there is a node in the cycle with an out-edge younger than its in-edge:
The original pageview for that node must have been lost. Cut the in-edge (FPA!).
![Page 37: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/37.jpg)
SUCCESS!Cycle-breaking + FPA = Trees!
Each tree is the UV graph downstream from a promotion source: promotion attribution!
Additional Benefits:Most information diffusion analyses model trees growing on graphs.
Many algorithms simplify when run on trees!
![Page 38: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/38.jpg)
SUPERTREEWe may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that?
Merge all the roots (promotion sources) together into one “super-node”
The whole forest becomes a SuperTree!
![Page 39: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/39.jpg)
SUPERTREE: EXAMPLE
![Page 40: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/40.jpg)
SUPERTREE: EXAMPLE
![Page 41: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/41.jpg)
APPLICATION: LARGE SCALE DATA VIS
![Page 42: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/42.jpg)
WHY IS IT SLOW?Layouts often consider repelling each node from every other: time complexity
Good for a few thousand nodes
![Page 43: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/43.jpg)
OPENORD: SIMULATED ANNEALINGLinear main layout
Quadratic settling Phase
Implemented in Gephi
![Page 44: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/44.jpg)
OPENORDGood for ~10k Users
Slow for ~100k Users
Messy! (if you skipthe quadratic step!)
![Page 45: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/45.jpg)
TAKE ADVANTAGE OF TREE STRUCTURE!
Traverse the tree to decide where to place nodes!
![Page 46: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/46.jpg)
H3 LAYOUTEach parent is in the center of a hemisphere.
Children are laid out on the surface of the hemisphere
They become centers of smaller hemispheres (if they’re parents)
Etc.
![Page 47: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/47.jpg)
![Page 48: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/48.jpg)
![Page 49: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/49.jpg)
![Page 50: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/50.jpg)
![Page 51: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/51.jpg)
A NEW IMPLEMENTATIONpip install pyh3
![Page 52: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/52.jpg)
WITH D3
![Page 53: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/53.jpg)
MORE APPLICATIONS
![Page 54: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/54.jpg)
ATTRIBUTION
Instead of
![Page 55: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/55.jpg)
CASCADE PREDICTION
![Page 56: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/56.jpg)
GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!
![Page 57: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/57.jpg)
TEST THE INFLUENTIALS HYPOTHESIS
![Page 58: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/58.jpg)
IMPROVE CONTENT TARGETING
![Page 59: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/59.jpg)
FINDING THE CAUSES OF VIRALITYConsider Fitting a Model:
User Features, content features, context features, User pair features
![Page 60: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/60.jpg)
UNDER CONSTRUCTION:Online Regression!
Real-time feature weights tell which features correlate with propagation probabilities!
Drives hypothesis-building!
![Page 61: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/61.jpg)
THE TEAM
![Page 62: DataEngConf: The Science of Virality at BuzzFeed](https://reader035.fdocuments.net/reader035/viewer/2022062900/58eceffc1a28ab5e228b4605/html5/thumbnails/62.jpg)