Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)
-
Upload
anton-konushin -
Category
Education
-
view
1.130 -
download
0
description
Transcript of Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)
![Page 1: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/1.jpg)
Original Plan
1. Algorithms for BIG graphs
• The centrality of centrality
• How to store BIG Graphs (WebGraph Framework)
• Four Degrees of Separation
• Diameter and Radius
2. Big Data and Memory Errors
![Page 2: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/2.jpg)
Slightly Revised Plan
1. Algorithms for BIG graphs
• The centrality of centrality
• Four Degrees of Separation
• Diameter (no Radius)
• How to store BIG Graphs (WebGraph Framework)
2. Big Data and Memory Errors
![Page 3: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/3.jpg)
Four Degrees of Separa.on
![Page 4: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/4.jpg)
Literature • Frigyes Karinthy, in his 1929 short story “Láncszemek” (“Chains'”) suggested that any two persons are distanced by at most six friendship links
• Just an (op.mis.c) posi.vis.c statement about combinatorial explosion
• Used by John Guare's in his 1990 eponymous play (and 1993 movie by Fred Schepisi)
4
![Page 5: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/5.jpg)
The Sociologists
• M. Kochen, I. de Sola Pool: Contacts and influences. (Manuscript, early 50s)
• A. Rapoport, W.J. Horvath: A study of a large sociogram. (Behav.Sci. 1961)
• S. Milgram, An experimental study of the small world problem. (Sociometry, 1969)…
5
![Page 6: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/6.jpg)
Milgram’s ques.on
• “Given two individuals selected randomly from the popula.on, what is the probability that the minimum number of intermediaries required to link them is 0, 1, 2, . . . , k?”
6
![Page 7: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/7.jpg)
Milgram’s ques.on
• What is the distance distribu.on of the acquaintance graph? – how many pairs are friends? – how many are not friends but have a friend in common?
– … • Note on “distance”:
– sociologists measure the degrees of separa.on – as computer scien.sts we measure the graph-‐theore.c distance (just add one)
7
![Page 8: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/8.jpg)
Milgram’s experiment 296 people (star.ng popula.on) asked to dispatch a parcel to a single individual (target) • Target: a Boston stockholder
• Star.ng popula.on selected as follows: – 100 were random Boston inhabitants (group A) – 100 were random Nebraska stockholders (group B) – 96 were random Nebraska inhabitants (group C)
• Rule of the game: parcels could be directly sent only to someone the sender knows personally (“first-‐name acquaintance”)
8
![Page 9: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/9.jpg)
Milgram’s experiment
9
![Page 10: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/10.jpg)
Milgram’s experiment • Actually completed 64 chains (22%)
• 453 intermediaries happened to be involved in the experiments
• Average distance of the completed chains was 6.2, ranging from 5.4 (Boston group) to 6.7 (random Nebraska inhabitants)
• Distance 6.7 was 5.7 degrees of separa.on (thus six degrees of separa.on) 10
![Page 11: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/11.jpg)
New Milgram’s experiment?
• Can we reproduce Milgram’s experiment on a large scale?
• How can one compute or approximate the distance distribu6on of a given huge friendship graph? (such as Facebook)
11
![Page 12: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/12.jpg)
Graph distances and distribu.on
• The distance d(x,y) is the length of the shortest path from x to y – d(x,y) = ∞ if one cannot go from x to y
• For undirected graphs d(x,y) = d(y,x) • For every t, count the number of pairs (x,y) such that d(x,y) = t (distance distribu.on)
• The frac.on of pairs at distance t is (the density func.on of) a distribu.on
12
![Page 13: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/13.jpg)
Previous experiments On-‐line social networks: • 6.6 degrees of separa.on on a one-‐month MSN Messenger communica.on graph (Leskovec and Horvitz [LH08]) – 180 M nodes and 1.3 G arcs
• 3.67 degrees of separa.on in Twioer [KLPM10] – 5 G follows – Is this meaningful? In Twioer, links created without permission at both ends…
13
![Page 14: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/14.jpg)
• 4.74 degrees of separa6on in Facebook (Backstrom et al. [Backstrom+12]) – 712 M people and 69 G friendship links
14
![Page 15: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/15.jpg)
Hyper ANF
• New tool for studying/compu.ng distance distribu.on of very large graphs
• Diffusion-‐based approximated algorithm • Based on WebGraph (Boldi et al [BV04]) and ANF [Palmer et al., 2002]
• It uses HyperLogLog counters [Flajolet et al.,2007] and broadword programming for low-‐level paralleliza.on
15
![Page 16: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/16.jpg)
Compu.ng Neighborhood Func.on • Neighborhood func.on N(t): for each t, number of pairs at distance ≤ t – provides data about how fast the “average ball” around each node expands
• Many breadth-‐first visits: O(mn), need direct access L • Sampling: a frac.on of breadth-‐first visits, very unreliable results on graphs that are not strongly connected, needs direct access L
• Edith Cohen’s [JCSS 1997] size es.ma.on framework: very powerful (strong theore.cal guarantees) but does not scale really well, needs direct access L
16
![Page 17: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/17.jpg)
Alterna.ve: Diffusion
• Basic idea: Palmer et al [PGF02] • Let Bt(x) be the ball of radius t around x (nodes at distance at most t from x)
• Clearly B0(x)={x} • But also Bt+1(x) = ∪x→y Bt(y)∪{x} • So we can compute balls by enumera.ng the arcs x→y and performing unions (on sets)
• The neighborhood func.on at t is given by the sum of the sizes of the balls of radius t, i.e., Bt(x1), Bt(x2), …, Bt(xn) !
17
![Page 18: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/18.jpg)
Easy but Costly: Approximate • Every set needs O(n) bits: overall O(n2). Too many.
• All we need is cardinality es.mators: es.mate # of dis.nct elements (nodes) in very large mul.sets (balls around nodes), presented as massive streams
• Do not es.mate cardinality exactly: use probabilis.c coun.ng to get approximate es.mates
• Wish to choose approximate sets such that unions can be computed quickly
18
![Page 19: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/19.jpg)
ANF and HyperANF ANF [PGF02]: • Diffusion with Mar.n-‐Flajolet (MF) counters (log n + c space)
• MF counters can be combined with OR HyperANF [BRV11a]: • Diffusion with HyperLogLog counters [Flajolet+07] (loglog n space)
• Use broadword programming to combine HyperLogLog counters quickly!
19
![Page 20: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/20.jpg)
HyperLogLog counters
20
![Page 21: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/21.jpg)
21
Rough Intui.on: let x be unknown cardinality of M. Each substream will contain approximately (x/m) different elements. Then, its Max-‐parameter should be close to log2(x/m). Harmonic mean of quan..es 2Max (mZ in our nota.on) likely to be of the order of (x/m). Thus, m2Z should be of the order of x. Constant αm to correct systema.c mul.plica.ve bias in m2Z .
![Page 22: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/22.jpg)
HyperLogLog counters • Instead of actually coun.ng, observe a sta.s.cal feature of a set (stream) of elements
• Feature: # of trailing zeroes of the value of a very good hash func.on
• Keep track of maximum Max (log log n bits!) • # of dis.nct elements ∝ 2Max
• Important: counter of stream AB is simply maximum of counters of A and B!
• With 40 bits can count up to 4 billion with standard devia.on of 6%
22
![Page 23: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/23.jpg)
Many many counters… • To increase confidence, need several counters (usually 2b, b ≥ 4) and take their harmonic mean
• Thus each set is represented by a list of small counters
• How small? Typically 5-‐bit is enough: 232 > 1G. For huge graphs, 7 bits (unlikely > 7 bits!)
• To compute the union of two sets these must be maximized one-‐by-‐one
• Extrac.ng by shi�s, maximizing and pu�ng back by shi�s is unbearably slow
• Mar.n–Flajolet (ANF) just OR the features! • Exponen.al reduc.on in space (from log n to loglog n) but unions get more complicated….
23
![Page 24: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/24.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
![Page 25: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/25.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0
![Page 26: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/26.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
=
![Page 27: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/27.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124 1 0 1 0
![Page 28: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/28.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
1 1 1 1
1 0 1 0
![Page 29: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/29.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
![Page 30: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/30.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
1 1 1 1 0 0 0 0
![Page 31: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/31.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
1 1 1 1 0 0 0 0 -‐
=
![Page 32: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/32.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
1 1 1 1 0 0 0 0 -‐
= 127 0 127 0 1 0 1 0
![Page 33: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/33.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
1 1 1 1 0 0 0 0 -‐
= 127 0 127 0 1 0 1 0
![Page 34: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/34.jpg)
Broadword Programming 8 bits
9 0 3 2
7 3 3 6
1 1 1 1
0 0 0 0 -‐
= 2 125 0 124
0 1 0 1 1 1 1 1
1 1 1 1 0 0 0 0 -‐
= 127 0 127 0 1 0 1 0
![Page 35: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/35.jpg)
Real speed?
• Large size: HADI [Kang et al., 2010] is a Hadoop-‐conscious implementa.on of ANF. Takes 30 minutes on a 200K-‐node graph (on one of the 50 world largest supercomputers).
• HyperANF does the same in couple of minutes on a worksta.on (tens min on a laptop).
35
![Page 36: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/36.jpg)
Experiments
• 24-‐core machine with: – 72 GB of RAM – 1 TB of disk space
![Page 37: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/37.jpg)
Experiments (.me)
Ran experiments on snapshots of facebook • Jan 1, 2007 • Jan 1, 2008 • ... • Jan 1, 2011 • May, 2011 (721.1M nodes, 68.7G edges)
37
![Page 38: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/38.jpg)
Experiments (dataset)
Considered: • �: the whole facebook graph • it / se: only Italian / Swedish users • it+se: only Italian & Swedish users • us: only US users Based on users’ current geo-‐IP loca.on
38
![Page 39: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/39.jpg)
Facebook distance distribu.on
39
4,74
![Page 40: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/40.jpg)
Distance distribu.ons
40
![Page 41: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/41.jpg)
Average Distance
41
![Page 42: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/42.jpg)
Average Degree and Density (�)
42 Density defined as 2m / n (n-‐1)
![Page 43: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/43.jpg)
Diameter (�)
43
![Page 44: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/44.jpg)
How to compute the diameter?
44
Lower bounds byproduct of HyperANF runs What about the exact diameter? Computa.on was rela.vely fast: diameter of Facebook required 10 hours of computa.on on machine with 1TB of RAM (although 256GB would have been sufficient)
![Page 45: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/45.jpg)
The WebGraph Framework
(Boldi & Vigna 2003)
![Page 46: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/46.jpg)
“The” Web graph
• Set U of URLs. Directed graph with: – U as set of nodes – an arc from x to y iff the page with URL x has a hyperlink poin.ng to URL y.
• Transpose graph: reverse all arcs (useful for several advanced ranking algs, i.e., HITS)
• Web graph is HUGE! (≈ 100 M nodes, 1 G links) 46
Page A hyperlink Page B Anchor
![Page 47: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/47.jpg)
Storing the Web graph • What does it mean “to store (part of) the Web graph”?
– being able to know the successors of each node in a reasonable .me (e.g., much less than 1 ms/link)
– having a simple way to know the node corresponding to a URL (e.g., minimal perfect hash)
– having a simple way to know the URL corresponding to a node (e.g., front-‐coded lists).
• W.l.o.g.,assume each URL represented by an integer (0, 1, . . . , n−1, where n = |U|)
47
![Page 48: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/48.jpg)
Adjacency lists
• The set of neighbors of a node • E.g., for a 4 billion page web, need 32 bits per node (each URL represented by an integer)
• Naively, this demands 64 bits to represent each hyperlink
• That’s too much: Web graph with 1 G links would require more than 64Gb
Sec. 20.4
![Page 49: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/49.jpg)
(Some) History • Connec.vity Server [Bharat+98] ≈ 32 bits/link • Algorithms for separable graphs [BBK03] ≈ 5 bits/link
• WebBase [Cho+06] ≈ 5.6 bits/link • LINK database [Randall+02] ≈ 4.5 bits/link • WebGraph [Boldi Vigna 04] ≈ 3 bits/link
hop://webgraph.dsi.unimi.it/
49
![Page 50: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/50.jpg)
Exploit features of Web Graphs 1. Locality: usually most links in a page have naviga.onal nature and so are local (i.e., they point to other pages on the same host). Source and target of those links share a long common prefix. If URLs are sorted lexicographically, the index of source and target are close to each other. E.g.:. – www.stanford.edu/alchemy – www.stanford.edu/biology – www.stanford.edu/biology/plant – www.stanford.edu/biology/plant/copyright – www.stanford.edu/biology/plant/people – www.stanford.edu/chemistry Literature reports that on average 80% of the links are local
![Page 51: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/51.jpg)
Locality Property: with URLs lexicographically ordered, for many arcs x→y we have o�en small |x − y| Improvement: represent the successors y1 < y2 < ·∙·∙·∙ < yk of x using their gaps instead:
y1 − x, y2 − y1 − 1, ... , yk − yk−1 − 1 All integers non-‐nega.ve, except for first one. To avoid this, use following coding for first integer:
51
![Page 52: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/52.jpg)
Naïve Representa.on and with Gaps
52
![Page 53: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/53.jpg)
2. Similarity: Pages that are close to each other (in lexicographic order) tend to have many common successors. This is because many naviga.onal links are the same on the same local cluster of pages (even non-‐naviga.onal links are o�en copied from one page to another on the same host)
Exploit features of Web Graphs
![Page 54: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/54.jpg)
Similarity • Property: URLs that are close in lexicographic order are likely to belong to the same site (probably to the same level of the site hierarchy)
• Consequence: they are likely to have similar successor lists
• Improvement: code successors list by referen6a6on – a reference r which tells us to start from the list of x − r (if r > 0)
– a bit string which tells us which successors must be copied – a list of extra-‐nodes for the remaining nodes Choice of r cri.cal: r chosen as value between 0 and W (fixed parameter, windows size) that gives the best compression. Larger W yields beoer compression rate but slower and more memory-‐consuming compression and decompression 54
![Page 55: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/55.jpg)
Reference compression
55
![Page 56: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/56.jpg)
Differen.al compression Look at copy list as sequence of 1-‐blocks and 0-‐blocks Length of each block decremented by 1, except for 1st block 1st copy block always assumed to be a 1-‐block (so 1st copy block is 0 if copy list starts with 0) Last block always omioed: its value can be deduced from # blocks and outdegree
56
![Page 57: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/57.jpg)
Differen.al compression
57
Copy blocks specify by inclusion/exclusion sublists that must be alterna.vely copied or discarded Using typical codes, such as γ coding, copying en.rely a list costs 1 bit: can code a link in less than 1 bit!
![Page 58: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/58.jpg)
Exploit features of Web Graphs 3. Consecu6vity: Many links within same page are likely to be consecu.ve (w.r.t. lexicographic order). Mainly due to two dis.nct phenomena: 1. Most pages contain sets of naviga.onal links which
point to a fixed level of the hierarchy. Because of hierarchical nature of URLs, links in pages at booom leve l o f h ierarchy tend to be ad jacent lexicographically.
2. In the transposed Web graph: pages that are high in the site hierarchy (e.g., the home page) are pointed to by most pages of the site.
![Page 59: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/59.jpg)
Consecu.vity
• More in general, consecu.vity is the dual of distance-‐one similarity. If a graph is easily compressible using similarity at distance one, its transpose must sport large intervals of consecu.ve links, and viceversa.
59
![Page 60: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/60.jpg)
Intervaliza.on • Exploit consecu.vity among nodes. Instead of compressing them directly using gaps, first isolate subsequences corresponding to integer intervals (only intervals with length above threshold Lmin considered). Compress list of extra nodes as follows: – A list of integer intervals: each interval represented by its le� extreme and its length; le� extremes compressed thru difference with previous right extreme -‐2 , interval lengths decremented by Lmin
– A list of residuals (remaining integers), compressed through differences.
60
![Page 61: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/61.jpg)
Lmin = 2 Interval [15,19]: le� extreme 15-‐15=0, length 5-‐2=3
Interval [23,24]: le� extreme 23-‐19-‐2=2, length 2-‐2=0 Residuals are 13, 203, 315, 1034, which gives 13-‐15=-‐2 (encoded as 2|-‐2|+1=5), 203-‐13-‐1=189, 315-‐203-‐1=111, 1034-‐315-‐1=718.
61
![Page 62: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/62.jpg)
62
Intervaliza.on
![Page 63: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/63.jpg)
Choices in the reference scheme • How do you choose the reference node for x? • You consider the successor lists of the last W nodes, but. . . you do not consider lists which would cause a recursive reference of more than R (maximum reference count) chains. No limit on R: accessing adjacency list of x may require a decompression of all lists up to x!
• Tuning parameter R essen.al for deciding the ra.o compression/speed: small R gives worst compression but shorter (random) access .mes.
• W essen.ally decreases compression .me only. 63
![Page 64: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/64.jpg)
Implementa.on
• Random access to successor lists is implemented lazily through a cascade of iterators.
• Each series of interval and each reference cause the crea.on of an iterator; the same happens for references.
• The results of all iterators are then merged. • The advantage of laziness is that we never have to build an actual list of successors in memory, so the overhead is limited to the number of actual reads, not to the number of successors lists that would be necessary to re-‐create a given one.
64
![Page 65: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/65.jpg)
Access speed • Access speed to a compressed graph is commonly
measured in the .me required to access a link (≈ 300 ns for WebGraph).
• This quan.ty, however, is strongly dependent on the architecture (e.g., cache size), and, even more, on low-‐level op.miza.ons (e.g., hard-‐coding of the first codewords of an instantaneaous code).
• To compare speeds reliably, we need public data, that anyone can access, and a common framework for the low-‐level opera.ons.
• A first step is hop://webgraph-‐data.dsi.unimi.it/. Freely available data to compare compression techniques.
65
![Page 66: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/66.jpg)
Summary WebGraph provides methods to manage very large (web) graphs. It consists of: • ζ codes: par.cularly suitable for storing web graphs (or integers with power law distribu.on in a certain exponen.al range)
• Algorithms for compressing web graphs that exploit referen.a.on, intervaliza.on and ζ codes to provide high compression ra.o
• Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay decompression un.l it is actually necessary
66
![Page 67: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/67.jpg)
Conclusions WebGraph combines new codes, new insights on the structure of the Web graph and new algorithmic techniques to achieve a very high compression ra.o, while s.ll retaining a good access speed (can it be beoer?). So�ware is highly tunable: can experiment with dozens of codes, algorithmic techniques and compression parameters, and there is a large unexplored space of combina.ons. A theore.cally interes.ng ques.on is how to combine op.mally differen.al compression and intervaliza.on: can you do beoer than current greedy approach (first copy as much as you can, then intervalize)? 67
![Page 68: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/68.jpg)
Conclusions The compression techniques are specialized for Web Graphs.
The average link size decreases with the increase of the graph.
The average link access .me increases with the increase of the graph.
ζ-‐codes seems to achieve great trade-‐off between avg. bit size and access .me.
![Page 69: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/69.jpg)
References 1/2 1. [AJB00] Albert, Jeong, and Barabási. Error and aoack tolerance of complex networks. Nature,
406:378–382, 2000. 2. [Backstrom+12] Backstrom, Boldi, Rosa, Ugander, and Vigna. 2012. Four degrees of separa.on.
In Proceedings of the 3rd Annual ACM Web Science Conference (WebSci '12) 3. [BBK03] Blandford, Blelloch and Kash. 2003. Compact representa.ons of separable graphs.
In Proceedings of the 14th ACM-‐SIAM symposium on Discrete algorithms (SODA '03) 4. [Bharat+98] Bharat, Broder, Henzinger, Kumar, and Venkatasubramanian. 1998. The connec.vity
server: fast access to linkage informa.on on the Web. In Proceedings of the seventh interna.onal conference on World Wide Web 7 (WWW7)
5. [Bra01] Brandes. A faster algorithm for betweenness centrality. Journal of Mathema.cal Sociology, 25(2):163–177, 2001.
6. [BRV11a] Boldi, Rosa, and Vigna. 2011. HyperANF: approxima.ng the neighbourhood func.on of very large graphs on a budget. In Proceedings of the 20th interna.onal conference on World wide web (WWW '11)
7. [BRV11b] Boldi, Rosa, and Vigna. 2011. Robustness of social networks: compara.ve results based on distance distribu.ons. In Proceedings of the Third interna.onal conference on Social informa.cs (SocInfo'11)
8. [BV04] Boldi and Vigna. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th interna.onal conference on World Wide Web (WWW '04)
9. [Bor05] Borga�. Centrality and network flow. Social Networks, 27(1):55–71, 2005. 10. [Cho+06] Cho, Garcia-‐Molina, Haveliwala, Lam, Paepcke, Raghavan, and Wesley. 2006. Stanford
WebBase components and applica.ons. ACM Trans. Internet Technol. 6 69
![Page 70: Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)](https://reader035.fdocuments.net/reader035/viewer/2022081400/554e9b79b4c905fc368b5276/html5/thumbnails/70.jpg)
References 2/2 11. [Crescenzi+12] Crescenzi, Grossi, Habib, Lanzi, and Marino. On compu.ng the diameter of real-‐
world undirected graphs. Theore.cal Computer Science, 2012. 12. [Flajolet+07] Flajolet , Fusy , Gandouet and Meunier. 2008. HyperLogLog: the analysis of a near-‐
op.mal cardinality es.ma.on algorithm. In Proceedings of the 2007 interna.onal conference on Analysis of Algorithms (AOFA ’07)
13. [Firmani+12] Firmani, Italiano, Laura, Orlandi, and Santaroni. 2012. Compu.ng strong ar.cula.on points and strong bridges in large scale graphs. InProceedings of the 11th interna.onal conference on Experimental Algorithms (SEA'12)
14. [Italiano+12] Italiano, Laura, Santaroni, “Finding strong ar.cula.on points and strong bridges in linear .me”, Theore.cal Computer Science, vol. 447, 74–84, 2012.
15. [KLPM10] Kwak, Lee, Park, and Moon. 2010. What is Twioer, a social network or a news media?. In Proceedings of the 19th interna.onal conference on World wide web (WWW '10)
16. [LH08] Leskovec and Horvitz. 2008. Planetary-‐scale views on a large instant-‐messaging network. In Proceedings of the 17th interna.onal conference on World Wide Web (WWW '08)
17. [Mislove+07] Mislove, Marcon, Gummadi, Druschel, and Bhaoacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC '07)
18. [PGF02] Palmer, Gibbons, and Faloutsos. 2002. ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD interna.onal conference on Knowledge discovery and data mining (KDD '02)
19. [Randall+02] Randall, Stata, Wiener, and Wickremesinghe. 2002. The Link Database: Fast Access to Graphs of the Web. In Proceedings of the Data Compression Conference (DCC '02)
20. [RAK07] Raghavan, Albert, and Kumara. Near linear .me algorithm to detect community structures in large-‐scale networks. Physical Review E (Sta.s.cal, Nonlinear, and So� Maoer Physics), 76(3), 2007.
70