Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and...

19
Approximate Sorting of Data Streams with Limited Storage Farzad Farnoud Eitan Yakoobi Jehoshua Bruck 20th International Computing and Combinatorics Conference (COCOON'14)

Transcript of Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and...

Page 1: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Approximate Sorting of Data Streams with Limited Storage

Farzad Farnoud!Eitan Yakoobi!Jehoshua Bruck

20th International Computing and Combinatorics Conference (COCOON'14)

Page 2: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Source

Data Stream

User with limited storage

Application Rank Correlation Testing (Spearman’s, Kendall’s) Preference Learning

Sorting with Limited Storage

Netflix, Amazon servers

Movies:!1/week

Sensors

Sensor data:!Tbs/day

Page 3: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Problem Statement

s1s2· · ·sn

| {z }· · ·

m storage cells

AlgorithmY: !

approximation of X

| {z }Stream s

X: permutation defined !by the ordering of s

❖ If si<sj then i appears before j in X!❖ To store stream elements, m cells are available; no limitation on other

types of storage!

❖ Algorithm can compare any two elements residing in storage!

❖ Deterministic algorithms, X is a random permutation!

❖ Performance measure: permutation distortion between X and Y

Page 4: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Example❖ Suppose X=253461 (s2<s5<s3<s4<s6<s1) and m=3

| {z }storage cells

s1s2s3s4s5s6

s2<s1 s2<s3<s1 s2<s3<s4 s2<s5<s4 s2<s4<s6

❖ Output, e.g. Y=235146 (s2<s3<s5<s1<s4<s6)

Page 5: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Related Work!

❖ J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

❖ G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited storage. ACM SIGMOD 1998!

❖ Sudipto Guha and Andrew McGregor. Approximate quantiles and the order of the stream. In Proc. 25th ACM Symposium on Principles of Database Systems, pp. 273– 279, 2006.!

❖ A. Chakrabarti, T. S. Jayram, and M. Patrascu. Tight lower bounds for selection in randomly ordered streams. SODA 2008

Page 6: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Performance Measures

❖ Kendall tau distortion:!

★ Counts # of pairwise disagreements ( = # of transpositions of adjacent elements taking X to Y )!

★ Example: d𝜏(312,123)=2 since 312→132→123

❖ Weighted Kendall distortion!

❖ Chebyshev distortion:!

★ Maximum error in the rank of any element!

★ Example: dc(35124,12345)=3

s1s2· · ·sn

| {z }· · ·

m storage cells

Algorithm

Y: !approximation of X

| {z }Stream s

X: permutation defined !by the ordering of s

Page 7: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Universal Bounds: Kendall DistortionTheorem: For any algorithm with storage m=µn and average Kendall distortion D=𝛿n, if 𝛿 is bounded away from zero, then!

x = W0HyL

x = W-1HyL

0.5 1.0 1.5 2.0 2.5y

-5

-4

-3

-2

-1

1x

y = xex

-5 -4 -3 -2 -1 1x

0.5

1.0

1.5

2.0

2.5

y

µ � �:�

����

H(�+ �)�+�

�(�+ R(�))

Page 8: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Universal Bounds: Kendall DistortionTheorem: For any algorithm with storage m=µn and average Kendall distortion D=𝛿n, if 𝛿 is bounded away from zero, then!

µ � �:�

����

H(�+ �)�+�

�(�+ R(�))

❖ As 𝛿 increases, we asymptotically have µ ≥ 1/(e2𝛿)(1+o(1))

0.0 0.5 1.0 1.5 2.0 2.5 3.0d

0.2

0.4

0.6

0.8

1.0m

Page 9: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Universal Bounds: Kendall Distortion Proof outline:!

❖ Let C={y1,y2,y3,…} denote the set of possible Y’s for a given algorithm!

❖ Since algorithm is deterministic, |C| ≤ m!mn-m!

❖ C can be viewed as a rate-distortion code [Wang et al 13, Farnoud et al 14]!

❖ For distortion D, B(D): size of ball of radius D!

❖ If we have B(D), eliminating |C| gives the result

Page 10: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Universal Bounds: Chebyshev Distortion

Theorem: For any algorithm with storage m=µn and average Chebyshev distortion D=𝛿n, with 2/n≤𝛿≤1/2,

µ � �:�

��(H/�)��

��Q

�(�+ R(�))

n =100

n =10000.0 0.1 0.2 0.3 0.4 0.5

d

0.05

0.10

0.15

0.20

0.25

0.30

0.35m

❖ For any fixed 𝛿 as n increases, storage requirement becomes a vanishing fraction of n

❖ Constant distortion needs at least constant μ

Page 11: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Algorithm❖ A simple algorithm: !

❖ Store the first m-1 elements of the stream, s1,…,sm-1, as pivots!

❖ Compare each new element with the pivots!

❖ Example: Suppose X=263415 (s2<s6<s3<s4<s1<s5) and m=3:

| {z }storage cells

s1s2s3s4s5s6

s2<s1 s2<s3<s1 s2<s4<s1 s2<s1<s5 s2<s6<s1

❖ Output Y=234615, d𝜏(263415,234615)=2, dc(263415,234615)=2

Page 12: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Algorithm: Kendall DistortionTheorem: The algorithm asymptotically requires at most a constant factor as much storage as an optimal algorithm for the same Kendall distortion.

Lower bound

Proposed algorithm

0.0 0.5 1.0 1.5 2.0 2.5 3.0d

0.2

0.4

0.6

0.8

1.0m❖ For small 𝛿,

❖ Proposed alg: µ≤1 !

❖ Opt. alg: µ bounded away from 0!

❖ For large 𝛿,

❖ Proposed alg: µ≤1/(2𝛿) (1+o(1)) !

❖ Opt. alg: µ≥1/(e2𝛿) (1+o(1))

Page 13: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Algorithm: Chebyshev DistortionTheorem: If the proposed algorithm has storage m=µn and average Chebyshev distortion D=𝛿n, with 𝛿≤1/2 and 𝛿 bounded away from 0, then μ≤W-1(-𝛿/e)/(𝛿n).

Lower bound

Proposed algorithmn =500

0.1 0.2 0.3 0.4 0.5d

0.2

0.4

0.6

0.8

1.0m❖ If 𝛿 is bounded away from

0, we need at most a constant times as much storage!

❖ For vanishing distortion, better algorithm and/or bounds are needed

Page 14: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Algorithm: Chebyshev DistortionProof outline:!

❖ Given Y, X is unknown only in segments bounded by pivots: If Y=2347156, then Xϵ{2347156, 2437156, 2374165, 2437165,…}

❖ Chebyshev distortion is bounded by the length of the longest segment!

❖ Coupling with a randomly broken stick of length n into m parts!

❖ Statistics of the length of the longest piece are well known [Holst’80]!

❖ 𝛿n = E[dc(X,Y)] ≤ E[length of longest piece of stick] ≤ n ln(me) / m

❖ (-m𝛿) e(-m𝛿) ≤ -𝛿/e

Page 15: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Weighted Kendall: Why?Indegree PageRank Katz Harmonic Closeness

United States United States United States United States Kharqan Rural DistrictList of sovereign states Animal List of sovereign states United Kingdom Talageh-ye SoflaAnimal List of sovereign states United Kingdom World War II Talageh-ye OlyaEngland France France France Greatest Remix Hits (Whigfield album)France Germany Animal Germany Suzhou HSR New TownAssociation football Association football World War II Association football Suzhou Lakeside New CityUnited Kingdom England England English language MepirodipineGermany India Association football China List of MPs . . . M–NCanada United Kingdom Germany Canada List of MPs . . . O–RWorld War II Canada Canada India List of MPs . . . S–TIndia Arthropod India Latin List of MPs . . . U–ZAustralia Insect Australia World War I List of MPs . . . J–LLondon World War II London England List of MPs . . . CJapan Japan Italy Italy List of MPs . . . F–IItaly Australia Japan Russia List of MPs . . . A–BArthropod Village New York City Europe List of MPs . . . D–EInsect Italy English language Australia Esmaili-ye SoflaNew York City Poland China European Union Esmaili-ye OlyaEnglish language English language Poland Catholic Church Levels of organization (ecology)Village Nationa Reg. of Hist. Places World War I London Jacques Moeschal (architect)

Table 1: Top 20 pages of the English version of Wikipedia following five different centrality measures.

are below the threshold 0.8, under which we are supposed to see considerable changes. The correlation of closenesswith harmonic centrality, moreover, is even more pathological: it is the largest correlation.

An obvious observation is that, maybe, the score is lowered by a large discordance in the rest of the rankings.Table 2 tries to verify this intuition by listing the top pages that are associated with the Wordnet category “scientist” inthe Yago2 ontology data [8]. These pages have considerably lower score (their rank is below 300), yet the first threerankings are almost identical. Harmonic centrality is still slightly different (Linnaeus is absent, and actually ranks 21),which tells us that the Kendall’s ⌧ is not giving completely unreasonable data. Nonetheless, closeness continues toprovide apparently random results.

We have actually to delve deep into Wikipedia, beyond rank 100 000 using the category “cocktail” to see that, finally,things settle down (Table 5). While closeness still displays a few quirks, the rankings start to stabilize.

To understand what happens in the very low-rank region, in Table 4 we provide Kendall’s ⌧ as in Table 3, butrestricting the computation to nodes of indegree 1 and 2. As it is immediately evident, after stabilization the low-rankregion is fraught with noise and all correlation values drop significantly.

The very high correlation between closeness and harmonic centrality is, actually, not strange: on the nodes reachablefrom giant connected component of our Wikipedia snapshot (89% of the nodes) they agree almost exactly, as closenessis the reciprocal of a denormalized arithmetic mean, whereas harmonic centrality is the reciprocal of a denormalizedharmonic mean [2]. Even if the remaining 11% of the nodes is completely out of place, making closeness useless,Kendall’s ⌧ tells us that it should be interchangeable with harmonic centrality. At the same time, Kendall’s ⌧ tells usthat indegree is very different from PageRank, which again goes completely against our empirical evidence.

In the rest of the paper, we will try to approach in a systematic manner these problems by defining a new weightedcorrelation index for scores with ties.

4 Definitions and ToolsIn his 1945 paper about ranking with ties [13], Kendall, starting from an observation of Daniels [4], reformulates hiscorrelation index using a definition similar in spirit to that of an inner product, which will be the starting point of ourproposal: we consider two real-valued vectors r and s (to be thought as scores) with indices in [n]; then, let us define

hr, si :=X

i<j

sgn(ri

� rj

) sgn(si

� sj

),

4

Indegree PageRank Katz Harmonic ClosenessCarl Linnaeus Carl Linnaeus Carl Linnaeus Aristotle Noël Bernard (botanist)Aristotle Aristotle Aristotle Albert Einstein Charles CoquelinThomas Jefferson Thomas Jefferson Thomas Jefferson Thomas Jefferson Markku KivinenMargaret Thatcher Charles Darwin Albert Einstein Charles Darwin Angiolo Maria ColomboniPlato Plato Charles Darwin Thomas Edison Om Prakash (historian)Charles Darwin Albert Einstein Karl Marx Alexander Graham Bell Michel MandjesKarl Marx Karl Marx Plato Nikola Tesla Kees PosthumusAlbert Einstein Pliny the Elder Margaret Thatcher William James F. Wolfgang SchnellVladimir Lenin Vladimir Lenin Vladimir Lenin Isaac Newton Christof EbertSigmund Freud Johann Wolfgang von Goethe Isaac Newton Karl Marx Reese ProsserJ. R. R. Tolkien Margaret Thatcher Ptolemy Charles Sanders Peirce David TullochJohann Wolfgang von Goethe Ptolemy Johann Wolfgang von Goethe Noam Chomsky Kim HawtreySpider-Man Sigmund Freud Pliny the Elder Enrico Fermi Patrick J. MillerPliny the Elder Isaac Newton Benjamin Franklin Ptolemy Mikel KingBenjamin Franklin Benjamin Franklin J. R. R. Tolkien John Dewey Albert Perry BrighamLeonardo da Vinci J. R. R. Tolkien Thomas Edison Johann Wolfgang von Goethe Gordon Wagner (economist)Isaac Newton Immanuel Kant Sigmund Freud Bertrand Russell George Henry ChasePtolemy Leonardo da Vinci Immanuel Kant Plato Charles C. HornImmanuel Kant Pierre André Latreille Leonardo da Vinci John von Neumann Paul GoldsteneGeorge Bernard Shaw Thomas Edison Noam Chomsky Vladimir Lenin Robert Stanton Avery

Table 2: Top 20 pages of Wikipedia following five different centrality measures and restricting pages to Yago2 Wordnetcategory “scientist”. The global rank of these items is beyond 300.

Ind. PR Katz Harm. Cl.Indegree 1 0.75 0.90 0.62 0.55PageRank 0.75 1 0.75 0.61 0.56Katz 0.90 0.75 1 0.70 0.62Harmonic 0.62 0.61 0.70 1 0.92Closeness 0.55 0.56 0.62 0.92 1

Table 3: Kendall’s ⌧ between Wikipedia centrality measures.

where

sgn(x) :=

8><

>:

1 if x > 0;0 if x = 0;�1 if x < 0.

Indices of score vectors in summations belong to [n] throughout the paper. Note that

hr,↵si = h↵r, si = sgn(↵)hr, si,

which reminds of the analogous property for inner products, and that hr,�i = h�, ri = 0 if r is constant. Followingthe analogy, we can define

krk :=

phr, ri,

sok↵rk = | sgn(↵)| · krk.

The norm thus defined measures the “untieness” of r: it is zero if and only if r is a constant vector, and it has maximumvalue

pn(n� 1)/2 when all components of r are distinct.

We can now define Kendall’s ⌧ between two vectors r and s with nonnull norm as a normalized inner product, in away formally identical to cosine similarity:

⌧(r, s) :=hr, si

krk · ksk . (1)

We recall that if r and s have no ties, the definition reduces to the classical “normalized difference of concordances anddiscordances”, as the denominator is exactly n(n�1)/2. The definition above is exactly that proposed by Kendall [13],albeit we use a different formalism.

The form of (1) suggests that to obtain a weighted correlation index it would be natural to define a weighted innerproduct

hr, siw

:=

X

i<j

sgn(ri

� rj

) sgn(si

� sj

)w(i, j),

5

Rankings of Wiki pages with different methods: correlations and top-20 pages [Vigna’14]

Ranking of Wikipedia pages:!

❖ Rankings with different methods for important items are all very similar!

❖ This is not reflected by the Kendall tau correlation!

❖ The correlation coefficient is affected by items with low ranks

Page 16: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Distortion with Weighted Kendall

ÊÊÊÊÊÊÊÊÊÊÊÊÊ

Ê

Ê

Ê

100 200 300 400 500Length Hin 1000sL

100

200

300

400

500

600Time Hin msL

❖ Weighted Kendall distortion: [F, Milenkovic 13]!

★ Weight wi for transposing ith and (i+1)st elements!

★ Can be used to penalize mistakes in higher positions more!

★ Example: w1 =2, w2 =1, dw(312,123)=3 since 312→132→123

❖ Axiomatically derived based on Kemeny’s axioms!

❖ No need for ground truth!

❖ Fast computation: O(n lg n)with small constant for monotonically decreasing weights

Page 17: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Distortion with Weighted Kendall❖ What should the ranks of pivots be if errors in higher positions are penalized more?!

❖ Example: Linearly decreasing weight function: wi = 1+ c (n-i-1) with c>0:

• The pivots are chosen more closely at the top to provide better accuracy for highly ranked items!

• Optimum positions for pivots is asymptotically independent from c!

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1"

m=2"

m=3"

m=4"

m=5"

Page 18: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Conclusion❖ Provided bounds on the performance of algorithms for sorting with

limited storage!

❖ Proposed an algorithm and showed that it is asymptotically optimal for !★ Kendall distortion (up to a constant factor), and!★ Chebyshev distortion (up to a constant factor and for 𝛿 bounded away from

0)!

❖ Future work and open problems:!★ What is the best possible algorithm if only the last m are remembered? !★ Tighter bounds for Kendall distortion!★ Better algorithm/bounds for Chebyshev distortion when 𝛿 is small

Page 19: Approximate Sorting - University of Virginiaffh8x/d/t/Streaming-COCOON-25m.pdf · Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980.!

Thank You!