Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive...

52
1-1 Finding Frequent Items in Probabilistic Data June 12, 2008 SIGMOD 2008 Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology

Transcript of Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive...

Page 1: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

1-1

Finding Frequent Items in Probabilistic Data

June 12, 2008SIGMOD 2008

Qin Zhang, Hong Kong University of Science & Technology

Feifei Li, Florida State University

Ke Yi, Hong Kong University of Science & Technology

Page 2: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

2-1

Motivation

Identifying frequent items is important

network traffic monitoring

answering iceberg queries

association rule mining......

Page 3: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

2-2

Motivation

Identifying frequent items is important

network traffic monitoring

answering iceberg queries

association rule mining......

Also, processing uncertain data

sensor reading

fuzzy data integration......

Page 4: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

2-3

Motivation

Identifying frequent items is important

network traffic monitoring

answering iceberg queries

association rule mining......

This paper: find frequent items in uncertain data

Also, processing uncertain data

sensor reading

fuzzy data integration......

(heavy hitters)

Page 5: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

3-1

Previous work on heavy hitters in certain data

For a parameter φ, an item t is the φ-heavy hitter of a bag Wif mW

t > φ · |W |.

Page 6: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

3-2

Previous work on heavy hitters in certain data

For a parameter φ, an item t is the φ-heavy hitter of a bag Wif mW

t > φ · |W |.Approximate version heavy hitters

• return all the φ-heavy hitters

• not return those t with mWt < (φ − ε) · |W |

• items in between: arbitrary

Page 7: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

3-3

Previous work on heavy hitters in certain data

For a parameter φ, an item t is the φ-heavy hitter of a bag Wif mW

t > φ · |W |.Approximate version heavy hitters

• return all the φ-heavy hitters

• not return those t with mWt < (φ − ε) · |W |

• items in between: arbitrary

Demaine et. al. (ESA 2002)Manku & Motwani (VLDB 2002)

Karp et. al. (TODS 2003)

Cormode & Muthukrishnan (VLDB 2002)

Cormode et. al. (SIGMOD 2004)Manjhi et. al. (ICDE 2005)Lee & Ting (PODS 2006)

Metwally et. al. (TODS 2006)

Misra and Gries (Sci. Comput. Programming 1982)

Page 8: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

4-1

The Probabilistic Model

The x-tuple model (proposed in the TRIO system)

T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

Page 9: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

4-2

The Probabilistic Model

The x-tuple model (proposed in the TRIO system)

T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

a occurs with Pr p(a),b occurs with Pr p(b),nothing occurs with Pr1 − p(a) − p(b).

Page 10: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

4-3

The Probabilistic Model

The x-tuple model (proposed in the TRIO system)

T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

D: the uncertain database, consistsof T1 and T2

W : a possible world of D

W Pr[W ]∅ (1 − p(a) − p(b))(1 − p′(a))a p(a)(1 − p′(a))

+(1 − p(a) − p(b))p′(a)b p(b)(1 − p′(a))aa p(a)p′(a)ab p(b)p′(a)

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

Page 11: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

4-4

The Probabilistic Model

The x-tuple model (proposed in the TRIO system)

T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

D: the uncertain database, consistsof T1 and T2

W : a possible world of D

W Pr[W ]∅ (1 − p(a) − p(b))(1 − p′(a))a p(a)(1 − p′(a))

+(1 − p(a) − p(b))p′(a)b p(b)(1 − p′(a))aa p(a)p′(a)ab p(b)p′(a)

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

Let R denote a random possible world|R|: the number of items in R.mR

t : the frenquency of item t in R.

Page 12: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

5-1

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

Page 13: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

5-2

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

Problems with Ehh (finding 0.5-heavy hitters. )

D1 = (a, 0.9), (b, 0.1), (c, 1) .a is not a 0.5-expected heavy hitter.But, a has a 90% chance of being a 0.5-heavy hitter!

with Pr. 0.9 R = a, cwith Pr. 0.1 R = b, c

Page 14: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

5-3

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

Problems with Ehh (finding 0.5-heavy hitters. )

D1 = (a, 0.9), (b, 0.1), (c, 1) .a is not a 0.5-expected heavy hitter.But, a has a 90% chance of being a 0.5-heavy hitter!

with Pr. 0.9 R = a, cwith Pr. 0.1 R = b, c

D2 = (a, 0.5), (b, 0.5).a is a 0.5-expected heavy hitter,but only has a 50% chance of being a 0.5-heavy hitter.

Page 15: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

5-4

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

Follow ”probabilistic thresholding” framework(Dalvi and Suciu VLDB 2004)

A more rigorous definitiont is a (φ, τ)-probabilistic heavy hitter (Phh) of D if

Pr[mRt > φ|R|] > τ

Page 16: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

5-5

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

Follow ”probabilistic thresholding” framework(Dalvi and Suciu VLDB 2004)

A more rigorous definitiont is a (φ, τ)-probabilistic heavy hitter (Phh) of D if

Pr[mRt > φ|R|] > τ

Page 17: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

6-1

Summary of main results

1. Give low degree polynomial-time algorithms for com-puting the exact PHH for offline data.

2. Design both space and time-efficient algorithms to com-pute the approximate Phh for streaming data, with the-oretically guaranteed accuracy and space/time bounds.

3. Establish a tradeoff between the accuracy and the per-tuple processing time of the proposed approximationalgorithms.

Page 18: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

7-1

Algorithm for offline data

For a single item t, dynamic programming(DP).

The running time of DP O(m3).

Thus, if we do this for every item, the running timewould be O(nm3)

m: the number of x-tuples, n: the number of distinct items

Main idea: calculate Pr[item t appears i times and itemsother than t appear j times in the first k x-tuples of D]for all i, j, k.

Page 19: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

7-2

Algorithm for offline data

For a single item t, dynamic programming(DP).

The running time of DP O(m3).

However, we can reduce the running time by almosta factor of n using the pruning lemma (next page).

Thus, if we do this for every item, the running timewould be O(nm3)

m: the number of x-tuples, n: the number of distinct items

Main idea: calculate Pr[item t appears i times and itemsother than t appear j times in the first k x-tuples of D]for all i, j, k.

Page 20: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

8-1

The following lemma gives an upper bound onPr[mR

t > φ|R|] depending on E[mRt ]/E[|R|].

The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

Page 21: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

8-2

The following lemma gives an upper bound onPr[mR

t > φ|R|] depending on E[mRt ]/E[|R|].

The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

If φ = 0.1, τ = 0.6

E[mRt ]

E[|R|] < 0.02 → Pr[mRt > φ|R|] < 0.6

since∑

t E(mRt ) = E(|R|), checking 50 items is enough!

Page 22: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

8-3

The following lemma gives an upper bound onPr[mR

t > φ|R|] depending on E[mRt ]/E[|R|].

The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

Now running time is O( 1φτ m3).

The algorithm.

ac

bed

f compute E[mRt ] +

prunning lemma

ac

ed

g DP cd

Phhs

Page 23: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

9-1

Approximation algorithms for streaming data

An item t is an approximate Phh ifPr[mR

t > φ|R|] > τ , and not an approximate Phh ifPr[mR

t > (φ− ε)|R|] < (1− θ)τ .

Approximation is necessary since calculating the exact fre-quency of heavy hitters require Ω(n) memory (Alon et. al.)

Page 24: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

9-2

Approximation algorithms for streaming data

An item t is an approximate Phh ifPr[mR

t > φ|R|] > τ , and not an approximate Phh ifPr[mR

t > (φ− ε)|R|] < (1− θ)τ .

Approximation is necessary since calculating the exact fre-quency of heavy hitters require Ω(n) memory (Alon et. al.)

finds all approximate (φ, τ)-Phh with proba-bility at least 1− δ.

space O( 1εθ2τ log( 1

δφτ ))

processing time: O( 1θ2τ log( 1

δφτ )+log(1/ε))

We propose algorithms with the following guarantees.

Page 25: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

9-3

Approximation algorithms for streaming data

An item t is an approximate Phh ifPr[mR

t > φ|R|] > τ , and not an approximate Phh ifPr[mR

t > (φ− ε)|R|] < (1− θ)τ .

Approximation is necessary since calculating the exact fre-quency of heavy hitters require Ω(n) memory (Alon et. al.)

finds all approximate (φ, τ)-Phh with proba-bility at least 1− δ.

space O( 1εθ2τ log( 1

δφτ ))

processing time: O( 1θ2τ log( 1

δφτ )+log(1/ε))

further improve to : O(log( 1δφτε ))

We propose algorithms with the following guarantees.

Page 26: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-1

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Page 27: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-2

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

Page 28: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-3

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Page 29: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-4

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Page 30: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-5

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Page 31: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-6

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Page 32: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-7

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Y ti = # possible worlds in which t is a heavy hitter /k

Page 33: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-8

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Y ti = # possible worlds in which t is a heavy hitter /k

Y t1

Y ti

Y tl

Page 34: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

10-9

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

Y ti = # possible worlds in which t is a heavy hitter /k

Y t1

Y ti

Y tl

Finally, let Y t = MedianY t1 , Y t

2 , . . . Y tl .

Page 35: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

11-1

The basic sampling algorithm

Y t > (1− θ/2)τ −→ t is a (φ, τ)-Phh.Y t ≤ (1− θ/2)τ −→ t is not a Phh.

Page 36: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

11-2

The basic sampling algorithm

Y t > (1− θ/2)τ −→ t is a (φ, τ)-Phh.Y t ≤ (1− θ/2)τ −→ t is not a Phh.

Correct with probability at least 1 − δ′ for any particularitem t.

Page 37: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

11-3

The basic sampling algorithm

Y t > (1− θ/2)τ −→ t is a (φ, τ)-Phh.Y t ≤ (1− θ/2)τ −→ t is not a Phh.

Correct with probability at least 1 − δ′ for any particularitem t.

Setting δ′ = φτ4 δ is enough since we only need to consider

at most 3φτ candidates Phh, by the Prunning Lemma.

Page 38: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-1

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Page 39: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-2

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

With Pr. 1/k2

t

t

t

Page 40: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-3

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

With Pr. (k − 1)/k2

don’t send

Page 41: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-4

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

Page 42: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-5

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

pairwise in-dependent

Page 43: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

12-6

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

Now processing time per x-tuple: O(log( 1δφτε )).

pairwise in-dependent

Page 44: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

13-1

Experiments - the data sets

Data sets.

movie from the MystiQ project; has a total of ap-proximately 100, 000 x-tuples, most of which haveonly one alternative, but some have a few.

It contains probabilistic movie records reflecting thematching probability as a result of data integrationfrom multiple sources.

wcday46zipfu1.60

Page 45: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

14-1

Experiments - the power of prunning

Effectiveness of the pruning lemma, where for skewed datasets, more than 90% of the items are pruned.

Page 46: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

15-1

Experiments - basic, improved streaming algorithmVarying m: φ = 0.01, τ = 0.8, δ = 0.05, θ = 0.05, ε = 0.001.

running time

Page 47: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

16-1

Experiments - basic, improved streaming algorithmVarying m: φ = 0.01, τ = 0.8, δ = 0.05, θ = 0.05, ε = 0.001.

memory usage

Page 48: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

17-1

Conclusion

We have

• formalized the notion of probabilistic heavy hit-ters following the commonly adopted possibleworld query semantics in uncertain databases.

• presented efficient algorithms with theoreticalguarantees for both offline and streaming data,under the widely adopted x-relation model.

Future work includes handling distributed data, andmore interestingly, supporting other uncertain datamodels.

Page 49: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

18-1

The End

T HANK YOUQ and A

Page 50: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

19-1

Experiments - basic, improved streaming algorithmVarying m: φ = 0.01, τ = 0.8, δ = 0.05, θ = 0.05, ε = 0.001.

recall

Page 51: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

20-1

Experiments - basic, improved streaming algorithmVarying m: φ = 0.01, τ = 0.8, δ = 0.05, θ = 0.05, ε = 0.001.

precision

Page 52: Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive definition t is a φ-expected heavy hitter (Ehh) of D if E[mR t] > φ · E[|R|] Problems

21-1

Experiments - generalized algorithm

Tradeoff in cost/accuracy, varying s, δ = 0.05, θ = 0.05,φ = 0.01, τ = 0.8, ε = 0.001.

For s/k as small as 0.05, its accuracy is already very closeto perfect. 20-fold speedup from the basic scheme!