Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive...

1-1

Finding Frequent Items in Probabilistic Data

June 12, 2008SIGMOD 2008

Qin Zhang, Hong Kong University of Science & Technology

Feifei Li, Florida State University

Ke Yi, Hong Kong University of Science & Technology

2-1

Motivation

Identifying frequent items is important

network traffic monitoring

answering iceberg queries

association rule mining......

2-2

Motivation





Also, processing uncertain data

sensor reading

fuzzy data integration......

2-3

Motivation





This paper: find frequent items in uncertain data

Also, processing uncertain data

sensor reading

fuzzy data integration......

(heavy hitters)

3-1

Previous work on heavy hitters in certain data

For a parameter φ, an item t is the φ-heavy hitter of a bag Wif mW

t > φ · |W |.

3-2



t > φ · |W |.Approximate version heavy hitters

• return all the φ-heavy hitters

• not return those t with mWt < (φ − ε) · |W |

• items in between: arbitrary

3-3



t > φ · |W |.Approximate version heavy hitters

• return all the φ-heavy hitters

• not return those t with mWt < (φ − ε) · |W |

• items in between: arbitrary

Demaine et. al. (ESA 2002)Manku & Motwani (VLDB 2002)

Karp et. al. (TODS 2003)

Cormode & Muthukrishnan (VLDB 2002)

Cormode et. al. (SIGMOD 2004)Manjhi et. al. (ICDE 2005)Lee & Ting (PODS 2006)

Metwally et. al. (TODS 2006)

Misra and Gries (Sci. Comput. Programming 1982)

4-1

The Probabilistic Model

The x-tuple model (proposed in the TRIO system)

T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

4-2



T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

a occurs with Pr p(a),b occurs with Pr p(b),nothing occurs with Pr1 − p(a) − p(b).

4-3



T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

D: the uncertain database, consistsof T1 and T2

W : a possible world of D

W Pr[W ]∅ (1 − p(a) − p(b))(1 − p′(a))a p(a)(1 − p′(a))

+(1 − p(a) − p(b))p′(a)b p(b)(1 − p′(a))aa p(a)p′(a)ab p(b)p′(a)

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

4-4



T1 (a, p(a)), (b, p(b))T2 (a, p′(a))

D: the uncertain database, consistsof T1 and T2

W : a possible world of D

W Pr[W ]∅ (1 − p(a) − p(b))(1 − p′(a))a p(a)(1 − p′(a))

+(1 − p(a) − p(b))p′(a)b p(b)(1 − p′(a))aa p(a)p′(a)ab p(b)p′(a)

x-tuple

Tm

Ti

T2 T1

T3

? heavy hitters

Let R denote a random possible world|R|: the number of items in R.mR

t : the frenquency of item t in R.

5-1

Ehh and Phh

An intuitive definitiont is a φ-expected heavy hitter (Ehh) of D if

E[mRt ] > φ · E[|R|]

5-2

Ehh and Phh


E[mRt ] > φ · E[|R|]

Problems with Ehh (finding 0.5-heavy hitters. )

D1 = (a, 0.9), (b, 0.1), (c, 1) .a is not a 0.5-expected heavy hitter.But, a has a 90% chance of being a 0.5-heavy hitter!

with Pr. 0.9 R = a, cwith Pr. 0.1 R = b, c

5-3

Ehh and Phh


E[mRt ] > φ · E[|R|]

Problems with Ehh (finding 0.5-heavy hitters. )

D1 = (a, 0.9), (b, 0.1), (c, 1) .a is not a 0.5-expected heavy hitter.But, a has a 90% chance of being a 0.5-heavy hitter!

with Pr. 0.9 R = a, cwith Pr. 0.1 R = b, c

D2 = (a, 0.5), (b, 0.5).a is a 0.5-expected heavy hitter,but only has a 50% chance of being a 0.5-heavy hitter.

5-4

Ehh and Phh


E[mRt ] > φ · E[|R|]

Follow ”probabilistic thresholding” framework(Dalvi and Suciu VLDB 2004)

A more rigorous definitiont is a (φ, τ)-probabilistic heavy hitter (Phh) of D if

Pr[mRt > φ|R|] > τ

5-5

Ehh and Phh


E[mRt ] > φ · E[|R|]

Follow ”probabilistic thresholding” framework(Dalvi and Suciu VLDB 2004)

A more rigorous definitiont is a (φ, τ)-probabilistic heavy hitter (Phh) of D if

Pr[mRt > φ|R|] > τ

6-1

Summary of main results

1. Give low degree polynomial-time algorithms for com-puting the exact PHH for offline data.

2. Design both space and time-efficient algorithms to com-pute the approximate Phh for streaming data, with the-oretically guaranteed accuracy and space/time bounds.

3. Establish a tradeoff between the accuracy and the per-tuple processing time of the proposed approximationalgorithms.

7-1

Algorithm for offline data

For a single item t, dynamic programming(DP).

The running time of DP O(m3).

Thus, if we do this for every item, the running timewould be O(nm3)

m: the number of x-tuples, n: the number of distinct items

Main idea: calculate Pr[item t appears i times and itemsother than t appear j times in the first k x-tuples of D]for all i, j, k.

7-2

Algorithm for offline data

For a single item t, dynamic programming(DP).

The running time of DP O(m3).

However, we can reduce the running time by almosta factor of n using the pruning lemma (next page).

Thus, if we do this for every item, the running timewould be O(nm3)

m: the number of x-tuples, n: the number of distinct items

Main idea: calculate Pr[item t appears i times and itemsother than t appear j times in the first k x-tuples of D]for all i, j, k.

8-1

The following lemma gives an upper bound onPr[mR

t > φ|R|] depending on E[mRt ]/E[|R|].

The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

8-2



The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

If φ = 0.1, τ = 0.6

E[mRt ]

E[|R|] < 0.02 → Pr[mRt > φ|R|] < 0.6

since∑

t E(mRt ) = E(|R|), checking 50 items is enough!

8-3



The Prunning Lemma

Pr[mRt > φ|R|] ≤ 2

φ

E[mRt ]

E[|R|]+ e−

18E[|R|]

(small)

Now running time is O( 1φτ m3).

The algorithm.

ac

bed

f compute E[mRt ] +

prunning lemma

ac

ed

g DP cd

Phhs

9-1

Approximation algorithms for streaming data

An item t is an approximate Phh ifPr[mR

t > φ|R|] > τ , and not an approximate Phh ifPr[mR

t > (φ− ε)|R|] < (1− θ)τ .

Approximation is necessary since calculating the exact fre-quency of heavy hitters require Ω(n) memory (Alon et. al.)

9-2




t > (φ− ε)|R|] < (1− θ)τ .


finds all approximate (φ, τ)-Phh with proba-bility at least 1− δ.

space O( 1εθ2τ log( 1

δφτ ))

processing time: O( 1θ2τ log( 1

δφτ )+log(1/ε))

We propose algorithms with the following guarantees.

9-3




t > (φ− ε)|R|] < (1− θ)τ .


finds all approximate (φ, τ)-Phh with proba-bility at least 1− δ.

space O( 1εθ2τ log( 1

δφτ ))

processing time: O( 1θ2τ log( 1

δφτ )+log(1/ε))

further improve to : O(log( 1δφτε ))

We propose algorithms with the following guarantees.

10-1

The basic sampling algorithm

The idea follows from Alon et. al. (JCSS 99) Average-Median

G1

Gi

Gl (l = 2 ln(1/δ′))

10-2



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

10-3



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )

compute HH in each Wij

using Space-Saving (Met-wally et. al. TODS 06)

10-4



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )



10-5



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )



10-6



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )



10-7



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )



Y ti = # possible worlds in which t is a heavy hitter /k

10-8



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )




Y t1

Y ti

Y tl

10-9



G1

Gi

Gl (l = 2 ln(1/δ′))

Wi1

Wi2

Wik (k = 8θ2τ )




Y t1

Y ti

Y tl

Finally, let Y t = MedianY t1 , Y t

2 , . . . Y tl .

11-1


Y t > (1− θ/2)τ −→ t is a (φ, τ)-Phh.Y t ≤ (1− θ/2)τ −→ t is not a Phh.

11-2



Correct with probability at least 1 − δ′ for any particularitem t.

11-3



Correct with probability at least 1 − δ′ for any particularitem t.

Setting δ′ = φτ4 δ is enough since we only need to consider

at most 3φτ candidates Phh, by the Prunning Lemma.

12-1

The improved sampling algorithm

Problem of the basic algorithm: the processing timefor each item is too large! O( 1

θ2τ )

12-2



θ2τ )

Solution: reduce the sampling rate!

Wi1

Wik

Wi2

With Pr. 1/k2

t

t

t

12-3



θ2τ )


Wi1

Wik

Wi2

With Pr. (k − 1)/k2

don’t send

12-4



θ2τ )


Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

12-5



θ2τ )


Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

pairwise in-dependent

12-6



θ2τ )


Wi1

Wik

Wi2

The rest

t

select a Wij

uniformly atrandom

Now processing time per x-tuple: O(log( 1δφτε )).

pairwise in-dependent

13-1

Experiments - the data sets

Data sets.

movie from the MystiQ project; has a total of ap-proximately 100, 000 x-tuples, most of which haveonly one alternative, but some have a few.

It contains probabilistic movie records reflecting thematching probability as a result of data integrationfrom multiple sources.

wcday46zipfu1.60

14-1

Experiments - the power of prunning

Effectiveness of the pruning lemma, where for skewed datasets, more than 90% of the items are pruned.

15-1

Experiments - basic, improved streaming algorithmVarying m: φ = 0.01, τ = 0.8, δ = 0.05, θ = 0.05, ε = 0.001.

running time

16-1


memory usage

17-1

Conclusion

We have

• formalized the notion of probabilistic heavy hit-ters following the commonly adopted possibleworld query semantics in uncertain databases.

• presented efficient algorithms with theoreticalguarantees for both offline and streaming data,under the widely adopted x-relation model.

Future work includes handling distributed data, andmore interestingly, supporting other uncertain datamodels.

18-1

The End

T HANK YOUQ and A

19-1


recall

20-1


precision

21-1

Experiments - generalized algorithm

Tradeoff in cost/accuracy, varying s, δ = 0.05, θ = 0.05,φ = 0.01, τ = 0.8, ε = 0.001.

For s/k as small as 0.05, its accuracy is already very closeto perfect. 20-fold speedup from the basic scheme!

Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive...

Documents

Transcript of Finding Frequent Items in Probabilistic Datalifeifei/papers/phh.pdfEhh and Phh An intuitive...