One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S....

31
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan CSCI 599 Multidimensional Databases Fall 2003
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    3

Transcript of One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S....

Page 1: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

One-Pass Wavelet Decompositions of Data Streams

TKDE May 2002

Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss

Presented by James ChanCSCI 599 Multidimensional Databases

Fall 2003

Page 2: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Outline of Talk

• Introduction

• Background

• Proposed Algorithm

• Experiments

• End Notes

Page 3: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Streaming Applications

• Telephone Call Duration

• Call Detail Record (CDR)

• IP Traffic Flow

• Bank ATM Transactions

• Mission Critical Task:– Fraud– Security– Performance Monitoring

Page 4: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Data Stream Model

Data Stream Problem• One Pass – no backtracking• Unbounded Data – Algorithms require small memory usage• Continuous – Need to run real time

Stream ProcessingEngine

(Approximate) Answer

Synopsis in Memory

Data Streams

Page 5: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Data Stream Strategies

• Many stream algorithms produce approximate answers and have:– Deterministic Bounds: answers are within ±– Probabilistic Bounds: answers have high

success probability (1-) within ±

Page 6: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Data Stream Strategies

• Windows: New elements expire after time t

• Samples: Approximate entire domain with a sample

• Histograms: Partitioning element domain values into buckets (Equi-depth, V-Opt)

• Wavelets: Haar, Construction and maintenance (difficult for large domain)

• Sketch Techniques: estimate of L2 norm of a signal

Page 7: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Proposed Stream Model

Page 8: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Cash Register vs. Aggregate

• Cash Register: incoming stream represents domain (increment or decrement range of that domain)

• Aggregate: incoming stream represents range, (update range of that domain)

Note: Examples in this paper assume– each cash register element as +1 unit– no duplicate elements in aggregate models

Page 9: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Cash Register vs. Aggregate

Cash Register

(domain)

Aggregate

(range)

Ordered Easiest

Eg. Time Series

Unordered General Challenging

Eg. Network volume

Contiguous Same as aggregate unordered n/a

Page 10: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Wavelet Basics

• Wavelet transforms capture trends in a signal

• Typical transform involves log n passes

• Each pass creates two sets of n/2 averages and differences.

• Process repeated on averages

• Output: Wavelet Basis vectors – one average and n-1 coefficients

Page 11: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Haar Wavelet Notation

• High pass filter

• Low pass filter

• Input: signal a

• Basis Coefficients

• Coefficients

• Scaling Factor

• Psi Vectors (un-normalized)

}2/1,2/1{

}2/1,2/1{

],...,[ 21 naaa

kjjkj asd ,, ,

}{}{],...,[ ,0,0110 kjn dcwww

jj Ns 2/

kj ,

]0,0,0,0,1,1,1,1[

]0,0,0,0,1,1,0,0[

]0,0,0,0,0,0,1,1[

0,2

2,3

0,3

kj

kj

kj

Page 12: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Haar Wavelet Example

Page 13: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Small B Representation

• Most signals in nature have small B representation

• Only keep largest B wavelet coefficients to estimate energy of signal

• Additional coefficients do not help reduce squared sum error

Energy:

SSE:

2

2R

2

2Ra

Page 14: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Storage

• Highest B wavelet coefficients

• Log N Straddling coefficients, one per level of the wavelet tree

2 2 0 2 3 5 4 4

-1.25

2.75

0.5 0

0 -1 0 -1

+

-+

+

+ + +

+

+

- -

- - - -

Original Signal

Page 15: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Background: Bounding Theorems

Theorem 1• Given O(B+logN) storage (B is number of dimensions)• time to compute new data item is O(B+logN) in ordered

aggregate model

Theorem 2• Any algorithm that calculates the 2nd largest wavelet coefficient of

the signal in unordered CR / unordered agg uses at least N/polylog(N)

• This holds if:– You only care about existence, not the coefficients value– Only calculating up to a factor of 2

Page 16: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Proposed Algorithm: Overview

• Avoid keeping anything domain size N in memory

• Estimate wavelet coefficients using sketches which are size log(N)

• Sketch is maintained in memory and is updated as data entries stream in

Page 17: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

What’s a Sketch?

• Distortion Parameter (epsilon)• Failure Probability (delta)• Failure Threshold (eta)• Original Signal a• Random vector of {-1,+1}s r• Seed for r s• Atomic Sketch <a,r> dot product of a and r• Sketch O(log(N/ )/ ^2) atomic sketches

• We use the same j to index the atomic sketch, seed, and random vector, so there are j atomic sketches in a sketch

Page 18: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Updating a Sketch

• Cash Register– Add corresponding to the j atomic sketches

• Aggregate– Add corresponding to the j atomic sketches

Use generator that takes in seed which is log(N) to compute

jir

jiria )(

jis

),( isGr jji

jir

Page 19: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Reed Muller Generator

• Pseudo random generator meeting these requirements:– Variables are 4 wise

independent• Expected value of product of any

4 distinct r is 0

– Requires O(log N) space for seeding

– Performs computation in polylog(N) time

{0} {d} {c} …. {d,c,b,a}

Page 20: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

X = median ( )

Estimation of Inner Product

O(log(1/))

O(log(1/^2))

= mean ( )

… … …

Page 21: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Boosting Accuracy and Confidence

• Improve accuracy to by averaging over more independent copies of for each average

• Improve Confidence by increasing number of averages to take median over

j

iijii rbra

O(log(1/^2)) copies of …

X = median of ( )

= means ( )…

O(log(1/)) copies of …

Page 22: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Using the sketches

j

iijii rbraba ~,

• We can approximate <a,> to maintain Bs• Note a point query is <a,ei> where e is a vector

with a 1 at index i and 0s everywhere else

jj rbra ,,Atomic Sketches in memory

Page 23: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Maintaining Top B Coefficients

• At most Log N +1 coefficient updates

• May need to approximate straddling coefficients to aggregate with already existing or near variables

• Compare updates with top B and update top B if necessary

ia

updated

unaffected

Page 24: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Algorithm Space and Time

Their algorithm uses polylog(N) space and per item time to maintain B terms (by approximation)

Page 25: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Experiments• Data: one week of AT&T call detail (unordered cash

register model)

• Modes– Batch: Query only between intervals– Online: Query anytime

• Direct Point: calc sketch of <ei,a> (ei is zero vector except with 1 at i)

• Direct Wavelets: estimate all supporting coefficients and use wavelet reconstruction to calculate point a(i)

• Top B: Reconstruction of point is done with Top B (maintained by sketch)

Page 26: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Top B – Day 0

Page 27: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Top B - 1 Week

(fixed-set) Value updates only. no replacement

Page 28: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Sketch Size on Accuracy

Page 29: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Heavy Hitters

• Points that contribute significantly to the energy of the signal

• Direct point estimates are very accurate for heavy hitters but gross estimates for non heavy hitters

• Adaptive Greedy pursuit: by removing the first heavy hitter from the signal, you improve the accuracy of calculating the next biggest heavy hitter

• However an error is introduced with each subtraction of a heavy hitter

Page 30: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Processing Heavy Hitters

Adaptive Greedy Pursuit

Page 31: One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

End Notes

• First Provable Guarantees for haar wavelet over data streams

• Can estimate Haar coefficients ci=<a,>• Top B is updated in:

• This paper is superseded by "Fast, Small-space algorithms for approximatehistogram maintenance" STOC 2002– Discusses how to select top B and find heavy hitters

))()log((log 3 BNNO