Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford...

41
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002

Transcript of Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford...

Page 1: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Approximate Frequency Counts over Data Streams

Gurmeet Singh Manku, Rajeev MotwaniStandford University

VLDB2002

Page 2: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Introduction

Data come as a continuous “stream”

Differs from traditional stored DB The sheer volume of a stream over its

lifetime is huge Queries require timely answer

Page 3: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Frequent itemset mining on offline databases vs data streams

Often, level-wise algorithms are used to mine offline databases At least 2 database scans are needed

Ex: Apriori algorithm

Level-wise algorithms cannot be applied to mine data streams Cannot go through the data stream multipl

e times

Page 4: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Challenges of streaming

Single pass

Limited Memory

Enumeration of itemsets

Page 5: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Purpose

Present algorithms computing frequency exceeding threshold Simple Low memory footprint Output approximate, guaranteed not exceed a

user specified error parameter. Deployed for singleton items, handle variable

sized sets of items.

Main contributions of the paper: Proposed 2 algorithms to find frequent items appe

ar in a data stream of items Extended the algorithms to find frequent itemset

Page 6: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Notations

Some notations: Let N denote the current length of the

stream Let s (0,1) denote the support

threshold Let (0,1) denote the error tolerance

<< s

Page 7: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Approximation guarantees

All itemsets whose true frequency exceeds sN are reported

No itemset whose true frequency is less than (s-)N is output

Estimated frequencies are less than the true frequencies by at most N

Page 8: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Example

s = 0.1%

ε should be one-tenth or one-twentieth of s. ε = 0.01%

Property 1, elements frequency exceeding 0.1% output.

Property 2, NO element frequency below 0.09% output

Elements between 0.09% ~ 0.1% may or may not be output.

Property 3, frequencies are less than their true frequencies at most 0.01%

Page 9: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Problem definition

An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties

Devise algorithms support ε-deficient synopsis using little main memory as possible

Page 10: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

The Algorithms for frequent Items

Each transaction contains only 1 item

Two algorithms proposed: Sticky Sampling Algorithm Lossy Counting Algorithm

Features : Sampling used Frequency found approximate, error guaranteed not e

xceed user-specified tolerance level For Lossy Counting, all frequent items are reported

Page 11: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling Algorithm

Create counters by sampling

Stream341530

283141233519

Page 12: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling Algorithm

User input : Support threshold s Error tolerance Probability of failure

Counts kept in data structure S Each entry in S is in the form (e,f), where:

e : item f : frequency of e since the entry inserted in S

Output entries in S where f (s - )N

Page 13: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling Algorithm

r : sampling rate

Sampling an element with rate = r means select the element with probablity = 1/r

Page 14: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling Algorithm

Initially – S is empty, r = 1. For each incoming element e

if (e exists in S) increment corresponding f

else {sample element with rate r

if (sampled)add entry (e,1) to S

elseignore

}

Page 15: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sampling rate

Let t = 1/ ε log(s-1 -1) ( = probability of failure)

First 2t elements sampled at rate=1 The next 2t at rate=2 The next 4t at rate=4 and so on…

Page 16: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling Algorithm

Whenever the sampling rate r changes: for each entry (e,f) in S repeat {

toss an unbiased coinif (toss is not successful)

diminsh f by oneif (f == 0) {

delete entry from Sbreak

}} until toss is successful

Page 17: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

Data stream conceptually divided into buckets = 1/ transactions

Buckets labeled with bucket ids, starting from 1

Current bucket id is bcurrent ,value is N/ fe :true frequency of an element e in stream

seen so far Each entry in data structure D is form (e, f, )

e : item f : frequency of e : the maximum possible error in f

Page 18: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

is the maximum # of times e occurred in the first bcurrent – 1 buckets ( this value is exactly bcurrent – 1)

Once a value is inserted into D its value is unchanged

Page 19: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

Initially D is empty Receive element e

if (e exists in D)increment its frequency (f) by 1

elsecreate a new entry (e, 1, bcurrent – 1)

If bucket boundary prune D by the following the rule:(e,f,) is deleted if f + ≤ bcurrent

When the user requests a list of items with threshold s, output those entries in D where f ≥ (s – ε)N

Page 20: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif

Page 21: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

FrequencyCounts

At window boundary, remove entries that for them f+∆ ≤ bcurrent

+

First WindowD is Empty

Page 22: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy CountingFrequencyCounts

At window boundary, remove entries that for them f+∆≤ bcurrent

Next Window

+

Page 23: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Lossy Counting

Lossy Counting guarantees that: When deletion occurs, bcurrent N

Entry (e, f, ) is deleted, If fe bcurrent

fe : actual frequency count of e Hence, if entry (e, f, ) is deleted, fe N

Finally, f fe f + N

Page 24: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling vs Lossy Counting

Sticky Sampling is non-deterministic, while Lossy Counting is deterministic

Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling

Page 25: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Sticky Sampling vs Lossy Counting

Lossy counting is superior by a large factor

Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled

Lossy counting is good at pruning low frequency elements quickly

Page 26: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

The more complex case: finding frequent itemsets

The Lossy Counting algorithm is extended to find frequent itemsets

Transactions in the data stream contains a set of items

Page 27: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

Stream

Page 28: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

Input: stream of transactions, each transaction is a set of items from I

N: length of the stream User specifies two parameters:

support s, error Challenge:

- handling variable sized transactions- avoiding explicit enumeration of all subsets of any transaction

Page 29: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

Data structure D – set of entries of the form (set, f, ) set : subset of items

Transactions are divided into buckets = 1/ transactions : # of transactions

in each bucket bcurrent : current bucket id

Page 30: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions.

β : # of buckets in main memory in the current batch being processed.

Page 31: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

D’s operations : UPDATE_SET updates and deletes in D

Entry (set, f, ) count occurrence of set in the batch and update the entry

If updated entry satisfies f + bcurrent, removed it from D

NEW_SET inserts new entries into D If set set has frequency f in batch and

set doesn’t occur in D, create a new entry (set, f, bcurrent-)

Page 32: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Finding frequent itemsets

If fset ≥ N it has an entry in D

If (set,f,)ED then the true frequency of fset satisfies the inequality f≤ fset ≤ f+

When user requests list of items with threshold s, output in D where f ≥ (s-)N

β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.

Page 33: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Buffer: repeatedly reads in a batch of buckets of transactions into available main memory

Trie: maintains the data structure D SetGen: generates subsets of item-id’s along

with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application

of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Page 34: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Three modules

BUFFER

TRIE

SUBSET-GEN

maintains the data structure D

operates on the current batch of transactions

repeatedly reads in a batch of transactionsinto available main memory

implement UPDATE_SET, NEW_SET

Page 35: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Module 1 - Buffer

Read a batch of transactions Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transaction by its item-id’s

Window 1 Window 2 Window 3 Window 4 Window 5 Window 6

In Main Memory

Page 36: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Module 2 - TRIE

50

40

30

31 29 32

45

42

50 40 30 31 29 45 32 42 Sets with frequency counts

Page 37: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Module 2 – TRIE cont…

Nodes are labeled {item-id, f, , level} Children of any node are ordered by their item-

id’s Root nodes are also ordered by their item-id’s A node represents an itemset consisting of item-

id’s in that node and all its ancestors TRIE is maintained as an array of entries of the

form {item-id, f, , level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes.

No pointers, level’s compactly encode the underlying tree structure.

Page 38: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Module 3 - SetGen

BUFFER

3 3 3 4 2 2 1 2 1 3 1 1

Frequency countsof subsetsin lexicographic order

SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Page 39: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Overall Algorithm

BUFFER

3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN

TRIE new TRIE

Page 40: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Conclusion

Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items

Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic

Lossy Counting can be extended to find frequent itemsets

Page 41: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Thank you very much…