Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window
description
Transcript of Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window
1
Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window
Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen
2005/9/23 報告人:董原賓
2
The Characteristics of data streams
Continuity: Data continuously arrive at a high rate
Expiration: Data can be read only once
Infinity: The total amount of data is unbounded
3
The requirements of data streams
Time-sensitivity: A model that adapts itself to the time passing of a continuous data stream
Approximation: Because the past data cannot be stored
Adjustability: Owing to the unlimited amount of data, a mechanism that adapts itself to available resources is needed
4
Definition t : time point p : time period Basic block B : transactions arrive in [t-p+
1, t] the basic block numbered i denote as Bi
|w| : length of the window Θ : support threshold
t-p+1 t time
abc
ac
acd
p
B
5
Definition
TS : time-sensitive sliding-window TSi : the TS that consists of the |W| consec
utive basic blocks from Bi-|W|+1 to Bi
∑i : the number of transactions in TSi
i-4 i-3 i-2 i-1 i i+1 time Bi-3 Bi-2 Bi-1 Bi Bi+1
|W| = 3aba
TSi
bdaccd
∑i = 5
6
Time sensitive sliding window The buffer continuously consumes transacti
ons and pours them block-by-block into our system
Accuracy guarantees of no false dismissal (NFD) recall oriented or no false alarm (NFA) precision oriented are provided
7
New itemset insertion Each frequent itemset is inserted into PFP in the fo
rm of (ID, Items, Acount, Pcount), recording a unique identifier, the items in it, the accumulated count, and the potential count, respectively.
Acount accumulates its exact support counts in the subsequent basic blocks, while Pcount estimates the maximum possible sum of its support counts in the past basic blocks
8
New itemset insertion Check every frequent itemset discover
ed in Bi to see whether it has been kept by PFP. If it is, we increase its Acount.
Otherwise, we create a new entry in PFP and estimate its Pcount as the largest integer that is less than θ×∑i-1
9
Old itemset update For each itemset that is in PFP (potenti
ally frequent-itemset pool) but not frequent in Bi, we compute its support count in Bi by scanning the buffer to update its Acount.
An itemset in PFP is deleted if its sum of Acount and Pcount is less than θ×∑i
10
DT maintenance
Each itemset in PFP is inserted into DT (discounting table) in the form of (B_ID, ID, Bcount), recording the serial number of the current basic block, the identifier in PFP, and its support count in the current basic block, respectively
11
Itemset discounting Since the transactions in Bi-|W| will be expire
d, the support counts of the itemsets kept by PFP are discounted accordingly
If the itemset’s Pcount is nonzero, we subtract the support count thresholds of the expired basic blocks from Pcount
If Pcount is already 0, we subtract Bcount of the corresponding entry in DT from Acount
Each entry in DT where B_ID = i−|W| is removed
12
TA update TA (Threshold array) : dynamically compu
te the support count threshold θ×|Bi| for each basic bock Bi and store it into an entry in the threshold array
Only |W|+1 entries are maintained in TA
13
Algorithm
0 1 2 3 4 5 time
abbcdacabdbdacd
B_ID ID Bcount
TA (Threshold Array) 1 2 3 4
ID Itemset Acount Pcount
2.4
PFP (potentially frequent itemset pool)
DT( Discounting Table)
t =1 block B1 B2 B3 B4 B5
Mining B1 frequent : a(4) b(4) c(3) d(4) bd(3)
infrequent : ab(2) ac(2) ad(2) bc(1) cd(2) bcd(1) abd(1) acd(1)
New itemset insertion
1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0
DT maintenance
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3
|w| = 3
Block size = 1
Threshold = 0.4
TA update
Sliding window
0 1 2 3 4 5 time
abbcdacabdbdacd
abcadabc
B_ID ID Bcount
TA (Threshold Array) 1 2 3 4
ID Itemset Acount Pcount
2 2.4
PFP (potentially frequent itemset pool)
DT( Discounting Table)
t =2 block B1 B2 B3 B4 B5
Mining B2 frequent : a(3) b(2) c(2) bc(2)
infrequent : d(1) ad(1)
New itemset insertion
1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0
DT maintenance
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3
|w| = 3
Block size = 1
Threshold = 0.4
1 a 7 0 2 b 6 0 3 c 5 0 4 d 4 0 5 bd 3 0 6 bc 2 2
1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0 5 bd 3 0 6 bc 2 2
Old itemset update
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1
2.4
TA update
Sliding window
Pcount =
0 1 2 3 4 5 time
abbcdacabdbdacd
abcadabc
bbdc
B_ID ID Bcount
TA (Threshold Array) 1 2 3 4
ID Itemset Acount Pcount
1.2 2 2.4
PFP (potentially frequent itemset pool)
DT( Discounting Table)
t =3 block B1 B2 B3 B4 B5
Mining B3 frequent : b(2)
infrequent : c(1) d(1) bd(1)
New itemset insertionDT maintenance
|w| = 3
Block size = 1
Threshold = 0.4
Old itemset update
1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1
1 a 7 0 2 b 8 0 3 c 5 0 4 d 5 0
1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1
TA update
2 2.4
Sliding window
0 1 2 3 4 5 time
abbcdacabdbdacd
abcadabc
bbdc
ababcabab
B_ID ID Bcount
TA (Threshold Array) 1 2 3 4
ID Itemset Acount Pcount
1.6 1.2 2 2.4
PFP (potentially frequent itemset pool)
DT( Discounting Table)
t =4 block B1 B2 B3 B4 B5
Mining B4 frequent : a(4) b(4) ab(4)
infrequent : c(1) abc(1)
New itemset insertionDT maintenance
|w| = 3
Block size = 1
Threshold = 0.4
Old itemset update
Itemset discounting
1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0
1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1
1 a 3 0 2 b 4 0 3 c 3 0 4 d 2 0
1 a 7 0 2 b 8 0 3 c 3 0 4 d 2 0 5 ab 4 5
1 a 7 0 2 b 8 0 3 c 4 0 4 d 2 0 5 ab 4 5
2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1
2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4
TA update
1.2 2 2.4
Sliding window
0 1 2 3 4 5 time
abbcdacabdbdacd
abcadabc
bbdc
ababcabab
bcbcbcbcabc
B_ID ID Bcount
TA (Threshold Array) 1 2 3 4
ID Itemset Acount Pcount
PFP (potentially frequent itemset pool)
DT( Discounting Table)
t =5 block B1 B2 B3 B4 B5
Mining B5 frequent : b(5) c(5) bc(5)
infrequent : a(1) ab(1) abc(1)
New itemset insertionDT maintenance
|w| = 3
Block size = 1
Threshold = 0.4
Old itemset update
2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4
1 a 7 0 2 b 8 0 5 ab 4 5TA updateItemset discounting
1.6 1.2 2 2.4
3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4
3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 5 1 1 5 2 5 5 5 1 5 6 5 5 7 5
1 a 4 0 2 b 6 0 5 ab 4 2.6
1 a 5 0 2 b 11 0 5 ab 5 2.6 6 c 5 4 7 bc 5 4
1 a 4 0 2 b 11 0 5 ab 4 2.6 6 c 5 4 7 bc 5 4
2 1.6 1.2 2
Sliding window
19
Self-adjusting discounting table
In this approach, DT often consumes most of the memory space. When the space limit is reached, an efficient way to reduce the DT size without losing too much accuracy is required
20
Selective adjustment
Each entry DTk is in the new form of (B_ID, ID, Bcount, AVG, NUM, Loss)
DTk.AVG keeps the average of support counts for all the itemsets merged into DTk, DTk.NUM is the number of itemsets in DTk, while DTk.Loss records the merging loss of merging DTk with DTk-1
21
Selective adjustment
The main idea is to select the entry with the smallest merging loss, called the victim, and merge it into the entry above it
22
Merging loss and new Bcount For k>1 and DTk.B_ID=DTk-1.B_ID 1. Under NFD (no false dismissal) mode Bcount = min {DTk.Bount, DTk-1.Bount} DTk.loss = (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min
{DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM) 2. Under NFA (no false alarm) mode Bcount = max {DTk.Bount, DTk-1.Bount} DTk.loss = max {DTk.Bount, DTk-1.Bount} x (DTk.Num + DT
k-1.NUM) – (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG)
23
Example
B_ID ID Bount
AVG NUM Loss
DT_limit = 4Under NFD mode
1 1 12 12 1 ∞ 1 1 12 12 1 ∞ 1 3 13 13 1 1 1 1 12 12 1 ∞ 1 3 13 13 1 1 1 4 2 2 1 11 1 5 10 10 1 8
1 1,3 12 12.5 2 ∞ 1 4 2 2 1 11 1 5 10 10 1 8
1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8 1 6 10 10 1 0
1 1 12 1 3 13 1 6 10
Loss = (1x13 + 1x12) – min{13, 12} x (1+1) = 25 – 24 = 1
(DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min {DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM)
Loss = (1x2 + 2x12.5) – min{2, 12} x (1+2) = 27 – 6 = 21Loss = (1x10 + 1x10) – min{10, 10} x (1+1) = 20 – 20 = 0
AVG = (12x1 + 13x1) / 1+1 = 12.5
AVG =
1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8
24
Experiment Intel Pentium-M 1.3GHz CPU 256 MB main memory Microsoft Windows XP Professional The datasets streaming into this system are
synthesized via the IBM data generator
25
Experiment
26
Experiment