Efficient Query Filtering for Streaming Time Series
description
Transcript of Efficient Query Filtering for Streaming Time Series
Efficient Query Filtering Efficient Query Filtering for Streaming Time Seriesfor Streaming Time Series
Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto
Computer Science & Engineering Dept.
University of California – Riverside
Riverside, CA 92521
{wli, eamonn}@cs.ucr.edu
David Geffen School of Medicine
University of California – Los Angeles
Los Angeles, CA 90095
ISCA Technologies
Riverside, CA 92517
ICDM '05
Outline of TalkOutline of Talk• Introduction to time seriesIntroduction to time series
• Time series filteringTime series filtering
• Wedge-based approachWedge-based approach
• Experimental resultsExperimental results
• ConclusionsConclusions
What are Time Series?What are Time Series?
0 20 40 60 80 100 120 140 160 180 2004.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
Time series are collections of observations made sequentially in time.
4.7275 4.7083 4.6700 4.6600 4.6617 4.6517 4.6500 4.6500 4.6917 4.7533 4.8233 4.8700 4.8783 4.8700 4.8500 4.8433 4.8383 4.8400 4.8433 . . .
Time Series are EverywhereTime Series are EverywhereECG Heartbeat Image
Stock Video
0 50 0 1000 150 0 2000 2500
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
A B C
A B C
ClusteringClustering ClassificationClassification
Query by ContentRule Discovery
s = 0.5c = 0.3
Motif DiscoveryMotif Discovery
Anomaly DetectionAnomaly Detection VisualizationVisualization
Time Series Data Mining TasksTime Series Data Mining Tasks
10
2
1
4
3 7
6
5 9
8
10
11
12Candidates
Time Series FilteringTime Series Filtering
Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C.
Matches Q11
Time Series
2
1
4
3 7
6
5 9
8
10
11
12Queries
Matches Q11
Database
Database
Query (template)
2
1
4
3
5
7
6
9
8
10
Database
Best match
Filtering vs. QueryingFiltering vs. Querying
Euclidean Distance MetricEuclidean Distance MetricGiven two time series Q = q1…qn and C = c1…cn ,
the Euclidean distance between them is defined as:
n
iii cqCQD
1
2,
0 10 20 30 40 50 60 70 80 90 100
Q
C
Early AbandonEarly AbandonDuring the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation.
0 10 20 30 40 50 60 70 80 90 100
calculation abandoned at this point
Q
C
2
1
4
3 7
6
5 9
8
10
11
12Candidates
Classic ApproachClassic Approach
Individually compare each candidate sequence to the query using the early abandoning algorithm.
Time Series
WedgeWedge
C2
C1
U
L
W
U
L
Q
W
Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki )
They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck .
We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L}
A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W:
n
iiiii
iiii
otherwise
LqifLq
UqifUq
WQKeoghLB1
2
2
0
)(
)(
),(_
Generalized WedgeGeneralized Wedge• Use Use WW(1,2)(1,2) to denote that a wedge is built to denote that a wedge is built
from sequences from sequences CC11 and and CC2 2 ..
• Wedges can be hierarchally nested. For Wedges can be hierarchally nested. For example, example, WW((1,2),3)((1,2),3) consists of consists of WW(1,2)(1,2) and and CC3 3 ..
C1 (or W1 ) C2 (or W2 ) C3 (or W3 )
W(1, 2)
W((1, 2), 3)
2
1
4
3 7
6
5 9
8
10
11
12Candidates
Wedge Based ApproachWedge Based Approach
• Compare the query to the wedge using LB_Keogh
• If the LB_Keogh function early abandons, we are done
• Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm
Time Series
Examples of Wedge MergingExamples of Wedge Merging
W(1,2)
Q
W((1,2),3)
Q
C1 (or W1 ) C2 (or W2 )
W(1, 2)
C1 (or W1 ) C2 (or W2 ) C3 (or W3 )
W(1, 2)
W((1, 2), 3)
Hierarchal Clustering Hierarchal Clustering
C1 (or W1)
C4 (or W4)
C2 (or W2)
C5 (or W5)
C3 (or W3)
W3
W2
W5
W1
W4
W3
W(2,5)
W1
W4
W3
W(2,5)
W(1,4)
W((2,5),3)
W(1,4)
W(((2,5),3), (1,4))
K = 5 K = 4 K = 3 K = 2 K = 1
Which wedge set to choose ?
Which Wedge Set to Choose ?Which Wedge Set to Choose ?
• Test all Test all kk wedge sets on a representative wedge sets on a representative sample of datasample of data
• Choose the wedge set which performs the Choose the wedge set which performs the bestbest
Upper Bound on Wedge Based ApproachUpper Bound on Wedge Based Approach
• Wedge based approach seems to be efficient when Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset.comparing a set of time series to a large batch dataset.
• But, what about streaming time series ?But, what about streaming time series ?– Streaming algorithms are limited by their Streaming algorithms are limited by their worstworst case. case.– Being efficient on Being efficient on averageaverage does not help. does not help.
• Worst caseWorst caseC1 (or W1 ) C2 (or W2 ) C3 (or W3 )
W(1, 2)
W((1, 2), 3)
Subsequence
If dist(W((2,5),3), W(1,4)) >= 2 r
failscannot fail on both wedges
>= 2r
< r
W3
W2
W5
W1
W4
W3
W(2,5)
W1
W4
W3
W(2,5)
W(1,4)
W((2,5),3)
W(1,4)
W(((2,5),3), (1,4))
K = 5 K = 4 K = 3 K = 2 K = 1
Subsequence
?
Triangular Inequality
W((2,5),3)
W(1,4)
Experimental SetupExperimental Setup• DatasetsDatasets
– ECG DatasetECG Dataset
– Stock DatasetStock Dataset
– Audio DatasetAudio Dataset
• We measure the number of computational steps used by the We measure the number of computational steps used by the following methods:following methods:– Brute forceBrute force
– Brute force with early abandoning (classic)Brute force with early abandoning (classic)
– Our approach (Atomic Wedgie)Our approach (Atomic Wedgie)
– Our approach with random wedge set (AWR)Our approach with random wedge set (AWR)
ECG DatasetECG Dataset• Batch time seriesBatch time series
– 650,000 data points (half an 650,000 data points (half an hour’s ECG signals)hour’s ECG signals)
• Candidate setCandidate set– 200 time series of length 40200 time series of length 40
– 4 types of patterns4 types of patterns• left bundle branch block beatleft bundle branch block beat
• right bundle branch block beatright bundle branch block beat
• atrial premature beatatrial premature beat
• ventricular escape beatventricular escape beat
• rr = 0.5 = 0.5
• Upper Bound: 2,120 Upper Bound: 2,120 (8,000 for (8,000 for brute force)brute force)
AlgorithmAlgorithm Number of StepsNumber of Steps
brute forcebrute force 5,199,688,000 5,199,688,000
classicclassic 210,190,006210,190,006
Atomic WedgieAtomic Wedgie 8,853,0088,853,008
AWRAWR 29,480,26429,480,264
0
1
2
3
4
5
6x 10
9
Algorithms
Num
ber
of S
teps
brute force
classic Atomic
WedgieAWR
Stock DatasetStock Dataset• Batch time seriesBatch time series
– 2,119,415 data points2,119,415 data points
• Candidate setCandidate set– 337 time series with length 128337 time series with length 128
– 3 types of patterns3 types of patterns• head and shouldershead and shoulders
• reverse head and shouldersreverse head and shoulders
• cup and handle cup and handle
• rr = 4.3 = 4.3
• Upper Bound: 18,048 Upper Bound: 18,048 (43,136 (43,136 for brute force)for brute force)
AlgorithmAlgorithm Number of StepsNumber of Steps
brute forcebrute force 91,417,607,168 91,417,607,168
classicclassic 13,028,000,000 13,028,000,000
Atomic WedgieAtomic Wedgie 3,204,100,000 3,204,100,000
AWRAWR 10,064,000,000 10,064,000,000
0
1
2
3
4
5
6
7
8
9
10x 10
10
Algorithms
Num
ber
of S
teps
brute force
classicAtomic
WedgieAWR
Audio DatasetAudio Dataset• Batch time seriesBatch time series
– 37,583,512 data points (one hour’s 37,583,512 data points (one hour’s sound)sound)
• Candidate setCandidate set– 68 time series with length 5168 time series with length 51– 3 species of harmful mosquitoes3 species of harmful mosquitoes
• Culex quinquefasciatusCulex quinquefasciatus• Aedes aegyptiAedes aegypti• Culiseta sppCuliseta spp
• Sliding window: 11,025 (1 second)Sliding window: 11,025 (1 second)• Step: 5,512 (0.5 second)Step: 5,512 (0.5 second)• rr = 2 = 2• Upper Bound: 2,929 Upper Bound: 2,929 (6,868 for brute (6,868 for brute
force)force)
AlgorithmAlgorithm Number of StepsNumber of Steps
brute forcebrute force 57,485,160 57,485,160
classicclassic 1,844,997 1,844,997
Atomic WedgieAtomic Wedgie 1,144,778 1,144,778
AWRAWR 2,655,816 2,655,816
0
1
2
3
4
5
6x 10
7
Algorithms
Num
ber
of S
teps
brute force
classic Atomic
Wedgie AWR
ConclusionsConclusions• We introduce the problem of time series We introduce the problem of time series
filtering.filtering.
• Combining similar sequences into a wedge is a Combining similar sequences into a wedge is a quite promising idea.quite promising idea.
• We have provided the upper bound of the cost We have provided the upper bound of the cost of the algorithm to compute the fastest arrival of the algorithm to compute the fastest arrival rate we can guarantee to handle.rate we can guarantee to handle.
Questions?Questions?
All datasets used in this talk can be found at
http://www.cs.ucr.edu/~wli/ICDM05/