Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh...
-
date post
20-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh...
Time Series Data Mining GroupTime Series Data Mining Group
SAXually Explicit Images: SAXually Explicit Images: Finding Unusual ShapesFinding Unusual Shapes
Li Wei Eamonn Keogh Xiaopeng Xi
Computer Science & Engineering Dept.University of California – Riverside
Appears as a Google tech talk , Google “Keogh SAXually”
http://video.google.com/videoplay?docid=6642985254445857159
Time Series Data Mining GroupTime Series Data Mining Group
OutlineOutline• Shape Representations and Distance Measures
• Shape Discords (i.e. unusual shapes)
• Algorithm– Shape Discords Discovery Framework– Approximating the Optimal Ordering
• Empirical Evaluation
• Conclusion
Time Series Data Mining GroupTime Series Data Mining Group
Shape DatasetsShape Datasets
Butterflies
Skulls Fruit fly wings
Leaves
Sea animals
NematodesArrowheads
Petroglyphs
Lizards
Time Series Data Mining GroupTime Series Data Mining Group
Shape RepresentationsShape Representations
0 200 400 600 800 1000 1200
We can convert shapes into a 1D signal. By doing this we remove information about scale and offset. But we must deal with rotation in our algorithms …
There are three ways to be rotation invariant: Landmarking, Rotation Invariant Features, Brute Force Rotation Alignment…
Time Series Data Mining GroupTime Series Data Mining Group
Landmarking*: Find the one “True” RotationLandmarking*: Find the one “True” Rotation
OrangutanOwl MonkeyNorthern Gray-Necked
Owl Monkey (species unknown)
Generic Landmark Alignment
A B C
A
B
A
B
Best Rotation Alignment
Generic Landmark Alignment
Best Rotation Alignment
• Domain Specific LandmarkingFind some fixed point in your domain, eg. the nose on a face, the stem of leaf, the tail of a fish …
• Generic LandmarkingFind the major axis of the shape and use that as the canonical alignment
• ProblemIt does not work in many cases.
?
* Xie, J. AND Heng P. Shape Modeling Using Automatic Landmarking. MICCAI 2005.
Time Series Data Mining GroupTime Series Data Mining Group
Rotation Invariant Features*Rotation Invariant Features*
• Possible features include:Ratio of perimeter to area, fractal measures, elongatedness, circularity, min/max/mean curvature, entropy, perimeter of convex hull and histograms
• ProblemWhen throwing away rotation information, some useful information are thrown away invariably.
Red HowlerMonkey
Mantled HowlerMonkey
Orangutan(juvenile)
BorneoOrangutan
Orangutan
Histogram of the distances between two randomly chosen points on the perimeter of the shape
1D centriod representation
* Cardone, A., Gupta, S. K., and Karnik, M. A Survey of Shape Similarity Assessment Algorithms for Product Design and Manufacturing Applications. ASME Journal, 2003
Time Series Data Mining GroupTime Series Data Mining Group
Brute Force Rotation AlignmentBrute Force Rotation Alignment
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
• IdeaAchieve true rotation invariance by exhaustive brute force search over all possible rotations
• Rotation MatrixGiven a time series C of length n, its possible rotations constitute a rotation matrix C of size n by n
• Rotation Invariant Euclidean Distance (RED)
• ProblemHigh computational cost
C
121
112
121
,,,,
,,,,
,,,,
nn
nn
nn
cccc
cccc
cccc
C
j
nj
CQEDCQRED ,min),(1
We have forcefully shown this is the right representation, see our VLDB 2006 paper
Time Series Data Mining GroupTime Series Data Mining Group
Shape DiscordShape Discord• The shape that is least similar to other shapes in a dataset
(or has the largest distance to its nearest match)
1st Discord
SQUID Dataset (subset)
1st Discord
Specimen 20773
1st Discord(Castroville Cornertang)
2nd Discord(Martindale
point)
Time Series Data Mining GroupTime Series Data Mining Group
Brute Force Shape Discord DiscoveryBrute Force Shape Discord Discovery
123456789
1011121314151617
Algorithm [dist, index] = BruteForce_Search(S)best_so_far_dist = 0best_so_far_index = NaNFor p = 1 to |S| nearest_neighbor_dist = infinity For q = 1 to |S| If p!= q If Dist (Cp , Cq ) < nearest_neighbor_dist
nearest_neighbor_dist = Dist (Cp , Cq)
End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [best_so_far_dist, best_so_far_index]
For each shape in the dataset (row)
Find the distance to its nearest neighbor (column)
Check whether it is a better candidate for discord
For each shape in the datasetFind the distance to its nearest neighborCheck whether it is a better candidate as the discord
Time Series Data Mining GroupTime Series Data Mining Group
Observations from Brute Force AlgorithmObservations from Brute Force Algorithm
19.1 5.9 29.3 19.5 18.4
19.1 10.1 29.0 2.4 3.0
5.9 10.1 28.1 4.1 8.4
29.3 29.0 28.1 26.7 28.8
19.5 2.4 4.1 26.7 3.4
18.4 3.0 8.4 28.8 3.4
1
2
3
4
5
6
5.9
2.4
4.1
26.7
2.4
3.0
1
2
3
4
5
6
1 2 3 4 5 6 nn_dist
bsf_dist = 5.9
2.4 < 5.9
4.1 < 5.9
bsf_dist = 26.7
2.4 < 26.7
3.0 < 26.7
comments
19.1 5.9 29.3 19.5 18.4
19.1 10.1 29.0 2.4 3.0
5.9 10.1 28.1 4.1 8.4
29.3 29.0 28.1 26.7 28.8
19.5 2.4 4.1 26.7 3.4
18.4 3.0 8.4 28.8 3.4
1
2
3
4
5
6
1 2 3 4 5 6
19.1 5.9 29.3 19.5 18.4
19.1 10.1 29.0 2.4 3.0
5.9 10.1 28.1 4.1 8.4
29.3 29.0 28.1 26.7 28.8
19.5 2.4 4.1 26.7 3.4
18.4 3.0 8.4 28.8 3.4
1 2 3 4 5 6
Brute Force Early Abandon
Magic
Order Matters!
Time Series Data Mining GroupTime Series Data Mining Group
Heuristic Shape Discord DiscoveryHeuristic Shape Discord Discovery
123456789
1011121314151617181920
Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)best_so_far_dist = 0best_so_far_index = NaNFor each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist
break End If Dist (Cp , Cq ) < nearest_neighbor_dist
nearest_neighbor_dist = Dist (Cp , Cq )
End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [ best_so_far_dist, best_so_far_index ]
Consider discord candidate in Outer order
Visit other shapes in Inner order
Apply early abandoning
Time Series Data Mining GroupTime Series Data Mining Group
Observations from Heuristic AlgorithmObservations from Heuristic Algorithm
123456789
1011121314151617181920
Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)best_so_far_dist = 0best_so_far_index = NaNFor each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist
break End If Dist (Cp , Cq ) < nearest_neighbor_dist
nearest_neighbor_dist = Dist (Cp , Cq )
End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [ best_so_far_dist, best_so_far_index ]
Observation 1 • We do not need a perfect outer ordering. • Among the first few shapes being examined, there is at least one that has a large distance to its nearest neighbor.
We want this conditional test be true as often as possible!
Observation 2 • We do not need a perfect inner ordering. • Among the first few shapes being examined, there is at least one that has a distance to the candidate that is less than the current value of the best_so_far_dist variable .
Time Series Data Mining GroupTime Series Data Mining Group
Approximating the Optimal Ordering Approximating the Optimal Ordering
• Step 1: symbolize the time series• Step 2: use locality-sensitive hashing to estimate similarity
between shapes• Step 3: generate heuristics for outer and inner loops
• Keep in mind:– Outer heuristic (invoked only once) can take at most O(m) to calculate.– Inner heuristic (invoked m times) can take at most O(1) to calculate.
Time Series Data Mining GroupTime Series Data Mining Group
SAX: SAX: Symbolic Aggregate approXimationSymbolic Aggregate approXimation
baabccbc• Lower bounds Euclidean distance• Achieves dimensionality reduction• There are now well over 100 SAX papers, see www.cs.ucr.edu/~eamonn/SAX.htm
Time Series Data Mining GroupTime Series Data Mining Group
Locality-sensitive Hash Function*Locality-sensitive Hash Function*
• Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the locality-sensitive hash function f is defined as
• For example,
• Property– Strings similar to each other are more likely to be hashed to the same
value.
][],...,[],[)( 21 kisisissf
ada dda da
aadd
f
* Indyk, P., Motwani, R., Raghavan, P., and Vempala, S. Locality-Preserving Hashing in Multidimensional Spaces. STOC 1997.
Time Series Data Mining GroupTime Series Data Mining Group
0 200 400 600 800 1000 1200
0 200 400 600 800 1000 1200
adad
daca
Images Time Series Representations
SAX Words
A)
B)
Because of Because of rotationsrotations, similar shapes may not be , similar shapes may not be hashed to the same value.hashed to the same value.
Time Series Data Mining GroupTime Series Data Mining Group
Rotation Invariant Locality-sensitive Hash FunctionRotation Invariant Locality-sensitive Hash Function
• Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the rotation invariant locality-sensitive hash function f ’ is defined as
where LSHIFTS(s) is the set of all possible left shifts of string
s
)}(|][],...,[],[{)(' 21 sLSHIFTSpipipipsf k
a d a dd a d a
d a c aa c a dc a d aa d a c
a d a d
d a c a
a ad d
d ca ac d
Images SAX Words
A)
Shifts LSH Values
B)
Time Series Data Mining GroupTime Series Data Mining Group
Generating HeuristicsGenerating Heuristics
d a c aa c b dc a a ba d a d:: :: :: ::
b a d a
d a c aa c a dc a d aa d a c
a d a dd a d a
aa:
ab:
ac:
ba:
1
m
2
34
2 1 12 2
21 1
1 1
1 4 m
2 3
3
2 3
1
m
2
3
4
1 m2 3 4
bd:
ca:
cd:
db:
m
3
1 2
m
dc: 1 2
dd: 4
…
…
Image 1
Image 4
Shifts
Array
BucketsCollision Matrix
Time Series 1
Time Series 4
• Outer order: examines shapes in the ascending order of the largest number of collisions each shape has with others .
• Inner order: When candidate shape i is considered in the outer loop, the inner loop examines the shapes in the descending order of the number of collisions they have with shape i.
Time Series Data Mining GroupTime Series Data Mining Group
The Utility of Shape DiscordsThe Utility of Shape Discords
0 100 200 300 400 500 600 700 800 900
Heliconius melpomene(The Postman)
Heliconius erato (Red Passion Flower Butterfly)
1st Discord (Dacrocyte) A B
C
D
E
A B
C D
E
A B C D E F
G
G
A, D, E
B, C, F
A
B
C
A B C
1st Discord
Time Series Data Mining GroupTime Series Data Mining Group
The Utility of Heuristic Ordered SearchThe Utility of Heuristic Ordered Search• Datasets
– Homogeneous: 10,000 projectile points– Heterogeneous: 5,844 objects
• Measurement– number of distance function calls by each approach / number of distance function
calls by brute force
Projectile Points
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100008000600040002000100050010050
Number of Time Series in database (m)
Brute ForceRandomOur method
Heterogeneous Dataset
0
0.2
0.4
0.6
0.8
1
58445000
400030002000100080040020010050
Brute ForceRandomOur method
Number of Time Series in database (m)
Just using early abandoning (which is an original idea in this context) is 3 or 4 orders of magnitude faster, the Magic heuristic is a further order of magnitude faster.
Time Series Data Mining GroupTime Series Data Mining Group
Conclusion & Future WorkConclusion & Future Work
• We define shape discords.• We introduce the heuristic based algorithm to
efficiently find discords and demonstrate its utility in various domains
• Future Work– Investigate image discords not only using shapes
but also texture/color– Conduct a field studies of shape discord discovery
in anthropology and archeology
Time Series Data Mining GroupTime Series Data Mining Group
Thank you!Thank you!
Questions?