Belyaev, O.a. - Fundamentals of Geometry Euclidean, Euclidean) - 2005, 231s
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
-
date post
19-Dec-2015 -
Category
Documents
-
view
232 -
download
3
Transcript of 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
![Page 1: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/1.jpg)
1
Clustering Preliminaries
ApplicationsEuclidean/Non-Euclidean
SpacesDistance Measures
![Page 2: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/2.jpg)
2
The Problem of Clustering
Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as close to each other as possible.
![Page 3: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/3.jpg)
3
Example
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
![Page 4: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/4.jpg)
4
Problems With Clustering
Clustering in two dimensions looks easy.
Clustering small amounts of data looks easy.
And in most cases, looks are not deceiving.
![Page 5: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/5.jpg)
5
The Curse of Dimensionality
Many applications involve not 2, but 10 or 10,000 dimensions.
High-dimensional spaces look different: almost all pairs of points are at about the same distance.
![Page 6: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/6.jpg)
6
Example: Curse of Dimensionality
Assume random points within a bounding box, e.g., values between 0 and 1 in each dimension.
In 2 dimensions: a variety of distances between 0 and 1.41.
In 10,000 dimensions, the difference in any one dimension is distributed as a triangle.
![Page 7: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/7.jpg)
7
Example – Continued
The law of large numbers applies. Actual distance between two
random points is the sqrt of the sum of squares of essentially the same set of differences.
![Page 8: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/8.jpg)
8
Example High-Dimension Application: SkyCat
A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands).
Problem: cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.
Sloan Sky Survey is a newer, better version.
![Page 9: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/9.jpg)
9
Example: Clustering CD’s (Collaborative Filtering)
Intuitively: music divides into categories, and customers prefer a few categories. But what are categories really?
Represent a CD by the customers who bought it.
Similar CD’s have similar sets of customers, and vice-versa.
![Page 10: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/10.jpg)
10
The Space of CD’s
Think of a space with one dimension for each customer. Values in a dimension may be 0 or 1
only. A CD’s point in this space is
(x1, x2,…, xk), where xi = 1 iff the i th customer bought the CD. Compare with boolean matrix: rows =
customers; cols. = CD’s.
![Page 11: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/11.jpg)
11
Space of CD’s – (2)
For Amazon, the dimension count is tens of millions.
An alternative: use minhashing/LSH to get Jaccard similarity between “close” CD’s.
1 minus Jaccard similarity can serve as a (non-Euclidean) distance.
![Page 12: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/12.jpg)
12
Example: Clustering Documents
Represent a document by a vector (x1, x2,…, xk), where xi = 1 iff the i th word (in some order) appears in the document. It actually doesn’t matter if k is infinite;
i.e., we don’t limit the set of words. Documents with similar sets of words
may be about the same topic.
![Page 13: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/13.jpg)
13
Aside: Cosine, Jaccard, and Euclidean Distances
As with CD’s we have a choice when we think of documents as sets of words or shingles:
1. Sets as vectors: measure similarity by the cosine distance.
2. Sets as sets: measure similarity by the Jaccard distance.
3. Sets as points: measure similarity by Euclidean distance.
![Page 14: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/14.jpg)
14
Example: DNA Sequences
Objects are sequences of {C,A,T,G}. Distance between sequences is edit
distance, the minimum number of inserts and deletes needed to turn one into the other.
Note there is a “distance,” but no convenient space in which points “live.”
![Page 15: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/15.jpg)
15
Distance Measures
Each clustering problem is based on some kind of “distance” between points.
Two major classes of distance measure:
1. Euclidean2. Non-Euclidean
![Page 16: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/16.jpg)
16
Euclidean Vs. Non-Euclidean
A Euclidean space has some number of real-valued dimensions and “dense” points. There is a notion of “average” of two points. A Euclidean distance is based on the
locations of points in such a space. A Non-Euclidean distance is based on
properties of points, but not their “location” in a space.
![Page 17: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/17.jpg)
17
Axioms of a Distance Measure
d is a distance measure if it is a function from pairs of points to real numbers such that:
1. d(x,y) > 0. 2. d(x,y) = 0 iff x = y.3. d(x,y) = d(y,x).4. d(x,y) < d(x,z) + d(z,y) (triangle
inequality ).
![Page 18: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/18.jpg)
18
Some Euclidean Distances
L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. The most common notion of “distance.”
L1 norm : sum of the differences in each dimension. Manhattan distance = distance if you
had to travel along coordinates only.
![Page 19: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/19.jpg)
19
Examples of Euclidean Distances
x = (5,5)
y = (9,8)L2-norm:dist(x,y) =(42+32)= 5
L1-norm:dist(x,y) =4+3 = 7
4
35
![Page 20: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/20.jpg)
20
Another Euclidean Distance
L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.
Note: the maximum is the limit as n goes to ∞ of what you get by taking the n th power of the differences, summing and taking the n th root.
![Page 21: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/21.jpg)
21
Non-Euclidean Distances
Jaccard distance for sets = 1 minus ratio of sizes of intersection and union.
Cosine distance = angle between vectors from the origin to the points in question.
Edit distance = number of inserts and deletes to change one string into another.
![Page 22: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/22.jpg)
22
Jaccard Distance for Sets (Bit-Vectors)
Example: p1 = 10111; p2 = 10011. Size of intersection = 3; size of
union = 4, Jaccard similarity (not distance) = 3/4.
d(x,y) = 1 – (Jaccard similarity) = 1/4.
![Page 23: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/23.jpg)
23
Why J.D. Is a Distance Measure
d(x,x) = 0 because xx = xx. d(x,y) = d(y,x) because union and
intersection are symmetric. d(x,y) > 0 because |xy| < |xy|. d(x,y) < d(x,z) + d(z,y) trickier –
next slide.
![Page 24: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/24.jpg)
24
Triangle Inequality for J.D.
1 - |x z| + 1 - |y z| > 1 - |x y| |x z| |y z| |x y| Remember: |a b|/|a b| =
probability that minhash(a) = minhash(b).
Thus, 1 - |a b|/|a b| = probability that minhash(a) minhash(b).
![Page 25: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/25.jpg)
25
Triangle Inequality – (2)
Claim: prob[minhash(x) minhash(y)] < prob[minhash(x) minhash(z)] + prob[minhash(z) minhash(y)]
Proof: whenever minhash(x) minhash(y), at least one of minhash(x) minhash(z) and minhash(z) minhash(y) must be true.
![Page 26: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/26.jpg)
26
Similar Sets and Clustering
We can use minhashing + LSH to find quickly those pairs of sets with low Jaccard distance.
We can cluster sets (points) using J.D.
But we only know some distances – the low ones.
Thus, clusters are not always connected components.
![Page 27: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/27.jpg)
27
Example: Clustering + J.D.
{a,b,c}{b,c,e,f}
{d,e,f}{a,b,d,e}
Similarity threshold = 1/3;distance < 2/3
![Page 28: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/28.jpg)
28
Cosine Distance
Think of a point as a vector from the origin (0,0,…,0) to its location.
Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors: p1.p2/|p2||p1|. Example: p1 = 00111; p2 = 10011. p1.p2 = 2; |p1| = |p2| = 3. cos() = 2/3; is about 48 degrees.
![Page 29: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/29.jpg)
29
Cosine-Measure Diagram
p1
p2p1.p2
|p2|
d (p1, p2) = = arccos(p1.p2/|p2||p1|)
![Page 30: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/30.jpg)
30
Why C.D. Is a Distance Measure
d(x,x) = 0 because arccos(1) = 0. d(x,y) = d(y,x) by symmetry. d(x,y) > 0 because angles are
chosen to be in the range 0 to 180 degrees.
Triangle inequality: physical reasoning. If I rotate an angle from x to z and then from z to y, I can’t rotate less than from x to y.
![Page 31: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/31.jpg)
31
Edit Distance
The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. Equivalently:
d(x,y) = |x| + |y| - 2|LCS(x,y)|. LCS = longest common subsequence =
any longest string obtained both by deleting from x and deleting from y.
![Page 32: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/32.jpg)
32
Example: LCS
x = abcde ; y = bcduve. Turn x into y by deleting a, then
inserting u and v after d. Edit distance = 3.
Or, LCS(x,y) = bcde. Note: |x| + |y| - 2|LCS(x,y)| =
5 + 6 –2*4 = 3 = edit distance.
![Page 33: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/33.jpg)
33
Why Edit Distance Is a Distance Measure
d(x,x) = 0 because 0 edits suffice. d(x,y) = d(y,x) because insert/delete
are inverses of each other. d(x,y) > 0: no notion of negative
edits. Triangle inequality: changing x to z
and then to y is one way to change x to y.
![Page 34: 1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.](https://reader035.fdocuments.net/reader035/viewer/2022062320/56649d405503460f94a19f27/html5/thumbnails/34.jpg)
34
Variant Edit Distances
Allow insert, delete, and mutate. Change one character into another.
Minimum number of inserts, deletes, and mutates also forms a distance measure.
Ditto for any set of operations on strings. Example: substring reversal OK for DNA
sequences