Statistics 202: Data Mining Part I Graphical exploratory data...
Transcript of Statistics 202: Data Mining Part I Graphical exploratory data...
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningWeek 3
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
October 10, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part I
Graphical exploratory data analysis
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Exploratory data analysis
A preliminary exploration of the data to better understandits characteristics.
Motivations:
Helping to select the right tool for preprocessing oranalysis.Making use of humans’ abilities to recognize patterns.
Pioneered by John Tukey, one of the giants of 20thcentury statistics.
Our focus
Visual summary statisticsQuantitative summary statisticsExtraction of data slices
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Visualization
Visualization is the conversion of data into a visual ortabular format so that the characteristics of the data andthe relationships among data items or attributes can beanalyzed or reported.
Visualization of data is one of the most powerful andappealing techniques for data exploration.
Humans have a well developed ability to analyze largeamounts of information that is presented visually.
Goals: to detect general patterns and trends and/or todetect outliers and unusual patterns.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Stem and leaf plot
The decimal point is at the |
1 | 01223333333444444444444
1 | 555555555555556666666777799
2 |
2 |
3 | 033
3 | 55678999
4 | 000001112222334444
4 | 5555555566677777888899999
5 | 000011111111223344
5 | 55566666677788899
6 | 0011134
6 | 6779
>
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Histogram
Usually shows the distribution of values of a single variable.
Divide the values into bins and show a bar plot of thenumber of objects in each bin.
The height of each bar indicates the number of objects ifall bins are of same width.
If bins are of different width, then often it is the *area* ofthe bar that indicates the number of objects in that bin.
Shape of histogram depends on the number of bins.
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
A histogram with 10 breaks
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
A histogram with 30 breaks
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Density
A smoothed / infinitesimal version of histogram.
The height of the curve, f (x) is proportional to probabilitya value falls in interval [x , x + dx ].
A density is non-negative, f (x) ≥ 0 and integrates to 1
∫ ∞
−∞f (x) dx = 1 = 100%
The integral over any interval [a, b] is the frequency /probability a value falls in this interval.
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The standard normal density
4 3 2 1 0 1 2 3 4units
0
5
10
15
20
25
30
35
40
% p
er
unit
The area between z = −0.7 and z = 0.7 is 51.61% 10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Estimated density of petal.length within eachiris type
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Density estimate
Simplest way to estimate a density of a sample{x1, . . . , xn} is to use a kernel density estimator
Definition
f̂ (x) =1
n
n∑
i=1
1
h· φ((x − xi )/h)
where h is the width of each bump attached to eachsample point.
R chooses h automatically.
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Estimated density bw=1
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Estimated density bw=0.05
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Distribution function
Associated to a density f is its distribution function
F (x) =
∫ x
−∞f (u) du
with 0 ≤ F (x) ≤ 1.
Given a sample {x1, . . . , xn} we define its ECDF(Empirical Cumulative Distribution Function) as
F̂ (x) =1
n
n∑
i=1
1(−∞,x](xi )
where
1(−∞,x](xi ) =
{1 −∞ < xi ≤ x
0 xi > x .15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
ECDF of petal.length for each iris type
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graphical summaries of data
Quantile function
The inverse of a distribution function
F (Q(p)) = p
with 0 ≤ F (x) ≤ 1.
For a sample {x1, . . . , xn} an estimated quantile Q̂(p) ischosen so that approximately np of the data points areless than Q̂(p).
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Quantile plots of petal.length for each iris type
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Boxplot
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Boxplot of all variables across iris types
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Boxplot of petal.length for each iris type
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Contour of 2D density estimate
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Heatmap of 2D density estimate
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Correlation (case × case) matrix of iris data
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part II
Multidimensional scaling
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional scaling
A visual tool
Recall the PCA scores were
X̃XXV = U∆
where X̃XX = HXXXS−1/2 = U∆V T .
Above, U are eigenvectors of
X̃XXX̃XXT
= HXXXS−1XXXTH
Also, XXXS−1XXXT is a measure of similarity between cases.
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional scaling
Classical multidimensional scaling
Given a similarity matrix A, recall the relationship betweena similarity and “distance”
Dij = (Aii − 2Aij + Ajj)1/2
Now, consider the matrix Bij with entries
Bij = −1
2D2ij .
Finally, consider the matrix
C = HBH
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional scaling
Classical multidimensional scaling
If A = XXXS−1XXXT is a similarity matrix for a (scaled) datamatrix X .
Then,
B = A− 1
2(ν111T + 111νT )
= XXXS−1XXXT − 1
2(ν111T + 111νT )
Above, ν = diag(A).
Therefore,
C = HAH = X̃XXX̃XXT
= U∆2U.
In short, the eigenvectors of C are the PCA scores.
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic PCA scores
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic MDS scores
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS vs. PCA
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS vs. PCA
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional scaling
Classical multidimensional scaling
We can form B, C for any dissimilarity matrix D.
The matrix C will be symmetric, so it will bediagonalizable as C = WΛW T with Wn×k andrank(C ) = k . In the general case the diagonal entries Λii
are not necessarily non-negative.
This leads to a Euclidean representation
(WΛ1/2)n×k
with each row of WΛ1/2 being a point in Euclidean space.
Taking the fisrst two columns of WΛ1/2 gives atwo-dimensional representation.
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional scaling
Classical multidimensional scaling
The points(WΛ1/2)[, 1 : 2]
are an optimal Euclidean embedding in the sense thattheir interpoint distances are chosen to be close to thedissimilarities in D.
If the dissimilarity is Euclidean and the points lie in a2-dimensional plane in Rn, then the interpoint distanceswill be identical . . .
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances between US capitals
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances between US capitals
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Map of U.S. capitals
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS of Iris data, `2
38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris PCA scores
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS of Iris data, `1
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS of Iris data, `∞ / sup norm
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS of Iris data, `20
42 / 1
Statistics 202:Data Mining
c©JonathanTaylor
MDS of Iris data, `0.2
43 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other applications
Manifold learning
Goal of manifold learning is to find “low-dimensional”representation s of XXX that are not necessarily linear.
Examples of techniques:
ISOMAPLaplacian eigenmaps / diffusion geometryLocal linear embedding
44 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Manifold learning
45 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Manifold learning
Graph distance
In the “bottleneck”, we already have a metric (Euclideandistance), so we can use MDS.
But we might want to emphasize the fact that the groupsare “barely connected”...
How can we emphasize this bottleneck in terms of adistance?
ISOMAP and Diffusion Geometry method does this bycreating a graph and creating a new metric based on thegraph.
46 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Manifold learning
Graph distances
In our “bottleneck” picture, let’s connect k mutual nearestneighbours to form a graph Gk
That is, insert an edge (i , j) if either1 i is within j ’s k nearest neighbours;2 j is within i ’s k nearest neighbours.
Let
DGk(i , j) = length of shortest path between vertices i and j in Gk
This is a metric on Gk .
47 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Manifold learning
ISOMAP
Shove DGkthrough classical MDS.
ISOMAP can “flatten” data if there is a low-dimensionalEuclidean configuration where the Euclidean distances wellapproximate the graph distances DGk
.
This means that the data has to be “flat” if ISOMAP is torecover it.
48 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Manifold learning
49 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part III
Multidimensional arrays (data cubes)
50 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Multidimensional arrays
Data cubes
It is sometimes useful to summarize data into amultidimensional array, with one axis per “category.”
In the unemployment data, we might summarize theresults by tuples (state, period, variable)
In the Iris data, we might summarize by (petal.width,
petal.length, iris.type)
In order to summarize by petal.width, petal.length
we might form categories: low, medium, high.
51 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris grouped by petal.width, petal.length,
iris.type
52 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Iris grouped by petal.width, petal.length,
iris.type
53 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Slices of the cube.
54 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Drilling down
Resolution of a data cube
In the process of forming a data cube for unemploymentacross states, we had already aggregated over county.
We could make another cube with county-level resolution.This would be drilling down.
The operation of going from county-level to state level isrolling up.
Such operations are often best handled by database toolsrather than *R*, but *R* does support multidimensionalarrays.
55 / 1