Topological Data Analysis for detecting Hidden...
Transcript of Topological Data Analysis for detecting Hidden...
Topological Data Analysis for detectingHidden Patterns in Data
Susan Holmes
Statistics, Stanford, CA 94305.
Joint work with Persi Diaconis, Mehrdad Shahshahani and
Sharad Goel.
Thanks to Harold Widom, Gunnar Carlssen, John Chakarian,
Leonid Pekelis for discussions, and NSF grant DMS 0241246
for funding.
A la recherche du temps perdu: Gradients etOrdination
Many popular multivariate methods based on spectral
decompositions of distance methods or transformed
distances, Multidimensional Scaling, kernel PCA,
correspondence analysis, Metric MDS aim to detect hidden
underlying structure of points in high dimensions.
A first type of dependence is a hidden gradient, placing
points close to a curve in high dimensional space. Ecologists,
archeologists have long known to look for horseshoes or
arches which are symptomatic of such structure.
We take a political science example with data from 2005
U.S. House of Representatives roll call votes. MDS and
kernel PCA, in this case, output two ‘horseshoes’ that are
characteristic of dimensionality reduction techniques.
PCA: Dimension Reduction
PCA seeks to replace the original (centered) matrix X by a
matrix of lower rank, this can be solved by doing the singular
value decomposition of X:
X = USV ′, with U ′DU = In and V ′QV = Ip and Sdiagonal
XX ′ = US2U ′, with U ′DU = In and S2 = ΛPCA is a linear nonparametric multivariate method for
dimension reduction.
Ordination : Finding Time (Le temps perdu...)
Early studies in archeology have aimed for seriation in time
Guttman, Kendall and Ter Braak have pointed out and
studied the arch or horseshoe effect.
Here is a linguistic example where I dated the works of Plato
according to their sentence endings using a particular
distance between the books called the Chisquare distance:
As an example we take data analysed by Cox and
Brandwood [?] who wanted to seriate Plato’s works using
the proportion of sentence endings in a given book, with a
given stress pattern. We propose the use of correspondence
analysis on the table of frequencies of sentence endings, for a
detailed analysis see Charnomordic and Holmes[?].
The first 10 profiles (as percentages) look as follows:
Rep Laws Crit Phil Pol Soph TimUUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2.4-UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3.9U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6.0UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1.8UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3.4UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3.5--UUU 2.1 2.7 3.3 4.3 3.3 3.3 3.4-U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3.4-UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1.7-UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3.3.......etc (there are 32 rows in all)
The eigenvalue decomposition (called the scree plot) of the
chisquare distance matrix (see [?]) shows that two axes out
of a possible 6 (the matrix is of rank 6) will provide a
summary of 85% of the departure from independence, this
suggests that a planar representation will provide a good
visual summary of the data.
Eigenvalue inertia % cumulative %1 0.09170 68.96 68.962 0.02120 15.94 84.903 0.00911 6.86 91.764 0.00603 4.53 96.295 0.00276 2.07 98.366 0.00217 1.64 100.00
Tim
Laws Rep
Soph
Phil
PolCrit
Axis #1: 69%
Axis
#2:
16%
-0.2 0.0 0.2 0.4
0.0
-0.2
-0.3
-0.1
0.1
Correspondence Analysis of Plato’s WorksWe can see from the plot that there is a seriation that as in
most cases follows a parabola or arch [?] from Laws on one
extreme being the latest work and Republica being the
earliest.
Examples from Ecology
The Boomlake plant data:
Biplot representing both species and locations
Blue circles with letters are species scores
Sampling locations are green circles with numbers.
Sample 1 is actually in the lake, and sample 12 is far away.
Species are located closely to the samples they occur in. If
you looked carefully into the data matrix, you would find
that species R and Q are strictly aquatic, while species F is a
dryland plant (cribs). There is an arch effect.
Reference to site on ordination:
http://ordination.okstate.edu/CA.htm
Psychological DataColor confusion data (Ekman, 1954):
w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00
Results
Color Confusion: Screeplot
●
●
●
●
●●
● ● ● ●
2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
Index
clas
s.co
l$ei
g
Planar configuration
−0.4 −0.2 0.0 0.2 0.4
−0.
4−
0.2
0.0
0.2
0.4
cmdscale(colorc)[,1]
cmds
cale
(col
orc)
[,2]
1 2
34
5
6
78
9
10
11
1213
14
Metric Multidimensional Scaling
Schoenberg (1935)
Decomposition of Distances
If we started with original data in Rp that are not centered:
Y , apply the centering matrix
X = HY, with H = (I − 1n11′), and 1′ = (1, 1, 1 . . . , 1)
Call B = XX ′, if D(2) is the matrix of squared distances
between rows of X in the euclidean coordinates, we can show
that
−12HD(2)H = B
We can go backwards from a matrix D to X by taking the
eigendecomposition of B in much the same way that PCA
provides the best rank r approximation for data by taking
the singular value decomposition of X, or the
eigendecomposition of XX ′.
X(r) = US(r)V ′ with S(r) =
s1 0 0 0 ...
0 s2 0 0 ...
0 0 ... ... ...
0 0 ... sr ...
... ... ... 0 0
Another approach: Markov Chain associatedto the data
Consider data points X = {x1, . . . , xn} in a metric space
(X , d).
We define a matrix on X that preferentially moves to nearby
states via the transition kernel
K(xi, xj) =e−d(xi,xj)∑nk=1 e
−d(xi,xk).
K has stationary distribution
π(xi) ∝n∑k=1
e−d(xi,xk)
and furthermore, (K,π) is reversible:
π(xi)K(xi, xj) = π(xj)K(xj, xi).
Because K is reversible, it is diagonalizable in L2(X, π) in a
real orthonormal basis of eigenfunctions f1, . . . , fn with
corresponding real eigenvalues,
1 = λ1 ≥ λ2 ≥ · · · ≥ λn > −1.
f1 ≡ 1 since K is stochastic. Having fixed an orthonormal
basis of eigenfunctions, the k-dimensional MDS is defined to
be
Γ : xi 7→ yi = (f2(xi), . . . , fk+1(xi))
We are generally interested in k � n, for example, k ≤ 3.
Γ is an optimal mapping of X into Rk in the sense that it
minimizes∑1≤i,j≤n
π(xi)K(xi, xj)‖yi − yj‖2
over all Γ : X→ Rk such that
1.∑n
i=1 Γ(p)(xi)Γ(q)(xi)π(xi) = δpq 1 ≤ p, q ≤ k
2.∑n
i=1 Γ(p)(xi)π(xi) = 0 1 ≤ p ≤ k.
where Γ(p)(xi) is the pth coordinate of Γ(xi) ∈ Rk. 1 says that
the coordinate functions of Γ are orthonormal in L2(π) and 2
says that they are also orthogonal to constant functions.
Intuitively, Λ maps similar points in X (as measured via
q(xi, xj) = π(xi)K(xi, xj)) to nearby points in Rk.
In the preceding, we started with a metric d on X and built a
similarity S(xi, xj) = e−d(xi,xj) which in turn lead to a Gram
matrix G. We could instead define G via alternative
measures of similarity, e.g.
S(xi, xj) = supxk,xl d(xk, xl)− d(xi, xj). More generally, we
could begin with an arbitrary reversible Markov chain on X,
bypassing the metric d altogether.
The Voting Data
We are going to carefully analyze the output of
multidimensional scaling applied to the 2005 U.S. House of
Representatives roll call votes. The resultant 3-dimensional
mapping of legislators shows ‘horseshoes’ that are
characteristic of a number of dimensionality reduction
techniques, including principal components analysis and
correspondence analysis.
These patterns are heuristically attributed to a latent
ordering of the data, e.g. the ranking of politicians within a
left-right spectrum.
Roll Call Data
We apply the eigendecomposition algorithm to members of
the 2005 U.S. House of Representatives with the distance
between legislators defined via roll call votes [?].
A full House consists of 435 members, and in 2005 there
were 671 roll calls. The first two roll calls were a call of the
House by States and the election of the Speaker, and so were
excluded from our analysis. Hence, the data can be ordered
into a 435× 669 matrix Y = (yij) with Yij ∈ {1/2,−1/2, 0}indicating, respectively, a vote of ‘yea’, ‘nay’, or ‘not voting’
by Representative i on roll call j.
We further restricted our analysis to the 401 Representatives
that voted on at least 90% of the roll calls (220 Republicans,
180 Democrats and 1 Independent) leading to a 401× 669matrix V of voting data.
The Data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 -1 -1 1 -1 0 1 1 1 1 12 -1 -1 1 -1 0 1 1 1 1 13 1 1 -1 1 -1 1 1 -1 -1 -14 1 1 -1 1 -1 1 1 -1 -1 -15 1 1 -1 1 -1 1 1 -1 -1 -16 -1 -1 1 -1 0 1 1 1 1 17 -1 -1 1 -1 -1 1 1 1 1 18 -1 -1 1 -1 0 1 1 1 1 19 1 1 -1 1 -1 1 1 -1 -1 -110 -1 -1 1 -1 0 1 1 0 0 0
This step removed, for example, the Speaker of House
Dennis Hastert (R-IL) who by custom votes only when his
vote would be decisive, and Robert T. Matsui (D-CA) who
passed away at the start the term.
We define a distance between legislators as
d̂(li, lj) =1
669
669∑k=1
|vik − vjk|.
Rougly, d̂(li, lj) is the percentage of roll calls on which
legislators li and lj disagreed. This interpretation would be
exact if not for the possibility of ‘not voting’.
Since we now have data points in a metric space, we can
apply the MDS algorithm. The figure shows the results of a
3-dimensional MDS mapping. The most striking feature of
the mapping is that the data separate into ’twin horseshoes’.
In the next figure we have added color to indicate the
political party affiliation of each Representative (blue for
Democrat, red for Republican, and green for the lone
independent–Rep. Bernie Sanders of Vermont). The output
from MDS is qualitatively similar to that obtained from other
dimensionality reduction techniques, such as principal
components analysis applied directly to the voting matrix V .
We build and analyze a model for the data in an effort to
understand and interpret these pictures. Roughly our theory
predicts that the Democrats, for example, are ordered along
the blue curve in correspondence to their political ideology,
i.e. how far they lean to the left.
We discuss connections between the theory and the data. In
particular, we explain why in the data, legislators at the
political extremes are not quite at the tips of the MDS
curves, but rather are positioned slightly toward the center.
Briefly, this amounts to the fact that there are distinct
groups of relatively-liberal Republicans, which accordingly
exhibit quite different voting patterns.
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-Dimensional MDS mapping of legislators based on the
2005 U.S. House of Representatives roll call votes.
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-Dimensional MDS mapping of legislators based on the 2005
U.S. House of Representatives roll call votes. Color has been
added to indicate the party affiliation of each representative.
A Model for the Data
Following the standard paradigm of placing politicians within
a left-right spectrum, it is natural to identify legislators li1 ≤ i ≤ n with points in the interval I = [0, 1] in
correspondence with their political ideologies. We define the
distance between legislators to be
d(li, lj) = |li − lj|.
This assumption that legislators can be isometrically mapped
into an interval is key to our analysis.
To apply MDS to the voting data, we defined a distance
between legislators via roll call votes. We now introduce a
‘cut-point model’ for voting that connects our distance d
above to the data-based roll call distance.
The Model: Each bill 1 ≤ k ≤ m on which the legislators
vote is represented as a pair
(Ck, Pk) ∈ [0, 1]× {0, 1}.
We can think of Pk as indicating whether the bill is liberal
(Pk = 0) or conservative (Pk = 1), and we can take Ck to be
the cut-point between legislators that vote ‘yea’ or ‘nay’. Let
Vik ∈ {1/2,−1/2} indicate how legislator li votes on bill k.
Then, in this model,
Vik ={
1/2− Pk li ≤ CkPk − 1/2 li > Ck
.
As described, the model has n+ 2m parameters, one for
each legislator and two for each bill. We reduce the number
of parameters by assuming that the cut-points are
independent random variables uniform on I. Then,
P(Vik 6= Vjk) = d(li, lj) (1)
since legislators li and lj take opposites sides on a given bill
if and only if the cut-point Ck divides them. Observe that
the Pk do not affect the probability above.
Define the empirical distance between legislators li and lj by
d̂m(li, lj) =1m
m∑k=1
|Vik − Vjk| =1m
m∑k=1
1Vik 6=Vjk.
By 1, we can estimate the distance d between legislators by
the distance d̂ which is computable from the voting record.
In particular,
limm→∞
d̂m(li, lj) = d(li, lj) a.s.
since we assumed the cut-points are independent. More
precisely, we have the following result:
Lemma. For m ≥ log(n/√ε)/ε2
P(∣∣∣d̂m(li, lj)− d(li, lj)
∣∣∣ ≤ ε ∀ 1 ≤ i, j ≤ n)≥ 1− ε.
Proof. By the Hoeffding inequality, for fixed li and lj
P(∣∣∣d̂m(li, lj)− d(li, lj)
∣∣∣ > ε)≤ 2e−2mε2.
Consequently,
P
⋃1≤i<j≤n
∣∣∣d̂m(li, lj)− d(li, lj)∣∣∣ > ε
≤∑
1≤i<j≤nP(∣∣∣d̂m(li, lj)− d(li, lj)
∣∣∣ > ε)
≤(
n
22e−2mε2
)≤ ε
for m ≥ log(n/√ε)/ε2 and the result follows.
In our model we identified latent variables with points in the
interval I = [0, 1] and accordingly defined the distance
between them to be d(li, lj) = |li − lj|. This general
description seems to be reasonable in a number of
applications. We then built a simple model for the data that
facilitated empirical approximation of this distance. This
second step depends heavily on the application. In the rest
of the paper, we simply assume that the distance d can be
reasonably approximated from the data.
Analysis of the Model
In this section we analyze the MDS algorithm applied to
metric models satisfying
d(xi, xj) = |i/n− j/n|.
This corresponds to the case in which legislators are
uniformly spaced in I: li = i/n.
Similarity and Transition Matrices
Given a distance d on a state space X , there are several ways
to build a similarity S. Two standard transformations are:
1. S1(xi, xj) = e−d(xi,xj)
2. S2(xi, xj) = supzi,zj d(zi, zj)− d(xi, xj)
Once we have a similarity, we can define a Gram/Kernel
matrix K by normalizing the rows. That is,
K(xi, xj) =S(xi, xj)∑xkS(xi, xk)
.
To ease the analysis, sometimes we instead normalize the
similarity matrix by the average row sum
z =1n
∑xi
∑xj
S(xi, xj).
That is, we set K(xi, xj) = S(xi, xj)/z.
Eigenvectors and Horseshoes
We find approximate eigenfunctions and eigenvalues for
models that satisfy
d(xi, xj) = |i/n− j/n|
with Gram matrices that are built with either a linear
similarity or an exponential similarity. The eigenfunctions are
found by continuizing the discrete Gram matrix, and then
solving the corresponding integral equation∫ 1
0K(x, y)f(y)dy = λf(x).
Standard matrix perturbation theory can then be applied to
recover approximate eigenfunctions for the original, discrete
kernel.
The eigenfunctions that we derive are in agreement with
those arising from the voting data, and lend considerable
insight into our data analysis problem and also into general
features of MDS mappings.
Approximate Eigenfunctions
We now state a classical perturbation result that relates two
different notions of an approximate eigenfunction. For more
refined estimates, see Parlett[?].
Theorem. Consider an n × n symmetric matrix A witheigenvalues λ1 ≤ · · · ≤ λn. If for ε > 0
‖Af − λf‖2 ≤ ε
for some f, λ with ‖f‖2 = 1, then A has an eigenvalue λksuch that |λk − λ| ≤ ε.
If we further assume that s = mini:λi 6=λk |λi − λk| > ε
then A has an eigenfunction fk such that Afk = λkfk and‖f − fk‖2 ≤ ε/(s− ε).
Remark. The second statement of the theorem allowsnon-simple eigenvalues, but requires that the eigenvaluescorresponding to distinct eigenspaces be well-separated.Remark. The eigenfunction bound of the theorem isasymptotically tight in ε as the following exampleillustrates: Consider the matrix
A =[λ 00 λ+ s
]with s > 0. For ε < s define the function
f =
[ √1− ε2/s2
ε/s
].
Then ‖f‖2 = 1 and ‖Af − λf‖2 = ε. The theoremguarantees that there is an eigenfunction fk with eigenvalue
λk such that |λ − λk| ≤ ε. Since the eigenvalues of A areλ and λ + s, and since s > ε, we must have λk = λ. LetVk = {fk : Afk = λkfk} = {ce1 : c ∈ R} where e1 is the firststandard basis vector. Then
minfk∈Vk
‖f − fk‖2 = ‖f − (f · e1)e1‖ = ε/s.
The bound of the theorem, ε/(s− ε), is only slightly larger.
Proof of Approximate Eigenfunction TheoremProof. First we show that mini |λi − λ| ≤ ε.If mini |λi − λ| = 0 we are done; otherwise A− λI is invertible. Then,
‖f‖2 ≤ ‖(A− λI)−1‖ · ‖(A− λ)f‖2≤ ε‖(A− λI)−1‖.
Since the eigenvalues of (A − λI)−1 are 1/(λ1 − λ), . . . , 1/(λn − λ), by
symmetry
‖(A− λI)−1‖ =1
mini |λi − λ|.
The result now follows since ‖f‖2 = 1.
Set λk = argmin|λi− λ|, and consider an orthonormal basis g1, . . . , gm of
the associated eigenspace Eλk. Define fk to be the projection of f onto
Eλk:
fk = 〈f, g1〉g1 + · · ·+ 〈f, gm〉gm.Then fk is an eigenfunction with eigenvalue λk. Writing f = fk+(f−fk)
we have
(A− λI)f = (A− λI)fk + (A− λI)(f − fk)= (λk − λ)fk + (A− λI)(f − fk).
Since f − fk ∈ E⊥λk, by symmetry we have
〈fk, A(f − fk)〉 = 〈Afk, f − fk〉 = 〈λkfk, f − fk〉 = 0.
Consequently, 〈fk, (A− λI)(f − fk)〉 = 0 and by Pythagoras
‖Af − λf‖22 = (λk − λ)2‖fk‖2 + ‖(A− λI)(f − fk)‖22.
In particular, ε ≥ ‖Af − λf‖2 ≥ ‖(A− λI)(f − fk)‖2.For λi 6= λk, |λi − λ| ≥ s− ε. The result now follows since for h ∈ E⊥λk
‖(A− λI)h‖2 ≥ (s− ε)‖h‖2.
Centering Kernel matrices
If our kernel K is renormalized so that it has row sums 1.
K1n = 1
Then 1n is an eigenvector of K with eigenvalue 1.
As a consequence if we recenter K by applying the centering
matrix H = I− 1n11′, for any eigenvector v different from 1n
KHv = Kv − 1nK1n1′nv = λv
and also HKHv = λHv = λv
So we will not bother to recenter the K matrix.
Linear Similarity
When we make a continuous version of the discrete Kernel
matrix Kn, we get the continuous kernel
K(x, y) =32[1− |x− y|].
Once we guess that the solutions to the corresponding
integral equation are trigonometric, verifying this is
straightforward. We start with a simple integral computation.
Lemma. For a 6= 0∫ 1
0cos(ax+ b)[1− |c− x|]dx =
2a2(cos(ac+ b))
− 1a2 [a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b)] .
In particular,
1. For odd integers k∫ 1
0sin(kπ(x−1/2))[1−|c−x|]dx =
2(kπ)2 cos(kπ(c−1/2))
2. For solutions to (a/2) tan(a/2) = 1∫ 1
0cos [a(x− 1/2)] [1− |c− x|]dx =
2a2 cos [a(c− 1/2)] .
Proof. The result follows from a straightforward calculation. Set
fc(x) = cos(ax+ b)[1− |c− x|].
Then
Z 1
0fc(x)dx = (1− c)
Z c
0cos(ax+ b)dx+
Z c
0x cos(ax+ b)dx
+ (1 + c)
Z 1
ccos(ax+ b)dx−
Z 1
cx cos(ax+ b)dx.
Integration by parts shows that,
Zx cos(ax+ b) =
1
ax sin(ax+ b) +
1
a2cos(ax+ b).
Substituting into the above, we have
Z 1
0fc(x)dx =
1
a2[a(1− c) sin(ac+ b)− a(1− c) sin(b) + a(1 + c) sin(a+ b)
− a(1 + c) sin(ac+ b) + ac sin(ac+ b) + cos(ac+ b)− cos(b)
− a sin(a+ b)− cos(a+ b) + ac sin(ac+ b) + cos(ac+ b)].
At a = kπ and b = 0 for k an odd integer,
a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = 0
and so Z 1
0cos(kπx)[1− |c− x|]dx =
2
(kπ)2cos(kπc).
Since for odd ksin(kπ(x− 1/2)) = cos(kπx− π(k + 1)/2) = (−1)
(k+1)/2cos(kπx)
the first part of the lemma follows. At b = −a/2 where a is a solution to (a/2) tan(a/2) = 1
a sin b− ac sin b− ac sin(a+ b) + cos(b) + cos(a+ b) = −a sin(a/2) + 2 cos(a/2)
= 0.
Consequently, Z 1
0cos(ax− a/2)[1− |c− x|]dx =
2
a2cos(ac− a/2).
for a a solution to (a/2) tan(a/2) = 1.
The solutions of (a/2) tan(a/2) = 1 occur at approximately
a = 2kπ for integers k. More precisely, we have the following
result.
Lemma. The positive solutions of (a/2) tan(a/2) = 1 lie inthe set
(0, π) ∪∞⋃k=1
(2kπ, 2kπ + 2/kπ)
with exactly one solution per interval. Furthermore, a is asolution if and only if −a is a solution.
Proof. Let f(θ) = (θ/2) tan(θ/2). Then f is an even
function, so a is a solution to f(θ) = 1 if and only if −a is
a solution. Since f ′(θ) = (1/2) tan(θ/2) + (θ/4) sec2(θ/2),
f(θ) is non-negative and increasing in the first and second
quadrants, and furthermore
f(2kπ) = 0 < 1 < +∞ = limθ→(2k+1)π−
f(θ).
The third and fourth quadrants have no solutions since f(θ) ≤0 in those regions. This shows that the solutions to f(θ) = 1lie in the intervals
∞⋃k=0
(2kπ, 2kπ + π)
with exactly one solution per interval. Recall the power series
expansion of tan θ for |θ| < π/2 is
tan θ = θ + θ3/3 + 2θ5/15 + 17θ7/315 + . . . .
In particular, for 0 ≤ θ < π/2, tan θ ≥ θ. Finally, for k ∈ Z≥1
f(2kπ + 2/kπ) = (kπ + 1/kπ) tan(kπ + 1/kπ)
= (kπ + 1/kπ) tan(1/kπ)
≥ (kπ + 1/kπ)(1/kπ)
> 1
which gives the result.
Remark. The first few positive solutions of (a/2) tan(a/2) =1 are
1. a = 1.72066717803876 . . .
2. a = 6.85123691896346 . . .
3. a = 12.87459635834389 . . .
4. a = 19.05866881072393 . . .
Lemma. For 1 ≤ i, j ≤ n, let
Kn(xi, xj) =3
2n
(1− |i− j|
n
).
Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to (a/2) tan(a/2) = 1, and set gn,k(xi) =sin(kπ(i/n− 1/2) for k ≥ 1 an odd integer. Then∣∣∣∣Knfn,a(xi)−
3a2fn,a(xi)
∣∣∣∣ ≤ a+ 1n
and ∣∣∣∣Kngn,k(xi)−3
(kπ)2gn,k(xi)∣∣∣∣ ≤ kπ + 1
n.
That is, fn,a and gn,k are approximate eigenfunctions of Kn
with approximate eigenvalues proportional to their squaredperiods.
Proof. Once we guess that f and g are approximate
eigenfunctions of Kn, the proof of this fact follows from
the integral computation in the previous Lemma. We have,
Knfn,a(xi) =3
2n
n∑j=1
cos(a(j/n− 1/2))[1− |i/n− j/n|]
=32
∫ 1
0cos(a(x− 1/2))[1− |j/n− x|]dx+
32Rn
=3a2fn,a(xi) +
32Rn by Lemma
where the error term satisfies
|Rn| ≤M
2nforM ≥ sup
0≤x≤1
∣∣∣∣ ddx cos(a(x− 1/2))[1− |j/n− x|]∣∣∣∣
by the standard right-hand rule error bound. In particular, we
can take M = a+ 1 independent of j, from which the result
for fn,a follows. The case of gn,k is analogous.
Lemma For 1 ≤ i, j ≤ n set
Kn(xi, xj) =3
2n
(1− |i− j|
n
)and let λ1, . . . , λn be the eigenvalues of Kn.
1. For positive solutions to (a/2) tan(a/2) = 1
min1≤i≤n
∣∣∣∣λi − 3a2
∣∣∣∣ ≤ 2(a+ 1)√n
.
2. For odd integers k ≥ 1
min1≤i≤n
∣∣∣∣λi − 3(kπ)2
∣∣∣∣ ≤ kπ + 1√n
.
Remark. By Remark the first few values of 3/a2 forsolutions to (a/2) tan(a/2) = 1 are
1. 1.01327541515878 . . .
2. 0.06391212873818 . . .
3. 0.01809897627265 . . .
4. 0.00825916473010 . . .
and the first few values of 3/(kπ)2 for k ≥ 1 an odd integerare
1. 0.30396355092701 . . .
2. 0.03377372788078 . . .
3. 0.01215854203708 . . .
4. 0.00620333777402 . . .
Exponential Transformation of Similarity
The case of exponential similarity is analogous to that of
linear similarity. Continuizing the discrete Gram matrix Kn,
we get the kernel
K(x, y) =e
2e−|x−y|.
Once again, we find trigonometric solutions to Kf = λf .
Lemma. For constants a, c ∈ R∫ 1
0e−|x−c| cos[a(x− 1/2)]dx
=2 cos[a(c− 1/2)]
1 + a2 +
(e−c + ec−1
)(a sin(a/2)− cos(a/2))
1 + a2
and∫ 1
0e−|x−c| sin[a(x− 1/2)]dx
=2 sin[a(c− 1/2)]
1 + a2 +
(e−c − ec−1
)(a cos(a/2) + sin(a/2))
1 + a2
In particular,
1. For a such that a tan(a/2) = 1∫ 1
0e−|x−c| cos[a(x− 1/2)]dx =
2 cos[a(c− 1/2)]1 + a2
2. For a such that a cot(a/2) = −1∫ 1
0e−|x−c| sin[a(x− 1/2)]dx =
2 sin[a(c− 1/2)]1 + a2
Proof. The lemma follows from a straightforward integration. First split the integral into two pieces:
Z 1
0e−|x−c|
cos[a(x− 1/2)]dx
=
Z c
0ex−c
cos[a(x− 1/2)]dx+
Z 1
cec−x
cos[a(x− 1/2)]dx.
By integration by parts applied twice,
Zex−c
cos[a(x− 1/2)]dx =aex−c sin(a(x− 1/2)) + ex−c cos(a(x− 1/2))
1 + a2
and Zec−x
cos[a(x− 1/2)]dx =aec−x sin(a(x− 1/2))− ec−x cos(a(x− 1/2))
1 + a2.
Evaluating these expressions at the appropriate limits of integration gives the first statement of the lemma. The computation
ofR 10 e−|x−c| sin[a(x− 1/2)]dx is analogous.
The solution of a tan(a/2) = 1 are approximately 2kπ for
integers k and the solutions of a cot(a/2) = −1 are
approximately (2k + 1)π.
Lemma.
1. The positive solutions of a tan(a/2) = 1 lie in the set
(0, π) ∪∞⋃k=1
(2kπ, 2kπ + 1/kπ)
with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.
2. The positive solutions of a cot(a/2) = −1 lie in the set∞⋃k=0
((2k + 1)π, (2k + 1)π + 1/(kπ + π/2))
with exactly one solution per interval. Furthermore, a isa solution if and only if −a is a solution.
Remark. The first few positive solutions of a tan(a/2) = 1 are
1. a = 1.30654237418881 . . .
2. a = 6.58462004256417 . . .
3. a = 12.72324078413133 . . .
4. a = 18.95497141084159 . . .
and the first few positive solutions of a cot(a/2) = −1 are
1. a = 3.67319440630425 . . .
2. a = 9.63168463569187 . . .
3. a = 15.83410536933241 . . .
4. a = 22.08165963594259 . . .
Lemma. For 1 ≤ i, j ≤ n, let
Kn(xi, xj) =e
2ne−|i−j|/n.
Set fn,a(xi) = cos(a(i/n − 1/2)) where a is a positivesolution to a tan(a/2) = 1, and set gn,a(xi) = sin(a(i/n −1/2)) where a is a positive solution to a cot(a/2) = −1.Then ∣∣∣∣Knfn,a(xi)−
e
1 + a2fn,a(xi)∣∣∣∣ ≤ 2(a+ 1)
n∣∣∣∣Kngn,a(xi)−e
1 + a2gn,a(xi)∣∣∣∣ ≤ 2(a+ 1)
n.
That is, fn,a and gn,a are approximate eigenfunctions of Kn.
Lemma. For 1 ≤ i, j ≤ n set
Kn(xi, xj) =e
2ne−|i−j|/n
and let λ1, . . . , λn be the eigenvalues of Kn.
1. For positive solutions to a tan(a/2) = 1
min1≤i≤n
∣∣∣∣λi − e
1 + a2
∣∣∣∣ ≤ 4(a+ 1)√n
.
2. For positive solutions to a cot(a/2) = −1
min1≤i≤n
∣∣∣∣λi − e
1 + a2
∣∣∣∣ ≤ 4(a+ 1)√n
.
Remark. The first few values of e/(1 + a2) for solutions toa tan(a/2) = 1 are
1. 1.00414799895293 . . .
2. 0.06128160783626 . . .
3. 0.01668877420197 . . .
4. 0.00754468546867 . . .
The first few values of e/(1 + a2) for solutions to a cot(a/2) = −1 are
1. 0.18756657740212 . . .
2. 0.02898902316936 . . .
3. 0.01079887885138 . . .
4. 0.00556341289490 . . .
Horseshoes and Twin Horseshoes
The 2-dimensional mapping is built out of the second andthird eigenfunctions of the Gram matrix. Above wecomputed several approximate eigenfunctions and
eigenvalues for the Gram matrix arising from the votingmodel. The linear and exponential similarity cases are
analogous, and so we only consider the latter here. In thiscase, we have the approximate eigenfunctions
1. fn,1(xi) = cos(1.3065(i/n− 1/2)) with eigenvalue λ ≈ 1.004
2. fn,2(xi) = sin(3.6732(i/n− 1/2)) with eigenvalue λ ≈ 0.1876
3. fn,3(xi) = cos(6.5846(i/n− 1/2)) with eigenvalue λ ≈ 0.06128.
0 0.5 1!0.5
0
0.5
1
1.5
2
0 0.5 1
!1
!0.5
0
0.5
1
0 0.5 1
!1
!0.5
0
0.5
1
Approximate eigenfunctions f1, f2 and f3.
!1 !0.8 !0.6 !0.4 !0.2 0 0.2 0.4 0.6 0.8 1!1
!0.8
!0.6
!0.4
!0.2
0
0.2
0.4
0.6
0.8
1
A horseshoe that results from plotting
Λ : xi 7→ (f2(xi), f3(xi)).
In particular, from Λ it is possible to deduce the relative
order of the representatives in the interval I. Since −f2 is
also an eigenfunction, it is not in general possible to
determine the absolute order knowing only that Λ comes
from the eigenfunctions.
You need a crib!
Voting Data
With the voting data, we see not one, but two horseshoes.
To see how this can happen, consider the two population
state space X = {x1, . . . , xn1, y1, . . . , yn2} with distance
d(xi, xj) = |i/n1 − j/n1|, d(yi, yj) = |i/n2 − j/n2| and
d(xi, yj) = +∞. This leads to the partitioned Gram matrix
Kn1+n2 =
[Kn1 0
0 Kn2
].
The approximate eigenfunctions and eigenvalues that we
found above for the single population model can now be
used to build higher dimensional eigenspaces for the two
population model. In particular, we have the following
approximate eigenspaces:Eigenspace with eigenvalue λ ≈ 1.004 containing orthogonal functions
gn,1(xi) =
sn1 + n2
n1fn1,1
(xi) · 11≤i≤n1+
sn1 + n2
n2fn2,1
(xi − n1) · 1n1<i≤n2
gn,2(xi) =
sn1 + n2
n1fn1,1
(xi) · 11≤i≤n1−
sn1 + n2
n2fn2,1
(xi − n1) · 1n1<i≤n2
Eigenspace with eigenvalue λ ≈ 0.1876 containing orthogonal functions
gn,3(xi) = a
sn1 + n2
n1fn1,2
(xi) · 11<i≤n1+
sn1 + n2
n2fn2,2
(xi − n1) · 1n1<i≤n2
gn,4(xi) =
sn1 + n2
n1fn1,2
(xi) · 11<i≤n1− a
sn1 + n2
n2fn2,2
(xi − n1) · 1n1<i≤n2
These functions are graphed for the case n1 = n2 and
a = 1/5. Moreover, plotting the 3-dimensional mapping
Λ : xi 7→ (g2(xi), g3(xi)g4(xi)) results in twin horseshoes.
0 1 2!3
!2
!1
0
1
2
3
4
0 1 2
!3
!2
!1
0
1
2
3
0 1 2
!3
!2
!1
0
1
2
3
0 1 2
!3
!2
!1
0
1
2
3
Approximate
eigenfunctions g1, g2, g3 and g4 for the Gram matrix arising
from the two population model.
!1!0.5
00.5
1
!1
!0.5
0
0.5
1!1
!0.5
0
0.5
1
Twin horseshoes
that result from plotting Λ : xi 7→ (g2(xi), g3(xi)g4(xi)).
The approximate eigenfunctions derived above are stable to
noise. Numerically, this is the case, as seen below.
That figure has generated by adding normal N(0, 1/5) noise
to the Gram matrix K200 before normalizing by the average
row sum. The specific form of the noise does not noticeably
affect the results.
0 100 200
!0.1
!0.05
0
0.05
0.1
0 100 200
!0.1
!0.05
0
0.05
0.1
0.15
0 100 200
!0.1
!0.05
0
0.05
0.1
0.15
Connecting the Model to the Data
When we apply eigendecomposition to the voting data, the
first few eigenvalues are:1
0.17709857573272 . . .0.01037622989886 . . .0.00831940284881 . . .0.00484075498479 . . .0.00344207632723 . . .0.00266158512355 . . .0.00248175112290 . . .
0 200 400 600!0.08
!0.06
!0.04
!0.02
0
0.02
0.04
0.06
0 200 400 600!0.15
!0.1
!0.05
0
0.05
0.1
0.15
0.2
0 200 400 600!0.15
!0.1
!0.05
0
0.05
0.1
0.15
0.2
The re-indexed second, third and fourth eigenfunctions
outputted from the MDS algorithm applied to the 2005 U.S.
House of Representatives roll call votes. Colors indicate
political parties.
Since legislators are not a priori ordered, the eigenfunctions
are difficult to interpret. However, our model suggests the
following ordering: Split the legislators into two groups G1
and G2 based on the sign of f2(xi); then the norm of f3 is
larger on one group, say G1, so we sort G1 based on
increasing values of f3, and similarly, sort G2 via f4.
0 200 400!0.08
!0.06
!0.04
!0.02
0
0.02
0.04
0.06
0 200 400!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
0 200 400!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
The re-indexed second, third and fourth eigenfunctions
outputted from the MDS algorithm applied to the 2004 U.S.
House of Representatives roll call votes. Colors indicate
political parties.
Our analysis suggests that if legislators are in fact
isometrically embedded in the interval I (relative to the roll
call distance), then the MDS rank will be consistent with the
order of legislators in the interval. This appears to be the
case in the data, for instance the following figure which
shows a graph of d(li, ·) for selected legislators li. For
example, as we would predict, d(l1, ·) is an increasing
function and d(ln, ·) is decreasing. Moreover, the data seem
to be in rough agreement with the metric assumption of our
two population model, namely that the two groups are
well-separated and that the within group distance is given by
d(li, lj) = |i/n− j/n|.
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
0 200 400 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Legislators
Distance
The empirical roll call derived distance function d(li, ·) for
selected legislators li = 1, 90, 181, 182, 290, 401. The x-axis
orders legislators according to their MDS rank.
Our voting model suggests that the MDS obtained ordering
of legislators should correspond to political ideology. To test
this, we compared the MDS results to the assessment of
legislators by Americans for Democratic Action [?]. Each
year, ADA selects 20 votes it considers the most important
during that session, for example, the Patriot Act
reauthorization. Legislators are assigned a Liberal Quotient:
the percentage of those 20 votes on which the Representative
voted in accordance with what ADA considered to be the
liberal position. For example, a representative who voted the
liberal position on all 20 votes would receive an LQ of 100%.
Figure below shows a plot of LQ vs. the MDS derived rank.
0 50 100 150 200 250 300 350 400 4500
10
20
30
40
50
60
70
80
90
100
Eigenmap Rank
Libe
ral Q
uotie
nt
Comparison of the
MDS derived rank for Representatives with the Liberal
Quotient as defined by Americans for Democratic Action.
This figure results because this notion of proximity, although
related, does correspond directly to political ideology. The
MDS and ADA rankings complement one another in the
sense that together they facilitate identification of two
distinct, yet relatively liberal groups of Republicans. That is,
although these two groups are relatively liberal, they are
considered to be liberal for different reasons.
0 50 100 150 200 250 300 350 4000
10
20
30
40
50
60
70
80
90
100
Eigenmap Rank
Natio
nal J
ourn
al S
core
Comparison of the MDS derived rank for Representatives
with the National Journal’s liberal score
Practical Questions:
• Which transformations of distances work well for detecting
gradients:√1− exp(−d(x, y)) work well in practice.
• Are most Toeplitz eigenvectors are simple to approximate.
• Can we prove the eigenvectors are robust to Noise (for
instance the physicists Bohigas, Bogomolny and Schmit
show that for uniformly distributed points on a segment
(the one dimensional Anderson model) the eigenstructure is
the same.
• How do we extend this to a two dimensional (spatial)
gradient?
A little immunology
T-lymphocyte cells (T-cells) originally derive from stem cells
of the bone marrow. At around the time of birth,
lymphocytes derived in this way leave the marrow and pass
to the thymus gland in the chest, where they multiply.
The lymphocytes are processed by the thymus gland, so that
between them they carry the genetic information necessary
to react with a multitude of possible antigens.
Biological Questions
• Do cancer patients show differential expression in any genes
expressed in T-cells?
• Are there any differences between naive effector and memory
T-cells?
• What are the steps involved in T-cell differentiation?
Differences between the three cell types?
• Linear Model
N E M
Apop
• Parallel Model
N
M
ApopE
Genes differentially expressed
Using the variance stabilized data (vsn) and multtest using
Westfall and Young’s maxT: I ranked the genes by their
adjusted p-value.
I made my collaborator choose a stopping point on the list:
156 significant genes.
MDS Analysis
Transform the data from continuous to discrete: cutoff
decided through genes known to be expressed in some arrays
and not in others. (Biological not Statistical Criteria)
87% of the variation in the first plane:
−0.2 −0.1 0.0 0.1 0.2
−0.
3−
0.2
−0.
10.
00.
10.
2
Kt.ev$vectors[, 29:30][,1]
Kt.e
v$ve
ctor
s[, 2
9:30
][,2]
EFF
MEM
NAI
EFF
MEM
NAI
EFF
MEM
NAI
EFF
MEM
NAI
EFF
MEM
NAI
EFF
NAI
NAI
EFF
MEM
NAI
EFF
MEM
NAI
EFF
MEM
NAI
EFF
MEM
NAI
Topological Problems in Spaces ofPhylogenetic Trees
Biology now requires the use of non standard parameters
generalising work done on multivariate Euclidean spaces to
spaces of parameters that are not embeddable in Euclidean
structures. Visualisation of distances often provides much
more information that the simple distributions.
Less symmetrical Phylogenies
Linguistics use trees to map out the history of language.
Linguists use trees, but they have an ancient form and a
novel form. So their trees do not have symmetry between
siblings.
Examples include :
• Comparing Phylogenetic trees from different DNA data.
• Comparing Bootstrap Trees with the tree computed from
the original data sets.
• Comparing Hierarchical clustering trees on melanoma
patients.
• Constructing confidence sets for non standard data.
• Testing for mixtures of trees (Mossel, Vigoda show how
important this can be).
• Trying to detect horizontal gene transfer.
• Output of many trees sampled from a Bayesian posterior
distribution on trees.
• Sets of trees built with different data (DNA tree, behavioral
trees, pheno typic trees).
• Confidence regions of trees from Bayesian posteriors or
Bootstrap resamples.
• Neighborhood Explorations, how many neighbours? What
are the curvatures of the boundaries?