Linear Least Squares and its applications in distance matrix methods

42
Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS236805 Based on the paper by Olivier Gascuel

description

Seminar in Phylogeny, CS236805. Linear Least Squares and its applications in distance matrix methods. Presented by Shai Berkovich June, 2007. Based on the paper by Olivier Gascuel. Contents. Background and Motivation LS in general LS in phylogeny UNJ algorithm LS sense of UNJ. - PowerPoint PPT Presentation

Transcript of Linear Least Squares and its applications in distance matrix methods

Page 1: Linear Least Squares  and its applications in distance matrix methods

Linear Least Squares

and its applications in distance matrix

methods

Presented by Shai Berkovich

June, 2007

Seminar in Phylogeny, CS236805

Based on the paper by Olivier Gascuel

Page 2: Linear Least Squares  and its applications in distance matrix methods

Contents

Background and Motivation

LS in general

LS in phylogeny

UNJ algorithm

LS sense of UNJ

Page 3: Linear Least Squares  and its applications in distance matrix methods

Distance Matrix Methods

A major family of phylogenic methods has been the distance matrix methods. The general idea: calculate a measure of the distance between each pair of species, and then find a tree that predicts the observed set of distances as closely as possible.

Page 4: Linear Least Squares  and its applications in distance matrix methods

Distance Matrix Methods

This lives out all information from higher-order combinations of character states, reducing the data matrix to a simple table of pairwise distances, though computer simulation studies show that the amount of information of phylogeny that is lost is remarkably small.

(As we already saw: Neighbor-Joining and its robustness to noise.)

Page 5: Linear Least Squares  and its applications in distance matrix methods

Additivity

Definition: A distance matrix D is additive if there exists a tree with positive edge weights such that where vk are the edges in the path between species i and j.

Theorem [Waterman et. al., 1977]: Given ad additive n x n distance matrix D there is a unique edge-weighted tree (without nodes of degree 2) in which n nodes in the tree are labeled s1,s2,…,sn so that the path between si and sj is equal to Dij. Furthermore, this unique tree consistent with D is reconstructable in O(n2) time.

ij kD v

Page 6: Linear Least Squares  and its applications in distance matrix methods

Distance-Based reconstruction

Input: distance matrix D

Output: edge-weighted tree – T ( if D is

additive, then DT = D,

otherwise, return a tree best ‘fitting’

the input – D).

A B

CD

E0.05

0.10

0.070.03

0.05

0.08

0.06

A B C D E

A 0 0.23 0.16 0.20 0.17

B 0.23 0 0.23 0.17 0.24

C 0.16 0.23 0 0.15 0.11

D 0.20 0.17 0.15 0 0.21

E 0.17 0.24 0.11 0.21 0

No topology!

0ij

ij ji

ij ik kj

D

D D

D D D

Page 7: Linear Least Squares  and its applications in distance matrix methods

Approximation

In practice, the distance matrix between molecular sequences will not be additive.

So, we want to find a tree T whose distance matrix approximates the given one.

Algorithms give exact results when operating on additive matrix, but it gets unclear when real matrix is handled.

Page 8: Linear Least Squares  and its applications in distance matrix methods

LS Overview

Linear least squares is a mathematical optimization technique to find an approximate solution for a system of linear equations that has no exact solution. This usually happens if the number of equations (m) is bigger than the number of variables (n).

Page 9: Linear Least Squares  and its applications in distance matrix methods

LS Overview

In mathematical terms, we want to find a solution for the "equation"

where A is a known m-by-n matrix (usually with m > n), x is an unknown n-dimensional parameter vector, and b is a known m-dimensional measurement vector.

ˆAx b

Page 10: Linear Least Squares  and its applications in distance matrix methods

LS Overview

Euclidean norm: on Rn the notion of length of vector is captured by formula

This gives the ordinary distance from the origin to the point x.

More precisely, we want to minimize the Euclidean norm, squared of the residual Ax − b, that is, the quantity

where [Ax]i denotes the i-th component of the vector Ax. Hence the name "least squares".

2 2 2 21 1 2 2|| || ([ ] ) ([ ] ) ([ ] )m mAx b Ax b Ax b Ax b

1 2[ , ,..., ]nx x x x 2 2 21 2|| || ( ... )nx x x x

Page 11: Linear Least Squares  and its applications in distance matrix methods

LS Overview

|| || ( ) ( ) ( ) ( ) ( )T T T T TAx b Ax b Ax b Ax Ax b Ax Ax b b b

ˆ[( ) ( ) 2( ) ] 2 2 0T T T T TdAx Ax Ax b b b A Ax A b

dx

ˆT TA Ax A b

Fact: squared norm of v is vTv

What we do when we want to minimize?

Page 12: Linear Least Squares  and its applications in distance matrix methods

LS Overview

Note that this corresponds to a system of linear equations. The matrix ATA on the left-hand side is a square matrix, which is invertible if A has full column rank (that is, if the rank of A is n). In that case, the solution of the system of linear equations is unique and given by

1ˆ ( )T Tx A A A b

Page 13: Linear Least Squares  and its applications in distance matrix methods

LS in phylogeny

Input: 1. Distance matrix D2. Tree topology

Supposed to receive same tree, since dissimilarity matrix is additive.

A B

CD

E0.05

0.10

0.070.03

0.05

0.08

0.06

A B C D E

A 0 0.23 0.16 0.20 0.17

B 0.23 0 0.23 0.17 0.24

C 0.16 0.23 0 0.15 0.11

D 0.20 0.17 0.15 0 0.21

E 0.17 0.24 0.11 0.21 0

Page 14: Linear Least Squares  and its applications in distance matrix methods

The measure that we use is the measure of disperancy between the observed and expected distances:

Where wij are weights that differ between different LS methods: 1 or or

2

1 1

( ) ( )n n

ij ij iji j

Q T w D d

1/ ijD 21/ ijD

LS in phylogeny

Intuition

Page 15: Linear Least Squares  and its applications in distance matrix methods

LS in phylogeny

,ij ij k kk

d x v A B

CD

Ev7

v2

v4v6

v3

v1

v5

2,

1 1

( ) ( )n n

ij ij ij k ki j k

Q T w D x v

12 1 2 3 4 5 6 7

13 1 2 3 4 5 6 7

45 1 2 3 4 5 6 7

1 1 0 0 0 0 1

1 0 1 0 0 1 0

...

0 0 0 1 1 1 1

d v v v v v v v

d v v v v v v v

d v v v v v v v

introduce an indicator variable , which is 1 if branch lies in the path from species i to species j and 0 otherwise

,ij kxkv

Page 16: Linear Least Squares  and its applications in distance matrix methods

,1 1

2 ( ) 0n n

ij ij ij k ki j kk

dQw D x v

dv

2,

1 1

( ) ( )n n

ij ij ij k ki j k

Q T w D x v

LS in phylogeny

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1 2 3 4 5 6 7

4 1 1 1 1 2 2

1 4 1 1 1 2 2

...

2 3 2 3 2 4 6

AB AC AD AE

AB BC BD BE

AB AD BC CD BE DE

D D D D v v v v v v v

D D D D v v v v v v v

D D D D D D v v v v v v v

prop. 1

Page 17: Linear Least Squares  and its applications in distance matrix methods

LS in phylogeny

Number of equations as number of edges => havingone solution if the matrix is fully column-ranked. What matrix?

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

D

D

D

D

Dd

D

D

D

D

D

1 1 0 0 0 0 1

1 0 1 0 0 1 0

1 0 0 1 0 0 1

1 0 0 0 1 1 0

0 1 1 0 0 1 1

0 1 0 1 0 0 0

0 1 0 0 1 1 1

0 0 1 1 0 1 1

0 0 1 0 1 0 0

0 0 0 1 1 1 1

X

( )T TX D X X v

1( )T Tv X X X D

Page 18: Linear Least Squares  and its applications in distance matrix methods

LS in phylogeny

Example: A

B

Cv2

v1

v3

A B C

A 0 10 12

B 10 0 8

C 12 8 0

1 1 0

1 0 1

0 1 1

X

2 1 1

1 2 1

1 1 2

TX X

1

3 1 1

4 4 41 3 1

( )4 4 41 1 3

4 4 4

TX X

1( ) ...T Tv X X X D

Page 19: Linear Least Squares  and its applications in distance matrix methods

LS in Phylogeny

When we have weighted LS, then previous equations can be written:

where W is a diagonal matrix with distance weights on main diagonal.

( )T TX WD X WX v

1( )T Tv X WX X WD

Simulations usually shows that LS better performance then NJ

Page 20: Linear Least Squares  and its applications in distance matrix methods

LS in Phylogeny

One can imagine an LS method that, for each tree topology, formed the matrix, inverted it and obtained the estimates. This can be done, but its computationally burdensome, even if not all topologies are examined.

Inversion of matrix: O(n3) for a tree with n tips In principle each tree topology should be

considered.

Page 21: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

Recall NJ algorithm:

1. Begin with star tree & all sequences as nodes in L2. Find pair of nodes with minimum QA,B

3. Create & insert new join (node K) w/ branch lengths

dA,K = ½ (dA,B + rA – rB)dB,K = ½ (dA,B + rB – rA)

4. For remaining nodes, update distance to K as

dK,C = ½ (dA,C + dB,C – dA,B)

5. Insert K and remove A, B from L6. Repeat steps 2-5 until only two nodes left UNJ

Page 22: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

Although the NJ algorithm is widely used and has yielded satisfactory simulation results, certain questions remain:

Proof of correctness of selection criterion (Saito & Nei) was contested but complete proof is still not provided.

NJ reduction formula gives identical importance to nodes x and y, even if one corresponds to a group of several objects and the other is single object.

Page 23: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

The manner in which the edge lengths are estimated is inexact in terms of LS when the agglomerated nodes represent not individual object but rather groups of objects.

The paper provides answers to this questions but we will concentrate on the last one.

Weighted/Unweighted

misunderstood

Page 24: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

Definitions• E = {1,2…,n} a set of n objects (leaves)• dissimilarity matrix over E• by removing the edge from T we

constitute bipartition where X may be viewed in two ways: as a subset of E or as a rooted subtree of T whose root is situated at the extremity of edge

• T denotes any valued tree• T` denotes its structure

( )ij

{ , }X X

Page 25: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

Definitions• cardinality of X – number of leaves in the

subtree X, also denoted as nx

• S = (sij) is an adjusted tree generated by LS

• S`tree structure associated with adjusted tree

• Let and be two bipartitions of S` when then

and

{ , }X X { , }Y YX Y

,XY ij

i X j Y

s s

,

1XY ij

i X j Yx y

s sn n

Page 26: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

Definitions• as well as

and

• flow of a rooted subtree X

and

,XY ij

i X j Y

,

1XY ij

i X j Yx yn n

X ixi X

f s

1X X

x

f fn

Page 27: Linear Least Squares  and its applications in distance matrix methods

Our Model

Estimates are unbiased i.e. for every i,j

where the noise variables are i.i.d (result of real observations and measurements)

The paper states that it is coherent to use an unweighted approach which allocates the same level of importance to each of the initial objects. Furthermore, within this model it is justified to use the “ordinary” LS criterion as opposed to “generalized”, which takes into account variances and covariances of the estimates

( )ijij ij ijd

ij

Statistics

Page 28: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm1. Initialize the running matrix: 2. Initialize the number of remaining nodes: r<-n3. Initialize the numbers of objects per node:4. While the number of nodes r is greater than 3:

{ Compute the sums Find the pair {x,y} to be agglomerated by

maximizing Qxy (3)

Create the node u, and set: Estimate the lengths of edges (x,u) and (y,u) using (2)Reduce the running matrix using (3)Decrease the number of nodes: r<-r-1 }

5. Create a central node and compute the last three edge-lengths using (2)

6. Output the tree found

( ) ( )ij ij

1, {1,... }in i n

, {1,... }iR i r

u x yn n n Reverse

of NJ

fin

Page 29: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm

1. Selection criterion

2. Estimation formula

3. Reduction formula

( 2) ,xy x y xyQ R R r 1

r

z zii

R

1,

1 1ˆ ( )2 2( )

r

xu xy i xi yiiux y

d nn n

dyu obtained by

symmetry

ˆ ˆ ,ui x xi y yi x xu y yuw w w d w d , yxx y

u u

nnw w

n n

(1)

(2)

(3)

NJ

We won’t prove (1) and (3)

Page 30: Linear Least Squares  and its applications in distance matrix methods

Conservation Property 1

Given dissimilarity matrix, S adjusted tree and S`its structure we have (Vach 1989):For every bipartition of S`we have (and )

Proof: we saw that

Lets have a closer look on a matrices

( )ij

{ , }X X

XX XXs XX XXs

( )T TX D X X v

( )T TX D X Xv

Page 31: Linear Least Squares  and its applications in distance matrix methods

Conservation Property 1

2

nm

Let n be a number of leaves, q=2n-3 be a number of edges and be a number of distances

Xv - mx1 matrix of tree paths between the leaves

D - mx1 matrix of dissimilarity distancesXtXv - qx1 matrix of all “interleave” paths that

pass over the given edgeXtD - qx1 matrix of all distances that pass over

the given edge (slide no. 16). Property is established.

( )T TX D X Xv

Page 32: Linear Least Squares  and its applications in distance matrix methods

Conservation Property 2

For all ternary nodes of S` and for every pair X,Y of subtrees associated with this node, we have ( and )

Proof:according to prop. 1 we have:

(*)

XY XYs XY XYs

X

Y

ZuXX XXs ZZ ZZs YY YYs

XY XZ XY XZs s

X Y Z

Page 33: Linear Least Squares  and its applications in distance matrix methods

Conservation Property 2

Property is established.

XY XZ XY XZ

XY YZ XY YZ

XZ YZ XZ YZ

s s

s s

s s

XY XYs XY XYs

Page 34: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

Using definition of SXY we can

rewrite:

Using prop.2 we can write:

by solving this equations we obtain

X

Y

Zu( )XY Y X Y X xu yu X Ys n f n n s s n f

( )

( )

( )

XY Y X Y X xu yu X Y

XZ Z X Z X xu zu X Z

ZY Y Z Y Z zu yu Z Y

n f n n s s n f

n f n n s s n f

n f n n s s n f

,

1 1 1

2 2 2x u XY XZ YZ Xs f (4)

Page 35: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

Let us consider an agglomerative procedure as described in algorithm: at p-th step it remains to resolve r=n-p+1 nodes when some of them are subtrees. After choosing x and y Z can be viewed as joint of r-2 subtrees, some of them consist of root.

X Y Z

,I X Y

Z I

Page 36: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

Thus, we may rewrite expression (4) as:

where

,

1 1( )

2 2( )

r

xu xy i XI YI XI X Yu

S n fn n

,U X Y I

I X Y

n n n n n

(5)

, ,

, , ,

1 1 1 1 1( )

2 2 2

1 1 1 1( ) ( )

2( ) 2( )

XZ YZ ij iji X j Z i Y j ZX Z Y Z

XI YI XI YI II X Y I X Y I X YX Y X Y u

n n n n

nn n n n n n n

(*)

Page 37: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

Now we will prove by induction the next two statements:

For each iteration of algorithm:(a) for every resolved subtree I,J(b) Formula (2) and equation (5) are equal

Important: at each step (b) evaluation is based on (a) result from previous step and (a) evaluation is based on (b) result at current step.

, ,i j I J I Jf f

Page 38: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

Base: at the first step the “weight” of each node is 1

Thus, for each node i fi=0 =>

(b) also achieved because in first iteration.

Step: Let us consider that (a) and (b) maintained during the step p. Now we’ll show that they are also maintained at step p+1.

ij ij

, , ,i j i j I J I Jf f

Page 39: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

(a) We must check that hypothesis is maintained for new node u:

Thus, formula (1) maintained.

ˆ ˆ

( ) ( )

1 1 1 1( ) ( )

1 1( )

1 1( ( ) )

ui x xi y yi x xu y yu

x XI X I y YI Y I x xu y yu

y yx xXI YI X Y

u x i u y i u x u y

y yx xI I xu yu

u i u i u u

yxUI X Y xu yu

u u u i

w w w d w d

w f f w f f w s w s

n nn nf f

n n n n n n n n n n

n nn nf f s s

n n n n n n

nnf f s s f

n n n n

UI U If f

Page 40: Linear Least Squares  and its applications in distance matrix methods

Formula (2) is correct

(b) We prove correctness of (b) for step p+2:

(b) is correct.

,

,

,

, ,

1 1( )

2 2( )

1 1( ) ( )

2 2( )

1 1 1( ) ( )

2 2( ) 2

1 1

2( ) 2( )

1ˆ (2

r

xu xy i XI YI XI X Yu

r

xy X Y i xi X I yi Y I XI X Yu

r

xy i xi yi I X Y XI X Yu

r r

i X i YI X Y I X Yu u

xu Y X

S n fn n

f f n f f f f fn n

n f f fn n

n f n fn n n n

d f f

1

) ( ).2 X Yf f

Page 41: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm - implications

The complexity in time of UNJ is O(n3) The property we proved may be exploited

within an algorithm in O(n2), allowing the LS estimation of edge lengths of any fixed structure binary tree

I.e. finding tree topology and then LS edge estimates in O(n3) + O(n2)

UNJ

Page 42: Linear Least Squares  and its applications in distance matrix methods

UNJ algorithm - implications

This new version derives from the original version of Saitou & Nei(1987) (weighted version) and also Vach(1989) (concerning lengths estimation). The simulation shows that UNJ suppresses NJ when data closely follow the chosen model. For certain tree structures, we obtain up to 50% error reduction, in terms of ability to recover the true tree structure.