Post on 15-Jan-2016
description
Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees
PH.D candidate: Lan Liu
Advisor: Tao Jiang
Outline
The haplotype inference problem The tagSNP selection problem The minimum common integer
partition problem
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
Introduction Basic concepts
Example: Mendelian experiment
2 2
2 1
1 2
1 1
1 2
Genotype
Haplotype
Locus
2 1 PS value=1
1 2 PS value=0
2 2Homozygous
1 1Heterozygous
Mendelian Law: one haplotype comes from the mother and the other comes from the father.
paternal maternal
Notations and Recombinant
1122
2222
Genotype
1222
2122
Haplotype Configuration
1111
2222
2222
2222
1111
0 recombinant
2222
FatherMother
Child: recombinant
1111
2222
2222
2222
1122
2222
1 recombinant
FatherMother
child
Pedigree
Camilla, Duchess of Cornwall
Peter Phillips Zara Phillips
Diana,Princess of Wales
Prince Williamof Wales
Prince Henry ofWales
PrincessBeatrice of York
PrincessEugenie of York
Lady LouiseWindsor
Prince Charles,Prince of Wales
Princess Anne, Princess Royal
CommanderTimothy Laurence
Prince Andrew,Duke of York
SarahMargaret Ferguson
Prince Edward, Earl of Wessex
Sophie Rhys-Jones
Elizabeth II ofthe United Kingdom
Prince Philip,Duke of Edinburgh
CaptainMark Phillips
An example: British Royal Family
A mating loop: a cycle inside the pedigree.
Haplotype Reconstruction - Haplotype: useful, expensive - Genotype: cheaper to obtain
1 21 2
1 21 2
M C
1 21 2
1 21 2
1 21 2
M C
1 21 2
(a)
1 21 2
1 22 1
M C
1 21 2
(b)
Reconstruct haplotypes from genotypes
Problem Definitions MRHC Given a pedigree and the genotype
information for each member, find a haplotype configuration for each member which obeys Mendelian law, s.t. the number of recombinants are minimized.
ZRHC: zero-recombinant
Loop-free-ZRHC: zero recombinant, pedigree with no mating loops
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
Approximation and Complexity of MRHC
The known hardness results for MRHC
NP-hard [LJ03]
P [LJ03]
P [DLJ03]
NP-hard [DLJ03]
2-locus-MRHCTree-MRHC with
bounded #membersTree-MRHC withbounded #loci
Tree-MRHC
Hardness
2-locus-MRHC: 2 loci Tree-MRHC: pedigree having no mating loops
Our Hardness and Approximation Results
Lower boundof approx.
ratio
Any f(n)
Any f(n)
Any constant
Assumption
P≠ NP
P≠ NP
P≠ NPthe Unique Games
Conjecture[Khot02]
Binary-tree-MRHC
2-locus-MRHC*
Binary-tree-MRHC*
2-locus-MRHC
Hardness
NP
Tree-MRHC Any constant P≠ NP
the Unique GamesConjecture
Upper boundof approx.
ratio
O ( )
The lower boundholds for
2-locus-MRHC*(4,1)
Binary-tree-MRHC*(1,1)
2-locus-MRHC(16,15)
Tree-MRHC(1,u)Tree-MRHC(u,1)
)log(n
Tree-MRHC: no mating loop Binary-tree-MRHC: 1 mate, 1 child Binary-tree-MRHC*: 1 mate, 1 child, missing data
2-locus-MRHC: 2 loci 2-locus-MRHC*: 2 loci with missing data
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
The ZRHC problem Problem definition Given a pedigree and the genotype
information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.
Previous work Li and Jiang introduced a system of linear equations
over F[2] and presented an O(m3n3) time algorithm for ZRHC [LJ03] , where m is #loci and n is #members in pedigree.
Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops.
Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k2.376) on k equations with k unknowns.
The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86].
Our Result
We present a much faster algorithm for ZRHC with running time . 2 3 2log log logO mn n n n
Ax=b
O mn
O mn Ax=b O mn Ax=b
transformation
redundancy elimination
O(n log2n log log n)
O(n)
O(n)
The New Linear System n, m
m : #loci n: #members in pedigree Unknowns
: the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j1 and a child j.
The New Linear System
0100
1101
0000
0111
0 0 0 1
1101
j2 j1
j
Pj1,1
pj1,2
pj1,3
pj1,4
j2
j
j1
Pj2,1
pj2,2
pj2,3
pj2,4
Pj2,1 +0
pj2,2 +1
pj2,3 +1
pj2,4 +1
Pj,1
pj,2
pj,3
pj,4
Pj,1 +1
pj,2 +1
pj,3 +0
pj,4 +0
hj1,j hj2,j
Pj1 +wj1Pj1Pj2 Pj2 +wj2
Pj1,1 +1
pj1,2 +0
pj1,3 +0
pj1,4 +1
Pj Pj +wj
pj1,2=1 pj1,3=0
Father
Mother
Child
The Linear System
O(mn) equations on O(mn) unknowns.
Given a homozygous locus i on a member j (with a child j1), pj[i] and pj1[i] are pre-determined.
Pedigree Graph A pedigree with genotype
1
6
9
8
32
4 75
12
11
12
12
11
12
12
12
12
22
12
12
12
22
22
12
12
12
11
22
12
11
12
12
22
12
12
1
6
9
8
32
4 75
Pedigree graph G
#edges · 2n
Locus Graph
Locus graph Gi
1
6
9
8
32
4 75
12 22 11
12 12 12 11
12
22
Example: Locus graph for the 3rd locus
Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1}
(a) Genotype info
Zero-weight
:
1
6
9
8
32
4 75
? 1 0
1 1 1 0
1
0
h1,4
h4,9h8,9
h6,8
(b) Locus graph
p-variables: variables on vertices. h-variables: variables on edges shared by all locus graphs.
An Observation For any cycle or any path connecting two pre-determined vertices in a locus graph, the summation of h-variables along the path is a constant.We can use paths to denote
constraints!
a constant
+ dj0, j1
…
Pj1[i]hj1, j2
Pj2[i] Pjk-1[i] Pjk[i]hjk-1, jk
dj1, j2 djk-1, jk
Pj1[i] + dj1, j2+ hj1, j2 = Pj2[i]Pj2[i] + dj2, j3+ hj2, j2 = Pj3[i]…
Pjk-1[i] + djk-1, jk+ hjk-1, jk= Pjk[i]
Pj0[i]hj0, j1
dj0, j1
Pj0[i] = Pj1[i]
+ hj0, j1
(proof sketch) Assume the path in locus graph Gi connecting two pre-determined vertices j0 and jk .
Examples of Linear Constraints
1
6
9
8
32
4 75
? 1 0
1 1 1 0
1
0h8,9
h6,8
(a) 1st locus graph h6,8 + h8,9= 1
1
6
9
8
32
4 75
0 ? ?
1 ? ? 1
0
1:
(b) 2nd locus graph h3,5 + h3,6 + h2,5 + h2,6 =
0
h2,5
h3,5 h3,6
h2,6
1
6
9
8
32
4 75
? ? ?
? ? ? ?
0
1
h6,8
h2,4
h2,5
h3,5 h3,6
h4,9
(c) 3rd locus graph h4,9 + h2,4 + h2,5 + h3,5 +
h3,6 + h6,8 = 0
Linear Constraints
Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient.
Moreover, we can upper bound #constraints in each locus graph as O(n), while the trivial analysis gives an upper bound O(n2).
Total #constraints = O(mn).
The ZRHC-PHASE algorithm
Algorithm ZRHC_PHASE
input: a pedigree G=(V,E) and genotype {gj}
output: a general solution of {pj}
begin
Step 1. Preprocessing
Step 2. Linear constraint generation on h-variables
Step 3. Solve h-variables by Gaussian Elimination
Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.
end
Our method Solve h-variables and p-variables separately
O(mn) linear equations on O(n) h-variables.
Traditional method Solve h-variables and p-variables together
O(mn) equations on O(mn) unknowns: O(mn) p-variables and O(n) h-variables.
Our Method
Ax=b
O mn
O mn Ax=b O mn Ax=b
transformation
redundancy elimination
O(n log2n log log n)
O(n)
O(n)
Redundant Equation Eliminationj0 j1
jk-1
jk
jk-2
j2
…
An observation
Given a cycle , assume that there are constraints among each pair of vertices. Originally, there are O(k2) constraints. Notice that they are not independent. We can replace the original constraints by an equivalent set of constraints with size O(k).
j2 ~ jk-1
j0 ~ j2
j0 ~ jk-1
Remove the redundant equations without solving them!
Key lemma
Given a spanning tree, the stretch of an edge (k, j) is defined as the length of the unique path between k and j on the tree.
Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log2n log log n).
The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog2n log log n).
Redundant Equation Elimination
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
The Loop-Free ZRHC problem
Problem definition Given a pedigree without mating loops
and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.
Constraint Graphs Given the constraints in a pedigree graph, we can
construct the corresponding constraint graph.Pedigree Graph
vertex v A constraint for the path connecting vertices j and k with the sum of h-variables along the path being b
Constraint Graphvertex v
An edge (j, k) with weight b
(b) Corresponding constraint graph
1 2
3
4
51
1
0
0
An example
(a) A pedigree graph with constrains
1 2
3
4
5path
(1,5)(1,2)
Sum ofh-variables
11
Constraints
(2,4) 0(2,5) 0
A Key Lemma There exists a solution to the loop-free ZRHC problem
if and only if the weight sum of every cycle C is 0 in the corresponding constraint graph.
”<=” Done by a construction later.
1 2
3
4
5
(proof sketch)
Each h-variables occurs even number of times in the constraint set S corresponding to C. The sum of h-variable in S is equal to the weight sum of C. The weight sum of C is 0.
”=>”
1 2
3
4
51
1
0
(a) The pedigree graph (b) Corresponding constraint graph
The constraints in S are not independent!
The constraints forming a spanning forest in the constraint graph are sufficient to represent all constraints.
There are at most n-1 independent constraints. We can construct an injective mapping f from
the independent constraints to edges in the pedigree graph
A Mapping from Constraints to Edges
1 2
3
4
5constraints
(1,2)edge(2,3)
Mapping
(2,4) (3,4)(2,5) (4,5)
(b) The pedigree graph(a) A spanning forest for the constraint graph
1 2
3
4
5
1
0
0
path
(1,2)
Sum ofh-variables
1
Constraints
(2,4) 0(2,5) 0
Each constraint is mapped to an edge on the path corresponding to the constraint.
The ZRHC-PHASE algorithm
Algorithm ZRHC_PHASE
input: a pedigree G=(V,E) and genotype {gj}
output: a general solution of {pj}
begin
Step 1. Preprocessing
Step 2. Linear constraint generation on h-variables
Step 3. Solve h-variables by Gaussian Elimination
Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.
end
It takes O(n3) time!
Solving h-variables
In order to obtain a linear-time algorithm, we want to avoid the Gaussian elimination method.
j0 j1 jk… jk-1
An observation Given a constraint along a path j0 , j1,…, jk-1 , jk
h +h + …+ h = b j0 , j1 j1 , j2 jk-1, j k
Assign the h-variables on edges (j0 , j1), (j1, j2), …, (jk-2, jk-1) arbitrarily. Assign the h-variables on the last edge (jk-1, jk) as a fixed value to satisfy the constraint: h = h + …+ h + b.j0 , j1 jk-2, j k-1jk-1, j k
We can solve the constraint in the following way:
Solving h-variables Based on the Mapping f
We have constructed the infective mapping f : S -> E , where S is the constraint set and E is the edge set.
h-variables can be solved by a single BFS Traversal.
1 2
3
4
5constraints
(1,2)
edge
(2,3)(2,4) (3,4)(2,5) (4,5)
Mappingsum of
h-variables100
0 1
10: not in f(E)
: in f(E)
We solve h-variables as follows: For each h-variable corresponding to an edge e not
in f (S), assign an arbitrary value. For each h-variable corresponding to an edge e in f
(S), assign a fixed value based on the constraint f –
1(e), such that the constraint is satisfied.
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
Motivation With the rapid development of genotyping
technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP.
We aim to select a subset of informative SNPs (i.e. tagSNPs) to save the cost for genotyping all SNPs and performing disease association mapping.
r2 Linkage Disequilibrium Statistics
Given a pair of genetic markers 1 and 2.
r2 statistics: r2 =(pAB –pA. p.B)2
pA.(1-pA.) p.B(1-p.B)
If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).
bBmarker 1
marker 2
A pAB pAb pA.
a paB pab pa. p.B p.b
The TagSNP Selection Problem
Given a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in
V}, we want to select a subset V' of minimum cardinality, such that given any v in V, there exists a v' in V' , where r2(v,v') is no less than r0.
If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.
1 2
3
45
6
(a) SNP markers and their LD patterns in a population
1 2
3
45
6
: tagSNP
(b) TagSNPs for the population
TagSNP Selection across Populations
In two populations with different evolutionary histories, a pair of SNPs having remarkably different marker frequencies and very weak LD may show strong LD in the admixed population.
Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations.
Problem Definition
Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.
The above problem is called the minimum common tagSNP selection problem (MCTS).
1 2
3
45
1 2
3
45
6 6
Population 1 Population 2
(a) SNP markers and their LD patterns in two populations.
1 2
3
45
1 2
3
45
6 6
Population 1 Population 2
: tagSNP
(b) The minimum TagSNP set for these two populations.
Our Algorithms The MCTS problem can be easily formulated by integer linear programming.
Lower bound: GreedyTag_lb and LRTag_lb
We calculate both the upper bound (i.e. the number of the tagSNPs obtained by our algorithms) and the lower bound (i.e. the minimum number of tagSNPs needed).
We first apply some data reduction rules, then use one of the following algorithms
A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag
Experimental Result
We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).
There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese people from Beijing. JPT: Japanese people from Tokyo. YRI: Yoruba people of Ibadan, Nigeria.
We get tagSNPs for the following two datasets: Encode regions: all 10 ENCODE regions with totally
10,859 markers. Human genome: chromosomes 1 – 22 with totally
2,862,454 markers.
Experiment Result for ENCODE Regions
We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).
The gap between LRTag_lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r2 threshold being 0.5. There is no gap with the r2 threshold being 0.8.
Experiment Result for Human Genome
The gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5, given the entire human genome with 2,862,454 SNPs. The gap is 142 SNPs with the r2 threshold being 0.8.
The numbers of tagSNPs selected by our algorithms are almost optimal.
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
Outline
Problem Definitions
P(n): given an integer n, a partition is a set of integers, say {n1,n2,…, nr}, s.t.i=1
r ni=n. Example: given n=4, {2,2} is a P(4); given n=3, {3} is a P(3).
Example: given S= {3, 3, 4}, {2,2,3,3} is an IP({3,3,4}).
IP(S): given a multiset S= {x1, , xm}, an integer partition is a disjoint union
Examples CIP(S1, S2, …, Sk): given multisets S1, S2, …, Sk ,
a common integer partition of all multisets.
Example: given S= {3, 3, 4}, T={2,2,6},
{2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T).
#P(100)=190,569,292
MCIP is NP-hard
MCIP(S1, S2, , Sk): a common integer partition with the minimum cardinality.
Example: {2,2,3,3} is a MCIP(S,T).
Biological Applications(1) The distance between
two strings a b c d e f g h i j k h h i j k h e f g a b c d
Genetic distance between two genomes
a b c d e f g h i j k h
h i j k h e f g a b c d
Minimum Common Substring Partition
Biological Applications(2)
MCIP is a special case of Minimum Common Substring Partition(MCSP)
MCIP(S',T') S'= {x1, x2, , xm} T'= {y1, y2, , yn}
aa...a |- aa...a |- aa...ax1 x2 xn
aa...a -| aa...a -| aa...ay1 y2 ym
MCSP(S,T)
S=
T=
Our Result 2- MCIP: MCIP on two input multisets
k- MCIP: MCIP on k input multisets
APX-hard: There is a constant c, s.t. a problem cannot be approximated within c.
Approximation upperbound
5/4{3k(k-1)}/(3k-2)
2-MCIPk-MCIP (k>2)
Approximation lowerbound
APX-hardAPX-hard
Conclusion and Future Work
The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem The minimum common integer
partition problem
References L. Liu and T. Jiang. Linear-Time Reconstruction of Zero-Recombinant Medelian
Inheritance on Pedigrees without Mating Loops. In submission. L. Liu, Y. Wu, S. Lonardi and T. Jiang. Efficient Algorithms for Genome-wide TagSNP
Selection across Populations via Linkage Disequilibrium Criterion. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).
Y. Wu, L. Liu, T. Close and S. Lonardi. Deconvoluting the BAC-gene Relationship Using a Physical Map. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).
J. Xiao, L. Liu, L. Xia and T. Jiang. Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree. In Proc. of ACM-SIAM Symposium on Discrete Algorithms(SODA'2007) , pp. 655-664.
X. Chen, L. Liu, Z. Liu and T. Jiang. On the Minimum Common Integer Partition Problem. In proc.of the 6th Conference on Algorithms and Complexity, Rome, Italy, pp. 236-247.
L. Liu, X. Chen, J. Xiao and T. Jiang. Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem. In Proc.of the 16th Annual International Symposium on Algorithms and Computation (ISAAC'05) , pp. 370-379. [Best paper nominations: 5.35%]. To appear in Theoretical Computer Science.
Thanks for your time and
attention!