PH.D candidate: Lan Liu

Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees

Advisor: Tao Jiang

Outline

The haplotype inference problem The tagSNP selection problem The minimum common integer

partition problem

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Introduction Basic concepts

Example: Mendelian experiment

Genotype

Haplotype

2 1 PS value=1

1 2 PS value=0

2 2Homozygous

1 1Heterozygous

Mendelian Law: one haplotype comes from the mother and the other comes from the father.

paternal maternal

Notations and Recombinant

Genotype

Haplotype Configuration

0 recombinant

FatherMother

Child: recombinant

1 recombinant

FatherMother

Pedigree

Camilla, Duchess of Cornwall

Peter Phillips Zara Phillips

Diana,Princess of Wales

Prince Williamof Wales

Prince Henry ofWales

PrincessBeatrice of York

PrincessEugenie of York

Lady LouiseWindsor

Prince Charles,Prince of Wales

Princess Anne, Princess Royal

CommanderTimothy Laurence

Prince Andrew,Duke of York

SarahMargaret Ferguson

Prince Edward, Earl of Wessex

Sophie Rhys-Jones

Elizabeth II ofthe United Kingdom

Prince Philip,Duke of Edinburgh

CaptainMark Phillips

An example: British Royal Family

A mating loop: a cycle inside the pedigree.

Haplotype Reconstruction - Haplotype: useful, expensive - Genotype: cheaper to obtain

1 21 2

1 22 1

1 21 2

Reconstruct haplotypes from genotypes

Problem Definitions MRHC Given a pedigree and the genotype

information for each member, find a haplotype configuration for each member which obeys Mendelian law, s.t. the number of recombinants are minimized.

ZRHC: zero-recombinant

Loop-free-ZRHC: zero recombinant, pedigree with no mating loops

partition problem

Outline

Approximation and Complexity of MRHC

The known hardness results for MRHC

NP-hard [LJ03]

P [LJ03]

P [DLJ03]

NP-hard [DLJ03]

2-locus-MRHCTree-MRHC with

bounded #membersTree-MRHC withbounded #loci

Tree-MRHC

Hardness

2-locus-MRHC: 2 loci Tree-MRHC: pedigree having no mating loops

Our Hardness and Approximation Results

Lower boundof approx.

Any f(n)

Any constant

Assumption

P≠ NP

P≠ NPthe Unique Games

Conjecture[Khot02]

Binary-tree-MRHC

2-locus-MRHC*

Binary-tree-MRHC*

2-locus-MRHC

Hardness

Tree-MRHC Any constant P≠ NP

the Unique GamesConjecture

Upper boundof approx.

The lower boundholds for

2-locus-MRHC*(4,1)

Binary-tree-MRHC*(1,1)

2-locus-MRHC(16,15)

Tree-MRHC(1,u)Tree-MRHC(u,1)

)log(n

Tree-MRHC: no mating loop Binary-tree-MRHC: 1 mate, 1 child Binary-tree-MRHC*: 1 mate, 1 child, missing data

2-locus-MRHC: 2 loci 2-locus-MRHC*: 2 loci with missing data

partition problem

Outline

The ZRHC problem Problem definition Given a pedigree and the genotype

information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

Previous work Li and Jiang introduced a system of linear equations

over F[2] and presented an O(m3n3) time algorithm for ZRHC [LJ03] , where m is #loci and n is #members in pedigree.

Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops.

Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k2.376) on k equations with k unknowns.

The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86].

Our Result

We present a much faster algorithm for ZRHC with running time . 2 3 2log log logO mn n n n

O mn Ax=b O mn Ax=b

transformation

redundancy elimination

O(n log2n log log n)

The New Linear System n, m

m : #loci n: #members in pedigree Unknowns

: the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j1 and a child j.

The New Linear System

0 0 0 1

Pj2,1 +0

pj2,2 +1

pj2,3 +1

pj2,4 +1

Pj,1 +1

pj,2 +1

pj,3 +0

pj,4 +0

hj1,j hj2,j

Pj1 +wj1Pj1Pj2 Pj2 +wj2

Pj1,1 +1

pj1,2 +0

pj1,3 +0

pj1,4 +1

Pj Pj +wj

pj1,2=1 pj1,3=0

Father

Mother

The Linear System

O(mn) equations on O(mn) unknowns.

Given a homozygous locus i on a member j (with a child j1), pj[i] and pj1[i] are pre-determined.

Pedigree Graph A pedigree with genotype

Pedigree graph G

#edges · 2n

Locus Graph

Locus graph Gi

12 22 11

12 12 12 11

Example: Locus graph for the 3rd locus

Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1}

(a) Genotype info

Zero-weight

1 1 1 0

h4,9h8,9

(b) Locus graph

p-variables: variables on vertices. h-variables: variables on edges shared by all locus graphs.

An Observation For any cycle or any path connecting two pre-determined vertices in a locus graph, the summation of h-variables along the path is a constant.We can use paths to denote

constraints!

a constant

+ dj0, j1

Pj1[i]hj1, j2

Pj2[i] Pjk-1[i] Pjk[i]hjk-1, jk

dj1, j2 djk-1, jk

Pj1[i] + dj1, j2+ hj1, j2 = Pj2[i]Pj2[i] + dj2, j3+ hj2, j2 = Pj3[i]…

Pjk-1[i] + djk-1, jk+ hjk-1, jk= Pjk[i]

Pj0[i]hj0, j1

dj0, j1

Pj0[i] = Pj1[i]

+ hj0, j1

(proof sketch) Assume the path in locus graph Gi connecting two pre-determined vertices j0 and jk .

Examples of Linear Constraints

1 1 1 0

(a) 1st locus graph h6,8 + h8,9= 1

1 ? ? 1

(b) 2nd locus graph h3,5 + h3,6 + h2,5 + h2,6 =

h3,5 h3,6

? ? ? ?

h3,5 h3,6

(c) 3rd locus graph h4,9 + h2,4 + h2,5 + h3,5 +

h3,6 + h6,8 = 0

Linear Constraints

Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient.

Moreover, we can upper bound #constraints in each locus graph as O(n), while the trivial analysis gives an upper bound O(n2).

Total #constraints = O(mn).

The ZRHC-PHASE algorithm

Algorithm ZRHC_PHASE

input: a pedigree G=(V,E) and genotype {gj}

output: a general solution of {pj}

Step 1. Preprocessing

Step 2. Linear constraint generation on h-variables

Step 3. Solve h-variables by Gaussian Elimination

Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.

Our method Solve h-variables and p-variables separately

O(mn) linear equations on O(n) h-variables.

Traditional method Solve h-variables and p-variables together

O(mn) equations on O(mn) unknowns: O(mn) p-variables and O(n) h-variables.

Our Method

O mn Ax=b O mn Ax=b

transformation

redundancy elimination

O(n log2n log log n)

Redundant Equation Eliminationj0 j1

An observation

Given a cycle , assume that there are constraints among each pair of vertices. Originally, there are O(k2) constraints. Notice that they are not independent. We can replace the original constraints by an equivalent set of constraints with size O(k).

j2 ~ jk-1

j0 ~ j2

j0 ~ jk-1

Remove the redundant equations without solving them!

Key lemma

Given a spanning tree, the stretch of an edge (k, j) is defined as the length of the unique path between k and j on the tree.

Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log2n log log n).

The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog2n log log n).

Redundant Equation Elimination

partition problem

Outline

The Loop-Free ZRHC problem

Problem definition Given a pedigree without mating loops

and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

Constraint Graphs Given the constraints in a pedigree graph, we can

construct the corresponding constraint graph.Pedigree Graph

vertex v A constraint for the path connecting vertices j and k with the sum of h-variables along the path being b

Constraint Graphvertex v

An edge (j, k) with weight b

(b) Corresponding constraint graph

An example

(a) A pedigree graph with constrains

(1,5)(1,2)

Sum ofh-variables

Constraints

(2,4) 0(2,5) 0

A Key Lemma There exists a solution to the loop-free ZRHC problem

if and only if the weight sum of every cycle C is 0 in the corresponding constraint graph.

”<=” Done by a construction later.

(proof sketch)

Each h-variables occurs even number of times in the constraint set S corresponding to C. The sum of h-variable in S is equal to the weight sum of C. The weight sum of C is 0.

”=>”

(a) The pedigree graph (b) Corresponding constraint graph

The constraints in S are not independent!

The constraints forming a spanning forest in the constraint graph are sufficient to represent all constraints.

There are at most n-1 independent constraints. We can construct an injective mapping f from

the independent constraints to edges in the pedigree graph

A Mapping from Constraints to Edges

5constraints

(1,2)edge(2,3)

Mapping

(2,4) (3,4)(2,5) (4,5)

(b) The pedigree graph(a) A spanning forest for the constraint graph

Sum ofh-variables

Constraints

(2,4) 0(2,5) 0

Each constraint is mapped to an edge on the path corresponding to the constraint.

The ZRHC-PHASE algorithm

Algorithm ZRHC_PHASE

input: a pedigree G=(V,E) and genotype {gj}

output: a general solution of {pj}

Step 1. Preprocessing

Step 2. Linear constraint generation on h-variables

Step 3. Solve h-variables by Gaussian Elimination

Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.

It takes O(n3) time!

Solving h-variables

In order to obtain a linear-time algorithm, we want to avoid the Gaussian elimination method.

j0 j1 jk… jk-1

An observation Given a constraint along a path j0 , j1,…, jk-1 , jk

h +h + …+ h = b j0 , j1 j1 , j2 jk-1, j k

Assign the h-variables on edges (j0 , j1), (j1, j2), …, (jk-2, jk-1) arbitrarily. Assign the h-variables on the last edge (jk-1, jk) as a fixed value to satisfy the constraint: h = h + …+ h + b.j0 , j1 jk-2, j k-1jk-1, j k

We can solve the constraint in the following way:

Solving h-variables Based on the Mapping f

We have constructed the infective mapping f : S -> E , where S is the constraint set and E is the edge set.

h-variables can be solved by a single BFS Traversal.

5constraints

(2,3)(2,4) (3,4)(2,5) (4,5)

Mappingsum of

h-variables100

10: not in f(E)

: in f(E)

We solve h-variables as follows: For each h-variable corresponding to an edge e not

in f (S), assign an arbitrary value. For each h-variable corresponding to an edge e in f

(S), assign a fixed value based on the constraint f –

1(e), such that the constraint is satisfied.

partition problem

Outline

Motivation With the rapid development of genotyping

technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP.

We aim to select a subset of informative SNPs (i.e. tagSNPs) to save the cost for genotyping all SNPs and performing disease association mapping.

r2 Linkage Disequilibrium Statistics

Given a pair of genetic markers 1 and 2.

r2 statistics: r2 =(pAB –pA. p.B)2

pA.(1-pA.) p.B(1-p.B)

If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

bBmarker 1

marker 2

A pAB pAb pA.

a paB pab pa. p.B p.b

The TagSNP Selection Problem

Given a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in

V}, we want to select a subset V' of minimum cardinality, such that given any v in V, there exists a v' in V' , where r2(v,v') is no less than r0.

If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.

(a) SNP markers and their LD patterns in a population

: tagSNP

(b) TagSNPs for the population

TagSNP Selection across Populations

In two populations with different evolutionary histories, a pair of SNPs having remarkably different marker frequencies and very weak LD may show strong LD in the admixed population.

Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations.

Problem Definition

Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.

The above problem is called the minimum common tagSNP selection problem (MCTS).

Population 1 Population 2

(a) SNP markers and their LD patterns in two populations.

Population 1 Population 2

: tagSNP

(b) The minimum TagSNP set for these two populations.

Our Algorithms The MCTS problem can be easily formulated by integer linear programming.

Lower bound: GreedyTag_lb and LRTag_lb

We calculate both the upper bound (i.e. the number of the tagSNPs obtained by our algorithms) and the lower bound (i.e. the minimum number of tagSNPs needed).

We first apply some data reduction rules, then use one of the following algorithms

A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag

Experimental Result

We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).

There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese people from Beijing. JPT: Japanese people from Tokyo. YRI: Yoruba people of Ibadan, Nigeria.

We get tagSNPs for the following two datasets: Encode regions: all 10 ENCODE regions with totally

10,859 markers. Human genome: chromosomes 1 – 22 with totally

2,862,454 markers.

Experiment Result for ENCODE Regions

We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

The gap between LRTag_lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r2 threshold being 0.5. There is no gap with the r2 threshold being 0.8.

Experiment Result for Human Genome

The gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5, given the entire human genome with 2,862,454 SNPs. The gap is 142 SNPs with the r2 threshold being 0.8.

The numbers of tagSNPs selected by our algorithms are almost optimal.

partition problem

Outline

Problem Definitions

P(n): given an integer n, a partition is a set of integers, say {n1,n2,…, nr}, s.t.i=1

r ni=n. Example: given n=4, {2,2} is a P(4); given n=3, {3} is a P(3).

Example: given S= {3, 3, 4}, {2,2,3,3} is an IP({3,3,4}).

IP(S): given a multiset S= {x1, , xm}, an integer partition is a disjoint union

Examples CIP(S1, S2, …, Sk): given multisets S1, S2, …, Sk ,

a common integer partition of all multisets.

Example: given S= {3, 3, 4}, T={2,2,6},

{2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T).

#P(100)=190,569,292

MCIP is NP-hard

MCIP(S1, S2, , Sk): a common integer partition with the minimum cardinality.

Example: {2,2,3,3} is a MCIP(S,T).

Biological Applications(1) The distance between

two strings a b c d e f g h i j k h h i j k h e f g a b c d

Genetic distance between two genomes

a b c d e f g h i j k h

h i j k h e f g a b c d

Minimum Common Substring Partition

Biological Applications(2)

MCIP is a special case of Minimum Common Substring Partition(MCSP)

MCIP(S',T') S'= {x1, x2, , xm} T'= {y1, y2, , yn}

aa...a |- aa...a |- aa...ax1 x2 xn

aa...a -| aa...a -| aa...ay1 y2 ym

MCSP(S,T)

Our Result 2- MCIP: MCIP on two input multisets

k- MCIP: MCIP on k input multisets

APX-hard: There is a constant c, s.t. a problem cannot be approximated within c.

Approximation upperbound

5/4{3k(k-1)}/(3k-2)

2-MCIPk-MCIP (k>2)

Approximation lowerbound

APX-hardAPX-hard

Conclusion and Future Work

partition problem

References L. Liu and T. Jiang. Linear-Time Reconstruction of Zero-Recombinant Medelian

Inheritance on Pedigrees without Mating Loops. In submission. L. Liu, Y. Wu, S. Lonardi and T. Jiang. Efficient Algorithms for Genome-wide TagSNP

Selection across Populations via Linkage Disequilibrium Criterion. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).

Y. Wu, L. Liu, T. Close and S. Lonardi. Deconvoluting the BAC-gene Relationship Using a Physical Map. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).

J. Xiao, L. Liu, L. Xia and T. Jiang. Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree. In Proc. of ACM-SIAM Symposium on Discrete Algorithms(SODA'2007) , pp. 655-664.

X. Chen, L. Liu, Z. Liu and T. Jiang. On the Minimum Common Integer Partition Problem. In proc.of the 6th Conference on Algorithms and Complexity, Rome, Italy, pp. 236-247.

L. Liu, X. Chen, J. Xiao and T. Jiang. Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem. In Proc.of the 16th Annual International Symposium on Algorithms and Computation (ISAAC'05) , pp. 370-379. [Best paper nominations: 5.35%]. To appear in Theoretical Computer Science.

Thanks for your time and

attention!

PH.D candidate: Lan Liu

Documents

Transcript of PH.D candidate: Lan Liu

Composites Science and Technologysmart.hit.edu.cn/_upload/article/files/50/76/b43da... · Shape memory polymers for composites Tong Mu a,1, Liwu Liu a,1, Xin Lan b, Yanju Liu a, *,

1 Instructor: George Church Teaching fellows: Lan Zhang (head), Chih Liu, Mike Jones, J. Singh, Faisal Reza, Tom Patterson, Woodie Zhao, Xiaoxia Lin, Griffin.

LAN interconnection (LAN Switch). LAN interconnection and layers.

CURRICULUM VITAE LEPING LIU - University of Nevada, Reno · • Working on curricular review and reorganization of the college structure • Being nominate as the candidate of the

Complexity and Approximation of the Minimum Recombinant Haplotype Configuration Problem Authors: Lan Liu, Xi Chen, Jing Xiao & Tao Jiang.

PRIVATIZING PROTECTION? The Evolution of Private Sponsorship in Canada SHAUNA LABMAN Ph.D. Candidate Trudeau Scholar & Liu Scholar Faculty of Law, University.

On Scheduling of Data Dissemination in Vehicular Networks with Mesh Backhaul Liu Zhongyi M.S. Candidate, Peking Univ. lzy@net.pku.edu.cn 2008-02-19 (To.

Xiaomei Liu, Jingchun Tang, Benru Song, Meinan Zhen, Lan ...

Calcium channel blocker amlodipine besylate is associated ...2020/04/08 · Wei-Juan Shang1, Yan Wu 1, Shufen Li1, Yu-Lan Zhang 1, Liu Yang4, Hongbo Chen5, Runming Jin 5 , Wei Liu

Groupe de travail Isabelle TOPHIN Jing LIU-RIMLINGER Lan LE THI CHINH Feng WANG GSA ( Groupe Salmon-Arc-en-ciel)

Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops Authors: Lan Liu, Tao Jiang Univ. California, Riverside.

Guangwu Liu , Yangang Liu - Atlantis Press

Supporting Information - Royal Society of ChemistrySupporting Information Enzyme Confined in Silica-based Nanocages for Biocatalysis in Pickering Emulsion Jia Liu, a,b Guojun Lan,

Opinion Leaders and the Flows of Citizens’ Political Preferences An Assessment with Agent-based Models Cheng-shan Frank Liu Doctoral Candidate Political.

10/13/20151 Gensheng (Jason) Liu Ph.D. Candidate Operations and Management Science Department Carlson School of Management University of Minnesota August.

Yugang Liu & Guangjun Liu

Lan Kiếm Trung Quốc · Hội Hoa Lan Việt Nam 1 Lan Kiếm Trung Quốc (Chinese Cymbidium) Theo cuốn The Genus of Cymbidium in China ấn hành năm 2006, do giáo sư Liu

Fast Incremental FIB Aggregation (FIFA) · Fast Incremental FIB Aggregation (FIFA) Yaoqing Liu yliu6@memphis.edu The University of Memphis Dr. Lan Wang lanwang@memphis.edu The University

207924Orig1s000 - Food and Drug Administration...Amy Liu, PharmD Candidate, Virginia Commonwealth University, School of Pharmacy SPONSOR ATTENDEES William Macias, MD, PhD, Team Leader

Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker: