"Maximum Likelihood Haplotyping for General Pedigrees." Fishelson M., Dovgolevsky N. and Geiger D....

"Maximum Likelihood Haplotyping for General

Pedigrees."

Fishelson M., Dovgolevsky N. and Geiger D. Human Heredity, 2005

Overview

Genetic linkage: basic definitions Superlink:

Preprocessing Finding optimal computation order Solving problem via Elim-Max

Experimental results

Human Genome

Most human cells contain 46 chromosomes:

2 sex chromosomes (X,Y):XY – in males.XX – in females.

22 pairs of chromosomes, named autosomes.

Genetic Information Gene – basic unit of genetic

information. They determine the inherited characteristics.

Genome – the collection of genetic information.

Chromosomes – storage units of genes.

Genotype Vs. Phenotype

Genotype - genetic constitution of an individual, inherited instructions it carries, which may or may not be expressed.

Phenotype - any observed quality of an organism (such as morphology, development, or behavior)

Chromosome Logical Structure

Genetic marker – a known DNA sequence, variation, which may arise due to mutation or alteration in the genomic loci, that can be observed.

May be short (SNP) or long one (minisatellites)

Locus – location of markers - fixed position on a chromosome

Allele – one variant form of a marker.

Locus1Possible Alleles: A1,A2

Locus2Possible Alleles: B1,B2,B3

Alleles - the ABO (Blood types) locus example

Multiple alleles: A,B,O.O is recessive to A. B is dominant over O.A and B are codominant.

O/OO

A/BAB

B/B, B/OB

A/A, A/OA

GenotypePhenotype

Mendel’s first law

Characters are controlled by pairs of genes which separate during the formation of the reproductive cells (meiosis)

A a

A a

Sexual Reproduction

zygote

gametes

sperm

egg

Meiosis

Mendel’s second law

When two or more pairs of genes segregate simultaneously, they do so independently.

A a; B b

A B A b a B a b

PAB= PA PB PAb=PA Pb PaB=Pa PB Pab=Pa Pb

Hardy–Weinberg equilibrium

genotype frequencies in a population remain constant or are in equilibrium from generation to generation unless specific disturbing influences are introduced

f(B) = p f(b)=q final three possible genotypic

frequencies in the offspring: f(BB)=p^2 f(Bb)=2pq f(bb)=q^2

Recombination During Meiosis

Recombinant gametes

Genetic recombination is the process by which a strand of genetic material (usually DNA; but can also be RNA) is broken and then joined to a different DNA molecule. In humans recombination commonly occurs during meiosis as chromosomal crossover between paired chromosomes.

Linkage 2 genes on separate chromosomes assort independently at meiosis

Recombination can occur with small probability at any location along chromosome

2 genes far apart on the same chromosome can also assort independently at meiosis.

2 genes close together on the same chromosome pair do not assort independently at meiosis.

A recombination frequency << 50% between 2 genes shows that they are linked – they are inherited together.

Linkage Maps Let U and V be 2 genes on the same chromosome.

In every meiosis, chromatids cross over at random along the chromosome.

If the chromatids cross over between U & V, then a recombinant is produced.

The farther apart U & V are the greater the

chance that a crossing over would occur between

them the greater the chance of recombination

between them.

Relative distance between two genes

- can be calculated using the offspring of an organism showing two linked genetic traits, and finding the percentage of the offspring where the two traits do not run together. The higher the percentage of descendants that does not show both traits, the further apart on the chromosome they are.

Recombination Fraction

Linkage) No(5.0)ionRecombinat(0)Linkage( P

• The recombination fraction between two loci is the percentage of times a recombination occurs between the two loci.

• is a monotone, nonlinear function of the physical distance separating the loci on the chromosome.

Centimorgan (cM)

1 cM (or 1 genetic map unit, m.u.) is the distance between genes for which the recombination frequency is 1%, that is genes for which one product of meiosis in 100 is recombinant.

Haplotype

- a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci.

Haplotype Resolution

Given the genotypes for a number of individuals, the haplotypes can be inferred by haplotype resolution or haplotype phasing techniques. These methods work by applying the observation that certain haplotypes are common in certain genomic regions.

Methods:

- compinatorial approach

- likelihood functions

An organism's genotype may not uniquely define its haplotype

AA AT TTGG AG AG AG TG TG TG

GC AG AC TG TC

CC AC AC AC TC TC TC

AG TC or AC TG

SUPERLINK:

Multipoint linkage analisys – estimate recombination fraction between a disease gene and known loci on the chromosome.

Haplotyping problem - infer the two haplotypes of each individual from the measured unordered genotypes (using pedigree genotype data and population genotype data)

Both problems can be defined via maximizing a suitable likelihood function.

Bayesian networks for representation of pedigree data

Allow to represent pedigrees in detailed manner Allow to encode independence assumptions

Pedigree – defines a joint distribution over the genotypes and phenotypes of the individuals represented in the pedigree.

Random variables representing a pedigree

Separate single-locus allele lists for the two haplotypes (paternal and maternal)

Genetic Loci: Gi,jp

, Gi,jm

– specific alleles of locus j in individual i's paternal and maternal haplotypes

Phenotypes: Pi,j – for each individual i and phenotype j denotes

the value of phenotype for individual i.

Selector variable: Si,jp

, Si,jm

– denote the selection made by meiosis that resulted in i's genetic makeup a locus j.

Formally, if a denotes i's father, then

Gi,jp

= Ga,jp

if Si,jp

= 0

Gi,jp

= Ga,jm

if Si,jp

= 1

Local probability tables:

Transmission models: Pr(Gi,jp|Ga,jp, Ga,jm, Si,jp), Pr(Gi,jm|Gb,jp, Gb,jm, Si, jm) where a and b are i's parents in pedigree. These tables are deterministic, namely, consist solely of 0 and 1. The first probability table equals 1, if Gi,jp=Ga,jp and Si,jp=0, or if Gi,jp=Ga,jm and Si,jp=1. In all other cases, this probability table equals 0. The second probability table is defined analogously.

Penetrance model (Penetrance - the proportion of individuals carrying a particular variation of a gene (an allele or genotype) that also express a particular trait (the phenotype)) or Marker model:

Pr(Pi,j| Gi,jp, Gi,jm)

These tables are also deterministic.The probability table equals 1 if Pi,j= (Gi,jp, Gi,jm), or if Pi,j=(Gi,jm, Gi,jp). Otherwise it equals 0. The assumption underlying these models is that there are no measurement errors.

Local probability tables:

Recombination model: Pr(Si,1p) = Pr(Si,1m) = 0.5, Pr(Si,jp|Si,j-1p, j-1) and Pr(Si,jm|Si,j-1m, j-1), where j-1 is the known or unknown recombination fraction between locus j-1 and locus j. The recombination fractions between the markers are specified by the user in the input to SUPERLINK. These recombination models do not take genetic interference into account.

General population allele frequencies: Pr(Gi,jp), Pr(Gi,jm), when i is a founder (whose biological parents are not included in the pedigree). The use of these models is based on the assumptions of Hardy-Weinberg and linkage equilibriums.

Each of these probability tables is called a factor.

A fragment of a Bayesian network representation of parents-child interaction

in a 3-loci analysis.

Superlink solves:

For haplotyping – Most Probable Explanation problem:

Superlink represents joint distribution over selector variables (S) and the genetic loci variables of founders (F), and non-founders (N) in factored form:

Haplotyping problem

A maximum-likelihood haplotype configuration of a pedigree is a maximum-likelihood assignment to all the genetic loci variables:

Since we are interested in determining the most likely gene flow as well, we seek a joint maximum-likelihood assignment to the selector variables and the genetic loci variables of founders

Since genetic loci variables of non-founders, N, are a function of the genetic loci variables of founders and the selector variables, solving equation above is equivalent to:

Algorithm:

Preprocessing Value elimination Variable trimming Allele recording

Finding optimal computation order Solving problem via Elim-Max (elim-mpe)

combined with conditioning

Value elimination

1st step – performed directly on the graph representation of pedigree, before transforming it into Bayesian network.

2nd step – performed on the local probability tables that annotate nodes of the constructed Bayesian network.

1st step

- based on the fact, that possible genotypes of an individual can be inferred from the genotypes of one's relatives.

Downward update: the child is updated according to parent

Ex. - a child can only have allele 1 or 2 in paternal haplotype of this locus

1st step

Upward update: parent is updated according to the children

Ex. - both children got allele 1 from their mother. So, father's genotype must be 3|4

These updates work in local manner, but when each update is propagated through the pedigree graph, it results in global update.

2nd Step

Value of certain variable is invalid, if all entries of some probability table that corresponds to that value of that variable equal zero.

Variable trimmingVariables that correspond to leaves in the

Bayesian network for which no data exists can be trimmed without altering likelihood computation.

Allele recording

Reduces the number of genotypes that need to be summed over, and hence, accelerates of computations.

One method is lumping all alleles that do not appear in the pedigree into a single allele whose population frequency is the sum of frequencies of the lumped alleles (Lange et al., 1988; Schaffer, 1996).

A more efficient method, which recodes the paternal and maternal allele lists of each individual separately, has been suggested by O'connell and Weeks (1995), and implemented in vitesse.

The allele recoding algorithm implemented in superlink is based on the ideas of set-recoding and fuzzy inheritance defined in vitesse.

allele-recoding algorithm

An allele is defined to be transmitted if the following two conditions are fulfilled:

(I) the allele appears in the ordered genotype list of a typed descendant D of P, as inherited from P;

(ii) there is some path from P to D containing only untyped descendants in the pedigree, namely, D is the nearest typed descendant of P on that path.

The remaining alleles are defined to be non-transmitted. In terms of determining recombination events, a person's non-transmitted alleles are indistinguishable from one another by data, and can therefore be combined into a single representative allele.

Algorithm allele-recording p2

The probability of the assignment found for the regular case (without allele recoding) is the same as the one found in the case of allele recoding.

Computation order

Main approaches: Elston-Steward algorithm – processes one nuclear family

after another. Good for large pedigrees with a few markers.

Lander-Green algorithm – processes one locus after another. Good for small to medium-sized pedigrees with large number of markers.

Superlink uses novel approach. In Superlink problem is reduced to operations

on a moralized graph of Bayesian network.

When a vertex is eliminated from the graph, its set of neighbors are connected to form a clique. The cost of eliminating vertex v from graph G

i is

where NGi

(v) represents the set of neighbors of v including v itself, and w(v) is the weight of v, namely, the number of possible values of variable X

v. In the case when there is no memory

limitation, we aim to find an elimination order ^Xa which satisfies

^Xa = arg min

a C(X

a), where

and a denotes a permutation on {1,.....,n}. Gi, i = 2,.....,n denotes

the sequence of residual graphs obtained from a given graph G1

= G by eliminating its vertices in the order Xa(1)

,....Xa(i-1)

.

Cost Function

C(Xa) – cost function (total state space) –

approximated measure of the time and space complexity of the computation.

If heaviest clique created doesn't fit into memory – conditioning is needed.

Cost function for conditioning

Let β = (β1,…βn) be a vector where βi €{0,1}. A constrained elimination sequence Xα,β = ((Xα(1),…,Xα(n),β) is a sequence of vertices along the binary vector β such that vertex Xα(i) is eliminated if βi = 0 and conditioned on if βi=1.

Goal – to find (for memory threshold T)

Algorithm for finding a combined order of elimination and conditioning

(for both haplotyping and likelihood computation)

Preprocessing step – application of reduction rules (initially designed for weighted treewidth problem. They can significantly reduce size of the graph)

Application of several stochastic-greedy algorithms

Reduction rules

Variable low – represents the largest lower bound known for the weighted treewidth of the original graph.

Simplicial rule: Let v be a simplicial vertex in Gi, namely its set of

neighbors form a clique. The simplicial rule removes v from the graph, and updates the variable low: low = max(low; nw(v)).

Almost simplicial rule: A vertex v is called an almost simplicial vertex in G

i if all its neighbors, except one u, form a clique.

Vertex v is removed if low>= nw(v) and w(v) >=w(u).

nw(v) denotes product

Stochastic-greedy algorithms

All based on the same procedure SG() (see next slide) Input:

weighted undirected graph G(V,E,w) Threshold T (memory limitation)

cost functions C1 and C2 (vary between three algorithms) Acording to C1 next vertex to be eliminated is chosen. Acording to C2 next vertex to be conditioned on is chosen.

Procedure runs many times, each times finds new elimination order and compares it to previous one

Three algorithms

Min-Weight (Win-W): C1 – product of weights of vertex's

neighbours

Min-Fill: C1 – number of edges that need to be added to the

graph after elimination of the vertex

- set of neighbours of X in Gi , - # of neighbours

Weighted Min-Fill (Wmin-Fill): Weight of an edge – product of weights of its constituent vertices. C

1- sum of weights of the

edges that need to be added due to vertex's elimination

- functions in Gi that include X

Three algorithms

Neither of the algorithms is better than others in all cases, so each of them is run a certain percentage of total optimization time: %MW, %MF, %WMF – percentage of iterations spent on running Min-Weight, Min-Fill and Weighted Min-Fill.

N – total number of iterations – estimated according to the complexity of the problem at hand, which is estimated by cost of elimination found by deterministic-greedy Min-Weight algorithm.

Deterministic-greedy Min-Weight

Deterministic

each iteration chooses to eliminate vertex with a minimal elimination cost according to the Min-Weight cost function

Stochastic

each interation flips a coin to determine which vertex out of three chosen to eliminate.

Experimental results

Evaluation of the optimization algorithm Stochastic algorithm Benchmarks Total running time Reduction rules

Evaluation of the haplotyping algorithm

Distribution of algorithms that found the lower cost

Min-Weight MCS WMCS Min-Fill Wmin-Fill4% 9% 7% 25% 76%

Min-Weight heuristic does not provide as good results as Weighted Min-Fill and Min-Fill, but it is the fastest and works well when conditioning is needed.

The MCS and Weighted-MCS heuristics have been found to hardly contribute when applied after the Min-Weight and thus are not implemented

Comparison of likelihood computation using old and new(1.4) version of Superlink

Simulation study

Superlink haplotyping algorithm was tested on a complex pedigree of moderate size (Lin, 1996). So far, only an approximate haplotype analysis was possible for this pedigree. Superlink obtained a maximum likelihood haplotype configuration in several minutes. This pedigree consists of 27 individuals and is highly inbred. Genehunter removes 12 individuals from the pedigree in order to perform the computations. At the time, no previous exact algorithm could produce the maximum likelihood haplotype conguration for this pedigree.

Simulation study

Testing correctness

Implementing three independent versions of the algorithm, and comparing the results obtained by all three versions.

Each version was implemented by different people, to assure an independent evaluation.

Correctness: the software finds a haplotype configuration of maximum likelihood given the assumptions of Hardy-Weinberg and Linkage equilibrium.

Tested 60 data sets consisting of 5 to 150 individuals and up to 200 markers. In all tested data sets, all three versions produced haplotype configurations with the same likelihood.

There is usually more than one maximum-likelihood haplotype configuration, and hence, various algorithms often produce different haplotype configurations.

Testing accuracy

Using Superlink, approximated haplotyping can be compared with the optimal solution on larger pedigrees than was previously possible. This experiment tested the accuracy of a state of the art program that uses MCMC, called simwalk2.75 random data sets consisting of 15 to 50 individuals and up to 10 markers were tested. Simwalk2 found a maximal likelihood assignment in 45 out of the 75 data sets. In the other 30 data sets, the average diference in the log-likelihood of the assignment found by simwalk2 compared to the maximal likelihood assignment was merely 1%

Testing accuracy

Example of different outputs by SIMWALK2 and SUPERLINK.

The two haplotype configurations are quite similar.

Many of the differences involve different phases in the haplotypes of founders. Such information can not be discerned by the data, and hence, such differences are meaningless.

However, the haplotype configuration found by SIMWALK2 contains 9 recombination events whereas the haplotype configuration found by SUPERLINK contains merely 7 recombination events. The positions of 5 of the recombination events found by both programs are the same. The other 2 recombination events found by SUPERLINK are in different positions than those found by SIMWALK2.

The likelihood of the haplotype configuration found by SUPERLINK is 4.2 times higher than the one reported by SIMWALK2.

Published Disease DataTwo published data sets from a study of the Krabbe disease , and

from a study on Episodic Ataxia (EA) were analysed.The first data set consists of 9 individuals typed at 8 polymorphic

markers. The second data set consists of 29 individuals, which are all typed at 9 polymorphic markers except for the first two generation founders.

For the Krabbe data set, the most likely haplotype conguration obtained by Superlink is identical to the one obtained by MCMC via simwalk2, by Lin and Speed, and by pedphase .

For the Episodic Ataxia data set, the most probable conguration difers from the one obtained by simwalk2 in the position of one recombination event. The only dierence is the genotype phase in the fourth marker of individuals 1007 and 113. This conguration is also very similar to the one found by Lin and Speed.

"Maximum Likelihood Haplotyping for General Pedigrees." Fishelson M., Dovgolevsky N. and Geiger D....

Documents

Transcript of "Maximum Likelihood Haplotyping for General Pedigrees." Fishelson M., Dovgolevsky N. and Geiger D....