Reconstructing Sibling Relationships from Genotyping Data

Reconstructing Sibling Relationships from Genotyping Data

Saad SheikhDepartment of Computer ScienceUniversity of Illinois at Chicago

Brothers!

?

?

• Used in: conservation biology, animal management, molecular ecology, genetic epidemiology

• Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.

• But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

Lemon sharks, Negaprion brevirostris

2 Brown-headed cowbird (Molothrus ater) eggs in a Blue-winged Warbler's nest

Biological Motivation

GeneUnit of inheritance

AlleleActual genetic sequence

LocusLocation of allele in entire genetic sequence

Diploid2 alleles at each locus

Basic Genetics

Diploid Siblings

Siblings: two children with the same parentsQuestion: given a set of children, find sibling

groups

locusallele

father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother

(.../...),(e /f ),(.../...),(.../...) child

one from fatherone from mother

recombination

Microsatellites (STR)Advantages:

Codominant (easy inference of genotypes and allele frequencies)

Many heterozygous alleles per locus

Possible to estimate other population parameters

Cheaper than SNPsBut:

Few lociAnd:

Large familiesSelf-mating…

CACACACA5’

AllelesCACACACA

CACACACACACACACACACACACACA

#1

#2

#3

Genotypes1/1 2/2 3/3 1/2 1/3 2/3

Sibling Reconstruction Problem

Sibling Groups:

2, 4, 5, 6

1, 3

7, 8

22/221/68

88/221/57

1/36

33/441/35

77/661/34

33/551/43

33/441/32

11/221/21

allele1/allele2

Locus2Locus1Animal

S={P1={2,4,5,6},P2={1,3},P3={7,8}}

33/77

Existing MethodsMethod Approach Error-

Detection

Assumptions

Almudevar & Field (1999,2003)

Minimal Sibling groups under likelihood

No Minimal sibgroups, representative allele frequencies

KinGroup (2004)

Markov Chain Monte Carlo/ML

No Allele Frequencies etc. are representative

Family Finder(2003)

Partition population using likelihood graphs

No Allele Frequencies etc. are representative

Pedigree (2001)

Markov Chain Monte Carlo/ML

No Allele Frequencies etc are representative

COLONY (2004)

Simulated Annealing

Yes Monogamy for one sex

Fernandez & Toro (2006)

Simulated Annealing

No Co-ancestry matrix is a good measure, parents can be reconstructed or are available

KINSHIP

David C. Queller and Keith F. Goodnight.

Computer software for performing likelihood tests of pedigree relationship using genetic markers.

Molecular Ecology, 8:1231–1234, 1999.

KINSHIP

First software and likelihood measure for sibling/kinship reconstruction

Estimates a ratio of two likelihoods: Primary vs. Null Hypothesis

Assumes Population Frequencies are known

Probability of sharing allele

R – Probability of alleles being identical by descent Rp = Probability (Xp = Yp)Rm = Probability (Xm = Ym)

Relationship Rp RmMother–offspring 0.0 1.0Father–offspring 1.0 0.0Full siblings 0.5 0.5Full sisters (haplodiploid) 1.0 0.5Half siblings (maternal) 0.0 0.5Cousins (maternal) 0.0 0.3Unrelated 0.0 0.0

Haploid Likelihood

Two individuals X =<X> and Y=<Y>If X=Y

Likelihood = Pr(Drawing X) x Pr(X = Y)=R+(1-R)Px

OtherwiseLikelihood = Pr(Drawing X) x Pr(X Y)=Px(1-R)Py

Diploid IndividualsDiploid Individuals X=<Xp/Xm> , Y =<Yp/Ym>Assumptions

We know which alleles are mother's and father'sNo Inbreeding

Likelihood = Likelihoodp x Likelihoodm

Loci are independentTotal Likelihood is a product of likelihoods

across loci

Calculating Likelihood

Population Frequencies: Pxm,Pxp,Pym,PypLikelihoods:

Xp = Yp

Xm = Ym

Xp Yp

Pxm(Rm + (1 - Rm)Pxm) ́Pxp(Rp + (1 – Rp)Pxp) Pxm(Rm + (1 - Rm)Pxm) ́ Pxp(1 – Rp)Pyp

Xm Ym Pxm(1 - Rm)Pym ́ Pxp(Rp + (1 - Rp)Pxp) Pxm(1 - Rm)Pym ́ Pxp(1 - Rp)Pyp

Likelihood Ratios

Independent Likelihood is not very reliable or meaningful

Different Ratios => Different LociRatio != Statistical SignificanceSimulations used to determine P-values

Statistical Significance

Randomly generate an individual X using allele frequencies

Draw Y using Rm and RpFirst Allele: Copy X's allele with Probability

Rm or vice versaSecond Allele: Copy X's allele with

Probability Rp or vice versaDraw a large number of such <X,Y> pairsThe value of the ratio that excludes 95%

of such pairs is at P=0.05 significance

Family Finder

Jen Beyer and B. May.

A graph-theoretic approach to the partition of individuals into full-sib families.

Molecular Ecology, 12:2243–2250, 2003.

Graph-Theory?

Build a graph of all individualsConnect individuals with edges

representing relationshipsAssign Likelihood Ratio Full

Sib/Unrelated as distance measureFilter using likelihood ratio at 0.05

significance levelFind a cut

AlgorithmCalculate LFS/LUR likelihood ratios for all pairsBuild a graph representing the full-sib relationships Find the connected components in the graph and store them in a queue.While the queue is not empty do

Remove a component from the queue and calculate its score. Build a GH cut tree for the component. For each cut with less than 1/3 the total number of edges in

the component do Score the components that would result if the cut's edges

were removed. If the scores are the best found so far, then store them.

If the best scores found are higher than the score for the original component then separate the families and put them in the queue for

further analysis.Otherwise save the original component as a result family.

Example

Score the components and Keep the best cuts

Conclusion – Family Finder

Some theoretical basisEfficiently computableProduces reasonably good results for

many lociA lot of assumptions because of

Goodknight & Queller measureRequires a significant number of loci - 8+Works well only when families are almost

equal size

Parsimony=Occam’s Razor"entities must not be multiplied beyond necessity”"plurality should not be posited without necessity”

“Parsimony is a 'less is better' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam's razor.”

• Wikipedia

Min Sib groups = Most Parsimonious explanation

Parsimony

http://en.wikipedia.org/wiki/Occam's_razor

4-allele rule:siblings have at most 4 different alleles in a locus

Yes: 3/3, 1/3, 1/5, 1/6No: 3/3, 1/3, 1/5, 1/6, 3/2

2-allele rule: In a locus in a sibling group:

a + R ≤ 4

Yes: 3/3, 1/3, 1/5No: 3/3, 1/3, 1/5, 1/6

Mendelian Constraints

Num distinct alleles

Num alleles that appear with 3 others or are homozygote

Find the minimum number of Sibling Groups necessary to explain the given cohort

Minimum Set Cover:Cohort as universe UIndividuals as elements of UCovering Groups C include all genetically

feasible sibling groupsNP-complete even when we know sibsets at

most 3Hard to approximate (Ashley et al. 09)ILP formulation (Chaovalitwongse et al. 08)

Min Sibgroups Reconstruction

[ ]min | | iI m i I

I such that S U

Minimum Set CoverGiven: universe U = {1, 2, …, n}

collection of sets S = {S1, S2,…,Sm} where Si subset of U

Find: the smallest number of sets in Swhose union is the universe U

Minimum Set Cover is NP-hard(1+ln n)-approximable (sharp)

1. Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)]

2. Use Min Set Cover to find the minimum sibling groups

Optimally using ILP (CPLEX)

2-Allele Min Set Cover

Generate candidate sets by all pairs of individualsCompare every set to every individual x

if x can be added to the set without any affecting “accomodability” or violating 2-allele: add it

If the “accomodability” is affected , but the 2-allele property is still satisfied: create a new copy of the set, and add to it

Otherwise ignore the individual, compare the next

2-Allele Algorithm Overview

ID alleles1 1/2

2 2/3

3 2/1

4 1/3

5 3/2

6 1/4

Canonical families

1/1 1/2 1/3 1/4 2/2 2/3 2/4 3/4 3/3 4/4

1/1 1/1

1/2

2/1

2/2 1/3

1/4

2/3

2/4

3/1

4/1

3/2

4/2

1/1

1/2

2/1

1/1

1/3

2/1

2/3

3/1

2/1

3/2

1/2

1/3

2/1

3/1

ID alleles1 55/43

2 43/114

3 43/55

4 55/114

5 114/43

6 55/78

1/3

2/1

2/3

2/1

3/2

Add

New Group Add (won’t accommodate (2/2))

Can’t add (a+R =4)

Examples

1/41/ 2 3/ 4

3/ 2

1/41/ 2 3/ 2

3/ 2

1/41/ 2 1/ 1

1/ 5

1. Get a dataset with known sibgroups(real or simulated)

2. Find sibgroups using our alg3. Compare the solutions

Partition distance, Gusfield ’03

4. Compare results to other sibship methods

Testing and Validation: Protocol

Salmon (Salmo salar) - Herbinger et al., 1999 351 individuals, 6 families, 4 loci. No missing alleles

Shrimp (Penaeus monodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles

Ants (Leptothorax acervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants

Real Data

http://images.google.com/imgres?imgurl=http://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Salmo_salar.jpg/800px-Salmo_salar.jpg&imgrefurl=http://commons.wikimedia.org/wiki/Image:Salmo_salar.jpg&h=366&w=800&sz=20&hl=en&start=6&um=1&tbnid=qCj-so0bqFZw1M:&tbnh=65&tbnw=143&prev=/images?q=Salmo+salar&svnum=10&um=1&hl=en&client=firefox-a&rls=org.mozilla:en-US:official&sa=G

http://images.google.com/imgres?imgurl=http://www.vishandel.net/viscatalogus/554.gif&imgrefurl=http://www.vishandel.net/viscatalogus/vis.cfm-visUID=554.htm&h=230&w=355&sz=37&hl=en&start=5&um=1&tbnid=LIvrTUNSRq7BOM:&tbnh=78&tbnw=121&prev=/images?q=Penaeus+monodon&svnum=10&um=1&hl=en&client=firefox-a&rls=org.mozilla:en-US:official&sa=G

Random Data GenerationGenerate F females and M males (F=M=5, 10, 15)Each with l loci (l=2, 4, 6)Each locus with a alleles

a[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1

Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5

For each family select female+male uniformly at random

For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1

For each offspring for each locus choose allele outcome uniformly at random

Results

2-Allele Min Set CoverFirst combinatorialMakes no assumptions other parsimonyWorks consistently and comparatively

Sibling ReconstructionGrowing number of methodsBiologists need (one) reliable reconstructionGenotyping errors

Answer: Consensus

Summary (Min Sib Groups)

Combine multiple solutions to a problem to generate one unified solutionC: S*→ SBased on Social Choice TheoryCommonly used where the real solution is not

known e.g. Phylogenetic Trees

Consensus Methods

Consensus...

S1 S2 Sk

S

Only Pareto Optimality and Anti-Pareto Optimality are enforcedAll solutions must agree on equivalence

All disputed individuals go to singletons

Strict Consensus

Strict Consensus

5 Sibling Groups? When 3 can do?

S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}

S={{1,2},{3},{4},{5},{6,7}}

Si x≡Siy≡ x≡Sy

Majority of solutions determine the final solutionTwo individuals are together if a majority of

solutions vote in their favourViolates Transitivity: A ≡ B ∧ B ≡ C ⇒ A ≡ C

Majority Consensus

S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}

1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4

Voting ConsensusMajority under closureResults in large monolithic groups

Majority Consensus

Voting Consensus 1 ≡ 5 ?

S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}

S={{1,2,3,4,5},{6,7}}

Commonly used consensus methods don’t work [AAAI-MPREF08]Strict Consensus produces too many singletonsMajority violates transitivity AND doesn’t work

for error-tolerance

Consensus Methods

Algorithm Compute a consensus solution S={g1,...,gk }Search for a good solution near S

Distance-based Consensus

Consensus...S1 S2 Sk Ss

S

Search

fd

f q

fq fd

NeedsA Distance Function fd: S x S →R A Quality Function fq: S → R

What is the Catch? [Sheikh et al. CSB 2008]Optimization of fd, fq or an arbitrary linear

combination is NP-Complete Reduction from the 2-Allele Min Set Cover

Problem

Distance-based Consensus

Algorithm Compute a strict consensusWhile distance is not too large

Merge two nearest sibgroupsQuality: fq=n-|C|Distance Function

fd(C,C’)=cost of merging groups in C to obtain C’

A Greedy Approach

A Greedy Approach

{1,2} {3} {4} {5} {6,7}{1,2} 3.5 1.1 2.5 5.1

{3} 0.5 0.3 0.5 0.1{4} 1.0 3.0 0.6 1.1

{5} 2.0 1.2 3.5 4.9

{6,7} 0.6 0.9 1.2 4.1

S1 ={ {1,2,3}, {4,5}, {6,7} }S2={ {1,2,3}, {4},

{5,6,7} }S3={ {1,2}, {3,4,5}, {6,7} }

Strict Consensus S={ {1,2}, {3},{4},{5},{6,7} }

{1,2} {3,6,7} {4} {5}{1,2} 3.5 1.1 2.5

{3,6,7} 1.7 3.1 2.2

{4} 1.0 3.0 0.6{5} 2.0 1.2 3.5

S={ {1,2}, {3,6,7},{4},{5} }

Distance Function(sibgroup, sibgroup)Cost of assigning all individuals

fd(C,C’)=min(SXPi fassign(Pj,X), SXPj fassign(Pi,X) )Distance Function (sibgroup, individual)

Benefit: Alleles and allele pairs sharedCost: Minimum Edit Distance

fassign(PiX)=

Greedy Consensus

benefit X can be a member of Pi

cost X cannot be a member of Pi`

AlgorithmCompute a strict consensusWhile distance is not too large

Merge two sibgroups which will minimize the TOTAL merging cost

Store the new merging cost in the merged set

Greedy Consensus

Error-Tolerant Approach

Locu

s 1

Locu

s 2

Locu

s 3

Locu

s k

Sibling Reconstruction

Algorithm

...

Consensus...

S1 S2 Sk S

Results

>90% accuracy for all real data

Results

Results

A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975]FairIndependentPareto Optimal

Biologically [AAAI-MPREF 2008]The subset of individuals chosen will impact the

consensus considerably

Impossibility Result

ParametricDoes NOT outperform other algorithms on:

Biological dataSmaller familiesHigh Allele Frequencies

Problems

Change costs to average per locus costsCompare max group error on per locus basisTreat cost and benefit independentlyIn order to qualify a merge

Cost <= maxcostBenefit >= minbenefitBenefit = max benefit among possible merges

Auto Greedy Consensus

Results

First consensus method for Sibship ReconstructionMajority won’t work

First combinatorial approach for Error-Tolerant Sibship ReconstructionFewer AssumptionsMore Efficient

Distance-based Consensus is NP-HardNew non-parametric consensus

Summary (Consensus)

Min number of sibgroups is just ONE way to interpret parsimony

Alternate ObjectivesSibship that minimizes number of parents

Very Hard! Connection to Raz’s Parallel Repetition Theorem

Sibship that minimizes number of matingsSibship that maximizes family sizeSibship that tries to satisfy uniform allele

distributions

Parsimony: Alternate Objectives

Problem Statement:Given a population U of individuals, partition

the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized

Observations and Challenges:MinParents: intractable, inapproximable

Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem)

There may be O(2|loci|) potential parents for a sibgroup

Self-mating (plants) may or may not be allowed

Parsimony: Minimize Parents

Not Necessarily…

Is MinParents = MinSibgroups?

Parents

Genotype atLocus 1

P1 1 10P2 2 20P3 3 30P4 4 40P5 3 50

Child Parents Genotype

A P1-P2 1 20B P1-P2 2 10C P1-P2 10 20D P1-P3 1 30E P1-P3 10 3F P4-P2 4 20G P4-P2 40 2H P4-P3 40 3I P4-P3 4 30J P4-P5 4 50

1. Generate M a set of covering groups2. Cover a subset S of covering groups3. For each group x in S

1. Generate Parent Pairs for x2. Insert parent vertices into graph G (if needed)3. Connect the parents in each parent pair

4. Cover the minimum vertices necessary to (doubly) cover all the individuals

Min Parents Meta ApproachM={{1,2},{3,6,7},{3,5}, {2,4},{1,6},{2,5},{6,7}}

S={{1,2,4},{3,5},{6,7}}

X={3,5}

{F=5/10, M=2/20},{F=5/20.M=2/10}

5/10

2/20

5/20

2/10

X={3,5}

X={3,5}

Different approaches to selecting a subset of maximal feasible groupsGreedy Min Set CoverK –Greedy Min Set CoversAll Sets! (Nearing optimality)

Forget maximal feasible sibling groupsGenerate K random minimal feasible sibling

reconstructions

Covering Groups

The number of generated parents is just too many!

Mine Association Rules across loci {A,B}locus1 => {C,D}locus2

Use Association Rules to filter parents {A,B}locus1 => {C,D}locus2 OR {C’,D’}locus2

Polygamy=>High Confidence Association Rules

No Polygamy=>Min Parents=Min GroupsIf self-mating is not allowed, odd-cycles must

be disallowed

Generating Parents

HeuristicWhile all vertices are not covered

Select the vertex that will cover the most uncovered individuals

MIP Formulation

Covering Vertices

ResultsLegend:M1: k-greedy cover with optimal graph cover

M2: greedy set cover with optimal graph cover

M3: Randomized cover with optimal graph cover

M4: k-greedy with graph heuristics

M5: greedy set cover with graph heuristic

Results

Reduction is from a version of Parallel Repetition theorem even if we know all the parents and just need to find the minimum parents to choose!

But, what is the parallel repetition theorem?

Complexity Results

2-prover 1-roundproof system

label cover problemfor bipartite graphs

small inapproximability

boosting(Raz’s parallel repetition theorem)

parallel repetition of2-prover 1-roundproof system

label cover problemfor some kind of“graph product” forbipartite graphslarger inapproximability

Unique gamesconjecture

restriction restriction

We need some version of Raz’s parallel repetition theorem that is suitable for us

Fortunately, the following two papers helped:

U. Feige, A threshold of ln n for approximating set-cover, Journal of the ACM, 1998

G. Kortsarz, R. Krauthgamer and J. R. Lee, Hardness of Approximating Vertex-Connectivity Network Design Problems, SIAM J. of Computing, 2004

Inapproximability for MINREP(Raz’s parallel repetition theorem)

Let LNP and x be an input instance of L

L MINREP

O(npolylog(n)) time

xL

xL

OPT ≤ α+β

0 < ε < 1 is any constant

OPT (α+β) 2log |A| +|B|

MINREP (minimum representative) problem

α partitionsall of equal size

β partitionsall of equal size

…A

B

A1 A2 Aα

B1 B2 BβB3

…

…A1 A2 Aα

B1 B2 B3 Bβ

B “super”-nodes

A “super”-nodes

associated “super”-graph Hinput graph G

…

(A1,B2)H if uA1 and vB2 such that (u,v)GIn this case, edge (u,v)G a witness of the super-edge (A1,B2)H

α partitionsall of equal size

…A

B

A1 A2 Aα

B1 B2 BβB3

MINREP goal

Valid solution: A’ A and B’ B such that

A’B’ contains a witness for every super-edge

Objective: minimize the size of the solution |A’B’|

Informally, given a set of childrengiven a candidate set of parentsassuming we believe in Mendelian inheritance

lawassuming that the parents tried to be as much

monogamous as possible

can we partition the children into a set of full siblings

(full sibling group has the same pair of parents)

Can reduce MINREP to show that this problem is hard

Parsimony-based combinatorial optimization works bet with least amount of information

Parsimony-based combinatorial optimization is NP-hard and inapproximable

First combinatorial approach for Error-Tolerant Sibship ReconstructionFewer AssumptionsMore Efficient

Other parsimony-based optimization objectives are possibleMin Parents is interesting and hard!

Conclusions

Better heuristics for Min Parents?Other parsimony objectivesFurther analysis of when objectives give

same results

Future Work

Mary AshleyUIC

W. Art Chaovalitwong

seRutgers

Isabel CaballeroUIC

Sibship Reconstruction Project

Ashfaq Khokhar

UIC

Tanya Berger-WolfUIC

Priya Govindan

UIC

Bhaskar DasGupta

UIC

Thank You!!Questions?

Chun-An (Joe) Chou

Rutgers

Reconstructing Sibling Relationships from Genotyping Data

Documents

Transcript of Reconstructing Sibling Relationships from Genotyping Data