Reconstructing Sibling Relationships from Genotyping Data
description
Transcript of Reconstructing Sibling Relationships from Genotyping Data
Reconstructing Sibling Relationships from Genotyping Data
Saad SheikhDepartment of Computer ScienceUniversity of Illinois at Chicago
Brothers!
?
?
• Used in: conservation biology, animal management, molecular ecology, genetic epidemiology
• Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.
• But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier
Lemon sharks, Negaprion brevirostris
2 Brown-headed cowbird (Molothrus ater) eggs in a Blue-winged Warbler's nest
Biological Motivation
GeneUnit of inheritance
AlleleActual genetic sequence
LocusLocation of allele in entire genetic sequence
Diploid2 alleles at each locus
Basic Genetics
Diploid Siblings
Siblings: two children with the same parentsQuestion: given a set of children, find sibling
groups
locusallele
father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother
(.../...),(e /f ),(.../...),(.../...) child
one from fatherone from mother
recombination
Microsatellites (STR)Advantages:
Codominant (easy inference of genotypes and allele frequencies)
Many heterozygous alleles per locus
Possible to estimate other population parameters
Cheaper than SNPsBut:
Few lociAnd:
Large familiesSelf-mating…
CACACACA5’
AllelesCACACACA
CACACACACACACACACACACACACA
#1
#2
#3
Genotypes1/1 2/2 3/3 1/2 1/3 2/3
Sibling Reconstruction Problem
Sibling Groups:
2, 4, 5, 6
1, 3
7, 8
22/221/68
88/221/57
1/36
33/441/35
77/661/34
33/551/43
33/441/32
11/221/21
allele1/allele2
Locus2Locus1Animal
S={P1={2,4,5,6},P2={1,3},P3={7,8}}
33/77
Existing MethodsMethod Approach Error-
Detection
Assumptions
Almudevar & Field (1999,2003)
Minimal Sibling groups under likelihood
No Minimal sibgroups, representative allele frequencies
KinGroup (2004)
Markov Chain Monte Carlo/ML
No Allele Frequencies etc. are representative
Family Finder(2003)
Partition population using likelihood graphs
No Allele Frequencies etc. are representative
Pedigree (2001)
Markov Chain Monte Carlo/ML
No Allele Frequencies etc are representative
COLONY (2004)
Simulated Annealing
Yes Monogamy for one sex
Fernandez & Toro (2006)
Simulated Annealing
No Co-ancestry matrix is a good measure, parents can be reconstructed or are available
KINSHIP
David C. Queller and Keith F. Goodnight.
Computer software for performing likelihood tests of pedigree relationship using genetic markers.
Molecular Ecology, 8:1231–1234, 1999.
KINSHIP
First software and likelihood measure for sibling/kinship reconstruction
Estimates a ratio of two likelihoods: Primary vs. Null Hypothesis
Assumes Population Frequencies are known
Probability of sharing allele
R – Probability of alleles being identical by descent Rp = Probability (Xp = Yp)Rm = Probability (Xm = Ym)
Relationship Rp RmMother–offspring 0.0 1.0Father–offspring 1.0 0.0Full siblings 0.5 0.5Full sisters (haplodiploid) 1.0 0.5Half siblings (maternal) 0.0 0.5Cousins (maternal) 0.0 0.3Unrelated 0.0 0.0
Haploid Likelihood
Two individuals X =<X> and Y=<Y>If X=Y
Likelihood = Pr(Drawing X) x Pr(X = Y)=R+(1-R)Px
OtherwiseLikelihood = Pr(Drawing X) x Pr(X Y)=Px(1-R)Py
Diploid IndividualsDiploid Individuals X=<Xp/Xm> , Y =<Yp/Ym>Assumptions
We know which alleles are mother's and father'sNo Inbreeding
Likelihood = Likelihoodp x Likelihoodm
Loci are independentTotal Likelihood is a product of likelihoods
across loci
Calculating Likelihood
Population Frequencies: Pxm,Pxp,Pym,PypLikelihoods:
Xp = Yp
Xm = Ym
Xp Yp
Pxm(Rm + (1 - Rm)Pxm) ́Pxp(Rp + (1 – Rp)Pxp) Pxm(Rm + (1 - Rm)Pxm) ́ Pxp(1 – Rp)Pyp
Xm Ym Pxm(1 - Rm)Pym ́ Pxp(Rp + (1 - Rp)Pxp) Pxm(1 - Rm)Pym ́ Pxp(1 - Rp)Pyp
Likelihood Ratios
Independent Likelihood is not very reliable or meaningful
Different Ratios => Different LociRatio != Statistical SignificanceSimulations used to determine P-values
Statistical Significance
Randomly generate an individual X using allele frequencies
Draw Y using Rm and RpFirst Allele: Copy X's allele with Probability
Rm or vice versaSecond Allele: Copy X's allele with
Probability Rp or vice versaDraw a large number of such <X,Y> pairsThe value of the ratio that excludes 95%
of such pairs is at P=0.05 significance
Family Finder
Jen Beyer and B. May.
A graph-theoretic approach to the partition of individuals into full-sib families.
Molecular Ecology, 12:2243–2250, 2003.
Graph-Theory?
Build a graph of all individualsConnect individuals with edges
representing relationshipsAssign Likelihood Ratio Full
Sib/Unrelated as distance measureFilter using likelihood ratio at 0.05
significance levelFind a cut
AlgorithmCalculate LFS/LUR likelihood ratios for all pairsBuild a graph representing the full-sib relationships Find the connected components in the graph and store them in a queue.While the queue is not empty do
Remove a component from the queue and calculate its score. Build a GH cut tree for the component. For each cut with less than 1/3 the total number of edges in
the component do Score the components that would result if the cut's edges
were removed. If the scores are the best found so far, then store them.
If the best scores found are higher than the score for the original component then separate the families and put them in the queue for
further analysis.Otherwise save the original component as a result family.
Example
Score the components and Keep the best cuts
Conclusion – Family Finder
Some theoretical basisEfficiently computableProduces reasonably good results for
many lociA lot of assumptions because of
Goodknight & Queller measureRequires a significant number of loci - 8+Works well only when families are almost
equal size
Parsimony=Occam’s Razor"entities must not be multiplied beyond necessity”"plurality should not be posited without necessity”
“Parsimony is a 'less is better' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam's razor.”
• Wikipedia
Min Sib groups = Most Parsimonious explanation
Parsimony
4-allele rule:siblings have at most 4 different alleles in a locus
Yes: 3/3, 1/3, 1/5, 1/6No: 3/3, 1/3, 1/5, 1/6, 3/2
2-allele rule: In a locus in a sibling group:
a + R ≤ 4
Yes: 3/3, 1/3, 1/5No: 3/3, 1/3, 1/5, 1/6
Mendelian Constraints
Num distinct alleles
Num alleles that appear with 3 others or are homozygote
Find the minimum number of Sibling Groups necessary to explain the given cohort
Minimum Set Cover:Cohort as universe UIndividuals as elements of UCovering Groups C include all genetically
feasible sibling groupsNP-complete even when we know sibsets at
most 3Hard to approximate (Ashley et al. 09)ILP formulation (Chaovalitwongse et al. 08)
Min Sibgroups Reconstruction
[ ]min | | iI m i I
I such that S U
Minimum Set CoverGiven: universe U = {1, 2, …, n}
collection of sets S = {S1, S2,…,Sm} where Si subset of U
Find: the smallest number of sets in Swhose union is the universe U
Minimum Set Cover is NP-hard(1+ln n)-approximable (sharp)
1. Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)]
2. Use Min Set Cover to find the minimum sibling groups
Optimally using ILP (CPLEX)
2-Allele Min Set Cover
Generate candidate sets by all pairs of individualsCompare every set to every individual x
if x can be added to the set without any affecting “accomodability” or violating 2-allele: add it
If the “accomodability” is affected , but the 2-allele property is still satisfied: create a new copy of the set, and add to it
Otherwise ignore the individual, compare the next
2-Allele Algorithm Overview
ID alleles1 1/2
2 2/3
3 2/1
4 1/3
5 3/2
6 1/4
Canonical families
1/1 1/2 1/3 1/4 2/2 2/3 2/4 3/4 3/3 4/4
1/1 1/1
1/2
2/1
2/2 1/3
1/4
2/3
2/4
3/1
4/1
3/2
4/2
1/1
1/2
2/1
1/1
1/3
2/1
2/3
3/1
2/1
3/2
1/2
1/3
2/1
3/1
ID alleles1 55/43
2 43/114
3 43/55
4 55/114
5 114/43
6 55/78
1/3
2/1
2/3
2/1
3/2
Add
New Group Add (won’t accommodate (2/2))
Can’t add (a+R =4)
Examples
1/41/ 2 3/ 4
3/ 2
1/41/ 2 3/ 2
3/ 2
1/41/ 2 1/ 1
1/ 5
1. Get a dataset with known sibgroups(real or simulated)
2. Find sibgroups using our alg3. Compare the solutions
Partition distance, Gusfield ’03
4. Compare results to other sibship methods
Testing and Validation: Protocol
Salmon (Salmo salar) - Herbinger et al., 1999 351 individuals, 6 families, 4 loci. No missing alleles
Shrimp (Penaeus monodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles
Ants (Leptothorax acervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants
Real Data
Random Data GenerationGenerate F females and M males (F=M=5, 10, 15)Each with l loci (l=2, 4, 6)Each locus with a alleles
a[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1
Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5
For each family select female+male uniformly at random
For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1
For each offspring for each locus choose allele outcome uniformly at random
Results
2-Allele Min Set CoverFirst combinatorialMakes no assumptions other parsimonyWorks consistently and comparatively
Sibling ReconstructionGrowing number of methodsBiologists need (one) reliable reconstructionGenotyping errors
Answer: Consensus
Summary (Min Sib Groups)
Combine multiple solutions to a problem to generate one unified solutionC: S*→ SBased on Social Choice TheoryCommonly used where the real solution is not
known e.g. Phylogenetic Trees
Consensus Methods
Consensus...
S1 S2 Sk
S
Only Pareto Optimality and Anti-Pareto Optimality are enforcedAll solutions must agree on equivalence
All disputed individuals go to singletons
Strict Consensus
Strict Consensus
5 Sibling Groups? When 3 can do?
S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}
S={{1,2},{3},{4},{5},{6,7}}
Si x≡Siy≡ x≡Sy
Majority of solutions determine the final solutionTwo individuals are together if a majority of
solutions vote in their favourViolates Transitivity: A ≡ B ∧ B ≡ C ⇒ A ≡ C
Majority Consensus
S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}
1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4
Voting ConsensusMajority under closureResults in large monolithic groups
Majority Consensus
Voting Consensus 1 ≡ 5 ?
S1 = {{1,2,3},{4,5},{6,7}S2={{1,2,3,4},{5,6,7}}S3={{1,2},{3,4,5},{6,7}}
S={{1,2,3,4,5},{6,7}}
Commonly used consensus methods don’t work [AAAI-MPREF08]Strict Consensus produces too many singletonsMajority violates transitivity AND doesn’t work
for error-tolerance
Consensus Methods
Algorithm Compute a consensus solution S={g1,...,gk }Search for a good solution near S
Distance-based Consensus
Consensus...S1 S2 Sk Ss
S
Search
fd
f q
fq fd
NeedsA Distance Function fd: S x S →R A Quality Function fq: S → R
What is the Catch? [Sheikh et al. CSB 2008]Optimization of fd, fq or an arbitrary linear
combination is NP-Complete Reduction from the 2-Allele Min Set Cover
Problem
Distance-based Consensus
Algorithm Compute a strict consensusWhile distance is not too large
Merge two nearest sibgroupsQuality: fq=n-|C|Distance Function
fd(C,C’)=cost of merging groups in C to obtain C’
A Greedy Approach
A Greedy Approach
{1,2} {3} {4} {5} {6,7}{1,2} 3.5 1.1 2.5 5.1
{3} 0.5 0.3 0.5 0.1{4} 1.0 3.0 0.6 1.1
{5} 2.0 1.2 3.5 4.9
{6,7} 0.6 0.9 1.2 4.1
S1 ={ {1,2,3}, {4,5}, {6,7} }S2={ {1,2,3}, {4},
{5,6,7} }S3={ {1,2}, {3,4,5}, {6,7} }
Strict Consensus S={ {1,2}, {3},{4},{5},{6,7} }
{1,2} {3,6,7} {4} {5}{1,2} 3.5 1.1 2.5
{3,6,7} 1.7 3.1 2.2
{4} 1.0 3.0 0.6{5} 2.0 1.2 3.5
S={ {1,2}, {3,6,7},{4},{5} }
Distance Function(sibgroup, sibgroup)Cost of assigning all individuals
fd(C,C’)=min(SXPi fassign(Pj,X), SXPj fassign(Pi,X) )Distance Function (sibgroup, individual)
Benefit: Alleles and allele pairs sharedCost: Minimum Edit Distance
fassign(PiX)=
Greedy Consensus
benefit X can be a member of Pi
cost X cannot be a member of Pi`
AlgorithmCompute a strict consensusWhile distance is not too large
Merge two sibgroups which will minimize the TOTAL merging cost
Store the new merging cost in the merged set
Greedy Consensus
Error-Tolerant Approach
Locu
s 1
Locu
s 2
Locu
s 3
Locu
s k
Sibling Reconstruction
Algorithm
...
Consensus...
S1 S2 Sk S
Results
>90% accuracy for all real data
Results
Results
Results
A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975]FairIndependentPareto Optimal
Biologically [AAAI-MPREF 2008]The subset of individuals chosen will impact the
consensus considerably
Impossibility Result
ParametricDoes NOT outperform other algorithms on:
Biological dataSmaller familiesHigh Allele Frequencies
Problems
Change costs to average per locus costsCompare max group error on per locus basisTreat cost and benefit independentlyIn order to qualify a merge
Cost <= maxcostBenefit >= minbenefitBenefit = max benefit among possible merges
Auto Greedy Consensus
Results
Results
Results
First consensus method for Sibship ReconstructionMajority won’t work
First combinatorial approach for Error-Tolerant Sibship ReconstructionFewer AssumptionsMore Efficient
Distance-based Consensus is NP-HardNew non-parametric consensus
Summary (Consensus)
Min number of sibgroups is just ONE way to interpret parsimony
Alternate ObjectivesSibship that minimizes number of parents
Very Hard! Connection to Raz’s Parallel Repetition Theorem
Sibship that minimizes number of matingsSibship that maximizes family sizeSibship that tries to satisfy uniform allele
distributions
Parsimony: Alternate Objectives
Problem Statement:Given a population U of individuals, partition
the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized
Observations and Challenges:MinParents: intractable, inapproximable
Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem)
There may be O(2|loci|) potential parents for a sibgroup
Self-mating (plants) may or may not be allowed
Parsimony: Minimize Parents
Not Necessarily…
Is MinParents = MinSibgroups?
Parents
Genotype atLocus 1
P1 1 10P2 2 20P3 3 30P4 4 40P5 3 50
Child Parents Genotype
A P1-P2 1 20B P1-P2 2 10C P1-P2 10 20D P1-P3 1 30E P1-P3 10 3F P4-P2 4 20G P4-P2 40 2H P4-P3 40 3I P4-P3 4 30J P4-P5 4 50
1. Generate M a set of covering groups2. Cover a subset S of covering groups3. For each group x in S
1. Generate Parent Pairs for x2. Insert parent vertices into graph G (if needed)3. Connect the parents in each parent pair
4. Cover the minimum vertices necessary to (doubly) cover all the individuals
Min Parents Meta ApproachM={{1,2},{3,6,7},{3,5}, {2,4},{1,6},{2,5},{6,7}}
S={{1,2,4},{3,5},{6,7}}
X={3,5}
{F=5/10, M=2/20},{F=5/20.M=2/10}
5/10
2/20
5/20
2/10
X={3,5}
X={3,5}
Different approaches to selecting a subset of maximal feasible groupsGreedy Min Set CoverK –Greedy Min Set CoversAll Sets! (Nearing optimality)
Forget maximal feasible sibling groupsGenerate K random minimal feasible sibling
reconstructions
Covering Groups
The number of generated parents is just too many!
Mine Association Rules across loci {A,B}locus1 => {C,D}locus2
Use Association Rules to filter parents {A,B}locus1 => {C,D}locus2 OR {C’,D’}locus2
Polygamy=>High Confidence Association Rules
No Polygamy=>Min Parents=Min GroupsIf self-mating is not allowed, odd-cycles must
be disallowed
Generating Parents
HeuristicWhile all vertices are not covered
Select the vertex that will cover the most uncovered individuals
MIP Formulation
Covering Vertices
ResultsLegend:M1: k-greedy cover with optimal graph cover
M2: greedy set cover with optimal graph cover
M3: Randomized cover with optimal graph cover
M4: k-greedy with graph heuristics
M5: greedy set cover with graph heuristic
Results
Results
Reduction is from a version of Parallel Repetition theorem even if we know all the parents and just need to find the minimum parents to choose!
But, what is the parallel repetition theorem?
Complexity Results
2-prover 1-roundproof system
label cover problemfor bipartite graphs
small inapproximability
boosting(Raz’s parallel repetition theorem)
parallel repetition of2-prover 1-roundproof system
label cover problemfor some kind of“graph product” forbipartite graphslarger inapproximability
Unique gamesconjecture
restriction restriction
We need some version of Raz’s parallel repetition theorem that is suitable for us
Fortunately, the following two papers helped:
U. Feige, A threshold of ln n for approximating set-cover, Journal of the ACM, 1998
G. Kortsarz, R. Krauthgamer and J. R. Lee, Hardness of Approximating Vertex-Connectivity Network Design Problems, SIAM J. of Computing, 2004
Inapproximability for MINREP(Raz’s parallel repetition theorem)
Let LNP and x be an input instance of L
L MINREP
O(npolylog(n)) time
xL
xL
OPT ≤ α+β
0 < ε < 1 is any constant
OPT (α+β) 2log |A| +|B|
MINREP (minimum representative) problem
α partitionsall of equal size
β partitionsall of equal size
…A
B
A1 A2 Aα
B1 B2 BβB3
…
…A1 A2 Aα
B1 B2 B3 Bβ
B “super”-nodes
A “super”-nodes
associated “super”-graph Hinput graph G
…
(A1,B2)H if uA1 and vB2 such that (u,v)GIn this case, edge (u,v)G a witness of the super-edge (A1,B2)H
α partitionsall of equal size
…A
B
A1 A2 Aα
B1 B2 BβB3
MINREP goal
Valid solution: A’ A and B’ B such that
A’B’ contains a witness for every super-edge
Objective: minimize the size of the solution |A’B’|
Informally, given a set of childrengiven a candidate set of parentsassuming we believe in Mendelian inheritance
lawassuming that the parents tried to be as much
monogamous as possible
can we partition the children into a set of full siblings
(full sibling group has the same pair of parents)
Can reduce MINREP to show that this problem is hard
Parsimony-based combinatorial optimization works bet with least amount of information
Parsimony-based combinatorial optimization is NP-hard and inapproximable
First combinatorial approach for Error-Tolerant Sibship ReconstructionFewer AssumptionsMore Efficient
Other parsimony-based optimization objectives are possibleMin Parents is interesting and hard!
Conclusions
Better heuristics for Min Parents?Other parsimony objectivesFurther analysis of when objectives give
same results
Future Work
Mary AshleyUIC
W. Art Chaovalitwong
seRutgers
Isabel CaballeroUIC
Sibship Reconstruction Project
Ashfaq Khokhar
UIC
Tanya Berger-WolfUIC
Priya Govindan
UIC
Bhaskar DasGupta
UIC
Thank You!!Questions?
Chun-An (Joe) Chou
Rutgers