A coalescent computational platform to predict strength of association for clinical samples
description
Transcript of A coalescent computational platform to predict strength of association for clinical samples
![Page 1: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/1.jpg)
A coalescent computational platform to predict strength of association for clinical
samples
Gabor T. MarthDepartment of Biology, Boston [email protected]
Genomic studies and the HapMapMarch 15-18, 2005Oxford, United Kingdom
![Page 2: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/2.jpg)
Focal questions about the HapMap
CEPH European samples
1. Required marker density
Yoruban samples
4. How general the answers are to these questions among different human populations
2. How to quantify the strength of allelic association in genome region
3. How to choose tagging SNPs
![Page 3: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/3.jpg)
Across samples from a single population?
(random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)
![Page 4: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/4.jpg)
Possible consequence for marker performance
Markers selected based on the allele structure of the HapMap reference samples…
… may not work well in another set of samples such as those used for a clinical study.
![Page 5: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/5.jpg)
How to assess sample-to-sample variability?
1. Understanding fundamental characteristics of a given genome region, e.g. estimating local recombination rate from the data
3. It would be a desirable alternative to generate such additional sets with computational means
McVean et al. Science 2004
2. Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly
![Page 6: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/6.jpg)
Towards a marker selection tool
2. generate computational samples
3. test the performance of markers across consecutive sets of computational samples
1. select markers (tag SNPs) with standard methods
![Page 7: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/7.jpg)
Generating additional computational haplotypes
1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population.
3. Use the second haplotype set induced by the same mutations as our computational samples.
4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in 2.
2. Enforce data-relevance by requiring that the first set reproduces the observed haplotype structure of the HapMap reference samples. Calculate the “degree of relevance” as the data likelihood (the probability that the genealogy does produce the observed haplotypes).
![Page 8: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/8.jpg)
Generating computational samplesProblem: The efficiency of generating data-relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem.
N
M
We propose a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K)
![Page 9: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/9.jpg)
Approximating M-site haplotypes as composites of overlapping K-site
haplotypes
1. generate K-site sets
2. build M-site composites
M
![Page 10: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/10.jpg)
Piecing together neighboring K-site sets
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
000100001101010110011111
000001010011100101110111 hope that constraint at overlapping markers
preserves for long-range marker association
![Page 11: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/11.jpg)
Building composite haplotypes
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
![Page 12: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/12.jpg)
Initial results: 3-site composite haplotypes
a typical 3-site composite
30 CEPH HapMap reference individuals (60 chr)
![Page 13: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/13.jpg)
3-site composite vs. data
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (3
-site
com
posi
te)
![Page 14: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/14.jpg)
3-site composites: the “best case”
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 ("
exac
t" 3
-site
com
posi
te)
the “best-case” 3-site scenario: composite of exact 3-site sub-haplotypes
“short-range”
“long-range”
![Page 15: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/15.jpg)
Variability across setsThe purpose of the composite haplotypes sets …
… is to model sample variance across consecutive data sets.But the variability across the composite haplotype sets is
compounded by the inherent loss of long-range association when 3-sites are used.
![Page 16: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/16.jpg)
4-site composite haplotypes
4-site composite
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (4
-site
com
posi
te #
2)
![Page 17: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/17.jpg)
“Best-case” 4 site composites
Composite of exact 4-site sub-haplotypes
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 ("
exac
t" 4
-site
com
posi
te)
![Page 18: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/18.jpg)
Variability across 4-site composites
![Page 19: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/19.jpg)
Variability across 4-site composites
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data #1)
r2 (d
ata
#2)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (4-site composite #1)
r2 (4
-site
com
posi
te #
5)
… is comparable to the variability across data sets.
![Page 20: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/20.jpg)
Technical/algorithmic improvements
3. dealing with uninformative markers
1. un-phased genotypes
2. markers with unknown ancestral state
(AC)(CG)(AT)(CT)A G A CC C T T
A C
?
01101000010101110111010000010101011110100001010111001101000010101110
4. taking into account local recombination rare
![Page 21: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/21.jpg)
Software engineering aspects: efficiencyCurrently, we run fresh Coalescent simulations at each K-site (several hours per region). This discards most Coalescent genealogies as irrelevant.Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Haplotype sets resulting from matches can be loaded into, stored in, and retrieved from a database efficiently.
4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes< 200 Gigabytes
![Page 22: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/22.jpg)
Acknowledgements
Eric TsungAaron Quinlan
Ike UnsalEva Czabarka (Dept. Mathematics, William & Mary)
![Page 23: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/23.jpg)
Testing markers with composite sets
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (4
-site
com
posi
te #
1)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (4
-site
com
posi
te #
2)
![Page 24: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/24.jpg)
Using the HapMap
1. genotype a set of reference samples
2. compute strength of association
4. use these markers in clinical studies
3. select a smaller set of markers that capture most of the information present in the complete set of markers
![Page 25: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/25.jpg)
Allele structure varies among populations
CEPH European samples
Yoruban samples
![Page 26: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/26.jpg)
Data probability for composite haplotypes
(motivation from composite likelihood methods for recombination rate estimation e.g. by Hudson, Clark, Wall)
Pr(composite) = Pr(K-site1) Pr(K-site1 ~ K-site2)Pr(K-site2) Pr(K-site2 ~ K-site3)Pr(K-site3)
![Page 27: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/27.jpg)
Generating K-site haplotypes
reference data
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
0
5
10
15
20
"000" "001" "010" "011" "100" "101" "110" "111"
1 match / 100 – 10,000 Coalescent genealogies
K=3,4
![Page 28: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/28.jpg)
Example: CFTR gene
Hinds et al. Science, 2005
![Page 29: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/29.jpg)
4-site composite haplotypes
4-site composite #1 4-site composite #2
HapMap data
![Page 30: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/30.jpg)
4-site composites vs. data
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (4
-site
com
posi
te)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
r2 (data)
r2 (4
-site
com
posi
te #
2)
![Page 31: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/31.jpg)
Why should this work?
tease apart two questions: (1) to what degree K-site composites preserve long-range correlations between markers (really, the quality of the approximation) and (3) the variability across different sets (what we are interested in).
![Page 32: A coalescent computational platform to predict strength of association for clinical samples](https://reader036.fdocuments.net/reader036/viewer/2022062521/56814f53550346895dbcfb0d/html5/thumbnails/32.jpg)
Example: 4-site approximation
4-site composite #1 4-site composite #2
4-site composite #3 4-site composite #4