Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the...
-
Upload
efrain-treadway -
Category
Documents
-
view
217 -
download
0
Transcript of Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the...
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes
from the 1000 Genomes Project
Presented by Qing DuanDr. Yun Li group
UNC at Chapel Hill09-13-2012
Outline• Imputation– Study samples: WHI African Americans and Hispanics
samples– Reference haplotypes: 1000 Genomes Project (version 3
March 2012 release)• Number of markers in reference haplotypes: ~38M
• Post imputation quality assessment– Evaluation of imputation quality by comparing with
actual genotypes from Metabochip genotyping– Estimation of total number of QC+ markers and number
of QC+ indels
QC on WHI Genotypes• QC was performed within African American and
Hispanics samples separately for autosomes and chromosome X.
• We excluded markers having:– Hardy-Weinberg equilibrium (HW p-value < 1e-6)– Genotype completeness (< 90%)– Minor allele frequency
• Chromosome 1-22: MAF < 1%• Chromosome X: singleton or monomorphic markers
With thanks to Eric Yi Liu
Summary of samples and GWAS QC+ markers
• Number of Individuals– WHI_AA: 8,421 / WHI_HA: 3,587
• Number of markersChr1-22 ChrX
WHIAA WHIHA WHIAA WHIHA
Total 860,510 36,889
QC+ 829,370 834,826 35,411 35,035
Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.
Reference Haplotypes
• The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release)– A total of 2184 haplotypes
– A total of ~38M markers • including singleton and monomorphic sites
– About 1.4M markers are short indels and large deletions, the rest SNPs.
Note on reference haplotypes• A latest reduced set of reference haplotypes with
singletons and monomorphic markers removed are also available.– Number of markers: ~30M
– Every marker in the reduced set is included in the complete set of reference haplotypes.
– We expect little influence on imputation quality from singleton and monomorphic markers, because:• Phasing of the reference haplotypes were performed with the
singleton and monomorphic markers included• Our previous evaluation shows little effect of singletons on the
quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).
Two-step genotype imputation-- Procedure
• Step 1: Pre-phasing (MaCH1) – WHI African American and Hispanics samples were
phased separately
• Step 2: Genotype imputation (minimac) – WHI African Americans and Hispanics samples were
imputed separately. – Haplotype to haplotype imputation: the pre-phased
haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.
Two-step genotype imputation-- Computational costs
• Phasing and imputation strategy– Split chromosomes into segments – Phase / impute each segment– Ligate segments back to chromosomes
Computational costs WHI_AA WHI_HA
Phasing Split strategy(sample genotypes)
Core region: 3000 markersFlanking: 500 markers each
# segment after splitting 277 278
Median run time ~245 hours (~10 days) ~63 hours (~3 days)
Imputation Split strategy(reference haplotypes)
Core region: 5 MbFlanking: 500 Kb each
Core region: 20 MbFlanking: 500 Kb each
# segment after splitting 520 150
Median run time ~41 hours (~2 days) ~71 hours (~3 days)
Summary of imputation results -- Before QC
WHIAA WHIHA
Number of individuals 8,421 3,587
Total number of imputed markers 38,050,692
Number of imputed indels 1,380,758
File size (All files gz compressed)
170 G 71 G
Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.
Evaluation of imputation quality-- Introduction
• Main idea– Compare imputed dosages with actual genotypes
• Quality metric– Dosage r2: squared correlation coefficient between
imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2)• True imputation accuracy (range 0 ~ 1)
– Rsq: estimated dosage r2• Estimated imputation accuracy
Evaluation of imputation quality-- Study design
Calculate dosage r2
Imputed dosage
2 2 1 2 1 1 1 0 1 02 1 1 2 2 2 1 2 1 02 2 1 1 2 1 1 2 1 02 2 1 1 2 1 2 2 1 02 1 0 2 2 1 1 2 0 1
2 2 1 2 1 1 1 02 1 1 2 2 1 1 02 2 1 1 2 1 1 02 2 1 1 2 2 1 02 1 0 2 2 1 0 1
Actual genotype (Metabochip)
• Individuals used in evaluation• 1962 WHI African American samples
• Markers used in evaluation• Overlapping markers between 1000G and Metabochip
but not on Affymetrix 6.0 (All 22 autosomes)• Minor allele frequency (MAF) is defined within the
1962 individuals
Estimation of imputation quality-- Results
• We recommend QC threshold 0.7, 0.6 and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively– The thresholds are chosen such that an average Rsq
greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).
• Estimation based on imputation quality assessment– Total number of markers passing QC– Total number of indels passing QC
Estimation of imputation quality-- Summary
• The values are estimated because:– Estimated Rsq cutoffs
• Evaluation is based on markers on Metabochip
– Estimated MAF• MAF of imputed markers is calculated based on imputed
dosages
Estimation based on imputation quality assessment-- Note
• The values are estimated because:– Estimated QC thresholds for WHI Hispanics samples
• We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans
• We will do similar quality assessment in Hispanics samples once we have their QC+ metabochip data
– Estimated QC thresholds for indels• Rsq is set based on evaluation on SNPs. We assumed indels
has similar Rsq cutoff in each MAF category to SNPs
Estimation based on imputation quality assessment-- Note (cont’d)
Estimation based on imputation quality assessment-- Total number of markers passing QC
Note: Markers includes both SNPs and indels
Estimation based on imputation quality assessment-- Number of indels passing QC
Summary• We conducted genotype imputation for 8,421 African
American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release)
• Summary of imputation results before and after QC
WHIAA WHIHABefore QC After QC Before QC After QC
Number of individuals 8,421 8,421 3,587 3,587Total number of markers 38,050,692 18,940,103 38,050,692 15,214,231 Number of indels 1,380,758 1,219,538 1,380,758 1,126,704 File size (All files gz compressed) 170 G 102 G 71 G 33 G