based imputation for genomic an application to Angus cattle...Ensemble‐based imputation for...

Ensemble‐based imputation for genomic selection: an application to Angus cattle

Chuanyu Sun, Xiao‐Lin Wu, Kent A. Weigel, Guilherme J.M. Rosa , Stewart Bauck , Brent W. Woodward, Robert D. Schnabel, Jeremy F. Taylor , Daniel

Gianola

Introduction Why imputation?

7k SNP 50k SNP

Reduce genotyping costs

Improving accuracy of GS relative to low density SNP?

Introduction Imputation Principle

Imputation methods: two groups:

1)Family based algorithms

2)Population based algorithms

Use linkage and Mendelian segregation rules

Use linkage disequilibrium information between missing SNP and the observed flanking SNPs

Introduction Imputation software

BeagleImputefastPHASE

AlphaImputeFindhapFimpute

PHASEBOOKCHROMIBDPLINKMACH•••••

IntroductionImputed genotypes may be inconsistent among

different programs.

How such inconsistencies can be solved is a another challenge in imputation.

Introduction

Ensemble learning algorithms:

Combine predictions from multiple models, and yield classification results that are more robust relative to individual

procedures

Machine learning:

Imputation of genotypes is a classification problemEach imputation method is an “independent”

classifier

AdaBoost:

one of the most widely‐used ensemble methods

Introduction

Objectives:

Investigate performance of an ensemble approach to imputation of high‐density genotypes.

Application to imputation of 50K genotypes from 7K genotypes in Angus cattle.

Data and Methods

50K3058

Animals49,083SNP

QC 2281Animals

777Animals

Imputation done chromosome by chromosome

777Animals

Bootstrap: 50 replications

••••••

Data and Methods1.

Focused on three representative chromosomes only:

• Chromosome 1 (longest)• Chromosome 16 (moderate size)• Chromosome 28 (one of the shortest).

Six imputation softwares:• Beagle,• IMPUTE• fastPHASE1.4 • findhap • AlphaImpute• Fimpute

AdaBoost‐like ensemble algorithm

Data and MethodsAdaBoost-like ensemble algorithm

Let x =

imputed genotypes, y =observed (“true”) SNP genotypes, and T = the number of independent classifiers

Initialize:

each sample (animal) was initialized with an equal weight.Training: For t =1, 2,…

, T classifiers 1.Call classifier t, which in turn generates hypothesis 2.At a given SNP locus, calculate the error of :

4.Update weight distribution :

Nt t i ii

W i I h x y

1log t

1( ) ( ) exp ( )t t t t i iW i W i I h x y

Data and MethodsAdaBoost-like ensemble algorithm

Test: In the test set, each “unknown” genotype is classified via so called “weighted majority voting”.

1. computes total vote received by each genotype (class)

2. assigns the genotype (class) that receives the highest total vote as the final (“putatively true”) genotype.

' ( )T

j t t i jt

v I h x g

1 , 2 , 3j

Classifiers working together!!Simulation:

Genotypes for 800 animals and one marker locus simulated: genotypes are “AA”, “AB”

and “BB”.

Classifiers with different accuracy of imputation simulated3.

Final results gotten as:

' ( )T

j t t i jt

v I h x g

1 , 2 , 3j

Results: individual performanceImputation accuracy by software package

Results: implementationAdaBoost-like ensemble algorithm

Given six packages, there were 6!=720 combinations, each defining a unique ensemble system

For each chromosome, there are 720*50 =36000 jobs

Distributed high-throughput computing used on University of Wisconsin Condor systems and Open Science Grid

Results: accuracy and uncertaintyAdaBoost-like ensemble algorithm

1 = “Beagle3.3”, 2 = “IMPUTE2.0”; 3 = “fastPHASE1.4”; 4 = “findhap version 2”; 5 = “AlphaImpute”; 6 = “Fimpute version 2”; 7 ~ 11 = five ensemble systems.

ResultsAdaBoost-like ensemble algorithm

ResultsAdaBoost‐like ensemble algorithm

Ensemble method better than any of these 6 packages.

Sequences of packages in voting affect imputation accuracy

Ensemble systems with best accuracies had Beagle

as first classifier, followed by one or two methods that

utilized pedigree information.

ResultsApplication to Angus population: LOW DENSITY PANEL

Based on a 7K-genotype panel, medium-density (50K) genotypes involving 29 chromosomes were imputed for 3,078 animals using the six imputation packages and five ensemble systems.

All five ensembles

had Beagle

and AlphaImpute

as the first two classifiers.

• Bgl‐Alp‐fhap‐Imp‐fPH‐Fimp • Bgl‐Alp‐Fimp‐Imp‐fPH‐fhap • Bgl‐Alp‐Fimp‐fPH‐Imp‐fhap • Bgl‐Alp‐Imp‐fhap‐fPH‐Fimp• Bgl‐Alp‐Imp‐Fimp‐fPH‐fhap

Results

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

EnS1-5 Bgl Imp fPh fhap Alp Fimp

Chromosomes

Conclusions

Ensemble methods solve inconsistencies, and can increase imputation accuracy even further

Bgl and Fimp were the most accurate softwares

Best ensemble: those with Bgl as first classifier, followed by one or two softwares that use pedigree information

during imputation.

Improvement depended on order of classifiers in the ensemble systems.

based imputation for genomic an application to Angus cattle...Ensemble‐based imputation for...

Documents

Transcript of based imputation for genomic an application to Angus cattle...Ensemble‐based imputation for...

using single-step genomic BLUP (ssGBLUP) in American Angus · DanielaLourenco,’University’of’Georgia June11,2015 2015’BIF’Gene>c’Predic>on’CommiCee’ 6 Indirect prediction

Amelia imputation

Hinkson Angus - Angus Bull Sale

Genotype Imputation

Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Implementation of genomic selection in the poultry industry · 2017. 10. 19. · imputation and for estimation of breeding values based on genomic information, genomic estimated breeding

Multiple Imputation

Allencroft Angus & Border Butte Angus 15th Annual Black Angus Bull Sale

Multiple Imputation Method

UNDERSTANDING AND USING GENEMAX ADVANTAGE …...GeneMax® Advantage™ is a genomic test created in collaboration between Angus Genetics Inc. (AGI), Certified Angus Beef® (CAB) and

[MI] Multiple Imputation - Stata

Genomic prediction of breed composition and …ARTICLE Genomic prediction of breed composition and heterosis effects in Angus, Charolais, and Hereford crosses using 50K genotypes E.C.

Imputation Catholic Critique

Multiple Imputation Using SAS Software · Multiple Imputation Using SAS Software Yang Yuan SAS Institute Inc. Abstract Multiple imputation provides a useful strategy for dealing with

Heuristics for Phasing and Imputation ......Why is genomic selection attractive? •Directly addresses 3 of the 4 components of genetic gain –Time -via generation interval –Selection

GENEVA Dental Caries Project Imputation Report - 1000 ... · V. Imputation software and computing resources VI. Imputation output a. Phased output b. Genotype probabilities ... Recent

Large-scale Epigenomic Imputation

Multiple imputation in principal component analysisfuzzy.cs.ovgu.de/studium/im/ss11/topics/multiple imputation in PCA... · Multiple imputation in principal component analysis 233

ANGUS BULLS - Cherylton Angus | Angus Cattle Stud

LECTURE 15 MULTIPLE IMPUTATION