Short introduction to BLUPF90 family...

22/05/2014

1

Short introduction to BLUPF90

family programs

Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay [email protected]

BLUPF90 Family of Programs

• Developed by Ignacy Misztal and collaborators

at University of Georgia

• Collection of software for mixed model

computation in animal breeding (plant, forest

breeding, etc. )

22/05/2014

2

BLUPF90 family programs

• Set of program: – Help in teaching course in mixed model

– Simplify programming using Fortran90

– Have a general program that support different model complexity:

• Linear and threshold-linear models with multiple-correlated effects, multiple trait animal models and dominance

• General philosophy of programs described here: – さCﾗﾏヮﾉW┝ MﾗSWﾉが MﾗヴW D;デ;ぎ “ｷﾏヮﾉW Pヴﾗｪヴ;ﾏﾏｷﾐｪいざく

Misztal, I. 1999 Interbull Bull.


• Consist of several programs:

• Estimation of variance components

• REMLF90, AIREMLF90, (thr)GIBBSxF90

– Solver of mixed model equations

• As before plus

• BLUPF90

• BLUP90IOD (large scale data)

– Aproximation of accuracy

• ACCF90 (large scale data)

• Support for genomic selection

22/05/2014

3

http://nce.ads.uga.edu/wiki/doku.php


• All programs are controled by the SAME

paramenter file.

• Extra options could be used to set non-default

behaviour of each program

• Understanding parameter file usually solve

most of problems





22/05/2014

4

BLUPF90 parameter file

Repeat for each

Random effect

Data file

• Free format, i.e. at least one space to separate columns

• TABs are not valid to separate columns

• Some programs (MS Excel) export flat files with TAB separators !!

• Only numbers: integer or reals

• Ia ヴW;ﾉゲ SWIｷﾏ;ﾉゲWヮ;ヴ;デﾗヴゲさくざﾐﾗデさがざ

• OﾐW さくざｷゲﾐﾗデ ; ﾏｷゲゲｷﾐｪ ┗;ﾉ┌W • All effect need to be renumber from 1

consecutively (see later RENUMF90)

22/05/2014

5

Number of traits / effects

• No restriction for number of traits or effects

• But memory requirements and computing

time increase exponentially with them

Effects section

• Many rows as the NUMBER_OF_EFFECTS

• In this section de model for each trait is

defined

• Different models per trait are supported

• If an effect is missing for one trait use 0

Many columns

as NUMBER_OF_TRAITS Number of Levels

Type of effect

22/05/2014

6

RANDOM_RESIDUAL VALUES

• This matrix should a square matrix with

dimension equal to the NUMBER_OF_TRAITS

• Use zero (0.0) to indicate uncorrelated

residual effects between traits

• e.g. For a 3 trait

43.1 0.0 0.0

0.0 5.1 3.2

0.0 3.2 10.3

Random effect definition

• RANDOM_GROUP – Number(s) of effect from list of effects

– Correlated effects should be consecutive e.g. Maternal effects, Random Regression models

• RANDOM_TYPE – diagonal, add_animal, add_sire, add_an_upg,

add_an_upginb, user_file, user_file_i or par_domin

• FILE – Pedigree file, parental dominance or user file

• (CO)VARIANCES – Square matrix with dimension equal to

number_of_traits*number_of_correlated_effects

22/05/2014

7

(CO)VARIANCES structure

• Assuming a 3 trait (T1-T3) and 3 correlated

effects (E1-E3)

E1 E2 E3

T1 T2 T3 T1 T2 T3 T1 T2 T3

E1

T1

T2

T3

ぐくく

RANDOM_TYPE

• Diagonal

– for permanent enviroment effects, assume no correlation between levels of the effect

• add_sire

– To create a relationship matrix using sire and maternal grandsire

– Pedigre file: • individual number, sire number, maternal grandsire number

• add_animal

– To create a relationship matrix using sire and dam information

– Pedigre file: • animal number, sire number, dam number

22/05/2014

8

RANDOM_TYPE

• add_an_upg

– As before but using rules for unknown parent group

– Pedigre file: • animal number, sire number, dam number, parent code

• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals

• Parent code = 3 に nb of known parents – 1 both parents known

– 2 one parent known

– 3 both parents unknown

• add_an_upginb

– As before but using rules for unknown parent group and inbreeding

– Pedigre file: • animal number, sire number, dam number, inb/upg code

• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals

• inb/upg code = 4000 / [(1+md )(1-Fs ) + (1+ms )(1-Fd )]

• ms (md) is 0 if sire (dam) is known and 1 otherwise

• Fs(Fs) inbreeding coefficient of the sire (dam)

RANDOM_TYPE

• user_file

– a matrix is read from file. Matrix is stored only upper- or lower-triang

– Matrix file: • row, col, value

• user_file_i

– As before but the matrix will be inverted

• par_domin

– A parental dominance file created by program RENDOM

– File format • s-d s-sd s-dd ss-d ds-d ss-sd ss-dd ds-sd ds-dd code

22/05/2014

9

Pedigree files

• As with data files pedigree files are separated by at least one SPACE!!

• TABs are not supported !!

• Order of columns depends on the type of the random effect

• Duplicates pedigree are not checked!!

• Identification number need to be coded sequentially from 1 to the maximun number of animals

• No order is required !!!!

Programs Options

• Programs behavior can be modified by adding lines with OPTION at the end of the parameter file

• OPTION option_name x1 x2 …

• option_name, each program has it own definition of options

• TｴW ﾐ┌ﾏHWヴﾗa ﾗヮデｷﾗﾐ;ﾉヮ;ヴ;ﾏWデWヴゲふ┝ヱが ┝ヲぐぶデﾗ control the behavior depends on the option.

22/05/2014

10

BLUPF90

• Blupf90 computes generalized solutions by several methods: – Preconditioner Conjugate Gradient (PCG). Default Iterative

method, fast.

– Succesive over-relaxation (SOR), a iterative method based on Gauss-Seidel

– Direct solution using sparse Cholesky factorization (FSPAK) Greater memory requirements

• The values of the solution change between methods but estimable function should be the same

• Prediction error variances can be obtanined using sparse inverse (FSPAK)

BLUPF90 options

22/05/2014

11

BLUPF90 options cont.

BLUPF90 options cont.

22/05/2014

12

Example of parameter file BLUPF90

From blupf90.pdf documentation:


Parameter File Model

22/05/2014

13

FAQ or Frequently Problems

• Wrong data file and pedigree name !! – Program does not stop if wrong file name not exist

– Check outputs for data file name and number of records and pedigree read

• Wrong position of formats or formats for observation and effects

• Misspelling of Keywords. – Program may stop

• (Co)variance matrices not symmetric, not positive definite – Program may not stop

• Large numbers (e.g. 305-day milk yield 10,000 kg) + large number of records with Gibbs Sampling programs – Scale down i.e. 10,000 /1,000 = 10

Data preparation and renumbering

• In general data files and pedigree files can be

created by any software (e.g. SAS, R, python, etc.)

• But all cross-classified effects (included pedigree)

needs to be renumber sequentially from 1 to the

maximum number of levels of each effect.

• No alphanumeric columns

• Columns has to be separated by at least one

space !!

22/05/2014

14

RENUMF90 • A renumbering program for the BLUPF90 family of programs

• Supports: – multiple traits

– different effects per trait

– alphanumeric and numeric fields

– unknown parent groups

– covariates for random regression models

• Provides data statistics

• Traceback pedigree related to individuals in data file and performs comprehensive pedigree checking

• Create files to be used by BLUPF90 family programs – renf90.par - parameter file

– renf90.dat - data recoded

– renaddxx.ped - renumer pedigree + statistics

– renf90.tables - cross reference file with renumber information with original data

RENUMF90 files

• Data file and pedigree file in flat files

• Columns separeted by at least one SPACE

• No TABS !!!! (current version check for it)

• Input files cannot contain character #

• Missing sire/dams must have code 0

• codes 00 are treated as a known animal

• Has it own parameter file!!!! not the same for other programs !!!!

22/05/2014

15

RENUMF90 parameter file

• Based on keywords in capital following by a

line(s) with the corresponding data item

• Keywords need to be typed exactly

• Keywords need to occur in sequentially order

!!!

• Lines starting with # are treated as comments

and are ignored

RENUMF90 keywords

All these keywords

are mandatory!!!

Leave a blank lines in

cases that are necessary

22/05/2014

16

RENUMF90 keywords

Effect section

RENUMF90 keywords

Random effect section

22/05/2014

17

RENUMF90 keywords

Random effect files section

RENUMF90 keywords

Pedigree options section

22/05/2014

18

RENUMF90 keywords

Unknown parent group section

RENUMF90 keywords

Random regression group section

22/05/2014

19

RENUMF90 keywords

Random effect (Co)Variances section

RENUMF90 keywords

• Section starting from EFFECTS can be repeated as many time as effects in the model

• Correlated effects are controled by option

• If (Co)Variances for any effect are missing, default matrix with 1.0 in diagonal and 0.1 on off-diagonal will be used.

– WARNING: for EM-REML convergence rate is improved if starting values are too large rather than too small !!!

22/05/2014

20

RENUMF90 keywords

Creation interacions effects

RENUMF90 keywords

extra options sections

22/05/2014

21

RENUMF90 keywords

options passed to BLUPF90

• All lines that begin with keyword OPTION are

passed to parameter file renf90.par

• This allows automatization of process by using

scripts

• For example:

– OPTION sol se

RENUMF90 output files

Pedigree file: renaddx.ped • Columns structure:

1. Animal number (from 1)

2. Parent 1 number or UPG number for parent 1

3. Parent 2 number or UPG number for parent 2

4. 3 minus number of know parents

5. Know or estimated year of birth (o if not provided)

6. Number of know parents, if animal has genotype: 10+number of know parents

7. Number of records

8. Number of progenies as parent 1

9. Number of progenies as parent 2

10. Original animal ID

22/05/2014

22


renumbering tables: renf90.tables

• For each cross-classified effect tables are

created with:

– Original ID, count,, consecutive number

• Usefull

– to translate solutions from BLUPF90 program into

original alphanumeric values

– Check counts of records by level

Example of RENUMF90

parameter file

22/05/2014

23


Inbreeding program

• INBUPGF90

– Calculates inbreeding coefficients

– Alphanumeric identification of individuals

– Different methods:

• Regular inbreeding (Meuwissen & Luo)

• Missing parent information (V;ﾐ‘;SWﾐげゲ method)

– Calculate expected future inbreeding for a set of defined mating

– Calculation of relationships between animals

– Output reordering pedigree with parent ID < animal ID

22/05/2014

24

INBUPGF90

• No parameter file

• Controlled by arguments

inbupgf90 –pedfile file_name

• See wiki for full description of options

Different Models with BLUPF90

http://nce.ads.uga.edu/wiki/doku.php?id=faq

http://nce.ads.uga.edu/wiki/doku.php?id=faq

22/05/2014

25

Estimation of variance components

• Several methods available

– REML

– Bayesian methods via Gibbs Sampling

• Review article:

– Misztal, I. 2008. Reliable computing in estimation

of variance components. J. Anim. Breed. Genet.

125:363-370.

REML

• Maximizes the likelihood with respect to

parameters

• Different ways to get maximum

– Derivate Free (DF) e.g. MTDFREML

– Using first derivatives (EM-REML)

– Using second derivatives (AI-REML)

22/05/2014

26

EM-REML

• Traditional regarderd as the most reliable

• But

– Slow

– Could fail if starting parameters are smaller than

デｴW けデヴ┌Wげヮ;ヴ;ﾏWデWヴゲ

– Use bigger values

– Not generate standard errors of estimates

AI REML

• Much faster than EM-REML

• Provide estimation of standard errors

• BUT

– For complex models and poor starting values

• Slow convergence

• Parameters estimates out of the parameter space

– In some cases initial rounds with EM-REML help

22/05/2014

27

Bayesian に Gibbs Sampling

• Implementation

– solving of mixed model equations (Gauss-Seidel)

Щ ;SSｷﾐｪけﾐﾗｷゲWげデﾗゲﾗﾉ┌デｷﾗﾐゲ

– Sampling of variances components from chi-

square or Wishart distributions

• Samples from marginal posterior distribution

after burn-in period

Programs for estimation

of variance components

• remlf90 -> EM-REML

• airemlf90 -> AI-REML

• Gibbs Sampling – gibbsf90 blupf90 transformed in gibbs, slow

– gibbs1f90 optimized version

– gibbs2f90 improve mixing with random correlated effects

– gibbs3f90 heterogeneous residual var. classes

– thrgibbs1f90 threshold-linear mixed models

22/05/2014

28

REMLF90 OPTION

AIREMLF90 OPTIONS

22/05/2014

29

AIREMLF90 OPTIONS

AIREMF90 OPTIONS

22/05/2014

30

GIBBS SAMPLING PROGRAMS

• Extra input are required when running gibbs

sampling programs

• As other programs – name of parameter file?

• Parameter to set the MCMC chain

– number of samples and length of burn-in

– Give n to store every n-th sample? (1

means store all samples)

Gibbs Sampling

Output files

• Default files

– gibbs_samples

– fort.99

• Solutions files only if they are required by

options

• Other files, only useful for continuation

– binary_final_solutions

– last_solutions

22/05/2014

31

Gibbs Sampling OPTIONS

Gibbs Sampling OPTIONS

22/05/2014

32

heterogeneous residual variances

GIBBS3F90

Threshold models

THRGIBBS1F90

22/05/2014

33

Post Gibbs analysis

• Program postgibbsf90 use output files

– gibbs_samples

– fort.99

• from all Gibbs Sampling programs

– gibbs1f90

– gibbs2f90

– gibbs3f90

– thrgibbs1f90

POSTGIBBSF90

• Calculate statistics for variance components from the posterior distribution – Means

– median

– mode

– standard deviations

– HPD 95

– effective sample size

– auto-correlations.

• Create graphs with trace of the chain and histogram of variance components

22/05/2014

34

POSTGIBBSF90

Output Files

• さpostgibbs_samplesざ – file contaning all Gibbs samples for additional post

analyses, e.g. posterior distribution for Heritabilities, correlations, covariance functions of random regression models

• さpostmeanざ – file contaning posterior means, in matrix format that

match parameters files

• さpostsdざ – file contaning posterior standard deviations.

• さpostoutざ – statistics

HowTo run POSTGIBBSF90

• Iterative program

• As other programs – name of parameter file?

• Parameter to select samples from distribution to calculate posterior statistics – Burn-in?

• Set number of samples to discard for posterior analysis

• In first run use 0 to see convergence

– Give n to read every n-th sample? (1 means read all samples)

• This number should be equal or greater that the one used in gibbs programs

• Ask user to enter option to – Generate graphs of trace or histograms

– exit

22/05/2014

35

Using Gibbs Sampling programs

• For new analysis use burn-in equal 0

– Allows look full chain with postgibbsf90

– Posterior samples could be extracted later

• For long jobs, use k-parameter >1 e.g. 10

– Decrease size of gibbs_samples

• DIC for model comparison is provided in output of gibbs programs,

– BUT, burn-in should be used in order to be meaningful

General comments for all programs

• Output that is printed to the terminal is not

SAVED in any file !!!

• Use redirection or pipes to store outputs in log

files:

echo renf90.par | blupf90 | tee blup.log

or echo renf90.par | remlf90 | tee reml.log

22/05/2014

36

For programs that requires

more than one parameter

gibbs2f90 <<AA > gibbs.log

renf90.par

1000 0

10

AA

• Or using single line

• printf "exmr99s \n 1000 0 \n 10 \n” | gibbs2f90 > gibbs.log

General OUTPUT from all programs

Check file names

Check model

22/05/2014

37

General OUTPUT from all programs

Check (co)variances

Check number of records

And pedigree read

Check maximum

number of columns

to read

Useful commands for Linux

• Access to server using ssh client: e.g. putty

• To run graphic windows a X11-client for windows: xming

• Commands in Linux are Case Sensitive !!

• Several tutorials on the WEB !!

• unixcombined.pdf from Misztal web page– http://nce.ads.uga.edu/~ignacy/ads8200/unixcombined.pdf

• Unix_en.pdf from genomeek blog:– http://genomeek.wordpress.com/manuels/unix-et-awk/

– http://dl.dropbox.com/u/22940514/Unix_En.pdf

Basic Commands

pwd show working directory

ls list files in working directory

ll as before but with more information

mkdir d make a directory d

cd d change to directory d

cat file list the complete file

less file list file page-by-page

Copy and moving commands

To copy file

cp /home/ignacio/course/lab/lab1/is .

To copy file directory

cp –r /home/ignacio/course/lab/lab1 .

to move file aa into bb in folder test

mv aa ./test/bb

To delete

rm yy delete the file yy

rm –r xx delete the folder xx

Other popular commands

head file print first 10 lines

tail file print last 10 lines

wc –l file count lines

grep text file find lines that contains text

cat file1 fiel2 catenate files

sort sort file

cut cuts specific columns

join join lines of two files on specific columns

paste paste lines of two file

expand replace TAB with spaces

uniq retain uniques lines on a sorted file

Redirections & pipe

aa < bb

program aa reads from file bb

blupf90 < in aa > bb

program aa write in file bb

blupf90 < in > log

“|” and “tee”

program blupf90 reads name of parameter file and writes output in terminal and in file log

echo par.b90 | blupf90 | tee log

AWK

• Very useful and fast command to work with

text files

• Can be used as a database query program

• Select specific columns or create new ones

• Select specific rows matching some criteria

AWK

Extract equations solutions for a particular effect (2) and print EBV and

accuracies (r^2)

awk '{ if ($2==2) print $3,$4,1-$5**2/20}' solutions

Count records by sire

awk '$2>0{ print $2}' ped | sort | uniq –c

Process CSV files

awk 'BEGIN {FS=","} { print $1,$2,$3}' pedigree.txt

1

Data simulation (including genomics) QMSim software

Zulma G.Vitezicaゆ

ゆ INRA-INPT, GenPhySE, Castanet-Tolosan 31326 France

[email protected]

It was design to simulate large-scale genotyping data in

multiple and complex livestock pedigrees

A wide variety of genome architectures from infinitesimal

model to single-locus model

It is a user-friendly tool for simulating data

Computationally efficient in terms of both time and

memory

QMSim: why to use it ?

mailto:[email protected]

mailto:[email protected]

2

The code is written in C++ language

Executable files are freely available for Windows and

Linux and Mac at: (Last update: July 12, 2013)

http://www.aps.uoguelph.ca/~msargol/qmsim/

QMSim†: where to find it ?

†Sargolzaei & Schenkel (2009), Bioinformatics 25:680-681.

In 2 steps:

First step: A historical population is simulated

–in order to create initial LD and

–to establish mutation-drift equilibrium

–expansion and contraction of the population

Second step: One or multiple recent population

structures are generated

How the simulation is carried out ?



3

It must be in ASCII format

It consists of five main sections

The order of commands within each

section is not important

All commands end with a semicolon

NﾗゲWﾏｷIﾗﾉﾗﾐ т WヴヴﾗヴﾏWゲゲ;ｪW ;ﾐS program exits.

Parameter file

1. Global parameters section

An arbitrary title

.---------------------------------------. | Example 1 - 10k SNP panel | `---------------------------------------' Initial seed is backed up in [r_ex01/seed]. parameter file is backed up in [r_ex01/ex01.prm].

Parameter file: ex01.prm

Output folder: r_ex01/

Output

* Mersenne Twister algorithm (Matsumoto & Nishimura, 1998)

The random number generator (RNG*) requires a seed file.

Ia ｷデｷゲﾐﾗデゲヮWIｷaｷWS т ‘NG ┘ｷﾉﾉ HW ゲWWSWS aヴﾗﾏデｴW ゲ┞ゲデWﾏ clock

For each run the initial seed numbers will be backed up in

output folder т Tｴｷゲ ;ﾉﾉﾗ┘ゲデﾗヴWヮW;デデｴW ヴ┌ﾐ !

4


Overall heritability (Polygenic + QTL)

QTL effect is simulated

Only polygenic effect is simulated

Both, polygenic and QTL effects are simulated

Range: 0 - 10,000


A sex limited trait like milk yield

When males do not have records, but selection or culling are based on

EBVs т Ok

Phenotypes т M;ﾉWゲ ┘ｷﾉﾉ HW ヴ;ﾐSﾗﾏﾉ┞ ゲWﾉWIデWS ﾗヴ I┌ﾉﾉWS

5


Parameter file

2. Historical population section

To create initial LD

Evolutionary forces: mutation and drift (no selection, no migration) Random mating: union of gametes randomly sampled from

the male and female gametic pools Discrete generations

Only a single historical population

6

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T


hg_size = v1 [v2]

Historical generation

sizes

v1 the historical generation size Range: 2 – 100,000 v2 the historical generation number Range: 0 – 150,000

Constant size of 420





Gradual decrease in size from 2000 to 200

Expansion in the last historical generation from 100 to 3000

Historical bottleneck or expansion can be simulated

LD in livestock extends over longer distances than in humans

7





nmfhg s first historical generation

nmlhg s last historical generation

Default : equal number of males and females

Sex ratio will be constant across historical generations. It can be changed in the last generation

Number of males


Parameter file

8




3. Population section

One or multiple recent populations

For the first defined recent population (i.e. p1), founders must come

from the last historical population

For subsequent populations (i.e. p2), founders can be chosen from one or more

(up to 10) previously defined populations (i.e. p1)

Multiple recent populations can be analyzed

separately (one pedigree for each population) or jointly (by creating one pedigree for all populations) for inbreeding and EBV


Parameters for the founders

Number of male/female

to be selected

It indicates from which population the base animals must

be selected

hp: historical population (last historical generation)

Type of selection

select: rnd (default), phen, tbv and ebv /l : to select low values /h : to select high values

Choosing founders for a population

9

Choosing founders for a population for F2 design

Crossing between populations/lines

is allowed

Migration can be simulated

Choosing founders for a population for migration

10


ls: number of progeny per dam

ls: Probability of the litter sizes

Litter size


pmp: range 0-1, default is equal to 0.5

pmp: 0.5 /fix_litter Sex ratio will be fixed within litters (progeny of a dam)

Sex ratio

11


rnd (default), rnd_ug (a dam can mate with more than one

sire in each generation), p_assort (similarity), n_assort (dissimilarity), minf and maxf (inbreeding is minimized in the

next generation)

Assortative mating base on phen, ebv or tbv

Matting design


sr : 40% of sires will be replaced in

all generations

sr : 0.4 [1] 0.5 [5] 40% of sires will be culled for generation 1 to 5, and

50% from generation 5 to last generation

Replacement

sr : 1, discrete generations (default)

12


rnd, phen, tbv ebv and age (only for

culling)

/l or /h to select low or high values

Selection and culling designs

Breeding value estimation method

Population specific parameters for saving outputs

data: save individual's data except their genopype (File name: 'population name'_data_'replicate number'.txt stat: save brief statistic on simulated data genotype: save genotype data

p1_mrk_007.txt

p1_qtl_007.txt

13


Parameter file




4. Genome section

Number of chromosomes: 10 chrlen : range 1-5,000 cM

Marker information

Example – 10k SNP panel

Samples from uniform distribution

in each replicate

All marker loci will have 2 alleles

In the first historical generation, then drift

and mutation

14




4. Genome section

nqloci: range 1-50,000 on the chromosome

QTL information


Samples from uniform distribution

in each replicate

Equal allele frequencies in

the first historical

generation

Nb of QTL alleles in the first historical generation (all:

same number)

It will be sampled from gamma distribution with shape 0.4




More genome information


In recurrent mutation, no new allele is generated.

Default: infinite-allele model SNP recurrent mutations are generally very rare and no evidence

that mutation contributes to erosion of LD between SNP ( Ardlie et al., 2002)

Other possibilities :

Missing marker/QTL genotypes Genotyping errors can be simulated (marker/QTL)

15


Parameter file

5. Output section

Save brief statistics on historical population

Save allele effects

Marker and QTL linkage map (GWAS)

16

Marker and QTL linkage map

p1_data_001.txt

QMSim outputs

p1_stat_001.txt

17

p1_mrk_001.txt

Marker and QTL linkage map

18

Save allele effects

QMSim

To create LD

Dense marker map QTL + polygenic

Population expansion or bottleneck

Multiple recent populations / lines

Crossing between populations / lines

A single historical population

Sex limited traits

No fixed effects -

+

Only additive effects

Conclusion

19

Reference population Phenotypes Genotypes Pedigree

Population Phenotypes Genotypes Pedigree

Estimation SNP effects

Calculation of GEBVs

Comparaison between GEBV and TBV (EBV) to obtain accuracy

Candidates to selection

Genomic selection : validation

Example of simulation

Generation -1000 to -5

Generation -5 to -1

Generation 1 to 9

Generation 10

Random mating (N=100)

Expanded to N=3000

200ﾝx 2,000ﾜ/ generation

Pedigree recording and genotyping start

Validation data: candidates to selection

Training data

22/05/2014

1

Bases for Genomic Prediction

A. Legarra

INRA, Toulouse, France

2

Linkage disequilibrium

• « Gametic phase disequilibrium »

Statistical association between alleles at two loci in the

same chromosome

– Loci : places

– Alleles: alternative forms of a gene (A,B,0)

– Phase: notion of being in the same chromosome (of a pair)

or coming from same origin (sire or dam)

22/05/2014

2

3

Biallelic case

• Assume we genotype 5 individuals, thus 10

chromosomes (and that we know the

phase)

• Now we compute allelic frequencies

AB AB ab aB ab ab Ab AB Ab AB

4

Biallelic case

p(A)=0.6

p(B)=0.5

if independent, p(AB)=0.3,p(ab)=0.2

The expected proportions are:

A a

B 0.3 0.2

b 0.3 0.2

22/05/2014

3

5

Biallelic case

p(A)=0.6

p(B)=0.5

in reality:

A a

B 0.4 0.2

b 0.1 0.3

vs. expected

A a

B 0.3 0.2

b 0.3 0.2

More AB & ab than expected !!

This is linkage disequilibrium

6

Linkage disequilibrium

• Is a statistical concept

• Describes not-random association of two loci

– Nothing more, so, why is it useful?

• Two loci in LD most often are (very) close

– This is because LD breaks down with recombination

• Linkage disequilibrium of two loci decays on average

with the distance

• Hence it serves to map genes

22/05/2014

4

7

Where does it come from?

• Because chromosomes are transmitted together

– Within known families (« linkage analysis »)

– Within the history of a population (« populational linkage

disequilibrium » or « linkage disequilibrium » in short)

• This distinction is rather artificial

– Remember: a population is a very old, large family

8

Populational linkage disequilibrium

• Assume we mix two populations (say Churra

and Merino)

• Or, that Adam was

– and Eve

– The first generation is an F1

– Then animals are mixed at random

• What do we get after many generations?

22/05/2014

5

9


• The chromosomes become a fine-grained mosaic of grey and black

ひ However, complete mixture is difficult to attain

10


•Some people distinguish LD and pedigree relationships •It’s pretty much the same thing

An stretch (=chromosomal

segment) is conserved because it

comes from the same ancestor

(co-ancestry).

•The value of LD (e.g. r2) observed at large distances is a function of recent relationships

•… at short distances is a function of distant relationships

The « existence » of only a few

conserved stretches at the same

place creates LD. LD is therefore:

an over-representation of segments

from a few gametes

that existed in the population some

time ago.

22/05/2014

6

11

Within-family linkage disequilibrium

• Consider this male who has 8 progeny A

a

B

b

Recombination fraction: 0.50

A b

A B

a B

a b

A b

A B

a B

a b

We found linkage equilibrium in one generation

These are the chromosomes in the sons (i.e. the gametes the male transmitted)

12


• Consider this male who has 8 progeny A

a

B

b

Recombination fraction: 0.25

A b

A B

a B

a b

A B

a b

Due to non-recombination linkage disequilibrium has been generated

A B

a b A a

B 0.375 0.175

b 0.175 0.375

22/05/2014

7

13


• Assume now there are two males A

a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

14



a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

A a

B 0.375 0.175

b 0.175 0.375


A a

B 0.175 0.375

b 0.375 0.175

22/05/2014

8

15



a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

A a

B 0.5 0.5

b 0.5 0.5

No overall linkage disequilibrium

• Why tracing QTLs within family is easy

22/05/2014

9

17

Measures of LD: r2

r is the correlation between two loci if we say « A » = 1, « a »=0

« B » = 1, « b »=0

• Not free from problems but can be understood by

statisticians (and breeders)

• The sample size needed to achieve a given power is

proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)

• Everybody uses it to describe things in genomic selection.

1 1

f AB pqr

p p q q

1 1

Dr

p p q q

18

Measures of LD: r2

r is the correlation between two loci if we say « A » = 1, « a »=0

« B » = 1, « b »=0

• Not free from problems but can be understood by

statisticians (and breeders)

• The sample size needed to achieve a given power is

proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)

• Everybody uses it to describe things in genomic selection.

1 1

f AB pqr

p p q q

1 1

Dr

p p q q

22/05/2014

10

Bayesian Inference

Gibbs sampling

• Iterative procedure – Construct a joint distribution p(A,B,C)

• Typically this distribution contains phenotypes + a priori information + likelihood

• Want to draw inferences from this distribution, for instance the expected value of A

– Echantillonage • Sample A from p(A|B,C)

• Sample B from p(B|A,C)

• Sample C from p(C|A,B)

• Sample A from p(A|B,C)

• Sample B from p(B|A,C)

• Sample C from p(C|A,B)

• ぐ

22/05/2014

11

Gibbs sampling

• Iterative procedure in two steps

– Burnin

• Some iterations ﾗa さburn-ｷﾐざ

• Typical trace along the iterations

Gibbs sampling

– The final result at the end of the iterations is NOT the

solution looked for (this constrats with REML or Gibbs)

– No clear measure of convergence

– We cumulate information over the post-burnin

iterations

• Solution= Average of the post-burnin iterations

• Example:

• 欠賦沈捗沈津銚鎮噺怠津デ欠葡沈 , where ã_i sampled over n iterations

• Ex, in BayesC 絞沈噺ど┻ぬ -> means that over 1000 iterations

300 times 絞風噺な and 700 fois 絞風噺ど

22/05/2014

12

24

Models for Genomic selection

• Single marker

• Whole-genome (multiple marker) genomic

selection

22/05/2014

13

25

Single marker

• Assume there is a marker in complete LD with

a QTL

• For example, the polymorphism in the halothane

gene (HAL) which is a predictor of bad meat quality

in swine

26

Single marker

• Estimate breeding values including the marker is a

piece of cake

• yi= marker effect in animal i + e

– We substitute the true, possibly unknown gene by a proxy

observed marker and estimate effects of the latter using a

linear model

– We can include an additional polygenic genetic value of

animal i

22/05/2014

14

27

Base model

• y= ぐ+ Za + e

– Z= incidence matrix of

marker effects

– a= marker effect

– e=residuals

1

2

3

4

0 1 1 0

2 0 0 0

0 1 0 1

a

a

a

a

》a

3 individuals, 1 marker with 4 alleles

ひ This can be solved, for example, by least squares

28

Single marker

• This is fine if we know what markers are good

predictor of what genes

• But this is rarely the case

– It can be shown that you miss a lot of information by trying

to locate the QTLs

– And those that you find are certainly exaggerated

22/05/2014

15

• Go to notes

30

Whole genome

• Ia ┘W SﾗﾐげデゲWﾉWIデ QTL ヴWｪｷﾗﾐゲ ┘W ゲﾆｷヮデｴW problem of bias

• Therefore :

– Genetic value = sum of effects of all regions

• We effectively treat all regions as being carriers of a

QTL

– How do we estimate the effects of all regions?

22/05/2014

16

31

Whole genome

• The simpler is to do an extension of single

marker analysis

• Do multiple marker regression

• You want to cover all the genome => many

markers

32

Multiple marker additive model

1,1

1,2

2,1

2,2

2,3

2,4

1 1 0 1 1 0

2 0 2 0 0 0

0 2 1 0 0 1

0 2 1 1 0 0

a

a

a

a

a

a

》a

2 alleles in 1st marker

4 alleles in 2nd marker

4 individuals, 2 markers each • y= Za + e

– Z= incidence matrix of

marker effects


– e=residuals

22/05/2014

17

33

Estimating SNP effects

• The simultaneous estimates of many markers by least squares are very poor,

– if we have much more SNPs than individuals

– They are thus terribly bad for genomic predictions as well

• Even if we had many individuals, there is a missing piece of information:

– we think that most SNPs do not have an effect or at least a big one

– this is a « prior » information

• Can we do something?

34

Best Predictor as a Bayesian estimator

|ˆ |

|

p p dE

p p d

a y a a a

a a yy a a a

« Prior » (how

we think SNP

effects are)

« Likelihood »

(how SNP effects

affect the

phenotype)

Estimate of SNP

effects

22/05/2014

18

Best Predictor as a « penalized » estimator

• Statisticians & « machine learners » aim using

« penalized » estimators (Ridge regression, Elastic

ﾐWデぐぶ

• A penalized estimator is the same as a « best

predictor » (or as a Bayesian estimator) before with

prior now called « penalization »

35

In the reference population:

Get markersげ genotypes (燦追岻

Get phenotypes 岫姿岻

Estimate markers effects 珊 from 姿噺層航髪燦追珊髪蚕 , possibly

with a Bayesian model

In the candidates :

Get markersげ genotypes (燦頂岻

Take estimates 珊赴 from above

Estimate breeding values as 四赴頂噺燦頂珊赴

22/05/2014

19

• Go to notes

38

A priori Distributions for marker effects

• Several distributions for SNP effects have been

proposed

– Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008) т

BLUP_SNP or GBLUP or RR-BLUP or « Ridge

regression »

– BayesA, BayesB, (Meuwissen et al. 2001; Habier et al., 2011)

– Mixture of normal , BayesC(Pi) (Van Raden JDS 2008,

Habier et al., 2011)

– (Bayesian) Lasso (Usai et al., 2009; De los Campos, et al., 2009)

22/05/2014

20

39

2 2

2 2

22

0,

0, , / 2

0 Pr1

0, Pr 1

i a i a

i a i a

i i aa

a N Var a

a t S Var a S

witha Var a

N with

Prior variances for SNP effects Normal

BayesA

BayesCPi

40

Normal distribution

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

Few « big » effects

20,i aa N

22/05/2014

21

41

Normal equations for genomic selection

(BLUP_SNP)

• If we assume normality there are closed

expressions for â

• This is called « BLUP », and also « genomic

BLUP » , BLUP_SNP, or GBLUP, but also « ridge

regression » or Random Regression-BLUP

– I will keep GBLUP for the use of the genomic

relationship matrix

– and BLUP_SNP for the direct estimation of SNP effects

42

Mixed model equations for BLUP_SNP

• HWﾐSWヴゲﾗﾐげゲ MME • ZげZ is not diagonal

• Var(a)=D is diagonal if we assume uncorrelated

SNP effects

1 1 1

1 1 1 1

ˆ

ˆ

X R X X R Z X R yb

Z R X Z R Z D Z R ya

2 2

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

a a

D ICould (will) be

something

different !!

22/05/2014

22

43

Solving BLUP_SNP: GS with Residual

Update • How to estimate SNP effects efficiently ( Legarra and Misztal J. Dairy Sci.

2008) (reinvented many times)

• LWデ ;ゲゲ┌ﾏW ; ヴ;ﾐSﾗﾏ “NP ﾏﾗSWﾉふさBLUPぱ“NPざぶ

• Mixed model equations can be solved by direct inversion (1 iteration) or

Gauss-Seidel, PCG or Jacobi (iterative methods, useful for large matrices).

• MCMC (BayesB, etc) can be done starting from Gauss-Seidel

• The number of effects (SNP) n is much larger than the number of records

m, and the matrix ZげZ is dense. A typical example (2000 records, 20000

SNP):

44

Efficient solvers for BLUP_SNP:

• Gauss-Seidel with Residual Update: ( Legarra and Misztal J.

Dairy Sci. 2008) (reinvented many times) implemented in GS3

– Form the basis of the Gibbs Sampling Algorithms in BayesC, etc.

– Iterate on:

1. Estimate SNP effect

2. Correct data for this SNP effect

• Preconditioned Conjugate Gradients (not in GS3)

– Eゲデｷﾏ;デW ;ﾉﾉ “NP ゲｷﾏ┌ﾉデ;ﾐWﾗ┌ゲﾉ┞ ┌ゲｷﾐｪさゲW;ヴIｴざ H;ゲWS ﾗﾐヴWゲｷS┌;ﾉゲ ;デ W;IｴｷデWヴ;デｷﾗﾐ

22/05/2014

23

45

The size of the MME

= Za y

Model

â = Z’y

Much bigger! Is this memory efficient? Easy to solve?

m

n

Z’Z (dense)

40,000,000

elements

400,000,000 elements

46

Reordering Gauss Seidel

• Gauss Seidel uses the conditional mean for the i-th

effect, corrected by the other effects:

• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi

l)

ひ Note that we are correcting for âi, so we put it

back

22/05/2014

24

47

Reordering Gauss Seidel

• Gauss Seidel uses the conditional mean for the i-th

effect, corrected by the other effects:

• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi

l)

• Correcting for Zâ takes 20000 op.

ひ This is the residual êがｷゲﾐげデｷデい

ひ Use alternative formula

(ziげzi + ゜) âil+1 = ziげê+ziげziâi

l+1

48

Reordering the error term

ひ Still we need to compute ê at each iteration

ひ Actually only âi changed

ひ It can be shown that ê can be « updated »

êl+1 = êl に zi(âil+1 - âi

l)

に Hence « GSRU » Gauss Seidel with Residual Updating

に Some machine learning literature calls this « backfitting »

22/05/2014

25

49

GSRU in Figure

= Za y

(ziげzi + ゜) âil+1

= ziげê + ziげzi âi-1l

= +

1- Gauss-Seidel

50

GSRU in Figure

= Za y

êl+1 = êl + zi âi

l

= +

1- Residual Updating

22/05/2014

26

51

Fortran pseudocode Double precision:: xpx(neq),y(ndata),e(ndata),X(ndata,neq), & sol(neq),lambda,lhs,rhs,val do i=1,neq xpx(i)=dot_product(X(:,i),X(:,i)) !form diagonal of X'X enddo e=y do until convergence do i=1,neq !form lhs X’R-1X + G-1 lhs=xpx(i)/vare+1/vara ! form rhs with y corrected by other effects (formula 1) !X’R-1y rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare ! do Gauss Seidel val=rhs/lhs ! MCMC sample solution from its conditional (commented out here) ! val=normal(rhs/lhs,1d0/lhs) ! update e with current estimate (formula 2) e=e - X(:,i)*(val-sol(i)) !update sol sol(i)=val enddo enddo

52

Preconditioned Conjugate Gradients

• The other method of choice to solve large systems of

equations (e.g. Strandén and Lidauer, 1998; Tsuruta et

al., 2001)

• Based on repeated computations of Ax above

• Can easily be done for genomic models as WげふWx) + п-

1x at a cost of 3nm operations

• PCG is much faster (but less general)

1 1 1

1

1 1 1 1

ˆ

ˆ

X R X X R Z X R ybAx W W ぇ x t

Z R X Z R Z D Z R ya

22/05/2014

27

53

0 2000 4000 6000 8000 10000

-12

-10

-8-6

-4

round

Co

nv

Preconditioned Conjugate Gradients for

BLUP_SNP

log10(Convergence) with real data (Holstein)

PCG

GSRU

PCG is much faster GSRU convergence slow for large data sets (or you really need to wait) Still, EBV’s seem identical, possibly because errors in SNP estimates cancel out when summing.

54

BLUP_SNP parameters

• How do we get the variance of SNP effects, ゝ2a, from a genetic variance

ゝ2g ?

• The formula comes from the sampling variance of covariates in Z affecting

SNP effects to data

– i.e., we try to explain all genetic variance as if « caused » by SNP effects, and

these SNP effects have a variance of ゝ2a

• Assumes Hardy-Weinberg and Linkage equilibrium

22

2 1g

ai i

all SNPs

p p

22/05/2014

28

55

Residual variance with pseudo-data

• Wｴ;デｷゲデｴW ヴWゲｷS┌;ﾉ ┗;ヴｷ;ﾐIW ｷa ┘W ┌ゲW DYDげゲい – DYD ｷゲ ; さﾏｷﾉﾆ ┞ｷWﾉSざ ;ゲゲｷｪﾐWS デﾗ ; H┌ﾉﾉふｴ;S ｷデ HWWﾐ ; Iﾗ┘ぶ

• DYDЭヮWヴaﾗヴﾏ;ﾐIW ﾗa デｴW S;┌ｪｴデWヴゲが IﾗヴヴWIデWS H┞ S;ﾏゲげ BVゲ ;ﾐS other effects. Ideally:

1 12 2 2i i j j i i

j ji i

DYD u e un n

Bull BV Mendelian

sampling of his

daughters « True »

residuals « Pseudo »

residuals

2 212 4i u e

i

Varn

But 21

2 uVar

And therefore

ni=« edc », equivalent daughter contribution

• Residual variances with deregressed proofs

can be found in Garrick et al. 2009

22/05/2014

29

57

Estimating variances = BayesC (with ヾ=0)

• It simply consists in a BLUP_SNP where we estimate (and simultaneously « integrate out ») ゝ2

a and ゝ2e

– ｷくWくが ; ヴWｪ┌ﾉ;ヴ GｷHHゲゲ;ﾏヮﾉWヴ ;ヮヮﾉｷWS デﾗ “NPゲｷﾐゲデW;S ﾗa EBVげゲふGｷHHゲ-SNP ??)

– LWｪ;ヴヴ; Wデ ;ﾉくがヲヰヰΒ ふ┘W SｷSﾐげデ I;ﾉﾉｷデ B;┞WゲCぶが H;HｷWヴ Wデ ;ﾉくがヲヰヱヱ

• Pretty straightforward from GSRU

• You can as well estimate ゝ2a and ゝ2

e using « BayesC » and take them as known in BLUP_SNP (e.g. as in REML+BLUP analysis)

2 2 2 2

2 2 2 2

,

| ~ , ;

| ~ , ;

a a a a

e e e

MVN S

MVN S

y Xb Za e

a 0 I

e 0 』

58

Fortran pseudocode for BayesC ...

do j=1,niter do i=1,neq

!form lhs

lhs=xpx(i)+1/vara

! form rhs with y corrected by other effects (formula 1)

rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare

! MCMC sample solution from its conditional

val=normal(rhs/lhs,1d0/lhs)

! update e with current estimate (formula 2)

e=e - X(:,i)*(val-sol(i))

!update sol

sol(i)=val

enddo

! draw variance components

ss=sum(sol**2)+nua*Sa vara=ss/chi(nua+nsnp) ss=sum(e**2)+nue*Se vare=ss/chi(nue+ndata) enddo

22/05/2014

30

Estimate of this SNP

�喋挑腸牒噺姉嫗姿購勅態姉嫗姉購勅態髪な購銚態

�弔調凋聴噺姉嫗姿購勅態姉嫗姉購勅態

60

Variance of the SNP

Least squares solution

(e.g. in GWAS)

In BLUP_SNP, we shrink the least square estimate towards 0

because usually 怠蹄尼鉄 is a large number

BLUP_SNP solution

22/05/2014

31


• So, the estimate is much smaller than the GWAS estimate

• But we can fit all SNPs simultaneously

• And this provides unbiased (in some sense) estimates

• However, the result is very confusing for QTL detection and is unclear how do they work for さlargeざ QTLs:

61

0 2000 4000 6000 8000 10000

0e+

00

2e-0

54e-0

56e-0

58e-0

51e-0

4

Index

snps$solu

tion^2


This suggests an iterative/adaptive strategy

�沈喋挑腸牒噺姉嫗姿購勅態姉嫗姉購勅態髪な購銚沈態

If 購銚沈態蝦タ we get the least square estimate

The more important the SNP, the larger 購銚沈態

-BayesA, etc etc (see later)

62

Variance of THIS SNP

22/05/2014

32

64

BayesA

• We « estimate » a different ゝ2a for each SNP

– this estimate is horribly bad

– but SNP solutions correspond to a model with « t » distributions

• Pretty straightforward from GSRU

2 2 2 2

2

2 2 2, ,

,

| ~ , ;

0, ,

0, ;

e e e e

i a a

i a i a i a a

MVN S

a t

a N S

y Xb Za e

e 0 』

2

2 2 2,

0, ,

0,

i a

i a i a

a t

a N

representation as

« t »

Meuwissen et al.

representation

22/05/2014

33

65

Normal vs. BayesA

-5 0 5

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

big effects are

more likely in

BayesA

66

Fortran pseudocode for BayesA ...

do j=1,niter

do i=1,neq

!form lhs

lhs=xpx(i)+1/vara(i)

! form rhs with y corrected by other effects (formula 1)


! MCMC sample solution from its conditional (commented out here)


! update e with current estimate (formula 2)


!update sol

sol(i)=val


ss=sol(i)**2+nua*Sa vara(i)=ss/chi(nua+1) enddo


ss=sum(e**2)+nue*Se

vare=ss/chi(nue+ndata)

enddo

22/05/2014

34

67

BayesB (mixture with t distribution)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

x

dt(

x, 4

)

Otherwise a t distribution (big effects are not unlikely)

A fraction ʌ of markers has null effects.

20 1 0, ,i aa t

68

BayesB

• e.g. Meuwissen et al., 2001

• What if some SNP had no effect in Bayes A?

– This is the original idea of BayesB

– needs the probability that a given SNP is at the model or not

– can be computed by MCMC but is notoriously more difficult (see for

instance Villanueva et al., doi: 10.2527/jas.2010-3814)

22

2 2

2 2 2 2

,

, , 1| , ~

0 0

0

1 1

| ~ , ;

i a ia

i i

a a

i

e e e

a t if

a if

S

with probability

with probability

MVN S

y Xb Za e

0a h

e 0 』

ヾ is fixed

22/05/2014

35

69

Mixture distribution or BayesC(Pi)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

A fraction ヾ of markers has null (or almost null) effects.

Otherwise they are normal

20 1 0,i aa N

70

BayesCPi

• e.g. Habier et al., 2011 (see also Rohan Fernando course notes)

• What if some SNP had no effect?

– This is the original idea of BayesB

– needs the probability that a given SNP is at the model or not

– can be computed by MCMC

22

2 2

2 2 2 2

,

, 1| , ~

0 0

0

1 1

| ~ , ;

i a ia

i i

a a

i

e e e

a N if

a if

S

with probability

with probability

MVN S

y Xb Za e

0a h

e 0 』

ヾ can be fixed or estimated

22/05/2014

36

71

BayesCPi

• Algorithm consists in a BLUP_SNP by GSRU where we estimate (and

simultaneously « integrate out ») ゝ2a and ゝ2

e

– for each SNP we compute the probability of it being « in » the model

(indicator variable h)

• This was a nightmare in original BayesB

• R Fernando found out a simple way of computing it (course notes:

http://www.ans.iastate.edu/stud/courses/short/2010short.html ) that is

« like » GSRU

– we can equally compute the proportions ヾ or fix them previously

72

Fortran pseudocode for BayesCPi ...

do j=1,niter do i=1,neq

...

! compute loglikelihood for state 1 (i -> in model) and 0 (not in model)

! Notes by RLF (2010, Bayesian Methods in

! Genome Association Studies, p 47/67)

v1=xpx(i)*vare+(xpx(i)**2)*vara

v0=xpx(i)*vare

rj=rhs*vare ! because rhs=X’R-1(y corrected) ! prob state delta=0

like2=density_normal((/rj/),v0) !rj = N(0,v0)

! prob state delta=1

like1=density_normal((/rj/),v1) !rj = N(0,v1)

! add prior for delta

like2=like2*pi; like1=like1*(1-pi)

!standardize

like2=like2/(like2+like1); like1=like1/(like2+like1)

delta(i)=sample(states=(/0,1/),prob=(/like2,like1/)

if(delta(i)==1) then


else

val=0

endif

...

enddo

pi=1- beta(count(delta==1)+apriori_included,count(delta==0)+apriori_not_included) ss=sum(sol**2)+nua*Sa

vara=ss/chi(nua+count(delta==1))

… enddo

http://www.ans.iastate.edu/stud/courses/short/2010short.html

22/05/2014

37

BayesCPi

• So far this looks simple

• But BayesCPi has many details & caveats

– How to run the Gibbs sampler?

• Rule of thumb: iterate などど伴券伴の times the number

of markers

– (need to find the good combination of markers)

– Do we estimate or fix 講 ? At which values?

– What do we get as results?

73

BayesCPi

• Parameter 講 (or 岫な伐講岻 ) is the number of SNPs

in the model

• Do we estimate or fix it?

– In theory we can estimate it

– In practice it is very tricky

• Colombani et al. could estimate it in Holstein but not in

Montbéliarde (estimate too imprecise)

– Usually we さfixざ it to 1/1000 (50 SNP out of 50,000) for

QTL detection and to 1/100 for genomic selection

– Or, we put a uniform prior on 講 for genomic selection

74

22/05/2014

38

BayesCPi

• Parameter 講 and genetic variance

• In the case of BayesCPi, 購直態噺に講デ喧沈圏沈購銚態

• So, to recover all genetic variance from SNPs, we need to

modify 講 and 購銚態 at the same time

– Then 購銚態噺蹄虹鉄態訂デ椎日槌日 • So, 講噺ど┻どどな implies that 購銚態 is 1000 times larger than in

BLUP_SNP and estimates are less さshrunkenざ

75

BayesCPi • Output of BayesCPi

– For each SNP, the marginal posterior probability of being さｷﾐ the modelざ

– Not a single subset of SNPs さｷﾐ the modelざ

effect level solution sderror p

2 1 0.49637122E-02 0.63842196E-01 0.69375000E-02

2 2 0.49501460E-03 0.17864670E-01 0.10375000E-02

2 3 0.38664734E-04 0.79524430E-02 0.32500000E-03

2 4 0.18222423E-04 0.59148438E-02 0.25000000E-03

2 5 0.21643136E-03 0.11477947E-01 0.53750000E-03

2 6 -0.55016190E-03 0.28990326E-01 0.97500000E-03

2 7 0.94168849E-04 0.74293395E-02 0.28750000E-03

• This implies that most SNP are in LD with some QTL somewhere

• Sometimes, a single SNP stands out 蝦 large QTL

76

22/05/2014

39

0 200 400 600 800 1000 1200 1400

0.0

0.1

0.2

0.3

Index

(Andre

s$p)

77 Position

Posterior

probability

OAR12, Salle et al. (JAS)

BayesCPi

• How do we SWIﾉ;ヴW ; さヮﾗゲｷデｷ┗Wざ QTLい

• Have no p-values in this analysis

– Bayesians insist in using the Bayes Factor for that (Wakefield; Bertrand &

Stephens, etc.) but no clear rules how

• Construct the Bayes Factor: 喧鯨軽鶏件券建月結兼剣穴結健穴欠建欠岻喧鯨軽鶏券剣建件券建月結兼剣穴結健穴欠建欠岻喧鯨軽鶏件券建月結兼剣穴結健喧鯨軽鶏券剣建件券建月結兼剣穴結健

In our case:

稽繋沈噺怠貸訂訂椎弟日退怠姿怠貸椎弟日退怠姿

78

Posterior «odds »

Prior « odds »

22/05/2014

40

BayesCPi

• What thresholds for BF?

• Some people suggest using permutations 蝦 too long

• Use a scale adapted by Kaas & Raftery (1995) used in QTL detection by

Varona et al. (2001, GSE) and Vidal et al. (2005, JAS)

– BF= 3-20 ゎゲ┌ｪｪWゲデｷ┗Wさ

– BF= 20-150 ゎゲデヴﾗﾐｪさ

– BF>150 "very ゲデヴﾗﾐｪざ

• We Sﾗﾐげデ need correction for multiple testing (Bonferroni):

– all SNP were introduced at the same time

– And the prior already « penalizes » their estimates

79

0 200 400 600 800 1000 1200 1400

0100

200

300

Index

(Andre

s$B

F)

80 Position

OAR12, Salle et al. (JAS)

BF

« Very strong »

22/05/2014

41

81

Lasso (double exponential)

-4 -2 0 2 4

01

23

4

x

de

xp

(ab

s(x

), 4

)

Often marker has almost null effect

Otherwise big effects are not unlikely

exp2i ia a

82

Lasso

Hierarchical representation of Lasso

• y : data

• a : SNP effects

2

2 2

,

| , ~ exp2

| ~ ,

ii

e

a

MVN

y Xb Za e

a

e 0 』 Distribution of SNP effects

-4 -2 0 2 4

01

23

4

x

de

xp

(ab

s(x

), 4

)

22/05/2014

42

83

• This Bayesian Lasso is being used for genomic selection (De los Campos et al., 2009)

• The following is largely from Legarra et al. (Genetical Res., 2011)

Bayesian Lasso ひ In regular Lasso, ゜ is tipically computed by cross-validation

に which depends strongly on the constitution of the training & validation data sets

に and is tricky to compute

ひ the Bayesian Lasso (Park & Casella 2008) uses an equivalent hierarchical model:

姿噺散産髪燦珊髪蚕

喧珊】滋ｂ錆宋┸ 拶購態

喧蚕ｂ錆宋┸ 薩購態

拶噺酵怠態どどどど酵態態どどどど狂どどどど酵津態喧滋膏噺敷膏態に結貸碇鉄邸日鉄沈

酵態 are « weights » of 購態

• Assume SNP effects have a different « variance » set to 購態噺な

• This is more similar to BLUP_SNP, BayesA, BayesC, etc etc.

– Equivalent to the TｷHゲｴｷヴ;ﾐｷげゲ original Lasso


喧珊】滋ｂ錆宋┸ 拶

喧蚕ｂ錆宋┸ 薩購蚕態

85

TｷHゲｴｷヴ;ﾐｷげゲ BL

22/05/2014

43

86

Bayesian Lasso vs. BayesA


喧珊滋ｂ錆宋┸ 拶喧岫蚕岻ｂ錆岫宋┸ 薩購勅態岻

拶噺酵怠態どどどど酵態態どどどど狂どどどど酵津態

酵態 are « variances » of SNP

effects (購銚沈態 in BayesA)

喧滋膏噺敷膏態に結貸碇鉄邸日鉄沈喧滋膏ｂ敷鋼程貸態鯨銚貸態沈 Exponential

Inverted chi-

squared

Distribution of the variances BayesA BL

87

Fortran pseudocode for BL

...

do j=1,niter

do i=1,neq

!form lhs

lhs=xpx(i)+1/vara(i)




sol(i)=val


ss=sol(i)**2 tau2(i)=1d0/rinvGauss(lambda2/ss,lambda2) enddo


ss=sum(e**2)+nue*Se

vare=ss/chi(nue+ndata)

! update lambda

... enddo

22/05/2014

44

Bayesian Lasso

• It gives different weights to larger SNPs

• Mixing is better than BayesCPi

• Performance in Genomic Selection is as good

(Colombani et al., 2013, JDS)

• But there is no clear notion of what SNP is a

QTL and a few papers with さlasso for QTLざ ;ヴW disappointing.

88

22/05/2014

45

90

Advice

• Use everything: GBLUP, BayesCPi, Bayesian Lasso

• GBLUP is very good if variances are correct

– If not, do estimate them (G matrix + REML)

• Bayesian methods are sensible to parameters

– BayesB, Cpi are more sensible than BayesA, Bayesian Lasso

– Seems that the notion of « SNP with no effect » is incorrect

• Multiple marker methods need attention to details: correct priors and initial values, computation time, verification of convergence

• Details are typically overlooked by most users !!!

22/05/2014

1

Quantitative genetics of markers

• Go to notes

22/05/2014

2

4

Equivalences

• Pedigree (Malécot)

relationships assumes we have

2N founder alleles

• Then we genotype individual 9

• In this case,

– molecular coancestry = Malécot

IBD coancestry

• However SNPs have 2 alleles

– How are then these equivalences?

3 4 5 6 7 8 1 2

3 2

22/05/2014

3

5

Wｷデｴ “NPゲぐ

• Let us imagine that to each

one of the 2M founder

alleles we assign at random a

tag saying if the allele is A or

a with probability p and q=1-

p

• Then we genotype 9

• Can we say which ancestral

allele (1 to 8) inherited 9?

3 4 5 6 7 8 1 2

6

┘ｷデｴ “NPゲぐ

• The molecular coancestry between two individuals i and j will be

– probability that two alleles are equal (alike in state) fMij,

• either because they have become identical by descent or

• either because they are not identical by descent but equal in the base population.

3 4 5 6 7 8 1 2

2 2 2ijM ijp qf pqf

22/05/2014

4

8

• 9 real French bulls among 1827 genotyped, ~50000

SNPs

• Very complex pedigree, simplified graph:

Real results (AMASGEN)

1 2

2 3 4 5 7 8 9

22/05/2014

5

9

Pedigree-based relationship

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 1.00 0.51 0.57 0.51 0.26 0.15 0.15 0.14 0.14

[2,] 0.51 1.01 0.30 0.33 0.17 0.17 0.12 0.11 0.11

[3,] 0.57 0.30 1.07 0.30 0.20 0.12 0.18 0.11 0.12

[4,] 0.51 0.33 0.30 1.01 0.17 0.18 0.11 0.11 0.11

[5,] 0.26 0.17 0.20 0.17 1.00 0.56 0.51 0.52 0.53

[6,] 0.15 0.17 0.12 0.18 0.56 1.06 0.31 0.32 0.32

[7,] 0.15 0.12 0.18 0.11 0.51 0.31 1.01 0.30 0.29

[8,] 0.14 0.11 0.11 0.11 0.52 0.32 0.30 1.02 0.30

[9,] 0.14 0.11 0.12 0.11 0.53 0.32 0.29 0.30 1.03

Cousin relationships ~0.125

Little inbreeding

10

さaｷヴゲデ GざｪWﾐﾗﾏｷI ヴWﾉ;デｷﾗﾐゲｴｷヮ

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 0.82 0.40 0.43 0.38 0.12 0.04 0.04 0.01 0.10

[2,] 0.40 0.91 0.18 0.24 0.02 0.05 -0.04 -0.04 0.04

[3,] 0.43 0.18 0.88 0.19 0.07 0.00 0.07 -0.02 0.05

[4,] 0.38 0.24 0.19 0.86 0.02 -0.01 -0.02 0.01 0.03

[5,] 0.12 0.02 0.07 0.02 0.73 0.34 0.30 0.31 0.35

[6,] 0.04 0.05 0.00 -0.01 0.34 0.85 0.15 0.14 0.18

[7,] 0.04 -0.04 0.07 -0.02 0.30 0.15 0.80 0.14 0.17

[8,] 0.01 -0.04 -0.02 0.01 0.31 0.14 0.14 0.80 0.17

[9,] 0.10 0.04 0.05 0.03 0.35 0.18 0.17 0.17 0.85

Relationships among cousins are ~0

Less than 1 in the diagonal Negative coefficients

/ 2 1i iall SNPs

p p G ZZ

22/05/2014

6

Genomic relationship matrix G

22 0, up N ju G

Assume that G is computed according to VanRaden 2008, using observed allelic frequencies This implies that the average BV of genotyped individuals (u2) is 0

This is possibly NOT the case if there is SELECTION

22/05/2014

7

An improved G

* 20, up |g N ju G

Relative to the pedigree base population, the average BV of genotyped individuals (u2), has a value possibly different from 0, say ȝ

22 0, up |g N + 'g ju G 11

20, up た N gj

22 up | た N た, ju 1 G

but substituting G for G* =G + 11'Į

µ is random because of finite size (drift)

ȝ is the average BV of genotyped individuals

How to find the value for ｖ ?

1た= 'n 21u

ȝ from either pedigree or single-step

22

1p uた N ' j

n

220, 1A 1

22

1s uた N ' + 'g j

n 0, 1 G 11 1

Assume traditional BLUP is unbiased. Assume traditional BLUP is unbiased.

22/05/2014

8

How to find the value for ｖ ?

from either pedigree or single-step

22 2

1i, ji, j

i j i j

g=n

A G

If we equate both variances of ȝ

As the 1'1 are simply summations

2

22 2

1p u i, j

i j

Var た = jn A

22

1s u i, j

i j

Var た = j g+n

G

g is simply the difference between means for A2 2 and G

What does ｖ mean ?

22 2

1i, ji, j

i j i j

g=n

A G

g accounts for the fact that u2 are related through pedigree more than G is able to reflect This is because we do not know base allelic frequencies to construct the 'correct' G

22/05/2014

9


From Wright's FST, another interpretation of g is possible... The FST can be defined as the mean relationship between gametes in a recent population with respect to an older base population

1old new new STF = F + F F

Powell et al., 2010


A2 2 involves relationships of genotyped individuals with reference to the base population, and G corresponds to relationships within the current population. Consequently, g is equal to twice FST

11

2= g + g

*G G 11' 22

1

2STF = mean A G

22/05/2014

10

Which G must we use ?

2° moment (variance) of u2

G*=G+11' ガ

G*=(1−12ガ)G+11' ガ

G*=trace(A 22)

trace(G)G

1° moment (mean) of u2

Mean & variance of u2 (assumption of random

mating)

AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Both, AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Which G must we use ?

G*=(1−12ガ)G+11' ガ Mean & variance of u2

(assumption of random mating)

G*=aG+11' b

Both, AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Mean & variance of u2 (no random mating)

preGSf90 Christensen et al., 2012,

(1-0.5g)=0.851

a=0.859

Ex. real pig population

22/05/2014

11

22/05/2014

12

23

GBLUP

• GBLUP is a « BLUP »

constructed with G so

defined

– Sustitute A for G

• As in regular BLUP, we can

include animals with

genotype but without

phenotype

2 1i ip p

ZZ

G

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

W R X W R W G W R yg

24

GBLUP issue

• Strandén & Christensen (2011) showed that G

constructed with « centered » coding is semi-

positive definite (has no inverse)

• In dairy cattle, we typically use

0.99 0.01

2 1i ip p

ZZ

G A or something similar

« Pure »

genomic

relationships

Pedigree

relationships

22/05/2014

13

25

GBLUP issue

• We can use equations for singular G which Sﾗﾐげデ require inversion (Harville, 1976; Henderson, 1984)

1 1 1

2 1 2 1 2 1

ˆ

û u u

X R X X R W X R yb

G W R X G W R W 』 G W R yu

(This has been reinvented many times: Misztal et al., 2009; VanKaam, 2012;

RKHS: De los Campos et al., 2009; etc)

1 1 2 1

2 1 2 1 2 2 2 1

ˆ

û

u u u u u

X R X X R WG X R yb

G W R X G W R WG G G W R yg

or

2 ˆˆ au G g

26

GBLUP

• GBLUP gives identical

results to BLUP_SNP if we fit

equivalent variances in both

1 1 1

1 1 2 1

ˆ

â

X R X X R Z X R yb

Z R X Z R Z I Z R ya

2 1i ip p

ZZ

G

2 22 1u i i aall SNPs

p p

1 1 1

1 1 1 2 1

ˆ

û

X R X X R I X R yb

I R X I R I G I R yu

ˆfromGBLUP fromBLUP_SNPg Za

ｇ薩ｇ because all

animals in genotype

have phenotype

22/05/2014

14

27

GBLUP

• In BLUP_SNP, (young) animals without phenotype do NOT enter into SNP estimation

– We get their EBVげゲ as 四赴槻墜通津直噺 ┣>

• In GBLUP we have three options which give the same result:

1. Include them in the analysis with no record of their own (as in classical pedigree BLUP)

• EBVげゲ四赴槻墜通津直 are obtained in the solutions

2. Use multivariate normality (selection index stuff):

• 四赴槻墜通津直噺札槻墜通津直┸墜鎮鳥札墜鎮鳥┸墜鎮鳥貸怠四赴墜鎮鳥 3. Backsolve for SNP effects and then use 四赴槻墜通津直噺 ┣>

28

GBLUP with more animals than phenotypes

Let 姿噺散産髪撒四髪蚕, 四噺四墜鎮鳥四槻墜通津直 ┹ �┻ g┻ 撒噺薩宋

Only these have

data

Let genotypes be 燦噺燦墜鎮鳥燦槻墜通津直 then 札噺燦燦嫗【にみpq 噺札墜鎮鳥札墜鎮鳥┸槻墜通津直札槻墜通津直┸墜鎮鳥札槻墜通津直

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

W R X W R W G W R yu

Gives the same solutions for 四赴墜鎮鳥 than

1 1 1

1 1 1 2 1

ˆ

ˆold u old

X R X X R I X R yb

I R X I R I G I R yu

22/05/2014

15

29

GBLUP

• We can jump from GBLUP

to BLUP_SNP

SNP effects from GEBVげゲ

(Henderson, 1973; Strandén and

Garrick, 2009)

1 2ˆ ˆu a DZ G u

ˆˆ u Za GEBVげゲ aヴﾗﾏ “NP WaaWIデゲ

Covariance SNPs-BVs (Variance BVs)-1

ˆ ˆ / 2 i ip q a Z Gu

usually

30

Multiple trait GBLUP Introducing multiple traits is so well known that nobody cared to publish it

Let genetic and residual covariances be 券捲券 (n= number of traits) 札待 and 三待, then

multiple trait GBLUP is (e.g. Henderson, 1984; Mrode & Thompson 2005),

1 1 1

1 1 1 10

ˆ

ˆ

X R X X R W X R yb

W R X W R W G G W R yu

Where 三噺薩戯三待┻

All models fitted in BLUP fit in GBLUP

22/05/2014

16

31

GBLUP

Some advantages of GBLUP:

ひIt fits nicely into existing BLUP software

ひぐ;ﾐS ｷﾐデﾗ W┝ｷゲデｷﾐｪデｴWﾗヴ┞ ふ‘EMLがﾏ;デWヴﾐ;ﾉ WaaWIデゲがデWゲデ-S;┞ぐﾏ┌ﾉデｷヮﾉW デヴ;ｷデゲぐ“ｷﾐｪﾉW “デWヮぶひProvides measures of accuracy from the inverse of the LHS

ひAccomodates all animals

Inconvients:

ひC;ﾐげデ easily accomodate major genes (unless using weights in

the construction of G т see later)

ひComputation of G and inversion might be challenging

32

Caveat

• By defining a genomic relationship matrix, we define a genetic base

– All inference will refer to this genetic base. Quoting Strandén & Christensen http://www.gsejournal.org/content/43/1/25 :

The bad news

« Reliabilities of estimated genomic breeding values calculated using elements of the

inverse of the coefficient matrix depend on the allele coding because different allele

coding methods imply different models » [The same applies for reliabilities computed

from any method fitting SNP effects like BayesA, etc.]

http://www.gsejournal.org/content/43/1/25

22/05/2014

17

33

GREML, G-GｷHHゲぐ

Use of G デﾗ Wゲデｷﾏ;デW ┗;ヴｷ;ﾐIW IﾗﾏヮﾗﾐWﾐデゲぐ

It can be done with remlf90, gibbs*f90, AsReml, TMぐ

The result will refer to an ideal population with whatever allelic

frequencies we introduced in the computation of G.

22/05/2014

1

Single Step GBLUP

1

2

Why 2-step procedure

• y= µ + Za + e

– y = data

– Z= incidence matrix of marker effects


– e=residuals

• Most often, genotyped animals (bulls) do not have data (trait record)

• Further, most animals with phenotype are not genotyped (e.g. cows)

• This limits practical applications

• Need to get pseudo-data for genotyped animals

22/05/2014

2

3

Pseudo-data

• So we need pseudo-data

• EBV’s

• DYD’s

4

Pseudo-data

• EBV’s • The problem with EBV’s is that they

already share information among individuals

• e.g., a dam EBV is = own yield + parent average + progeny contribution

• But we are including information of the sire in the cow, yet not all SNPs of the sire are in the cow

22/05/2014

3

5

Pseudo-data

• Also, EBV’s are correlated

• The correlation depends on the amount of data and distribution across fixed effects and families

• EBVs of two cows are correlated, if they belong to the same herd, even if they are not related

• EBVs of two bulls are correlated if they have daughters in the same herds

• This is not serious in dairy cattle, but might be, e.g., in swine

ˆ| ~ , uuNu y u C

6

Pseudo-data

• DYD’s avoid part of these problems (Van Raden Wiggans 1991)

• DYD = daughter yield deviation

• Record of the daughter, corrected by environmental effects and dam’s EBV

• Thus DYD = 0.5 BV sire + mendelian sampling

• E(DYD)=0.5 BV sire

• YD’s exist for cows

– YD = record –environmental effects

22/05/2014

4

7

Pseudo-data

Problems of DYD’s / YD’s

• YD’s little reliable and subject to preferential treatment • DYD’s not reliable for many species (sheep, swine) • Hard to define for some species/traits (maternal effects)

• Extremely complex procedure

• Loss of generality

8

Proposals for overall relationship matrix (Legarra et al., 2009 JDS 92:4656; Christensen & Lund, GSE 42:2; Misztal et al., JDS

92:4648; Aguilar et al JDS 93:743)

• Not big loss in assuming normality for SNP effects (Van Raden et al. JDS 92:19; Hayes et al. JDS 92:433)

• G easy to be constructed then

• Can we include G in the relationship matrix?

• If we construct an overall relationship matrix with good properties, then we can just do BLUP with all data and animals

22/05/2014

5

9

• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)

• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)

10

• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)

• Things would be simple if we could add

genotypes for all animals (Christensen

et al., 2010)

22/05/2014

6

11

Single Step as a missing data problem

• We can see genotype as a missing data problem (Christensen & Lund, 2010)

• Use the prediction and the distribution of the prediction (if not the procedure does not work)

12

Missing data

Fill-in missing data: data augmentation • « data augmentation refers to a scheme of augmenting

the observed data so as to make it more easy to analyze » (Tanner & Wong, 1987) – Two flavors: EM and Bayesian (Posterior distributions) – For instance: pretending (temporarily) that you know the true

BV’s simplifies REML s EMREML , or provides full conditionals for Gibbs

• Augmenting = adding genotypes

22/05/2014

7

13

Inferring genotypes

• Genotypes in some individuals can be inferred, but only to some extent

• This is feasible for key individuals (ancestors with many progeny genotyped)

• Or by imputing data from parents into an animal genotyped with a SNP chip

• Typically done using « LD » patterns

• Fimpute, findhap, Alpha impute, Beagle, etc

• These methods do not extend well to non-genotyped individuals

Example:

14

By simulation, they know that…

Accuracy of prediction of genotype is quite good, but not perfect They don’t even try « far » animals Need a simpler way for « far » animals

22/05/2014

8

16

Inferring genotypes

• There is Gengler’s gene content prediction J. Dairy Sci.

91:1652

• Linear approximation to the imputation problem

• This method can be applied to any member of a pedigree

11,2 2,2

11,1 1,2 2,2 2,1

ˆ |

ˆ | 2

non genotyped non genotyped genotyped genotyped

non genotyped non genotyped genotyped

E p

Var Var pq

z z z A A z 2

z z z A A A A

11 12

21 22

=

A AA

A ALet

genotyped

non genotyped

17

Inferring genotypes

• Instead of working with individual SNP effects, we will define

– u=Za

– i.e., the genetic value is the sum of SNP effects

– We’re not really interested in a themselves but in u (we know from GBLUP that we can jump from one to the other)

– Moreover, we’re interested in the distribution of u’s, so that we can compute their covariances and put them into the MME

22/05/2014

9

18

四噺四直四津直噺燦直燦津直珊

Var 四噺燦直燦撫津直撃欠堅珊燦直嫗燦撫津直髪宋宋宋撃欠堅燦撫津直撃欠堅珊

な【にみ喧沈圏沈

Br��ing valu�s SNP �ff�cts Re-create GBLUP…

Chistensen & Lund use 撃欠堅畦噺継撃欠堅畦】稽髪撃欠堅継畦】稽 to

consider the prediction of the genotype and its variance

継燦津直燦直撃欠堅燦津直燦直

Resulting in:

ng: « non genotyped » g: « genotyped »

Christensen & Lund key idea:

Using Gengler’s results Using Gengler’s results

19

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

non genotyped

genotyped

Covariances of all animals Christensen & Lund, 2010

1: « non genotyped » 2: « genotyped »

22/05/2014

10

20

• Incredibly: H-1 is very simple:

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H


GA A G

Inverse of the regular pedigree relationship matrix

Correcting for genomic relationships…

…and avoiding « double counting »

21

• Things would be simple if we had

genomic relationships for everyone

(Legarra et al., 2009)

• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)

22/05/2014

11

22

Overall modification

• Look at A as a « prior » relationship and to G as an « observed » relationship

– G is observed for some individuals only, whose « a priori » relationship matrix was A22

• Try to construct a « posterior » relationship matrix

23

Joint distributions

2 , andp Nu 0 G

1 2 2 1 2,p p pu u u u u

Unconditional distribution of genetic values of Genotyped individuals

Conditional distribution of Non-Genotyped individuals

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

Joint distribution

After seeing their genotypes !

Because they have no genotypes, this depends

only on pedigree

22/05/2014

12

24

Joint distributions

1 2 1 2 2

1 2 2

1 11 1 11 12 22 2 1 12 22 2 2 2

11 11 1112 22

1 2 1 11 1 1 11 1222 21 22 21 12 22

( , ) ( , | ) ( )

( | ) ( )

exp[ 0.5( ) ( )]exp[ 0.5 ]

exp 0.5

exp 0.5

p p p

p p

u u u u u

u u u

u A A u A u A A u u G u

uA A A Au u

uA A A G A A A A A

11 121

1 2 21 1 22 1222

.

uA Au u

uA G A A

…for those inclined to algebra

ｇG�nomicｇ r�lationships

pr��iction of non g�notyp�� from g�notyp��

25

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

22/05/2014

13

26

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

27

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

1 1 11 11 12 22 21 12 22 22 21Var u ] ] A ] ] A GA ]

because Var(Xt) = XVar(t)X’

22/05/2014

14

28

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

11 2 12 22,Cov u u ] A G

because Cov(Xt,t) = XVar(t)

29

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H


GA A G

non genotyped

genotyped

Covariances of all animals Legarra et al. 2009; Aguilar et al., 2010; Christensen & Lund, 2010

22/05/2014

15

30

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H


GA A G

Covariances of all animals

G comes from genotypes

This is the variance of prediction of genotypes from genotyped to

non-genotyped

This is the error in the prediction

The prediction « generates » a covariance

31

Overall modification: example

22/05/2014

16

32


This is the regular relationship matrix. Assume now that animals 9 to 12 have a genomic relationship of 0.7

33


This parents now are related

This guy now is inbred

G

22/05/2014

17

34

• Incredibly: H-1 is very simple:

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H


GA A G

Inverse of the regular pedigree relationship matrix

Correcting for genomic relationships…

…and avoiding « double counting »

35

Single step GBLUP

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

WR X WR W H WR yu

1 1

1 122

H A 0 0

0 G A

W: incidence matrix of animals on data

A: pedigree relationship matrix

GThis G could be any matrix describing « genomic » covariances of breeding values;

it does not restrict to VanRaden’s (2008) GBLUP

A22: pedigree matrix among genotyped individuals

22/05/2014

18

Single step GBLUP

• So the Single Step GBLUP is like regular BLUP changing one small submatrix !!!

• It is almost too simple to be true…

36

Some properties of H

• Semi-positive definite always

• Positive definite & invertible iff G is invertible

• In practice, if G is too different (wrong pedigree or genotyping) from A22, this gives lots of numerical problems

• If everyone is genotyped, Single Step is GBLUP

• If no one is genotyped, Single Step is BLUP

37

22/05/2014

19

38

Single Step as an extra random effect

• Legarra & Ducrocq, 2012

• Decompose Breeding Values in « classical » and deviations: ‒ 撃欠堅四匝噺札┸ ‒ 四態噺四態茅髪纂態┸ 撃欠堅四匝茅噺冊態態 ┸ 撃欠堅纂態噺札伐冊態態

• Regress, using pedigree, deviations for « ungenotyped » individuals ‒ 纂怠噺冊怠態冊態態貸怠纂態┹ Var 穴怠噺冊怠態冊態態貸怠札伐冊態態冊態態貸怠冊態怠

• After quite some algebra, you get to the same results

39

Alternative derivations • Why all agree?

• Some use genotypes, some use breeding values in the algebra.

• Because 決堅結結穴件券訣懸欠健憲結嫌噺嫌憲兼剣血訣結券剣建検喧結嫌, the same rules for variances and covariances apply and the derivation is identical

22/05/2014

20

• So far SSGBLUP is the most serious option for a general method for genomic evaluations in practice

• Large body of practical results in dairy cattle & sheep, poultry, swine

• Besides our group, USA, NZ, DK, Fin, are giving SSGBLUP serious tests

40

41

Problems of SSGBLUP

Most of these problems exist for the other methods (BayesA etc.)

• Assumption p(u2)=N(0,G)

– If there is selection, mean is not 0 (« tuning » solves it: see Vitezica later)

• Same genetic variance in genotyped and ungenotyped animals

– solved with « tuning »

• Non normality (i.e. major genes)

– Can solved using G=ZDZ’ with « weights » for SNP (Legarra et al., 2011; Zhang et al., 2010; Wang et al., Gen. Res. 2012)

– unclear for multiple traits (but also in other methods like BayesB)

• Assumption that « A » is fair. This is false if:

– pedigree is incorrect

– distant relationships are too different from reality (Hill & Weir 2010)

– solution: cut pedigrees that are too long

• Unknown parent groups / several breeds

– Need to modify H to include them (Misztal et al., 2013)

22/05/2014

21

SSGBLUP vs. rest

42

SSGBLUP GBLUP, BayesA Non parametric

Deregressions from « regular BLUP »

Not needed Complex Complex

Bias due to genomic selection

Immune Affected Affected

Bias due to classic selection

Immune Affected Affected

Major genes Complex OK OK

Long MCMC No Yes (except GBLUP) No

Matrix inversions Yes (but work in progress)

No (except GBLUP) No

Computation of accuracies

Complex (but work in progress)

Easy (provided deregressions were OK)

Undefined

Multiple trait Straightforward (if no major genes)

Easy for GBLUP (if no major genes)

Complex otherwise

???

Get marker effects Yes (after backsolving)

Yes Sometimes

43

Computing stuff

• Need to compute G-1 and A22-1, is a challenge.

– perhaps only in dairy cattle?

• But see Ignacio talk

• Future strategies (Legarra & Ducrocq, JDS 2012)

22/05/2014

22

44

1 1 1 11 1 2 2

1 1 11 2 1 12 2 11 1 1 1 1 2 11

1 1 21 2 1 22 2 2 22 2 2 1 2 2 2

2 222

2 2

ˆ

ˆˆˆˆ

u u

u u u u

u u

u u

X R X X R W X R W 0 0 X R yb

W R X W R W A W R W A 0 0 W Ru

W R X W R W A W R W A I I u

0 0 I A 0 l0 0 I 0 G け

12

y

W R y

0

0

• Unsymmetric Single Step

This can be computed efficiently because we use G NOT 札貸怠, 冊匝匝 not 冊匝匝貸層. We don’t even need to construct them.

45

• Iterative Single Step

New solutions are a weighted average of the solutions to (9-10) and the former solutions at the previous iteration

Solution of MME (9) old solution

New solution

w<=1

Solve ' '1 1 2 2

' ' 11 12 '1 1 1 1 1 1 1' 21 ' 22 '2 2 2 2 2 2 2

ˆ

ˆˆ ˆˆ

u u

u u u u

X X X W X W b X y 0

W X W W A A u W y 0

W X A W W A u W y l け

Solve 22 2 2ˆ ˆ ˆ ˆˆ ˆ and for and A l u Gけ u l け

* 1

1 1 1

2 2 2

ˆ ˆ ˆ

ˆ ˆ ˆ1

ˆ ˆ ˆ

t t

b b b

u u u

u u u * 1 * 1ˆ ˆ ˆ ˆ ˆ ˆ1 and 1t t t t l l l けけけand

(9)

This can be done efficiently as well

22/05/2014

23

46

Results • (Using simulated data) all

strategies arrived to the same solution

• Some can be converted to iteration on data procedures

• Reasonable computing time: – 2 s « regular » (with G and A22

already inverted) – 47 s « unsymmetric » – 286 s « iterative »

Convergence

0 50 100 150 200

-12

-10

-8

-6

-4

-2

0

iteration

conve

rge

nce

? Unsymmetric extended SSGBLUP

Regular SSGBLUP ? ? Iterative SSGBLUP

47

More results • Lacaune dairy sheep • 5000 individuals (males) genotyped, 1 500 000 animals, • ~3 000 000 equations

Regular Single Step 0 1000 2000 3000 4000 5000

-15

-10

-5

iteration

log10(c

onve

rgence

) Unsymmetric equations

22/05/2014

24

• Livestock paper for more details

48

Forming Single-step mixed model

equation and quality control

Ignacio AguilarInstituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay

[email protected]

X'X X'ZZ'X Z'Z + α H -1

bu

=X'y

Z'y

• Traditional genetic evaluation

• Single-step genomic evaluation

Single-Step to genomic evaluation

X'X X'ZZ'X Z'Z + α A-1

bu

=X'y

Z'y

Multiple-step Genomic Selection

Records ‘Y” BLUP Pseudo observations

De-regressed EBVs

BayesX

GBLUP

etc

indexPA*w2

SPA*w3GEBV*w1

EBV

SNPsPedigree

Single-Step Genomic Selection

Records “Y”

BLUP

EBVs

Pedigree SNPs

Single-Step evaluation

• Unified approach with pedigree, phenotypic

and genomic markers information considered

simultaneously

• Pedigree-based relationships augmented by

genomic relationship matrix (Misztal et al. 2009)

ˆ

ˆα

Λ

=

-1

X'X X'Z X'yβ

Z'X Z'Z + H Z'yu

H = A+ A - conventional numerator relationship matrix

- matrix modified to account for genomic relationships ∆

A

A

Single step genomic evaluation

• Inverses

– Numerator relationship matrix

– Pedigree relationships between genotyped animals

– Genomic relationships

1 11 1

22

0 0

0− −

− −

= + −

H AG A

Aguilar et al., 2010

Christensen & Lund, 2010

X'X X'ZZ'X Z'Z + H -1α

bu

=X'y

Z'y

=X 'Xp1 + X 'Zp2

Z 'Xp1 + Z 'Zp2

+0

A−1α p2

LHS* p = X 'X X 'ZZ 'X Z 'Z + H −1α

p1

p2

Matrix-vector operations in PCG with

genomic information

Contributions

due to records

Contributions due

to relationships

+00

(G−1 − A22−1)α p2g

Contributions

due to genomics

Extra matrices required for single-step

• Inverses

– Pedigree relationships between genotyped

animals

– Genomic relationships

1 11 1

22

0 0

0− −

− −

= + −

H AG A

OPTIONS – BLUPF90 parameter file

• Genomic programs

– controled by adding OPTIONS commands to the

parameter file

– OPTION SNP_file marker.geno.clean

– Read 2 files:

• marker.geno.clean

• marker.geno.clean.XrefID

Printout: Same heading as other

programs

All options that were

enter in the parameter

file should be here !!.

IF not check that

keywords are correct

(upper and lower case)

Check number of

animals and

individuals with

genotypes

Printout

Information from genotype file.

The format is detected from

the first line !!!

So all genotypes should start in

the same column !!!

Number of SNP is also

determined by the first line!!

Output Files

• GimA22i– Store the content of the inv(G) – inv(A22)

– Only if preGSf90 for runs, not in applications programs

• freqdata.count– Contains the estimated allele frequency before QC

• freqdata.count.after.clean– Contains allele frequencies as used in calculations, remove code

– For removed SNP these will be zero,

• Gen_call_rate– List of animals removed by low call rate

• Gen_conflicts– Report of animals with Mendelian conflicts

Quality control. By default exclude:

• MAF– SNP with MAF < 0.05

• Call rate– SNP with call rate < 0.90

– Individuals with call rate < 0.90

• Monomorphic– Exclude monomorphic SNP. ONLY when MAF <> 0

• Parent-progeny conflicts (SNP & Individuals) – Exclusion -> oposite homozigous

– For SNP: >10 % of parent-progeny exclusion from the total of pairs evaluated

– For Individuals: > 2% of parent-progeny from total numberof SNP

Control default values

• For MAF

– OPTION minfreq x

• Call rate

– OPTION callrate x

– OPTION callrateAnim x

• Mendelian conflicts

– OPTION exclusion_threshold x

– OPTION exclusion_threshold_snp x

Parent-progeny conflicts

• Presence of these conflicts results in a negative H matrix !!!

• Problems in estimation of variance component byREML, programs does not converge, etc.

• Solution:

– Report all conflicts, with counts for each individual as parent or progeny to trace the conflicts

– Remove progeny genotype

• maybe not the best option

• But results in a positive-definite H matrix !!!

Parent-progeny conflicts

• OPTION verify_parentage x

– 0: no action

– 1: only detect

– 2: detect and search for an alternate parent; no

change to any file. Not yet implemented

– 3: detect and eliminate progenies with conflicts

(default)

SNP map file (optional)

• OPTION chrinfo xxxx

• For some genomic analyses (GWAS) or checks

• Format:

– snp number

• Index number of SNP in the sorted map by chr and position

– chromosome number

– position

• First row corresponds to first column SNP in

genotype file !!!

Other Options

• IF OPTION chrinfo is provided, we can exclude

selected crhomosomes:

– OPTION excludeCHR n1 n2 n3 ...

• or inform which are sex chromosomes:

– OPTION sex_chr n

– Chr > n will be excluded only for check or parent-

progeny, but not in calculations

Saving ‘clean’ files

• SNP excluded from QC are set as missing (i.e. Code=5)

• Excluded Individuals are treated as unrealated in G and A22– For individual i

G[i,:] = 0; G[:,i]=0; G[i,i]=1 ; Same for A22

so G-A22 will cancel out

• OPTION saveCleanSNPs

• Save clean genotype data with excluded SNP and individuals– For example for a SNP_file gt

– Clean fles will be:• gt_clean

• gt_clean_XrefID

– Removed will be output in files:• gt_SNPs_removed

• gt_Animals_removed

Inspection of Diagonal of G

� High diagonal elements from G

� Mislabed samples , individualsfrom other populations/lines

� Problems with sample, low callrate

� By default values >1.6 are excluded from analysis, Threshold can be changed with:

OPTION threshold_diagonal_g x

Simeone et al., 2011 JABG

Potential duplicate samples

• All samples are checked with each other

– x = G(i,j)/sqrt(G(i,i),G(j,j))

– Values of x > 0.90 are printed in the output

Correlation off-diagonal G vs A

• Compute correlation for all elements of A > 0.02

• Potential problems with matching genotype file and pedigree

file

• For low values (<0.5) => print a warning !!!!

• For low values (<0.3) => program stop !!!

• If still you want to go …

– OPTION thrStopCorAG -1

Looking for stratification in

populations• OPTION plotpca

– (only preGSf90 not in application

programs)

– Plot the first 2 PC

• OPTION extra_info_pca filename col

– File with variables (alphanumeric) to

plot PC with different colors for

different classes

– Same order as genotype file

Use in application programs

• Use renumf90 for proper renumering and creation of cross reference id and parameterfile

• If large number of genotypes

– Precompute inv(G)-inv(A22) (PreGSF90)

– Modify parameter file to read GimA22i

– BLUPF90, REMLF90

• Generally all steps can be in a script file to facilitate running programs

Genome-Wide Association Mapping

Including Phenotypes from Relatives

without Genotypes

Ignacio AguilarInstituto Nacional de Investigación Agropecuaria


[email protected]

Slides from H. Wang (Joy)

Classical GWAS

• Test single marker one at time

• Simple linear regression for the SNP effects

• Other fixed effects can be fit (conteporary

group)

• Polygenic breeding value can be used

Genomic Selection

• Considers all genetic associations derived from markers

• Methods (Bayes A, Bayes B, Bayes Lasso) provide solutions to SNP effects

• Then Genome-Wide Association Ananlysis(GWAS) can be performed

• Accounts for population stratifications and cryptic relatedness

Non-Genotyped individuals with

Phenotype

• Recorded information from non-genotyped individuals can not easily be incorporated in Single marker regression and Bayesian methods

• Although can incorporate information by accumulating data from relatives , e.g. EBV

• But problems with– heterogeneity from different sources

– Loss of information

– bias

– Computational cost with MCMC method for large number of genotypes and makers

Single-step GBLUP

• Integrates all available information

– Phenotypes

– Genotypes

– Peidgree

• Limitation of ssGBLUP, in constrast to BayesX

methods

– infinitesimal model i.e. same variance for all SNP

effects

ssGWAS

• Combining methods

– Unequal variances

– Use all available information like in ssGBLUP

• Improve Accuracy of estimation of GEBV

– For breeding and selection

– Precesion for estimation of SNP effects for GWAS

SNP variances

• Zhang et al. 2010, presented a method to

estimate weights for SNP variances without

sampling i.e. non MCMC methods

• SNP weights: function of squares of SNP

effects

• Incorporate weights into genomic relationship

matrix

• Similar approach by Sun et al., 2012 PlosOne

BUT both approachs can not utilize

phenotypes of un-genotyped individuals !!!

As a reminder:

• GBLUP BLUP_SNP:

• As:

And: λσσ

'')1(2

'2

2

ZDZZDZpp

ZDZG

a

u

allSNPsii

==−

=∑

uZag ˆˆ =

22 )]1(2[ uallSNPs

iia pp σσ ∑ −=

Genetic value of

genotyped animals

SNP effect

Weight Matrix

SNP weights

• SNP weights derived from SNP effects

• Zhang et al., 2010 PlosOne

• Matrix D, diagonal matrix, with un-equal variances for each SNP

• SNP effects from GEBV’s (Henderson, 1973; Strandén and Garrick, 2009):

• Also, for each SNP effect (i-th):

• Note: this is just variance of SNP effects

NOT the same concept for genetic variance

)1(2ˆˆ 22, iiiiu ppu −=σ

gga

u aZDZDZaGDZu ˆ]'['ˆ'ˆ 112

2−− ==

σσ

postGSf90 par files

1) Parameter files:

(1) BLUPF90 (and preGSf90 for S1)

(2) postGSf90

2) OPTIONs:

BLUPf90 / PreGSf90:OPTION SNP_file marker.geno.cleanOPTION saveGInverseOPTION weightedG w # A vector with length = M

postGSf90:

OPTION SNP_file marker.geno.cleanOPTION ReadGInverseOPTION chrinfo mapfile #format: snpID chr posOPTION weightedG w# OPTION which_weight 1# OPTION SNP_moving_average n# OPTION Manhattan_plot

3) Document:http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90#gwas_options_postgsf90

Computing algorithm• Denote t as an iteration number and i as the i-th SNP

1. t=0, D(t)=I, G(t)=ZD(t)Z’λ

2. Compute by ssGBLUP

3. Calculate

4. Calculate for all SNPs (Zhang et al., 2010)

5. Normalize

6. Calculate

7. t=t+1

8. Exit , or loop to step 2 or 3

ga

gttt aGZDu ˆ'ˆ 1)()()(−= λ

)1(2ˆ2*

)()1( iiii ppudtt

−=+

*

* )1()1(

)0()1( )(

)(+

++ = t

tt D

Dtr

DtrD

λ')1()1( ZZDG tt ++ =

S2S1

Simulated data

1. QMSim

2. Simple model:

3. 10 QTLs w. 3000 SNP markers on 2 chromosomes

4. N = 15,800

Ng= 1500

5. h2=0.5, all due to QTLs (No Polygen)

6. 10 replications

eaZ1y a ++= µ

Different Scenarios

• Scenario 1– Run only one BLUP and get GEBV

– Estimate SNP effects from GEBV using weighted Genomic matrix

– Multiple trait or random correlated effects

• Scenario 2– Get EBVs with weighted genomic relationship matrix

– Estimate SNP effects from GEBV using updated solutions

– Single trait analysis - fit one genomic relationship matrix

postGSf90 bash script

• Scenario 1:

# run 1 time GBLUP to get GEBVs:echo par.b90 | blupf90 | tee log.blupf90

# run x times PreGSf90 – postGSf90 to get SNPeff:

for i in 1 2 3 4 5 6 7 8 … … x

do

echo par.b90 | preGSf90 | tee log_preGS_$i

echo postpar.b90 | postGSf90 | tee logpost_$i

cp snp_sol snp_sol_$i

#format: tr, eff, snpID, chr, pos, sol, w

cp chrsnp chrsnp_$i

cp w w_$i

awk '{ print $7 }' snp_sol > w

done

• Scenario 2: for i in 1 2 3 4 5 6 7 8 … … x

do

echo par.b90 | blupf90 | tee logpre_$i

cp solutions solutions_$i

echo postpar.b90 | postGSf90 | tee logpost_$i

cp snp_sol snp_sol_$i

cp chrsnp chrsnp_$i

cp w w_$i

awk '{ print $7 }' snp_sol> w

done

Methods

1. Single marker model: WOMBAT

2. BayesB using de-regressed proofs : GENSEL

3. ssGBLUP: S1 & S2

Manhattan plot of S1

Manhattan plot of S2

Manhattan plot of BayesB

Manhattan plot of WOMBAT

Accuracy of (G)EBVs

BLUP EBVs DP

0.81 (0.01)

0.77 (0.01)

ssGBLUP it1* it2 it3 it4 it5 it6 it7 it8

0.87 (0.01)

0.89 (0.01)

0.88 (0.01)

0.88 (0.02)

0.88 (0.02)

0.87 (0.02)

0.87 (0.02)

0.87 (0.02)

BayesB_DP NW† c=0.1 0.88

(0.02) 0.88

(0.02)

Accuracy of SNP effectsTable 3. Average correlations (standard deviations) between QTL effects and sum of cluster of

m SNP effects using ssGBLUP

S1* 1† 2 4 8 16 40

it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.46 (0.07) 0.66 (0.05) 0.78 (0.02) 0.82 (0.02) 0.81 (0.02) 0.63 (0.08) it3 0.43 (0.07) 0.64 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it4 0.42 (0.07) 0.63 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it5 0.41 (0.07) 0.63 (0.05) 0.76 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.08) it6 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it7 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it8 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.60 (0.07)

S2 1 2 4 8 16 40

it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.44 (0.09) 0.65 (0.06) 0.77 (0.03) 0.82 (0.03) 0.81 (0.02) 0.63 (0.06) it3 0.41 (0.08) 0.62 (0.05) 0.75 (0.03) 0.79 (0.03) 0.79 (0.03) 0.65 (0.06) it4 0.40 (0.07) 0.61 (0.05) 0.73 (0.03) 0.77 (0.03) 0.78 (0.03) 0.64 (0.06) it5 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.76 (0.04) 0.77 (0.04) 0.64 (0.06) it6 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it7 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it8 0.40 (0.07) 0.60 (0.05) 0.71 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06)

* S1: update weights for SNP effects but not for GEBVs; S2: update weights for both GEBVs and SNP effects in each iteration.

† Number of SNPs (i.e. m ranges from 1 to 40) in each cluster.

Variances Explained by segments

• Several ISU works propose to present results from GWAS using variance explained by windows of adjacent SNP

• Fan et al 2011, Onteru et al 2011, Peters el al 2012,etc.

• Potentially use of bootstrap to get significance of detected QTL

Windows VariancesZ u

a = Zu for only SNP in segment

a = EBV derived from segment

Get sample variance Var(a)

from genotyped individuals

POSTGSF90 Options

POSTGSF90 Options

Output files from POSTGSF90

QTL-MAS workshop 2010

G = ZDZ '/ ka = DZ '(ZDZ ')−1u

G = ZZ '/ ka = Z '(ZZ ')−1u

cor(ebv,tbv)=0.68 cor(ebv,tbv)=0.70

Single-Step GWAS Conception Rate

• Multiple-Trait US Holsteins Service records

from AI

• ~ 5 millions records, ~ 2.5 millions pedigrees

• ~ 5,600 genotyped bulls

• Computing time

– Complete evaluation 2 h

– Estimates of SNPs 2 m

Single-Step GWAS Heat Stress

• Multiple-Trait Test-Day model heat tolerance

• ~ 90 millions records, ~ 9 millions pedigrees

• ~ 3,800 genotyped bulls

• Computing time

– Complete evaluation ~ 16 h

Milk yield no Heat stress Heat stress

27/05/2014

1

Creation and handling of genomic

relationship matrices with preGSf90

Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria


[email protected]

Genomic Relationship Matrix - G

• G Э ZZげっﾆ

– Z = matrix for SNP marker

– Dimension Z= n*p

– n animals,

– p markers

Data file with SNP marker

27/05/2014

2

HOWTO: Creation of Genomic Matrix

• Read SNP marker information => M

• Get けmeansげ to center

– Calculate allele frequency from observed genotypes (pi)

– pi= sum(SNPcodei)/2n

• Matrix for center W(3,p)

• Center matrix Z = W(M)

012

0 - 2p1 0 - 2p2 ..

1 - 2p1 1 - 2p2 ..

2 - 2p1 2 - 2p2 .

2 1 20 1 0.. .. ..

..

..

..

.

Creation of Genomic

• Issues

– Large number of genotyped individuals

– Large number of SNP markers

– Matrix multiplication ~ cost n^2 * p

Large amount of data put in (cache) memory for

Sﾗｷﾐｪけmatmulげ aﾗヴ W;Iｴヮ;ｷヴﾗa animals and indirect

memory access (center)

Memory hierarchy

27/05/2014

3

Matrix multiplication

• Matrix multiplication

– Several methods

• Intrisic matmul (good for small examples !!!)

• さSﾗ-ﾉﾗﾗヮゲざ

• Packages (BLAS, LAPACK)

– Non-optimzed

– Optimized (ATLAS, MKL, etc.)

– Several Compilers

• Perform automatic optimization

– Vectorize loops

– Detect permuted loops

• Can use OpenMP directives for parallelization

Memory Hierarchy

CPU #1

Main Memory (1Gb – 128Gb)

Cache memory

(256 kb – 16Mb)

CPU #2

Cache memory

(256 kb – 16Mb)

slow

slow

fast

fast

27/05/2014

4

Alternative codes to create G matrix

Do i=1,n

Do j=i,n

S=0

Do k=1,p

S=S+Z(M(i,k),k)

*Z(M(j,k),k)

End do

G(i,j)=S/sqrt(d(i)*d(j))

G(j,i)=G(i,j)

End do

End do

Do k=1,p

X(:,k)=Z(M(:,k),k)

End do

Do i=1,n

Do j=i,n

S=0

Do k=1,p

S=S+X(i,k)

*X(j,k)

End do

G(i,j)=S/sqrt(d(i)*d(j))

G(j,i)=G(i,j)

End do

End do

Do k=1,p

X(:,k)=Z(M(:,k),k)

End do

Do i=1,n

Do j=1,n

Do k=1,p

G(i,j)=G(i,j)

+X(i,k)*X(j,k)

End do

End do

End do

Do i=1,n

Do j=1,n

G(i,j)=G(i,j)/sqrt(d(i)*d(j)

End do

End do

Original

Optimize Indirect Memory

Access -OPTM

Optimize Memory and Loops

- OPTML

Gmatrix.f90 (VanRaden, 2009)

Testing

6500 genotyped animals

40k SNPs

CPU time for alternative codes for G

matrix and machines

Algorithms

Processor Cache Original OPTM OPTML

Xeon 3.5 GHz 6 MB 24 m 26 m 7 m

Opteron 3.0 GHz 1 MB 265 m 59 m 17 m

27/05/2014

5

Compiler Original OPTM OPTML

Intel 265 59 17

Absoft 241 60 34

gfortran 213 63 >1day

CPU time (m) with alternative codes

and compilers

Testing

6500 genotyped animals

40k SNPs

Opteron 3.02 GHz 1 MB Cache memory

PreGSf90 program

• From BLUPF90 package

• Uses a genomic module

• Creation and handling of genomic relationship

matrices and relationship based on pedigree

• Different methods to optimize calculations

using parallel processing

27/05/2014

6

Input files

• Same parameter file as for all BLUPf90 programs – But with さOPTION SNP_file xxxxざ

– indicate to run genomic subroutines

• Pedigree file

• Marker information (SNP file)

• Cross Reference file for renumber ID – Links genotypes files with codes in pedigree, etc.

SNP map file (optional)

• For some genomic analyses or checks

• Format:

– snp number

• Index number of SNP in the sorted map

– chromosome number

– position

• First row corresponds to first column SNP in genotype file !!!

27/05/2014

7

OPTIONS に BLUPF90 parameter file

• PreGSF90

– controled by adding OPTIONS commands to the

parameter file

– OPTION SNP_file marker.geno.clean

– Read 2 files:

• marker.geno.clean • marker.geno.clean.XrefID

RENUMF90

• Add keyword to the さ;ﾐｷﾏ;ﾉ effectざ SNP_FILE

marker_geno_clean

• Renumber tool to prepares: – data

– pedigree

– genotypes

– parameter files for BLUPF90 programs including PREGSF90

• Check wiki:

• http://nce.ads.uga.edu/wiki/doku.php




27/05/2014

8

Parameters file

RENUMF90

renum.par

BLUPF90

renf90.par

Pedigree file from RENUMF90

• 1 - animal number

• 2 - parent 1 number or UPG

• 3 - parent 2 number or UPG

• 4 - 3 minus number of known parents

• 5 - known or estimated year of birth

• 6 - number of known parents;

if animal is genotyped 10 + number of known parents

• 7 - number of records

• 8 - number of progenies as parent 1

• 9 - number of progenies as parent 2

• 10 - original animal ID

27/05/2014

9

SNP file & Cross Reference Id

SNP File

Cross Reference ID

First col: Identification, could be alphanumeric

Second col: SNP markers {codes: 0,1,2 and 5 for missing}

Pedigree File (from RENUMF90)

Original ID

Renumber ID

Genomic Matrix default options

• Gゅ Э ZZげっﾆ ;ゲｷﾐ VanRaden, 2008

• With: – Z center using allele frequencies estimated from the

genotyped individuals

– k = 2 sum ( p * (1-p))

• G = G*0.95 + A*0.05 (to invert)

• Tunning of G (see Z. Vitezica work) – Adjust G to have mean of diagonals and off-diagonals

equal to A

27/05/2014

10

Genomic Matrix Options

• OPTION whichG x – 1: G=ZZ'/k (default) (VanRaden, 2008)

– 2: G=ZDZ'/n; D=1/2p(1-p) (Amin et al., 2007; Leuttenger et al., 2003)

– 3: As 2 with modification UAR (Yang et al., 2010)

– Euclidean distance matrix, not fully implemented yet

• OPTION weightedG file

– ‘W;S ┘Wｷｪｴデゲデﾗ IヴW;デW GЭZDZげ – Weighting Z*= Z sqrt(D) => G = Z*Z*' = ZDZげ

• OPTION whichScale x – 1ぎヲぞふヮふヱ-p)) (default) (VanRaden, 2008)

– 2: trace(ZZ')/n (Legarra 2009, Hayes 2009, Forni et al 2011)

– 3: correction (Gianola et al., 2009)

Genomic Matrix Options

• OPTION whichfreq x

– 0: read from file freqdata or other specified

– 1: 0.5

– 2: current calculated from genotypes (default)

• OPTION FreqFile file

– Reads allele frequencies from a file

• OPTION maxsnps x

– Set the maximum length of string for reading marker data from file => BovineHD chip

27/05/2014

11

Options for Blending G and A

• OPTION AlphaBeta alpha beta – G = alpha*Gr + beta*A

• OPTION tunedG – 0: no adjustment

– 1: mean(diag(G))=1, mean(offdiag(G))=0

– 2: mean(diag(G))=mean(diag(A)),

mean(offdiag(G))=mean(offdiag(A)) (default)

– 3: mean(G)=mean(A)

– 4: Use Fst adjustment. Powell et al. (2010) & Vitezica et al. (2011)

Creation ﾗa けrawげ genomic matrix

• Tricks:

• Use dummy pedigree 1 0 0

2 0 0

ぐ

• Change blending parameters

– OPTION AlphaBeta 0.99 0.01

• No adjustment for compatibility with A

– OPTION tunedG 0

G = 0.99*G + 0.01*I

27/05/2014

12

Storing and Reading Matrices

• PreGSF90: – Facilitate the implementation of single-step

– Matrix A is replaced by H with:

– Default output is the matrix GimA22i, to be included in apllication programs (BLUPF90, REMLF90..)

• BUT: intermediate matrices could be stored for examination, use in application programs, etc.

1 11 1

22

0 0

0

H AG A


• Matrices that can be stored: – A22, inv(A22), G, inv(G), GmA22, inv(GmA22), inv(H)

• All matrices are stored in same format: – upper triangle

– By default in binary format

– But to store in text (Ascii) format: • Use: OPTION saveAscii

• Values – i j val

– i & j refers to the row number in the genotype file !!!!!

– Renumber ID could be obtained from the XrefID file

27/05/2014

13


To save our けrawげ genomic matrix:

• OPTION saveG [all] – If the optional all is present all intermediate G

matrices will be saved!!!

or it inverse

• OPTION saveGInverse – Only the final matrix G, after blending, scaling, etc. is

inverted !!!

• Look in wiki for keywords for other matrices

Storing with Original IDs

• Some matrices could be stored in text files with the original IDs extracted from renaddxx.ped created by the RENUMF90 program (col #10)

• For example: – OPTION saveGOrig

– OPTION saveDiagGOrig

– OPTION saveHinvOrig

• Values – origID_i, origID_j, val

27/05/2014

14

OUTPUT

• Only GimA22i , other requested matrices files, and

some reports are stored.

• Main log is printout to the screen !!!

• Use redirection けбげ • or better the command tee to save in a log file.

• This will allows to save and see the messages from

the program

• echo renf90.par | preGSf90 | tee pregs.log

Printout: Same heading as other

programs

All options that were

enter in the parameter

file should be here !!.

IF not check that

keywords are correct

(upper and lower case)

Check number of

animals and

individuals with

genotypes

27/05/2014

15

Printout

Information from genotype file.

The format is detected from

the first line !!!

So all genotypes should start in

the same column !!!

Number of SNP is also

determined by the first line!!

Looking stored matrices

• Avoid open with text editors, huge files !!!

• For example:

• 1500 genotyped individuals => 1,125,750 rows

• Inspection could be done by Unix commands: – head G => first 10 lines

– tail G => last 10 lines

– less G => scroll document by line/page

– wc -l G => count number of lines

good for checks with the number of

genotypes (n) = (n*(n+1)/2)

27/05/2014

16

head G

GBLUP, GREML, GGIBBS

• Using BLUPF90 programs to perfom genomic

selection using genomic relationship matrix

• Using only phenotypes or pseudo phenotypes

(DYD, DP, EBV ) for only genotyped individuals

27/05/2014

17

Two ways: user_file

• By user defined files for covariances of random effects

• Look at Tricks in the wiki for more detailshttp://nce.ads.uga.edu/wiki/doku.php

• Special type of random

effect in BLUPF90

parameter file

• Gi created by PreGSF90

can be used here!

By けfakeげゲｷﾐｪﾉW-step GBLUP

• Same trick as before:

– Dummy pedigree with number of individual equal

to number of individuals with genotypes

– Little blending with A (identity matrix) to create

the inverse (OPTION AlphaBeta 0.99 0.01)

– No adjustment for means of A (OPTION tunedG 0)

– Parameter file include:

• Random effect defined as add_animal

• OPTION SNP_file xxxx




27/05/2014

18

By けfakeげ single-step GBLUP

• Runs could be either by:

– Several steps

• 1 run preGSf90 and store G inverse

• 2 modify paramter file for BLUP

adding OPTION readGimA22i

• 3 run BLUPF90

– けOne-Stepげ • 1 run BLUPF90 or REMLF90

RENUMF90 ren.par BLUPF90 renf90.par

27/05/2014

19

PreGSf90 inside BLUPF90 ??

• Almost all programs from package support creation of genomic relationship matrices, Hinv, etc.

• OPTION SNP_file xxxx

• Why preGSF90 ? – Same genomic relationship matrix for several models,

traits, etc. Just do it once and store.

– Uses of optimized subroutines for efficient matrix multiplications, inversion and with support for parallel processing

Matrix multiplication subroutines

• Optimized memory and loops (compiler optimization)

• dgemm subroutine from BLAS

• Optimized dgemm (ATLAS or MKL libraries*)

– Serial

– Parallel (Automatic use of OpenMP) * Intel Fortran Compiler

27/05/2014

20

Matrix multiplication using 40k SNPs

1

10

100

1000

10000

100000

0 5000 10000 15000 20000 25000 30000 35000

Log

10

CP

U t

ime

(s)

Number of animals

BLAS dgemm OPTML

~ 6.4 h

Optimized dgemm

~ 3.8 h

Speedup for matrix multiplications

1

1,5

2

2,5

3

3,5

4

4,5

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

du

p

Number of animals

4 Threads

3 Threads

2 Threads

Speedup = time using one thread/time using n threads

27/05/2014

21

Efficent methods to construct genomic

relationship matrices

Number of

genotypes

Genomic Relationship Matrix

Creation Invertion

10k 0.6 m 0.1 m

30k 5.4 m 3 m

50k 15 m 14 m

70k 30 m 36.4 m

100k 60 m 106 m

Elapsed time for different number of individuals

BLADE INIALB 24 cpu

Creation a subset of relationship

matrix (A22)

• Create a relationship matrix for only

genotyped animals (~ thousands)

• Full pedigree (~millions)

• Trace only ancestors of genotyped (reduce but

still large number for A matrix)

27/05/2014

22

Relationship Matrix of Genotyped Animals

• CﾗﾉﾉW;┌げゲ algorithm to creates A22

• No need to have explicit A matrix

• MWデｴﾗS ┌ゲWゲさﾏ;デヴｷ┝-┗WIデﾗヴざﾏ┌ﾉデｷヮﾉｷI;デｷﾗﾐ ┘ｷデｴ ; decomposition of A matrix

-1 -1(I -A (I - Pr ) P)Dv 'r

Example A times a vector

Pedigree

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 2 0 0

[3,] 3 1 2

Matrix P

[,1] [,2] [,3]

[1,] 0.0 0.0 0.0

[2,] 0.0 0.0 0.0

[3,] 0.5 0.5 0.0

Matrix (I-P)-1

[,1] [,2] [,3]

[1,] 1.0

[2,] 0.0 1.0

[3,] 0.5 0.5 1

Matrix (I-P)-ヱげ

[,1] [,2] [,3]

[1,] 1 0 0.5

[2,] 1 0.5

[3,] 1.0

Matrix D

[,1] [,2] [,3]

[1,] 1

[2,] 1

[3,] 0.5

Vector r2

[,1]

[1,] 10

[2,] 20

[3,] 30

Matrix (I-P)-1

[,1] [,2] [,3]

[1,] 1.0

[2,] 0.0 1.0

[3,] 0.5 0.5 1.0

Vector q

[,1]

[1,] 25

[2,] 35 = [3,]

30

-1 -1(I -A (I - Pr ) P)Dv 'r

27/05/2014

23

• For each genotyped animal in A22

A 0 0

1 0

0

A A22

A22(i.) * =

-12 2

-1(I - P)v A (I - P) D 'r r

Relationship Matrix of Genotyped Animals

Tabular method vs. Colleau algorithm

Tabular* Colleau method

CPU Time 311 s 45 s

Memory 12.1GB 322MB

Testing

6,500 genotyped Holsteins

57,000 pedigrees

* Gmatrix.f90 (VanRaden, 2009)

Short introduction to BLUPF90 family...

Documents

Transcript of Short introduction to BLUPF90 family...