Short introduction to BLUPF90 family...

198
22/05/2014 1 Short introduction to BLUPF90 family programs Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria INIA Las Brujas, Uruguay [email protected] BLUPF90 Family of Programs Developed by Ignacy Misztal and collaborators at University of Georgia Collection of software for mixed model computation in animal breeding (plant, forest breeding, etc. )

Transcript of Short introduction to BLUPF90 family...

Page 1: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

1

Short introduction to BLUPF90

family programs

Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay [email protected]

BLUPF90 Family of Programs

• Developed by Ignacy Misztal and collaborators

at University of Georgia

• Collection of software for mixed model

computation in animal breeding (plant, forest

breeding, etc. )

Page 2: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

2

BLUPF90 family programs

• Set of program: – Help in teaching course in mixed model

– Simplify programming using Fortran90

– Have a general program that support different model complexity:

• Linear and threshold-linear models with multiple-correlated effects, multiple trait animal models and dominance

• General philosophy of programs described here: – さCラマヮノW┝ MラSWノが MラヴW D;デ;ぎ “キマヮノW Pヴラェヴ;ママキミェ いざく

Misztal, I. 1999 Interbull Bull.

BLUPF90 family programs

• Consist of several programs:

• Estimation of variance components

• REMLF90, AIREMLF90, (thr)GIBBSxF90

– Solver of mixed model equations

• As before plus

• BLUPF90

• BLUP90IOD (large scale data)

– Aproximation of accuracy

• ACCF90 (large scale data)

• Support for genomic selection

Page 3: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

3

http://nce.ads.uga.edu/wiki/doku.php

BLUPF90 family programs

• All programs are controled by the SAME

paramenter file.

• Extra options could be used to set non-default

behaviour of each program

• Understanding parameter file usually solve

most of problems

Page 4: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

4

BLUPF90 parameter file

Repeat for each

Random effect

Data file

• Free format, i.e. at least one space to separate columns

• TABs are not valid to separate columns

• Some programs (MS Excel) export flat files with TAB separators !!

• Only numbers: integer or reals

• Ia ヴW;ノゲ SWIキマ;ノ ゲWヮ;ヴ;デラヴゲ さくざ ミラデ さがざ

• OミW さくざ キゲ ミラデ ; マキゲゲキミェ ┗;ノ┌W • All effect need to be renumber from 1

consecutively (see later RENUMF90)

Page 5: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

5

Number of traits / effects

• No restriction for number of traits or effects

• But memory requirements and computing

time increase exponentially with them

Effects section

• Many rows as the NUMBER_OF_EFFECTS

• In this section de model for each trait is

defined

• Different models per trait are supported

• If an effect is missing for one trait use 0

Many columns

as NUMBER_OF_TRAITS Number of Levels

Type of effect

Page 6: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

6

RANDOM_RESIDUAL VALUES

• This matrix should a square matrix with

dimension equal to the NUMBER_OF_TRAITS

• Use zero (0.0) to indicate uncorrelated

residual effects between traits

• e.g. For a 3 trait

43.1 0.0 0.0

0.0 5.1 3.2

0.0 3.2 10.3

Random effect definition

• RANDOM_GROUP – Number(s) of effect from list of effects

– Correlated effects should be consecutive e.g. Maternal effects, Random Regression models

• RANDOM_TYPE – diagonal, add_animal, add_sire, add_an_upg,

add_an_upginb, user_file, user_file_i or par_domin

• FILE – Pedigree file, parental dominance or user file

• (CO)VARIANCES – Square matrix with dimension equal to

number_of_traits*number_of_correlated_effects

Page 7: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

7

(CO)VARIANCES structure

• Assuming a 3 trait (T1-T3) and 3 correlated

effects (E1-E3)

E1 E2 E3

T1 T2 T3 T1 T2 T3 T1 T2 T3

E1

T1

T2

T3

ぐくく

RANDOM_TYPE

• Diagonal

– for permanent enviroment effects, assume no correlation between levels of the effect

• add_sire

– To create a relationship matrix using sire and maternal grandsire

– Pedigre file: • individual number, sire number, maternal grandsire number

• add_animal

– To create a relationship matrix using sire and dam information

– Pedigre file: • animal number, sire number, dam number

Page 8: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

8

RANDOM_TYPE

• add_an_upg

– As before but using rules for unknown parent group

– Pedigre file: • animal number, sire number, dam number, parent code

• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals

• Parent code = 3 に nb of known parents – 1 both parents known

– 2 one parent known

– 3 both parents unknown

• add_an_upginb

– As before but using rules for unknown parent group and inbreeding

– Pedigre file: • animal number, sire number, dam number, inb/upg code

• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals

• inb/upg code = 4000 / [(1+md )(1-Fs ) + (1+ms )(1-Fd )]

• ms (md) is 0 if sire (dam) is known and 1 otherwise

• Fs(Fs) inbreeding coefficient of the sire (dam)

RANDOM_TYPE

• user_file

– a matrix is read from file. Matrix is stored only upper- or lower-triang

– Matrix file: • row, col, value

• user_file_i

– As before but the matrix will be inverted

• par_domin

– A parental dominance file created by program RENDOM

– File format • s-d s-sd s-dd ss-d ds-d ss-sd ss-dd ds-sd ds-dd code

Page 9: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

9

Pedigree files

• As with data files pedigree files are separated by at least one SPACE!!

• TABs are not supported !!

• Order of columns depends on the type of the random effect

• Duplicates pedigree are not checked!!

• Identification number need to be coded sequentially from 1 to the maximun number of animals

• No order is required !!!!

Programs Options

• Programs behavior can be modified by adding lines with OPTION at the end of the parameter file

• OPTION option_name x1 x2 …

• option_name, each program has it own definition of options

• TエW ミ┌マHWヴ ラa ラヮデキラミ;ノ ヮ;ヴ;マWデWヴゲ ふ┝ヱが ┝ヲぐぶ デラ control the behavior depends on the option.

Page 10: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

10

BLUPF90

• Blupf90 computes generalized solutions by several methods: – Preconditioner Conjugate Gradient (PCG). Default Iterative

method, fast.

– Succesive over-relaxation (SOR), a iterative method based on Gauss-Seidel

– Direct solution using sparse Cholesky factorization (FSPAK) Greater memory requirements

• The values of the solution change between methods but estimable function should be the same

• Prediction error variances can be obtanined using sparse inverse (FSPAK)

BLUPF90 options

Page 11: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

11

BLUPF90 options cont.

BLUPF90 options cont.

Page 12: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

12

Example of parameter file BLUPF90

From blupf90.pdf documentation:

http://nce.ads.uga.edu/wiki/doku.php

Parameter File Model

Page 13: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

13

FAQ or Frequently Problems

• Wrong data file and pedigree name !! – Program does not stop if wrong file name not exist

– Check outputs for data file name and number of records and pedigree read

• Wrong position of formats or formats for observation and effects

• Misspelling of Keywords. – Program may stop

• (Co)variance matrices not symmetric, not positive definite – Program may not stop

• Large numbers (e.g. 305-day milk yield 10,000 kg) + large number of records with Gibbs Sampling programs – Scale down i.e. 10,000 /1,000 = 10

Data preparation and renumbering

• In general data files and pedigree files can be

created by any software (e.g. SAS, R, python, etc.)

• But all cross-classified effects (included pedigree)

needs to be renumber sequentially from 1 to the

maximum number of levels of each effect.

• No alphanumeric columns

• Columns has to be separated by at least one

space !!

Page 14: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

14

RENUMF90 • A renumbering program for the BLUPF90 family of programs

• Supports: – multiple traits

– different effects per trait

– alphanumeric and numeric fields

– unknown parent groups

– covariates for random regression models

• Provides data statistics

• Traceback pedigree related to individuals in data file and performs comprehensive pedigree checking

• Create files to be used by BLUPF90 family programs – renf90.par - parameter file

– renf90.dat - data recoded

– renaddxx.ped - renumer pedigree + statistics

– renf90.tables - cross reference file with renumber information with original data

RENUMF90 files

• Data file and pedigree file in flat files

• Columns separeted by at least one SPACE

• No TABS !!!! (current version check for it)

• Input files cannot contain character #

• Missing sire/dams must have code 0

• codes 00 are treated as a known animal

• Has it own parameter file!!!! not the same for other programs !!!!

Page 15: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

15

RENUMF90 parameter file

• Based on keywords in capital following by a

line(s) with the corresponding data item

• Keywords need to be typed exactly

• Keywords need to occur in sequentially order

!!!

• Lines starting with # are treated as comments

and are ignored

RENUMF90 keywords

All these keywords

are mandatory!!!

Leave a blank lines in

cases that are necessary

Page 16: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

16

RENUMF90 keywords

Effect section

RENUMF90 keywords

Random effect section

Page 17: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

17

RENUMF90 keywords

Random effect files section

RENUMF90 keywords

Pedigree options section

Page 18: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

18

RENUMF90 keywords

Unknown parent group section

RENUMF90 keywords

Random regression group section

Page 19: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

19

RENUMF90 keywords

Random effect (Co)Variances section

RENUMF90 keywords

• Section starting from EFFECTS can be repeated as many time as effects in the model

• Correlated effects are controled by option

• If (Co)Variances for any effect are missing, default matrix with 1.0 in diagonal and 0.1 on off-diagonal will be used.

– WARNING: for EM-REML convergence rate is improved if starting values are too large rather than too small !!!

Page 20: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

20

RENUMF90 keywords

Creation interacions effects

RENUMF90 keywords

extra options sections

Page 21: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

21

RENUMF90 keywords

options passed to BLUPF90

• All lines that begin with keyword OPTION are

passed to parameter file renf90.par

• This allows automatization of process by using

scripts

• For example:

– OPTION sol se

RENUMF90 output files

Pedigree file: renaddx.ped • Columns structure:

1. Animal number (from 1)

2. Parent 1 number or UPG number for parent 1

3. Parent 2 number or UPG number for parent 2

4. 3 minus number of know parents

5. Know or estimated year of birth (o if not provided)

6. Number of know parents, if animal has genotype: 10+number of know parents

7. Number of records

8. Number of progenies as parent 1

9. Number of progenies as parent 2

10. Original animal ID

Page 22: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

22

RENUMF90 output files

renumbering tables: renf90.tables

• For each cross-classified effect tables are

created with:

– Original ID, count,, consecutive number

• Usefull

– to translate solutions from BLUPF90 program into

original alphanumeric values

– Check counts of records by level

Example of RENUMF90

parameter file

Page 23: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

23

RENUMF90 output files

Inbreeding program

• INBUPGF90

– Calculates inbreeding coefficients

– Alphanumeric identification of individuals

– Different methods:

• Regular inbreeding (Meuwissen & Luo)

• Missing parent information (V;ミ‘;SWミげゲ method)

– Calculate expected future inbreeding for a set of defined mating

– Calculation of relationships between animals

– Output reordering pedigree with parent ID < animal ID

Page 24: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

24

INBUPGF90

• No parameter file

• Controlled by arguments

inbupgf90 –pedfile file_name

• See wiki for full description of options

Different Models with BLUPF90

http://nce.ads.uga.edu/wiki/doku.php?id=faq

Page 25: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

25

Estimation of variance components

• Several methods available

– REML

– Bayesian methods via Gibbs Sampling

• Review article:

– Misztal, I. 2008. Reliable computing in estimation

of variance components. J. Anim. Breed. Genet.

125:363-370.

REML

• Maximizes the likelihood with respect to

parameters

• Different ways to get maximum

– Derivate Free (DF) e.g. MTDFREML

– Using first derivatives (EM-REML)

– Using second derivatives (AI-REML)

Page 26: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

26

EM-REML

• Traditional regarderd as the most reliable

• But

– Slow

– Could fail if starting parameters are smaller than

デエW けデヴ┌Wげ ヮ;ヴ;マWデWヴゲ

– Use bigger values

– Not generate standard errors of estimates

AI REML

• Much faster than EM-REML

• Provide estimation of standard errors

• BUT

– For complex models and poor starting values

• Slow convergence

• Parameters estimates out of the parameter space

– In some cases initial rounds with EM-REML help

Page 27: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

27

Bayesian に Gibbs Sampling

• Implementation

– solving of mixed model equations (Gauss-Seidel)

Щ ;SSキミェ けミラキゲWげ デラ ゲラノ┌デキラミゲ

– Sampling of variances components from chi-

square or Wishart distributions

• Samples from marginal posterior distribution

after burn-in period

Programs for estimation

of variance components

• remlf90 -> EM-REML

• airemlf90 -> AI-REML

• Gibbs Sampling – gibbsf90 blupf90 transformed in gibbs, slow

– gibbs1f90 optimized version

– gibbs2f90 improve mixing with random correlated effects

– gibbs3f90 heterogeneous residual var. classes

– thrgibbs1f90 threshold-linear mixed models

Page 28: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

28

REMLF90 OPTION

AIREMLF90 OPTIONS

Page 29: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

29

AIREMLF90 OPTIONS

AIREMF90 OPTIONS

Page 30: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

30

GIBBS SAMPLING PROGRAMS

• Extra input are required when running gibbs

sampling programs

• As other programs – name of parameter file?

• Parameter to set the MCMC chain

– number of samples and length of burn-in

– Give n to store every n-th sample? (1

means store all samples)

Gibbs Sampling

Output files

• Default files

– gibbs_samples

– fort.99

• Solutions files only if they are required by

options

• Other files, only useful for continuation

– binary_final_solutions

– last_solutions

Page 31: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

31

Gibbs Sampling OPTIONS

Gibbs Sampling OPTIONS

Page 32: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

32

heterogeneous residual variances

GIBBS3F90

Threshold models

THRGIBBS1F90

Page 33: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

33

Post Gibbs analysis

• Program postgibbsf90 use output files

– gibbs_samples

– fort.99

• from all Gibbs Sampling programs

– gibbs1f90

– gibbs2f90

– gibbs3f90

– thrgibbs1f90

POSTGIBBSF90

• Calculate statistics for variance components from the posterior distribution – Means

– median

– mode

– standard deviations

– HPD 95

– effective sample size

– auto-correlations.

• Create graphs with trace of the chain and histogram of variance components

Page 34: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

34

POSTGIBBSF90

Output Files

• さpostgibbs_samplesざ – file contaning all Gibbs samples for additional post

analyses, e.g. posterior distribution for Heritabilities, correlations, covariance functions of random regression models

• さpostmeanざ – file contaning posterior means, in matrix format that

match parameters files

• さpostsdざ – file contaning posterior standard deviations.

• さpostoutざ – statistics

HowTo run POSTGIBBSF90

• Iterative program

• As other programs – name of parameter file?

• Parameter to select samples from distribution to calculate posterior statistics – Burn-in?

• Set number of samples to discard for posterior analysis

• In first run use 0 to see convergence

– Give n to read every n-th sample? (1 means read all samples)

• This number should be equal or greater that the one used in gibbs programs

• Ask user to enter option to – Generate graphs of trace or histograms

– exit

Page 35: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

35

Using Gibbs Sampling programs

• For new analysis use burn-in equal 0

– Allows look full chain with postgibbsf90

– Posterior samples could be extracted later

• For long jobs, use k-parameter >1 e.g. 10

– Decrease size of gibbs_samples

• DIC for model comparison is provided in output of gibbs programs,

– BUT, burn-in should be used in order to be meaningful

General comments for all programs

• Output that is printed to the terminal is not

SAVED in any file !!!

• Use redirection or pipes to store outputs in log

files:

echo renf90.par | blupf90 | tee blup.log

or echo renf90.par | remlf90 | tee reml.log

Page 36: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

36

For programs that requires

more than one parameter

gibbs2f90 <<AA > gibbs.log

renf90.par

1000 0

10

AA

• Or using single line

• printf "exmr99s \n 1000 0 \n 10 \n” | gibbs2f90 > gibbs.log

General OUTPUT from all programs

Check file names

Check model

Page 37: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

37

General OUTPUT from all programs

Check (co)variances

Check number of records

And pedigree read

Check maximum

number of columns

to read

Page 38: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Useful commands for Linux

• Access to server using ssh client: e.g. putty

• To run graphic windows a X11-client for windows: xming

• Commands in Linux are Case Sensitive !!

• Several tutorials on the WEB !!

• unixcombined.pdf from Misztal web page– http://nce.ads.uga.edu/~ignacy/ads8200/unixcombined.pdf

• Unix_en.pdf from genomeek blog:– http://genomeek.wordpress.com/manuels/unix-et-awk/

– http://dl.dropbox.com/u/22940514/Unix_En.pdf

Basic Commands

pwd show working directory

ls list files in working directory

ll as before but with more information

mkdir d make a directory d

cd d change to directory d

cat file list the complete file

less file list file page-by-page

Page 39: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Copy and moving commands

To copy file

cp /home/ignacio/course/lab/lab1/is .

To copy file directory

cp –r /home/ignacio/course/lab/lab1 .

to move file aa into bb in folder test

mv aa ./test/bb

To delete

rm yy delete the file yy

rm –r xx delete the folder xx

Other popular commands

head file print first 10 lines

tail file print last 10 lines

wc –l file count lines

grep text file find lines that contains text

cat file1 fiel2 catenate files

sort sort file

cut cuts specific columns

join join lines of two files on specific columns

paste paste lines of two file

expand replace TAB with spaces

uniq retain uniques lines on a sorted file

Page 40: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Redirections & pipe

aa < bb

program aa reads from file bb

blupf90 < in aa > bb

program aa write in file bb

blupf90 < in > log

“|” and “tee”

program blupf90 reads name of parameter file and writes output in terminal and in file log

echo par.b90 | blupf90 | tee log

AWK

• Very useful and fast command to work with

text files

• Can be used as a database query program

• Select specific columns or create new ones

• Select specific rows matching some criteria

Page 41: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

AWK

Extract equations solutions for a particular effect (2) and print EBV and

accuracies (r^2)

awk '{ if ($2==2) print $3,$4,1-$5**2/20}' solutions

Count records by sire

awk '$2>0{ print $2}' ped | sort | uniq –c

Process CSV files

awk 'BEGIN {FS=","} { print $1,$2,$3}' pedigree.txt

Page 42: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

1

Data simulation (including genomics) QMSim software

Zulma G.Vitezicaゆ

ゆ INRA-INPT, GenPhySE, Castanet-Tolosan 31326 France

[email protected]

It was design to simulate large-scale genotyping data in

multiple and complex livestock pedigrees

A wide variety of genome architectures from infinitesimal

model to single-locus model

It is a user-friendly tool for simulating data

Computationally efficient in terms of both time and

memory

QMSim: why to use it ?

Page 43: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

2

The code is written in C++ language

Executable files are freely available for Windows and

Linux and Mac at: (Last update: July 12, 2013)

http://www.aps.uoguelph.ca/~msargol/qmsim/

QMSim†: where to find it ?

†Sargolzaei & Schenkel (2009), Bioinformatics 25:680-681.

In 2 steps:

First step: A historical population is simulated

–in order to create initial LD and

–to establish mutation-drift equilibrium

–expansion and contraction of the population

Second step: One or multiple recent population

structures are generated

How the simulation is carried out ?

Page 44: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

3

It must be in ASCII format

It consists of five main sections

The order of commands within each

section is not important

All commands end with a semicolon

Nラ ゲWマキIラノラミ т Wヴヴラヴ マWゲゲ;ェW ;ミS program exits.

Parameter file

1. Global parameters section

An arbitrary title

.---------------------------------------. | Example 1 - 10k SNP panel | `---------------------------------------' Initial seed is backed up in [r_ex01/seed]. parameter file is backed up in [r_ex01/ex01.prm].

Parameter file: ex01.prm

Output folder: r_ex01/

Output

* Mersenne Twister algorithm (Matsumoto & Nishimura, 1998)

The random number generator (RNG*) requires a seed file.

Ia キデ キゲ ミラデ ゲヮWIキaキWS т ‘NG ┘キノノ HW ゲWWSWS aヴラマ デエW ゲ┞ゲデWマ clock

For each run the initial seed numbers will be backed up in

output folder т Tエキゲ ;ノノラ┘ゲ デラ ヴWヮW;デ デエW ヴ┌ミ !

Page 45: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

4

1. Global parameters section

Overall heritability (Polygenic + QTL)

QTL effect is simulated

Only polygenic effect is simulated

Both, polygenic and QTL effects are simulated

Range: 0 - 10,000

1. Global parameters section

A sex limited trait like milk yield

When males do not have records, but selection or culling are based on

EBVs т Ok

Phenotypes т M;ノWゲ ┘キノノ HW ヴ;ミSラマノ┞ ゲWノWIデWS ラヴ I┌ノノWS

Page 46: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

5

It consists of five main sections

Parameter file

2. Historical population section

To create initial LD

Evolutionary forces: mutation and drift (no selection, no migration) Random mating: union of gametes randomly sampled from

the male and female gametic pools Discrete generations

Only a single historical population

Page 47: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

6

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

2. Historical population section

hg_size = v1 [v2]

Historical generation

sizes

v1 the historical generation size Range: 2 – 100,000 v2 the historical generation number Range: 0 – 150,000

Constant size of 420

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

2. Historical population section

Gradual decrease in size from 2000 to 200

Expansion in the last historical generation from 100 to 3000

Historical bottleneck or expansion can be simulated

LD in livestock extends over longer distances than in humans

Page 48: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

7

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

2. Historical population section

nmfhg s first historical generation

nmlhg s last historical generation

Default : equal number of males and females

Sex ratio will be constant across historical generations. It can be changed in the last generation

Number of males

It consists of five main sections

Parameter file

Page 49: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

8

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

3. Population section

One or multiple recent populations

For the first defined recent population (i.e. p1), founders must come

from the last historical population

For subsequent populations (i.e. p2), founders can be chosen from one or more

(up to 10) previously defined populations (i.e. p1)

Multiple recent populations can be analyzed

separately (one pedigree for each population) or jointly (by creating one pedigree for all populations) for inbreeding and EBV

3. Population section

Parameters for the founders

Number of male/female

to be selected

It indicates from which population the base animals must

be selected

hp: historical population (last historical generation)

Type of selection

select: rnd (default), phen, tbv and ebv /l : to select low values /h : to select high values

Choosing founders for a population

Page 50: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

9

Choosing founders for a population for F2 design

Crossing between populations/lines

is allowed

Migration can be simulated

Choosing founders for a population for migration

Page 51: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

10

3. Population section

ls: number of progeny per dam

ls: Probability of the litter sizes

Litter size

3. Population section

pmp: range 0-1, default is equal to 0.5

pmp: 0.5 /fix_litter Sex ratio will be fixed within litters (progeny of a dam)

Sex ratio

Page 52: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

11

3. Population section

rnd (default), rnd_ug (a dam can mate with more than one

sire in each generation), p_assort (similarity), n_assort (dissimilarity), minf and maxf (inbreeding is minimized in the

next generation)

Assortative mating base on phen, ebv or tbv

Matting design

3. Population section

sr : 40% of sires will be replaced in

all generations

sr : 0.4 [1] 0.5 [5] 40% of sires will be culled for generation 1 to 5, and

50% from generation 5 to last generation

Replacement

sr : 1, discrete generations (default)

Page 53: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

12

3. Population section

rnd, phen, tbv ebv and age (only for

culling)

/l or /h to select low or high values

Selection and culling designs

Breeding value estimation method

Population specific parameters for saving outputs

data: save individual's data except their genopype (File name: 'population name'_data_'replicate number'.txt stat: save brief statistic on simulated data genotype: save genotype data

p1_mrk_007.txt

p1_qtl_007.txt

Page 54: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

13

It consists of five main sections

Parameter file

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

4. Genome section

Number of chromosomes: 10 chrlen : range 1-5,000 cM

Marker information

Example – 10k SNP panel

Samples from uniform distribution

in each replicate

All marker loci will have 2 alleles

In the first historical generation, then drift

and mutation

Page 55: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

14

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

4. Genome section

nqloci: range 1-50,000 on the chromosome

QTL information

Example – 10k SNP panel

Samples from uniform distribution

in each replicate

Equal allele frequencies in

the first historical

generation

Nb of QTL alleles in the first historical generation (all:

same number)

It will be sampled from gamma distribution with shape 0.4

A L I M E N T A T I O N

A G R I C U L T U R E

E N V I R O N N E M E N T

More genome information

Example – 10k SNP panel

In recurrent mutation, no new allele is generated.

Default: infinite-allele model SNP recurrent mutations are generally very rare and no evidence

that mutation contributes to erosion of LD between SNP ( Ardlie et al., 2002)

Other possibilities :

Missing marker/QTL genotypes Genotyping errors can be simulated (marker/QTL)

Page 56: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

15

It consists of five main sections

Parameter file

5. Output section

Save brief statistics on historical population

Save allele effects

Marker and QTL linkage map (GWAS)

Page 57: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

16

Marker and QTL linkage map

p1_data_001.txt

QMSim outputs

p1_stat_001.txt

Page 58: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

17

p1_mrk_001.txt

Marker and QTL linkage map

Page 59: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

18

Save allele effects

QMSim

To create LD

Dense marker map QTL + polygenic

Population expansion or bottleneck

Multiple recent populations / lines

Crossing between populations / lines

A single historical population

Sex limited traits

No fixed effects -

+

Only additive effects

Conclusion

Page 60: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

19

Reference population Phenotypes Genotypes Pedigree

Population Phenotypes Genotypes Pedigree

Estimation SNP effects

Calculation of GEBVs

Comparaison between GEBV and TBV (EBV) to obtain accuracy

Candidates to selection

Genomic selection : validation

Example of simulation

Generation -1000 to -5

Generation -5 to -1

Generation 1 to 9

Generation 10

Random mating (N=100)

Expanded to N=3000

200ンx 2,000ワ/ generation

Pedigree recording and genotyping start

Validation data: candidates to selection

Training data

Page 61: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

1

Bases for Genomic Prediction

A. Legarra

INRA, Toulouse, France

2

Linkage disequilibrium

• « Gametic phase disequilibrium »

Statistical association between alleles at two loci in the

same chromosome

– Loci : places

– Alleles: alternative forms of a gene (A,B,0)

– Phase: notion of being in the same chromosome (of a pair)

or coming from same origin (sire or dam)

Page 62: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

2

3

Biallelic case

• Assume we genotype 5 individuals, thus 10

chromosomes (and that we know the

phase)

• Now we compute allelic frequencies

AB AB ab aB ab ab Ab AB Ab AB

4

Biallelic case

p(A)=0.6

p(B)=0.5

if independent, p(AB)=0.3,p(ab)=0.2

The expected proportions are:

A a

B 0.3 0.2

b 0.3 0.2

Page 63: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

3

5

Biallelic case

p(A)=0.6

p(B)=0.5

in reality:

A a

B 0.4 0.2

b 0.1 0.3

vs. expected

A a

B 0.3 0.2

b 0.3 0.2

More AB & ab than expected !!

This is linkage disequilibrium

6

Linkage disequilibrium

• Is a statistical concept

• Describes not-random association of two loci

– Nothing more, so, why is it useful?

• Two loci in LD most often are (very) close

– This is because LD breaks down with recombination

• Linkage disequilibrium of two loci decays on average

with the distance

• Hence it serves to map genes

Page 64: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

4

7

Where does it come from?

• Because chromosomes are transmitted together

– Within known families (« linkage analysis »)

– Within the history of a population (« populational linkage

disequilibrium » or « linkage disequilibrium » in short)

• This distinction is rather artificial

– Remember: a population is a very old, large family

8

Populational linkage disequilibrium

• Assume we mix two populations (say Churra

and Merino)

• Or, that Adam was

– and Eve

– The first generation is an F1

– Then animals are mixed at random

• What do we get after many generations?

Page 65: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

5

9

Populational linkage disequilibrium

• The chromosomes become a fine-grained mosaic of grey and black

ひ However, complete mixture is difficult to attain

10

Populational linkage disequilibrium

•Some people distinguish LD and pedigree relationships •It’s pretty much the same thing

An stretch (=chromosomal

segment) is conserved because it

comes from the same ancestor

(co-ancestry).

•The value of LD (e.g. r2) observed at large distances is a function of recent relationships

•… at short distances is a function of distant relationships

The « existence » of only a few

conserved stretches at the same

place creates LD. LD is therefore:

an over-representation of segments

from a few gametes

that existed in the population some

time ago.

Page 66: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

6

11

Within-family linkage disequilibrium

• Consider this male who has 8 progeny A

a

B

b

Recombination fraction: 0.50

A b

A B

a B

a b

A b

A B

a B

a b

We found linkage equilibrium in one generation

These are the chromosomes in the sons (i.e. the gametes the male transmitted)

12

Within-family linkage disequilibrium

• Consider this male who has 8 progeny A

a

B

b

Recombination fraction: 0.25

A b

A B

a B

a b

A B

a b

Due to non-recombination linkage disequilibrium has been generated

A B

a b A a

B 0.375 0.175

b 0.175 0.375

Page 67: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

7

13

Within-family linkage disequilibrium

• Assume now there are two males A

a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

14

Within-family linkage disequilibrium

• Assume now there are two males A

a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

A a

B 0.375 0.175

b 0.175 0.375

Within-family linkage disequilibrium

A a

B 0.175 0.375

b 0.375 0.175

Page 68: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

8

15

Within-family linkage disequilibrium

• Assume now there are two males A

a

B

b

A b

A B

a B

a b

A B

a b

A B

a b

A

a

b

A

A B

A b

a b

a B

A b

a B

A b

a B

A a

B 0.5 0.5

b 0.5 0.5

No overall linkage disequilibrium

• Why tracing QTLs within family is easy

Page 69: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

9

17

Measures of LD: r2

r is the correlation between two loci if we say « A » = 1, « a »=0

« B » = 1, « b »=0

• Not free from problems but can be understood by

statisticians (and breeders)

• The sample size needed to achieve a given power is

proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)

• Everybody uses it to describe things in genomic selection.

1 1

f AB pqr

p p q q

1 1

Dr

p p q q

18

Measures of LD: r2

r is the correlation between two loci if we say « A » = 1, « a »=0

« B » = 1, « b »=0

• Not free from problems but can be understood by

statisticians (and breeders)

• The sample size needed to achieve a given power is

proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)

• Everybody uses it to describe things in genomic selection.

1 1

f AB pqr

p p q q

1 1

Dr

p p q q

Page 70: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

10

Bayesian Inference

Gibbs sampling

• Iterative procedure – Construct a joint distribution p(A,B,C)

• Typically this distribution contains phenotypes + a priori information + likelihood

• Want to draw inferences from this distribution, for instance the expected value of A

– Echantillonage • Sample A from p(A|B,C)

• Sample B from p(B|A,C)

• Sample C from p(C|A,B)

• Sample A from p(A|B,C)

• Sample B from p(B|A,C)

• Sample C from p(C|A,B)

• ぐ

Page 71: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

11

Gibbs sampling

• Iterative procedure in two steps

– Burnin

• Some iterations ラa さburn-キミざ

• Typical trace along the iterations

Gibbs sampling

– The final result at the end of the iterations is NOT the

solution looked for (this constrats with REML or Gibbs)

– No clear measure of convergence

– We cumulate information over the post-burnin

iterations

• Solution= Average of the post-burnin iterations

• Example:

• 欠賦沈 捗沈津銚鎮 噺 怠津 デ 欠葡沈 , where ã_i sampled over n iterations

• Ex, in BayesC 絞沈 噺 ど┻ぬ -> means that over 1000 iterations

300 times 絞 風 噺 な and 700 fois 絞 風 噺 ど

Page 72: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

12

24

Models for Genomic selection

• Single marker

• Whole-genome (multiple marker) genomic

selection

Page 73: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

13

25

Single marker

• Assume there is a marker in complete LD with

a QTL

• For example, the polymorphism in the halothane

gene (HAL) which is a predictor of bad meat quality

in swine

26

Single marker

• Estimate breeding values including the marker is a

piece of cake

• yi= marker effect in animal i + e

– We substitute the true, possibly unknown gene by a proxy

observed marker and estimate effects of the latter using a

linear model

– We can include an additional polygenic genetic value of

animal i

Page 74: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

14

27

Base model

• y= ぐ+ Za + e

– Z= incidence matrix of

marker effects

– a= marker effect

– e=residuals

1

2

3

4

0 1 1 0

2 0 0 0

0 1 0 1

a

a

a

a

》a

3 individuals, 1 marker with 4 alleles

ひ This can be solved, for example, by least squares

28

Single marker

• This is fine if we know what markers are good

predictor of what genes

• But this is rarely the case

– It can be shown that you miss a lot of information by trying

to locate the QTLs

– And those that you find are certainly exaggerated

Page 75: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

15

• Go to notes

30

Whole genome

• Ia ┘W Sラミげデ ゲWノWIデ QTL ヴWェキラミゲ ┘W ゲニキヮ デエW problem of bias

• Therefore :

– Genetic value = sum of effects of all regions

• We effectively treat all regions as being carriers of a

QTL

– How do we estimate the effects of all regions?

Page 76: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

16

31

Whole genome

• The simpler is to do an extension of single

marker analysis

• Do multiple marker regression

• You want to cover all the genome => many

markers

32

Multiple marker additive model

1,1

1,2

2,1

2,2

2,3

2,4

1 1 0 1 1 0

2 0 2 0 0 0

0 2 1 0 0 1

0 2 1 1 0 0

a

a

a

a

a

a

》a

2 alleles in 1st marker

4 alleles in 2nd marker

4 individuals, 2 markers each • y= Za + e

– Z= incidence matrix of

marker effects

– a= marker effect

– e=residuals

Page 77: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

17

33

Estimating SNP effects

• The simultaneous estimates of many markers by least squares are very poor,

– if we have much more SNPs than individuals

– They are thus terribly bad for genomic predictions as well

• Even if we had many individuals, there is a missing piece of information:

– we think that most SNPs do not have an effect or at least a big one

– this is a « prior » information

• Can we do something?

34

Best Predictor as a Bayesian estimator

|ˆ |

|

p p dE

p p d

a y a a a

a a yy a a a

« Prior » (how

we think SNP

effects are)

« Likelihood »

(how SNP effects

affect the

phenotype)

Estimate of SNP

effects

Page 78: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

18

Best Predictor as a « penalized » estimator

• Statisticians & « machine learners » aim using

« penalized » estimators (Ridge regression, Elastic

ミWデぐぶ

• A penalized estimator is the same as a « best

predictor » (or as a Bayesian estimator) before with

prior now called « penalization »

35

In the reference population:

Get markersげ genotypes (燦追岻

Get phenotypes 岫姿岻

Estimate markers effects 珊 from 姿 噺 層航 髪 燦追珊 髪 蚕 , possibly

with a Bayesian model

In the candidates :

Get markersげ genotypes (燦頂岻

Take estimates 珊赴 from above

Estimate breeding values as 四赴頂 噺 燦頂珊赴

Page 79: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

19

• Go to notes

38

A priori Distributions for marker effects

• Several distributions for SNP effects have been

proposed

– Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008) т

BLUP_SNP or GBLUP or RR-BLUP or « Ridge

regression »

– BayesA, BayesB, (Meuwissen et al. 2001; Habier et al., 2011)

– Mixture of normal , BayesC(Pi) (Van Raden JDS 2008,

Habier et al., 2011)

– (Bayesian) Lasso (Usai et al., 2009; De los Campos, et al., 2009)

Page 80: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

20

39

2 2

2 2

22

0,

0, , / 2

0 Pr1

0, Pr 1

i a i a

i a i a

i i aa

a N Var a

a t S Var a S

witha Var a

N with

Prior variances for SNP effects Normal

BayesA

BayesCPi

40

Normal distribution

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

Few « big » effects

20,i aa N

Page 81: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

21

41

Normal equations for genomic selection

(BLUP_SNP)

• If we assume normality there are closed

expressions for â

• This is called « BLUP », and also « genomic

BLUP » , BLUP_SNP, or GBLUP, but also « ridge

regression » or Random Regression-BLUP

– I will keep GBLUP for the use of the genomic

relationship matrix

– and BLUP_SNP for the direct estimation of SNP effects

42

Mixed model equations for BLUP_SNP

• HWミSWヴゲラミげゲ MME • ZげZ is not diagonal

• Var(a)=D is diagonal if we assume uncorrelated

SNP effects

1 1 1

1 1 1 1

ˆ

ˆ

X R X X R Z X R yb

Z R X Z R Z D Z R ya

2 2

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

a a

D ICould (will) be

something

different !!

Page 82: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

22

43

Solving BLUP_SNP: GS with Residual

Update • How to estimate SNP effects efficiently ( Legarra and Misztal J. Dairy Sci.

2008) (reinvented many times)

• LWデ ;ゲゲ┌マW ; ヴ;ミSラマ “NP マラSWノ ふさBLUPぱ“NPざぶ

• Mixed model equations can be solved by direct inversion (1 iteration) or

Gauss-Seidel, PCG or Jacobi (iterative methods, useful for large matrices).

• MCMC (BayesB, etc) can be done starting from Gauss-Seidel

• The number of effects (SNP) n is much larger than the number of records

m, and the matrix ZげZ is dense. A typical example (2000 records, 20000

SNP):

44

Efficient solvers for BLUP_SNP:

• Gauss-Seidel with Residual Update: ( Legarra and Misztal J.

Dairy Sci. 2008) (reinvented many times) implemented in GS3

– Form the basis of the Gibbs Sampling Algorithms in BayesC, etc.

– Iterate on:

1. Estimate SNP effect

2. Correct data for this SNP effect

• Preconditioned Conjugate Gradients (not in GS3)

– Eゲデキマ;デW ;ノノ “NP ゲキマ┌ノデ;ミWラ┌ゲノ┞ ┌ゲキミェ さゲW;ヴIエざ H;ゲWS ラミ ヴWゲキS┌;ノゲ ;デ W;Iエ キデWヴ;デキラミ

Page 83: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

23

45

The size of the MME

= Za y

Model

â = Z’y

Much bigger! Is this memory efficient? Easy to solve?

m

n

Z’Z (dense)

40,000,000

elements

400,000,000 elements

46

Reordering Gauss Seidel

• Gauss Seidel uses the conditional mean for the i-th

effect, corrected by the other effects:

• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi

l)

ひ Note that we are correcting for âi, so we put it

back

Page 84: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

24

47

Reordering Gauss Seidel

• Gauss Seidel uses the conditional mean for the i-th

effect, corrected by the other effects:

• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi

l)

• Correcting for Zâ takes 20000 op.

ひ This is the residual êが キゲミげデ キデい

ひ Use alternative formula

(ziげzi + ゜) âil+1 = ziげê+ziげziâi

l+1

48

Reordering the error term

ひ Still we need to compute ê at each iteration

ひ Actually only âi changed

ひ It can be shown that ê can be « updated »

êl+1 = êl に zi(âil+1 - âi

l)

に Hence « GSRU » Gauss Seidel with Residual Updating

に Some machine learning literature calls this « backfitting »

Page 85: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

25

49

GSRU in Figure

= Za y

(ziげzi + ゜) âil+1

= ziげê + ziげzi âi-1l

= +

1- Gauss-Seidel

50

GSRU in Figure

= Za y

êl+1 = êl + zi âi

l

= +

1- Residual Updating

Page 86: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

26

51

Fortran pseudocode Double precision:: xpx(neq),y(ndata),e(ndata),X(ndata,neq), & sol(neq),lambda,lhs,rhs,val do i=1,neq xpx(i)=dot_product(X(:,i),X(:,i)) !form diagonal of X'X enddo e=y do until convergence do i=1,neq !form lhs X’R-1X + G-1 lhs=xpx(i)/vare+1/vara ! form rhs with y corrected by other effects (formula 1) !X’R-1y rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare ! do Gauss Seidel val=rhs/lhs ! MCMC sample solution from its conditional (commented out here) ! val=normal(rhs/lhs,1d0/lhs) ! update e with current estimate (formula 2) e=e - X(:,i)*(val-sol(i)) !update sol sol(i)=val enddo enddo

52

Preconditioned Conjugate Gradients

• The other method of choice to solve large systems of

equations (e.g. Strandén and Lidauer, 1998; Tsuruta et

al., 2001)

• Based on repeated computations of Ax above

• Can easily be done for genomic models as WげふWx) + п-

1x at a cost of 3nm operations

• PCG is much faster (but less general)

1 1 1

1

1 1 1 1

ˆ

ˆ

X R X X R Z X R ybAx W W ぇ x t

Z R X Z R Z D Z R ya

Page 87: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

27

53

0 2000 4000 6000 8000 10000

-12

-10

-8-6

-4

round

Co

nv

Preconditioned Conjugate Gradients for

BLUP_SNP

log10(Convergence) with real data (Holstein)

PCG

GSRU

PCG is much faster GSRU convergence slow for large data sets (or you really need to wait) Still, EBV’s seem identical, possibly because errors in SNP estimates cancel out when summing.

54

BLUP_SNP parameters

• How do we get the variance of SNP effects, ゝ2a, from a genetic variance

ゝ2g ?

• The formula comes from the sampling variance of covariates in Z affecting

SNP effects to data

– i.e., we try to explain all genetic variance as if « caused » by SNP effects, and

these SNP effects have a variance of ゝ2a

• Assumes Hardy-Weinberg and Linkage equilibrium

22

2 1g

ai i

all SNPs

p p

Page 88: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

28

55

Residual variance with pseudo-data

• Wエ;デ キゲ デエW ヴWゲキS┌;ノ ┗;ヴキ;ミIW キa ┘W ┌ゲW DYDげゲい – DYD キゲ ; さマキノニ ┞キWノSざ ;ゲゲキェミWS デラ ; H┌ノノ ふエ;S キデ HWWミ ; Iラ┘ぶ

• DYDЭヮWヴaラヴマ;ミIW ラa デエW S;┌ェエデWヴゲが IラヴヴWIデWS H┞ S;マゲげ BVゲ ;ミS other effects. Ideally:

1 12 2 2i i j j i i

j ji i

DYD u e un n

Bull BV Mendelian

sampling of his

daughters « True »

residuals « Pseudo »

residuals

2 212 4i u e

i

Varn

But 21

2 uVar

And therefore

ni=« edc », equivalent daughter contribution

• Residual variances with deregressed proofs

can be found in Garrick et al. 2009

Page 89: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

29

57

Estimating variances = BayesC (with ヾ=0)

• It simply consists in a BLUP_SNP where we estimate (and simultaneously « integrate out ») ゝ2

a and ゝ2e

– キくWくが ; ヴWェ┌ノ;ヴ GキHHゲ ゲ;マヮノWヴ ;ヮヮノキWS デラ “NPゲ キミゲデW;S ラa EBVげゲ ふGキHHゲ-SNP ??)

– LWェ;ヴヴ; Wデ ;ノくが ヲヰヰΒ ふ┘W SキSミげデ I;ノノ キデ B;┞WゲCぶが H;HキWヴ Wデ ;ノくが ヲヰヱヱ

• Pretty straightforward from GSRU

• You can as well estimate ゝ2a and ゝ2

e using « BayesC » and take them as known in BLUP_SNP (e.g. as in REML+BLUP analysis)

2 2 2 2

2 2 2 2

,

| ~ , ;

| ~ , ;

a a a a

e e e

MVN S

MVN S

y Xb Za e

a 0 I

e 0 』

58

Fortran pseudocode for BayesC ...

do j=1,niter do i=1,neq

!form lhs

lhs=xpx(i)+1/vara

! form rhs with y corrected by other effects (formula 1)

rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare

! MCMC sample solution from its conditional

val=normal(rhs/lhs,1d0/lhs)

! update e with current estimate (formula 2)

e=e - X(:,i)*(val-sol(i))

!update sol

sol(i)=val

enddo

! draw variance components

ss=sum(sol**2)+nua*Sa vara=ss/chi(nua+nsnp) ss=sum(e**2)+nue*Se vare=ss/chi(nue+ndata) enddo

Page 90: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

30

Estimate of this SNP

�喋挑腸牒 噺 姉嫗姿購勅態姉嫗姉購勅態 髪 な購銚態

�弔調凋聴 噺 姉嫗姿購勅態姉嫗姉購勅態

60

Variance of the SNP

Least squares solution

(e.g. in GWAS)

In BLUP_SNP, we shrink the least square estimate towards 0

because usually 怠蹄尼鉄 is a large number

BLUP_SNP solution

Page 91: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

31

Estimate of this SNP

• So, the estimate is much smaller than the GWAS estimate

• But we can fit all SNPs simultaneously

• And this provides unbiased (in some sense) estimates

• However, the result is very confusing for QTL detection and is unclear how do they work for さlargeざ QTLs:

61

0 2000 4000 6000 8000 10000

0e+

00

2e-0

54e-0

56e-0

58e-0

51e-0

4

Index

snps$solu

tion^2

Estimate of this SNP

This suggests an iterative/adaptive strategy

�沈 喋挑腸牒 噺 姉嫗姿購勅態姉嫗姉購勅態 髪 な購銚沈態

If 購銚沈態 蝦 タ we get the least square estimate

The more important the SNP, the larger 購銚沈態

-BayesA, etc etc (see later)

62

Variance of THIS SNP

Page 92: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

32

64

BayesA

• We « estimate » a different ゝ2a for each SNP

– this estimate is horribly bad

– but SNP solutions correspond to a model with « t » distributions

• Pretty straightforward from GSRU

2 2 2 2

2

2 2 2, ,

,

| ~ , ;

0, ,

0, ;

e e e e

i a a

i a i a i a a

MVN S

a t

a N S

y Xb Za e

e 0 』

2

2 2 2,

0, ,

0,

i a

i a i a

a t

a N

representation as

« t »

Meuwissen et al.

representation

Page 93: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

33

65

Normal vs. BayesA

-5 0 5

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

big effects are

more likely in

BayesA

66

Fortran pseudocode for BayesA ...

do j=1,niter

do i=1,neq

!form lhs

lhs=xpx(i)+1/vara(i)

! form rhs with y corrected by other effects (formula 1)

rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare

! MCMC sample solution from its conditional (commented out here)

val=normal(rhs/lhs,1d0/lhs)

! update e with current estimate (formula 2)

e=e - X(:,i)*(val-sol(i))

!update sol

sol(i)=val

! draw variance components

ss=sol(i)**2+nua*Sa vara(i)=ss/chi(nua+1) enddo

! draw variance components

ss=sum(e**2)+nue*Se

vare=ss/chi(nue+ndata)

enddo

Page 94: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

34

67

BayesB (mixture with t distribution)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

x

dt(

x, 4

)

Otherwise a t distribution (big effects are not unlikely)

A fraction ʌ of markers has null effects.

20 1 0, ,i aa t

68

BayesB

• e.g. Meuwissen et al., 2001

• What if some SNP had no effect in Bayes A?

– This is the original idea of BayesB

– needs the probability that a given SNP is at the model or not

– can be computed by MCMC but is notoriously more difficult (see for

instance Villanueva et al., doi: 10.2527/jas.2010-3814)

22

2 2

2 2 2 2

,

, , 1| , ~

0 0

0

1 1

| ~ , ;

i a ia

i i

a a

i

e e e

a t if

a if

S

with probability

with probability

MVN S

y Xb Za e

0a h

e 0 』

ヾ is fixed

Page 95: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

35

69

Mixture distribution or BayesC(Pi)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

dn

orm

(x)

A fraction ヾ of markers has null (or almost null) effects.

Otherwise they are normal

20 1 0,i aa N

70

BayesCPi

• e.g. Habier et al., 2011 (see also Rohan Fernando course notes)

• What if some SNP had no effect?

– This is the original idea of BayesB

– needs the probability that a given SNP is at the model or not

– can be computed by MCMC

22

2 2

2 2 2 2

,

, 1| , ~

0 0

0

1 1

| ~ , ;

i a ia

i i

a a

i

e e e

a N if

a if

S

with probability

with probability

MVN S

y Xb Za e

0a h

e 0 』

ヾ can be fixed or estimated

Page 96: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

36

71

BayesCPi

• Algorithm consists in a BLUP_SNP by GSRU where we estimate (and

simultaneously « integrate out ») ゝ2a and ゝ2

e

– for each SNP we compute the probability of it being « in » the model

(indicator variable h)

• This was a nightmare in original BayesB

• R Fernando found out a simple way of computing it (course notes:

http://www.ans.iastate.edu/stud/courses/short/2010short.html ) that is

« like » GSRU

– we can equally compute the proportions ヾ or fix them previously

72

Fortran pseudocode for BayesCPi ...

do j=1,niter do i=1,neq

...

! compute loglikelihood for state 1 (i -> in model) and 0 (not in model)

! Notes by RLF (2010, Bayesian Methods in

! Genome Association Studies, p 47/67)

v1=xpx(i)*vare+(xpx(i)**2)*vara

v0=xpx(i)*vare

rj=rhs*vare ! because rhs=X’R-1(y corrected) ! prob state delta=0

like2=density_normal((/rj/),v0) !rj = N(0,v0)

! prob state delta=1

like1=density_normal((/rj/),v1) !rj = N(0,v1)

! add prior for delta

like2=like2*pi; like1=like1*(1-pi)

!standardize

like2=like2/(like2+like1); like1=like1/(like2+like1)

delta(i)=sample(states=(/0,1/),prob=(/like2,like1/)

if(delta(i)==1) then

val=normal(rhs/lhs,1d0/lhs)

else

val=0

endif

...

enddo

pi=1- beta(count(delta==1)+apriori_included,count(delta==0)+apriori_not_included) ss=sum(sol**2)+nua*Sa

vara=ss/chi(nua+count(delta==1))

… enddo

Page 97: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

37

BayesCPi

• So far this looks simple

• But BayesCPi has many details & caveats

– How to run the Gibbs sampler?

• Rule of thumb: iterate などど 伴 券 伴 の times the number

of markers

– (need to find the good combination of markers)

– Do we estimate or fix 講 ? At which values?

– What do we get as results?

73

BayesCPi

• Parameter 講 (or 岫な 伐 講岻 ) is the number of SNPs

in the model

• Do we estimate or fix it?

– In theory we can estimate it

– In practice it is very tricky

• Colombani et al. could estimate it in Holstein but not in

Montbéliarde (estimate too imprecise)

– Usually we さfixざ it to 1/1000 (50 SNP out of 50,000) for

QTL detection and to 1/100 for genomic selection

– Or, we put a uniform prior on 講 for genomic selection

74

Page 98: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

38

BayesCPi

• Parameter 講 and genetic variance

• In the case of BayesCPi, 購直態 噺 に講デ喧沈圏沈購銚態

• So, to recover all genetic variance from SNPs, we need to

modify 講 and 購銚態 at the same time

– Then 購銚態 噺 蹄虹鉄態訂デ椎日槌日 • So, 講 噺 ど┻どどな implies that 購銚態 is 1000 times larger than in

BLUP_SNP and estimates are less さshrunkenざ

75

BayesCPi • Output of BayesCPi

– For each SNP, the marginal posterior probability of being さキミ the modelざ

– Not a single subset of SNPs さキミ the modelざ

effect level solution sderror p

2 1 0.49637122E-02 0.63842196E-01 0.69375000E-02

2 2 0.49501460E-03 0.17864670E-01 0.10375000E-02

2 3 0.38664734E-04 0.79524430E-02 0.32500000E-03

2 4 0.18222423E-04 0.59148438E-02 0.25000000E-03

2 5 0.21643136E-03 0.11477947E-01 0.53750000E-03

2 6 -0.55016190E-03 0.28990326E-01 0.97500000E-03

2 7 0.94168849E-04 0.74293395E-02 0.28750000E-03

• This implies that most SNP are in LD with some QTL somewhere

• Sometimes, a single SNP stands out 蝦 large QTL

76

Page 99: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

39

0 200 400 600 800 1000 1200 1400

0.0

0.1

0.2

0.3

Index

(Andre

s$p)

77 Position

Posterior

probability

OAR12, Salle et al. (JAS)

BayesCPi

• How do we SWIノ;ヴW ; さヮラゲキデキ┗Wざ QTLい

• Have no p-values in this analysis

– Bayesians insist in using the Bayes Factor for that (Wakefield; Bertrand &

Stephens, etc.) but no clear rules how

• Construct the Bayes Factor: 喧 鯨軽鶏 件券 建月結 兼剣穴結健 穴欠建欠岻喧 鯨軽鶏 券剣建 件券 建月結 兼剣穴結健 穴欠建欠岻喧 鯨軽鶏 件券 建月結 兼剣穴結健喧 鯨軽鶏 券剣建 件券 建月結 兼剣穴結健

In our case:

稽繋沈 噺 怠貸訂訂 椎 弟日退怠 姿怠貸椎 弟日退怠 姿

78

Posterior «odds »

Prior « odds »

Page 100: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

40

BayesCPi

• What thresholds for BF?

• Some people suggest using permutations 蝦 too long

• Use a scale adapted by Kaas & Raftery (1995) used in QTL detection by

Varona et al. (2001, GSE) and Vidal et al. (2005, JAS)

– BF= 3-20 ゎゲ┌ェェWゲデキ┗Wさ

– BF= 20-150 ゎゲデヴラミェ さ

– BF>150 "very ゲデヴラミェざ

• We Sラミげデ need correction for multiple testing (Bonferroni):

– all SNP were introduced at the same time

– And the prior already « penalizes » their estimates

79

0 200 400 600 800 1000 1200 1400

0100

200

300

Index

(Andre

s$B

F)

80 Position

OAR12, Salle et al. (JAS)

BF

« Very strong »

Page 101: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

41

81

Lasso (double exponential)

-4 -2 0 2 4

01

23

4

x

de

xp

(ab

s(x

), 4

)

Often marker has almost null effect

Otherwise big effects are not unlikely

exp2i ia a

82

Lasso

Hierarchical representation of Lasso

• y : data

• a : SNP effects

2

2 2

,

| , ~ exp2

| ~ ,

ii

e

a

MVN

y Xb Za e

a

e 0 』 Distribution of SNP effects

-4 -2 0 2 4

01

23

4

x

de

xp

(ab

s(x

), 4

)

Page 102: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

42

83

• This Bayesian Lasso is being used for genomic selection (De los Campos et al., 2009)

• The following is largely from Legarra et al. (Genetical Res., 2011)

Bayesian Lasso ひ In regular Lasso, ゜ is tipically computed by cross-validation

に which depends strongly on the constitution of the training & validation data sets

に and is tricky to compute

ひ the Bayesian Lasso (Park & Casella 2008) uses an equivalent hierarchical model:

姿 噺 散産 髪 燦珊 髪 蚕

喧 珊】滋 b錆 宋┸ 拶購態

喧 蚕 b錆 宋┸ 薩購態

拶 噺 酵怠態 ど ど どど 酵態態 ど どど ど 狂 どど ど ど 酵津態 喧 滋 膏 噺 敷 膏態に 結貸碇鉄邸日鉄沈

酵態 are « weights » of 購態

• Assume SNP effects have a different « variance » set to 購態 噺 な

• This is more similar to BLUP_SNP, BayesA, BayesC, etc etc.

– Equivalent to the TキHゲエキヴ;ミキげゲ original Lasso

姿 噺 散産 髪 燦珊 髪 蚕

喧 珊】滋 b錆 宋┸ 拶

喧 蚕 b錆 宋┸ 薩購蚕態

85

TキHゲエキヴ;ミキげゲ BL

Page 103: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

43

86

Bayesian Lasso vs. BayesA

姿 噺 散産 髪 燦珊 髪 蚕

喧 珊 滋 b錆 宋┸ 拶 喧岫蚕岻b錆岫宋┸ 薩購勅態 岻

拶 噺 酵怠態 ど ど どど 酵態態 ど どど ど 狂 どど ど ど 酵津態

酵態 are « variances » of SNP

effects (購銚沈態 in BayesA)

喧 滋 膏 噺 敷 膏態に 結貸碇鉄邸日鉄沈 喧 滋 膏 b 敷 鋼程貸態鯨銚貸態沈 Exponential

Inverted chi-

squared

Distribution of the variances BayesA BL

87

Fortran pseudocode for BL

...

do j=1,niter

do i=1,neq

!form lhs

lhs=xpx(i)+1/vara(i)

rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare

val=normal(rhs/lhs,1d0/lhs)

e=e - X(:,i)*(val-sol(i))

sol(i)=val

! draw variance components

ss=sol(i)**2 tau2(i)=1d0/rinvGauss(lambda2/ss,lambda2) enddo

! draw variance components

ss=sum(e**2)+nue*Se

vare=ss/chi(nue+ndata)

! update lambda

... enddo

Page 104: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

44

Bayesian Lasso

• It gives different weights to larger SNPs

• Mixing is better than BayesCPi

• Performance in Genomic Selection is as good

(Colombani et al., 2013, JDS)

• But there is no clear notion of what SNP is a

QTL and a few papers with さlasso for QTLざ ;ヴW disappointing.

88

Page 105: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

45

90

Advice

• Use everything: GBLUP, BayesCPi, Bayesian Lasso

• GBLUP is very good if variances are correct

– If not, do estimate them (G matrix + REML)

• Bayesian methods are sensible to parameters

– BayesB, Cpi are more sensible than BayesA, Bayesian Lasso

– Seems that the notion of « SNP with no effect » is incorrect

• Multiple marker methods need attention to details: correct priors and initial values, computation time, verification of convergence

• Details are typically overlooked by most users !!!

Page 106: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

1

Quantitative genetics of markers

• Go to notes

Page 107: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

2

4

Equivalences

• Pedigree (Malécot)

relationships assumes we have

2N founder alleles

• Then we genotype individual 9

• In this case,

– molecular coancestry = Malécot

IBD coancestry

• However SNPs have 2 alleles

– How are then these equivalences?

3 4 5 6 7 8 1 2

3 2

Page 108: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

3

5

Wキデエ “NPゲぐ

• Let us imagine that to each

one of the 2M founder

alleles we assign at random a

tag saying if the allele is A or

a with probability p and q=1-

p

• Then we genotype 9

• Can we say which ancestral

allele (1 to 8) inherited 9?

3 4 5 6 7 8 1 2

6

┘キデエ “NPゲぐ

• The molecular coancestry between two individuals i and j will be

– probability that two alleles are equal (alike in state) fMij,

• either because they have become identical by descent or

• either because they are not identical by descent but equal in the base population.

3 4 5 6 7 8 1 2

2 2 2ijM ijp qf pqf

Page 109: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

4

8

• 9 real French bulls among 1827 genotyped, ~50000

SNPs

• Very complex pedigree, simplified graph:

Real results (AMASGEN)

1 2

2 3 4 5 7 8 9

Page 110: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

5

9

Pedigree-based relationship

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 1.00 0.51 0.57 0.51 0.26 0.15 0.15 0.14 0.14

[2,] 0.51 1.01 0.30 0.33 0.17 0.17 0.12 0.11 0.11

[3,] 0.57 0.30 1.07 0.30 0.20 0.12 0.18 0.11 0.12

[4,] 0.51 0.33 0.30 1.01 0.17 0.18 0.11 0.11 0.11

[5,] 0.26 0.17 0.20 0.17 1.00 0.56 0.51 0.52 0.53

[6,] 0.15 0.17 0.12 0.18 0.56 1.06 0.31 0.32 0.32

[7,] 0.15 0.12 0.18 0.11 0.51 0.31 1.01 0.30 0.29

[8,] 0.14 0.11 0.11 0.11 0.52 0.32 0.30 1.02 0.30

[9,] 0.14 0.11 0.12 0.11 0.53 0.32 0.29 0.30 1.03

Cousin relationships ~0.125

Little inbreeding

10

さaキヴゲデ Gざ ェWミラマキI ヴWノ;デキラミゲエキヮ

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

[1,] 0.82 0.40 0.43 0.38 0.12 0.04 0.04 0.01 0.10

[2,] 0.40 0.91 0.18 0.24 0.02 0.05 -0.04 -0.04 0.04

[3,] 0.43 0.18 0.88 0.19 0.07 0.00 0.07 -0.02 0.05

[4,] 0.38 0.24 0.19 0.86 0.02 -0.01 -0.02 0.01 0.03

[5,] 0.12 0.02 0.07 0.02 0.73 0.34 0.30 0.31 0.35

[6,] 0.04 0.05 0.00 -0.01 0.34 0.85 0.15 0.14 0.18

[7,] 0.04 -0.04 0.07 -0.02 0.30 0.15 0.80 0.14 0.17

[8,] 0.01 -0.04 -0.02 0.01 0.31 0.14 0.14 0.80 0.17

[9,] 0.10 0.04 0.05 0.03 0.35 0.18 0.17 0.17 0.85

Relationships among cousins are ~0

Less than 1 in the diagonal Negative coefficients

/ 2 1i iall SNPs

p p G ZZ

Page 111: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

6

Genomic relationship matrix G

22 0, up N ju G

Assume that G is computed according to VanRaden 2008, using observed allelic frequencies This implies that the average BV of genotyped individuals (u2) is 0

This is possibly NOT the case if there is SELECTION

Page 112: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

7

An improved G

* 20, up |g N ju G

Relative to the pedigree base population, the average BV of genotyped individuals (u2), has a value possibly different from 0, say ȝ

22 0, up |g N + 'g ju G 11

20, up た N gj

22 up | た N た, ju 1 G

but substituting G for G* =G + 11'Į

µ is random because of finite size (drift)

ȝ is the average BV of genotyped individuals

How to find the value for v ?

1た= 'n 21u

ȝ from either pedigree or single-step

22

1p uた N ' j

n

220, 1A 1

22

1s uた N ' + 'g j

n 0, 1 G 11 1

Assume traditional BLUP is unbiased. Assume traditional BLUP is unbiased.

Page 113: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

8

How to find the value for v ?

from either pedigree or single-step

22 2

1i, ji, j

i j i j

g=n

A G

If we equate both variances of ȝ

As the 1'1 are simply summations

2

22 2

1p u i, j

i j

Var た = jn A

22

1s u i, j

i j

Var た = j g+n

G

g is simply the difference between means for A2 2 and G

What does v mean ?

22 2

1i, ji, j

i j i j

g=n

A G

g accounts for the fact that u2 are related through pedigree more than G is able to reflect This is because we do not know base allelic frequencies to construct the 'correct' G

Page 114: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

9

What does v mean ?

From Wright's FST, another interpretation of g is possible... The FST can be defined as the mean relationship between gametes in a recent population with respect to an older base population

1old new new STF = F + F F

Powell et al., 2010

What does v mean ?

A2 2 involves relationships of genotyped individuals with reference to the base population, and G corresponds to relationships within the current population. Consequently, g is equal to twice FST

11

2= g + g

*G G 11' 22

1

2STF = mean A G

Page 115: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

10

Which G must we use ?

2° moment (variance) of u2

G*=G+11' ガ

G*=(1−12ガ)G+11' ガ

G*=trace(A 22)

trace(G)G

1° moment (mean) of u2

Mean & variance of u2 (assumption of random

mating)

AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Both, AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Which G must we use ?

G*=(1−12ガ)G+11' ガ Mean & variance of u2

(assumption of random mating)

G*=aG+11' b

Both, AvgD(G)=AvgD(A22)

sum(G)=sum(A22)

Mean & variance of u2 (no random mating)

preGSf90 Christensen et al., 2012,

(1-0.5g)=0.851

a=0.859

Ex. real pig population

Page 116: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

11

Page 117: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

12

23

GBLUP

• GBLUP is a « BLUP »

constructed with G so

defined

– Sustitute A for G

• As in regular BLUP, we can

include animals with

genotype but without

phenotype

2 1i ip p

ZZ

G

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

W R X W R W G W R yg

24

GBLUP issue

• Strandén & Christensen (2011) showed that G

constructed with « centered » coding is semi-

positive definite (has no inverse)

• In dairy cattle, we typically use

0.99 0.01

2 1i ip p

ZZ

G A or something similar

« Pure »

genomic

relationships

Pedigree

relationships

Page 118: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

13

25

GBLUP issue

• We can use equations for singular G which Sラミげデ require inversion (Harville, 1976; Henderson, 1984)

1 1 1

2 1 2 1 2 1

ˆ

ˆu u u

X R X X R W X R yb

G W R X G W R W 』 G W R yu

(This has been reinvented many times: Misztal et al., 2009; VanKaam, 2012;

RKHS: De los Campos et al., 2009; etc)

1 1 2 1

2 1 2 1 2 2 2 1

ˆ

ˆu

u u u u u

X R X X R WG X R yb

G W R X G W R WG G G W R yg

or

2 ˆˆ au G g

26

GBLUP

• GBLUP gives identical

results to BLUP_SNP if we fit

equivalent variances in both

1 1 1

1 1 2 1

ˆ

ˆa

X R X X R Z X R yb

Z R X Z R Z I Z R ya

2 1i ip p

ZZ

G

2 22 1u i i aall SNPs

p p

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R I X R yb

I R X I R I G I R yu

ˆfromGBLUP fromBLUP_SNPg Za

g薩g because all

animals in genotype

have phenotype

Page 119: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

14

27

GBLUP

• In BLUP_SNP, (young) animals without phenotype do NOT enter into SNP estimation

– We get their EBVげゲ as 四赴槻墜通津直 噺 ┣>

• In GBLUP we have three options which give the same result:

1. Include them in the analysis with no record of their own (as in classical pedigree BLUP)

• EBVげゲ 四赴槻墜通津直 are obtained in the solutions

2. Use multivariate normality (selection index stuff):

• 四赴槻墜通津直 噺 札槻墜通津直┸墜鎮鳥札墜鎮鳥┸墜鎮鳥貸怠 四赴墜鎮鳥 3. Backsolve for SNP effects and then use 四赴槻墜通津直 噺 ┣>

28

GBLUP with more animals than phenotypes

Let 姿 噺 散産 髪 撒四 髪 蚕, 四 噺 四墜鎮鳥四槻墜通津直 ┹ �┻ g┻ 撒 噺 薩 宋

Only these have

data

Let genotypes be 燦 噺 燦墜鎮鳥燦槻墜通津直 then 札 噺 燦燦嫗【にみpq 噺 札墜鎮鳥 札墜鎮鳥┸槻墜通津直札槻墜通津直┸墜鎮鳥 札槻墜通津直

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

W R X W R W G W R yu

Gives the same solutions for 四赴墜鎮鳥 than

1 1 1

1 1 1 2 1

ˆ

ˆold u old

X R X X R I X R yb

I R X I R I G I R yu

Page 120: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

15

29

GBLUP

• We can jump from GBLUP

to BLUP_SNP

SNP effects from GEBVげゲ

(Henderson, 1973; Strandén and

Garrick, 2009)

1 2ˆ ˆu a DZ G u

ˆˆ u Za GEBVげゲ aヴラマ “NP WaaWIデゲ

Covariance SNPs-BVs (Variance BVs)-1

ˆ ˆ / 2 i ip q a Z Gu

usually

30

Multiple trait GBLUP Introducing multiple traits is so well known that nobody cared to publish it

Let genetic and residual covariances be 券 捲 券 (n= number of traits) 札待 and 三待, then

multiple trait GBLUP is (e.g. Henderson, 1984; Mrode & Thompson 2005),

1 1 1

1 1 1 10

ˆ

ˆ

X R X X R W X R yb

W R X W R W G G W R yu

Where 三 噺 薩 戯 三待┻

All models fitted in BLUP fit in GBLUP

Page 121: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

16

31

GBLUP

Some advantages of GBLUP:

ひIt fits nicely into existing BLUP software

ひぐ;ミS キミデラ W┝キゲデキミェ デエWラヴ┞ ふ‘EMLが マ;デWヴミ;ノ WaaWIデゲが デWゲデ-S;┞ぐ マ┌ノデキヮノW デヴ;キデゲぐ“キミェノW “デWヮぶ ひProvides measures of accuracy from the inverse of the LHS

ひAccomodates all animals

Inconvients:

ひC;ミげデ easily accomodate major genes (unless using weights in

the construction of G т see later)

ひComputation of G and inversion might be challenging

32

Caveat

• By defining a genomic relationship matrix, we define a genetic base

– All inference will refer to this genetic base. Quoting Strandén & Christensen http://www.gsejournal.org/content/43/1/25 :

The bad news

« Reliabilities of estimated genomic breeding values calculated using elements of the

inverse of the coefficient matrix depend on the allele coding because different allele

coding methods imply different models » [The same applies for reliabilities computed

from any method fitting SNP effects like BayesA, etc.]

Page 122: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

17

33

GREML, G-GキHHゲぐ

Use of G デラ Wゲデキマ;デW ┗;ヴキ;ミIW IラマヮラミWミデゲぐ

It can be done with remlf90, gibbs*f90, AsReml, TMぐ

The result will refer to an ideal population with whatever allelic

frequencies we introduced in the computation of G.

Page 123: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

1

Single Step GBLUP

1

2

Why 2-step procedure

• y= µ + Za + e

– y = data

– Z= incidence matrix of marker effects

– a= marker effect

– e=residuals

• Most often, genotyped animals (bulls) do not have data (trait record)

• Further, most animals with phenotype are not genotyped (e.g. cows)

• This limits practical applications

• Need to get pseudo-data for genotyped animals

Page 124: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

2

3

Pseudo-data

• So we need pseudo-data

• EBV’s

• DYD’s

4

Pseudo-data

• EBV’s • The problem with EBV’s is that they

already share information among individuals

• e.g., a dam EBV is = own yield + parent average + progeny contribution

• But we are including information of the sire in the cow, yet not all SNPs of the sire are in the cow

Page 125: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

3

5

Pseudo-data

• Also, EBV’s are correlated

• The correlation depends on the amount of data and distribution across fixed effects and families

• EBVs of two cows are correlated, if they belong to the same herd, even if they are not related

• EBVs of two bulls are correlated if they have daughters in the same herds

• This is not serious in dairy cattle, but might be, e.g., in swine

ˆ| ~ , uuNu y u C

6

Pseudo-data

• DYD’s avoid part of these problems (Van Raden Wiggans 1991)

• DYD = daughter yield deviation

• Record of the daughter, corrected by environmental effects and dam’s EBV

• Thus DYD = 0.5 BV sire + mendelian sampling

• E(DYD)=0.5 BV sire

• YD’s exist for cows

– YD = record –environmental effects

Page 126: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

4

7

Pseudo-data

Problems of DYD’s / YD’s

• YD’s little reliable and subject to preferential treatment • DYD’s not reliable for many species (sheep, swine) • Hard to define for some species/traits (maternal effects)

• Extremely complex procedure

• Loss of generality

8

Proposals for overall relationship matrix (Legarra et al., 2009 JDS 92:4656; Christensen & Lund, GSE 42:2; Misztal et al., JDS

92:4648; Aguilar et al JDS 93:743)

• Not big loss in assuming normality for SNP effects (Van Raden et al. JDS 92:19; Hayes et al. JDS 92:433)

• G easy to be constructed then

• Can we include G in the relationship matrix?

• If we construct an overall relationship matrix with good properties, then we can just do BLUP with all data and animals

Page 127: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

5

9

• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)

• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)

10

• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)

• Things would be simple if we could add

genotypes for all animals (Christensen

et al., 2010)

Page 128: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

6

11

Single Step as a missing data problem

• We can see genotype as a missing data problem (Christensen & Lund, 2010)

• Use the prediction and the distribution of the prediction (if not the procedure does not work)

12

Missing data

Fill-in missing data: data augmentation • « data augmentation refers to a scheme of augmenting

the observed data so as to make it more easy to analyze » (Tanner & Wong, 1987) – Two flavors: EM and Bayesian (Posterior distributions) – For instance: pretending (temporarily) that you know the true

BV’s simplifies REML s EMREML , or provides full conditionals for Gibbs

• Augmenting = adding genotypes

Page 129: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

7

13

Inferring genotypes

• Genotypes in some individuals can be inferred, but only to some extent

• This is feasible for key individuals (ancestors with many progeny genotyped)

• Or by imputing data from parents into an animal genotyped with a SNP chip

• Typically done using « LD » patterns

• Fimpute, findhap, Alpha impute, Beagle, etc

• These methods do not extend well to non-genotyped individuals

Example:

14

By simulation, they know that…

Accuracy of prediction of genotype is quite good, but not perfect They don’t even try « far » animals Need a simpler way for « far » animals

Page 130: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

8

16

Inferring genotypes

• There is Gengler’s gene content prediction J. Dairy Sci.

91:1652

• Linear approximation to the imputation problem

• This method can be applied to any member of a pedigree

11,2 2,2

11,1 1,2 2,2 2,1

ˆ |

ˆ | 2

non genotyped non genotyped genotyped genotyped

non genotyped non genotyped genotyped

E p

Var Var pq

z z z A A z 2

z z z A A A A

11 12

21 22

=

A AA

A ALet

genotyped

non genotyped

17

Inferring genotypes

• Instead of working with individual SNP effects, we will define

– u=Za

– i.e., the genetic value is the sum of SNP effects

– We’re not really interested in a themselves but in u (we know from GBLUP that we can jump from one to the other)

– Moreover, we’re interested in the distribution of u’s, so that we can compute their covariances and put them into the MME

Page 131: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

9

18

四 噺 四直四津直 噺 燦直燦津直 珊

Var 四 噺 燦直燦撫津直 撃欠堅 珊 燦直嫗 燦撫津直 髪 宋 宋宋 撃欠堅 燦撫津直 撃欠堅 珊

な【にみ喧沈圏沈

Br���ing valu�s SNP �ff�cts Re-create GBLUP…

Chistensen & Lund use 撃欠堅 畦 噺 継 撃欠堅 畦】稽 髪 撃欠堅 継 畦】稽 to

consider the prediction of the genotype and its variance

継 燦津直 燦直 撃欠堅 燦津直 燦直

Resulting in:

ng: « non genotyped » g: « genotyped »

Christensen & Lund key idea:

Using Gengler’s results Using Gengler’s results

19

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

non genotyped

genotyped

Covariances of all animals Christensen & Lund, 2010

1: « non genotyped » 2: « genotyped »

Page 132: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

10

20

• Incredibly: H-1 is very simple:

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

Inverse of the regular pedigree relationship matrix

Correcting for genomic relationships…

…and avoiding « double counting »

21

• Things would be simple if we had

genomic relationships for everyone

(Legarra et al., 2009)

• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)

Page 133: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

11

22

Overall modification

• Look at A as a « prior » relationship and to G as an « observed » relationship

– G is observed for some individuals only, whose « a priori » relationship matrix was A22

• Try to construct a « posterior » relationship matrix

23

Joint distributions

2 , andp Nu 0 G

1 2 2 1 2,p p pu u u u u

Unconditional distribution of genetic values of Genotyped individuals

Conditional distribution of Non-Genotyped individuals

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

Joint distribution

After seeing their genotypes !

Because they have no genotypes, this depends

only on pedigree

Page 134: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

12

24

Joint distributions

1 2 1 2 2

1 2 2

1 11 1 11 12 22 2 1 12 22 2 2 2

11 11 1112 22

1 2 1 11 1 1 11 1222 21 22 21 12 22

( , ) ( , | ) ( )

( | ) ( )

exp[ 0.5( ) ( )]exp[ 0.5 ]

exp 0.5

exp 0.5

p p p

p p

u u u u u

u u u

u A A u A u A A u u G u

uA A A Au u

uA A A G A A A A A

11 121

1 2 21 1 22 1222

.

uA Au u

uA G A A

…for those inclined to algebra

gG�nomicg r�lationships

pr��iction of non g�notyp�� from g�notyp��

25

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

Page 135: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

13

26

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

27

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

1 1 11 11 12 22 21 12 22 22 21Var u ] ] A ] ] A GA ]

because Var(Xt) = XVar(t)X’

Page 136: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

14

28

Joint distributions

2 ,p Nu 0 G

1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]

2Var u G

11 2 12 22,Cov u u ] A G

because Cov(Xt,t) = XVar(t)

29

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

non genotyped

genotyped

Covariances of all animals Legarra et al. 2009; Aguilar et al., 2010; Christensen & Lund, 2010

Page 137: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

15

30

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

Covariances of all animals

G comes from genotypes

This is the variance of prediction of genotypes from genotyped to

non-genotyped

This is the error in the prediction

The prediction « generates » a covariance

31

Overall modification: example

Page 138: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

16

32

Overall modification: example

This is the regular relationship matrix. Assume now that animals 9 to 12 have a genomic relationship of 0.7

33

Overall modification: example

This parents now are related

This guy now is inbred

G

Page 139: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

17

34

• Incredibly: H-1 is very simple:

1 11 12

2 21 22

1 1 1 111 12 22 21 12 22 22 21 12 22

122 21

=Var

u H HH

u H H

A A A A A A GA A A A G

GA A G

Inverse of the regular pedigree relationship matrix

Correcting for genomic relationships…

…and avoiding « double counting »

35

Single step GBLUP

1 1 1

1 1 1 2 1

ˆ

ˆu

X R X X R W X R yb

WR X WR W H WR yu

1 1

1 122

H A 0 0

0 G A

W: incidence matrix of animals on data

A: pedigree relationship matrix

GThis G could be any matrix describing « genomic » covariances of breeding values;

it does not restrict to VanRaden’s (2008) GBLUP

A22: pedigree matrix among genotyped individuals

Page 140: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

18

Single step GBLUP

• So the Single Step GBLUP is like regular BLUP changing one small submatrix !!!

• It is almost too simple to be true…

36

Some properties of H

• Semi-positive definite always

• Positive definite & invertible iff G is invertible

• In practice, if G is too different (wrong pedigree or genotyping) from A22, this gives lots of numerical problems

• If everyone is genotyped, Single Step is GBLUP

• If no one is genotyped, Single Step is BLUP

37

Page 141: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

19

38

Single Step as an extra random effect

• Legarra & Ducrocq, 2012

• Decompose Breeding Values in « classical » and deviations: ‒ 撃欠堅 四匝 噺 札┸ ‒ 四態 噺 四態茅 髪 纂態┸ 撃欠堅 四匝茅 噺 冊態態 ┸ 撃欠堅 纂態 噺 札 伐 冊態態

• Regress, using pedigree, deviations for « ungenotyped » individuals ‒ 纂怠 噺 冊怠態冊態態貸怠纂態┹ Var 穴怠 噺 冊怠態冊態態貸怠 札 伐 冊態態 冊態態貸怠冊態怠

• After quite some algebra, you get to the same results

39

Alternative derivations • Why all agree?

• Some use genotypes, some use breeding values in the algebra.

• Because 決堅結結穴件券訣 懸欠健憲結嫌 噺 嫌憲兼 剣血 訣結券剣建検喧結嫌, the same rules for variances and covariances apply and the derivation is identical

Page 142: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

20

• So far SSGBLUP is the most serious option for a general method for genomic evaluations in practice

• Large body of practical results in dairy cattle & sheep, poultry, swine

• Besides our group, USA, NZ, DK, Fin, are giving SSGBLUP serious tests

40

41

Problems of SSGBLUP

Most of these problems exist for the other methods (BayesA etc.)

• Assumption p(u2)=N(0,G)

– If there is selection, mean is not 0 (« tuning » solves it: see Vitezica later)

• Same genetic variance in genotyped and ungenotyped animals

– solved with « tuning »

• Non normality (i.e. major genes)

– Can solved using G=ZDZ’ with « weights » for SNP (Legarra et al., 2011; Zhang et al., 2010; Wang et al., Gen. Res. 2012)

– unclear for multiple traits (but also in other methods like BayesB)

• Assumption that « A » is fair. This is false if:

– pedigree is incorrect

– distant relationships are too different from reality (Hill & Weir 2010)

– solution: cut pedigrees that are too long

• Unknown parent groups / several breeds

– Need to modify H to include them (Misztal et al., 2013)

Page 143: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

21

SSGBLUP vs. rest

42

SSGBLUP GBLUP, BayesA Non parametric

Deregressions from « regular BLUP »

Not needed Complex Complex

Bias due to genomic selection

Immune Affected Affected

Bias due to classic selection

Immune Affected Affected

Major genes Complex OK OK

Long MCMC No Yes (except GBLUP) No

Matrix inversions Yes (but work in progress)

No (except GBLUP) No

Computation of accuracies

Complex (but work in progress)

Easy (provided deregressions were OK)

Undefined

Multiple trait Straightforward (if no major genes)

Easy for GBLUP (if no major genes)

Complex otherwise

???

Get marker effects Yes (after backsolving)

Yes Sometimes

43

Computing stuff

• Need to compute G-1 and A22-1, is a challenge.

– perhaps only in dairy cattle?

• But see Ignacio talk

• Future strategies (Legarra & Ducrocq, JDS 2012)

Page 144: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

22

44

1 1 1 11 1 2 2

1 1 11 2 1 12 2 11 1 1 1 1 2 11

1 1 21 2 1 22 2 2 22 2 2 1 2 2 2

2 222

2 2

ˆ

ˆˆˆˆ

u u

u u u u

u u

u u

X R X X R W X R W 0 0 X R yb

W R X W R W A W R W A 0 0 W Ru

W R X W R W A W R W A I I u

0 0 I A 0 l0 0 I 0 G け

12

y

W R y

0

0

• Unsymmetric Single Step

This can be computed efficiently because we use G NOT 札貸怠, 冊匝匝 not 冊匝匝貸層. We don’t even need to construct them.

45

• Iterative Single Step

New solutions are a weighted average of the solutions to (9-10) and the former solutions at the previous iteration

Solution of MME (9) old solution

New solution

w<=1

Solve ' '1 1 2 2

' ' 11 12 '1 1 1 1 1 1 1' 21 ' 22 '2 2 2 2 2 2 2

ˆ

ˆˆ ˆˆ

u u

u u u u

X X X W X W b X y 0

W X W W A A u W y 0

W X A W W A u W y l け

Solve 22 2 2ˆ ˆ ˆ ˆˆ ˆ and for and A l u Gけ u l け

* 1

1 1 1

2 2 2

ˆ ˆ ˆ

ˆ ˆ ˆ1

ˆ ˆ ˆ

t t

b b b

u u u

u u u * 1 * 1ˆ ˆ ˆ ˆ ˆ ˆ1 and 1t t t t l l l け け けand

(9)

This can be done efficiently as well

Page 145: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

23

46

Results • (Using simulated data) all

strategies arrived to the same solution

• Some can be converted to iteration on data procedures

• Reasonable computing time: – 2 s « regular » (with G and A22

already inverted) – 47 s « unsymmetric » – 286 s « iterative »

Convergence

0 50 100 150 200

-12

-10

-8

-6

-4

-2

0

iteration

conve

rge

nce

? Unsymmetric extended SSGBLUP

Regular SSGBLUP ? ? Iterative SSGBLUP

47

More results • Lacaune dairy sheep • 5000 individuals (males) genotyped, 1 500 000 animals, • ~3 000 000 equations

Regular Single Step 0 1000 2000 3000 4000 5000

-15

-10

-5

iteration

log10(c

onve

rgence

) Unsymmetric equations

Page 146: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

22/05/2014

24

• Livestock paper for more details

48

Page 147: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Forming Single-step mixed model

equation and quality control

Ignacio AguilarInstituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay

[email protected]

X'X X'ZZ'X Z'Z + α H -1

bu

=X'y

Z'y

• Traditional genetic evaluation

• Single-step genomic evaluation

Single-Step to genomic evaluation

X'X X'ZZ'X Z'Z + α A-1

bu

=X'y

Z'y

Page 148: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Multiple-step Genomic Selection

Records ‘Y” BLUP Pseudo observations

De-regressed EBVs

BayesX

GBLUP

etc

indexPA*w2

SPA*w3GEBV*w1

EBV

SNPsPedigree

Single-Step Genomic Selection

Records “Y”

BLUP

EBVs

Pedigree SNPs

Page 149: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Single-Step evaluation

• Unified approach with pedigree, phenotypic

and genomic markers information considered

simultaneously

• Pedigree-based relationships augmented by

genomic relationship matrix (Misztal et al. 2009)

ˆ

ˆα

Λ

=

-1

X'X X'Z X'yβ

Z'X Z'Z + H Z'yu

H = A+ A - conventional numerator relationship matrix

- matrix modified to account for genomic relationships ∆

A

A

Single step genomic evaluation

• Inverses

– Numerator relationship matrix

– Pedigree relationships between genotyped animals

– Genomic relationships

1 11 1

22

0 0

0− −

− −

= + −

H AG A

Aguilar et al., 2010

Christensen & Lund, 2010

X'X X'ZZ'X Z'Z + H -1α

bu

=X'y

Z'y

Page 150: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

=X 'Xp1 + X 'Zp2

Z 'Xp1 + Z 'Zp2

+0

A−1α p2

LHS* p = X 'X X 'ZZ 'X Z 'Z + H −1α

p1

p2

Matrix-vector operations in PCG with

genomic information

Contributions

due to records

Contributions due

to relationships

+00

(G−1 − A22−1)α p2g

Contributions

due to genomics

Extra matrices required for single-step

• Inverses

– Pedigree relationships between genotyped

animals

– Genomic relationships

1 11 1

22

0 0

0− −

− −

= + −

H AG A

Page 151: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

OPTIONS – BLUPF90 parameter file

• Genomic programs

– controled by adding OPTIONS commands to the

parameter file

– OPTION SNP_file marker.geno.clean

– Read 2 files:

• marker.geno.clean

• marker.geno.clean.XrefID

Printout: Same heading as other

programs

All options that were

enter in the parameter

file should be here !!.

IF not check that

keywords are correct

(upper and lower case)

Check number of

animals and

individuals with

genotypes

Page 152: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Printout

Information from genotype file.

The format is detected from

the first line !!!

So all genotypes should start in

the same column !!!

Number of SNP is also

determined by the first line!!

Output Files

• GimA22i– Store the content of the inv(G) – inv(A22)

– Only if preGSf90 for runs, not in applications programs

• freqdata.count– Contains the estimated allele frequency before QC

• freqdata.count.after.clean– Contains allele frequencies as used in calculations, remove code

– For removed SNP these will be zero,

• Gen_call_rate– List of animals removed by low call rate

• Gen_conflicts– Report of animals with Mendelian conflicts

Page 153: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Quality control. By default exclude:

• MAF– SNP with MAF < 0.05

• Call rate– SNP with call rate < 0.90

– Individuals with call rate < 0.90

• Monomorphic– Exclude monomorphic SNP. ONLY when MAF <> 0

• Parent-progeny conflicts (SNP & Individuals) – Exclusion -> oposite homozigous

– For SNP: >10 % of parent-progeny exclusion from the total of pairs evaluated

– For Individuals: > 2% of parent-progeny from total numberof SNP

Control default values

• For MAF

– OPTION minfreq x

• Call rate

– OPTION callrate x

– OPTION callrateAnim x

• Mendelian conflicts

– OPTION exclusion_threshold x

– OPTION exclusion_threshold_snp x

Page 154: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Parent-progeny conflicts

• Presence of these conflicts results in a negative H matrix !!!

• Problems in estimation of variance component byREML, programs does not converge, etc.

• Solution:

– Report all conflicts, with counts for each individual as parent or progeny to trace the conflicts

– Remove progeny genotype

• maybe not the best option

• But results in a positive-definite H matrix !!!

Parent-progeny conflicts

• OPTION verify_parentage x

– 0: no action

– 1: only detect

– 2: detect and search for an alternate parent; no

change to any file. Not yet implemented

– 3: detect and eliminate progenies with conflicts

(default)

Page 155: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

SNP map file (optional)

• OPTION chrinfo xxxx

• For some genomic analyses (GWAS) or checks

• Format:

– snp number

• Index number of SNP in the sorted map by chr and position

– chromosome number

– position

• First row corresponds to first column SNP in

genotype file !!!

Other Options

• IF OPTION chrinfo is provided, we can exclude

selected crhomosomes:

– OPTION excludeCHR n1 n2 n3 ...

• or inform which are sex chromosomes:

– OPTION sex_chr n

– Chr > n will be excluded only for check or parent-

progeny, but not in calculations

Page 156: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Saving ‘clean’ files

• SNP excluded from QC are set as missing (i.e. Code=5)

• Excluded Individuals are treated as unrealated in G and A22– For individual i

G[i,:] = 0; G[:,i]=0; G[i,i]=1 ; Same for A22

so G-A22 will cancel out

• OPTION saveCleanSNPs

• Save clean genotype data with excluded SNP and individuals– For example for a SNP_file gt

– Clean fles will be:• gt_clean

• gt_clean_XrefID

– Removed will be output in files:• gt_SNPs_removed

• gt_Animals_removed

Inspection of Diagonal of G

� High diagonal elements from G

� Mislabed samples , individualsfrom other populations/lines

� Problems with sample, low callrate

� By default values >1.6 are excluded from analysis, Threshold can be changed with:

OPTION threshold_diagonal_g x

Simeone et al., 2011 JABG

Page 157: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Potential duplicate samples

• All samples are checked with each other

– x = G(i,j)/sqrt(G(i,i),G(j,j))

– Values of x > 0.90 are printed in the output

Correlation off-diagonal G vs A

• Compute correlation for all elements of A > 0.02

• Potential problems with matching genotype file and pedigree

file

• For low values (<0.5) => print a warning !!!!

• For low values (<0.3) => program stop !!!

• If still you want to go …

– OPTION thrStopCorAG -1

Page 158: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Looking for stratification in

populations• OPTION plotpca

– (only preGSf90 not in application

programs)

– Plot the first 2 PC

• OPTION extra_info_pca filename col

– File with variables (alphanumeric) to

plot PC with different colors for

different classes

– Same order as genotype file

Use in application programs

• Use renumf90 for proper renumering and creation of cross reference id and parameterfile

• If large number of genotypes

– Precompute inv(G)-inv(A22) (PreGSF90)

– Modify parameter file to read GimA22i

– BLUPF90, REMLF90

• Generally all steps can be in a script file to facilitate running programs

Page 159: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Genome-Wide Association Mapping

Including Phenotypes from Relatives

without Genotypes

Ignacio AguilarInstituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay

[email protected]

Slides from H. Wang (Joy)

Page 160: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Classical GWAS

• Test single marker one at time

• Simple linear regression for the SNP effects

• Other fixed effects can be fit (conteporary

group)

• Polygenic breeding value can be used

Genomic Selection

• Considers all genetic associations derived from markers

• Methods (Bayes A, Bayes B, Bayes Lasso) provide solutions to SNP effects

• Then Genome-Wide Association Ananlysis(GWAS) can be performed

• Accounts for population stratifications and cryptic relatedness

Page 161: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Non-Genotyped individuals with

Phenotype

• Recorded information from non-genotyped individuals can not easily be incorporated in Single marker regression and Bayesian methods

• Although can incorporate information by accumulating data from relatives , e.g. EBV

• But problems with– heterogeneity from different sources

– Loss of information

– bias

– Computational cost with MCMC method for large number of genotypes and makers

Single-step GBLUP

• Integrates all available information

– Phenotypes

– Genotypes

– Peidgree

• Limitation of ssGBLUP, in constrast to BayesX

methods

– infinitesimal model i.e. same variance for all SNP

effects

Page 162: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

ssGWAS

• Combining methods

– Unequal variances

– Use all available information like in ssGBLUP

• Improve Accuracy of estimation of GEBV

– For breeding and selection

– Precesion for estimation of SNP effects for GWAS

SNP variances

• Zhang et al. 2010, presented a method to

estimate weights for SNP variances without

sampling i.e. non MCMC methods

• SNP weights: function of squares of SNP

effects

• Incorporate weights into genomic relationship

matrix

• Similar approach by Sun et al., 2012 PlosOne

BUT both approachs can not utilize

phenotypes of un-genotyped individuals !!!

Page 163: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

As a reminder:

• GBLUP BLUP_SNP:

• As:

And: λσσ

'')1(2

'2

2

ZDZZDZpp

ZDZG

a

u

allSNPsii

==−

=∑

uZag ˆˆ =

22 )]1(2[ uallSNPs

iia pp σσ ∑ −=

Genetic value of

genotyped animals

SNP effect

Weight Matrix

SNP weights

• SNP weights derived from SNP effects

• Zhang et al., 2010 PlosOne

• Matrix D, diagonal matrix, with un-equal variances for each SNP

Page 164: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

• SNP effects from GEBV’s (Henderson, 1973; Strandén and Garrick, 2009):

• Also, for each SNP effect (i-th):

• Note: this is just variance of SNP effects

NOT the same concept for genetic variance

)1(2ˆˆ 22, iiiiu ppu −=σ

gga

u aZDZDZaGDZu ˆ]'['ˆ'ˆ 112

2−− ==

σσ

postGSf90 par files

1) Parameter files:

(1) BLUPF90 (and preGSf90 for S1)

(2) postGSf90

2) OPTIONs:

BLUPf90 / PreGSf90:OPTION SNP_file marker.geno.cleanOPTION saveGInverseOPTION weightedG w # A vector with length = M

postGSf90:

OPTION SNP_file marker.geno.cleanOPTION ReadGInverseOPTION chrinfo mapfile #format: snpID chr posOPTION weightedG w# OPTION which_weight 1# OPTION SNP_moving_average n# OPTION Manhattan_plot

3) Document:http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90#gwas_options_postgsf90

Page 165: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Computing algorithm• Denote t as an iteration number and i as the i-th SNP

1. t=0, D(t)=I, G(t)=ZD(t)Z’λ

2. Compute by ssGBLUP

3. Calculate

4. Calculate for all SNPs (Zhang et al., 2010)

5. Normalize

6. Calculate

7. t=t+1

8. Exit , or loop to step 2 or 3

ga

gttt aGZDu ˆ'ˆ 1)()()(−= λ

)1(2ˆ2*

)()1( iiii ppudtt

−=+

*

* )1()1(

)0()1( )(

)(+

++ = t

tt D

Dtr

DtrD

λ')1()1( ZZDG tt ++ =

S2S1

Simulated data

1. QMSim

2. Simple model:

3. 10 QTLs w. 3000 SNP markers on 2 chromosomes

4. N = 15,800

Ng= 1500

5. h2=0.5, all due to QTLs (No Polygen)

6. 10 replications

eaZ1y a ++= µ

Page 166: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Different Scenarios

• Scenario 1– Run only one BLUP and get GEBV

– Estimate SNP effects from GEBV using weighted Genomic matrix

– Multiple trait or random correlated effects

• Scenario 2– Get EBVs with weighted genomic relationship matrix

– Estimate SNP effects from GEBV using updated solutions

– Single trait analysis - fit one genomic relationship matrix

postGSf90 bash script

• Scenario 1:

# run 1 time GBLUP to get GEBVs:echo par.b90 | blupf90 | tee log.blupf90

# run x times PreGSf90 – postGSf90 to get SNPeff:

for i in 1 2 3 4 5 6 7 8 … … x

do

echo par.b90 | preGSf90 | tee log_preGS_$i

echo postpar.b90 | postGSf90 | tee logpost_$i

cp snp_sol snp_sol_$i

#format: tr, eff, snpID, chr, pos, sol, w

cp chrsnp chrsnp_$i

cp w w_$i

awk '{ print $7 }' snp_sol > w

done

Page 167: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

• Scenario 2: for i in 1 2 3 4 5 6 7 8 … … x

do

echo par.b90 | blupf90 | tee logpre_$i

cp solutions solutions_$i

echo postpar.b90 | postGSf90 | tee logpost_$i

cp snp_sol snp_sol_$i

cp chrsnp chrsnp_$i

cp w w_$i

awk '{ print $7 }' snp_sol> w

done

Methods

1. Single marker model: WOMBAT

2. BayesB using de-regressed proofs : GENSEL

3. ssGBLUP: S1 & S2

Page 168: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Manhattan plot of S1

Manhattan plot of S2

Page 169: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Manhattan plot of BayesB

Manhattan plot of WOMBAT

Page 170: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Accuracy of (G)EBVs

BLUP EBVs DP

0.81 (0.01)

0.77 (0.01)

ssGBLUP it1* it2 it3 it4 it5 it6 it7 it8

0.87 (0.01)

0.89 (0.01)

0.88 (0.01)

0.88 (0.02)

0.88 (0.02)

0.87 (0.02)

0.87 (0.02)

0.87 (0.02)

BayesB_DP NW† c=0.1 0.88

(0.02) 0.88

(0.02)

Accuracy of SNP effectsTable 3. Average correlations (standard deviations) between QTL effects and sum of cluster of

m SNP effects using ssGBLUP

S1* 1† 2 4 8 16 40

it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.46 (0.07) 0.66 (0.05) 0.78 (0.02) 0.82 (0.02) 0.81 (0.02) 0.63 (0.08) it3 0.43 (0.07) 0.64 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it4 0.42 (0.07) 0.63 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it5 0.41 (0.07) 0.63 (0.05) 0.76 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.08) it6 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it7 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it8 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.60 (0.07)

S2 1 2 4 8 16 40

it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.44 (0.09) 0.65 (0.06) 0.77 (0.03) 0.82 (0.03) 0.81 (0.02) 0.63 (0.06) it3 0.41 (0.08) 0.62 (0.05) 0.75 (0.03) 0.79 (0.03) 0.79 (0.03) 0.65 (0.06) it4 0.40 (0.07) 0.61 (0.05) 0.73 (0.03) 0.77 (0.03) 0.78 (0.03) 0.64 (0.06) it5 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.76 (0.04) 0.77 (0.04) 0.64 (0.06) it6 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it7 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it8 0.40 (0.07) 0.60 (0.05) 0.71 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06)

* S1: update weights for SNP effects but not for GEBVs; S2: update weights for both GEBVs and SNP effects in each iteration.

† Number of SNPs (i.e. m ranges from 1 to 40) in each cluster.

Page 171: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Variances Explained by segments

• Several ISU works propose to present results from GWAS using variance explained by windows of adjacent SNP

• Fan et al 2011, Onteru et al 2011, Peters el al 2012,etc.

• Potentially use of bootstrap to get significance of detected QTL

Windows VariancesZ u

a = Zu for only SNP in segment

a = EBV derived from segment

Get sample variance Var(a)

from genotyped individuals

Page 172: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

POSTGSF90 Options

POSTGSF90 Options

Page 173: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Output files from POSTGSF90

Page 174: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

QTL-MAS workshop 2010

G = ZDZ '/ ka = DZ '(ZDZ ')−1u

G = ZZ '/ ka = Z '(ZZ ')−1u

cor(ebv,tbv)=0.68 cor(ebv,tbv)=0.70

Page 175: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

Single-Step GWAS Conception Rate

• Multiple-Trait US Holsteins Service records

from AI

• ~ 5 millions records, ~ 2.5 millions pedigrees

• ~ 5,600 genotyped bulls

• Computing time

– Complete evaluation 2 h

– Estimates of SNPs 2 m

Single-Step GWAS Heat Stress

• Multiple-Trait Test-Day model heat tolerance

• ~ 90 millions records, ~ 9 millions pedigrees

• ~ 3,800 genotyped bulls

• Computing time

– Complete evaluation ~ 16 h

Milk yield no Heat stress Heat stress

Page 176: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

1

Creation and handling of genomic

relationship matrices with preGSf90

Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria

INIA Las Brujas, Uruguay

[email protected]

Genomic Relationship Matrix - G

• G Э ZZげっニ

– Z = matrix for SNP marker

– Dimension Z= n*p

– n animals,

– p markers

Data file with SNP marker

Page 177: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

2

HOWTO: Creation of Genomic Matrix

• Read SNP marker information => M

• Get けmeansげ to center

– Calculate allele frequency from observed genotypes (pi)

– pi= sum(SNPcodei)/2n

• Matrix for center W(3,p)

• Center matrix Z = W(M)

012

0 - 2p1 0 - 2p2 ..

1 - 2p1 1 - 2p2 ..

2 - 2p1 2 - 2p2 .

2 1 20 1 0.. .. ..

..

..

..

.

Creation of Genomic

• Issues

– Large number of genotyped individuals

– Large number of SNP markers

– Matrix multiplication ~ cost n^2 * p

Large amount of data put in (cache) memory for

Sラキミェ けmatmulげ aラヴ W;Iエ ヮ;キヴ ラa animals and indirect

memory access (center)

Memory hierarchy

Page 178: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

3

Matrix multiplication

• Matrix multiplication

– Several methods

• Intrisic matmul (good for small examples !!!)

• さSラ-ノララヮゲざ

• Packages (BLAS, LAPACK)

– Non-optimzed

– Optimized (ATLAS, MKL, etc.)

– Several Compilers

• Perform automatic optimization

– Vectorize loops

– Detect permuted loops

• Can use OpenMP directives for parallelization

Memory Hierarchy

CPU #1

Main Memory (1Gb – 128Gb)

Cache memory

(256 kb – 16Mb)

CPU #2

Cache memory

(256 kb – 16Mb)

slow

slow

fast

fast

Page 179: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

4

Alternative codes to create G matrix

Do i=1,n

Do j=i,n

S=0

Do k=1,p

S=S+Z(M(i,k),k)

*Z(M(j,k),k)

End do

G(i,j)=S/sqrt(d(i)*d(j))

G(j,i)=G(i,j)

End do

End do

Do k=1,p

X(:,k)=Z(M(:,k),k)

End do

Do i=1,n

Do j=i,n

S=0

Do k=1,p

S=S+X(i,k)

*X(j,k)

End do

G(i,j)=S/sqrt(d(i)*d(j))

G(j,i)=G(i,j)

End do

End do

Do k=1,p

X(:,k)=Z(M(:,k),k)

End do

Do i=1,n

Do j=1,n

Do k=1,p

G(i,j)=G(i,j)

+X(i,k)*X(j,k)

End do

End do

End do

Do i=1,n

Do j=1,n

G(i,j)=G(i,j)/sqrt(d(i)*d(j)

End do

End do

Original

Optimize Indirect Memory

Access -OPTM

Optimize Memory and Loops

- OPTML

Gmatrix.f90 (VanRaden, 2009)

Testing

6500 genotyped animals

40k SNPs

CPU time for alternative codes for G

matrix and machines

Algorithms

Processor Cache Original OPTM OPTML

Xeon 3.5 GHz 6 MB 24 m 26 m 7 m

Opteron 3.0 GHz 1 MB 265 m 59 m 17 m

Page 180: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

5

Compiler Original OPTM OPTML

Intel 265 59 17

Absoft 241 60 34

gfortran 213 63 >1day

CPU time (m) with alternative codes

and compilers

Testing

6500 genotyped animals

40k SNPs

Opteron 3.02 GHz 1 MB Cache memory

PreGSf90 program

• From BLUPF90 package

• Uses a genomic module

• Creation and handling of genomic relationship

matrices and relationship based on pedigree

• Different methods to optimize calculations

using parallel processing

Page 181: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

6

Input files

• Same parameter file as for all BLUPf90 programs – But with さOPTION SNP_file xxxxざ

– indicate to run genomic subroutines

• Pedigree file

• Marker information (SNP file)

• Cross Reference file for renumber ID – Links genotypes files with codes in pedigree, etc.

SNP map file (optional)

• For some genomic analyses or checks

• Format:

– snp number

• Index number of SNP in the sorted map

– chromosome number

– position

• First row corresponds to first column SNP in genotype file !!!

Page 182: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

7

OPTIONS に BLUPF90 parameter file

• PreGSF90

– controled by adding OPTIONS commands to the

parameter file

– OPTION SNP_file marker.geno.clean

– Read 2 files:

• marker.geno.clean • marker.geno.clean.XrefID

RENUMF90

• Add keyword to the さ;ミキマ;ノ effectざ SNP_FILE

marker_geno_clean

• Renumber tool to prepares: – data

– pedigree

– genotypes

– parameter files for BLUPF90 programs including PREGSF90

• Check wiki:

• http://nce.ads.uga.edu/wiki/doku.php

Page 183: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

8

Parameters file

RENUMF90

renum.par

BLUPF90

renf90.par

Pedigree file from RENUMF90

• 1 - animal number

• 2 - parent 1 number or UPG

• 3 - parent 2 number or UPG

• 4 - 3 minus number of known parents

• 5 - known or estimated year of birth

• 6 - number of known parents;

if animal is genotyped 10 + number of known parents

• 7 - number of records

• 8 - number of progenies as parent 1

• 9 - number of progenies as parent 2

• 10 - original animal ID

Page 184: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

9

SNP file & Cross Reference Id

SNP File

Cross Reference ID

First col: Identification, could be alphanumeric

Second col: SNP markers {codes: 0,1,2 and 5 for missing}

Pedigree File (from RENUMF90)

Original ID

Renumber ID

Genomic Matrix default options

• Gゅ Э ZZげっニ ;ゲ キミ VanRaden, 2008

• With: – Z center using allele frequencies estimated from the

genotyped individuals

– k = 2 sum ( p * (1-p))

• G = G*0.95 + A*0.05 (to invert)

• Tunning of G (see Z. Vitezica work) – Adjust G to have mean of diagonals and off-diagonals

equal to A

Page 185: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

10

Genomic Matrix Options

• OPTION whichG x – 1: G=ZZ'/k (default) (VanRaden, 2008)

– 2: G=ZDZ'/n; D=1/2p(1-p) (Amin et al., 2007; Leuttenger et al., 2003)

– 3: As 2 with modification UAR (Yang et al., 2010)

– Euclidean distance matrix, not fully implemented yet

• OPTION weightedG file

– ‘W;S ┘Wキェエデゲ デラ IヴW;デW GЭZDZげ – Weighting Z*= Z sqrt(D) => G = Z*Z*' = ZDZげ

• OPTION whichScale x – 1ぎ ヲぞふヮふヱ-p)) (default) (VanRaden, 2008)

– 2: trace(ZZ')/n (Legarra 2009, Hayes 2009, Forni et al 2011)

– 3: correction (Gianola et al., 2009)

Genomic Matrix Options

• OPTION whichfreq x

– 0: read from file freqdata or other specified

– 1: 0.5

– 2: current calculated from genotypes (default)

• OPTION FreqFile file

– Reads allele frequencies from a file

• OPTION maxsnps x

– Set the maximum length of string for reading marker data from file => BovineHD chip

Page 186: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

11

Options for Blending G and A

• OPTION AlphaBeta alpha beta – G = alpha*Gr + beta*A

• OPTION tunedG – 0: no adjustment

– 1: mean(diag(G))=1, mean(offdiag(G))=0

– 2: mean(diag(G))=mean(diag(A)),

mean(offdiag(G))=mean(offdiag(A)) (default)

– 3: mean(G)=mean(A)

– 4: Use Fst adjustment. Powell et al. (2010) & Vitezica et al. (2011)

Creation ラa けrawげ genomic matrix

• Tricks:

• Use dummy pedigree 1 0 0

2 0 0

• Change blending parameters

– OPTION AlphaBeta 0.99 0.01

• No adjustment for compatibility with A

– OPTION tunedG 0

G = 0.99*G + 0.01*I

Page 187: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

12

Storing and Reading Matrices

• PreGSF90: – Facilitate the implementation of single-step

– Matrix A is replaced by H with:

– Default output is the matrix GimA22i, to be included in apllication programs (BLUPF90, REMLF90..)

• BUT: intermediate matrices could be stored for examination, use in application programs, etc.

1 11 1

22

0 0

0

H AG A

Storing and Reading Matrices

• Matrices that can be stored: – A22, inv(A22), G, inv(G), GmA22, inv(GmA22), inv(H)

• All matrices are stored in same format: – upper triangle

– By default in binary format

– But to store in text (Ascii) format: • Use: OPTION saveAscii

• Values – i j val

– i & j refers to the row number in the genotype file !!!!!

– Renumber ID could be obtained from the XrefID file

Page 188: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

13

Storing and Reading Matrices

To save our けrawげ genomic matrix:

• OPTION saveG [all] – If the optional all is present all intermediate G

matrices will be saved!!!

or it inverse

• OPTION saveGInverse – Only the final matrix G, after blending, scaling, etc. is

inverted !!!

• Look in wiki for keywords for other matrices

Storing with Original IDs

• Some matrices could be stored in text files with the original IDs extracted from renaddxx.ped created by the RENUMF90 program (col #10)

• For example: – OPTION saveGOrig

– OPTION saveDiagGOrig

– OPTION saveHinvOrig

• Values – origID_i, origID_j, val

Page 189: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

14

OUTPUT

• Only GimA22i , other requested matrices files, and

some reports are stored.

• Main log is printout to the screen !!!

• Use redirection けбげ • or better the command tee to save in a log file.

• This will allows to save and see the messages from

the program

• echo renf90.par | preGSf90 | tee pregs.log

Printout: Same heading as other

programs

All options that were

enter in the parameter

file should be here !!.

IF not check that

keywords are correct

(upper and lower case)

Check number of

animals and

individuals with

genotypes

Page 190: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

15

Printout

Information from genotype file.

The format is detected from

the first line !!!

So all genotypes should start in

the same column !!!

Number of SNP is also

determined by the first line!!

Looking stored matrices

• Avoid open with text editors, huge files !!!

• For example:

• 1500 genotyped individuals => 1,125,750 rows

• Inspection could be done by Unix commands: – head G => first 10 lines

– tail G => last 10 lines

– less G => scroll document by line/page

– wc -l G => count number of lines

good for checks with the number of

genotypes (n) = (n*(n+1)/2)

Page 191: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

16

head G

GBLUP, GREML, GGIBBS

• Using BLUPF90 programs to perfom genomic

selection using genomic relationship matrix

• Using only phenotypes or pseudo phenotypes

(DYD, DP, EBV ) for only genotyped individuals

Page 192: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

17

Two ways: user_file

• By user defined files for covariances of random effects

• Look at Tricks in the wiki for more detailshttp://nce.ads.uga.edu/wiki/doku.php

• Special type of random

effect in BLUPF90

parameter file

• Gi created by PreGSF90

can be used here!

By けfakeげ ゲキミェノW-step GBLUP

• Same trick as before:

– Dummy pedigree with number of individual equal

to number of individuals with genotypes

– Little blending with A (identity matrix) to create

the inverse (OPTION AlphaBeta 0.99 0.01)

– No adjustment for means of A (OPTION tunedG 0)

– Parameter file include:

• Random effect defined as add_animal

• OPTION SNP_file xxxx

Page 193: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

18

By けfakeげ single-step GBLUP

• Runs could be either by:

– Several steps

• 1 run preGSf90 and store G inverse

• 2 modify paramter file for BLUP

adding OPTION readGimA22i

• 3 run BLUPF90

– けOne-Stepげ • 1 run BLUPF90 or REMLF90

RENUMF90 ren.par BLUPF90 renf90.par

Page 194: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

19

PreGSf90 inside BLUPF90 ??

• Almost all programs from package support creation of genomic relationship matrices, Hinv, etc.

• OPTION SNP_file xxxx

• Why preGSF90 ? – Same genomic relationship matrix for several models,

traits, etc. Just do it once and store.

– Uses of optimized subroutines for efficient matrix multiplications, inversion and with support for parallel processing

Matrix multiplication subroutines

• Optimized memory and loops (compiler optimization)

• dgemm subroutine from BLAS

• Optimized dgemm (ATLAS or MKL libraries*)

– Serial

– Parallel (Automatic use of OpenMP) * Intel Fortran Compiler

Page 195: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

20

Matrix multiplication using 40k SNPs

1

10

100

1000

10000

100000

0 5000 10000 15000 20000 25000 30000 35000

Log

10

CP

U t

ime

(s)

Number of animals

BLAS dgemm OPTML

~ 6.4 h

Optimized dgemm

~ 3.8 h

Speedup for matrix multiplications

1

1,5

2

2,5

3

3,5

4

4,5

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

du

p

Number of animals

4 Threads

3 Threads

2 Threads

Speedup = time using one thread/time using n threads

Page 196: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

21

Efficent methods to construct genomic

relationship matrices

Number of

genotypes

Genomic Relationship Matrix

Creation Invertion

10k 0.6 m 0.1 m

30k 5.4 m 3 m

50k 15 m 14 m

70k 30 m 36.4 m

100k 60 m 106 m

Elapsed time for different number of individuals

BLADE INIALB 24 cpu

Creation a subset of relationship

matrix (A22)

• Create a relationship matrix for only

genotyped animals (~ thousands)

• Full pedigree (~millions)

• Trace only ancestors of genotyped (reduce but

still large number for A matrix)

Page 197: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

22

Relationship Matrix of Genotyped Animals

• CラノノW;┌げゲ algorithm to creates A22

• No need to have explicit A matrix

• MWデエラS ┌ゲWゲ さマ;デヴキ┝-┗WIデラヴざ マ┌ノデキヮノキI;デキラミ ┘キデエ ; decomposition of A matrix

-1 -1(I -A (I - Pr ) P)Dv 'r

Example A times a vector

Pedigree

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 2 0 0

[3,] 3 1 2

Matrix P

[,1] [,2] [,3]

[1,] 0.0 0.0 0.0

[2,] 0.0 0.0 0.0

[3,] 0.5 0.5 0.0

Matrix (I-P)-1

[,1] [,2] [,3]

[1,] 1.0

[2,] 0.0 1.0

[3,] 0.5 0.5 1

Matrix (I-P)-ヱげ

[,1] [,2] [,3]

[1,] 1 0 0.5

[2,] 1 0.5

[3,] 1.0

Matrix D

[,1] [,2] [,3]

[1,] 1

[2,] 1

[3,] 0.5

Vector r2

[,1]

[1,] 10

[2,] 20

[3,] 30

Matrix (I-P)-1

[,1] [,2] [,3]

[1,] 1.0

[2,] 0.0 1.0

[3,] 0.5 0.5 1.0

Vector q

[,1]

[1,] 25

[2,] 35 = [3,]

30

-1 -1(I -A (I - Pr ) P)Dv 'r

Page 198: Short introduction to BLUPF90 family programsnce.ads.uga.edu/wiki/lib/exe/fetch.php?media=slides_2014_lva.pdf · 22/05/2014 7 (CO)VARIANCES structure • Assuming a 3 trait (T1-T3)

27/05/2014

23

• For each genotyped animal in A22

A 0 0

1 0

0

A A22

A22(i.) * =

-12 2

-1(I - P)v A (I - P) D 'r r

Relationship Matrix of Genotyped Animals

Tabular method vs. Colleau algorithm

Tabular* Colleau method

CPU Time 311 s 45 s

Memory 12.1GB 322MB

Testing

6,500 genotyped Holsteins

57,000 pedigrees

* Gmatrix.f90 (VanRaden, 2009)