Short introduction to BLUPF90 family...
Transcript of Short introduction to BLUPF90 family...
22/05/2014
1
Short introduction to BLUPF90
family programs
Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria
INIA Las Brujas, Uruguay [email protected]
BLUPF90 Family of Programs
• Developed by Ignacy Misztal and collaborators
at University of Georgia
• Collection of software for mixed model
computation in animal breeding (plant, forest
breeding, etc. )
22/05/2014
2
BLUPF90 family programs
• Set of program: – Help in teaching course in mixed model
– Simplify programming using Fortran90
– Have a general program that support different model complexity:
• Linear and threshold-linear models with multiple-correlated effects, multiple trait animal models and dominance
• General philosophy of programs described here: – さCラマヮノW┝ MラSWノが MラヴW D;デ;ぎ “キマヮノW Pヴラェヴ;ママキミェ いざく
Misztal, I. 1999 Interbull Bull.
BLUPF90 family programs
• Consist of several programs:
• Estimation of variance components
• REMLF90, AIREMLF90, (thr)GIBBSxF90
– Solver of mixed model equations
• As before plus
• BLUPF90
• BLUP90IOD (large scale data)
– Aproximation of accuracy
• ACCF90 (large scale data)
• Support for genomic selection
22/05/2014
3
http://nce.ads.uga.edu/wiki/doku.php
BLUPF90 family programs
• All programs are controled by the SAME
paramenter file.
• Extra options could be used to set non-default
behaviour of each program
• Understanding parameter file usually solve
most of problems
22/05/2014
4
BLUPF90 parameter file
Repeat for each
Random effect
Data file
• Free format, i.e. at least one space to separate columns
• TABs are not valid to separate columns
• Some programs (MS Excel) export flat files with TAB separators !!
• Only numbers: integer or reals
• Ia ヴW;ノゲ SWIキマ;ノ ゲWヮ;ヴ;デラヴゲ さくざ ミラデ さがざ
• OミW さくざ キゲ ミラデ ; マキゲゲキミェ ┗;ノ┌W • All effect need to be renumber from 1
consecutively (see later RENUMF90)
22/05/2014
5
Number of traits / effects
• No restriction for number of traits or effects
• But memory requirements and computing
time increase exponentially with them
Effects section
• Many rows as the NUMBER_OF_EFFECTS
• In this section de model for each trait is
defined
• Different models per trait are supported
• If an effect is missing for one trait use 0
Many columns
as NUMBER_OF_TRAITS Number of Levels
Type of effect
22/05/2014
6
RANDOM_RESIDUAL VALUES
• This matrix should a square matrix with
dimension equal to the NUMBER_OF_TRAITS
• Use zero (0.0) to indicate uncorrelated
residual effects between traits
• e.g. For a 3 trait
43.1 0.0 0.0
0.0 5.1 3.2
0.0 3.2 10.3
Random effect definition
• RANDOM_GROUP – Number(s) of effect from list of effects
– Correlated effects should be consecutive e.g. Maternal effects, Random Regression models
• RANDOM_TYPE – diagonal, add_animal, add_sire, add_an_upg,
add_an_upginb, user_file, user_file_i or par_domin
• FILE – Pedigree file, parental dominance or user file
• (CO)VARIANCES – Square matrix with dimension equal to
number_of_traits*number_of_correlated_effects
22/05/2014
7
(CO)VARIANCES structure
• Assuming a 3 trait (T1-T3) and 3 correlated
effects (E1-E3)
E1 E2 E3
T1 T2 T3 T1 T2 T3 T1 T2 T3
E1
T1
T2
T3
ぐくく
RANDOM_TYPE
• Diagonal
– for permanent enviroment effects, assume no correlation between levels of the effect
• add_sire
– To create a relationship matrix using sire and maternal grandsire
– Pedigre file: • individual number, sire number, maternal grandsire number
• add_animal
– To create a relationship matrix using sire and dam information
– Pedigre file: • animal number, sire number, dam number
22/05/2014
8
RANDOM_TYPE
• add_an_upg
– As before but using rules for unknown parent group
– Pedigre file: • animal number, sire number, dam number, parent code
• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals
• Parent code = 3 に nb of known parents – 1 both parents known
– 2 one parent known
– 3 both parents unknown
• add_an_upginb
– As before but using rules for unknown parent group and inbreeding
– Pedigre file: • animal number, sire number, dam number, inb/upg code
• missing sire/dam can be replaced by upg number, usually greater than maximun number of animals
• inb/upg code = 4000 / [(1+md )(1-Fs ) + (1+ms )(1-Fd )]
• ms (md) is 0 if sire (dam) is known and 1 otherwise
• Fs(Fs) inbreeding coefficient of the sire (dam)
RANDOM_TYPE
• user_file
– a matrix is read from file. Matrix is stored only upper- or lower-triang
– Matrix file: • row, col, value
• user_file_i
– As before but the matrix will be inverted
• par_domin
– A parental dominance file created by program RENDOM
– File format • s-d s-sd s-dd ss-d ds-d ss-sd ss-dd ds-sd ds-dd code
22/05/2014
9
Pedigree files
• As with data files pedigree files are separated by at least one SPACE!!
• TABs are not supported !!
• Order of columns depends on the type of the random effect
• Duplicates pedigree are not checked!!
• Identification number need to be coded sequentially from 1 to the maximun number of animals
• No order is required !!!!
Programs Options
• Programs behavior can be modified by adding lines with OPTION at the end of the parameter file
• OPTION option_name x1 x2 …
• option_name, each program has it own definition of options
• TエW ミ┌マHWヴ ラa ラヮデキラミ;ノ ヮ;ヴ;マWデWヴゲ ふ┝ヱが ┝ヲぐぶ デラ control the behavior depends on the option.
22/05/2014
10
BLUPF90
• Blupf90 computes generalized solutions by several methods: – Preconditioner Conjugate Gradient (PCG). Default Iterative
method, fast.
– Succesive over-relaxation (SOR), a iterative method based on Gauss-Seidel
– Direct solution using sparse Cholesky factorization (FSPAK) Greater memory requirements
• The values of the solution change between methods but estimable function should be the same
• Prediction error variances can be obtanined using sparse inverse (FSPAK)
BLUPF90 options
22/05/2014
11
BLUPF90 options cont.
BLUPF90 options cont.
22/05/2014
12
Example of parameter file BLUPF90
From blupf90.pdf documentation:
http://nce.ads.uga.edu/wiki/doku.php
Parameter File Model
22/05/2014
13
FAQ or Frequently Problems
• Wrong data file and pedigree name !! – Program does not stop if wrong file name not exist
– Check outputs for data file name and number of records and pedigree read
• Wrong position of formats or formats for observation and effects
• Misspelling of Keywords. – Program may stop
• (Co)variance matrices not symmetric, not positive definite – Program may not stop
• Large numbers (e.g. 305-day milk yield 10,000 kg) + large number of records with Gibbs Sampling programs – Scale down i.e. 10,000 /1,000 = 10
Data preparation and renumbering
• In general data files and pedigree files can be
created by any software (e.g. SAS, R, python, etc.)
• But all cross-classified effects (included pedigree)
needs to be renumber sequentially from 1 to the
maximum number of levels of each effect.
• No alphanumeric columns
• Columns has to be separated by at least one
space !!
22/05/2014
14
RENUMF90 • A renumbering program for the BLUPF90 family of programs
• Supports: – multiple traits
– different effects per trait
– alphanumeric and numeric fields
– unknown parent groups
– covariates for random regression models
• Provides data statistics
• Traceback pedigree related to individuals in data file and performs comprehensive pedigree checking
• Create files to be used by BLUPF90 family programs – renf90.par - parameter file
– renf90.dat - data recoded
– renaddxx.ped - renumer pedigree + statistics
– renf90.tables - cross reference file with renumber information with original data
RENUMF90 files
• Data file and pedigree file in flat files
• Columns separeted by at least one SPACE
• No TABS !!!! (current version check for it)
• Input files cannot contain character #
• Missing sire/dams must have code 0
• codes 00 are treated as a known animal
• Has it own parameter file!!!! not the same for other programs !!!!
22/05/2014
15
RENUMF90 parameter file
• Based on keywords in capital following by a
line(s) with the corresponding data item
• Keywords need to be typed exactly
• Keywords need to occur in sequentially order
!!!
• Lines starting with # are treated as comments
and are ignored
RENUMF90 keywords
All these keywords
are mandatory!!!
Leave a blank lines in
cases that are necessary
22/05/2014
16
RENUMF90 keywords
Effect section
RENUMF90 keywords
Random effect section
22/05/2014
17
RENUMF90 keywords
Random effect files section
RENUMF90 keywords
Pedigree options section
22/05/2014
18
RENUMF90 keywords
Unknown parent group section
RENUMF90 keywords
Random regression group section
22/05/2014
19
RENUMF90 keywords
Random effect (Co)Variances section
RENUMF90 keywords
• Section starting from EFFECTS can be repeated as many time as effects in the model
• Correlated effects are controled by option
• If (Co)Variances for any effect are missing, default matrix with 1.0 in diagonal and 0.1 on off-diagonal will be used.
– WARNING: for EM-REML convergence rate is improved if starting values are too large rather than too small !!!
22/05/2014
20
RENUMF90 keywords
Creation interacions effects
RENUMF90 keywords
extra options sections
22/05/2014
21
RENUMF90 keywords
options passed to BLUPF90
• All lines that begin with keyword OPTION are
passed to parameter file renf90.par
• This allows automatization of process by using
scripts
• For example:
– OPTION sol se
RENUMF90 output files
Pedigree file: renaddx.ped • Columns structure:
1. Animal number (from 1)
2. Parent 1 number or UPG number for parent 1
3. Parent 2 number or UPG number for parent 2
4. 3 minus number of know parents
5. Know or estimated year of birth (o if not provided)
6. Number of know parents, if animal has genotype: 10+number of know parents
7. Number of records
8. Number of progenies as parent 1
9. Number of progenies as parent 2
10. Original animal ID
22/05/2014
22
RENUMF90 output files
renumbering tables: renf90.tables
• For each cross-classified effect tables are
created with:
– Original ID, count,, consecutive number
• Usefull
– to translate solutions from BLUPF90 program into
original alphanumeric values
– Check counts of records by level
Example of RENUMF90
parameter file
22/05/2014
23
RENUMF90 output files
Inbreeding program
• INBUPGF90
– Calculates inbreeding coefficients
– Alphanumeric identification of individuals
– Different methods:
• Regular inbreeding (Meuwissen & Luo)
• Missing parent information (V;ミ‘;SWミげゲ method)
– Calculate expected future inbreeding for a set of defined mating
– Calculation of relationships between animals
– Output reordering pedigree with parent ID < animal ID
22/05/2014
24
INBUPGF90
• No parameter file
• Controlled by arguments
inbupgf90 –pedfile file_name
• See wiki for full description of options
Different Models with BLUPF90
http://nce.ads.uga.edu/wiki/doku.php?id=faq
22/05/2014
25
Estimation of variance components
• Several methods available
– REML
– Bayesian methods via Gibbs Sampling
• Review article:
– Misztal, I. 2008. Reliable computing in estimation
of variance components. J. Anim. Breed. Genet.
125:363-370.
REML
• Maximizes the likelihood with respect to
parameters
• Different ways to get maximum
– Derivate Free (DF) e.g. MTDFREML
– Using first derivatives (EM-REML)
– Using second derivatives (AI-REML)
22/05/2014
26
EM-REML
• Traditional regarderd as the most reliable
• But
– Slow
– Could fail if starting parameters are smaller than
デエW けデヴ┌Wげ ヮ;ヴ;マWデWヴゲ
– Use bigger values
– Not generate standard errors of estimates
AI REML
• Much faster than EM-REML
• Provide estimation of standard errors
• BUT
– For complex models and poor starting values
• Slow convergence
• Parameters estimates out of the parameter space
– In some cases initial rounds with EM-REML help
22/05/2014
27
Bayesian に Gibbs Sampling
• Implementation
– solving of mixed model equations (Gauss-Seidel)
Щ ;SSキミェ けミラキゲWげ デラ ゲラノ┌デキラミゲ
– Sampling of variances components from chi-
square or Wishart distributions
• Samples from marginal posterior distribution
after burn-in period
Programs for estimation
of variance components
• remlf90 -> EM-REML
• airemlf90 -> AI-REML
• Gibbs Sampling – gibbsf90 blupf90 transformed in gibbs, slow
– gibbs1f90 optimized version
– gibbs2f90 improve mixing with random correlated effects
– gibbs3f90 heterogeneous residual var. classes
– thrgibbs1f90 threshold-linear mixed models
22/05/2014
28
REMLF90 OPTION
AIREMLF90 OPTIONS
22/05/2014
29
AIREMLF90 OPTIONS
AIREMF90 OPTIONS
22/05/2014
30
GIBBS SAMPLING PROGRAMS
• Extra input are required when running gibbs
sampling programs
• As other programs – name of parameter file?
• Parameter to set the MCMC chain
– number of samples and length of burn-in
– Give n to store every n-th sample? (1
means store all samples)
Gibbs Sampling
Output files
• Default files
– gibbs_samples
– fort.99
• Solutions files only if they are required by
options
• Other files, only useful for continuation
– binary_final_solutions
– last_solutions
22/05/2014
31
Gibbs Sampling OPTIONS
Gibbs Sampling OPTIONS
22/05/2014
32
heterogeneous residual variances
GIBBS3F90
Threshold models
THRGIBBS1F90
22/05/2014
33
Post Gibbs analysis
• Program postgibbsf90 use output files
– gibbs_samples
– fort.99
• from all Gibbs Sampling programs
– gibbs1f90
– gibbs2f90
– gibbs3f90
– thrgibbs1f90
POSTGIBBSF90
• Calculate statistics for variance components from the posterior distribution – Means
– median
– mode
– standard deviations
– HPD 95
– effective sample size
– auto-correlations.
• Create graphs with trace of the chain and histogram of variance components
22/05/2014
34
POSTGIBBSF90
Output Files
• さpostgibbs_samplesざ – file contaning all Gibbs samples for additional post
analyses, e.g. posterior distribution for Heritabilities, correlations, covariance functions of random regression models
• さpostmeanざ – file contaning posterior means, in matrix format that
match parameters files
• さpostsdざ – file contaning posterior standard deviations.
• さpostoutざ – statistics
HowTo run POSTGIBBSF90
• Iterative program
• As other programs – name of parameter file?
• Parameter to select samples from distribution to calculate posterior statistics – Burn-in?
• Set number of samples to discard for posterior analysis
• In first run use 0 to see convergence
– Give n to read every n-th sample? (1 means read all samples)
• This number should be equal or greater that the one used in gibbs programs
• Ask user to enter option to – Generate graphs of trace or histograms
– exit
22/05/2014
35
Using Gibbs Sampling programs
• For new analysis use burn-in equal 0
– Allows look full chain with postgibbsf90
– Posterior samples could be extracted later
• For long jobs, use k-parameter >1 e.g. 10
– Decrease size of gibbs_samples
• DIC for model comparison is provided in output of gibbs programs,
– BUT, burn-in should be used in order to be meaningful
General comments for all programs
• Output that is printed to the terminal is not
SAVED in any file !!!
• Use redirection or pipes to store outputs in log
files:
echo renf90.par | blupf90 | tee blup.log
or echo renf90.par | remlf90 | tee reml.log
22/05/2014
36
For programs that requires
more than one parameter
gibbs2f90 <<AA > gibbs.log
renf90.par
1000 0
10
AA
• Or using single line
• printf "exmr99s \n 1000 0 \n 10 \n” | gibbs2f90 > gibbs.log
General OUTPUT from all programs
Check file names
Check model
22/05/2014
37
General OUTPUT from all programs
Check (co)variances
Check number of records
And pedigree read
Check maximum
number of columns
to read
Useful commands for Linux
• Access to server using ssh client: e.g. putty
• To run graphic windows a X11-client for windows: xming
• Commands in Linux are Case Sensitive !!
• Several tutorials on the WEB !!
• unixcombined.pdf from Misztal web page– http://nce.ads.uga.edu/~ignacy/ads8200/unixcombined.pdf
• Unix_en.pdf from genomeek blog:– http://genomeek.wordpress.com/manuels/unix-et-awk/
– http://dl.dropbox.com/u/22940514/Unix_En.pdf
Basic Commands
pwd show working directory
ls list files in working directory
ll as before but with more information
mkdir d make a directory d
cd d change to directory d
cat file list the complete file
less file list file page-by-page
Copy and moving commands
To copy file
cp /home/ignacio/course/lab/lab1/is .
To copy file directory
cp –r /home/ignacio/course/lab/lab1 .
to move file aa into bb in folder test
mv aa ./test/bb
To delete
rm yy delete the file yy
rm –r xx delete the folder xx
Other popular commands
head file print first 10 lines
tail file print last 10 lines
wc –l file count lines
grep text file find lines that contains text
cat file1 fiel2 catenate files
sort sort file
cut cuts specific columns
join join lines of two files on specific columns
paste paste lines of two file
expand replace TAB with spaces
uniq retain uniques lines on a sorted file
Redirections & pipe
aa < bb
program aa reads from file bb
blupf90 < in aa > bb
program aa write in file bb
blupf90 < in > log
“|” and “tee”
program blupf90 reads name of parameter file and writes output in terminal and in file log
echo par.b90 | blupf90 | tee log
AWK
• Very useful and fast command to work with
text files
• Can be used as a database query program
• Select specific columns or create new ones
• Select specific rows matching some criteria
AWK
Extract equations solutions for a particular effect (2) and print EBV and
accuracies (r^2)
awk '{ if ($2==2) print $3,$4,1-$5**2/20}' solutions
Count records by sire
awk '$2>0{ print $2}' ped | sort | uniq –c
Process CSV files
awk 'BEGIN {FS=","} { print $1,$2,$3}' pedigree.txt
1
Data simulation (including genomics) QMSim software
Zulma G.Vitezicaゆ
ゆ INRA-INPT, GenPhySE, Castanet-Tolosan 31326 France
It was design to simulate large-scale genotyping data in
multiple and complex livestock pedigrees
A wide variety of genome architectures from infinitesimal
model to single-locus model
It is a user-friendly tool for simulating data
Computationally efficient in terms of both time and
memory
QMSim: why to use it ?
2
The code is written in C++ language
Executable files are freely available for Windows and
Linux and Mac at: (Last update: July 12, 2013)
http://www.aps.uoguelph.ca/~msargol/qmsim/
QMSim†: where to find it ?
†Sargolzaei & Schenkel (2009), Bioinformatics 25:680-681.
In 2 steps:
First step: A historical population is simulated
–in order to create initial LD and
–to establish mutation-drift equilibrium
–expansion and contraction of the population
Second step: One or multiple recent population
structures are generated
How the simulation is carried out ?
3
It must be in ASCII format
It consists of five main sections
The order of commands within each
section is not important
All commands end with a semicolon
Nラ ゲWマキIラノラミ т Wヴヴラヴ マWゲゲ;ェW ;ミS program exits.
Parameter file
1. Global parameters section
An arbitrary title
.---------------------------------------. | Example 1 - 10k SNP panel | `---------------------------------------' Initial seed is backed up in [r_ex01/seed]. parameter file is backed up in [r_ex01/ex01.prm].
Parameter file: ex01.prm
Output folder: r_ex01/
Output
* Mersenne Twister algorithm (Matsumoto & Nishimura, 1998)
The random number generator (RNG*) requires a seed file.
Ia キデ キゲ ミラデ ゲヮWIキaキWS т ‘NG ┘キノノ HW ゲWWSWS aヴラマ デエW ゲ┞ゲデWマ clock
For each run the initial seed numbers will be backed up in
output folder т Tエキゲ ;ノノラ┘ゲ デラ ヴWヮW;デ デエW ヴ┌ミ !
4
1. Global parameters section
Overall heritability (Polygenic + QTL)
QTL effect is simulated
Only polygenic effect is simulated
Both, polygenic and QTL effects are simulated
Range: 0 - 10,000
1. Global parameters section
A sex limited trait like milk yield
When males do not have records, but selection or culling are based on
EBVs т Ok
Phenotypes т M;ノWゲ ┘キノノ HW ヴ;ミSラマノ┞ ゲWノWIデWS ラヴ I┌ノノWS
5
It consists of five main sections
Parameter file
2. Historical population section
To create initial LD
Evolutionary forces: mutation and drift (no selection, no migration) Random mating: union of gametes randomly sampled from
the male and female gametic pools Discrete generations
Only a single historical population
6
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
2. Historical population section
hg_size = v1 [v2]
Historical generation
sizes
v1 the historical generation size Range: 2 – 100,000 v2 the historical generation number Range: 0 – 150,000
Constant size of 420
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
2. Historical population section
Gradual decrease in size from 2000 to 200
Expansion in the last historical generation from 100 to 3000
Historical bottleneck or expansion can be simulated
LD in livestock extends over longer distances than in humans
7
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
2. Historical population section
nmfhg s first historical generation
nmlhg s last historical generation
Default : equal number of males and females
Sex ratio will be constant across historical generations. It can be changed in the last generation
Number of males
It consists of five main sections
Parameter file
8
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
3. Population section
One or multiple recent populations
For the first defined recent population (i.e. p1), founders must come
from the last historical population
For subsequent populations (i.e. p2), founders can be chosen from one or more
(up to 10) previously defined populations (i.e. p1)
Multiple recent populations can be analyzed
separately (one pedigree for each population) or jointly (by creating one pedigree for all populations) for inbreeding and EBV
3. Population section
Parameters for the founders
Number of male/female
to be selected
It indicates from which population the base animals must
be selected
hp: historical population (last historical generation)
Type of selection
select: rnd (default), phen, tbv and ebv /l : to select low values /h : to select high values
Choosing founders for a population
9
Choosing founders for a population for F2 design
Crossing between populations/lines
is allowed
Migration can be simulated
Choosing founders for a population for migration
10
3. Population section
ls: number of progeny per dam
ls: Probability of the litter sizes
Litter size
3. Population section
pmp: range 0-1, default is equal to 0.5
pmp: 0.5 /fix_litter Sex ratio will be fixed within litters (progeny of a dam)
Sex ratio
11
3. Population section
rnd (default), rnd_ug (a dam can mate with more than one
sire in each generation), p_assort (similarity), n_assort (dissimilarity), minf and maxf (inbreeding is minimized in the
next generation)
Assortative mating base on phen, ebv or tbv
Matting design
3. Population section
sr : 40% of sires will be replaced in
all generations
sr : 0.4 [1] 0.5 [5] 40% of sires will be culled for generation 1 to 5, and
50% from generation 5 to last generation
Replacement
sr : 1, discrete generations (default)
12
3. Population section
rnd, phen, tbv ebv and age (only for
culling)
/l or /h to select low or high values
Selection and culling designs
Breeding value estimation method
Population specific parameters for saving outputs
data: save individual's data except their genopype (File name: 'population name'_data_'replicate number'.txt stat: save brief statistic on simulated data genotype: save genotype data
p1_mrk_007.txt
p1_qtl_007.txt
13
It consists of five main sections
Parameter file
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
4. Genome section
Number of chromosomes: 10 chrlen : range 1-5,000 cM
Marker information
Example – 10k SNP panel
Samples from uniform distribution
in each replicate
All marker loci will have 2 alleles
In the first historical generation, then drift
and mutation
14
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
4. Genome section
nqloci: range 1-50,000 on the chromosome
QTL information
Example – 10k SNP panel
Samples from uniform distribution
in each replicate
Equal allele frequencies in
the first historical
generation
Nb of QTL alleles in the first historical generation (all:
same number)
It will be sampled from gamma distribution with shape 0.4
A L I M E N T A T I O N
A G R I C U L T U R E
E N V I R O N N E M E N T
More genome information
Example – 10k SNP panel
In recurrent mutation, no new allele is generated.
Default: infinite-allele model SNP recurrent mutations are generally very rare and no evidence
that mutation contributes to erosion of LD between SNP ( Ardlie et al., 2002)
Other possibilities :
Missing marker/QTL genotypes Genotyping errors can be simulated (marker/QTL)
15
It consists of five main sections
Parameter file
5. Output section
Save brief statistics on historical population
Save allele effects
Marker and QTL linkage map (GWAS)
16
Marker and QTL linkage map
p1_data_001.txt
QMSim outputs
p1_stat_001.txt
17
p1_mrk_001.txt
Marker and QTL linkage map
18
Save allele effects
QMSim
To create LD
Dense marker map QTL + polygenic
Population expansion or bottleneck
Multiple recent populations / lines
Crossing between populations / lines
A single historical population
Sex limited traits
No fixed effects -
+
Only additive effects
Conclusion
19
Reference population Phenotypes Genotypes Pedigree
Population Phenotypes Genotypes Pedigree
Estimation SNP effects
Calculation of GEBVs
Comparaison between GEBV and TBV (EBV) to obtain accuracy
Candidates to selection
Genomic selection : validation
Example of simulation
Generation -1000 to -5
Generation -5 to -1
Generation 1 to 9
Generation 10
Random mating (N=100)
Expanded to N=3000
200ンx 2,000ワ/ generation
Pedigree recording and genotyping start
Validation data: candidates to selection
Training data
22/05/2014
1
Bases for Genomic Prediction
A. Legarra
INRA, Toulouse, France
2
Linkage disequilibrium
• « Gametic phase disequilibrium »
Statistical association between alleles at two loci in the
same chromosome
– Loci : places
– Alleles: alternative forms of a gene (A,B,0)
– Phase: notion of being in the same chromosome (of a pair)
or coming from same origin (sire or dam)
22/05/2014
2
3
Biallelic case
• Assume we genotype 5 individuals, thus 10
chromosomes (and that we know the
phase)
• Now we compute allelic frequencies
AB AB ab aB ab ab Ab AB Ab AB
4
Biallelic case
p(A)=0.6
p(B)=0.5
if independent, p(AB)=0.3,p(ab)=0.2
The expected proportions are:
A a
B 0.3 0.2
b 0.3 0.2
22/05/2014
3
5
Biallelic case
p(A)=0.6
p(B)=0.5
in reality:
A a
B 0.4 0.2
b 0.1 0.3
vs. expected
A a
B 0.3 0.2
b 0.3 0.2
More AB & ab than expected !!
This is linkage disequilibrium
6
Linkage disequilibrium
• Is a statistical concept
• Describes not-random association of two loci
– Nothing more, so, why is it useful?
• Two loci in LD most often are (very) close
– This is because LD breaks down with recombination
• Linkage disequilibrium of two loci decays on average
with the distance
• Hence it serves to map genes
22/05/2014
4
7
Where does it come from?
• Because chromosomes are transmitted together
– Within known families (« linkage analysis »)
– Within the history of a population (« populational linkage
disequilibrium » or « linkage disequilibrium » in short)
• This distinction is rather artificial
– Remember: a population is a very old, large family
8
Populational linkage disequilibrium
• Assume we mix two populations (say Churra
and Merino)
• Or, that Adam was
– and Eve
– The first generation is an F1
– Then animals are mixed at random
• What do we get after many generations?
22/05/2014
5
9
Populational linkage disequilibrium
• The chromosomes become a fine-grained mosaic of grey and black
ひ However, complete mixture is difficult to attain
10
Populational linkage disequilibrium
•Some people distinguish LD and pedigree relationships •It’s pretty much the same thing
An stretch (=chromosomal
segment) is conserved because it
comes from the same ancestor
(co-ancestry).
•The value of LD (e.g. r2) observed at large distances is a function of recent relationships
•… at short distances is a function of distant relationships
The « existence » of only a few
conserved stretches at the same
place creates LD. LD is therefore:
an over-representation of segments
from a few gametes
that existed in the population some
time ago.
22/05/2014
6
11
Within-family linkage disequilibrium
• Consider this male who has 8 progeny A
a
B
b
Recombination fraction: 0.50
A b
A B
a B
a b
A b
A B
a B
a b
We found linkage equilibrium in one generation
These are the chromosomes in the sons (i.e. the gametes the male transmitted)
12
Within-family linkage disequilibrium
• Consider this male who has 8 progeny A
a
B
b
Recombination fraction: 0.25
A b
A B
a B
a b
A B
a b
Due to non-recombination linkage disequilibrium has been generated
A B
a b A a
B 0.375 0.175
b 0.175 0.375
22/05/2014
7
13
Within-family linkage disequilibrium
• Assume now there are two males A
a
B
b
A b
A B
a B
a b
A B
a b
A B
a b
A
a
b
A
A B
A b
a b
a B
A b
a B
A b
a B
14
Within-family linkage disequilibrium
• Assume now there are two males A
a
B
b
A b
A B
a B
a b
A B
a b
A B
a b
A
a
b
A
A B
A b
a b
a B
A b
a B
A b
a B
A a
B 0.375 0.175
b 0.175 0.375
Within-family linkage disequilibrium
A a
B 0.175 0.375
b 0.375 0.175
22/05/2014
8
15
Within-family linkage disequilibrium
• Assume now there are two males A
a
B
b
A b
A B
a B
a b
A B
a b
A B
a b
A
a
b
A
A B
A b
a b
a B
A b
a B
A b
a B
A a
B 0.5 0.5
b 0.5 0.5
No overall linkage disequilibrium
• Why tracing QTLs within family is easy
22/05/2014
9
17
Measures of LD: r2
r is the correlation between two loci if we say « A » = 1, « a »=0
« B » = 1, « b »=0
• Not free from problems but can be understood by
statisticians (and breeders)
• The sample size needed to achieve a given power is
proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)
• Everybody uses it to describe things in genomic selection.
1 1
f AB pqr
p p q q
1 1
Dr
p p q q
18
Measures of LD: r2
r is the correlation between two loci if we say « A » = 1, « a »=0
« B » = 1, « b »=0
• Not free from problems but can be understood by
statisticians (and breeders)
• The sample size needed to achieve a given power is
proportional to 1/r2 (Pritchard Przeworski 2001 Am J Hum Genet 69:1)
• Everybody uses it to describe things in genomic selection.
1 1
f AB pqr
p p q q
1 1
Dr
p p q q
22/05/2014
10
Bayesian Inference
Gibbs sampling
• Iterative procedure – Construct a joint distribution p(A,B,C)
• Typically this distribution contains phenotypes + a priori information + likelihood
• Want to draw inferences from this distribution, for instance the expected value of A
– Echantillonage • Sample A from p(A|B,C)
• Sample B from p(B|A,C)
• Sample C from p(C|A,B)
• Sample A from p(A|B,C)
• Sample B from p(B|A,C)
• Sample C from p(C|A,B)
• ぐ
22/05/2014
11
Gibbs sampling
• Iterative procedure in two steps
– Burnin
• Some iterations ラa さburn-キミざ
• Typical trace along the iterations
Gibbs sampling
– The final result at the end of the iterations is NOT the
solution looked for (this constrats with REML or Gibbs)
– No clear measure of convergence
– We cumulate information over the post-burnin
iterations
• Solution= Average of the post-burnin iterations
• Example:
• 欠賦沈 捗沈津銚鎮 噺 怠津 デ 欠葡沈 , where ã_i sampled over n iterations
• Ex, in BayesC 絞沈 噺 ど┻ぬ -> means that over 1000 iterations
300 times 絞 風 噺 な and 700 fois 絞 風 噺 ど
22/05/2014
12
24
Models for Genomic selection
• Single marker
• Whole-genome (multiple marker) genomic
selection
22/05/2014
13
25
Single marker
• Assume there is a marker in complete LD with
a QTL
• For example, the polymorphism in the halothane
gene (HAL) which is a predictor of bad meat quality
in swine
26
Single marker
• Estimate breeding values including the marker is a
piece of cake
• yi= marker effect in animal i + e
– We substitute the true, possibly unknown gene by a proxy
observed marker and estimate effects of the latter using a
linear model
– We can include an additional polygenic genetic value of
animal i
22/05/2014
14
27
Base model
• y= ぐ+ Za + e
– Z= incidence matrix of
marker effects
– a= marker effect
– e=residuals
1
2
3
4
0 1 1 0
2 0 0 0
0 1 0 1
a
a
a
a
》a
3 individuals, 1 marker with 4 alleles
ひ This can be solved, for example, by least squares
28
Single marker
• This is fine if we know what markers are good
predictor of what genes
• But this is rarely the case
– It can be shown that you miss a lot of information by trying
to locate the QTLs
– And those that you find are certainly exaggerated
22/05/2014
15
• Go to notes
30
Whole genome
• Ia ┘W Sラミげデ ゲWノWIデ QTL ヴWェキラミゲ ┘W ゲニキヮ デエW problem of bias
• Therefore :
– Genetic value = sum of effects of all regions
• We effectively treat all regions as being carriers of a
QTL
– How do we estimate the effects of all regions?
22/05/2014
16
31
Whole genome
• The simpler is to do an extension of single
marker analysis
• Do multiple marker regression
• You want to cover all the genome => many
markers
32
Multiple marker additive model
1,1
1,2
2,1
2,2
2,3
2,4
1 1 0 1 1 0
2 0 2 0 0 0
0 2 1 0 0 1
0 2 1 1 0 0
a
a
a
a
a
a
》a
2 alleles in 1st marker
4 alleles in 2nd marker
4 individuals, 2 markers each • y= Za + e
– Z= incidence matrix of
marker effects
– a= marker effect
– e=residuals
22/05/2014
17
33
Estimating SNP effects
• The simultaneous estimates of many markers by least squares are very poor,
– if we have much more SNPs than individuals
– They are thus terribly bad for genomic predictions as well
• Even if we had many individuals, there is a missing piece of information:
– we think that most SNPs do not have an effect or at least a big one
– this is a « prior » information
• Can we do something?
34
Best Predictor as a Bayesian estimator
|ˆ |
|
p p dE
p p d
a y a a a
a a yy a a a
« Prior » (how
we think SNP
effects are)
« Likelihood »
(how SNP effects
affect the
phenotype)
Estimate of SNP
effects
22/05/2014
18
Best Predictor as a « penalized » estimator
• Statisticians & « machine learners » aim using
« penalized » estimators (Ridge regression, Elastic
ミWデぐぶ
• A penalized estimator is the same as a « best
predictor » (or as a Bayesian estimator) before with
prior now called « penalization »
35
In the reference population:
Get markersげ genotypes (燦追岻
Get phenotypes 岫姿岻
Estimate markers effects 珊 from 姿 噺 層航 髪 燦追珊 髪 蚕 , possibly
with a Bayesian model
In the candidates :
Get markersげ genotypes (燦頂岻
Take estimates 珊赴 from above
Estimate breeding values as 四赴頂 噺 燦頂珊赴
22/05/2014
19
• Go to notes
38
A priori Distributions for marker effects
• Several distributions for SNP effects have been
proposed
– Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008) т
BLUP_SNP or GBLUP or RR-BLUP or « Ridge
regression »
– BayesA, BayesB, (Meuwissen et al. 2001; Habier et al., 2011)
– Mixture of normal , BayesC(Pi) (Van Raden JDS 2008,
Habier et al., 2011)
– (Bayesian) Lasso (Usai et al., 2009; De los Campos, et al., 2009)
22/05/2014
20
39
2 2
2 2
22
0,
0, , / 2
0 Pr1
0, Pr 1
i a i a
i a i a
i i aa
a N Var a
a t S Var a S
witha Var a
N with
Prior variances for SNP effects Normal
BayesA
BayesCPi
40
Normal distribution
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
dn
orm
(x)
Few « big » effects
20,i aa N
22/05/2014
21
41
Normal equations for genomic selection
(BLUP_SNP)
• If we assume normality there are closed
expressions for â
• This is called « BLUP », and also « genomic
BLUP » , BLUP_SNP, or GBLUP, but also « ridge
regression » or Random Regression-BLUP
– I will keep GBLUP for the use of the genomic
relationship matrix
– and BLUP_SNP for the direct estimation of SNP effects
42
Mixed model equations for BLUP_SNP
• HWミSWヴゲラミげゲ MME • ZげZ is not diagonal
• Var(a)=D is diagonal if we assume uncorrelated
SNP effects
1 1 1
1 1 1 1
ˆ
ˆ
X R X X R Z X R yb
Z R X Z R Z D Z R ya
2 2
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
a a
D ICould (will) be
something
different !!
22/05/2014
22
43
Solving BLUP_SNP: GS with Residual
Update • How to estimate SNP effects efficiently ( Legarra and Misztal J. Dairy Sci.
2008) (reinvented many times)
• LWデ ;ゲゲ┌マW ; ヴ;ミSラマ “NP マラSWノ ふさBLUPぱ“NPざぶ
• Mixed model equations can be solved by direct inversion (1 iteration) or
Gauss-Seidel, PCG or Jacobi (iterative methods, useful for large matrices).
• MCMC (BayesB, etc) can be done starting from Gauss-Seidel
• The number of effects (SNP) n is much larger than the number of records
m, and the matrix ZげZ is dense. A typical example (2000 records, 20000
SNP):
44
Efficient solvers for BLUP_SNP:
• Gauss-Seidel with Residual Update: ( Legarra and Misztal J.
Dairy Sci. 2008) (reinvented many times) implemented in GS3
– Form the basis of the Gibbs Sampling Algorithms in BayesC, etc.
– Iterate on:
1. Estimate SNP effect
2. Correct data for this SNP effect
• Preconditioned Conjugate Gradients (not in GS3)
– Eゲデキマ;デW ;ノノ “NP ゲキマ┌ノデ;ミWラ┌ゲノ┞ ┌ゲキミェ さゲW;ヴIエざ H;ゲWS ラミ ヴWゲキS┌;ノゲ ;デ W;Iエ キデWヴ;デキラミ
22/05/2014
23
45
The size of the MME
= Za y
Model
â = Z’y
Much bigger! Is this memory efficient? Easy to solve?
m
n
Z’Z (dense)
40,000,000
elements
400,000,000 elements
46
Reordering Gauss Seidel
• Gauss Seidel uses the conditional mean for the i-th
effect, corrected by the other effects:
• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi
l)
ひ Note that we are correcting for âi, so we put it
back
22/05/2014
24
47
Reordering Gauss Seidel
• Gauss Seidel uses the conditional mean for the i-th
effect, corrected by the other effects:
• (ziげzi + ゜) âil+1 = ziげふy-Zâ+ziâi
l)
• Correcting for Zâ takes 20000 op.
ひ This is the residual êが キゲミげデ キデい
ひ Use alternative formula
(ziげzi + ゜) âil+1 = ziげê+ziげziâi
l+1
48
Reordering the error term
ひ Still we need to compute ê at each iteration
ひ Actually only âi changed
ひ It can be shown that ê can be « updated »
êl+1 = êl に zi(âil+1 - âi
l)
に Hence « GSRU » Gauss Seidel with Residual Updating
に Some machine learning literature calls this « backfitting »
22/05/2014
25
49
GSRU in Figure
= Za y
(ziげzi + ゜) âil+1
= ziげê + ziげzi âi-1l
= +
1- Gauss-Seidel
50
GSRU in Figure
= Za y
êl+1 = êl + zi âi
l
= +
1- Residual Updating
22/05/2014
26
51
Fortran pseudocode Double precision:: xpx(neq),y(ndata),e(ndata),X(ndata,neq), & sol(neq),lambda,lhs,rhs,val do i=1,neq xpx(i)=dot_product(X(:,i),X(:,i)) !form diagonal of X'X enddo e=y do until convergence do i=1,neq !form lhs X’R-1X + G-1 lhs=xpx(i)/vare+1/vara ! form rhs with y corrected by other effects (formula 1) !X’R-1y rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare ! do Gauss Seidel val=rhs/lhs ! MCMC sample solution from its conditional (commented out here) ! val=normal(rhs/lhs,1d0/lhs) ! update e with current estimate (formula 2) e=e - X(:,i)*(val-sol(i)) !update sol sol(i)=val enddo enddo
52
Preconditioned Conjugate Gradients
• The other method of choice to solve large systems of
equations (e.g. Strandén and Lidauer, 1998; Tsuruta et
al., 2001)
• Based on repeated computations of Ax above
• Can easily be done for genomic models as WげふWx) + п-
1x at a cost of 3nm operations
• PCG is much faster (but less general)
1 1 1
1
1 1 1 1
ˆ
ˆ
X R X X R Z X R ybAx W W ぇ x t
Z R X Z R Z D Z R ya
22/05/2014
27
53
0 2000 4000 6000 8000 10000
-12
-10
-8-6
-4
round
Co
nv
Preconditioned Conjugate Gradients for
BLUP_SNP
log10(Convergence) with real data (Holstein)
PCG
GSRU
PCG is much faster GSRU convergence slow for large data sets (or you really need to wait) Still, EBV’s seem identical, possibly because errors in SNP estimates cancel out when summing.
54
BLUP_SNP parameters
• How do we get the variance of SNP effects, ゝ2a, from a genetic variance
ゝ2g ?
• The formula comes from the sampling variance of covariates in Z affecting
SNP effects to data
– i.e., we try to explain all genetic variance as if « caused » by SNP effects, and
these SNP effects have a variance of ゝ2a
• Assumes Hardy-Weinberg and Linkage equilibrium
22
2 1g
ai i
all SNPs
p p
22/05/2014
28
55
Residual variance with pseudo-data
• Wエ;デ キゲ デエW ヴWゲキS┌;ノ ┗;ヴキ;ミIW キa ┘W ┌ゲW DYDげゲい – DYD キゲ ; さマキノニ ┞キWノSざ ;ゲゲキェミWS デラ ; H┌ノノ ふエ;S キデ HWWミ ; Iラ┘ぶ
• DYDЭヮWヴaラヴマ;ミIW ラa デエW S;┌ェエデWヴゲが IラヴヴWIデWS H┞ S;マゲげ BVゲ ;ミS other effects. Ideally:
1 12 2 2i i j j i i
j ji i
DYD u e un n
Bull BV Mendelian
sampling of his
daughters « True »
residuals « Pseudo »
residuals
2 212 4i u e
i
Varn
But 21
2 uVar
And therefore
ni=« edc », equivalent daughter contribution
• Residual variances with deregressed proofs
can be found in Garrick et al. 2009
22/05/2014
29
57
Estimating variances = BayesC (with ヾ=0)
• It simply consists in a BLUP_SNP where we estimate (and simultaneously « integrate out ») ゝ2
a and ゝ2e
– キくWくが ; ヴWェ┌ノ;ヴ GキHHゲ ゲ;マヮノWヴ ;ヮヮノキWS デラ “NPゲ キミゲデW;S ラa EBVげゲ ふGキHHゲ-SNP ??)
– LWェ;ヴヴ; Wデ ;ノくが ヲヰヰΒ ふ┘W SキSミげデ I;ノノ キデ B;┞WゲCぶが H;HキWヴ Wデ ;ノくが ヲヰヱヱ
• Pretty straightforward from GSRU
• You can as well estimate ゝ2a and ゝ2
e using « BayesC » and take them as known in BLUP_SNP (e.g. as in REML+BLUP analysis)
2 2 2 2
2 2 2 2
,
| ~ , ;
| ~ , ;
a a a a
e e e
MVN S
MVN S
y Xb Za e
a 0 I
e 0 』
58
Fortran pseudocode for BayesC ...
do j=1,niter do i=1,neq
!form lhs
lhs=xpx(i)+1/vara
! form rhs with y corrected by other effects (formula 1)
rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
! MCMC sample solution from its conditional
val=normal(rhs/lhs,1d0/lhs)
! update e with current estimate (formula 2)
e=e - X(:,i)*(val-sol(i))
!update sol
sol(i)=val
enddo
! draw variance components
ss=sum(sol**2)+nua*Sa vara=ss/chi(nua+nsnp) ss=sum(e**2)+nue*Se vare=ss/chi(nue+ndata) enddo
22/05/2014
30
Estimate of this SNP
�喋挑腸牒 噺 姉嫗姿購勅態姉嫗姉購勅態 髪 な購銚態
�弔調凋聴 噺 姉嫗姿購勅態姉嫗姉購勅態
60
Variance of the SNP
Least squares solution
(e.g. in GWAS)
In BLUP_SNP, we shrink the least square estimate towards 0
because usually 怠蹄尼鉄 is a large number
BLUP_SNP solution
22/05/2014
31
Estimate of this SNP
• So, the estimate is much smaller than the GWAS estimate
• But we can fit all SNPs simultaneously
• And this provides unbiased (in some sense) estimates
• However, the result is very confusing for QTL detection and is unclear how do they work for さlargeざ QTLs:
61
0 2000 4000 6000 8000 10000
0e+
00
2e-0
54e-0
56e-0
58e-0
51e-0
4
Index
snps$solu
tion^2
Estimate of this SNP
This suggests an iterative/adaptive strategy
�沈 喋挑腸牒 噺 姉嫗姿購勅態姉嫗姉購勅態 髪 な購銚沈態
If 購銚沈態 蝦 タ we get the least square estimate
The more important the SNP, the larger 購銚沈態
-BayesA, etc etc (see later)
62
Variance of THIS SNP
22/05/2014
32
64
BayesA
• We « estimate » a different ゝ2a for each SNP
– this estimate is horribly bad
– but SNP solutions correspond to a model with « t » distributions
• Pretty straightforward from GSRU
2 2 2 2
2
2 2 2, ,
,
| ~ , ;
0, ,
0, ;
e e e e
i a a
i a i a i a a
MVN S
a t
a N S
y Xb Za e
e 0 』
2
2 2 2,
0, ,
0,
i a
i a i a
a t
a N
representation as
« t »
Meuwissen et al.
representation
22/05/2014
33
65
Normal vs. BayesA
-5 0 5
0.0
0.1
0.2
0.3
0.4
x
dn
orm
(x)
big effects are
more likely in
BayesA
66
Fortran pseudocode for BayesA ...
do j=1,niter
do i=1,neq
!form lhs
lhs=xpx(i)+1/vara(i)
! form rhs with y corrected by other effects (formula 1)
rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
! MCMC sample solution from its conditional (commented out here)
val=normal(rhs/lhs,1d0/lhs)
! update e with current estimate (formula 2)
e=e - X(:,i)*(val-sol(i))
!update sol
sol(i)=val
! draw variance components
ss=sol(i)**2+nua*Sa vara(i)=ss/chi(nua+1) enddo
! draw variance components
ss=sum(e**2)+nue*Se
vare=ss/chi(nue+ndata)
enddo
22/05/2014
34
67
BayesB (mixture with t distribution)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
x
dt(
x, 4
)
Otherwise a t distribution (big effects are not unlikely)
A fraction ʌ of markers has null effects.
20 1 0, ,i aa t
68
BayesB
• e.g. Meuwissen et al., 2001
• What if some SNP had no effect in Bayes A?
– This is the original idea of BayesB
– needs the probability that a given SNP is at the model or not
– can be computed by MCMC but is notoriously more difficult (see for
instance Villanueva et al., doi: 10.2527/jas.2010-3814)
22
2 2
2 2 2 2
,
, , 1| , ~
0 0
0
1 1
| ~ , ;
i a ia
i i
a a
i
e e e
a t if
a if
S
with probability
with probability
MVN S
y Xb Za e
0a h
e 0 』
ヾ is fixed
22/05/2014
35
69
Mixture distribution or BayesC(Pi)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
dn
orm
(x)
A fraction ヾ of markers has null (or almost null) effects.
Otherwise they are normal
20 1 0,i aa N
70
BayesCPi
• e.g. Habier et al., 2011 (see also Rohan Fernando course notes)
• What if some SNP had no effect?
– This is the original idea of BayesB
– needs the probability that a given SNP is at the model or not
– can be computed by MCMC
22
2 2
2 2 2 2
,
, 1| , ~
0 0
0
1 1
| ~ , ;
i a ia
i i
a a
i
e e e
a N if
a if
S
with probability
with probability
MVN S
y Xb Za e
0a h
e 0 』
ヾ can be fixed or estimated
22/05/2014
36
71
BayesCPi
• Algorithm consists in a BLUP_SNP by GSRU where we estimate (and
simultaneously « integrate out ») ゝ2a and ゝ2
e
– for each SNP we compute the probability of it being « in » the model
(indicator variable h)
• This was a nightmare in original BayesB
• R Fernando found out a simple way of computing it (course notes:
http://www.ans.iastate.edu/stud/courses/short/2010short.html ) that is
« like » GSRU
– we can equally compute the proportions ヾ or fix them previously
72
Fortran pseudocode for BayesCPi ...
do j=1,niter do i=1,neq
...
! compute loglikelihood for state 1 (i -> in model) and 0 (not in model)
! Notes by RLF (2010, Bayesian Methods in
! Genome Association Studies, p 47/67)
v1=xpx(i)*vare+(xpx(i)**2)*vara
v0=xpx(i)*vare
rj=rhs*vare ! because rhs=X’R-1(y corrected) ! prob state delta=0
like2=density_normal((/rj/),v0) !rj = N(0,v0)
! prob state delta=1
like1=density_normal((/rj/),v1) !rj = N(0,v1)
! add prior for delta
like2=like2*pi; like1=like1*(1-pi)
!standardize
like2=like2/(like2+like1); like1=like1/(like2+like1)
delta(i)=sample(states=(/0,1/),prob=(/like2,like1/)
if(delta(i)==1) then
val=normal(rhs/lhs,1d0/lhs)
else
val=0
endif
...
enddo
pi=1- beta(count(delta==1)+apriori_included,count(delta==0)+apriori_not_included) ss=sum(sol**2)+nua*Sa
vara=ss/chi(nua+count(delta==1))
… enddo
22/05/2014
37
BayesCPi
• So far this looks simple
• But BayesCPi has many details & caveats
– How to run the Gibbs sampler?
• Rule of thumb: iterate などど 伴 券 伴 の times the number
of markers
– (need to find the good combination of markers)
– Do we estimate or fix 講 ? At which values?
– What do we get as results?
73
BayesCPi
• Parameter 講 (or 岫な 伐 講岻 ) is the number of SNPs
in the model
• Do we estimate or fix it?
– In theory we can estimate it
– In practice it is very tricky
• Colombani et al. could estimate it in Holstein but not in
Montbéliarde (estimate too imprecise)
– Usually we さfixざ it to 1/1000 (50 SNP out of 50,000) for
QTL detection and to 1/100 for genomic selection
– Or, we put a uniform prior on 講 for genomic selection
74
22/05/2014
38
BayesCPi
• Parameter 講 and genetic variance
• In the case of BayesCPi, 購直態 噺 に講デ喧沈圏沈購銚態
• So, to recover all genetic variance from SNPs, we need to
modify 講 and 購銚態 at the same time
– Then 購銚態 噺 蹄虹鉄態訂デ椎日槌日 • So, 講 噺 ど┻どどな implies that 購銚態 is 1000 times larger than in
BLUP_SNP and estimates are less さshrunkenざ
75
BayesCPi • Output of BayesCPi
– For each SNP, the marginal posterior probability of being さキミ the modelざ
– Not a single subset of SNPs さキミ the modelざ
effect level solution sderror p
2 1 0.49637122E-02 0.63842196E-01 0.69375000E-02
2 2 0.49501460E-03 0.17864670E-01 0.10375000E-02
2 3 0.38664734E-04 0.79524430E-02 0.32500000E-03
2 4 0.18222423E-04 0.59148438E-02 0.25000000E-03
2 5 0.21643136E-03 0.11477947E-01 0.53750000E-03
2 6 -0.55016190E-03 0.28990326E-01 0.97500000E-03
2 7 0.94168849E-04 0.74293395E-02 0.28750000E-03
• This implies that most SNP are in LD with some QTL somewhere
• Sometimes, a single SNP stands out 蝦 large QTL
76
22/05/2014
39
0 200 400 600 800 1000 1200 1400
0.0
0.1
0.2
0.3
Index
(Andre
s$p)
77 Position
Posterior
probability
OAR12, Salle et al. (JAS)
BayesCPi
• How do we SWIノ;ヴW ; さヮラゲキデキ┗Wざ QTLい
• Have no p-values in this analysis
– Bayesians insist in using the Bayes Factor for that (Wakefield; Bertrand &
Stephens, etc.) but no clear rules how
• Construct the Bayes Factor: 喧 鯨軽鶏 件券 建月結 兼剣穴結健 穴欠建欠岻喧 鯨軽鶏 券剣建 件券 建月結 兼剣穴結健 穴欠建欠岻喧 鯨軽鶏 件券 建月結 兼剣穴結健喧 鯨軽鶏 券剣建 件券 建月結 兼剣穴結健
In our case:
稽繋沈 噺 怠貸訂訂 椎 弟日退怠 姿怠貸椎 弟日退怠 姿
78
Posterior «odds »
Prior « odds »
22/05/2014
40
BayesCPi
• What thresholds for BF?
• Some people suggest using permutations 蝦 too long
• Use a scale adapted by Kaas & Raftery (1995) used in QTL detection by
Varona et al. (2001, GSE) and Vidal et al. (2005, JAS)
– BF= 3-20 ゎゲ┌ェェWゲデキ┗Wさ
– BF= 20-150 ゎゲデヴラミェ さ
– BF>150 "very ゲデヴラミェざ
• We Sラミげデ need correction for multiple testing (Bonferroni):
– all SNP were introduced at the same time
– And the prior already « penalizes » their estimates
79
0 200 400 600 800 1000 1200 1400
0100
200
300
Index
(Andre
s$B
F)
80 Position
OAR12, Salle et al. (JAS)
BF
« Very strong »
22/05/2014
41
81
Lasso (double exponential)
-4 -2 0 2 4
01
23
4
x
de
xp
(ab
s(x
), 4
)
Often marker has almost null effect
Otherwise big effects are not unlikely
exp2i ia a
82
Lasso
Hierarchical representation of Lasso
• y : data
• a : SNP effects
2
2 2
,
| , ~ exp2
| ~ ,
ii
e
a
MVN
y Xb Za e
a
e 0 』 Distribution of SNP effects
-4 -2 0 2 4
01
23
4
x
de
xp
(ab
s(x
), 4
)
22/05/2014
42
83
• This Bayesian Lasso is being used for genomic selection (De los Campos et al., 2009)
• The following is largely from Legarra et al. (Genetical Res., 2011)
Bayesian Lasso ひ In regular Lasso, ゜ is tipically computed by cross-validation
に which depends strongly on the constitution of the training & validation data sets
に and is tricky to compute
ひ the Bayesian Lasso (Park & Casella 2008) uses an equivalent hierarchical model:
姿 噺 散産 髪 燦珊 髪 蚕
喧 珊】滋 b錆 宋┸ 拶購態
喧 蚕 b錆 宋┸ 薩購態
拶 噺 酵怠態 ど ど どど 酵態態 ど どど ど 狂 どど ど ど 酵津態 喧 滋 膏 噺 敷 膏態に 結貸碇鉄邸日鉄沈
酵態 are « weights » of 購態
• Assume SNP effects have a different « variance » set to 購態 噺 な
• This is more similar to BLUP_SNP, BayesA, BayesC, etc etc.
– Equivalent to the TキHゲエキヴ;ミキげゲ original Lasso
姿 噺 散産 髪 燦珊 髪 蚕
喧 珊】滋 b錆 宋┸ 拶
喧 蚕 b錆 宋┸ 薩購蚕態
85
TキHゲエキヴ;ミキげゲ BL
22/05/2014
43
86
Bayesian Lasso vs. BayesA
姿 噺 散産 髪 燦珊 髪 蚕
喧 珊 滋 b錆 宋┸ 拶 喧岫蚕岻b錆岫宋┸ 薩購勅態 岻
拶 噺 酵怠態 ど ど どど 酵態態 ど どど ど 狂 どど ど ど 酵津態
酵態 are « variances » of SNP
effects (購銚沈態 in BayesA)
喧 滋 膏 噺 敷 膏態に 結貸碇鉄邸日鉄沈 喧 滋 膏 b 敷 鋼程貸態鯨銚貸態沈 Exponential
Inverted chi-
squared
Distribution of the variances BayesA BL
87
Fortran pseudocode for BL
...
do j=1,niter
do i=1,neq
!form lhs
lhs=xpx(i)+1/vara(i)
rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
val=normal(rhs/lhs,1d0/lhs)
e=e - X(:,i)*(val-sol(i))
sol(i)=val
! draw variance components
ss=sol(i)**2 tau2(i)=1d0/rinvGauss(lambda2/ss,lambda2) enddo
! draw variance components
ss=sum(e**2)+nue*Se
vare=ss/chi(nue+ndata)
! update lambda
... enddo
22/05/2014
44
Bayesian Lasso
• It gives different weights to larger SNPs
• Mixing is better than BayesCPi
• Performance in Genomic Selection is as good
(Colombani et al., 2013, JDS)
• But there is no clear notion of what SNP is a
QTL and a few papers with さlasso for QTLざ ;ヴW disappointing.
88
22/05/2014
45
90
Advice
• Use everything: GBLUP, BayesCPi, Bayesian Lasso
• GBLUP is very good if variances are correct
– If not, do estimate them (G matrix + REML)
• Bayesian methods are sensible to parameters
– BayesB, Cpi are more sensible than BayesA, Bayesian Lasso
– Seems that the notion of « SNP with no effect » is incorrect
• Multiple marker methods need attention to details: correct priors and initial values, computation time, verification of convergence
• Details are typically overlooked by most users !!!
22/05/2014
1
Quantitative genetics of markers
• Go to notes
22/05/2014
2
4
Equivalences
• Pedigree (Malécot)
relationships assumes we have
2N founder alleles
• Then we genotype individual 9
• In this case,
– molecular coancestry = Malécot
IBD coancestry
• However SNPs have 2 alleles
– How are then these equivalences?
3 4 5 6 7 8 1 2
3 2
22/05/2014
3
5
Wキデエ “NPゲぐ
• Let us imagine that to each
one of the 2M founder
alleles we assign at random a
tag saying if the allele is A or
a with probability p and q=1-
p
• Then we genotype 9
• Can we say which ancestral
allele (1 to 8) inherited 9?
3 4 5 6 7 8 1 2
6
┘キデエ “NPゲぐ
• The molecular coancestry between two individuals i and j will be
– probability that two alleles are equal (alike in state) fMij,
• either because they have become identical by descent or
• either because they are not identical by descent but equal in the base population.
3 4 5 6 7 8 1 2
2 2 2ijM ijp qf pqf
22/05/2014
4
8
• 9 real French bulls among 1827 genotyped, ~50000
SNPs
• Very complex pedigree, simplified graph:
Real results (AMASGEN)
1 2
2 3 4 5 7 8 9
22/05/2014
5
9
Pedigree-based relationship
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1.00 0.51 0.57 0.51 0.26 0.15 0.15 0.14 0.14
[2,] 0.51 1.01 0.30 0.33 0.17 0.17 0.12 0.11 0.11
[3,] 0.57 0.30 1.07 0.30 0.20 0.12 0.18 0.11 0.12
[4,] 0.51 0.33 0.30 1.01 0.17 0.18 0.11 0.11 0.11
[5,] 0.26 0.17 0.20 0.17 1.00 0.56 0.51 0.52 0.53
[6,] 0.15 0.17 0.12 0.18 0.56 1.06 0.31 0.32 0.32
[7,] 0.15 0.12 0.18 0.11 0.51 0.31 1.01 0.30 0.29
[8,] 0.14 0.11 0.11 0.11 0.52 0.32 0.30 1.02 0.30
[9,] 0.14 0.11 0.12 0.11 0.53 0.32 0.29 0.30 1.03
Cousin relationships ~0.125
Little inbreeding
10
さaキヴゲデ Gざ ェWミラマキI ヴWノ;デキラミゲエキヮ
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.82 0.40 0.43 0.38 0.12 0.04 0.04 0.01 0.10
[2,] 0.40 0.91 0.18 0.24 0.02 0.05 -0.04 -0.04 0.04
[3,] 0.43 0.18 0.88 0.19 0.07 0.00 0.07 -0.02 0.05
[4,] 0.38 0.24 0.19 0.86 0.02 -0.01 -0.02 0.01 0.03
[5,] 0.12 0.02 0.07 0.02 0.73 0.34 0.30 0.31 0.35
[6,] 0.04 0.05 0.00 -0.01 0.34 0.85 0.15 0.14 0.18
[7,] 0.04 -0.04 0.07 -0.02 0.30 0.15 0.80 0.14 0.17
[8,] 0.01 -0.04 -0.02 0.01 0.31 0.14 0.14 0.80 0.17
[9,] 0.10 0.04 0.05 0.03 0.35 0.18 0.17 0.17 0.85
Relationships among cousins are ~0
Less than 1 in the diagonal Negative coefficients
/ 2 1i iall SNPs
p p G ZZ
22/05/2014
6
Genomic relationship matrix G
22 0, up N ju G
Assume that G is computed according to VanRaden 2008, using observed allelic frequencies This implies that the average BV of genotyped individuals (u2) is 0
This is possibly NOT the case if there is SELECTION
22/05/2014
7
An improved G
* 20, up |g N ju G
Relative to the pedigree base population, the average BV of genotyped individuals (u2), has a value possibly different from 0, say ȝ
22 0, up |g N + 'g ju G 11
20, up た N gj
22 up | た N た, ju 1 G
but substituting G for G* =G + 11'Į
µ is random because of finite size (drift)
ȝ is the average BV of genotyped individuals
How to find the value for v ?
1た= 'n 21u
ȝ from either pedigree or single-step
22
1p uた N ' j
n
220, 1A 1
22
1s uた N ' + 'g j
n 0, 1 G 11 1
Assume traditional BLUP is unbiased. Assume traditional BLUP is unbiased.
22/05/2014
8
How to find the value for v ?
from either pedigree or single-step
22 2
1i, ji, j
i j i j
g=n
A G
If we equate both variances of ȝ
As the 1'1 are simply summations
2
22 2
1p u i, j
i j
Var た = jn A
22
1s u i, j
i j
Var た = j g+n
G
g is simply the difference between means for A2 2 and G
What does v mean ?
22 2
1i, ji, j
i j i j
g=n
A G
g accounts for the fact that u2 are related through pedigree more than G is able to reflect This is because we do not know base allelic frequencies to construct the 'correct' G
22/05/2014
9
What does v mean ?
From Wright's FST, another interpretation of g is possible... The FST can be defined as the mean relationship between gametes in a recent population with respect to an older base population
1old new new STF = F + F F
Powell et al., 2010
What does v mean ?
A2 2 involves relationships of genotyped individuals with reference to the base population, and G corresponds to relationships within the current population. Consequently, g is equal to twice FST
11
2= g + g
*G G 11' 22
1
2STF = mean A G
22/05/2014
10
Which G must we use ?
2° moment (variance) of u2
G*=G+11' ガ
G*=(1−12ガ)G+11' ガ
G*=trace(A 22)
trace(G)G
1° moment (mean) of u2
Mean & variance of u2 (assumption of random
mating)
AvgD(G)=AvgD(A22)
sum(G)=sum(A22)
Both, AvgD(G)=AvgD(A22)
sum(G)=sum(A22)
Which G must we use ?
G*=(1−12ガ)G+11' ガ Mean & variance of u2
(assumption of random mating)
G*=aG+11' b
Both, AvgD(G)=AvgD(A22)
sum(G)=sum(A22)
Mean & variance of u2 (no random mating)
preGSf90 Christensen et al., 2012,
(1-0.5g)=0.851
a=0.859
Ex. real pig population
22/05/2014
11
22/05/2014
12
23
GBLUP
• GBLUP is a « BLUP »
constructed with G so
defined
– Sustitute A for G
• As in regular BLUP, we can
include animals with
genotype but without
phenotype
2 1i ip p
ZZ
G
1 1 1
1 1 1 2 1
ˆ
ˆu
X R X X R W X R yb
W R X W R W G W R yg
24
GBLUP issue
• Strandén & Christensen (2011) showed that G
constructed with « centered » coding is semi-
positive definite (has no inverse)
• In dairy cattle, we typically use
0.99 0.01
2 1i ip p
ZZ
G A or something similar
« Pure »
genomic
relationships
Pedigree
relationships
22/05/2014
13
25
GBLUP issue
• We can use equations for singular G which Sラミげデ require inversion (Harville, 1976; Henderson, 1984)
1 1 1
2 1 2 1 2 1
ˆ
ˆu u u
X R X X R W X R yb
G W R X G W R W 』 G W R yu
(This has been reinvented many times: Misztal et al., 2009; VanKaam, 2012;
RKHS: De los Campos et al., 2009; etc)
1 1 2 1
2 1 2 1 2 2 2 1
ˆ
ˆu
u u u u u
X R X X R WG X R yb
G W R X G W R WG G G W R yg
or
2 ˆˆ au G g
26
GBLUP
• GBLUP gives identical
results to BLUP_SNP if we fit
equivalent variances in both
1 1 1
1 1 2 1
ˆ
ˆa
X R X X R Z X R yb
Z R X Z R Z I Z R ya
2 1i ip p
ZZ
G
2 22 1u i i aall SNPs
p p
1 1 1
1 1 1 2 1
ˆ
ˆu
X R X X R I X R yb
I R X I R I G I R yu
ˆfromGBLUP fromBLUP_SNPg Za
g薩g because all
animals in genotype
have phenotype
22/05/2014
14
27
GBLUP
• In BLUP_SNP, (young) animals without phenotype do NOT enter into SNP estimation
– We get their EBVげゲ as 四赴槻墜通津直 噺 ┣>
• In GBLUP we have three options which give the same result:
1. Include them in the analysis with no record of their own (as in classical pedigree BLUP)
• EBVげゲ 四赴槻墜通津直 are obtained in the solutions
2. Use multivariate normality (selection index stuff):
• 四赴槻墜通津直 噺 札槻墜通津直┸墜鎮鳥札墜鎮鳥┸墜鎮鳥貸怠 四赴墜鎮鳥 3. Backsolve for SNP effects and then use 四赴槻墜通津直 噺 ┣>
28
GBLUP with more animals than phenotypes
Let 姿 噺 散産 髪 撒四 髪 蚕, 四 噺 四墜鎮鳥四槻墜通津直 ┹ �┻ g┻ 撒 噺 薩 宋
Only these have
data
Let genotypes be 燦 噺 燦墜鎮鳥燦槻墜通津直 then 札 噺 燦燦嫗【にみpq 噺 札墜鎮鳥 札墜鎮鳥┸槻墜通津直札槻墜通津直┸墜鎮鳥 札槻墜通津直
1 1 1
1 1 1 2 1
ˆ
ˆu
X R X X R W X R yb
W R X W R W G W R yu
Gives the same solutions for 四赴墜鎮鳥 than
1 1 1
1 1 1 2 1
ˆ
ˆold u old
X R X X R I X R yb
I R X I R I G I R yu
22/05/2014
15
29
GBLUP
• We can jump from GBLUP
to BLUP_SNP
SNP effects from GEBVげゲ
(Henderson, 1973; Strandén and
Garrick, 2009)
1 2ˆ ˆu a DZ G u
ˆˆ u Za GEBVげゲ aヴラマ “NP WaaWIデゲ
Covariance SNPs-BVs (Variance BVs)-1
ˆ ˆ / 2 i ip q a Z Gu
usually
30
Multiple trait GBLUP Introducing multiple traits is so well known that nobody cared to publish it
Let genetic and residual covariances be 券 捲 券 (n= number of traits) 札待 and 三待, then
multiple trait GBLUP is (e.g. Henderson, 1984; Mrode & Thompson 2005),
1 1 1
1 1 1 10
ˆ
ˆ
X R X X R W X R yb
W R X W R W G G W R yu
Where 三 噺 薩 戯 三待┻
All models fitted in BLUP fit in GBLUP
22/05/2014
16
31
GBLUP
Some advantages of GBLUP:
ひIt fits nicely into existing BLUP software
ひぐ;ミS キミデラ W┝キゲデキミェ デエWラヴ┞ ふ‘EMLが マ;デWヴミ;ノ WaaWIデゲが デWゲデ-S;┞ぐ マ┌ノデキヮノW デヴ;キデゲぐ“キミェノW “デWヮぶ ひProvides measures of accuracy from the inverse of the LHS
ひAccomodates all animals
Inconvients:
ひC;ミげデ easily accomodate major genes (unless using weights in
the construction of G т see later)
ひComputation of G and inversion might be challenging
32
Caveat
• By defining a genomic relationship matrix, we define a genetic base
– All inference will refer to this genetic base. Quoting Strandén & Christensen http://www.gsejournal.org/content/43/1/25 :
The bad news
« Reliabilities of estimated genomic breeding values calculated using elements of the
inverse of the coefficient matrix depend on the allele coding because different allele
coding methods imply different models » [The same applies for reliabilities computed
from any method fitting SNP effects like BayesA, etc.]
22/05/2014
17
33
GREML, G-GキHHゲぐ
Use of G デラ Wゲデキマ;デW ┗;ヴキ;ミIW IラマヮラミWミデゲぐ
It can be done with remlf90, gibbs*f90, AsReml, TMぐ
The result will refer to an ideal population with whatever allelic
frequencies we introduced in the computation of G.
22/05/2014
1
Single Step GBLUP
1
2
Why 2-step procedure
• y= µ + Za + e
– y = data
– Z= incidence matrix of marker effects
– a= marker effect
– e=residuals
• Most often, genotyped animals (bulls) do not have data (trait record)
• Further, most animals with phenotype are not genotyped (e.g. cows)
• This limits practical applications
• Need to get pseudo-data for genotyped animals
22/05/2014
2
3
Pseudo-data
• So we need pseudo-data
• EBV’s
• DYD’s
4
Pseudo-data
• EBV’s • The problem with EBV’s is that they
already share information among individuals
• e.g., a dam EBV is = own yield + parent average + progeny contribution
• But we are including information of the sire in the cow, yet not all SNPs of the sire are in the cow
22/05/2014
3
5
Pseudo-data
• Also, EBV’s are correlated
• The correlation depends on the amount of data and distribution across fixed effects and families
• EBVs of two cows are correlated, if they belong to the same herd, even if they are not related
• EBVs of two bulls are correlated if they have daughters in the same herds
• This is not serious in dairy cattle, but might be, e.g., in swine
ˆ| ~ , uuNu y u C
6
Pseudo-data
• DYD’s avoid part of these problems (Van Raden Wiggans 1991)
• DYD = daughter yield deviation
• Record of the daughter, corrected by environmental effects and dam’s EBV
• Thus DYD = 0.5 BV sire + mendelian sampling
• E(DYD)=0.5 BV sire
• YD’s exist for cows
– YD = record –environmental effects
22/05/2014
4
7
Pseudo-data
Problems of DYD’s / YD’s
• YD’s little reliable and subject to preferential treatment • DYD’s not reliable for many species (sheep, swine) • Hard to define for some species/traits (maternal effects)
• Extremely complex procedure
• Loss of generality
8
Proposals for overall relationship matrix (Legarra et al., 2009 JDS 92:4656; Christensen & Lund, GSE 42:2; Misztal et al., JDS
92:4648; Aguilar et al JDS 93:743)
• Not big loss in assuming normality for SNP effects (Van Raden et al. JDS 92:19; Hayes et al. JDS 92:433)
• G easy to be constructed then
• Can we include G in the relationship matrix?
• If we construct an overall relationship matrix with good properties, then we can just do BLUP with all data and animals
22/05/2014
5
9
• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)
• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)
10
• Things would be simple if we had genomic relationships for everyone (Legarra et al., 2009)
• Things would be simple if we could add
genotypes for all animals (Christensen
et al., 2010)
22/05/2014
6
11
Single Step as a missing data problem
• We can see genotype as a missing data problem (Christensen & Lund, 2010)
• Use the prediction and the distribution of the prediction (if not the procedure does not work)
12
Missing data
Fill-in missing data: data augmentation • « data augmentation refers to a scheme of augmenting
the observed data so as to make it more easy to analyze » (Tanner & Wong, 1987) – Two flavors: EM and Bayesian (Posterior distributions) – For instance: pretending (temporarily) that you know the true
BV’s simplifies REML s EMREML , or provides full conditionals for Gibbs
• Augmenting = adding genotypes
22/05/2014
7
13
Inferring genotypes
• Genotypes in some individuals can be inferred, but only to some extent
• This is feasible for key individuals (ancestors with many progeny genotyped)
• Or by imputing data from parents into an animal genotyped with a SNP chip
• Typically done using « LD » patterns
• Fimpute, findhap, Alpha impute, Beagle, etc
• These methods do not extend well to non-genotyped individuals
Example:
14
By simulation, they know that…
Accuracy of prediction of genotype is quite good, but not perfect They don’t even try « far » animals Need a simpler way for « far » animals
22/05/2014
8
16
Inferring genotypes
• There is Gengler’s gene content prediction J. Dairy Sci.
91:1652
• Linear approximation to the imputation problem
• This method can be applied to any member of a pedigree
11,2 2,2
11,1 1,2 2,2 2,1
ˆ |
ˆ | 2
non genotyped non genotyped genotyped genotyped
non genotyped non genotyped genotyped
E p
Var Var pq
z z z A A z 2
z z z A A A A
11 12
21 22
=
A AA
A ALet
genotyped
non genotyped
17
Inferring genotypes
• Instead of working with individual SNP effects, we will define
– u=Za
– i.e., the genetic value is the sum of SNP effects
– We’re not really interested in a themselves but in u (we know from GBLUP that we can jump from one to the other)
– Moreover, we’re interested in the distribution of u’s, so that we can compute their covariances and put them into the MME
22/05/2014
9
18
四 噺 四直四津直 噺 燦直燦津直 珊
Var 四 噺 燦直燦撫津直 撃欠堅 珊 燦直嫗 燦撫津直 髪 宋 宋宋 撃欠堅 燦撫津直 撃欠堅 珊
な【にみ喧沈圏沈
Br���ing valu�s SNP �ff�cts Re-create GBLUP…
Chistensen & Lund use 撃欠堅 畦 噺 継 撃欠堅 畦】稽 髪 撃欠堅 継 畦】稽 to
consider the prediction of the genotype and its variance
継 燦津直 燦直 撃欠堅 燦津直 燦直
Resulting in:
ng: « non genotyped » g: « genotyped »
Christensen & Lund key idea:
Using Gengler’s results Using Gengler’s results
19
1 11 12
2 21 22
1 1 1 111 12 22 21 12 22 22 21 12 22
122 21
=Var
u H HH
u H H
A A A A A A GA A A A G
GA A G
non genotyped
genotyped
Covariances of all animals Christensen & Lund, 2010
1: « non genotyped » 2: « genotyped »
22/05/2014
10
20
• Incredibly: H-1 is very simple:
1 11 12
2 21 22
1 1 1 111 12 22 21 12 22 22 21 12 22
122 21
=Var
u H HH
u H H
A A A A A A GA A A A G
GA A G
Inverse of the regular pedigree relationship matrix
Correcting for genomic relationships…
…and avoiding « double counting »
21
• Things would be simple if we had
genomic relationships for everyone
(Legarra et al., 2009)
• Things would be simple if we could add genotypes for all animals (Christensen et al., 2010)
22/05/2014
11
22
Overall modification
• Look at A as a « prior » relationship and to G as an « observed » relationship
– G is observed for some individuals only, whose « a priori » relationship matrix was A22
• Try to construct a « posterior » relationship matrix
23
Joint distributions
2 , andp Nu 0 G
1 2 2 1 2,p p pu u u u u
Unconditional distribution of genetic values of Genotyped individuals
Conditional distribution of Non-Genotyped individuals
1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]
Joint distribution
After seeing their genotypes !
Because they have no genotypes, this depends
only on pedigree
22/05/2014
12
24
Joint distributions
1 2 1 2 2
1 2 2
1 11 1 11 12 22 2 1 12 22 2 2 2
11 11 1112 22
1 2 1 11 1 1 11 1222 21 22 21 12 22
( , ) ( , | ) ( )
( | ) ( )
exp[ 0.5( ) ( )]exp[ 0.5 ]
exp 0.5
exp 0.5
p p p
p p
u u u u u
u u u
u A A u A u A A u u G u
uA A A Au u
uA A A G A A A A A
11 121
1 2 21 1 22 1222
.
uA Au u
uA G A A
…for those inclined to algebra
gG�nomicg r�lationships
pr��iction of non g�notyp�� from g�notyp��
25
Joint distributions
2 ,p Nu 0 G
1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]
22/05/2014
13
26
Joint distributions
2 ,p Nu 0 G
1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]
2Var u G
27
Joint distributions
2 ,p Nu 0 G
1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]
2Var u G
1 1 11 11 12 22 21 12 22 22 21Var u ] ] A ] ] A GA ]
because Var(Xt) = XVar(t)X’
22/05/2014
14
28
Joint distributions
2 ,p Nu 0 G
1 11 2 12 22 2 11 12 22 21,p N u u ] A u ] ] A ]
2Var u G
11 2 12 22,Cov u u ] A G
because Cov(Xt,t) = XVar(t)
29
1 11 12
2 21 22
1 1 1 111 12 22 21 12 22 22 21 12 22
122 21
=Var
u H HH
u H H
A A A A A A GA A A A G
GA A G
non genotyped
genotyped
Covariances of all animals Legarra et al. 2009; Aguilar et al., 2010; Christensen & Lund, 2010
22/05/2014
15
30
1 11 12
2 21 22
1 1 1 111 12 22 21 12 22 22 21 12 22
122 21
=Var
u H HH
u H H
A A A A A A GA A A A G
GA A G
Covariances of all animals
G comes from genotypes
This is the variance of prediction of genotypes from genotyped to
non-genotyped
This is the error in the prediction
The prediction « generates » a covariance
31
Overall modification: example
22/05/2014
16
32
Overall modification: example
This is the regular relationship matrix. Assume now that animals 9 to 12 have a genomic relationship of 0.7
33
Overall modification: example
This parents now are related
This guy now is inbred
G
22/05/2014
17
34
• Incredibly: H-1 is very simple:
1 11 12
2 21 22
1 1 1 111 12 22 21 12 22 22 21 12 22
122 21
=Var
u H HH
u H H
A A A A A A GA A A A G
GA A G
Inverse of the regular pedigree relationship matrix
Correcting for genomic relationships…
…and avoiding « double counting »
35
Single step GBLUP
1 1 1
1 1 1 2 1
ˆ
ˆu
X R X X R W X R yb
WR X WR W H WR yu
1 1
1 122
H A 0 0
0 G A
W: incidence matrix of animals on data
A: pedigree relationship matrix
GThis G could be any matrix describing « genomic » covariances of breeding values;
it does not restrict to VanRaden’s (2008) GBLUP
A22: pedigree matrix among genotyped individuals
22/05/2014
18
Single step GBLUP
• So the Single Step GBLUP is like regular BLUP changing one small submatrix !!!
• It is almost too simple to be true…
36
Some properties of H
• Semi-positive definite always
• Positive definite & invertible iff G is invertible
• In practice, if G is too different (wrong pedigree or genotyping) from A22, this gives lots of numerical problems
• If everyone is genotyped, Single Step is GBLUP
• If no one is genotyped, Single Step is BLUP
37
22/05/2014
19
38
Single Step as an extra random effect
• Legarra & Ducrocq, 2012
• Decompose Breeding Values in « classical » and deviations: ‒ 撃欠堅 四匝 噺 札┸ ‒ 四態 噺 四態茅 髪 纂態┸ 撃欠堅 四匝茅 噺 冊態態 ┸ 撃欠堅 纂態 噺 札 伐 冊態態
• Regress, using pedigree, deviations for « ungenotyped » individuals ‒ 纂怠 噺 冊怠態冊態態貸怠纂態┹ Var 穴怠 噺 冊怠態冊態態貸怠 札 伐 冊態態 冊態態貸怠冊態怠
• After quite some algebra, you get to the same results
39
Alternative derivations • Why all agree?
• Some use genotypes, some use breeding values in the algebra.
• Because 決堅結結穴件券訣 懸欠健憲結嫌 噺 嫌憲兼 剣血 訣結券剣建検喧結嫌, the same rules for variances and covariances apply and the derivation is identical
22/05/2014
20
• So far SSGBLUP is the most serious option for a general method for genomic evaluations in practice
• Large body of practical results in dairy cattle & sheep, poultry, swine
• Besides our group, USA, NZ, DK, Fin, are giving SSGBLUP serious tests
40
41
Problems of SSGBLUP
Most of these problems exist for the other methods (BayesA etc.)
• Assumption p(u2)=N(0,G)
– If there is selection, mean is not 0 (« tuning » solves it: see Vitezica later)
• Same genetic variance in genotyped and ungenotyped animals
– solved with « tuning »
• Non normality (i.e. major genes)
– Can solved using G=ZDZ’ with « weights » for SNP (Legarra et al., 2011; Zhang et al., 2010; Wang et al., Gen. Res. 2012)
– unclear for multiple traits (but also in other methods like BayesB)
• Assumption that « A » is fair. This is false if:
– pedigree is incorrect
– distant relationships are too different from reality (Hill & Weir 2010)
– solution: cut pedigrees that are too long
• Unknown parent groups / several breeds
– Need to modify H to include them (Misztal et al., 2013)
22/05/2014
21
SSGBLUP vs. rest
42
SSGBLUP GBLUP, BayesA Non parametric
Deregressions from « regular BLUP »
Not needed Complex Complex
Bias due to genomic selection
Immune Affected Affected
Bias due to classic selection
Immune Affected Affected
Major genes Complex OK OK
Long MCMC No Yes (except GBLUP) No
Matrix inversions Yes (but work in progress)
No (except GBLUP) No
Computation of accuracies
Complex (but work in progress)
Easy (provided deregressions were OK)
Undefined
Multiple trait Straightforward (if no major genes)
Easy for GBLUP (if no major genes)
Complex otherwise
???
Get marker effects Yes (after backsolving)
Yes Sometimes
43
Computing stuff
• Need to compute G-1 and A22-1, is a challenge.
– perhaps only in dairy cattle?
• But see Ignacio talk
• Future strategies (Legarra & Ducrocq, JDS 2012)
22/05/2014
22
44
1 1 1 11 1 2 2
1 1 11 2 1 12 2 11 1 1 1 1 2 11
1 1 21 2 1 22 2 2 22 2 2 1 2 2 2
2 222
2 2
ˆ
ˆˆˆˆ
u u
u u u u
u u
u u
X R X X R W X R W 0 0 X R yb
W R X W R W A W R W A 0 0 W Ru
W R X W R W A W R W A I I u
0 0 I A 0 l0 0 I 0 G け
12
y
W R y
0
0
• Unsymmetric Single Step
This can be computed efficiently because we use G NOT 札貸怠, 冊匝匝 not 冊匝匝貸層. We don’t even need to construct them.
45
• Iterative Single Step
New solutions are a weighted average of the solutions to (9-10) and the former solutions at the previous iteration
Solution of MME (9) old solution
New solution
w<=1
Solve ' '1 1 2 2
' ' 11 12 '1 1 1 1 1 1 1' 21 ' 22 '2 2 2 2 2 2 2
ˆ
ˆˆ ˆˆ
u u
u u u u
X X X W X W b X y 0
W X W W A A u W y 0
W X A W W A u W y l け
Solve 22 2 2ˆ ˆ ˆ ˆˆ ˆ and for and A l u Gけ u l け
* 1
1 1 1
2 2 2
ˆ ˆ ˆ
ˆ ˆ ˆ1
ˆ ˆ ˆ
t t
b b b
u u u
u u u * 1 * 1ˆ ˆ ˆ ˆ ˆ ˆ1 and 1t t t t l l l け け けand
(9)
This can be done efficiently as well
22/05/2014
23
46
Results • (Using simulated data) all
strategies arrived to the same solution
• Some can be converted to iteration on data procedures
• Reasonable computing time: – 2 s « regular » (with G and A22
already inverted) – 47 s « unsymmetric » – 286 s « iterative »
Convergence
0 50 100 150 200
-12
-10
-8
-6
-4
-2
0
iteration
conve
rge
nce
? Unsymmetric extended SSGBLUP
Regular SSGBLUP ? ? Iterative SSGBLUP
47
More results • Lacaune dairy sheep • 5000 individuals (males) genotyped, 1 500 000 animals, • ~3 000 000 equations
Regular Single Step 0 1000 2000 3000 4000 5000
-15
-10
-5
iteration
log10(c
onve
rgence
) Unsymmetric equations
22/05/2014
24
• Livestock paper for more details
48
Forming Single-step mixed model
equation and quality control
Ignacio AguilarInstituto Nacional de Investigación Agropecuaria
INIA Las Brujas, Uruguay
X'X X'ZZ'X Z'Z + α H -1
bu
=X'y
Z'y
• Traditional genetic evaluation
• Single-step genomic evaluation
Single-Step to genomic evaluation
X'X X'ZZ'X Z'Z + α A-1
bu
=X'y
Z'y
Multiple-step Genomic Selection
Records ‘Y” BLUP Pseudo observations
De-regressed EBVs
BayesX
GBLUP
etc
indexPA*w2
SPA*w3GEBV*w1
EBV
SNPsPedigree
Single-Step Genomic Selection
Records “Y”
BLUP
EBVs
Pedigree SNPs
Single-Step evaluation
• Unified approach with pedigree, phenotypic
and genomic markers information considered
simultaneously
• Pedigree-based relationships augmented by
genomic relationship matrix (Misztal et al. 2009)
ˆ
ˆα
Λ
=
-1
X'X X'Z X'yβ
Z'X Z'Z + H Z'yu
H = A+ A - conventional numerator relationship matrix
- matrix modified to account for genomic relationships ∆
A
A
Single step genomic evaluation
• Inverses
– Numerator relationship matrix
– Pedigree relationships between genotyped animals
– Genomic relationships
1 11 1
22
0 0
0− −
− −
= + −
H AG A
Aguilar et al., 2010
Christensen & Lund, 2010
X'X X'ZZ'X Z'Z + H -1α
bu
=X'y
Z'y
=X 'Xp1 + X 'Zp2
Z 'Xp1 + Z 'Zp2
+0
A−1α p2
LHS* p = X 'X X 'ZZ 'X Z 'Z + H −1α
p1
p2
Matrix-vector operations in PCG with
genomic information
Contributions
due to records
Contributions due
to relationships
+00
(G−1 − A22−1)α p2g
Contributions
due to genomics
Extra matrices required for single-step
• Inverses
– Pedigree relationships between genotyped
animals
– Genomic relationships
1 11 1
22
0 0
0− −
− −
= + −
H AG A
OPTIONS – BLUPF90 parameter file
• Genomic programs
– controled by adding OPTIONS commands to the
parameter file
– OPTION SNP_file marker.geno.clean
– Read 2 files:
• marker.geno.clean
• marker.geno.clean.XrefID
Printout: Same heading as other
programs
All options that were
enter in the parameter
file should be here !!.
IF not check that
keywords are correct
(upper and lower case)
Check number of
animals and
individuals with
genotypes
Printout
Information from genotype file.
The format is detected from
the first line !!!
So all genotypes should start in
the same column !!!
Number of SNP is also
determined by the first line!!
Output Files
• GimA22i– Store the content of the inv(G) – inv(A22)
– Only if preGSf90 for runs, not in applications programs
• freqdata.count– Contains the estimated allele frequency before QC
• freqdata.count.after.clean– Contains allele frequencies as used in calculations, remove code
– For removed SNP these will be zero,
• Gen_call_rate– List of animals removed by low call rate
• Gen_conflicts– Report of animals with Mendelian conflicts
Quality control. By default exclude:
• MAF– SNP with MAF < 0.05
• Call rate– SNP with call rate < 0.90
– Individuals with call rate < 0.90
• Monomorphic– Exclude monomorphic SNP. ONLY when MAF <> 0
• Parent-progeny conflicts (SNP & Individuals) – Exclusion -> oposite homozigous
– For SNP: >10 % of parent-progeny exclusion from the total of pairs evaluated
– For Individuals: > 2% of parent-progeny from total numberof SNP
Control default values
• For MAF
– OPTION minfreq x
• Call rate
– OPTION callrate x
– OPTION callrateAnim x
• Mendelian conflicts
– OPTION exclusion_threshold x
– OPTION exclusion_threshold_snp x
Parent-progeny conflicts
• Presence of these conflicts results in a negative H matrix !!!
• Problems in estimation of variance component byREML, programs does not converge, etc.
• Solution:
– Report all conflicts, with counts for each individual as parent or progeny to trace the conflicts
– Remove progeny genotype
• maybe not the best option
• But results in a positive-definite H matrix !!!
Parent-progeny conflicts
• OPTION verify_parentage x
– 0: no action
– 1: only detect
– 2: detect and search for an alternate parent; no
change to any file. Not yet implemented
– 3: detect and eliminate progenies with conflicts
(default)
SNP map file (optional)
• OPTION chrinfo xxxx
• For some genomic analyses (GWAS) or checks
• Format:
– snp number
• Index number of SNP in the sorted map by chr and position
– chromosome number
– position
• First row corresponds to first column SNP in
genotype file !!!
Other Options
• IF OPTION chrinfo is provided, we can exclude
selected crhomosomes:
– OPTION excludeCHR n1 n2 n3 ...
• or inform which are sex chromosomes:
– OPTION sex_chr n
– Chr > n will be excluded only for check or parent-
progeny, but not in calculations
Saving ‘clean’ files
• SNP excluded from QC are set as missing (i.e. Code=5)
• Excluded Individuals are treated as unrealated in G and A22– For individual i
G[i,:] = 0; G[:,i]=0; G[i,i]=1 ; Same for A22
so G-A22 will cancel out
• OPTION saveCleanSNPs
• Save clean genotype data with excluded SNP and individuals– For example for a SNP_file gt
– Clean fles will be:• gt_clean
• gt_clean_XrefID
– Removed will be output in files:• gt_SNPs_removed
• gt_Animals_removed
Inspection of Diagonal of G
� High diagonal elements from G
� Mislabed samples , individualsfrom other populations/lines
� Problems with sample, low callrate
� By default values >1.6 are excluded from analysis, Threshold can be changed with:
OPTION threshold_diagonal_g x
Simeone et al., 2011 JABG
Potential duplicate samples
• All samples are checked with each other
– x = G(i,j)/sqrt(G(i,i),G(j,j))
– Values of x > 0.90 are printed in the output
Correlation off-diagonal G vs A
• Compute correlation for all elements of A > 0.02
• Potential problems with matching genotype file and pedigree
file
• For low values (<0.5) => print a warning !!!!
• For low values (<0.3) => program stop !!!
• If still you want to go …
– OPTION thrStopCorAG -1
Looking for stratification in
populations• OPTION plotpca
– (only preGSf90 not in application
programs)
– Plot the first 2 PC
• OPTION extra_info_pca filename col
– File with variables (alphanumeric) to
plot PC with different colors for
different classes
– Same order as genotype file
Use in application programs
• Use renumf90 for proper renumering and creation of cross reference id and parameterfile
• If large number of genotypes
– Precompute inv(G)-inv(A22) (PreGSF90)
– Modify parameter file to read GimA22i
– BLUPF90, REMLF90
• Generally all steps can be in a script file to facilitate running programs
Genome-Wide Association Mapping
Including Phenotypes from Relatives
without Genotypes
Ignacio AguilarInstituto Nacional de Investigación Agropecuaria
INIA Las Brujas, Uruguay
Slides from H. Wang (Joy)
Classical GWAS
• Test single marker one at time
• Simple linear regression for the SNP effects
• Other fixed effects can be fit (conteporary
group)
• Polygenic breeding value can be used
Genomic Selection
• Considers all genetic associations derived from markers
• Methods (Bayes A, Bayes B, Bayes Lasso) provide solutions to SNP effects
• Then Genome-Wide Association Ananlysis(GWAS) can be performed
• Accounts for population stratifications and cryptic relatedness
Non-Genotyped individuals with
Phenotype
• Recorded information from non-genotyped individuals can not easily be incorporated in Single marker regression and Bayesian methods
• Although can incorporate information by accumulating data from relatives , e.g. EBV
• But problems with– heterogeneity from different sources
– Loss of information
– bias
– Computational cost with MCMC method for large number of genotypes and makers
Single-step GBLUP
• Integrates all available information
– Phenotypes
– Genotypes
– Peidgree
• Limitation of ssGBLUP, in constrast to BayesX
methods
– infinitesimal model i.e. same variance for all SNP
effects
ssGWAS
• Combining methods
– Unequal variances
– Use all available information like in ssGBLUP
• Improve Accuracy of estimation of GEBV
– For breeding and selection
– Precesion for estimation of SNP effects for GWAS
SNP variances
• Zhang et al. 2010, presented a method to
estimate weights for SNP variances without
sampling i.e. non MCMC methods
• SNP weights: function of squares of SNP
effects
• Incorporate weights into genomic relationship
matrix
• Similar approach by Sun et al., 2012 PlosOne
BUT both approachs can not utilize
phenotypes of un-genotyped individuals !!!
As a reminder:
• GBLUP BLUP_SNP:
• As:
And: λσσ
'')1(2
'2
2
ZDZZDZpp
ZDZG
a
u
allSNPsii
==−
=∑
uZag ˆˆ =
22 )]1(2[ uallSNPs
iia pp σσ ∑ −=
Genetic value of
genotyped animals
SNP effect
Weight Matrix
SNP weights
• SNP weights derived from SNP effects
• Zhang et al., 2010 PlosOne
• Matrix D, diagonal matrix, with un-equal variances for each SNP
• SNP effects from GEBV’s (Henderson, 1973; Strandén and Garrick, 2009):
• Also, for each SNP effect (i-th):
• Note: this is just variance of SNP effects
NOT the same concept for genetic variance
)1(2ˆˆ 22, iiiiu ppu −=σ
gga
u aZDZDZaGDZu ˆ]'['ˆ'ˆ 112
2−− ==
σσ
postGSf90 par files
1) Parameter files:
(1) BLUPF90 (and preGSf90 for S1)
(2) postGSf90
2) OPTIONs:
BLUPf90 / PreGSf90:OPTION SNP_file marker.geno.cleanOPTION saveGInverseOPTION weightedG w # A vector with length = M
postGSf90:
OPTION SNP_file marker.geno.cleanOPTION ReadGInverseOPTION chrinfo mapfile #format: snpID chr posOPTION weightedG w# OPTION which_weight 1# OPTION SNP_moving_average n# OPTION Manhattan_plot
3) Document:http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90#gwas_options_postgsf90
Computing algorithm• Denote t as an iteration number and i as the i-th SNP
1. t=0, D(t)=I, G(t)=ZD(t)Z’λ
2. Compute by ssGBLUP
3. Calculate
4. Calculate for all SNPs (Zhang et al., 2010)
5. Normalize
6. Calculate
7. t=t+1
8. Exit , or loop to step 2 or 3
ga
gttt aGZDu ˆ'ˆ 1)()()(−= λ
)1(2ˆ2*
)()1( iiii ppudtt
−=+
*
* )1()1(
)0()1( )(
)(+
++ = t
tt D
Dtr
DtrD
λ')1()1( ZZDG tt ++ =
S2S1
Simulated data
1. QMSim
2. Simple model:
3. 10 QTLs w. 3000 SNP markers on 2 chromosomes
4. N = 15,800
Ng= 1500
5. h2=0.5, all due to QTLs (No Polygen)
6. 10 replications
eaZ1y a ++= µ
Different Scenarios
• Scenario 1– Run only one BLUP and get GEBV
– Estimate SNP effects from GEBV using weighted Genomic matrix
– Multiple trait or random correlated effects
• Scenario 2– Get EBVs with weighted genomic relationship matrix
– Estimate SNP effects from GEBV using updated solutions
– Single trait analysis - fit one genomic relationship matrix
postGSf90 bash script
• Scenario 1:
# run 1 time GBLUP to get GEBVs:echo par.b90 | blupf90 | tee log.blupf90
# run x times PreGSf90 – postGSf90 to get SNPeff:
for i in 1 2 3 4 5 6 7 8 … … x
do
echo par.b90 | preGSf90 | tee log_preGS_$i
echo postpar.b90 | postGSf90 | tee logpost_$i
cp snp_sol snp_sol_$i
#format: tr, eff, snpID, chr, pos, sol, w
cp chrsnp chrsnp_$i
cp w w_$i
awk '{ print $7 }' snp_sol > w
done
• Scenario 2: for i in 1 2 3 4 5 6 7 8 … … x
do
echo par.b90 | blupf90 | tee logpre_$i
cp solutions solutions_$i
echo postpar.b90 | postGSf90 | tee logpost_$i
cp snp_sol snp_sol_$i
cp chrsnp chrsnp_$i
cp w w_$i
awk '{ print $7 }' snp_sol> w
done
Methods
1. Single marker model: WOMBAT
2. BayesB using de-regressed proofs : GENSEL
3. ssGBLUP: S1 & S2
Manhattan plot of S1
Manhattan plot of S2
Manhattan plot of BayesB
Manhattan plot of WOMBAT
Accuracy of (G)EBVs
BLUP EBVs DP
0.81 (0.01)
0.77 (0.01)
ssGBLUP it1* it2 it3 it4 it5 it6 it7 it8
0.87 (0.01)
0.89 (0.01)
0.88 (0.01)
0.88 (0.02)
0.88 (0.02)
0.87 (0.02)
0.87 (0.02)
0.87 (0.02)
BayesB_DP NW† c=0.1 0.88
(0.02) 0.88
(0.02)
Accuracy of SNP effectsTable 3. Average correlations (standard deviations) between QTL effects and sum of cluster of
m SNP effects using ssGBLUP
S1* 1† 2 4 8 16 40
it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.46 (0.07) 0.66 (0.05) 0.78 (0.02) 0.82 (0.02) 0.81 (0.02) 0.63 (0.08) it3 0.43 (0.07) 0.64 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it4 0.42 (0.07) 0.63 (0.05) 0.77 (0.02) 0.81 (0.02) 0.80 (0.02) 0.62 (0.08) it5 0.41 (0.07) 0.63 (0.05) 0.76 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.08) it6 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it7 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.61 (0.07) it8 0.41 (0.07) 0.62 (0.05) 0.75 (0.02) 0.80 (0.02) 0.79 (0.02) 0.60 (0.07)
S2 1 2 4 8 16 40
it1 0.53 (0.07) 0.68 (0.05) 0.79 (0.03) 0.81 (0.02) 0.80 (0.03) 0.62 (0.08) it2 0.44 (0.09) 0.65 (0.06) 0.77 (0.03) 0.82 (0.03) 0.81 (0.02) 0.63 (0.06) it3 0.41 (0.08) 0.62 (0.05) 0.75 (0.03) 0.79 (0.03) 0.79 (0.03) 0.65 (0.06) it4 0.40 (0.07) 0.61 (0.05) 0.73 (0.03) 0.77 (0.03) 0.78 (0.03) 0.64 (0.06) it5 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.76 (0.04) 0.77 (0.04) 0.64 (0.06) it6 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it7 0.40 (0.07) 0.60 (0.05) 0.72 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06) it8 0.40 (0.07) 0.60 (0.05) 0.71 (0.04) 0.75 (0.04) 0.76 (0.04) 0.63 (0.06)
* S1: update weights for SNP effects but not for GEBVs; S2: update weights for both GEBVs and SNP effects in each iteration.
† Number of SNPs (i.e. m ranges from 1 to 40) in each cluster.
Variances Explained by segments
• Several ISU works propose to present results from GWAS using variance explained by windows of adjacent SNP
• Fan et al 2011, Onteru et al 2011, Peters el al 2012,etc.
• Potentially use of bootstrap to get significance of detected QTL
Windows VariancesZ u
a = Zu for only SNP in segment
a = EBV derived from segment
Get sample variance Var(a)
from genotyped individuals
POSTGSF90 Options
POSTGSF90 Options
Output files from POSTGSF90
QTL-MAS workshop 2010
G = ZDZ '/ ka = DZ '(ZDZ ')−1u
G = ZZ '/ ka = Z '(ZZ ')−1u
cor(ebv,tbv)=0.68 cor(ebv,tbv)=0.70
Single-Step GWAS Conception Rate
• Multiple-Trait US Holsteins Service records
from AI
• ~ 5 millions records, ~ 2.5 millions pedigrees
• ~ 5,600 genotyped bulls
• Computing time
– Complete evaluation 2 h
– Estimates of SNPs 2 m
Single-Step GWAS Heat Stress
• Multiple-Trait Test-Day model heat tolerance
• ~ 90 millions records, ~ 9 millions pedigrees
• ~ 3,800 genotyped bulls
• Computing time
– Complete evaluation ~ 16 h
Milk yield no Heat stress Heat stress
27/05/2014
1
Creation and handling of genomic
relationship matrices with preGSf90
Ignacio Aguilar Instituto Nacional de Investigación Agropecuaria
INIA Las Brujas, Uruguay
Genomic Relationship Matrix - G
• G Э ZZげっニ
– Z = matrix for SNP marker
– Dimension Z= n*p
– n animals,
– p markers
Data file with SNP marker
27/05/2014
2
HOWTO: Creation of Genomic Matrix
• Read SNP marker information => M
• Get けmeansげ to center
– Calculate allele frequency from observed genotypes (pi)
– pi= sum(SNPcodei)/2n
• Matrix for center W(3,p)
• Center matrix Z = W(M)
012
0 - 2p1 0 - 2p2 ..
1 - 2p1 1 - 2p2 ..
2 - 2p1 2 - 2p2 .
2 1 20 1 0.. .. ..
..
..
..
.
Creation of Genomic
• Issues
– Large number of genotyped individuals
– Large number of SNP markers
– Matrix multiplication ~ cost n^2 * p
Large amount of data put in (cache) memory for
Sラキミェ けmatmulげ aラヴ W;Iエ ヮ;キヴ ラa animals and indirect
memory access (center)
Memory hierarchy
27/05/2014
3
Matrix multiplication
• Matrix multiplication
– Several methods
• Intrisic matmul (good for small examples !!!)
• さSラ-ノララヮゲざ
• Packages (BLAS, LAPACK)
– Non-optimzed
– Optimized (ATLAS, MKL, etc.)
– Several Compilers
• Perform automatic optimization
– Vectorize loops
– Detect permuted loops
• Can use OpenMP directives for parallelization
Memory Hierarchy
CPU #1
Main Memory (1Gb – 128Gb)
Cache memory
(256 kb – 16Mb)
CPU #2
Cache memory
(256 kb – 16Mb)
slow
slow
fast
fast
27/05/2014
4
Alternative codes to create G matrix
Do i=1,n
Do j=i,n
S=0
Do k=1,p
S=S+Z(M(i,k),k)
*Z(M(j,k),k)
End do
G(i,j)=S/sqrt(d(i)*d(j))
G(j,i)=G(i,j)
End do
End do
Do k=1,p
X(:,k)=Z(M(:,k),k)
End do
Do i=1,n
Do j=i,n
S=0
Do k=1,p
S=S+X(i,k)
*X(j,k)
End do
G(i,j)=S/sqrt(d(i)*d(j))
G(j,i)=G(i,j)
End do
End do
Do k=1,p
X(:,k)=Z(M(:,k),k)
End do
Do i=1,n
Do j=1,n
Do k=1,p
G(i,j)=G(i,j)
+X(i,k)*X(j,k)
End do
End do
End do
Do i=1,n
Do j=1,n
G(i,j)=G(i,j)/sqrt(d(i)*d(j)
End do
End do
Original
Optimize Indirect Memory
Access -OPTM
Optimize Memory and Loops
- OPTML
Gmatrix.f90 (VanRaden, 2009)
Testing
6500 genotyped animals
40k SNPs
CPU time for alternative codes for G
matrix and machines
Algorithms
Processor Cache Original OPTM OPTML
Xeon 3.5 GHz 6 MB 24 m 26 m 7 m
Opteron 3.0 GHz 1 MB 265 m 59 m 17 m
27/05/2014
5
Compiler Original OPTM OPTML
Intel 265 59 17
Absoft 241 60 34
gfortran 213 63 >1day
CPU time (m) with alternative codes
and compilers
Testing
6500 genotyped animals
40k SNPs
Opteron 3.02 GHz 1 MB Cache memory
PreGSf90 program
• From BLUPF90 package
• Uses a genomic module
• Creation and handling of genomic relationship
matrices and relationship based on pedigree
• Different methods to optimize calculations
using parallel processing
27/05/2014
6
Input files
• Same parameter file as for all BLUPf90 programs – But with さOPTION SNP_file xxxxざ
– indicate to run genomic subroutines
• Pedigree file
• Marker information (SNP file)
• Cross Reference file for renumber ID – Links genotypes files with codes in pedigree, etc.
SNP map file (optional)
• For some genomic analyses or checks
• Format:
– snp number
• Index number of SNP in the sorted map
– chromosome number
– position
• First row corresponds to first column SNP in genotype file !!!
27/05/2014
7
OPTIONS に BLUPF90 parameter file
• PreGSF90
– controled by adding OPTIONS commands to the
parameter file
– OPTION SNP_file marker.geno.clean
– Read 2 files:
• marker.geno.clean • marker.geno.clean.XrefID
RENUMF90
• Add keyword to the さ;ミキマ;ノ effectざ SNP_FILE
marker_geno_clean
• Renumber tool to prepares: – data
– pedigree
– genotypes
– parameter files for BLUPF90 programs including PREGSF90
• Check wiki:
• http://nce.ads.uga.edu/wiki/doku.php
27/05/2014
8
Parameters file
RENUMF90
renum.par
BLUPF90
renf90.par
Pedigree file from RENUMF90
• 1 - animal number
• 2 - parent 1 number or UPG
• 3 - parent 2 number or UPG
• 4 - 3 minus number of known parents
• 5 - known or estimated year of birth
• 6 - number of known parents;
if animal is genotyped 10 + number of known parents
• 7 - number of records
• 8 - number of progenies as parent 1
• 9 - number of progenies as parent 2
• 10 - original animal ID
27/05/2014
9
SNP file & Cross Reference Id
SNP File
Cross Reference ID
First col: Identification, could be alphanumeric
Second col: SNP markers {codes: 0,1,2 and 5 for missing}
Pedigree File (from RENUMF90)
Original ID
Renumber ID
Genomic Matrix default options
• Gゅ Э ZZげっニ ;ゲ キミ VanRaden, 2008
• With: – Z center using allele frequencies estimated from the
genotyped individuals
– k = 2 sum ( p * (1-p))
• G = G*0.95 + A*0.05 (to invert)
• Tunning of G (see Z. Vitezica work) – Adjust G to have mean of diagonals and off-diagonals
equal to A
27/05/2014
10
Genomic Matrix Options
• OPTION whichG x – 1: G=ZZ'/k (default) (VanRaden, 2008)
– 2: G=ZDZ'/n; D=1/2p(1-p) (Amin et al., 2007; Leuttenger et al., 2003)
– 3: As 2 with modification UAR (Yang et al., 2010)
– Euclidean distance matrix, not fully implemented yet
• OPTION weightedG file
– ‘W;S ┘Wキェエデゲ デラ IヴW;デW GЭZDZげ – Weighting Z*= Z sqrt(D) => G = Z*Z*' = ZDZげ
• OPTION whichScale x – 1ぎ ヲぞふヮふヱ-p)) (default) (VanRaden, 2008)
– 2: trace(ZZ')/n (Legarra 2009, Hayes 2009, Forni et al 2011)
– 3: correction (Gianola et al., 2009)
Genomic Matrix Options
• OPTION whichfreq x
– 0: read from file freqdata or other specified
– 1: 0.5
– 2: current calculated from genotypes (default)
• OPTION FreqFile file
– Reads allele frequencies from a file
• OPTION maxsnps x
– Set the maximum length of string for reading marker data from file => BovineHD chip
27/05/2014
11
Options for Blending G and A
• OPTION AlphaBeta alpha beta – G = alpha*Gr + beta*A
• OPTION tunedG – 0: no adjustment
– 1: mean(diag(G))=1, mean(offdiag(G))=0
– 2: mean(diag(G))=mean(diag(A)),
mean(offdiag(G))=mean(offdiag(A)) (default)
– 3: mean(G)=mean(A)
– 4: Use Fst adjustment. Powell et al. (2010) & Vitezica et al. (2011)
Creation ラa けrawげ genomic matrix
• Tricks:
• Use dummy pedigree 1 0 0
2 0 0
ぐ
• Change blending parameters
– OPTION AlphaBeta 0.99 0.01
• No adjustment for compatibility with A
– OPTION tunedG 0
G = 0.99*G + 0.01*I
27/05/2014
12
Storing and Reading Matrices
• PreGSF90: – Facilitate the implementation of single-step
– Matrix A is replaced by H with:
– Default output is the matrix GimA22i, to be included in apllication programs (BLUPF90, REMLF90..)
• BUT: intermediate matrices could be stored for examination, use in application programs, etc.
1 11 1
22
0 0
0
H AG A
Storing and Reading Matrices
• Matrices that can be stored: – A22, inv(A22), G, inv(G), GmA22, inv(GmA22), inv(H)
• All matrices are stored in same format: – upper triangle
– By default in binary format
– But to store in text (Ascii) format: • Use: OPTION saveAscii
• Values – i j val
– i & j refers to the row number in the genotype file !!!!!
– Renumber ID could be obtained from the XrefID file
27/05/2014
13
Storing and Reading Matrices
To save our けrawげ genomic matrix:
• OPTION saveG [all] – If the optional all is present all intermediate G
matrices will be saved!!!
or it inverse
• OPTION saveGInverse – Only the final matrix G, after blending, scaling, etc. is
inverted !!!
• Look in wiki for keywords for other matrices
Storing with Original IDs
• Some matrices could be stored in text files with the original IDs extracted from renaddxx.ped created by the RENUMF90 program (col #10)
• For example: – OPTION saveGOrig
– OPTION saveDiagGOrig
– OPTION saveHinvOrig
• Values – origID_i, origID_j, val
27/05/2014
14
OUTPUT
• Only GimA22i , other requested matrices files, and
some reports are stored.
• Main log is printout to the screen !!!
• Use redirection けбげ • or better the command tee to save in a log file.
• This will allows to save and see the messages from
the program
• echo renf90.par | preGSf90 | tee pregs.log
Printout: Same heading as other
programs
All options that were
enter in the parameter
file should be here !!.
IF not check that
keywords are correct
(upper and lower case)
Check number of
animals and
individuals with
genotypes
27/05/2014
15
Printout
Information from genotype file.
The format is detected from
the first line !!!
So all genotypes should start in
the same column !!!
Number of SNP is also
determined by the first line!!
Looking stored matrices
• Avoid open with text editors, huge files !!!
• For example:
• 1500 genotyped individuals => 1,125,750 rows
• Inspection could be done by Unix commands: – head G => first 10 lines
– tail G => last 10 lines
– less G => scroll document by line/page
– wc -l G => count number of lines
good for checks with the number of
genotypes (n) = (n*(n+1)/2)
27/05/2014
16
head G
GBLUP, GREML, GGIBBS
• Using BLUPF90 programs to perfom genomic
selection using genomic relationship matrix
• Using only phenotypes or pseudo phenotypes
(DYD, DP, EBV ) for only genotyped individuals
27/05/2014
17
Two ways: user_file
• By user defined files for covariances of random effects
• Look at Tricks in the wiki for more detailshttp://nce.ads.uga.edu/wiki/doku.php
• Special type of random
effect in BLUPF90
parameter file
• Gi created by PreGSF90
can be used here!
By けfakeげ ゲキミェノW-step GBLUP
• Same trick as before:
– Dummy pedigree with number of individual equal
to number of individuals with genotypes
– Little blending with A (identity matrix) to create
the inverse (OPTION AlphaBeta 0.99 0.01)
– No adjustment for means of A (OPTION tunedG 0)
– Parameter file include:
• Random effect defined as add_animal
• OPTION SNP_file xxxx
27/05/2014
18
By けfakeげ single-step GBLUP
• Runs could be either by:
– Several steps
• 1 run preGSf90 and store G inverse
• 2 modify paramter file for BLUP
adding OPTION readGimA22i
• 3 run BLUPF90
– けOne-Stepげ • 1 run BLUPF90 or REMLF90
RENUMF90 ren.par BLUPF90 renf90.par
27/05/2014
19
PreGSf90 inside BLUPF90 ??
• Almost all programs from package support creation of genomic relationship matrices, Hinv, etc.
• OPTION SNP_file xxxx
• Why preGSF90 ? – Same genomic relationship matrix for several models,
traits, etc. Just do it once and store.
– Uses of optimized subroutines for efficient matrix multiplications, inversion and with support for parallel processing
Matrix multiplication subroutines
• Optimized memory and loops (compiler optimization)
• dgemm subroutine from BLAS
• Optimized dgemm (ATLAS or MKL libraries*)
– Serial
– Parallel (Automatic use of OpenMP) * Intel Fortran Compiler
27/05/2014
20
Matrix multiplication using 40k SNPs
1
10
100
1000
10000
100000
0 5000 10000 15000 20000 25000 30000 35000
Log
10
CP
U t
ime
(s)
Number of animals
BLAS dgemm OPTML
~ 6.4 h
Optimized dgemm
~ 3.8 h
Speedup for matrix multiplications
1
1,5
2
2,5
3
3,5
4
4,5
0 5000 10000 15000 20000 25000 30000 35000
Sp
ee
du
p
Number of animals
4 Threads
3 Threads
2 Threads
Speedup = time using one thread/time using n threads
27/05/2014
21
Efficent methods to construct genomic
relationship matrices
Number of
genotypes
Genomic Relationship Matrix
Creation Invertion
10k 0.6 m 0.1 m
30k 5.4 m 3 m
50k 15 m 14 m
70k 30 m 36.4 m
100k 60 m 106 m
Elapsed time for different number of individuals
BLADE INIALB 24 cpu
Creation a subset of relationship
matrix (A22)
• Create a relationship matrix for only
genotyped animals (~ thousands)
• Full pedigree (~millions)
• Trace only ancestors of genotyped (reduce but
still large number for A matrix)
27/05/2014
22
Relationship Matrix of Genotyped Animals
• CラノノW;┌げゲ algorithm to creates A22
• No need to have explicit A matrix
• MWデエラS ┌ゲWゲ さマ;デヴキ┝-┗WIデラヴざ マ┌ノデキヮノキI;デキラミ ┘キデエ ; decomposition of A matrix
-1 -1(I -A (I - Pr ) P)Dv 'r
Example A times a vector
Pedigree
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 2 0 0
[3,] 3 1 2
Matrix P
[,1] [,2] [,3]
[1,] 0.0 0.0 0.0
[2,] 0.0 0.0 0.0
[3,] 0.5 0.5 0.0
Matrix (I-P)-1
[,1] [,2] [,3]
[1,] 1.0
[2,] 0.0 1.0
[3,] 0.5 0.5 1
Matrix (I-P)-ヱげ
[,1] [,2] [,3]
[1,] 1 0 0.5
[2,] 1 0.5
[3,] 1.0
Matrix D
[,1] [,2] [,3]
[1,] 1
[2,] 1
[3,] 0.5
Vector r2
[,1]
[1,] 10
[2,] 20
[3,] 30
Matrix (I-P)-1
[,1] [,2] [,3]
[1,] 1.0
[2,] 0.0 1.0
[3,] 0.5 0.5 1.0
Vector q
[,1]
[1,] 25
[2,] 35 = [3,]
30
-1 -1(I -A (I - Pr ) P)Dv 'r
27/05/2014
23
• For each genotyped animal in A22
A 0 0
1 0
0
A A22
A22(i.) * =
-12 2
-1(I - P)v A (I - P) D 'r r
Relationship Matrix of Genotyped Animals
Tabular method vs. Colleau algorithm
Tabular* Colleau method
CPU Time 311 s 45 s
Memory 12.1GB 322MB
Testing
6,500 genotyped Holsteins
57,000 pedigrees
* Gmatrix.f90 (VanRaden, 2009)