Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...
Transcript of Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...
Marker Based Infinitesimal Model for
Quantitative Trait Analysis
Shizhong Xu
Department of Botany and Plant Sciences
University of California
Riverside, CA 92521
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Quantitative Trait
Quantitative Genetics Model
Phenotype = Genotype + Environment
Infinitesimal Model
• Infinite number of genes
• Infinitely small effect of each gene
• Effect of an individual gene is not recognizable
• Collective effect of all genes are studied using
pedigree information (genetic relationship)
• Best linear unbiased prediction (BLUP)
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Marker Based Infinitesimal Model
1
1
0
( ) ( )
p
j jk k jk
j jk k jk
L
j j j
y Z
y Z
y Z d
Different from Longitudinal Data Analysis
0
( ) ( )
( ) ( ) ;
L
j j j
j j
y Z d
y t t t
Numerical Integration
Bin Effect Model
1
0
1
1
( ) ( )
( ) ( )
j jk k jk
L
j j j
m
j j k k k jk
m
j jk k jk
y Z
y Z d
y Z
y Z
Bin Effects
Dense markers
Bin Bin
1
1( )
kp
jk jhk
Z Z hp
Recombination Breakpoint Data
1
0
1Marker: ( )
1Breakpoint: ( )
k
k
p
jk jhk
jk j
k
Z Z hp
Z Z d
0
1( )
k
jk j
k
Z Z d
8 21 0 0.8
10 10jkZ
What Does a Bin Effect Represent?
1
0
0 0
1( )
1( ) ( )
size of bin k
uniform variable
k
k k
m
j jk k jk
jk j
k
k k
k
k
y Z
Z Z d
d d
Assumptions of the Infinitesimal Model
• High linkage disequilibrium within a bin
• Homogeneous genetic effect within a bin
High Linkage Disequilibrium
0
20
1( )
number of crossovers, inversely
related to linkage disequilibrium
1lim var( ) , high linkage disequilibrium (F )
2
lim var( ) 0, low linkage disequilibrium
Larger v
k
k
k
jk j
k
k
jk
jk
Z Z d
Z
Z
ar( ) means higher powerjkZ
Range of Var(Z)
22
22
20 0 0
2
2 1 1 1lim var( ) lim lim
4 2 2
2 1 1lim var( ) lim lim 0
4 2
k
k
k k
k
k
k k k
k
kjk
k
jk
k
k
eZ e
eZ e
0 var( ) 0.5
0
choose var( ) as close to 0.5 as possible
but with the number of bins small enough to be
handled by a program for a given sample size
jk
k
jk
Z
Z
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Adaptive Model Relaxes the Two
Assumptions
• High linkage disequilibrium within a bin
- prevent var(Z) from being zero
• Homogeneous genetic effect within a bin
- make all effects positive
Redefine the Bin Size by the Number
of Markers Within a Bin
1
1
1
1( )
( )
number of markers in bin k
k
k
m
j jk k jk
p
jk jhk
p
k k kh
k
y Z
Z Z hp
p h
p
Weighted Average Effect of a Bin
1 1
* * 1
1 1
1Unweighted: ( ); ( )
1Weighted: ( ) ( ); ( ) ( )
k k
k k
p p
jk j k k kh hk
p p
jk j kh hk
Z Z h p hp
Z w h Z h w h hp
* *
1
m
j jk k jk
y Z
Weight System
1
1
1
1 ˆ ˆDefine | | = mean(| |)
ˆwhere is the least squares estimate
of marker within bin
The weight for marker is defined as
ˆ ˆˆ
ˆmean(| |)ˆ| |
k
k
p
k hhk
h
k h hh k h p
hh
c b bp
b
h k
h
p b bw c b
bb
Weighted Var(Z*) > 0
* * * *
21
2
21
2
21
1var( ) var ( ) 2 cov ( ), ( )
1 1 1 2 (1 2 )
2 2
1 1 , when no linkage disequilibrium (1 2 )
2
k k
k k
k
p p
jk j j jh l hk
p p
h h l hlh l hk
p
h hlhk
Z Z h Z h Z lp
w w wp
wp
0
0
Homogenization of Marker
Effects Within Bin
* 1
1 1 1
*
1
( ) ˆ( ) | |ˆ
( )where (a constant)
ˆ
ˆ ˆ| | 0 as long as one 0
k k k
k
p p p
k h k k k hh h hh
h
p
k h hh
hw h c c p b
b
h
b
b b
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Measurement of Prediction (Cross Validation)
2
1
2
1
2
1ˆ( ) , Mean Squared Error
1( ) , Phenotypic Variance
, Squared Correlation
n
j jj
n
j jj
MSE y yn
MSY y yn
MSY MSER
MSY
Simulation Experiment
• Genome size = 2,500 cM
• Number of markers = 120,000
• Marker interval = 0.02 cM
• Cross validation (MSE)
• Design I = 20 QTL
• Design II = Clustered polygenic model
• Design III = Polygenic model
• Design IV = Design I with 2,500 x100 cM
True QTL Effect
0 500 1000 1500 2000 2500
-6-4
-20
24
68
Pa
ram
ete
r se
ttin
g
true values
Eff
ect
Position (cM)
Estimated Bin Effects0 500 1000 1500 2000 2500
Estim
ate
(a) Δ = 1cM m = 2400
p = 50
-6-2
26
Eff
ect
0 500 1000 1500 2000 2500
Estim
ate
(b) Δ = 2cM m = 1200
p = 100
-6-2
26
Eff
ect
0 500 1000 1500 2000 2500
Estim
ate
(c) Δ = 5cM m = 480
p = 250
-6-2
26
Eff
ect
0 500 1000 1500 2000 2500
Estim
ate
(d) Δ = 10cM
m = 240 p = 500
-6-2
26
Eff
ect
0 500 1000 1500 2000 2500
Estim
ate
(e) Δ = 20cM
m = 120 p = 1000
-6-2
26
Eff
ect
0 500 1000 1500 2000 2500
Estim
ate
(f) Δ = 40cM m = 60 p = 2000
-6-2
26
Eff
ect
Position (cM)
0 500 1000 1500 2000 2500
Estim
ate
(g) Δ = 100cM m = 24
p = 5000
-6-2
26
0 500 1000 1500 2000 2500
Estim
ate
(h) Δ = 150cM m = 16
p = 7500
-6-2
26
0 500 1000 1500 2000 2500
Estim
ate
(i) Δ = 300cM m = 8
p = 15000
-6-2
26
0 500 1000 1500 2000 2500
Estim
ate
(j) Δ = 600cM
m = 4 p = 30000
-6-2
26
0 500 1000 1500 2000 2500
Estim
ate
(k) Δ = 1200cM
m = 2 p = 60000
-50
510
0 500 1000 1500 2000 2500
Estim
ate
(l) Δ = 2400cM m = 1 p = 120000
-50
510
Position (cM)
True and Estimated QTL Effect
0 500 1000 1500 2000 2500
-6-4
-20
24
68
Pa
ram
ete
r se
ttin
gtrue values
Eff
ect
Position (cM)
0 500 1000 1500 2000 2500
Estim
ate
Δ = 20cM
m = 120
p = 1000
-6-4
-20
24
68
Eff
ect
Position (cM)
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
0 20 40 60 80 100
40
50
60
70
80
90
(a) 2=10, h
2=0.638
Me
an
sq
ua
red
err
or
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
0 20 40 60 80 100
50
60
70
80
90
100
(b) 2=20, h
2=0.581
Bin size (cM)
Me
an
sq
ua
red
err
or
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
0 20 40 60 80 100
90
100
120
140
(c) 2=50, h
2=0.457
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
(data$cm[idx])
data
$m
se[idx]
0 20 40 60 80 100140
160
180
200
220 (d)
2=100, h
2=0.337
Bin size (cM)
n=200 n=300 n=400 n=500 n=1000
Figure 1. Mean squared error expressed as a function of bin size for Design I. The mean squared
errors were obtained from 100 replicated simulations. The overall proportion of the phenotypic
variance contributed by the 20 simulated QTL was calculated using 2 264.41/ (64.41 26.53 )h . Each panel contains the result of five different sample sizes (n).
The phenotypic variance of the simulated trait is indicated by the light horizontal line in each
panel (each panel represents one of the four different scenarios).
4.0 4.5 5.0 5.5
05
01
00
15
0
log10cmlist * 5000
$M
SE
1v
log10cmlist * 5000
Me
an
sq
ua
red
err
or
Bin size (log10 cM)
Figure 6. Mean squared error for the simulated data under design IV (low linkage disequilibrium)
plotted against the bin size. The sample size of the simulated population was 500n . The
residual error variance was 2 20 , corresponding to 2 0.777h . The filled circles indicate the
MSE under the infinitesimal model while the open circles indicate the MSE under the adaptive
infinitesimal model. The dashed horizontal line represents the phenotypic variance of the
simulated trait (89.71).
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Rice Tiller Number (Yu et al. 2011)
• Number of recombinant inbred lines: 210
• Number of SNP: 270,820
• Number of natural bins: 1619
• Number of artificial bins: vary from small to large
• Method: Empirical Bayes (eBayes)
• Cross validation: MSE and R-square
Yu et al. 2007, PLoS One 6(3) e17595
Figure 5. The MSE (curve in the left panel) and the R-square (curve in the right panel)
of the rice tiller number trait analysis, expressed as a function of bin size (artificial
bins). The black dashed horizontal line in the left panel is the phenotypic variance. The
red dashed horizontal line in the left panel is the MSE of the natural bin (without
breakpoints with bin) analysis. The red dashed horizontal line in the right panel is the
R-square of the natural bin analysis. R-square increased from 0.42 to 0.55.
Beef Cattle Data Analysis
• Trait = carcass weight
• Number of beef = 922
• Number of SNP markers = 40809
• Number of chromosomes = 29
• Methods = unweighted and weighted
5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
40
04
50
50
05
50
60
06
50
70
0
Bin size (log10 bp)
Me
an
sq
ua
red
err
or
Figure 7. Mean squared error for the carcass trait of beef cattle plotted against the bin size. The
filled circles indicate the MSE under the infinitesimal model while the open circles indicate the
MSE under the adaptive infinitesimal model. The dashed horizontal line represents the
phenotypic variance of the simulated trait (670.36). The blue horizontal line along with the two
dotted lines represents the MSE and the standard deviation of the MSE in the situation where the
bin size was one (one marker per bin). The sample size was 921n and the number of SNP
markers was 40809p . The bin size was defined as log10 bp. For example, the largest bin size
10log bp 8.5 means that the bin size contains 58.5 10 base pairs.
5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
40
04
50
50
05
50
60
06
50
70
0
Bin size (log10 bp)
Me
an
sq
ua
red
err
or
Figure 7. Mean squared error for the carcass trait of beef cattle plotted against the bin size. The
filled circles indicate the MSE under the infinitesimal model while the open circles indicate the
MSE under the adaptive infinitesimal model. The dashed horizontal line represents the
phenotypic variance of the simulated trait (670.36). The blue horizontal line along with the two
dotted lines represents the MSE and the standard deviation of the MSE in the situation where the
bin size was one (one marker per bin). The sample size was 921n and the number of SNP
markers was 40809p . The bin size was defined as log10 bp. For example, the largest bin size
10log bp 8.5 means that the bin size contains 58.5 10 base pairs.
5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
40
04
50
50
05
50
60
06
50
70
0
Bin size (log10 bp)
MS
E
0.2940.30.2980.281
0.3050.3260.3240.3240.333
0.306
0.204
0.122
-0.0020.0010.001-0.001
0.0260.0170.0240.0260.039
0.0850.0750.087
Marker Analysis p = 40809
MSE = 600
R^2 = (670-600)/670 = 0.09
0.09
Table 1. Mean squared error (MSE) and R-square values obtained from the 10-fold cross
validation analysis for the beef carcass trait using five competing models and the
proposed bin model.
Model MSE2 R-square
eBayes 648.11 0.0332
G-Blup 632.46 0.0565
BayesB-1 655.59 0.0220
BayesB-21 658.19 0.0182
Lasso 603.75 0.0994
Bin model 447.10 0.3330
1The Pi value for BayesB-2 is set at 0.95.
2The phenotypic variance of the beef carcass trait is 670.36. The magnitude of MSE value
smaller than 670.36 indicates the effectiveness of the model predictability.
Outline
• Quantitative trait and the infinitesimal model
• Infinitesimal model using marker information
• Adaptive infinitesimal model
• Simulation studies
• Rice and beef cattle data analyses
Acknowledgements
• Zhiqiu Hu (postdoc)
• Qifa Zhang (rice data)
• Zhiqiun Wang (beef data)
• USDA Grant 2007-02784
Thank You !