Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...

Marker Based Infinitesimal Model for

Quantitative Trait Analysis

Shizhong Xu

Department of Botany and Plant Sciences

University of California

Riverside, CA 92521

Outline

• Quantitative trait and the infinitesimal model

• Infinitesimal model using marker information

• Adaptive infinitesimal model

• Simulation studies

• Rice and beef cattle data analyses

Quantitative Trait

Quantitative Genetics Model

Phenotype = Genotype + Environment

Infinitesimal Model

• Infinite number of genes

• Infinitely small effect of each gene

• Effect of an individual gene is not recognizable

• Collective effect of all genes are studied using

pedigree information (genetic relationship)

• Best linear unbiased prediction (BLUP)

Outline






Marker Based Infinitesimal Model

1

1

0

( ) ( )

p

j jk k jk

j jk k jk

L

j j j

y Z

y Z

y Z d

Different from Longitudinal Data Analysis

0

( ) ( )

( ) ( ) ;

L

j j j

j j

y Z d

y t t t

Numerical Integration

Bin Effect Model

1

0

1

1

( ) ( )

( ) ( )

j jk k jk

L

j j j

m

j j k k k jk

m

j jk k jk

y Z

y Z d

y Z

y Z

Bin Effects

Dense markers

Bin Bin

1

1( )

kp

jk jhk

Z Z hp

Recombination Breakpoint Data

1

0

1Marker: ( )

1Breakpoint: ( )

k

k

p

jk jhk

jk j

k

Z Z hp

Z Z d

0

1( )

k

jk j

k

Z Z d

8 21 0 0.8

10 10jkZ

What Does a Bin Effect Represent?

1

0

0 0

1( )

1( ) ( )

size of bin k

uniform variable

k

k k

m

j jk k jk

jk j

k

k k

k

k

y Z

Z Z d

d d

Assumptions of the Infinitesimal Model

• High linkage disequilibrium within a bin

• Homogeneous genetic effect within a bin

High Linkage Disequilibrium

0

20

1( )

number of crossovers, inversely

related to linkage disequilibrium

1lim var( ) , high linkage disequilibrium (F )

2

lim var( ) 0, low linkage disequilibrium

Larger v

k

k

k

jk j

k

k

jk

jk

Z Z d

Z

Z

ar( ) means higher powerjkZ

Range of Var(Z)

22

22

20 0 0

2

2 1 1 1lim var( ) lim lim

4 2 2

2 1 1lim var( ) lim lim 0

4 2

k

k

k k

k

k

k k k

k

kjk

k

jk

k

k

eZ e

eZ e

0 var( ) 0.5

0

choose var( ) as close to 0.5 as possible

but with the number of bins small enough to be

handled by a program for a given sample size

jk

k

jk

Z

Z

Outline






Adaptive Model Relaxes the Two

Assumptions

• High linkage disequilibrium within a bin

- prevent var(Z) from being zero

• Homogeneous genetic effect within a bin

- make all effects positive

Redefine the Bin Size by the Number

of Markers Within a Bin

1

1

1

1( )

( )

number of markers in bin k

k

k

m

j jk k jk

p

jk jhk

p

k k kh

k

y Z

Z Z hp

p h

p

Weighted Average Effect of a Bin

1 1

* * 1

1 1

1Unweighted: ( ); ( )

1Weighted: ( ) ( ); ( ) ( )

k k

k k

p p

jk j k k kh hk

p p

jk j kh hk

Z Z h p hp

Z w h Z h w h hp

* *

1

m

j jk k jk

y Z

Weight System

1

1

1

1 ˆ ˆDefine | | = mean(| |)

ˆwhere is the least squares estimate

of marker within bin

The weight for marker is defined as

ˆ ˆˆ

ˆmean(| |)ˆ| |

k

k

p

k hhk

h

k h hh k h p

hh

c b bp

b

h k

h

p b bw c b

bb

Weighted Var(Z*) > 0

* * * *

21

2

21

2

21

1var( ) var ( ) 2 cov ( ), ( )

1 1 1 2 (1 2 )

2 2

1 1 , when no linkage disequilibrium (1 2 )

2

k k

k k

k

p p

jk j j jh l hk

p p

h h l hlh l hk

p

h hlhk

Z Z h Z h Z lp

w w wp

wp

0

0

Homogenization of Marker

Effects Within Bin

* 1

1 1 1

*

1

( ) ˆ( ) | |ˆ

( )where (a constant)

ˆ

ˆ ˆ| | 0 as long as one 0

k k k

k

p p p

k h k k k hh h hh

h

p

k h hh

hw h c c p b

b

h

b

b b

Outline






Measurement of Prediction (Cross Validation)

2

1

2

1

2

1ˆ( ) , Mean Squared Error

1( ) , Phenotypic Variance

, Squared Correlation

n

j jj

n

j jj

MSE y yn

MSY y yn

MSY MSER

MSY

Simulation Experiment

• Genome size = 2,500 cM

• Number of markers = 120,000

• Marker interval = 0.02 cM

• Cross validation (MSE)

• Design I = 20 QTL

• Design II = Clustered polygenic model

• Design III = Polygenic model

• Design IV = Design I with 2,500 x100 cM

True QTL Effect

0 500 1000 1500 2000 2500

-6-4

-20

24

68

Pa

ram

ete

r se

ttin

g

true values

Eff

ect

Position (cM)

Estimated Bin Effects0 500 1000 1500 2000 2500

Estim

ate

(a) Δ = 1cM m = 2400

p = 50

-6-2

26

Eff

ect

0 500 1000 1500 2000 2500

Estim

ate

(b) Δ = 2cM m = 1200

p = 100

-6-2

26

Eff

ect

0 500 1000 1500 2000 2500

Estim

ate

(c) Δ = 5cM m = 480

p = 250

-6-2

26

Eff

ect

0 500 1000 1500 2000 2500

Estim

ate

(d) Δ = 10cM

m = 240 p = 500

-6-2

26

Eff

ect

0 500 1000 1500 2000 2500

Estim

ate

(e) Δ = 20cM

m = 120 p = 1000

-6-2

26

Eff

ect

0 500 1000 1500 2000 2500

Estim

ate

(f) Δ = 40cM m = 60 p = 2000

-6-2

26

Eff

ect

Position (cM)

0 500 1000 1500 2000 2500

Estim

ate

(g) Δ = 100cM m = 24

p = 5000

-6-2

26

0 500 1000 1500 2000 2500

Estim

ate

(h) Δ = 150cM m = 16

p = 7500

-6-2

26

0 500 1000 1500 2000 2500

Estim

ate

(i) Δ = 300cM m = 8

p = 15000

-6-2

26

0 500 1000 1500 2000 2500

Estim

ate

(j) Δ = 600cM

m = 4 p = 30000

-6-2

26

0 500 1000 1500 2000 2500

Estim

ate

(k) Δ = 1200cM

m = 2 p = 60000

-50

510

0 500 1000 1500 2000 2500

Estim

ate

(l) Δ = 2400cM m = 1 p = 120000

-50

510

Position (cM)

True and Estimated QTL Effect

0 500 1000 1500 2000 2500

-6-4

-20

24

68

Pa

ram

ete

r se

ttin

gtrue values

Eff

ect

Position (cM)

0 500 1000 1500 2000 2500

Estim

ate

Δ = 20cM

m = 120

p = 1000

-6-4

-20

24

68

Eff

ect

Position (cM)

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

0 20 40 60 80 100

40

50

60

70

80

90

(a) 2=10, h

2=0.638

Me

an

sq

ua

red

err

or

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

0 20 40 60 80 100

50

60

70

80

90

100

(b) 2=20, h

2=0.581

Bin size (cM)

Me

an

sq

ua

red

err

or

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

0 20 40 60 80 100

90

100

120

140

(c) 2=50, h

2=0.457

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

(data$cm[idx])

data

$m

se[idx]

0 20 40 60 80 100140

160

180

200

220 (d)

2=100, h

2=0.337

Bin size (cM)

n=200 n=300 n=400 n=500 n=1000

Figure 1. Mean squared error expressed as a function of bin size for Design I. The mean squared

errors were obtained from 100 replicated simulations. The overall proportion of the phenotypic

variance contributed by the 20 simulated QTL was calculated using 2 264.41/ (64.41 26.53 )h . Each panel contains the result of five different sample sizes (n).

The phenotypic variance of the simulated trait is indicated by the light horizontal line in each

panel (each panel represents one of the four different scenarios).

4.0 4.5 5.0 5.5

05

01

00

15

0

log10cmlist * 5000

$M

SE

1v

log10cmlist * 5000

Me

an

sq

ua

red

err

or

Bin size (log10 cM)

Figure 6. Mean squared error for the simulated data under design IV (low linkage disequilibrium)

plotted against the bin size. The sample size of the simulated population was 500n . The

residual error variance was 2 20 , corresponding to 2 0.777h . The filled circles indicate the

MSE under the infinitesimal model while the open circles indicate the MSE under the adaptive

infinitesimal model. The dashed horizontal line represents the phenotypic variance of the

simulated trait (89.71).

Outline






Rice Tiller Number (Yu et al. 2011)

• Number of recombinant inbred lines: 210

• Number of SNP: 270,820

• Number of natural bins: 1619

• Number of artificial bins: vary from small to large

• Method: Empirical Bayes (eBayes)

• Cross validation: MSE and R-square

Yu et al. 2007, PLoS One 6(3) e17595

Figure 5. The MSE (curve in the left panel) and the R-square (curve in the right panel)

of the rice tiller number trait analysis, expressed as a function of bin size (artificial

bins). The black dashed horizontal line in the left panel is the phenotypic variance. The

red dashed horizontal line in the left panel is the MSE of the natural bin (without

breakpoints with bin) analysis. The red dashed horizontal line in the right panel is the

R-square of the natural bin analysis. R-square increased from 0.42 to 0.55.

Beef Cattle Data Analysis

• Trait = carcass weight

• Number of beef = 922

• Number of SNP markers = 40809

• Number of chromosomes = 29

• Methods = unweighted and weighted

5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5

40

04

50

50

05

50

60

06

50

70

0

Bin size (log10 bp)

Me

an

sq

ua

red

err

or

Figure 7. Mean squared error for the carcass trait of beef cattle plotted against the bin size. The

filled circles indicate the MSE under the infinitesimal model while the open circles indicate the

MSE under the adaptive infinitesimal model. The dashed horizontal line represents the

phenotypic variance of the simulated trait (670.36). The blue horizontal line along with the two

dotted lines represents the MSE and the standard deviation of the MSE in the situation where the

bin size was one (one marker per bin). The sample size was 921n and the number of SNP

markers was 40809p . The bin size was defined as log10 bp. For example, the largest bin size

10log bp 8.5 means that the bin size contains 58.5 10 base pairs.

5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5

40

04

50

50

05

50

60

06

50

70

0

Bin size (log10 bp)

MS

E

0.2940.30.2980.281

0.3050.3260.3240.3240.333

0.306

0.204

0.122

-0.0020.0010.001-0.001

0.0260.0170.0240.0260.039

0.0850.0750.087

Marker Analysis p = 40809

MSE = 600

R^2 = (670-600)/670 = 0.09

0.09

Table 1. Mean squared error (MSE) and R-square values obtained from the 10-fold cross

validation analysis for the beef carcass trait using five competing models and the

proposed bin model.

Model MSE2 R-square

eBayes 648.11 0.0332

G-Blup 632.46 0.0565

BayesB-1 655.59 0.0220

BayesB-21 658.19 0.0182

Lasso 603.75 0.0994

Bin model 447.10 0.3330

1The Pi value for BayesB-2 is set at 0.95.

2The phenotypic variance of the beef carcass trait is 670.36. The magnitude of MSE value

smaller than 670.36 indicates the effectiveness of the model predictability.

Outline






Acknowledgements

• Zhiqiu Hu (postdoc)

• Qifa Zhang (rice data)

• Zhiqiun Wang (beef data)

• USDA Grant 2007-02784

Thank You !

Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...

Documents

Transcript of Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...