A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup...

A tour of solving a Ridge regression model

Gregor Gorjanc, John M. Hickey

www.alphagenes.roslin.ed.ac.uk @GregorGorjanc

The plan

Data example

Direct/exact solve

Iterative solve via Gauss-Seidel

Monte Carlo Markov Chain

A small example

Locus

Individual 1 2 3 4

A A/A B/B A/B A/A

B A/A B/B A/A A/A

C A/B B/B B/B B/B

D B/B A/B A/A A/A

Allele dosages

Locus

Individual 1 2 3 4

A 0 2 1 0

B 0 2 0 0

C 1 2 2 2

D 2 1 0 0

Genes and markers

Lets pick locus 1 as a gene (=causal locus) and loci 2 and 3 as markers

Locus

Individual 1 2 3 4

A 0 2 1 0

B 0 2 0 0

C 1 2 2 2

D 2 1 0 0

Simulate phenotypes

Quantitative genetic model

P = Mean + G + E + G×E = Mean + (A + D + I) + E + (…) ×E P ≈ Mean + A + E

Simulate phenotypes

•  Population mean 10 units (mean of reference genotype, A/A)

•  Allele substitution effect 2 units (a change of mean when substituting allele A for B)

•  Breeding value = Allele sub. effect * Allele dosage

•  True phenotype = Pop. mean + Breeding value

Simulate phenotypes

Individual Gene Population mean

Breeding value

True phenotype

A 0 10 0×2=0 10

B 0 10 0×2=0 10

C 1 10 1×2=2 12

D 2 10 2×2=4 14

Simulate phenotype

•  Observed phenotype = True phenotype + Noise •  Sample noise from Gaussian distribution

Noise ~ Normal(0,Ve)

•  How much noise? •  Target h2 of 0.3, h2 = Va/(Va+Ve)

•  Work out Ve if Va = 3.67 units2

Simulate phenotype

•  Observed phenotype = True phenotype + Noise •  Sample noise from Gaussian distribution

Noise ~ Normal(0,Ve)

•  How much noise? •  Target h2 of 0.3, h2 = Va/(Va+Ve)

•  Work out Ve if Va = 3.67 units2

Ve=Va(1-h2)/h2=3.67(1-0.3)/0.3=8.56 units2

Simulate phenotypes

Individual Gene True phenotype

Noise Observed phenotype

A 0 10 2.3 12.3

B 0 10 -5.9 4.1

C 1 12 3.6 15.6

D 2 14 5.0 19.0

We will work with the true phenotype so that we all get the same solutions

R

•  File 01_Data.R

•  Run the code (step by step)

•  Which marker should capture the effect of gene the most?

•  Will the estimated marker effect be positive or negative?

R

•  Marker 2 has the strongest correlation with the gene

•  Gene effect is positive, but correlation is negative, so marker effect estimate will likely be negative.

Summary

Data example

Direct/exact solve



The tasks

1)  Setup the model

2)  Estimate the model parameters (=solve the system of equations)

3)  Estimate/predict (genomic) breeding values

4)  Evaluate accuracy of breeding values (in the training set!!!)

Using R’s lm() function

•  File 02_Estimate_lm.R

•  Use R functions

NOTE: this is not a ridge regression model – just a linear model without any shrinkage/penalization

Using R’s lm() function > summary(LmFit)Call:lm(formula = Phen ~ 1 + Geno[, Cols])

Residuals: 1 2 3 4 -6.667e-01 3.333e-01 3.333e-01 2.776e-17

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.3333 1.7951 10.213 0.0621 .Geno[, Cols]1 -4.3333 1.1055 -3.920 0.1590 Geno[, Cols]2 1.0000 0.5774 1.732 0.3333 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8165 on 1 degrees of freedomMultiple R-squared: 0.9394, Adjusted R-squared: 0.8182 F-statistic: 7.75 on 2 and 1 DF, p-value: 0.2462

Do it yourself

•  File 03_Estimate_Direct_Solve.R

•  Model

Do it yourself

•  System of equations

Do it yourself

•  Solve the system •  Predict phenotype •  Standard errors

Do it yourself

•  Solutions •  Standard errors

•  Predictions

•  Accuracy (in training!!!!!!!!!!)

Ridge regression

•  File 04_Estimate_Direct_Solve_Prior.R

•  Assume that we know variance components – Vm=Va/nMarkers = 3.67/2 = 1.83 – Ve=8.56

Ridge regression - system

Results

•  Solutions

•  Standard errors

•  Predictions

Summary

Data example

Direct/exact solve



Direct vs. iterative methods

•  Direct methods – PRO: get estimates (=cond. means) and

variance of estimates (=cond. variances) – CON: can be expensive to solve for big datasets

•  Iterative methods – PRO: can be solved for VERY large systems – CON: get only estimates

(NOTE: can easily extend to get variance of estimates as well as other stuff à full Bayesian analysis via MCMC)

Gauss-Seidel with residual update

1)  Setup diagonal of 2)  Define working vector 3)  Initialize solutions 4)  Iterate until convergence –  Iterate over parameters

1)  Add to working vector 2)  Setup LHS diagonal element 3)  Setup RHS element 4)  Estimate 5)  Remove from working vector

GSRU in R

XpX <- colSums(X*X)E <- PhenSol <- rep(0, times=nCov)Iter <- 2while (Iter <= nIter) { Eps <- 0 CovOrder <- sample(x=1:nCov) for(j in CovOrder) { E <- E + X[, j]*Sol[j] LHS <- XpX[j] RHS <- sum(X[, j]*E) New <- RHS/LHS E <- E - X[, j]*New Eps <- Eps + abs(New - Sol[j]) Sol[j] <- New } Iter <- Iter + 1 if (Eps < 1e-8) break}

•  File 05_Estimate_GSRU.R

Convergence

GSRU for ridge regression

1)  Setup diagonal of 2)  Define working vector 3)  Initialize solutions 4)  Iterate until convergence –  Iterate over parameters

1)  Add to working vector 2)  Setup LHS diagonal element 3)  Setup RHS element 4)  Estimate 5)  Remove from working vector

File 06_Estimate_GSRU_Prior.R

Convergence

Regularization Shrinkage in action!!! Penalization

Summary

Data example

Direct/exact solve




•  A method to obtain numerical approximation of multi-dimensional integrals – Monte Carlo = sampling from distributions

– Markov Chain = current value depends only on the previous value, but not the value before the previous value

•  Very useful tool in Bayesian analysis

MCMC for ridge regression

•  Not needed in this case!!! •  Model (means unknown, variances known)

•  Posterior (we know how to solve this)

MCMC for ridge regression

•  File 07_Estimate_MCMC.R

•  Lets test if we will get the same standard errors as with the direct solve

•  Its easy to implement on top of GSRU

Traceplot

Summarize the chains

•  Posterior means

•  Posterior standard deviations

MCMC for FULL ridge regression

•  Needed in this case!!! •  Model (means and variances unknown)

•  Posterior (tricky …)

MCMC for FULL ridge regression

•  File 08_Estimate_MCMC_Including_Variances.R

•  As before just sample variance components in addition to the means (intercept and two marker effects)

Traceplot 1

Traceplot 2

Summarize the chains

Mean SD Mean SD

Ve 3.5 2.8 Intercept 14.4 3.0

Vm 4.4 4.4 b1 -1.8 1.7

Va 6.7 6.7 b2 0.3 1.0

h2 0.63 0.19

Summary

Data example

Direct/exact solve



Acknowledgements

•  Roslin –  Gregor Gorjanc –  Janez Jenko –  Mara Battagin –  Stefan Edwards –  Serap Gonen –  Chris Gaynor –  Anne-Michelle Faux –  Roberto Antolin –  John Woolliams –  Bruce Whitelaw

•  Further information www.alphagenes.roslin.ed.ac.uk

@GregorGorjanc @HickeyJohn

[email protected] [email protected]

•  Vacancies –  Two post-doc positions

currently available

•  NIAB –  Ian Mackay –  Alison Bentley

•  Genus –  Alan Mileham –  Matthew Cleveland –  William Herring

•  Aviagen –  Andreas Kranis –  Kellie Watson

Funding

A tour of solving a Ridge regression model

Gregor Gorjanc, John M. Hickey

www.alphagenes.roslin.ed.ac.uk @GregorGorjanc

A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup...

Documents

Transcript of A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup...