A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup...
Transcript of A tour of solving a Ridge regression model€¦ · Gauss-Seidel with residual update 1) Setup...
A tour of solving a Ridge regression model
Gregor Gorjanc, John M. Hickey
www.alphagenes.roslin.ed.ac.uk @GregorGorjanc
The plan
Data example
Direct/exact solve
Iterative solve via Gauss-Seidel
Monte Carlo Markov Chain
A small example
Locus
Individual 1 2 3 4
A A/A B/B A/B A/A
B A/A B/B A/A A/A
C A/B B/B B/B B/B
D B/B A/B A/A A/A
Allele dosages
Locus
Individual 1 2 3 4
A 0 2 1 0
B 0 2 0 0
C 1 2 2 2
D 2 1 0 0
Genes and markers
Lets pick locus 1 as a gene (=causal locus) and loci 2 and 3 as markers
Locus
Individual 1 2 3 4
A 0 2 1 0
B 0 2 0 0
C 1 2 2 2
D 2 1 0 0
Simulate phenotypes
Quantitative genetic model
P = Mean + G + E + G×E = Mean + (A + D + I) + E + (…) ×E P ≈ Mean + A + E
Simulate phenotypes
• Population mean 10 units (mean of reference genotype, A/A)
• Allele substitution effect 2 units (a change of mean when substituting allele A for B)
• Breeding value = Allele sub. effect * Allele dosage
• True phenotype = Pop. mean + Breeding value
Simulate phenotypes
Individual Gene Population mean
Breeding value
True phenotype
A 0 10 0×2=0 10
B 0 10 0×2=0 10
C 1 10 1×2=2 12
D 2 10 2×2=4 14
Simulate phenotype
• Observed phenotype = True phenotype + Noise • Sample noise from Gaussian distribution
Noise ~ Normal(0,Ve)
• How much noise? • Target h2 of 0.3, h2 = Va/(Va+Ve)
• Work out Ve if Va = 3.67 units2
Simulate phenotype
• Observed phenotype = True phenotype + Noise • Sample noise from Gaussian distribution
Noise ~ Normal(0,Ve)
• How much noise? • Target h2 of 0.3, h2 = Va/(Va+Ve)
• Work out Ve if Va = 3.67 units2
Ve=Va(1-h2)/h2=3.67(1-0.3)/0.3=8.56 units2
Simulate phenotypes
Individual Gene True phenotype
Noise Observed phenotype
A 0 10 2.3 12.3
B 0 10 -5.9 4.1
C 1 12 3.6 15.6
D 2 14 5.0 19.0
We will work with the true phenotype so that we all get the same solutions
R
• File 01_Data.R
• Run the code (step by step)
• Which marker should capture the effect of gene the most?
• Will the estimated marker effect be positive or negative?
R
• Marker 2 has the strongest correlation with the gene
• Gene effect is positive, but correlation is negative, so marker effect estimate will likely be negative.
Summary
Data example
Direct/exact solve
Iterative solve via Gauss-Seidel
Monte Carlo Markov Chain
The tasks
1) Setup the model
2) Estimate the model parameters (=solve the system of equations)
3) Estimate/predict (genomic) breeding values
4) Evaluate accuracy of breeding values (in the training set!!!)
Using R’s lm() function
• File 02_Estimate_lm.R
• Use R functions
NOTE: this is not a ridge regression model – just a linear model without any shrinkage/penalization
Using R’s lm() function > summary(LmFit)Call:lm(formula = Phen ~ 1 + Geno[, Cols])
Residuals: 1 2 3 4 -6.667e-01 3.333e-01 3.333e-01 2.776e-17
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.3333 1.7951 10.213 0.0621 .Geno[, Cols]1 -4.3333 1.1055 -3.920 0.1590 Geno[, Cols]2 1.0000 0.5774 1.732 0.3333 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8165 on 1 degrees of freedomMultiple R-squared: 0.9394, Adjusted R-squared: 0.8182 F-statistic: 7.75 on 2 and 1 DF, p-value: 0.2462
Do it yourself
• File 03_Estimate_Direct_Solve.R
• Model
Do it yourself
• File 03_Estimate_Direct_Solve.R
• Model
Do it yourself
• System of equations
Do it yourself
• Solve the system • Predict phenotype • Standard errors
Do it yourself
• Solutions • Standard errors
• Predictions
• Accuracy (in training!!!!!!!!!!)
Ridge regression
• File 04_Estimate_Direct_Solve_Prior.R
• Assume that we know variance components – Vm=Va/nMarkers = 3.67/2 = 1.83 – Ve=8.56
Ridge regression - system
Results
• Solutions
• Standard errors
• Predictions
Summary
Data example
Direct/exact solve
Iterative solve via Gauss-Seidel
Monte Carlo Markov Chain
Direct vs. iterative methods
• Direct methods – PRO: get estimates (=cond. means) and
variance of estimates (=cond. variances) – CON: can be expensive to solve for big datasets
• Iterative methods – PRO: can be solved for VERY large systems – CON: get only estimates
(NOTE: can easily extend to get variance of estimates as well as other stuff à full Bayesian analysis via MCMC)
Gauss-Seidel with residual update
1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence – Iterate over parameters
1) Add to working vector 2) Setup LHS diagonal element 3) Setup RHS element 4) Estimate 5) Remove from working vector
GSRU in R
XpX <- colSums(X*X)E <- PhenSol <- rep(0, times=nCov)Iter <- 2while (Iter <= nIter) { Eps <- 0 CovOrder <- sample(x=1:nCov) for(j in CovOrder) { E <- E + X[, j]*Sol[j] LHS <- XpX[j] RHS <- sum(X[, j]*E) New <- RHS/LHS E <- E - X[, j]*New Eps <- Eps + abs(New - Sol[j]) Sol[j] <- New } Iter <- Iter + 1 if (Eps < 1e-8) break}
• File 05_Estimate_GSRU.R
Convergence
GSRU for ridge regression
1) Setup diagonal of 2) Define working vector 3) Initialize solutions 4) Iterate until convergence – Iterate over parameters
1) Add to working vector 2) Setup LHS diagonal element 3) Setup RHS element 4) Estimate 5) Remove from working vector
File 06_Estimate_GSRU_Prior.R
Convergence
Regularization Shrinkage in action!!! Penalization
Summary
Data example
Direct/exact solve
Iterative solve via Gauss-Seidel
Monte Carlo Markov Chain
Monte Carlo Markov Chain
• A method to obtain numerical approximation of multi-dimensional integrals – Monte Carlo = sampling from distributions
– Markov Chain = current value depends only on the previous value, but not the value before the previous value
• Very useful tool in Bayesian analysis
MCMC for ridge regression
• Not needed in this case!!! • Model (means unknown, variances known)
• Posterior (we know how to solve this)
MCMC for ridge regression
• File 07_Estimate_MCMC.R
• Lets test if we will get the same standard errors as with the direct solve
• Its easy to implement on top of GSRU
Traceplot
Traceplot
Summarize the chains
• Posterior means
• Posterior standard deviations
MCMC for FULL ridge regression
• Needed in this case!!! • Model (means and variances unknown)
• Posterior (tricky …)
MCMC for FULL ridge regression
• File 08_Estimate_MCMC_Including_Variances.R
• As before just sample variance components in addition to the means (intercept and two marker effects)
Traceplot 1
Traceplot 2
Summarize the chains
Mean SD Mean SD
Ve 3.5 2.8 Intercept 14.4 3.0
Vm 4.4 4.4 b1 -1.8 1.7
Va 6.7 6.7 b2 0.3 1.0
h2 0.63 0.19
Summary
Data example
Direct/exact solve
Iterative solve via Gauss-Seidel
Monte Carlo Markov Chain
Acknowledgements
• Roslin – Gregor Gorjanc – Janez Jenko – Mara Battagin – Stefan Edwards – Serap Gonen – Chris Gaynor – Anne-Michelle Faux – Roberto Antolin – John Woolliams – Bruce Whitelaw
• Further information www.alphagenes.roslin.ed.ac.uk
@GregorGorjanc @HickeyJohn
[email protected] [email protected]
• Vacancies – Two post-doc positions
currently available
• NIAB – Ian Mackay – Alison Bentley
• Genus – Alan Mileham – Matthew Cleveland – William Herring
• Aviagen – Andreas Kranis – Kellie Watson
Funding
A tour of solving a Ridge regression model
Gregor Gorjanc, John M. Hickey
www.alphagenes.roslin.ed.ac.uk @GregorGorjanc