Introduction to Biostatistics The 2nd Lecture

38
1 1 Introduction to Biostatistics The 2nd Lecture Hiroyosh IWATA [email protected] 2021/4/16 Simple regression analysis Changes in one variable may affect another, such as the relationship between breeding and cultivation conditions and the growth of animals and plants. One of the statistical methods to model the relationship between these variables is regression analysis. By statistically modeling the relationship between variables, it becomes possible to understand the causal relationship that exists between variables, and to predict one variable from another. Here, first, we will discuss simple regression analysis that models the relationship between two variables as a “linear relationship”. In this case, the mechanism of single regression analysis will be explained using the analysis of rice data (Zhao et al. 2011, Nature Communications 2: 467) as an example. First, read the rice data in the same way as before. Before entering the following command, change your R working directory to the directory (folder) where the two input files (RiceDiversityPheno.csv, RiceDiversityLine.csv) are located. # this data set was analyzed in Zhao 2011 (Nature Communications 2:467) pheno <- read.csv("RiceDiversityPheno.csv") line <- read.csv("RiceDiversityLine.csv") line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID") head(line.pheno)[,1:12] ## NSFTV.ID GSOR.ID IRGC.ID Accession.Name Country.of.origin Latitu de ## 1 1 301001 To be assigned Agostano Italy 41.8719 40 ## 2 3 301003 117636 Ai-Chiao-Hong China 27.9025 27 ## 3 4 301004 117601 NSF-TV 4 India 22.9030 81 ## 4 5 301005 117641 NSF-TV 5 India 30.4726 64 ## 5 6 301006 117603 ARC 7229 India 22.9030 81 ## 6 7 301007 To be assigned Arias Indonesia -0.7892 75 ## Longitude Sub.population PC1 PC2 PC3 PC4 ## 1 12.56738 TEJ -0.0486 0.0030 0.0752 -0.0076 ## 2 116.87256 IND 0.0672 -0.0733 0.0094 -0.0005 ## 3 87.12158 AUS 0.0544 0.0681 -0.0062 -0.0369 ## 4 75.34424 AROMATIC -0.0073 0.0224 -0.0121 0.2602 ## 5 87.12158 AUS 0.0509 0.0655 -0.0058 -0.0378 ## 6 113.92133 TRJ -0.0293 -0.0027 -0.0677 -0.0085 Prepare analysis data by extracting only the data used for simple regression analysis from the read data. Here we analyze the relationship between plant height (Plant.height) and flowering timing (Flowering.time.at.Arkansas). In addition, principal component scores (PC1 to PC4)

Transcript of Introduction to Biostatistics The 2nd Lecture

Page 1: Introduction to Biostatistics The 2nd Lecture

1

1

Introduction to Biostatistics The 2nd Lecture

[email protected]

2021/4/16

Simple regression analysis

Changesinonevariablemayaffectanother,suchastherelationshipbetweenbreedingandcultivationconditionsandthegrowthofanimalsandplants.Oneofthestatisticalmethodstomodeltherelationshipbetweenthesevariablesisregressionanalysis.Bystatisticallymodelingtherelationshipbetweenvariables,itbecomespossibletounderstandthecausalrelationshipthatexistsbetweenvariables,andtopredictonevariablefromanother.Here,first,wewilldiscusssimpleregressionanalysisthatmodelstherelationshipbetweentwovariablesasa“linearrelationship”.Inthiscase,themechanismofsingleregressionanalysiswillbeexplainedusingtheanalysisofricedata(Zhaoetal.2011,NatureCommunications2:467)asanexample.

First,readthericedatainthesamewayasbefore.Beforeenteringthefollowingcommand,changeyourRworkingdirectorytothedirectory(folder)wherethetwoinputfiles(RiceDiversityPheno.csv,RiceDiversityLine.csv)arelocated.

# this data set was analyzed in Zhao 2011 (Nature Communications 2:467)pheno <- read.csv("RiceDiversityPheno.csv")line <- read.csv("RiceDiversityLine.csv")line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID")head(line.pheno)[,1:12]

## NSFTV.ID GSOR.ID IRGC.ID Accession.Name Country.of.origin Latitude## 1 1 301001 To be assigned Agostano Italy 41.871940## 2 3 301003 117636 Ai-Chiao-Hong China 27.902527## 3 4 301004 117601 NSF-TV 4 India 22.903081## 4 5 301005 117641 NSF-TV 5 India 30.472664## 5 6 301006 117603 ARC 7229 India 22.903081## 6 7 301007 To be assigned Arias Indonesia -0.789275## Longitude Sub.population PC1 PC2 PC3 PC4## 1 12.56738 TEJ -0.0486 0.0030 0.0752 -0.0076## 2 116.87256 IND 0.0672 -0.0733 0.0094 -0.0005## 3 87.12158 AUS 0.0544 0.0681 -0.0062 -0.0369## 4 75.34424 AROMATIC -0.0073 0.0224 -0.0121 0.2602## 5 87.12158 AUS 0.0509 0.0655 -0.0058 -0.0378## 6 113.92133 TRJ -0.0293 -0.0027 -0.0677 -0.0085

Prepareanalysisdatabyextractingonlythedatausedforsimpleregressionanalysisfromthereaddata.Hereweanalyzetherelationshipbetweenplantheight(Plant.height)andfloweringtiming(Flowering.time.at.Arkansas).Inaddition,principalcomponentscores(PC1toPC4)

Page 2: Introduction to Biostatistics The 2nd Lecture

2

2

representingthegeneticbackgroundtobeusedlaterarealsoextracted.Also,removesampleswithmissingvaluesinadvance.

# extract variables for regression analysisdata <- data.frame( height = line.pheno$Plant.height, flower = line.pheno$Flowering.time.at.Arkansas, PC1 = line.pheno$PC1, PC2 = line.pheno$PC2, PC3 = line.pheno$PC3, PC4 = line.pheno$PC4)data <- na.omit(data)head(data)

## height flower PC1 PC2 PC3 PC4## 1 110.9167 75.08333 -0.0486 0.0030 0.0752 -0.0076## 2 143.5000 89.50000 0.0672 -0.0733 0.0094 -0.0005## 3 128.0833 94.50000 0.0544 0.0681 -0.0062 -0.0369## 4 153.7500 87.50000 -0.0073 0.0224 -0.0121 0.2602## 5 148.3333 89.08333 0.0509 0.0655 -0.0058 -0.0378## 6 119.6000 105.00000 -0.0293 -0.0027 -0.0677 -0.0085

First,visualizetherelationshipbetweentwovariables.

# look at the relationship between plant height and flowering timeplot(data$height ~ data$flower)

Page 3: Introduction to Biostatistics The 2nd Lecture

3

3

Asshownintheabovefigure,theearlierthefloweringtime,theshortertheplantheight,whilethelaterthefloweringtime,thetallertheplantheight.

Let’screateasimpleregressionmodelthatexplainsthevariationinplantheightbythedifferenceinfloweringtiming.

# perform single linear regressionmodel <- lm(height ~ flower, data = data)

Theresultofregressionanalysis(estimatedmodel)isassignedto“model”.Usethefunction“summary”todisplaytheresultofregressionanalysis.

# show the resultsummary(model)

## ## Call:## lm(formula = height ~ flower, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -43.846 -13.718 0.295 13.409 61.594 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.05464 6.92496 8.383 1.08e-15 ***## flower 0.67287 0.07797 8.630 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 19 on 371 degrees of freedom## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649 ## F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16

Iwillexplaintheresultsdisplayedbyexecutingtheabovecommandinorder.

Call:lm(formula=height~flower,data=data)

Thisisarepeatofthecommandyouenteredearlier.Ifyougetthisoutputrightafteryoutypeit,itdoesnotseemtobeusefulinformation.However,ifyoumakemultipleregressionmodelsandcomparethemasdescribedlater,itmaybeusefulbecauseyoucanreconfirmthemodelemployedintheanalysis.Here,assumingthattheplantheightis𝑦! andthefloweringtimingis𝑥! ,theregressionanalysisisperformedwiththemodel

𝑦! = 𝜇 + 𝛽𝑥! + 𝜖!

Asmentionedearlier,𝑥! iscalledindependentvariableorexplanatoryvariable,and𝑦! iscalleddependentvariableorresponsevariable.𝜇and𝛽arecalledparametersoftheregressionmodel,and𝜖! iscallederror.Also,𝜇iscalledpopulationinterceptand𝛽iscalledpopulationregressioncoefficient.

Inaddition,sinceitisnotpossibletodirectlyknowthetruevaluesoftheparameters𝜇and𝛽oftheregressionmodel,estimationisperformedbasedonsamples.Theestimatesoftheparameters𝜇and𝛽,whichareestimatedfromthesample,arecalledsampleinterceptandsampleregressioncoefficient,respectively.Thevaluesof𝜇and𝛽estimatedfromthesamplesaredenotedby𝑚and𝑏,respectively.Since𝑚and𝑏arevaluesestimatedfromthesamples,

Page 4: Introduction to Biostatistics The 2nd Lecture

4

4

theyarerandomvariablesthatvarydependingonthesamplesselectedbychance.Therefore,itfollowsaprobabilitydistribution.Detailswillbedescribedlater.

Residuals:Min1QMedian3QMax-43.846-13.7180.29513.40961.594

Thisoutputgivesanoverviewofthedistributionofresiduals.Youcanusethisinformationtochecktheregressionmodel.Forexample,themodelassumesthattheexpectedvalue(average)oftheerroris0.Youcancheckwhetherthemedianisclosetoit.Youcanalsocheckwhetherthedistributionissymmetricalaround0,i.e.,whetherthemaximumandminimumorthefirstandthirdquantileshavealmostthesamevalue.Inthisexample,themaximumvalueisslightlylargerthantheminimumvalue,butotherwisenomajorissuesarefound.

Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)58.054646.924968.3831.08e-15***flower0.672870.077978.630<2e-16***—Signif.codes:0‘’0.001‘’0.01‘’0.05‘.’0.1‘’1

Theestimatesofparameters𝜇and𝛽,i.e.,𝑚and𝑏,andtheirstandarderrors,𝑡valuesand𝑝valuesareshown.Asterisksattheendoofeachlinerepresentsignificancelevels.Onestarrepresents5%,twostars1%,andthreestars0.1%.

Residualstandarderror:19on371degreesoffreedomMultipleR-squared:0.1672,AdjustedR-squared:0.1649F-statistic:74.48on1and371DF,p-value:<2.2e-16

Thefirstlineshowsthestandarddeviationoftheresiduals.Thisisthevaluerepresentedby𝑠,where𝑠"istheestimatedvalueoftheerrorvariance𝜎".Thesecondlineisthedeterminationcoefficient𝑅".Thisindexandtheadjusted𝑅"representhowwelltheregressionexplainthevariationofy.ThethirdlineistheresultoftheFtestthatrepresentsthesignificanceoftheregressionmodel.Itisatestunderthehypothesis(nullhypothesis)thatallregressioncoefficientsare0,andifthis𝑝valueisverysmall,thenullhypothesisisrejectedandthealternativehypothesis(regressioncoefficientisnot0)istakentobeadopted.

Let’slookattheresultsofregressionanalysisgraphically.First,drawascatterplotanddrawaregressionline.

# again, plot the two variablesplot(data$height ~ data$flower)abline(model, col = "red")

Page 5: Introduction to Biostatistics The 2nd Lecture

5

5

Next,calculateandplotthevalueof𝑦whenthedataisfittedtotheregressionmodel.

# calculate fitted valuesheight.fit <- fitted(model)plot(data$height ~ data$flower)abline(model, col = "red")points(data$flower, height.fit, pch = 3, col = "green")

Page 6: Introduction to Biostatistics The 2nd Lecture

6

6

Thevaluesof𝑦calculatedbyfittingthemodelalllieonastraightline.

Anobservedvalue𝑦isexpressedasthesumofthevariationexplainedbytheregressionmodelandtheerrorwhichisnotexplainedbytheregression.Let’svisualizetheerrorinthefigureandchecktherelationship.

# plot residualsplot(data$height ~ data$flower)abline(model, col = "red")points(data$flower, height.fit, pch = 3, col = "green")segments(data$flower, height.fit, data$flower, height.fit + resid(model), col = "gray")

Page 7: Introduction to Biostatistics The 2nd Lecture

7

7

Thevalueof𝑦isexpressedasthesumofthevaluesof𝑦calculatedbyfittingthemodel(greenpoints)andtheresidualsofthemodel(graylinesegments)

Let’susearegressionmodeltopredict𝑦for𝑥 = (60,80, . . . ,140),whicharenotactuallyobserved.

# predict unknown dataheight.pred <- predict(model, data.frame(flower = seq(60, 140, 20)))plot(data$height ~ data$flower)abline(model, col = "red")points(data$flower, height.fit, pch = 3, col = "green")segments(data$flower, height.fit, data$flower, height.fit + resid(model), col = "gray")points(seq(60, 140, 20), height.pred, pch = 2, col = "blue")

Page 8: Introduction to Biostatistics The 2nd Lecture

8

8

Allthepredictedvalueswilllocateagainontheline.

Quiz 1

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.Then,registeryournicknameandwaitforthequiztostart.

Method for calculating the parameters of a regression model

Herewewillexplainhowtocalculatearegressionmodel.Also,let’scalculatetheregressioncoefficientswhileactuallyusingtheRcommand.

Asmentionedearlier,thesimpleregressionmodelis

𝑦! = 𝑚 + 𝑏𝑥! + 𝜖!

Therearevariouscriteriaforwhatisconsidered“optimal”,buthereweconsiderminimizingtheerror𝜖! acrossthedata.Sinceerrorscantakebothpositiveandnegativevalues,errorscancelseachotherintheirsimplesum.So,weconsiderminimizingthesumofsquarederror(SSE).Thatis,consider𝑚and𝑏thatminimizethefollowingequation:

𝑆𝑆𝐸 =:𝜖!"#

!$%

=:(#

!$%

𝑦! − (𝑚 + 𝑏𝑥!))"

Page 9: Introduction to Biostatistics The 2nd Lecture

9

9

Equation(1)

ThefollowingfigureshowsthechangeinSSEforvariousvaluesof𝑚and𝑏.Thecommandstodrawthefigureisalittlecomplicated,buttheyareasfollows:

#visualize the plane for optimizationx <- data$flowery <- data$heightm <- seq(0, 100, 1)b <- seq(0, 2, 0.02)sse <- matrix(NA, length(m), length(b))for(i in 1:length(m)) { for(j in 1:length(b)) { sse[i, j] <- sum((y - m[i] - b[j] * x)^2) }}persp(m, b, sse, col = "green")

Drawthegraphusing“plotly”package

# draw the figure with plotlydf <- data.frame(m, b, sse) plot_ly(data = df, x = ~m, y = ~b, z = ~sse) %>% add_surface()

Itshouldbenotedthatatthepointwhere𝑆𝑆𝐸becomestheminimuminFigure3,𝑆𝑆𝐸shouldnotchange(theslopeofthetangentiszero)evenwhen𝜇or𝛽changesslightly.Therefore,thecoordinatesoftheminimumpointcanbedeterminedbypartiallydifferentiatingtheequation(1)with𝜇and𝛽,andsettingthevaluetozero.Thatis,

𝜕𝑆𝑆𝐸𝜕𝑚 = 0,

𝜕𝑆𝑆𝐸𝜕𝑏 = 0

Weshouldobtainthevaluesof𝜇and𝛽tosatisfythese.Themethodofcalculatingtheparametersofaregressionmodelthroughminimizingthesumofsquaresoferrorsinthiswayiscalledtheleastsquaresmethod.

Page 10: Introduction to Biostatistics The 2nd Lecture

10

10

Notethat𝜇minimizingSSEis

𝜕𝑆𝑆𝐸𝜕𝑚 = −2:(

#

!$%

𝑦! −𝑚 − 𝑏𝑥!) = 0

⇔:𝑦!

#

!$%

− 𝑛𝑚 − 𝑏:𝑥!

#

!$%

= 0

⇔𝑚 =∑ 𝑦!#!$%

𝑛 − 𝑏∑ 𝑥!#!$%

𝑛 = 𝑦 − 𝑏𝑥

Also,𝛽minimizing𝑆𝑆𝐸is

𝜕𝑆𝑆𝐸𝜕𝑏 = −2:𝑥!

#

!$%

(𝑦! −𝑚 − 𝑏𝑥!) = 0

⇔:𝑥!

#

!$%

𝑦! −𝑚:𝑥!

#

!$%

− 𝑏:𝑥!"#

!$%

= 0

⇔:𝑥!

#

!$%

𝑦! − 𝑛(𝑦 − 𝑏𝑥)𝑥 − 𝑏:𝑥!"#

!$%

= 0

⇔:𝑥!

#

!$%

𝑦! − 𝑛𝑥𝑦 − 𝑏(:𝑥!"#

!$%

− 𝑛𝑥") = 0

⇔ 𝑏 =∑ 𝑥!#!$% 𝑦! − 𝑛𝑥𝑦∑ 𝑥!"#!$% − 𝑛𝑥"

=𝑆𝑆𝑋𝑌𝑆𝑆𝑋

Here,𝑆𝑆𝑋𝑌and𝑆𝑆𝑋aresumofproductsofdeviationin𝑥and𝑦anddeviationthesumofsquaresofdeviationin𝑥,respectively.

𝑆𝑆𝑋𝑌 =:(#

!$%

𝑥! − 𝑥)(𝑦! − 𝑦)

=:𝑥!

#

!$%

𝑦! − 𝑥:𝑦!

#

!$%

− 𝑦:𝑥!

#

!$%

+ 𝑛𝑥𝑦

=:𝑥!

#

!$%

𝑦! − 𝑛𝑥𝑦 − 𝑛𝑦𝑥 + 𝑛𝑥𝑦

=:𝑥!

#

!$%

𝑦! − 𝑛𝑥𝑦

𝑆𝑆𝑋 =:(#

!$%

𝑥! − 𝑥)"

Page 11: Introduction to Biostatistics The 2nd Lecture

11

11

=:𝑥!"#

!$%

− 2𝑥:𝑥!

#

!$%

− 𝑛𝑥"

=:𝑥!"#

!$%

− 2𝑛𝑥" + 𝑛𝑥"

=:𝑥!"#

!$%

− 𝑛𝑥"

Thevaluesof𝑚and𝑏minimizing𝑆𝑆𝐸aretheestimatesoftheparameters,andlettheestimatesberepresentedsimplyby𝑚and𝑏.Thatis,

𝑏 =𝑆𝑆𝑋𝑌𝑆𝑆𝑋

𝑚 = 𝑦 − 𝑏𝑥

Nowlet’scalculatetheregressioncoefficientsbasedontheaboveequation.First,calculatethesumofproductsfdeviationandthesumofsquaresofdeviation.

# calculate sum of squares (ss) of x and ss of xyn <- length(x)ssx <- sum(x^2) - n * mean(x)^2ssxy <- sum(x * y) - n * mean(x) * mean(y)

Firstwecalculatetheslope𝑏.

# calculate bb <- ssxy / ssxb

## [1] 0.6728746

Thencalculatetheintercept𝑚

# calculate mm <- mean(y) - b * mean(x)m

## [1] 58.05464

Let’sdrawaregressionlinebasedonthecalculatedestimates.

# draw scatter plot and regression line plot(y ~ x)abline(m, b)

Page 12: Introduction to Biostatistics The 2nd Lecture

12

12

Let’smakesurethatthesameregressionlinewasobtainedasthefunctionlmwhichweusedearlier.

Notethatoncetheregressionparameters𝜇and𝛽areestimated,itispossibletocalculate𝑦C! ,whichisthevalueof𝑦correspondingtoagiven𝑥! .Thatis,

𝑦&D = 𝑚 + 𝑏𝑥!

Thismakesitpossibletocalculatethevalueof𝑦whenthemodelisfittedtotheobserved𝑥,ortopredict𝑦ifonlythevalueof𝑥isknown.Here,let’scalculatethevalueof𝑦whenthemodelisfittedtotheobserved𝑥,anddrawscatterpointsonthefiguredrawnearlier.

# calculate fitted valuesy.hat <- m + b * xlim <- range(c(y, y.hat))plot(y, y.hat, xlab = "Observed", ylab = "Fitted", xlim = lim, ylim = lim)abline(0, 1)

Page 13: Introduction to Biostatistics The 2nd Lecture

13

13

Let’scalculatethecorrelationcoefficientbetweenthetwotofindthedegreeofagreementbetweentheobservedandfittedvalues.

# calculate correlation between observed and fitted valuescor(y, y.hat)

## [1] 0.408888

Infact,thesquareofthiscorrelationcoefficientistheproportionofthevariationof𝑦explainedbytheregression(coefficientofdetermination,𝑅"value).Let’scomparethesetwostatistics.

# compare the square of the correlation and R2 cor(y, y.hat)^2

## [1] 0.1671894

summary(model)

## ## Call:## lm(formula = height ~ flower, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -43.846 -13.718 0.295 13.409 61.594 ##

Page 14: Introduction to Biostatistics The 2nd Lecture

14

14

## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.05464 6.92496 8.383 1.08e-15 ***## flower 0.67287 0.07797 8.630 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 19 on 371 degrees of freedom## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649 ## F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16

Quiz 2

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Ifyouclosethepageofthequiz,gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.

Significance test of a regression model

Whenthelinearrelationshipbetweenvariablesisstrong,aregressionlinefitwelltotheobservations,andtherelationshipbetweenbothvariablescanbewellmodeledbyaregressionline.However,whenalinearrelationshipbetweenvariablesisnotclear,modelingwitharegressionlinedoesnotworkwell.Here,asamethodtoobjectivelyconfirmthegoodnessoffitoftheestimatedregressionmodel,wewillexplainatestusinganalysisofvariance.

First,let’sgobacktothesimpleregressionagain.

model <- lm(height ~ flower, data = data)

Thesignificanceoftheobtainedregressionmodelcanbetestedusingthefunctionanova.

# analysis of variance of regressionanova(model)

## Analysis of Variance Table## ## Response: height## Df Sum Sq Mean Sq F value Pr(>F) ## flower 1 26881 26881.5 74.479 < 2.2e-16 ***## Residuals 371 133903 360.9 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Asaresultoftheanalysisofvariance,thetermoffloweringtimeishighlysignificant(𝑝 <0.001),andthegoodnessoffitoftheregressionmodelthatthefloweringtiminginfluencesplantheightisconfirmed.

Analysisofvarianceforregressionmodelsinvolvesthefollowingcalculations:Firstofall,“Sumofsquaresexplainedbyregression”canbecalculatedasfollows.

𝑆𝑆𝑅 =:(#

!$%

𝑦C! − 𝑦)"

Page 15: Introduction to Biostatistics The 2nd Lecture

15

15

=:(#

!$%

𝑚+ 𝑏𝑥! − (𝑚 + 𝑏𝑥))"

= 𝑏":(#

!$%

𝑥! − 𝑥)"

= 𝑏" ⋅ 𝑆𝑆𝑋 = 𝑏 ⋅ 𝑆𝑆𝑋𝑌

Also,thesumofsquaresofdeviationfromthemeanoftheobservedvalues𝑦isexpressedasthesumofthesumofsquares𝑆𝑆𝑅explainedbytheregressionandtheresidualsumofsquares𝑆𝑆𝐸.Thatis,

𝑆𝑆𝑌 =:(#

!$%

𝑦! − 𝑦)"

=:(#

!$%

𝑦! − 𝑦C! + 𝑦C! − 𝑦)"

=:(#

!$%

𝑦! − 𝑦C!)" +:(#

!$%

𝑦C! − 𝑦)"

= 𝑆𝑆𝐸 + 𝑆𝑆𝑅

∵ 2:(#

!$%

𝑦! − 𝑦C!)(𝑦C! − 𝑦)

= 2:(#

!$%

𝑦! −𝑚 − 𝑏𝑥!)(𝑚 + 𝑏𝑥! − (𝑚 + 𝑏𝑥))

= 2𝑏:(#

!$%

𝑦! − (𝑦 − 𝑏𝑥) − 𝑏𝑥!)(𝑥! − 𝑥)

= 2𝑏:(#

!$%

𝑦! − 𝑦 − 𝑏(𝑥! − 𝑥))(𝑥! − 𝑥)

= 2𝑏(𝑆𝑆𝑋𝑌 − 𝑏 ⋅ 𝑆𝑆𝑋) = 0

Let’sactuallycalculateitusingtheaboveequation.First,calculate𝑆𝑆𝑅and𝑆𝑆𝐸.

# calculate sum of squares of regression and errorssr <- b * ssxyssr

## [1] 26881.49

ssy <- sum(y^2) - n * mean(y)^2sse <- ssy - ssrsse

## [1] 133903.2

Page 16: Introduction to Biostatistics The 2nd Lecture

16

16

Next,calculatethemeansquares,whichisofthesumofsquaresdividedbythedegreesoffreedom.

# calculate mean squares of regression and errormsr <- ssr / 1msr

## [1] 26881.49

mse <- sse / (n - 2)mse

## [1] 360.9251

Finally,themeansquareoftheregressionisdividedbythemeansquareoftheerrortocalculatethe𝐹value.Furthermore,calculatethe𝑝valuecorrespondingtothecalculatedFvalue.

# calculate F valuef.value <- msr / msef.value

## [1] 74.47943

# calculate p value for the F value1 - pf(f.value, 1, n - 2)

## [1] 2.220446e-16

Theresultsobtainedareinagreementwiththeresultscalculatedearlierusingthefunctionanova.

Theresultsofregressionanalysisareincludedintheresultsofregressionanalysisdisplayedusingthefunction“summary”.

# check the summary of the result of regression analysis summary(model)

## ## Call:## lm(formula = height ~ flower, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -43.846 -13.718 0.295 13.409 61.594 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.05464 6.92496 8.383 1.08e-15 ***## flower 0.67287 0.07797 8.630 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 19 on 371 degrees of freedom## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649 ## F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16

“Residualstandarderror”isthesquarerootofthemeansquareoftheresidual.

Page 17: Introduction to Biostatistics The 2nd Lecture

17

17

# square root of msesqrt(mse)

## [1] 18.99803

“MultipleR-squared”(𝑅")isavaluecalledthecoefficientofdetermination,whichistheratioof𝑆𝑆𝑅to𝑆𝑆𝑌.

# R squared ssr / ssy

## [1] 0.1671894

“AdjustedR-squared”(𝑅'()" )isavaluecalledtheadjustedcoefficientofdetermination,whichcanbecalculatedasfollows.

# adjusted R squared(ssy / (n - 1) - mse) / (ssy / (n - 1))

## [1] 0.1649446

Also,“F-statistic”matchesthe𝐹valueandits𝑝valuewhichareexpressedastheeffectoffloweringtimeintheanalysisofvariance.Inaddition,thetvaluecalculatedfortheregressioncoefficientofthefloweringtimetermissquaredtoobtainthe𝐹value(8.6302=74.477).

𝑅"and𝑅'()" canalsobeexpressedusing𝑆𝑆𝑅,𝑆𝑆𝑌,and𝑆𝑆𝐸asfollows.

𝑅" =𝑆𝑆𝑅𝑆𝑆𝑌 = 1 −

𝑆𝑆𝐸𝑆𝑆𝑌

𝑅'()" = 1 −𝑛 − 1𝑛 − 𝑝 ⋅

𝑆𝑆𝐸𝑆𝑆𝑌

Here,𝑝isthenumberofparametersincludedinthemodel,and𝑝 = 2forasimpleregressionmodel.Itcanbeseenthat𝑅'()" hasalargeramountofadjustment(theresidualsumofsquaresisunderestimated)asthenumberofparametersincludedinthemodelincreases.

Quiz 3

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Ifyouclosethepageofthequiz,gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.

Distribution that estimated value of regression coefficient follows

Asmentionedearlier,theestimates𝑏and𝑚oftheregressioncoefficients𝜇and𝛽arevaluesestimatedfromsamplesandarerandomvariablesthatdependonthesampleschosenbychance.Thus,estimates𝑏and𝑚haveprobabilisticdistributions.Hereweconsiderthedistributionsthattheestimatesfollow.

Theestimate𝑏followsthenormaldistribution:

𝑏 ∼ 𝑁(𝛽,𝜎"

𝑆𝑆𝑋)

Here,𝜎"istheresidualvariance𝜎" = 𝑉𝑎𝑟(𝑦!) = 𝑉𝑎𝑟(𝑒!).

Page 18: Introduction to Biostatistics The 2nd Lecture

18

18

Theesimate𝑚followsthenormaldistribution:

𝑚 ∼ 𝑁(𝜇, 𝜎"[1𝑛 +

𝑥"

𝑆𝑆𝑋])

Althoughthetruevalueoftheerrorvariance𝜎"isunknown,itcanbereplacedbytheresidualvariance𝑠".Thatis,

𝑠" =𝑆𝑆𝐸𝑛 − 2

Thisvalueisthemeansquareoftheresidualscalculatedduringtheanalysisofvariance.

Atthistime,statisticson𝑏

𝑡 =𝑏 − 𝑏*𝑠/√𝑆𝑆𝑋

followsthetdistributionwithn–2degreesoffreedomunderthenullhypothesis:

𝐻*: 𝛽 = 𝑏*

First,testfornullhypothesis𝐻*: 𝛽 = 0for𝑏.

# test beta = 0t.value <- (b - 0) / sqrt(mse/ssx)t.value

## [1] 8.630147

Thisstatisticfollowsthetdistributionwith371degreesoffreedomunderthenullhypothesis.Drawagraphofthedistribution

# visualize the t distribution under H0s <- seq(-10, 10, 0.2)plot(s, dt(s, n - 2), type = "l")abline(v = t.value, col = "green")abline(v = - t.value, col = "gray")

Page 19: Introduction to Biostatistics The 2nd Lecture

19

19

Fromthedistributionthatfollowsunderthenullhypothesis,thevalueof𝑡thatisbeingobtainedappearstobelarge.Let’sperforma𝑡test.Notethatitisatwo-tailedtest.

# perform t test2 * (1 - pt(abs(t.value), n - 2)) # two-sided test

## [1] 2.220446e-16

Theresultsofthistestwerealreadydisplayedasregressionanalysisresults.

# check the summary of the modelsummary(model)

## ## Call:## lm(formula = height ~ flower, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -43.846 -13.718 0.295 13.409 61.594 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.05464 6.92496 8.383 1.08e-15 ***## flower 0.67287 0.07797 8.630 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 20: Introduction to Biostatistics The 2nd Lecture

20

20

## ## Residual standard error: 19 on 371 degrees of freedom## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649 ## F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16

Thehypothesistestperformedabovecanbeperformedforany𝛽*.Forexample,let’stestfornullhypothesis𝐻*: 𝛽 = 0.5.

# test beta = 0.5t.value <- (b - 0.5) / sqrt(mse/ssx)t.value

## [1] 2.217253

2 * (1 - pt(abs(t.value), n - 2))

## [1] 0.02721132

Theresultissignificantatthe5%level.

Nowletustestfor𝑚.First,let’stestthenullhypothesis𝐻*: 𝜇 = 0.

# test mu = 0t.value <- (m - 0) / sqrt(mse * (1/n + mean(x)^2 / ssx))t.value

## [1] 8.383389

2 * (1 - pt(abs(t.value), n - 2))

## [1] 1.110223e-15

Thisresultwasalsoalreadycalculated.

# check the summary of the model againsummary(model)

## ## Call:## lm(formula = height ~ flower, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -43.846 -13.718 0.295 13.409 61.594 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.05464 6.92496 8.383 1.08e-15 ***## flower 0.67287 0.07797 8.630 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 19 on 371 degrees of freedom## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649 ## F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16

(Itmaybeduetotheroundingerrorthatthepvaluedoesnotmatchcompletely)

Page 21: Introduction to Biostatistics The 2nd Lecture

21

21

Finally,let’stestforthenullhypothesis𝐻*: 𝜇 = 70.

# test mu = 70t.value <- (m - 70) / sqrt(mse * (1/n + mean(x)^2 / ssx))t.value

## [1] -1.724971

2 * (1 - pt(abs(t.value), n - 2))

## [1] 0.08536545

Theresultwasnotsignificantevenatthe5%level.

Quiz 4

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Ifyouclosethepageofthequiz,gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.

Confidence intervals for regression coefficients and fitted values

Function“predict”hasvariousfunctions.First,let’susethefunctionwiththeestimatedregressionmodel.Then,thevaluesofywhenthemodelfittedtoobserveddataarecalculated.Thevalues𝑦Careexactlythesameascalculatedbythefunction“fitted”.

# fitted valuespred <- predict(model)head(pred)

## 1 2 3 4 5 6 ## 108.5763 118.2769 121.6413 116.9312 117.9966 128.7065

head(fitted(model))

## 1 2 3 4 5 6 ## 108.5763 118.2769 121.6413 116.9312 117.9966 128.7065

Bysettingtheoptions“interval”and“level”,youcancalculatetheconfidenceinterval(the95confidenceintervalatthedefaultsetting)ofyatthespecifiedsignificancelevelwhenfittingthemodel.

# calculate confidence intervalpred <- predict(model, interval = "confidence", level = 0.95)head(pred)

## fit lwr upr## 1 108.5763 105.8171 111.3355## 2 118.2769 116.3275 120.2264## 3 121.6413 119.4596 123.8230## 4 116.9312 114.9958 118.8665## 5 117.9966 116.0540 119.9391## 6 128.7065 125.4506 131.9623

Let’svidualizethe95%confidenceintervalofyusingthefunction“predict”.

Page 22: Introduction to Biostatistics The 2nd Lecture

22

22

# draw confidence bandspred <- data.frame(flower = 50:160)pc <- predict(model, interval = "confidence", newdata = pred)plot(data$height ~ data$flower)matlines(pred$flower, pc, lty = c(1, 2, 2), col = "red")

Next,wewillconsiderthecaseofpredictingunknowndata.Letususethepredictfunctiontodrawthe95%confidenceintervalofthepredictedvalue,i.e.,thevalueof𝑦thatwouldbeobservedfortheunknowndata,i.e.,thepredictedvalue𝑦X,issubjecttoadditionalvariationduetoerror.

pc <- predict(model, interval= "prediction", newdata = pred)plot(data$height ~ data$flower)matlines(pred$flower, pc, lty = c(1, 2, 2), col = "green")

Page 23: Introduction to Biostatistics The 2nd Lecture

23

23

Withtheaddederror,thevariabilityofthepredicted𝑦Xisgreaterthantheestimated𝑦C.

Notethatforaparticular𝑥,theconfidenceintervalsfortheestimatedy,i.e.,𝑦C,andthepredictedy,i.e.,𝑦X,canbefoundasfollows.Here,wefindthe99%confidenceintervalwhen𝑥 =120.

# estimate the confidence intervals for the estimate and prediction of y pred <- data.frame(flower = 120)predict(model, interval = "confidence", newdata = pred, level = 0.99)

## fit lwr upr## 1 138.7996 131.8403 145.7589

predict(model, interval = "prediction", newdata = pred, level = 0.99)

## fit lwr upr## 1 138.7996 89.12106 188.4781

Quiz 5

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Ifyouclosethepageofthequiz,gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.

Page 24: Introduction to Biostatistics The 2nd Lecture

24

24

Polynomial regression model and multiple regression model

Sofar,wehaveappliedtothedataaregressionmodelthatrepresentstherelationshipbetweenthetwovariableswithastraightline.Let’sextendtheregressionmodelabit.

First,let’sperformregressionbyamethodcalledpolynomialregression.Inpolynomialregression,

𝑦! = 𝜇 + 𝛽%𝑥! + 𝛽"𝑥!" +⋯+ 𝛽+𝑥!+ + 𝜖!

Inthisway,regressionisperformedusingthesecondorhigherordertermsof𝑥.First,let’sperformregressionusingthefirstandsecondtermsof𝑥.

# polynomial regressionmodel.quad <- lm(height ~ flower + I(flower^2), data = data)summary(model.quad)

## ## Call:## lm(formula = height ~ flower + I(flower^2), data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -39.57 -13.60 0.97 12.91 64.83 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -29.082326 27.019440 -1.076 0.282473 ## flower 2.662663 0.601878 4.424 1.28e-05 ***## I(flower^2) -0.011130 0.003339 -3.333 0.000945 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.74 on 370 degrees of freedom## Multiple R-squared: 0.1915, Adjusted R-squared: 0.1871 ## F-statistic: 43.81 on 2 and 370 DF, p-value: < 2.2e-16

Itcanbeseenthattheproportionofvariationof𝑦(coefficientofdetermination𝑅")explainedbythepolynomialregressionmodelislargerthanthatofthesimpleregressionmodel.

Althoughthiswillbementionedlater,youshouldnotjudgethatthepolynomialregressionmodelisexcellentonlywiththisvalue.Thisisbecauseapolynomialregressionmodelhasmoreparametersthanasimpleregressionmodel,andyouhavemoreflexibilitywhenfittingthemodeltodata.Itiseasytoimprovethefitofthemodeltothedatabyincreasingtheflexibility.Inextremecases,themodelcanbecompletelyfittedtothedatawithasmanyparametersasthesizeofdata(Inthatcase,thecoefficientpfdetermination𝑅"completelymatches1).Therefore,carefulselectionofsomestatisticalcriteriaisrequiredwhenselectingthebestmodel.Thiswillbediscussedlater.

# plot(data$height ~ data$flower)pred <- data.frame(flower = 50:160)pc <- predict(model.quad, interval = "confidence", newdata = pred)plot(data$height ~ data$flower)matlines(pred$flower, pc, lty = c(1, 2, 2), col = "red")

Page 25: Introduction to Biostatistics The 2nd Lecture

25

25

Whenthetimingoffloweringisover120daysaftersowing,itcanbeseenthatthereliabilityofthemodelislow.

Nowlet’sdrawtheresultofpolynomialregressionwithconfidenceintervals.

# compare predicted and observed valueslim <- range(c(data$height, fitted(model), fitted(model.quad)))plot(data$height, fitted(model), xlab = "Observed", ylab = "Expected", xlim = lim, ylim = lim)points(data$height, fitted(model.quad), col = "red")abline(0, 1)

Page 26: Introduction to Biostatistics The 2nd Lecture

26

26

Theabovefigurerepresentsrelationshipbetweenfittedvalueandobservedvalueofthesimpleregressionmodel(black)andthesecond-orderpolynomialmodel(red).

Let’stestiftheimprovementintheexplanatorypowerofthesecond-orderpolynomialmodelisstatisticallysignificant.Thesignificanceistestedwith𝐹testwhetherthedifferencebetweentheresidualsumofsquaresofthetwomodelsissufficientlylargecomparedtotheresidualsumofsquaresofthemodelcontainingtheother(here,Model2containsModel1).

# compare error variance between two modelsanova(model, model.quad)

## Analysis of Variance Table## ## Model 1: height ~ flower## Model 2: height ~ flower + I(flower^2)## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 371 133903 ## 2 370 129999 1 3903.8 11.111 0.0009449 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theresultsshowthatthedifferenceinresidualvariancebetweenthetwomodelsishighlysignificant(𝑝 < 0.001).Inotherwords,Model2hassignificantlymoreexplanatorypowerthanModel1.

Nowlet’sfitathird-orderpolynomialregressionmodelandtestifitissignificantlymoredescriptivethanasecond-ordermodel.

Page 27: Introduction to Biostatistics The 2nd Lecture

27

27

# extend polynormial regression model to a higher dimensional one...model.cube <- lm(height ~ flower + I(flower^2) + I(flower^3), data = data)summary(model.cube)

## ## Call:## lm(formula = height ~ flower + I(flower^2) + I(flower^3), data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -39.699 -13.705 1.031 13.240 65.840 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.001e+02 8.541e+01 -1.172 0.2419 ## flower 5.029e+00 2.765e+00 1.818 0.0698 .## I(flower^2) -3.664e-02 2.929e-02 -1.251 0.2118 ## I(flower^3) 8.898e-05 1.015e-04 0.877 0.3813 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.75 on 369 degrees of freedom## Multiple R-squared: 0.1931, Adjusted R-squared: 0.1866 ## F-statistic: 29.44 on 3 and 369 DF, p-value: < 2.2e-16

# compare error variance between two modelsanova(model.quad, model.cube)

## Analysis of Variance Table## ## Model 1: height ~ flower + I(flower^2)## Model 2: height ~ flower + I(flower^2) + I(flower^3)## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 370 129999 ## 2 369 129729 1 270.17 0.7685 0.3813

The3rd-ordermodelhasaslightlybetterexplanatorypowerthanthe2nd-ordermodel.However,thedifferenceisnotstatisticallysignificant.Inotherwords,itturnsoutthatextendingasecond-ordermodeltoathird-ordermodelisnotagoodidea.

Finally,let’sapplythemultiplelinearregressionmodel:

𝑦! = 𝜇 + 𝛽%𝑥%! + 𝛽"𝑥"! +⋯+ 𝛽+𝑥+! + 𝜖!

Inthisway,regressionisperformedusingmultipleexplanatoryvariables(𝑥%! , 𝑥"! , . . . , 𝑥+!).Inthefirstlecture,Iconfirmedinthegraphthattheheightvariesdependingonthedifferenceingeneticbackground.Here,wewillcreateamultipleregressionmodelthatexplainsplantheightusinggeneticbackgrounds(PC1toPC4)expressedasthescoresoffourprincipalcomponents.

# multi-linear regression with genetic background model.wgb <- lm(height ~ PC1 + PC2 + PC3 + PC4, data = data)summary(model.wgb)

## ## Call:## lm(formula = height ~ PC1 + PC2 + PC3 + PC4, data = data)

Page 28: Introduction to Biostatistics The 2nd Lecture

28

28

## ## Residuals:## Min 1Q Median 3Q Max ## -45.89 -11.65 0.15 11.05 72.12 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 117.2608 0.8811 133.080 < 2e-16 ***## PC1 181.6572 18.2977 9.928 < 2e-16 ***## PC2 83.5334 17.9920 4.643 4.79e-06 ***## PC3 -88.6432 18.1473 -4.885 1.55e-06 ***## PC4 122.1351 18.2476 6.693 8.16e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 17 on 368 degrees of freedom## Multiple R-squared: 0.3388, Adjusted R-squared: 0.3316 ## F-statistic: 47.14 on 4 and 368 DF, p-value: < 2.2e-16

anova(model.wgb)

## Analysis of Variance Table## ## Response: height## Df Sum Sq Mean Sq F value Pr(>F) ## PC1 1 28881 28881.3 99.971 < 2.2e-16 ***## PC2 1 5924 5924.2 20.506 8.040e-06 ***## PC3 1 6723 6723.2 23.272 2.063e-06 ***## PC4 1 12942 12942.3 44.799 8.163e-11 ***## Residuals 368 106314 288.9 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Youcanseethatthecoefficientofdeterminationoftheregressionmodelishigherthanthatofthepolynomialregressionmodel.Theresultsofanalysisofvarianceshowthatallprincipalcomponentsaresignificantandneedtobeincludedintheregression.

Finally,let’scombinethepolynomialregressionmodelwiththemultipleregressionmodel.

# multi-linear regression with all informationmodel.all <- lm(height ~ flower + I(flower^2) + PC1 + PC2 + PC3 + PC4, data = data)summary(model.all)

## ## Call:## lm(formula = height ~ flower + I(flower^2) + PC1 + PC2 + PC3 + ## PC4, data = data)## ## Residuals:## Min 1Q Median 3Q Max ## -37.589 -10.431 0.542 10.326 65.390 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.739160 24.725955 1.041 0.29857 ## flower 1.633185 0.543172 3.007 0.00282 **

Page 29: Introduction to Biostatistics The 2nd Lecture

29

29

## I(flower^2) -0.006598 0.002974 -2.219 0.02713 * ## PC1 141.214491 18.547296 7.614 2.29e-13 ***## PC2 83.552448 17.231568 4.849 1.84e-06 ***## PC3 -45.310663 18.647979 -2.430 0.01559 * ## PC4 119.638954 17.369423 6.888 2.48e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 16.17 on 366 degrees of freedom## Multiple R-squared: 0.4045, Adjusted R-squared: 0.3947 ## F-statistic: 41.43 on 6 and 366 DF, p-value: < 2.2e-16

# compare error varianceanova(model.all, model.wgb)

## Analysis of Variance Table## ## Model 1: height ~ flower + I(flower^2) + PC1 + PC2 + PC3 + PC4## Model 2: height ~ PC1 + PC2 + PC3 + PC4## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 366 95753 ## 2 368 106314 -2 -10561 20.184 4.84e-09 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theeffectofthegeneticbackgroundonplantheightisverylarge,butitcanalsobeseenthatthemodel’sexplanatorypowerimprovesiftheeffectoffloweringtimingisalsoadded.

Lastly,let’scomparethefirstregressionmodelandthelastcreatedmultipleregressionmodelbyplottingthescatteroftheobservedvalueandthefittedvalue.

# compare between the simplest and final modelslim <- range(data$height, fitted(model), fitted(model.all))plot(data$height, fitted(model), xlab = "Observed", ylab = "Fitted", xlim = lim, ylim = lim)points(data$height, fitted(model.all), col = "red")abline(0,1)

Page 30: Introduction to Biostatistics The 2nd Lecture

30

30

Asaresult,wecanseethattheexplanatorypowerofthemodelissignificantlyimprovedbyconsideringthegeneticbackgroundandthesecondorderterms.However,ontheotherhand,itcanalsobeseenthatthetwovarietiesandlineswhosefloweringtimingislate(after180days)cannotbesufficientlyexplainedevenbythefinallyobtainedmodel.Theremayberoomtoimprovethemodel,suchasaddingnewfactorsasindependentvariables.

Quiz 6

Now,let’ssolveapracticequestionhere.Practicequestionswillbepresentedinthelecture.

Ifyouclosethepageofthequiz,gotohttps://www.menti.com/ie9s1dzmdzandtypeinthenumberIwilltellyouinthelecture.

Experimental design and analysis of variance

Whentryingtodrawconclusionsbasedonexperimentalresults,itisalwaysthepresenceoferrorsintheobservedvalues.Errorsareinevitablenomatterhowprecisetheexperimentis,especiallyinfieldexperiments,errorsarecausedbysmallenvironmentalvariationsinthefield.Therefore,experimentaldesignisamethoddevisedtoobtainobjectiveconclusionswithoutbeingaffectedbyerrors.

Firstofall,whatismostimportantinplanningexperimentsistheFisher’sthreeprinciples:

1. Replication:Inordertobeabletoperformstatisticaltestsonexperimentalresults,werepeatthesameprocess.Forexample,evaluateonevarietymultipletimes.Theexperimentalunitequivalenttoonereplicationiscalledaplot.

Page 31: Introduction to Biostatistics The 2nd Lecture

31

31

2. Randomization:Anoperationthatmakestheeffectoferrorsrandomiscalledrandomization.Forexample,inthefieldtestexample,varietiesarerandomlyassignedtoplotsinthefieldusingdiceorrandomnumbers.

3. Localcontrol:Localcontrolmeansdividingthefieldintoblocksandmanagingtheenvironmentalconditionsineachblocktobeashomogeneousaspossible.Intheexampleofthefieldtest,thegroupedareaofthefieldisdividedintosmallunitscalledblockstomakethecultivationenvironmentintheblockashomogeneousaspossible.Itiseasiertohomogenizeeachblockratherthanhomogenizingthecultivationenvironmentofthewholefield.

Theexperimentalmethodofdividingthefieldintoseveralblocksandmakingthecultivationenvironmentashomogeneousaspossibleintheblocksiscalledtherandomizedblockdesign.Intherandomizedblockdesign,thefieldisdividedintoblocks,andvarietiesarerandomlyassignedwithineachblock.Thenumberofblocksisequaltothenumberofreplications.

Next,Iwillexplainthemethodofstatisticaltestintherandomizedblockdesignthroughasimplesimulation.First,let’ssetthe“seed”oftherandomnumberbeforestartingthesimulation.Arandomseedisasourcevalueforgeneratingpseudorandomnumbers.

# set a seed for random number generationset.seed(12)

Let’sstartthesimulation.Here,considerafieldwhere16plotsarearrangedin4×4.Andthinkaboutthesituationthatthereisaslopeofthesoilfertilityinthefield.

# The blocks have unequal fertility among themfield.cond <- matrix(rep(c(4,2,-2,-4), each = 4), nrow = 4)field.cond

## [,1] [,2] [,3] [,4]## [1,] 4 2 -2 -4## [2,] 4 2 -2 -4## [3,] 4 2 -2 -4## [4,] 4 2 -2 -4

However,itisassumedthatthereisaneffectof+4wherethesoilfertilityishighand-4whereitislow.

Here,wearrangeblocksaccordingtoFisher’sthreeprinciples.Theblocksarearrangedtoreflectthedifferenceinthesoilfertilitywell.

# set block to consider the heterogeneity of field conditionblock <- c("I", "II", "III", "IV")blomat <- matrix(rep(block, each = 4), nrow = 4)blomat

## [,1] [,2] [,3] [,4]## [1,] "I" "II" "III" "IV"## [2,] "I" "II" "III" "IV"## [3,] "I" "II" "III" "IV"## [4,] "I" "II" "III" "IV"

Next,randomlyarrangevarietiesineachblockaccordingtoFisher’sthreeprinciples.Let’sprepareforthatfirst.

# assume that there are four varietiesvariety <- c("A", "B", "C", "D")

Page 32: Introduction to Biostatistics The 2nd Lecture

32

32

# sample the order of the four varieties randomlysample(variety)

## [1] "B" "D" "C" "A"

sample(variety)

## [1] "C" "B" "A" "D"

Let’sallocatevarietiesrandomlytoeachblock.

# allocate the varieties randomly to each column of the fieldvarmat <- matrix(c(sample(variety), sample(variety), sample(variety), sample(variety)), nrow = 4)varmat

## [,1] [,2] [,3] [,4]## [1,] "D" "B" "B" "D" ## [2,] "B" "A" "D" "A" ## [3,] "C" "D" "C" "B" ## [4,] "A" "C" "A" "C"

Considerthedifferencesingeneticvaluesofthefourvarieties.LetthegeneticvaluesoftheAtoDvarietiesbe+4,+2,-2,-4,respectively.

# simulate genetic ability of the varietiesg.value <- matrix(NA, 4, 4)g.value[varmat == "A"] <- 4g.value[varmat == "B"] <- 2g.value[varmat == "C"] <- -2g.value[varmat == "D"] <- -4g.value

## [,1] [,2] [,3] [,4]## [1,] -4 2 2 -4## [2,] 2 4 -4 4## [3,] -2 -4 -2 2## [4,] 4 -2 4 -2

Environmentalvariationsaregeneratedasrandomnumbersfromanormaldistributionwithanaverageof0andastandarddeviationof2.5.

# simulate error variance (variation due to the heterogeneity of local environment)e.value <- matrix(rnorm(16, sd = 2.5), 4, 4)e.value

## [,1] [,2] [,3] [,4]## [1,] -1.547611 2.232424 0.1911757 1.8861892## [2,] -2.207789 3.909922 -2.1251164 -0.8860432## [3,] 1.098536 2.524477 -3.2760349 -1.1552031## [4,] 3.110199 -0.690938 -4.2153578 4.7493152

Althoughtheabovecommandgeneratesrandomnumbers,Ithinkyouwillgetthesamevalueasthetextbook.Thisisbecausetherandomnumbersgeneratedarepseudorandomnumbersandaregeneratedaccordingtocertainrules.Notethatifyouchangethevalueoftherandom

Page 33: Introduction to Biostatistics The 2nd Lecture

33

33

seed,thesamevalueasabovewillnotbegenerated.Also,differentrandomnumbersaregeneratedeachtimeyourun.

Finally,theoverallaverage,thegradientofsoilfertility,thegeneticvaluesofvarieties,andthevariationduetothelocalenvironmentareaddedtogethertogenerateasimulatedobservedvalueofthetrait.

# simulate phenotypic valuesgrand.mean <- 50simyield <- grand.mean + field.cond + g.value + e.valuesimyield

## [,1] [,2] [,3] [,4]## [1,] 48.45239 56.23242 50.19118 43.88619## [2,] 53.79221 59.90992 41.87488 49.11396## [3,] 53.09854 50.52448 42.72397 46.84480## [4,] 61.11020 49.30906 47.78464 48.74932

Beforeperforminganalysisofvariance,reshapedataintheformofmatricesintovectorsandrebundlethem.

# unfold a matrix to a vectoras.vector(simyield)

## [1] 48.45239 53.79221 53.09854 61.11020 56.23242 59.90992 50.52448 49.30906## [9] 50.19118 41.87488 42.72397 47.78464 43.88619 49.11396 46.84480 48.74932

as.vector(varmat)

## [1] "D" "B" "C" "A" "B" "A" "D" "C" "B" "D" "C" "A" "D" "A" "B" "C"

as.vector(blomat)

## [1] "I" "I" "I" "I" "II" "II" "II" "II" "III" "III" "III" "III"## [13] "IV" "IV" "IV" "IV"

Below,thedataisbundledasadataframe.

# create a dataframe for the analysis of variancesimdata <- data.frame(variety = as.vector(varmat), block = as.vector(blomat), yield = as.vector(simyield))head(simdata, 10)

## variety block yield## 1 D I 48.45239## 2 B I 53.79221## 3 C I 53.09854## 4 A I 61.11020## 5 B II 56.23242## 6 A II 59.90992## 7 D II 50.52448## 8 C II 49.30906## 9 B III 50.19118## 10 D III 41.87488

Page 34: Introduction to Biostatistics The 2nd Lecture

34

34

Let’splotthecreateddatausingthefunctioninteraction.plot.

# draw interaction plotinteraction.plot(simdata$block, simdata$variety, simdata$yield)

Itcanbeseenthatthedifferencebetweenblocksisaslargeasthedifferencebetweenvarieties

Let’sperformananalysisofvarianceusingtheprepareddata.

# perform the analysis of variance (ANOVA) with simulated datares <- aov(yield ~ block + variety, data = simdata)summary(res)

## Df Sum Sq Mean Sq F value Pr(>F) ## block 3 239.11 79.70 11.614 0.00190 **## variety 3 159.52 53.17 7.748 0.00728 **## Residuals 9 61.77 6.86 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Itcanbeseenthatboththeblockandvarietyeffectsarehighlysignificant.Notethattheformerisnotthesubjectofverification,andisincorporatedintothemodelinordertoestimatethevarietyeffectcorrectly.

Theanalysisofvariancedescribedabovecanalsobeperformedusingthefunction“lm”forestimatingregressionmodels.

Page 35: Introduction to Biostatistics The 2nd Lecture

35

35

# perform ANOVA with a linear modelres <- lm(yield ~ block + variety, data = simdata)anova(res)

## Analysis of Variance Table## ## Response: yield## Df Sum Sq Mean Sq F value Pr(>F) ## block 3 239.109 79.703 11.6138 0.001898 **## variety 3 159.518 53.173 7.7479 0.007285 **## Residuals 9 61.765 6.863 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Complete random block design and completely randomized design

Thelocalcontrol,oneofFisher’sthreeprinciples,isveryimportantforperforminghighlyaccurateexperimentsinfieldsunderhighheterogeneitybetweenplots.Here,assumingthesameenvironmentalconditionsasbefore,let’sconsiderperforminganexperimentwithoutsettingupablock.

Intheprevioussimulationexperiment,weblockedeachcolumnandplacedA,B,C,Drandomlyinthatblock.Herewewillassigntheplotswith4varietiesx4replicatescompletelyrandomlyacrossthefield.Anexperimentinwhichblocksarenotarrangedintheexplimentandarrangedcompletelyrandomlyiscalled“completelyrandomizeddesign.”

# completely randomized the plots of each variety in a fieldvarmat.crd <- matrix(sample(varmat), nrow = 4)varmat.crd

## [,1] [,2] [,3] [,4]## [1,] "A" "C" "C" "D" ## [2,] "B" "A" "B" "D" ## [3,] "A" "C" "B" "A" ## [4,] "D" "B" "C" "D"

Thistime,youshouldcarefulthatthefrequencyofappearanceofvarietyvariesfromrowtorow,sincevarietiesarerandomlyassignedtotheentirefield.

Thegeneticeffectisassignedaccordingtotheorderofvarietiesinacompletelyrandomarrangement.

# simulate genetic ability of the varietiesg.value.crd <- matrix(NA, 4, 4)g.value.crd[varmat.crd == "A"] <- 4g.value.crd[varmat.crd == "B"] <- 2g.value.crd[varmat.crd == "C"] <- -2g.value.crd[varmat.crd == "D"] <- -4g.value.crd

## [,1] [,2] [,3] [,4]## [1,] 4 -2 -2 -4## [2,] 2 4 2 -4## [3,] 4 -2 2 4## [4,] -4 2 -2 -4

Page 36: Introduction to Biostatistics The 2nd Lecture

36

36

Asintheprevioussimulationexperiment,theoverallaverage,thegradientofsoilfertility,thegeneticeffectofvarieties,andthevariationduetothelocalenvironmentaresummedup.

# simulate phenotypic valuessimyield.crd <- grand.mean + g.value.crd + field.cond + e.valuesimyield.crd

## [,1] [,2] [,3] [,4]## [1,] 56.45239 52.23242 46.19118 43.88619## [2,] 53.79221 59.90992 47.87488 41.11396## [3,] 59.09854 52.52448 46.72397 48.84480## [4,] 53.11020 53.30906 41.78464 46.74932

Thedataisbundledasadataframe.

# create a dataframe for the analysis of variancesimdata.crd <- data.frame(variety = as.vector(varmat.crd), yield = as.vector(simyield.crd))head(simdata.crd, 10)

## variety yield## 1 A 56.45239## 2 B 53.79221## 3 A 59.09854## 4 D 53.11020## 5 C 52.23242## 6 A 59.90992## 7 C 52.52448## 8 B 53.30906## 9 C 46.19118## 10 B 47.87488

Nowlet’sperformanalysisofvarianceonthedatageneratedinthesimulation.Unlikethepreviousexperiment,wedonotsetblocks.Thus,weperformregressionanalysiswiththemodelthatonlyincludesthevarietaleffectanddoesnotincludetheblockeffect.

# perform ANOVAres <- lm(yield ~ variety, data = simdata.crd)anova(res)

## Analysis of Variance Table## ## Response: yield## Df Sum Sq Mean Sq F value Pr(>F) ## variety 3 218.12 72.705 3.1663 0.06392 .## Residuals 12 275.55 22.962 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Intheaboveexample,thevarietaleffectisnotsignificant.Thisisconsideredtobeduetothefactthatthespatialheterogeneityinthefieldcausestheerrortobelargeandthegeneticdifferencebetweenvarietiescannotbeestimatedwithsufficientaccuracy.

Theabovesimulationexperimentwasrepeated100times(shownonthenextpage).Asaresult,intheexperimentusingtherandomcompleteblockdesign,thevarietaleffectwasdetected(thesignificancelevelwassetto5%)in94experimentsoutof100,butitwasdetectedonly66timesinthecompletelyrandomarrangement.Inaddition,whenthe

Page 37: Introduction to Biostatistics The 2nd Lecture

37

37

significancelevelwassetto1%,thenumberofthevarietaleffectdetectedwas70and30,respectively(inthecaseofcompletelyrandomarrangement,thevarietaleffectwasmissed70times!).Fromthisresult,itcanbeseenthattheadoptionoftherandomcompleteblockdesigniseffectivewhenthereisamong-replicationheterogeneitysuchastheslopeofsoilfertility.Inordertomakeatime-consumingandlabor-intensiveexperimentasefficientaspossible,itisimportanttodesigntheexperimentproperly.

# perform multiple simulationsn.rep <- 100p.rbd <- rep(NA, n.rep)p.crd <- rep(NA, n.rep)for(i in 1:n.rep) { # experiment with randomized block design varmat <- matrix(c(sample(variety), sample(variety), sample(variety), sample(variety)), nrow = 4) g.value <- matrix(NA, 4, 4) g.value[varmat == "A"] <- 4 g.value[varmat == "B"] <- 2 g.value[varmat == "C"] <- -2 g.value[varmat == "D"] <- -4 e.value <- matrix(rnorm(16, sd = 2.5), 4, 4) simyield <- grand.mean + field.cond + g.value + e.value simdata <- data.frame(variety = as.vector(varmat), block = as.vector(blomat), yield = as.vector(simyield)) res <- lm(yield ~ block + variety, data = simdata) p.rbd[i] <- anova(res)$Pr[2] # experiment with completed randomized design varmat.crd <- matrix(sample(varmat), nrow = 4) g.value.crd <- matrix(NA, 4, 4) g.value.crd[varmat.crd == "A"] <- 4 g.value.crd[varmat.crd == "B"] <- 2 g.value.crd[varmat.crd == "C"] <- -2 g.value.crd[varmat.crd == "D"] <- -4 simyield.crd <- grand.mean + g.value.crd + field.cond + e.value simdata.crd <- data.frame(variety = as.vector(varmat.crd), yield = as.vector(simyield.crd)) res <- lm(yield ~ variety, data = simdata.crd) p.crd[i] <- anova(res)$Pr[1]}sum(p.rbd < 0.05) / n.rep

## [1] 0.94

sum(p.crd < 0.05) / n.rep

## [1] 0.54

sum(p.rbd < 0.01) / n.rep

## [1] 0.74

sum(p.crd < 0.01) / n.rep

## [1] 0.21

Page 38: Introduction to Biostatistics The 2nd Lecture

38

38

Report assignment 1. Usingthenumberofseedsperpanicle(Seed.number.per.panicle)asthedependent

variable𝑦andpaniclelength(Panicle.length)astheindependentvariable𝑥,fitasimpleregressionmodel(𝑦! = 𝜇 + 𝛽𝑥! + 𝜖!)andfindthesampleintercept𝑚andsampleregressioncoefficient𝑏,whichareestimatesof𝜇and𝛽.

2. Testthenullhypothesis𝐻*: 𝛽 = 0.02.3. Testthenullhypothesis𝐻*: 𝜇 = 5.4. Answerthe95%confidenceintervalbetweentheestimatedy,i.e.,𝑦Candthepredictedy,

i.e.,𝑦Xfor𝑥 = 27.5. Fitthepolynomialregressionmodel𝑦! = 𝜇 + 𝛽%𝑥! + 𝛽"𝑥!" + 𝜖!)usingthefirst-and

second-ordertermsof𝑥andanswerthedeterminationcoefficient𝑅"andtheadjustedcoefficientofdetermination𝑅'()" .

6. Comparetheregressionmodelin5withtheregressionmodelin1inananalysisofvarianceandconsiderthevalidityofincludingasecond-ordertermof𝑥intheregressionmodel.

7. Fitthepolynomialregressionmodel𝑦! = 𝜇 + 𝛽%𝑥! + 𝛽"𝑥!" + 𝛽,𝑥!, + 𝜖!)usingthefirst-tothird-ordertermsof𝑥andanswerthedecisioncoefficient𝑅"andtheheadjustedcoefficientofdetermination𝑅'()" .

8. Comparetheregressionmodelin7withtheregressionmodelin5inananalysisofvariancetoexaminethevalidityofextendingthesecond-orderpolynomialregressionmodeltoathird-orderpolynomialregressionmodel.

Submissionmethod:

• CreateareportasapdffileandsubmitittoITC-LMS.• WhenyoucannotsubmityourreporttoITC-LMSwithsomeissues,sendthereportto

[email protected]• Makesuretowritetheaffiliation,studentnumber,andnameatthebeginningofthe

report.• ThedeadlineforsubmissionisMay14th.