Lab notes for Statistics for Social Sciences II ...

Lab notes for Statistics for Social Sciences II:Multivariate Techniques

BSc in International Studies and BSc in International Studies &Political Science, Carlos III University of Madrid, 2016/2017

Eduardo García-Portugués

Last updated: 2022-02-19, v12.4

Contents

1 Introduction 51.1 Software employed . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Why this software? . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Installation in your own computer . . . . . . . . . . . . . . . . . 81.4 R Commander basics . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Datasets for the course . . . . . . . . . . . . . . . . . . . . . . . . 111.6 Main references and credits . . . . . . . . . . . . . . . . . . . . . 13

2 Simple linear regression 152.1 Examples and applications . . . . . . . . . . . . . . . . . . . . . . 152.2 Some R basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3 Model formulation and estimation by least squares . . . . . . . . 452.4 Assumptions of the model . . . . . . . . . . . . . . . . . . . . . . 492.5 Inference for the model coefficients . . . . . . . . . . . . . . . . . 502.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.7 ANOVA and model fit . . . . . . . . . . . . . . . . . . . . . . . . 662.8 Nonlinear relationships . . . . . . . . . . . . . . . . . . . . . . . . 732.9 Exercises and case studies . . . . . . . . . . . . . . . . . . . . . . 78

3 Multiple linear regression 913.1 Examples and applications . . . . . . . . . . . . . . . . . . . . . . 913.2 Model formulation and estimation by least squares . . . . . . . . 1073.3 Assumptions of the model . . . . . . . . . . . . . . . . . . . . . . 1123.4 Inference for model parameters . . . . . . . . . . . . . . . . . . . 1163.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.6 ANOVA and model fit . . . . . . . . . . . . . . . . . . . . . . . . 1233.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313.8 Model diagnostics and multicollinearity . . . . . . . . . . . . . . 137

4 Logistic regression 1514.1 More R basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1514.2 Examples and applications . . . . . . . . . . . . . . . . . . . . . . 1714.3 Model formulation and estimation by maximum likelihood . . . . 180

3

4 CONTENTS

4.4 Assumptions of the model . . . . . . . . . . . . . . . . . . . . . . 1894.5 Inference for model parameters . . . . . . . . . . . . . . . . . . . 1904.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1964.7 Deviance and model fit . . . . . . . . . . . . . . . . . . . . . . . . 1984.8 Model selection and multicollinearity . . . . . . . . . . . . . . . . 204

5 Principal component analysis 2135.1 Examples and applications . . . . . . . . . . . . . . . . . . . . . . 213

6 Cluster analysis 2376.1 𝑘-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 2376.2 Agglomerative hierarchical clustering . . . . . . . . . . . . . . . . 248

A Glossary of important R commands 259

B Use of qualitative predictors in regression 265

C Multinomial logistic regression 273

D Reporting with R and R Commander 279

E Group project 287

Chapter 1

Introduction

Welcome to the lab notes for Statistics for Social Sciences II: Multivariate Tech-niques. Along these notes we will see how to effectively implement the statisticalmethods presented in the lectures. The exposition we will follow is based onlearning by analyzing datasets and real-case studies, always with the help ofstatistical software. While doing so, we will illustrate the key insights of somemultivariate techniques and the adequate use of advanced statistical software.

Be advised that these notes are neither an exhaustive, rigorous nor compre-hensive treatment of the broad statistical branch know as Multivariate Analysis.They are just a helpful resource to implement the specific topics covered in thislimited course.

1.1 Software employedThe software we will employ in this course is available in all UC3M computerlabs. We will use two pieces of software:

• R. A free open-source software environment for statistical computing andgraphics. Virtually all the statistical methods you can think of are avail-able in R. Currently, it is the dominant statistical software (at least amongstatisticians).

• R Commander. A Graphical User Interface (GUI) designed to make Raccessible to non-specialists through friendly menus. Essentially, it trans-lates simple menu instructions into R code.

The only thing you need to do to run R Commander in any UC3M computer is:

1. Run 'Start' -> 'R3.3.1' -> 'R3.3.1 (consola)'. A black consolewill open. Do not panic!

2. Type inside

5

https://www.r-project.org/

http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

6 CHAPTER 1. INTRODUCTION

library(Rcmdr)

Congratulations on your first piece of R code, you have just loaded apackage!

3. If Rcmdr is installed, then R Commander will open automatically and youare ready to go. In case you accidentally close R Commander, typeCommander()

If Rcmdr is not installed, then typeinstall.packages("Rcmdr", dep = TRUE)

and say 'Yes' to the next pop-ups regarding the installation of the per-sonal library. This will download and install Rcmdr and all the relatedpackages from a CRAN mirror ('Spain (A Coruña) [https]', usuallyworks fine – try a different one if you experience problems). Wait for thedownloading and installation of the packages. When it is done, just typelibrary(Rcmdr)

and you are ready to go.

An important warning about UC3M computer labs:

Every file you save locally (including installed packages) will be wipedout after you close your session. So be sure to save your valuable filesat the end of the lesson.

The exception is the folder 'C:/TEMP', where all the files you savewill be accessible for everyone that logs in the computer!

In UC3M computers, R and R Commander are only available in Span-ish. To have them in English, you need to do a workaround:

1. Create a shortcut to R in your desktop. To do so, go to'C:/Archivos de programa/R/R-3.3.1/bin/', right-clickin‘R.exe’and choose‘Enviar a’ -> ‘Escritorio (crear accesodirecto)’‘.

2. Modify the properties of the shortcut. Right-click on the shortcutand choose 'Propiedades'. Then append to the 'Destino'field the text Language=en (separated by a space, see Figure1.1). Click 'Aplicar' and then 'OK'.

3. Run that shortcut and then type library(Rcmdr).

Tip: if you save the modified shortcut in 'C:/TEMP', it could be avail-able the next time you log in.

1.2. WHY THIS SOFTWARE? 7

Alternatively, you can bring your own laptop and save all your files in it, seeSection 1.3.

1.2 Why this software?There are many advanced commercial statistical software, such as SPSS, Excel(with commercial add-ons), Minitab, Stata, SAS, etc. We will rely on the comboR (R Core Team, 2015) + R Commander (Fox, 2005) due to some noteworthyadvantages:

1. Free and open-source. (Free as in beer, free as in speech.) No softwarelicenses are needed. This means that you can readily use it outside UC3Mcomputer labs, without limitations on the period or purpose of use.

2. Scalable complexity and extensibility. R Commander creates R codethat you can see and eventually understand. Once you begin to get afeeling of it, you will realize that is faster to type the right commandsthan to navigate through menus. In addition, R Commander has 39 high-quality plug-ins (September, 2016), so the procedures available throughmenus will not fall short easily.

3. R is the leading computer language in statistics. Any statisticalanalysis that you can imagine is already available in R through its almost9000 free packages (September, 2016). Some of them contain a goodnumber of ready-to-use datasets or methods for data acquisition fromaccredited sources.

4. R Commander produces high-quality graphs easily. R Commander,through the plug-in KMggplot2, interfaces the ggplot2 library, which de-livers high-quality, publication-level graphs (sample gallery). It is consid-ered as one of the best and more elegant graphing packages nowadays.

5. Great report generation. R Commander integrates R Markdown, whichis a framework able to create .html, .pdf and .docx reports directly fromthe outputs of R. That means you can deliver high-quality, reproducibleand beautiful reports with a little amount of effort. For example, thesenotes have been created with an extension of R Markdown.

In summary, R Commander eases the learning curve of R and provides a pow-erful way of creating and reporting statistical analyses. An intermediate knowl-edge in R Commander + R will improve notably your quantitative skills, there-fore making an important distinction in your graduate profile (it is a factthat many social scientists tend to lack a proper quantitative formation). So Iencourage you to take full advantage of this great opportunity!

http://www.r-graph-gallery.com/portfolio/ggplot2-package/


1.3 Installation in your own computerYou are allowed to bring your own laptop to the labs. This may have a seriesof benefits, such as admin privileges, saving all your files locally and a deeperfamiliarization with the software. But keep in mind:

If you plan to use your personal laptop, you are responsible for theright setup of the software (and laptop) prior to the lab lesson.

Regardless of your choice, at some point you will probably need to run thesoftware outside UC3M computer labs. This is what you have to do in order toinstall R + R Commander in your own computer:

1. In Mac OS X, download and install first XQuartz and log out and back onyour Mac OS X account (this is an important step). Be sure that yourMac OS X system is up-to-date.

2. Download the latest version of R for Windows or Mac OS X.

3. Install R. In Windows, be sure to select the 'Startup options' andthen choose 'SDI' in the 'Display Mode' options. Leave the rest ofinstallation options as default.

4. Open R ('R x64 X.X.X' in 64-bit Windows, 'R i386 X.X.X' in 32-bitWindows and 'R.app' in Mac OS X) and type:install.packages(c("Rcmdr", "RcmdrMisc", "RcmdrPlugin.TeachingDemos",

"RcmdrPlugin.FactoMineR", "RcmdrPlugin.KMggplot2"),dep = TRUE)

Say 'Yes' to the pop-ups regarding the installation of the personal libraryand choose the CRAN mirror (the server from which you will downloadpackages). 'Spain (A Coruña) [https]', usually works fine – try a dif-ferent one if you experience problems.

5. To launch the R Commander, run R and thenlibrary(Rcmdr)

Mac OS X users. To prevent an occasional freezing of R and RCommander by the OS, go to 'Tools' -> 'Manage Mac OS X appnap for R.app...' and select 'off (recommended)'.

If there is any Linux user, follow the corresponding instructions here and here.

By default R and R Commander will have menus and messages in the languageof your OS. If you want them in English, a simple option is to change the OS

https://www.xquartz.org/

https://cran.r-project.org/bin/windows/base/release.html

https://cran.r-project.org/bin/macosx/R-3.3.2.pkg

https://cran.r-project.org/

http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/installation-notes.html

1.4. R COMMANDER BASICS 9

language to English and reboot. If you want to stick with your OS language,another options are:

• Windows. Create a shortcut in your desktop, either to 'R x64 X.X.X'or to 'R i386 X.X.X'. Add a distinctive descriptor to its name, for exam-ple 'R x64 X.X.X ENGLISH'. Then 'Right mouse click' on it, select'Properties' and append to the 'Target' field the text Language=en(separated by a space, see Figure 1.1). Analogously, use Language=es forSpanish, Language=it for Italian, etc. Use this shortcut to launch R (andthen R Commander) in the chosen language. Click 'Apply' and then'OK'.

Figure 1.1: Modification of the R shortcut properties in Windows.

• Mac OS X. Open R.app and simply runsystem("defaults write org.R-project.R force.LANG en_GB.UTF-8")

Then close 'R.app' and relaunch it. Analogously, replace en_GB aboveby es_ES or it_IT if you want to switch back to Spanish or Italian, forexample.

1.4 R Commander basicsWhen you start R Commander you will see a window similar to Figure 1.2. ThisGUI has the following items:

1. Drop-down menus. They are pretty self-explanatory. A quick summary:


Figure 1.2: Main window of R Commander, with the plug-insRcmdrPlugin.FactoMineR, RcmdrPlugin.KMggplot2 and RcmdrPlugin.Demosloaded.

• 'File'. Saving options for different files (.R, .Rmd, .txt outputand .RData). The latter corresponds to saving the workspace of thesession.

• 'Edit'. Basic edition within the GUI text boxes ('Copy', 'Paste','Undo', …).

• 'Data'. Import, manage and manipulate datasets.• 'Statistics'. Perform statistical analyses, such as 'Summaries'

and 'Fit models'.• 'Graphs'. Compute the available graphs (depending on the kind of

variables) for the active dataset.• 'Models'. Graphical, numerical and inferential analyses for the ac-

1.5. DATASETS FOR THE COURSE 11

tive model.• 'Distributions'. Operations for continuous/discrete distributions:

sampling, computation of probabilities and quantiles and plotting ofdensity and distribution functions.

• 'Tools'. Options for R Commander. Here you can 'Load Rcmdrplug-in(s)...', which enables to expand the number of menusavailable via plug-ins. R Commander will need to restart prior toloading a plug-in. (A minor inconvenience is that the text boxes inthe GUI will be wiped out after restarting. The workspace is keptafter restarting, so the models and datasets will be available – butnot selected as active.)

• 'Help'. Several help resources.2. Dataset manipulation. Select the active dataset among the list of

loaded datasets. Edit (very basic) and view the active dataset.3. Model selector. Select an active model among the available ones to work

with.4. Switch to script or report mode. Switch between the generated R

code and the associated R Markdown code.5. Input panel. The R code generated by the drop-down menus appears

here and is passed to the output console. You can type code here and runit without using the drop-down menus.

6. Submit/Generate report button. Allows to pass and run selected Rcode to the output console (keyboard shortcut: 'Control' + 'R').

7. Output console. Shows the commands that were run (red) and theirvisible result (blue), if any.

8. Messages. Displays possible error messages, warnings and notes.

When you close R Commander you will be asked on what files to save: 'scriptfile' and 'R Markdown file' (contents in the two tabs of panel 5, respec-tively) and 'output file' (panel 7). If you want to save the workspace, youhave to do it through the 'File' menu.

Focus on understanding the purpose of each element in theGUI. When performing the real-case studies we will take care of ex-plaining the many features of R Commander step by step.

1.5 Datasets for the courseThis is a handy list with a small description and download link for all therelevant datasets used in the course. To download them, simply save the linkas a file in your browser.

• pisa.csv (download). Contains 65 rows corresponding to the countriesthat took part on the PISA study. Each row has the variables Country,MeanMath,MathShareLow, MathShareTop, ReadingMean, ScienceMean,

https://raw.githubusercontent.com/egarpor/handy/master/datasets/pisa.csv


GDPp, logGDPp and HighIncome. The logGDPp is the logarithm of theGDPp, which is taken in order to avoid scale distortions.

• US_apportionment.xlsx (download). Contains the 50 US states entitledto representation in the US House of Representatives. The recorded vari-ables are State, Population2010 and Seats2013–2023.

• EU_apportionment.txt (download). Contains 28 rows with the mem-ber states for the EU (Country), the number of seats assigned underdifferent years (Seats2011, Seats2014), the Cambridge Compromise ap-portionment (CamCom2011) and the states population (Population2010,Population2013).

• least-squares.RData (download). Contains a single data.frame, namedleastSquares, with 50 observations of the variables x, yLin, yQua andyExp. These are generated as 𝑋 ∼ 𝒩(0, 1), 𝑌lin = −0.5 + 1.5𝑋 + 𝜀,𝑌qua = −0.5+1.5𝑋2 +𝜀 and 𝑌exp = −0.5+1.5⋅2𝑋 +𝜀, with 𝜀 ∼ 𝒩(0, 0.52).The purpose of the dataset is to illustrate the least squares fitting.

• assumptions.RData (download). Contains the data frame assumptionswith 200 observations of the variables x1, …, x9 and y1, …, y9. Thepurpose of the dataset is to identify which regression y1 ~ x1, …,y9 ~ x9 fulfills the assumptions of the linear model. The datasetmoreAssumptions.RData (download) has the same structure.

• cpus.txt (download) and gpus.txt (download). The datasets contain102 and 35 rows, respectively, of commercial CPUs and GPUs appearedsince the first models up to nowadays. The variables in the datasets areProcessor, Transistor count, Date of introduction, Manufacturer,Process and Area.

• hap.txt (download). Contains data for 20 advanced economies in the timeperiod 1946–2009, measured for 31 variables. Among those, the variabledRGDP represents the real GDP growth (as a percentage) and debtgdprepresents the percentage of public debt with respect to the GDP.

• wine.csv (download). The dataset is formed by the auction Price of27 red Bordeaux vintages, five vintage descriptors (WinterRain, AGST,HarvestRain, Age, Year) and the population of France in the year of thevintage (FrancePop).

• Boston.xlsx (download). The dataset contains 14 variables describing506 suburbs in Boston. Among those variables, medv is the median housevalue, rm is the average number of rooms per house and crim is the percapita crime rate. The full description is available in ?Boston.

• assumptions3D.RData (download). Contains the data frame assumptions3Dwith 200 observations of the variables x1.1, …, x1.8, x2.1, …, x2.8 andy.1, …, y.8. The purpose of the dataset is to identify which regression

https://raw.githubusercontent.com/egarpor/handy/master/datasets/US_apportionment.xlsx

https://raw.githubusercontent.com/egarpor/handy/master/datasets/EU_apportionment.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/least-squares.RData

https://raw.githubusercontent.com/egarpor/handy/master/datasets/assumptions.RData

https://raw.githubusercontent.com/egarpor/handy/master/datasets/moreAssumptions.RData

https://raw.githubusercontent.com/egarpor/handy/master/datasets/cpus.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/gpus.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/hap.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/wine.csv

https://raw.githubusercontent.com/egarpor/handy/master/datasets/Boston.xlsx

https://raw.githubusercontent.com/egarpor/handy/master/datasets/assumptions3D.RData

1.6. MAIN REFERENCES AND CREDITS 13

y.1 ~ x1.1 + x2.1, …, y.8 ~ x1.8 + x2.8 fulfills the assumptions ofthe linear model.

• challenger.txt (download). Contains data for 23 Space-Shuttlelaunches. The data consists of 23 shuttle flights. There are 8 variables.Among them: temp, the temperature in Celsius degrees at the time oflaunch, and fail.field and fail.nozzle, indicators of whether therewere an incidents in the O-rings of the field joints and nozzles of the solidrocket boosters.

• eurojob.txt (download). Contains data for employment in 26 Europeancountries. There are 9 variables, giving the percentage of employmentsin 9 sectors: Agr (Agriculture), Min (Mining), Man (Manufacture), Pow(Power), Con (Construction), Ser (Services), Fin (Finance), Soc (Social)and Tra (Transport).

• Chile.txt (download). Contains data for 2700 respondents on a surveyfor the voting intentions in the 1988 Chilean national plebiscite. Thereare 8 variables: region, population, sex, age, education, income,statusquo (scale of support for the status quo) and vote. vote is afactor with levels A (abstention), N (against Pinochet), U (undecided), Y(for Pinochet). Available in R through the package car and data(Chile).

• USArrests.txt (download). Arrest statistics for Assault, Murder andRape in each of the 50 US states in 1973. The percent of the populationliving in urban areas, UrbanPop, is also given. Available in R throughdata(USArrests).

• USJudgeRatings.txt (download). Lawyers’ ratings of state judges in theUS Superior Court. The dataset contains 43 observations of 12 variablesmeasuring the performance of the judge when conducting a trial. Availablein R through data(USJudgeRatings).

• la-liga-2015-2016.xlsx (download). Contains 19 performance metricsfor the 20 football teams in La Liga 2015/2016.

• pisaUS2009.csv (download). Reading score of 3663 US students in thePISA test, with 23 variables informing about the student profile and familybackground.

1.6 Main references and creditsThe following great reference books have been used extensively for preparingthese notes:

• James et al. (2013) (linear regression, logistic regression, PCA, clustering),• Peña (2002) (linear regression, logistic regression, PCA, clustering),• Bartholomew et al. (2008) (PCA).

https://raw.githubusercontent.com/egarpor/handy/master/datasets/challenger.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/eurojob.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/Chile.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/USArrests.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/USJudgeRatings.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/la-liga-2015-2016.xlsx

https://raw.githubusercontent.com/egarpor/handy/master/datasets/pisaUS2009.csv


The icons used in the notes were designed by madebyoliver, freepik and roundi-cons from Flaticon.

In addition, these notes are possible due to the existence of these incrediblepieces of software: Xie (2016a), Xie (2016b), Allaire et al. (2016) and R CoreTeam (2015).

All material in these notes is licensed under CC BY-NC-SA 4.0.

http://www.flaticon.com/authors/madebyoliver

http://www.flaticon.com/authors/freepik

http://www.flaticon.com/authors/roundicons

http://www.flaticon.com/authors/roundicons

http://www.flaticon.com/

https://creativecommons.org/licenses/by-nc-sa/4.0/

Chapter 2

Simple linear regression

The simple linear regression is a simple but useful statistical model. In short, itallows to analyze the (assumed) linear relation between two variables, 𝑋 and 𝑌in a proper way. It does it by considering the model

𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜀

which in Chapter 3 will be extended to multiple linear regression.

To convince you why simple linear regression is useful, let’s begin by seeing whatit can do in real-case scenarios!

2.1 Examples and applications2.1.1 Case study I: PISA scores and GDPpThe Programme for International Student Assessment (PISA) is a study carriedout by the Organization for Economic Co-operation and Development (OECD)in 65 countries with the purpose of evaluating the performance of 15-year-oldpupils on mathematics, science and reading. A phenomena observed over yearsis that wealthy countries tend to achieve larger average scores. The purposeof this case study, motivated by the OECD (2012a) inform, is to answer twoquestions related with the previous statement:

• Q1. Is the educational level of a country influenced by its economic wealth?• Q2. If so, up to what precise extent?

The pisa.csv file (download) contains 65 rows corresponding to the countriesthat took part on the PISA study. The data was obtained merging the statlinkin OECD (2012b) with The World Bank (2012) data. Each row has the followingvariables: Country; MathMean, ReadingMean and ScienceMean (the average per-formance of the students in mathematics, reading and science); MathShareLow

15

https://raw.githubusercontent.com/egarpor/handy/master/datasets/pisa.csv

http://dx.doi.org/10.1787/888932937035

16 CHAPTER 2. SIMPLE LINEAR REGRESSION

and MathShareTop (percentages of students with a low and top performancein mathematics); GDPp and logGDPp (the Gross Domestic Product per capitaand its logarithm); HighIncome (whether the country has a GDPp larger than20000$ or not). The GDPp of a country is a measure of how many economicresources are available per citizen. The logGDPp is the logarithm of the GDPp,taken in order to avoid scale distortions. A small subset of the data is shown inTable 2.1.

Table 2.1: First 10 rows of the pisa dataset for a selection of vari-ables. Note the NA (Not Available) in Chinese Taipei (or Taiwan).

Country MathMean ReadingMean ScienceMean logGDPp HighIncomeShanghai-China 613 570 580 8.74267 FALSESingapore 573 542 551 10.90506 TRUEHong Kong SAR, China 561 545 555 10.51074 TRUEChinese Taipei 560 523 523 NA NAKorea 554 536 538 10.10455 TRUEMacao SAR, China 538 509 521 11.25344 TRUEJapan 536 538 547 10.75152 TRUELiechtenstein 535 516 525 11.91278 TRUESwitzerland 531 509 515 11.32911 TRUENetherlands 523 511 522 10.80922 TRUE

We definitely need a way of summarizing this amount of information!

We are going to do the following. First, import the data into R Commanderand do a basic manipulation of it. Second, fit a linear model and interpret itsoutput. Finally, visualize the fitted line and the data.

1. Import the data into R Commander.

• Go to 'Data' -> 'Import data' -> 'from text file, clipboard,or URL...'. A window like Figure 2.1 will pop-up. Select theappropiate formatting options of the data file: whether the firstrow contains the name of the variables, what is the indicator formissing data, what is the field separator and what is the decimalpoint character. Then click 'OK'.

– Inspecting the data file in a text editor will give you the rightformatting choices for importing the data.

• Click on 'View data set' to check that the importation was fine. Ifthe data looks weird, then recheck the structure of the data file andrestart from the above point.

2.1. EXAMPLES AND APPLICATIONS 17

Figure 2.1: Data importation options.

• Since each row corresponds to a different country, we are going toname the rows as the value of the variable Country. To that end, goto 'Data' -> 'Active data set' -> 'Set case names...' andselect the variable Country and click 'OK'. The dataset should looklike Figure 2.2.

Figure 2.2: Correct importation of the pisa dataset.

– In UC3M computers, altering the location of a down-loaded file may cause errors in its importation to R Com-mander!

Example:


∗ Default download path:'C:/Users/g15s4021/Downloads/pisa.csv'. Im-portation from that path works fine.

∗ If you move the file another location (e.g. to'C:/Users/g15s4021/Desktop/pisa.csv'). Importa-tion generates an error.

2. Fit a simple linear regression.

• Go to 'Statistics' -> 'Fit models' -> 'Linear regression...'.A window like Figure 2.3 will pop-up.

Figure 2.3: Window for performing simple linear regression.

Select the response variable. This is the variable denoted by 𝑌 thatwe want to predict/explain. Then select the explanatory variable(also known as the predictor). It is denoted by 𝑋 and is the variableused to predict/explain 𝑌 . Recall the form of the linear model:

𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜀

In our case 𝑌 = MathMean and 𝑋 = logGDPp, so select them andclick 'OK'1.

– If you want to deselect an option in an R Commander menu,use 'Control' + 'Mouse click'.

– Four buttons are common in the menus of R Commander:

∗ 'OK': executes the selected action, then closes the win-dow.

∗ 'Apply': executes the selected action but leaves the win-dow open. Useful if you are experimenting with differentoptions.

1In principle, you could pick more than one explanatory variables using the 'Control' or'Shift' keys, but that corresponds to the multiple linear regression (covered in Chapter 3).


∗ 'Reset': resets the fields and boxes of the window totheir defaults.

∗ 'Cancel': exits the window without performing any ac-tion.

• The window in Figure 2.3 generates this code and output:pisaLinearModel <- lm(MathMean ~ logGDPp, data = pisa)summary(pisaLinearModel)

#### Call:## lm(formula = MathMean ~ logGDPp, data = pisa)#### Residuals:## Min 1Q Median 3Q Max## -138.924 -29.109 1.381 20.239 176.166#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 185.16 61.36 3.018 0.00369 **## logGDPp 28.79 6.13 4.696 1.51e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 47.48 on 62 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.2624, Adjusted R-squared: 0.2505## F-statistic: 22.06 on 1 and 62 DF, p-value: 1.512e-05

This is the linear model of MathMean regressed on logGDPp (first line)and its summary (second line). The summary gives the coefficientsof the line and the 𝑅2 ('Multiple R-squared'), which – as we willsee in Section 2.7 – it can be regarded as an indicator of the strengthof the linear relation between the variables. (𝑅2 = 1 is a perfect linearfit – all the points lay in a line – and 𝑅2 = 0 is the poorest fit.)

The fitted regression line is MathMean = 185.16 + 28.79 × logGDPp.The slope coefficient is positive, which indicates that there is a posi-tive correlation between the wealth of a country and its performancein the PISA Mathematics test (this answers Q1). Hence, the evidencethat wealthy countries tend to achieve larger average scores is indeedtrue (at least for the Mathematics test). We can be more precise onthe effect of the wealth of a country. According to the fitted linearmodel, an increase of 1 unit in the logGDPp of a country is associatedwith achieving, on average, 28.79 additional points in the test (Q2).


3. Visualize the fitted regression line.

• Go to 'Graphs' -> 'Scatterplot...'. A window with two panelswill pop-up (Figures 2.4 and 2.5).

Figure 2.4: Scatterplot window, 'Data' panel.

Figure 2.5: Scatterplot window, 'Options' panel. Remember to tick the'Least-squares line' box in order to display the fitted regression line.

On the 'Data' panel, select the 𝑋 and 𝑌 variables to be displayed inthe scatterplot. On the 'Options' panel, check the 'Least-squaresline' box and choose to identify '3' points 'Automatically'2.

2The decision of which points are the most different from the rest is done automatically bya method known as the Mahalanobis depth.


This will identify what are the three3 most different observationsof the data.

• The following R code will be generated. It produces a scatterplot ofMathMean vs logGDPp, with its corresponding regression line.scatterplot(MathMean ~ logGDPp, reg.line = lm, smooth = FALSE, spread = FALSE,

id.method = 'mahal', id.n = 3, boxplots = FALSE, span = 0.5,ellipse = FALSE, levels = c(.5, .9),main = "Average Math score vs. logGDPp", pch = c(16), data = pisa)

8 9 10 11 12

400

450

500

550

600

Average Math score vs. logGDPp

logGDPp

Mat

hMea

n

There are three clear outliers4: Vietnam, Shanghai-China and Qatar.The first two are non high-income economies that perform excep-tionally well in the test (although Shanghai-China is a cherry-pickedregion of China). On the other hand, Qatar is a high-income economythat has really poor scores.

We can identify countries that are above and below the linear trendin the plot. This is particularly interesting: we can assess whethera country is performing better or worse with respect to its expectedPISA score according to its economic status (this adds more insightinto Q2). To do so, we want to display the text labels in the pointsof the scatterplot. We can take a shortcut: copy and run in the inputpanel the next piece of code. It is a slightly modified version of theprevious code (what are the differences?).

3The default GUI option is set to identify '2' points. However, we know after a preliminaryplot that there are three very different points in the dataset, hence this particular choice.

4The outliers have a considerable impact on the regression line, as we will see later.


scatterplot(MathMean ~ logGDPp, reg.line = lm, smooth = FALSE, spread = FALSE,id.method = 'mahal', id.n = 65, id.cex = 0.75, boxplots = FALSE,span = 0.5, ellipse = FALSE, levels = c(.5, .9),main = "Average Math score vs. logGDPp", pch = c(16), cex = 0.75,data = pisa)

8 9 10 11 12

400

450

500

550

600

Average Math score vs. logGDPp

logGDPp

Mat

hMea

n

If you understood the previous analysis, then you should be able to perform thenext ones on your own.

Repeat the regression analysis (steps 2–3) for:

– ReadingMean regressed on logGDPp. Are the results similar toMathMean on logGDPp?

– MathMean regressed on ReadingMean. Compare it with MathMeanon ScienceMean. Which pair of variables has the highest linearrelation? Is that something expected?

Save the new models with different names to avoid overwriting theprevious models!

2.1.2 Case study II: Apportionment in the EU and USApportionment is the process by which seats in a legislative bodyare distributed among administrative divisions entitled to represen-tation.

— Wikipedia article on Apportionment (politics)

https://en.wikipedia.org/wiki/Apportionment_(politics)


The European Parliament and the US House of Representatives are two of themost important macro legislative bodies in the world. The distribution of seatsin both cameras is designed to represent the different states that conform thefederation (US) or union (EU). Both chambers were created under very dif-ferent historical and political circumstances, which is reflected in the kinds ofapportionment that they present. More specifically:

• In the US, the apportionment is neatly fixed by the US Constitution. Eachof the 50 states is apportioned a number of seats that corresponds to itsshare of the total population of the 50 states, according to the most recentdecennial census. Every state is guaranteed at least 1 seat. There are 435seats.

• Until now, the apportionment in the EU was set by treaties (Nice, Lisbon),in which negotiations between countries took place. The last acceptedcomposition gives an allocation of seats based on the principle of “degres-sive proportionality”5 and somehow vague guidelines. It concludes witha commitment to establish a system to “allocate the seats between Mem-ber States in an objective, fair, durable and transparent way, translatingthe principle of degressive proportionality”. The Cambridge Compromise(Grimmett et al., 2011) was a proposal in that direction that was not ef-fectively implemented. Currently, every state is guaranteed a minimumof 6 seats and a maximum of 96 for a grand total of 750 seats.

We know that there exist qualitative dissimilarities between both chambers, butwe can not be more specific with the description at hand. The purpose of thiscase study is to quantify and visualize what are the differences between theapportionments of the two chambers and how the simple linear regression canadd insights on what is actually going on with the EU apportionment. Thequestions we want to answer are:

• Q1. Can we quantify which chamber is more proportional?• Q2. What are the over-represented and under-represented states in both

chambers?• Q3. How can we quantify the ‘degressive proportionality’ in the EU ap-

portionment system? Was the Cambridge Compromise proposing a fairerrepresentation?

Let’s begin by reading the data:

1. The US_apportionment.xlsx file (download) contains the 50 US states en-titled to representation. The variables are State, Population2010 (fromthe last census) and Seats2013–2023. This is an Excel file that we canread using 'Data' -> 'Import data' -> 'from Excel file...'. Awindow will pop-up, asking for the right options. We set them as inFigure 2.6, since we want the variable State to be the case names. Afterclicking in 'View dataset', the data should look like Figure 2.7.

5Less populated states are given more weight than its corresponding proportional share.

http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32013D0312

http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32013D0312

https://raw.githubusercontent.com/egarpor/handy/master/datasets/US_apportionment.xlsx


Figure 2.6: Importation of an Excel file.

Figure 2.7: Correct importation of the US dataset.

2. The EU_apportionment.txt file (download) contains 28 rows with themember states of the EU (Country), the number of seats assignedunder different years (Seats2011, Seats2014), the Cambridge Com-promise apportionment (CamCom2011) and the countries population6

(Population2010,Population2013).

For this file, you should know how to:

(a) Inspect the file in a text editor and determine its formatting.(b) Decide the right importation options and load it with the

name EU.(c) Set the case names as the variable Country.

6According to EuroStat and the population stated in the Cambridge Compromise report.

https://raw.githubusercontent.com/egarpor/handy/master/datasets/EU_apportionment.txt

http://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=tps00001


Table 2.2: The EU dataset with Country set as the case names.

Population2010 Seats2011 CamCom2011 Population2013 Seats2014Germany 81802257 99 96 80523746 96France 64714074 74 85 65633194 74United Kingdom 62008048 73 81 63896071 73Italy 60340328 73 79 59685227 73Spain 45989016 54 62 46704308 54Poland 38167329 51 52 38533299 51Romania 21462186 33 32 20020074 32Netherlands 16574989 26 26 16779575 26Greece 11305118 22 19 11161642 21Belgium 10839905 22 19 11062508 21Portugal 10637713 22 18 10516125 21Czech Republic 10506813 22 18 10487289 21Hungary 10014324 22 18 9908798 21Sweden 9340682 20 17 9555893 20Austria 8375290 19 16 8451860 18Bulgaria 7563710 18 15 7284552 17Denmark 5534738 13 12 5602628 13Slovakia 5424925 13 12 5426674 13Finland 5351427 13 12 5410836 13Ireland 4467854 12 11 4591087 11Crotia 4425747 NA NA 4262140 11Lithuania 3329039 12 10 2971905 11Latvia 2248374 9 8 2058821 8Slovenia 2046976 8 8 2023825 8Estonia 1340127 6 7 1324814 6Cyprus 803147 6 6 865878 6Luxembourg 502066 6 6 537039 6Malta 412970 6 6 421364 6

We start by analyzing the US dataset. If there is indeed a direct proportionalityin the apportionment, we would expect a direct, 1:1, relation between the ratiosof seats and the population per state. Let’s start by constructing these variables:

1. Switch the active dataset to US. An alternative way to do so is by 'Data'-> 'Active data set' -> 'Select active data set...'.

2. Go to 'Data' -> 'Manage variables in active dataset...' ->'Compute new variable...'.

3. Create the variable RatioSeats2013.2023 as shown in Figure 2.8. Becareful to not overwrite the variable Seats2013.2023.


Figure 2.8: Creation of the new variable RatioSeats2013.2023. The expressionto compute is Seats2013.2023/sum(Seats2013.2023).

4. 'View dataset' to check that the new variable is available.

Repeat steps 1–3, conveniently adapted, to create the new variableRatioPopulation2010.

Let’s fit a regression line to the US data, with RatioSeats2013.2023 as theresponse and RatioPopulation2010 as the explanatory variable. If we namethe model as appUS, you should get the following code and output:appUS <- lm(RatioSeats2013.2023 ~ RatioPopulation2010, data = US)summary(appUS)

#### Call:## lm(formula = RatioSeats2013.2023 ~ RatioPopulation2010, data = US)#### Residuals:## Min 1Q Median 3Q Max## -1.118e-03 -4.955e-04 -3.144e-05 4.087e-04 1.269e-03#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.0001066 0.0001275 -0.836 0.407## RatioPopulation2010 1.0053307 0.0042872 234.498 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.0006669 on 48 degrees of freedom## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991## F-statistic: 5.499e+04 on 1 and 48 DF, p-value: < 2.2e-16

The fitted regression line is RatioSeats2013.2023 = 0.000+1.005 ×RatioPopulation2010and has an 𝑅2 = 0.9991 ('Multiple R-squared'), which means that the datais almost perfectly linearly distributed. Furthermore, the intercept coefficientis not significant for the regression. This is seen in the column 'Pr(>|t|)',


which gives the 𝑝-values for the null hypotheses 𝐻0 ∶ 𝛽0 = 0 and 𝐻0 ∶ 𝛽1 = 0,respectively. The null hypothesis 𝐻0 ∶ 𝛽0 = 0 is not rejected (𝑝-value = 0.407;non-significant) whereas 𝐻0 ∶ 𝛽1 = 0 is rejected (𝑝-value = 0; significant)7.Hence, we can conclude that the appropriation of seats in the US House ofRepresentatives is indeed directly proportional to the population of each state(partially answers Q1).

If we make the scatterplot for the US dataset, we can see the almost perfect (upto integer rounding) 1:1 relation between the ratios “state seats”/“total seats”and “state population”/“aggregated population”. We can set the scatterplotto automatically label the '25' most different points (select the numeric boxwith the mouse and type '25' – the arrow buttons are limited to '10') withtheir case names. As it is seen in Figure 2.9, there is no state clearly over- orunder-represented (Q2).

0.00 0.02 0.04 0.06 0.08 0.10 0.12

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Approportionment in the US House of Representatives

RatioPopulation2010

Rat

ioS

eats

2013

.202

3

Figure 2.9: The apportionment in the US House of Representatives comparedwith a linear fit.

Let’s switch to the EU dataset, for which we will focus on the 2011 variables. Aquick way of visualizing this dataset and, in general, of visualizing multivari-ate data (up to a moderate number of dimensions) is to use a matrix scatterplot.Essentially, it displays the scatterplots between all the pairs of variables. Todo it, go to 'Graphs' -> 'Scatterplot matrix...' and select the numberof variables to be displayed. If you select them as in Figures 2.10 and 2.11, you

7We will be able to say more about how these test are performed after Section 2.5.


should get an output like Figure 2.12.

Figure 2.10: Scatterplot matrix window, 'Data' panel.

Figure 2.11: Scatterplot matrix window, 'Options' panel. Be sure to tick the'Least-squares line' box in order to display the fitted regression line.

The scatterplot matrix has a central panel displaying one-variablesummary plots: histogram, density estimate, boxplot and QQ-plot.Experiment and understand them.

The most interesting panels in Figure 2.12 for our study are CamCom2011vs. Population2010 – panel (1,2) – and Seats2011 vs. Population2010 –panel (3,2). At first sight, it seems that the Cambridge Compromise wasfavoring a fairer allocation of seats than what it was actually being used in the


Figure 2.12: Scatterplot matrix for the variables CamCom2011, Population2010and Seats2011 of EU dataset with boxplots in the central panels.

EU parliament in 2011 (recall the step-wise patterns in (3,2)). Let’s explore indepth the scatterplot Seats2011 vs Population2010.

There are some countries clearly detrimented and benefited by this apportion-ment. For example, France and Spain are under-represented and, on the otherhand, Germany, Hungary and Czech Republic are over-represented (Q2).

Let’s compute the regression line of Seats2011 on Population2010, which wesave in the model appEU2011.appEU2011 <- lm(Seats2011 ~ Population2010, data = EU)summary(appEU2011)

#### Call:## lm(formula = Seats2011 ~ Population2010, data = EU)#### Residuals:## Min 1Q Median 3Q Max## -3.7031 -1.9511 0.0139 1.9799 3.2898#### Coefficients:


0e+00 2e+07 4e+07 6e+07 8e+07

2040

6080

100

Approportionment in the EU parliament in 2011

Population2010

Sea

ts20

11

Figure 2.13: Seats2011 vs Population2010 in the EU dataset.

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 7.910e+00 5.661e-01 13.97 2.58e-13 ***## Population2010 1.078e-06 1.915e-08 56.31 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.289 on 25 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.9922, Adjusted R-squared: 0.9919## F-statistic: 3171 on 1 and 25 DF, p-value: < 2.2e-16

The fitted line is Seats2011 = 7.91 + 1.078 × 10−6 ×Population2010. Theintercept is not zero and, indeed, the fitted intercept is significantly differentfrom zero. Therefore, there is no proportionality in the apportionment.Recall that the fitted slope, despite being very small (why?), is also significantlydifferent from zero. The 𝑅2 is slightly smaller than in the US dataset, butdefinitely very high. Two conclusions stem from this analysis:

• The US House of Representatives is a proportional chamber whereas theEU parliament is definitely not, but is close to perfect linearity (completesQ1).

• The principle of digressive proportionality, in practice, means an al-most linear allocation of seats with respect to population (Q3). The


main point is the presence of a non-zero intercept – that is, a minimumnumber of seats corresponding to a country – in order to over-representsmaller countries with respect to its corresponding proportional share.

The question that remains to be answered is whether the Cambridge Compro-mise was favoring a fairer allocation of seats than the 2011 official agreement.In Figure 2.12 we can see that indeed it seems like that, but there is an outlieroutside the linear pattern: Germany. There is an explanation for that: the EUcommission imposed a cap to the maximum number of seats per country, 96,to the development of the Cambridge Compromise. With this rule, Germany isnotably under-represented.

In order to avoid this distortion, we will exclude Germany from our compari-son. To do so, we specify in the 'Subset expression' field, of either 'Linearregression...' or 'Scatterplot...', a '-1'. This tells R to exclude thefirst row of EU dataset, corresponding to Germany. Then, we compare thelinear models for the official allocation, appEUNoGer2011 and the CambridgeCompromise, appCamComNoGer2011. The outputs are the following.appEUNoGer2011 <- lm(Seats2011 ~ Population2010, data = EU, subset = -1)summary(appEUNoGer2011)

#### Call:## lm(formula = Seats2011 ~ Population2010, data = EU, subset = -1)#### Residuals:## Min 1Q Median 3Q Max## -3.5197 -2.0722 -0.2192 2.0179 3.2865#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 8.099e+00 5.638e-01 14.37 2.78e-13 ***## Population2010 1.060e-06 2.212e-08 47.92 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.227 on 24 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.9897, Adjusted R-squared: 0.9892## F-statistic: 2296 on 1 and 24 DF, p-value: < 2.2e-16appCamComNoGer2011 <- lm(CamCom2011 ~ Population2010, data = EU, subset = -1)summary(appCamComNoGer2011)

#### Call:## lm(formula = CamCom2011 ~ Population2010, data = EU, subset = -1)


#### Residuals:## Min 1Q Median 3Q Max## -0.47547 -0.22598 0.01443 0.27471 0.46766#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 5.459e+00 7.051e-02 77.42 <2e-16 ***## Population2010 1.224e-06 2.766e-09 442.41 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.2784 on 24 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999## F-statistic: 1.957e+05 on 1 and 24 DF, p-value: < 2.2e-16

We see that the Cambridge Compromise has a larger 𝑅2 and a lower interceptthan the official allocation of seats. This means that it favors a more propor-tional allocation, which is fairer in the sense that the deviations from the lineartrend are smaller (Q3). We conclude the case study by illustrating both fits.

0e+00 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07

1020

3040

5060

70

Approportionment in the EU parliament (Germany excl.)

Population2010

Sea

ts20

11

Figure 2.14: Seats2011 vs Population2010 in the EU dataset, Germany ex-cluded.


0e+00 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07

2040

6080

Proposal for approportionment in the EU parliament (Germany excl.)

Population2010

Cam

Com

2011

Figure 2.15: CamCom2011 vs Population2010 in the EU dataset, Germany ex-cluded.

In 2014 it was negotiated a new EU apportionment, collected inSeats2014, according to the population of 2013, Population2013 anddue to the inclusion of Croatia in the EU. Answer these questions:

– Which countries were the most favored and unfavored by suchapportionment?

– Was the apportionment proportional?– Was the degree of linearity higher or lower than the 2011 appor-

tionment? (Exclude Germany.)– Was the degree of linearity higher or lower than the Cambridge

Compromise for 2011? (Exclude Germany.)

We have performed a decent number of operations in R Commander. If we haveto exit the session, we can save the data and models in an .RData file, whichcontains all the objects we have computed so far (but not the code – this hasto be saved differently).

To exit R Commander, save all your progress and reload it later, do:

1. Save .RData file. Go to 'File' -> 'Save R workspaceas...'.


2. Save .R file. Go to 'File' -> 'Save script as...'.

3. Exit R Commander + R. Go to 'File' -> 'Exit' -> 'FromCommander and R'. Choose to not save any file.

4. Start R Commander and load your files:

– .RData file in 'Data' -> 'Load data set...',– .R file in 'File' -> 'Open script file...'.

If you just want to save a dataset, you have two options:

– 'Data' -> 'Active data set' -> 'Save active dataset...': it will be saved as an .RData file. The easiest way ofimporting it back to R.

– 'Data' -> 'Active data set' -> 'Export active dataset...': it will be saved as a text file with the format that youchoose. Useful for exporting data to other programs.

2.2 Some R basicsBy this time you probably had realized that some pieces of R code are repeatedover and over and that it is simpler to just modify them than to navigate themenus. For example, the codes lm and scatterplot always appear related withlinear models and scatterplots. It is important to know some of the R basics inorder to understand what are these pieces of text actually doing. Do not worry,the menus will always be there to generate the proper code for you – but youneed to have a general idea of the code.

In the following sections, type – not copy and paste systematically – the code inthe 'R Script' panel and send it to the output panel (on the selected expression,either with the 'Submit' button or with 'Control' + 'R').

We begin with the lm function, since it is the one you are more used to. In thefollowing, you should get the same outputs (which are preceded by ## [1]).

2.2.1 The lm functionWe are going to employ the EU dataset from Section 2.1.2, with the case namesset as the Country. In case you do not have it loaded, you can download it hereas an .RData file.# First of all, this is a comment. Its purpose is to explain what the code is doing# Comments are preceded by a #

# lm has the syntax: lm(formula = response ~ explanatory, data = data)

https://raw.githubusercontent.com/egarpor/handy/master/datasets/EU.RData

2.2. SOME R BASICS 35

# For example (you need to load first the EU dataset)mod <- lm(formula = Seats2011 ~ Population2010, data = EU)

# We have saved the linear model into mod, which now contains all the output of lm# You can see it by typingmod#### Call:## lm(formula = Seats2011 ~ Population2010, data = EU)#### Coefficients:## (Intercept) Population2010## 7.910e+00 1.078e-06

# mod is indeed a list of objects whose names arenames(mod)## [1] "coefficients" "residuals" "effects" "rank"## [5] "fitted.values" "assign" "qr" "df.residual"## [9] "na.action" "xlevels" "call" "terms"## [13] "model"

# We can access these elements by $# For examplemod$coefficients## (Intercept) Population2010## 7.909890e+00 1.078486e-06

# The residualsmod$residuals## Germany France United Kingdom Italy Spain## 2.86753139 -3.70310468 -1.78469388 0.01391858 -3.50839420## Poland Romania Netherlands Greece Belgium## 1.92718471 1.94344541 0.21421832 1.89769973 2.39942538## Portugal Czech Republic Hungary Sweden Austria## 2.61748660 2.75866040 3.28980283 2.01631620 2.05747784## Bulgaria Denmark Slovakia Finland Ireland## 1.93275540 -0.87902697 -0.76059520 -0.68132864 -0.72840765## Lithuania Latvia Slovenia Estonia Cyprus## 0.49978824 -1.33472983 -2.11752493 -3.35519827 -2.77607293## Luxembourg Malta## -2.45136132 -2.35527254

# The fitted valuesmod$fitted.values## Germany France United Kingdom Italy Spain


## 96.132469 77.703105 74.784694 72.986081 57.508394## Poland Romania Netherlands Greece Belgium## 49.072815 31.056555 25.785782 20.102300 19.600575## Portugal Czech Republic Hungary Sweden Austria## 19.382513 19.241340 18.710197 17.983684 16.942522## Bulgaria Denmark Slovakia Finland Ireland## 16.067245 13.879027 13.760595 13.681329 12.728408## Lithuania Latvia Slovenia Estonia Cyprus## 11.500212 10.334730 10.117525 9.355198 8.776073## Luxembourg Malta## 8.451361 8.355273

# Summary of the modelsumMod <- summary(mod)sumMod#### Call:## lm(formula = Seats2011 ~ Population2010, data = EU)#### Residuals:## Min 1Q Median 3Q Max## -3.7031 -1.9511 0.0139 1.9799 3.2898#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 7.910e+00 5.661e-01 13.97 2.58e-13 ***## Population2010 1.078e-06 1.915e-08 56.31 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.289 on 25 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.9922, Adjusted R-squared: 0.9919## F-statistic: 3171 on 1 and 25 DF, p-value: < 2.2e-16

The following table contains a handy cheat sheet of equivalences between R codeand some of the statistical concepts associated to linear regression.

R Statistical conceptx Predictor 𝑋1, … , 𝑋𝑛y Response 𝑌1, … , 𝑌𝑛data <- data.frame(x = x, y = y) Sample (𝑋1, 𝑌1), … , (𝑋𝑛, 𝑌𝑛)model <- lm(y ~ x, data = data) Fitted linear modelmodel$coefficients Fitted coefficients 𝛽0, 𝛽1model$residuals Residuals 𝜀1, … , 𝜀𝑛


R Statistical concept

model$fitted.values Fitted values 𝑌1, … , 𝑌𝑛model$df.residual Degrees of freedom 𝑛 − 2summaryModel <- summary(model) Summary of the fitted linear modelsummaryModel$sigma Fitted residual standard deviation ��summaryModel$r.squared Coefficient of determination 𝑅2

summaryModel$fstatistic 𝐹 -testanova(model) ANOVA table

Do the following:

– Compute the regression of CamCom2011 into Population2010.Save that model as the variable myModel.

– Access the objects residuals and coefficients of myModel.– Compute the summary of myModel and store it as the variable

summaryMyModel.– Access the object sigma of myModel.– Repeat the previous steps changing the names of myModel and

summaryMyModel to otherMod and infoOtherMod, respectively.

Now you know how to fit and summarize a linear model with a few keystrokes.Let’s see more of the basics of R – it will be useful for the next sections.

2.2.2 Simple computations

# These are some simple operations# The console can act as a simple calculator1.0 + 1.1## [1] 2.12 * 2## [1] 43/2## [1] 1.52^3## [1] 81/0## [1] Inf0/0## [1] NaN

# Use ; for performing several operations in the same line(1 + 3) * 2 - 1; 1 + 3 * 2 - 1## [1] 7


## [1] 6

# Mathematical functionssqrt(2); 2^0.5## [1] 1.414214## [1] 1.414214sqrt(-1)## Warning in sqrt(-1): NaNs produced## [1] NaNexp(1)## [1] 2.718282log(10); log10(10); log2(10)## [1] 2.302585## [1] 1## [1] 3.321928sin(pi); cos(0); asin(0)## [1] 1.224647e-16## [1] 1## [1] 0

# Remember to complete the expressions1 +(1 + 3## Error: <text>:4:0: unexpected end of input## 2: 1 +## 3: (1 + 3## ^

2.2.3 Variables and assignment# Any operation that you perform in R can be stored in a variable (or object)# with the assignment operator "<-"a <- 1

# To see the value of a variable, we simply type ita## [1] 1

# A variable can be overwrittena <- 1 + 1

# Now the value of a is 2 and not 1, as beforea## [1] 2


# Careful with capitalizationA## Error in eval(expr, envir, enclos): object 'A' not found

# DifferentA <- 3a; A## [1] 2## [1] 3

# The variables are stored in your workspace (.RData file)# A handy tip to see what variables are in the workspacels()## [1] "a" "A" "appCamComNoGer2011"## [4] "appEU2011" "appEUNoGer2011" "appUS"## [7] "EU" "mod" "pisa"## [10] "pisaLinearModel" "sumMod" "US"# Now you know which variables can be accessed!

# Remove variablesrm(a)a## Error in eval(expr, envir, enclos): object 'a' not found

Do the following:

– Store −123 in the variable b.– Get the log of the square of b. (Answer: 9.624369)– Remove variable b.

2.2.4 Vectors# These are vectors - arrays of numbers# We combine numbers with the function cc(1, 3)## [1] 1 3c(1.5, 0, 5, -3.4)## [1] 1.5 0.0 5.0 -3.4

# A handy way of creating sequences is the operator :# Sequence from 1 to 51:5## [1] 1 2 3 4 5


# Storing some vectorsmyData <- c(1, 2)myData2 <- c(-4.12, 0, 1.1, 1, 3, 4)myData## [1] 1 2myData2## [1] -4.12 0.00 1.10 1.00 3.00 4.00

# Entry-wise operationsmyData + 1## [1] 2 3myData^2## [1] 1 4

# If you want to access a position of a vector, use [position]myData[1]## [1] 1myData2[6]## [1] 4

# You also can change elementsmyData[1] <- 0myData## [1] 0 2

# Think on what you want to access...myData2[7]## [1] NAmyData2[0]## numeric(0)

# If you want to access all the elements except a position, use [-position]myData2[-1]## [1] 0.0 1.1 1.0 3.0 4.0myData2[-2]## [1] -4.12 1.10 1.00 3.00 4.00

# Also with vectors as indexesmyData2[1:2]## [1] -4.12 0.00myData2[myData]## [1] 0

# And alsomyData2[-c(1, 2)]


## [1] 1.1 1.0 3.0 4.0

# But do not mix positive and negative indexes!myData2[c(-1, 2)]## Error in myData2[c(-1, 2)]: only 0's may be mixed with negative subscripts

Do the following:

– Create the vector 𝑥 = (1, 7, 3, 4).– Create the vector 𝑦 = (100, 99, 98, ..., 2, 1).– Compute 𝑥2 + 𝑦4 and cos(𝑥3) + sin(𝑥2)𝑒−𝑦2 . (Answers: 104,

-0.9899925)– Set 𝑥2 = 0 and 𝑦2 = −1. Recompute the previous expressions.

(Answers: 97, 2.785875)– Index 𝑦 by 𝑥+1 and store it as z. What is the output? (Answer:

z is c(-1, 100, 97, 96))

2.2.5 Some functions

# Functions take arguments between parenthesis and transform them into an outputsum(myData)## [1] 2prod(myData)## [1] 0

# Summary of an objectsummary(myData)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.0 0.5 1.0 1.0 1.5 2.0

# Length of the vectorlength(myData)## [1] 2

# Mean, standard deviation, variance, covariance, correlationmean(myData)## [1] 1var(myData)## [1] 2cov(myData, myData^2)## [1] 4cor(myData, myData * 2)## [1] 1quantile(myData)


## 0% 25% 50% 75% 100%## 0.0 0.5 1.0 1.5 2.0

# Maximum and minimum of vectorsmin(myData)## [1] 0which.min(myData)## [1] 1

# Usually the functions have several arguments, which are set by "argument = value"# In this case, the second argument is a logical flag to indicate the kind of sortingsort(myData) # If nothing is specified, decreasing = FALSE is assumed## [1] 0 2sort(myData, decreasing = TRUE)## [1] 2 0

# Don't know what are the arguments of a function? Use args and help!args(sort)## function (x, decreasing = FALSE, ...)## NULL?sort

Do the following:

– Compute the mean, median and variance of 𝑦. (Answers: 49.5,49.5, 843.6869)

– Do the same for 𝑦 + 1. What are the differences?– What is the maximum of 𝑦? Where is it placed?– Sort 𝑦 increasingly and obtain the 5th and 76th positions. (An-

swer: c(4,75))– Compute the covariance between 𝑦 and 𝑦. Compute the variance

of 𝑦. Why do you get the same result?

2.2.6 Matrices, data frames and lists# A matrix is an array of vectorsA <- matrix(1:4, nrow = 2, ncol = 2)A## [,1] [,2]## [1,] 1 3## [2,] 2 4

# Another matrixB <- matrix(1, nrow = 2, ncol = 2, byrow = TRUE)


B## [,1] [,2]## [1,] 1 1## [2,] 1 1

# Binding by rows or columnsrbind(1:3, 4:6)## [,1] [,2] [,3]## [1,] 1 2 3## [2,] 4 5 6cbind(1:3, 4:6)## [,1] [,2]## [1,] 1 4## [2,] 2 5## [3,] 3 6

# Entry-wise operationsA + 1## [,1] [,2]## [1,] 2 4## [2,] 3 5A * B## [,1] [,2]## [1,] 1 3## [2,] 2 4

# Accessing elementsA[2, 1] # Element (2, 1)## [1] 2A[1, ] # First row## [1] 1 3A[, 2] # First column## [1] 3 4

# A data frame is a matrix with column names# Useful when you have multiple variablesmyDf <- data.frame(var1 = 1:2, var2 = 3:4)myDf## var1 var2## 1 1 3## 2 2 4

# You can change namesnames(myDf) <- c("newname1", "newname2")myDf


## newname1 newname2## 1 1 3## 2 2 4

# The nice thing is that you can access variables by its name with the $ operatormyDf$newname1## [1] 1 2

# And create new variables also (it has to be of the same# length as the rest of variables)myDf$myNewVariable <- c(0, 1)myDf## newname1 newname2 myNewVariable## 1 1 3 0## 2 2 4 1

# A list is a collection of arbitrary variablesmyList <- list(myData = myData, A = A, myDf = myDf)

# Access elements by namesmyList$myData## [1] 0 2myList$A## [,1] [,2]## [1,] 1 3## [2,] 2 4myList$myDf## newname1 newname2 myNewVariable## 1 1 3 0## 2 2 4 1

# Reveal the structure of an objectstr(myList)## List of 3## $ myData: num [1:2] 0 2## $ A : int [1:2, 1:2] 1 2 3 4## $ myDf :'data.frame': 2 obs. of 3 variables:## ..$ newname1 : int [1:2] 1 2## ..$ newname2 : int [1:2] 3 4## ..$ myNewVariable: num [1:2] 0 1str(myDf)## 'data.frame': 2 obs. of 3 variables:## $ newname1 : int 1 2## $ newname2 : int 3 4## $ myNewVariable: num 0 1

2.3. MODEL FORMULATION AND ESTIMATION BY LEAST SQUARES45

# A less lengthy outputnames(myList)## [1] "myData" "A" "myDf"

Do the following:

– Create a matrix called M with rows given by y[3:5], y[3:5]^2and log(y[3:5]).

– Create a data frame called myDataFrame with column names “y”,“y2” and “logy” containing the vectors y[3:5], y[3:5]^2 andlog(y[3:5]), respectively.

– Create a list, called l, with entries for x and M. Access the ele-ments by their names.

– Compute the squares of myDataFrame and save the result asmyDataFrame2.

– Compute the log of the sum of myDataFrame and myDataFrame2.Answer:

## y y2 logy## 1 9.180087 18.33997 3.242862## 2 9.159678 18.29895 3.238784## 3 9.139059 18.25750 3.234656

2.3 Model formulation and estimation by leastsquares

The simple linear model is a statistical tool for describing the relation betweentwo random variables, 𝑋 and 𝑌 . For example, in the pisa dataset, 𝑋 couldbe ReadingMean and 𝑌 = MathMean. The simple linear model is constructed byassuming that the linear relation

𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜀 (2.1)

holds between 𝑋 and 𝑌 . In (2.1), 𝛽0 and 𝛽1 are known as the intercept andslope, respectively. 𝜀 is a random variable with mean zero and independent from𝑋. It describes the error around the mean, or the effect of other variables thatwe do not model. Another way of looking at (2.1) is

𝔼[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1𝑥, (2.2)

since 𝔼[𝜀|𝑋 = 𝑥] = 0.

The Left Hand Side (LHS) of (2.2) is the conditional expectation of 𝑌 given 𝑋.It represents how the mean of the random variable 𝑌 is changing according to a


particular value, denoted by 𝑥, of the random variable 𝑋. With the RHS, whatwe are saying is that the mean of 𝑌 is changing in a linear fashion with respectto the value of 𝑋. Hence the interpretation of the coefficients:

• 𝛽0: is the mean of 𝑌 when 𝑋 = 0.• 𝛽1: is the increment in mean of 𝑌 for an increment of one unit in 𝑋 = 𝑥.

If we have a sample (𝑋1, 𝑌1), … , (𝑋𝑛, 𝑌𝑛) for our random variables 𝑋 and 𝑌 ,we can estimate the unknown coefficients 𝛽0 and 𝛽1. In the pisa dataset, thesample are the observations for ReadingMean and MathMean. A possible way ofestimating (𝛽0, 𝛽1) is by minimizing the Residual Sum of Squares (RSS):

RSS(𝛽0, 𝛽1) =𝑛

∑𝑖=1

(𝑌𝑖 − 𝛽0 − 𝛽1𝑋𝑖)2.

In other words, we look for the estimators ( 𝛽0, 𝛽1) such that

( 𝛽0, 𝛽1) = arg min(𝛽0,𝛽1)∈ℝ2

RSS(𝛽0, 𝛽1).

It can be seen that the minimizers of the RSS8 are

𝛽0 = 𝑌 − 𝛽1��, 𝛽1 = 𝑠𝑥𝑦𝑠2𝑥

, (2.3)

where:

• �� = 1𝑛 ∑𝑛

𝑖=1 𝑋𝑖 is the sample mean.• 𝑠2

𝑥 = 1𝑛 ∑𝑛

𝑖=1(𝑋𝑖 − ��)2 is the sample variance. The sample standarddeviation is 𝑠𝑥 = √𝑠2𝑥.

• 𝑠𝑥𝑦 = 1𝑛 ∑𝑛

𝑖=1(𝑋𝑖 − ��)(𝑌𝑖 − 𝑌 ) is the sample covariance. It measuresthe degree of linear association between 𝑋1, … , 𝑋𝑛 and 𝑌1, … , 𝑌𝑛. Oncescaled by 𝑠𝑥𝑠𝑦, it gives the sample correlation coefficient, 𝑟𝑥𝑦 = 𝑠𝑥𝑦

𝑠𝑥𝑠𝑦.

As a consequence of 𝛽1 = 𝑟𝑥𝑦𝑠𝑦𝑠𝑥

, 𝛽1 has always the same sign as 𝑟𝑥𝑦.

There are some important points hidden behind the election of RSS as the errorcriterion for obtaining ( 𝛽0, 𝛽1):

• Why the vertical distances and not horizontal or perpendicular? Becausewe want to minimize the error in the prediction of 𝑌 ! Note that thetreatment of the variables is not symmetrical9.

8They are unique and always exist. They can be obtained by solving 𝜕𝜕𝛽0

RSS(𝛽0, 𝛽1) = 0and 𝜕

𝜕𝛽1RSS(𝛽0, 𝛽1) = 0.

9In Chapter 5 we will consider perpendicular distances.


• Why the squares in the distances and not the absolute value? Due tomathematical convenience. Squares are nice to differentiate and are closelyrelated with the normal distribution.

Figure 2.16 illustrates the influence of the distance employed in the sum ofsquares. Try to minimize the sum of squares for the different datasets. Is thebest choice of intercept and slope independent of the type of distance?

Figure 2.16: The effect of the kind of distance in the error criterion. The choicesof intercept and slope that minimize the sum of squared distances for a kind ofdistance are not the optimal for a different kind of distance. Application alsoavailable here.

The data of the figure has been generated with the following code:# Generates 50 points from a N(0, 1): predictor and errorset.seed(34567) # Fixes the seed for the random generatorx <- rnorm(n = 50)eps <- rnorm(n = 50)

# ResponsesyLin <- -0.5 + 1.5 * x + epsyQua <- -0.5 + 1.5 * x^2 + epsyExp <- -0.5 + 1.5 * 2^x + eps

https://shinyserv.es/shiny/least-squares/


# DataleastSquares <- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)

The minimizers of the error in the above illustration are indeed thecoefficients given by the lm function. Check this for the three typesof responses: yLin, yQua and yExp.

The population regression coefficients, (𝛽0, 𝛽1), are not the same asthe estimated regression coefficients, ( 𝛽0, 𝛽1):

– (𝛽0, 𝛽1) are the theoretical and always unknown quantities (ex-cept under controlled scenarios).

– ( 𝛽0, 𝛽1) are the estimates computed from the data. In particular,they are the output of lm. They are random variables, since theyare computed from the random sample (𝑋1, 𝑌1), … , (𝑋𝑛, 𝑌𝑛).

In an abuse of notation, the term regression line is often used to denoteboth the theoretical (𝑦 = 𝛽0 + 𝛽1𝑥) and the estimated (𝑦 = 𝛽0 + 𝛽1𝑥)regression lines.

Once we have the least squares estimates ( 𝛽0, 𝛽1), we can define the next twoconcepts:

• The fitted values 𝑌1, … , 𝑌𝑛, where

𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖, 𝑖 = 1, … , 𝑛.

They are the vertical projections of 𝑌1, … , 𝑌𝑛 into the fitted line (see Figure2.16).

• The residuals (or estimated errors) 𝜀1, … , 𝜀𝑛, where

𝜀𝑖 = 𝑌𝑖 − 𝑌𝑖, 𝑖 = 1, … , 𝑛.

They are the vertical distances between actual data (𝑋𝑖, 𝑌𝑖) and fitted data(𝑋𝑖, 𝑌𝑖). Hence, another way of writing the minimum RSS is ∑𝑛

𝑖=1(𝑌𝑖 −𝛽0 − 𝛽1𝑋𝑖)2 = ∑𝑛

𝑖=1 𝜀2𝑖 .

To conclude this section, we check that the regression coefficients given by lmare indeed the ones given in (2.3).# CovarianceSxy <- cov(x, yLin)

# VarianceSx2 <- var(x)

2.4. ASSUMPTIONS OF THE MODEL 49

# Coefficientsbeta1 <- Sxy / Sx2beta0 <- mean(yLin) - beta1 * mean(x)c(beta0, beta1)## [1] -0.6153744 1.3950973

# Output from lmmod <- lm(yLin ~ x, data = leastSquares)mod$coefficients## (Intercept) x## -0.6153744 1.3950973

Adapt the code conveniently for doing the same checking with

– 𝑋 = ReadingMean and 𝑌 = MathMean from the pisa dataset.– 𝑋 = logGDPp and 𝑌 = MathMean.– 𝑋 = Population2010 and 𝑌 = Seats2013.2023 from the US

dataset.– 𝑋 = Population2010 and 𝑌 = Seats2011 from the EU dataset.

2.4 Assumptions of the modelWhy do we need assumptions? To make inference on the model parameters.In other words, to infer properties about the unknown population coefficients𝛽0 and 𝛽1 from the sample (𝑋1, 𝑌1), … , (𝑋𝑛, 𝑌𝑛).The assumptions of the linear model are:

i. Linearity: 𝔼[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1𝑥.ii. Homoscedasticity: 𝕍ar[𝜀𝑖] = 𝜎2, with 𝜎2 constant for 𝑖 = 1, … , 𝑛.iii. Normality: 𝜀𝑖 ∼ 𝒩(0, 𝜎2) for 𝑖 = 1, … , 𝑛.iv. Independence of the errors: 𝜀1, … , 𝜀𝑛 are independent (or uncorre-

lated, 𝔼[𝜀𝑖𝜀𝑗] = 0, 𝑖 ≠ 𝑗, since they are assumed to be normal).

A good one-line summary of the linear model is (independence is assumed)

𝑌 |𝑋 = 𝑥 ∼ 𝒩(𝛽0 + 𝛽1𝑥, 𝜎2)

Recall:

– Nothing is said about the distribution of 𝑋. Indeed, 𝑋 could bedeterministic (called fixed design) or random (random design).

– The linear model assumes that 𝑌 is continuous due to thenormality of the errors. However, 𝑋 can be discrete!


Figure 2.17: The key concepts of the simple linear model. The yellow banddenotes where the 95% of the data is, according to the model.

Figures 2.18 and 2.19 represent situations where the assumptions of the modelare respected and violated, respectively. For the moment, we will focus onbuilding the intuition for checking the assumptions visually. In Chapter 3 wewill see more sophisticated methods for checking the assumptions. We will seealso what are the possible fixes to the failure of assumptions.

The dataset assumptions.RData (download) contains the variablesx1, …, x9 and y1, …, y9. For each regression y1 ~ x1, …, y9 ~ x9:

– Check whether the assumptions of the linear model are beingsatisfied (make a scatterplot with a regression line).

– State which assumption(s) are violated and justify your answer.

2.5 Inference for the model coefficientsThe assumptions introduced in the previous section allow to specify what is thedistribution of the random variables 𝛽0 and 𝛽1. As we will see, this is a keypoint for making inference on 𝛽0 and 𝛽1.

https://raw.githubusercontent.com/egarpor/handy/master/datasets/assumptions.RData

2.5. INFERENCE FOR THE MODEL COEFFICIENTS 51

Figure 2.18: Perfectly valid simple linear models (all the assumptions are veri-fied).

The distributions are derived conditionally on the sample predictors 𝑋1, … , 𝑋𝑛.In other words, we assume that the randomness of 𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖 + 𝜀𝑖, 𝑖 =1, … , 𝑛, comes only from the error terms and not from the predictors. To denotethis, we employ lowercase for the sample predictors 𝑥1, … , 𝑥𝑛.

2.5.1 Distributions of the fitted coefficientsThe distributions of 𝛽0 and 𝛽1 are:

𝛽0 ∼ 𝒩 (𝛽0, SE( 𝛽0)2) , 𝛽1 ∼ 𝒩 (𝛽1, SE( 𝛽1)2) (2.4)

where

SE( 𝛽0)2 = 𝜎2

𝑛 [1 + ��2

𝑠2𝑥] , SE( 𝛽1)2 = 𝜎2

𝑛𝑠2𝑥. (2.5)

Recall that an equivalent form for (2.4) is (why?)

𝛽0 − 𝛽0SE( 𝛽0)

∼ 𝒩(0, 1),𝛽1 − 𝛽1

SE( 𝛽1)∼ 𝒩(0, 1).

Some important remarks on (2.4) and (2.5) are


Figure 2.19: Problematic simple linear models (a single assumption does nothold).

• Bias. Both estimates are unbiased. That means that their expectationsare the true coefficients.

• Variance. The variances SE( 𝛽0)2 and SE( 𝛽1)2 have an interesting inter-pretation in terms of its components:

– Sample size 𝑛. As the sample size grows, the precision of the estima-tors increases, since both variances decrease.

– Error variance 𝜎2. The more disperse the error is, the less precisethe estimates are, since the more vertical variability is present.

– Predictor variance 𝑠2𝑥. If the predictor is spread out (large 𝑠2

𝑥), thenit is easier to fit a regression line: we have information about thedata trend over a long interval. If 𝑠2

𝑥 is small, then all the data isconcentrated on a narrow vertical band, so we have a much morelimited view of the trend.

– Mean ��. It has influence only on the precision of 𝛽0. The larger ��is, the less precise 𝛽0 is.

The problem with (2.4) and (2.5) is that 𝜎2 is unknown in practice, so we needto estimate 𝜎2 from the data. We do so by computing the sample variance ofthe residuals 𝜀1, … , 𝜀𝑛. First note that the residuals have zero mean. This can


Figure 2.20: Illustration of the randomness of the fitted coefficients ( 𝛽0, 𝛽1)and the influence of 𝑛, 𝜎2 and 𝑠2

𝑥. The sample predictors 𝑥1, … , 𝑥𝑛 are fixedand new responses 𝑌1, … , 𝑌𝑛 are generated each time from a linear model 𝑌 =𝛽0 + 𝛽1𝑋 + 𝜀. Application also available here.

be easily seen by replacing 𝛽0 = 𝑌 − 𝛽1��:

𝜀 = 1𝑛

𝑛∑𝑖=1

(𝑌𝑖 − 𝛽0 − 𝛽1𝑋𝑖) = 1𝑛

𝑛∑𝑖=1

(𝑌𝑖 − 𝑌 + 𝛽1�� − 𝛽1𝑋𝑖) = 0. (2.6)

Due to this, we can and we can do it by computing a rescaled sample varianceof the residuals:

��2 = ∑𝑛𝑖=1 𝜀2

𝑖𝑛 − 2 .

Note the 𝑛−2 in the denominator, instead of 𝑛! 𝑛−2 are the degrees of freedomand is the number of data points minus the number of already fitted parameters.The interpretation is that “we have consumed 2 degrees of freedom of the sampleon fitting 𝛽0 and 𝛽1”.

If we use the estimate ��2 instead of 𝜎2, we get different – and more useful –

https://shinyserv.es/shiny/lm-random/


distributions for 𝛽0 and 𝛽1:

𝛽0 − 𝛽0SE( 𝛽0)

∼ 𝑡𝑛−2,𝛽1 − 𝛽1SE( 𝛽1)

∼ 𝑡𝑛−2 (2.7)

where 𝑡𝑛−2 represents the Student’s 𝑡 distribution10 with 𝑛−2 degrees of freedomand

SE( 𝛽0)2 = ��2

𝑛 [1 + ��2

𝑠2𝑥] , SE( 𝛽1)2 = ��2

𝑛𝑠2𝑥(2.8)

are the estimates of SE( 𝛽0)2 and SE( 𝛽1)2. The LHS of (2.7) is called the 𝑡-statistic because of its distribution. The interpretation of (2.8) is analogous tothe one of (2.5).

2.5.2 Confidence intervals for the coefficientsDue to (2.7) and (2.8), we can have the 100(1 − 𝛼)% Confidence Intervals (CI)for the coefficients:

( 𝛽𝑗 ± SE( 𝛽𝑗)𝑡𝑛−2;𝛼/2) , 𝑗 = 0, 1, (2.9)

where 𝑡𝑛−2;𝛼/2 is the 𝛼/2-upper quantile of the 𝑡𝑛−2 (see Figure 2.21). Usually,𝛼 = 0.10, 0.05, 0.01 are considered.

Do you need to remember the above equations? No, although youneed to fully understand them. R + R Commander will computeeverything for you through the functions lm, summary and confint.

This random CI contains the unknown coefficient 𝛽𝑗 with a probability of 1 − 𝛼.Note also that the CI is symmetric around 𝛽𝑗. A simple way of understandingthis concept is as follows. Suppose you have 100 samples generated accordingto a linear model. If you compute the CI for a coefficient, then in approximately100(1−𝛼) of the samples the true coefficient would be actually inside the randomCI. This is illustrated in Figure 2.22.

Let’s see how we can compute the CIs for 𝛽0 and 𝛽1 in practice. We do it inthe first regression of the assumptions dataset. Assuming you have loaded thedataset, in R we can simply type:mod1 <- lm(y1 ~ x1, data = assumptions)confint(mod1)## 2.5 % 97.5 %## (Intercept) -0.2256901 0.1572392## x1 -0.5587490 -0.2881032

10The Student’s 𝑡 distribution has heavier tails than the normal, which means that largeobservations in absolute value are more likely. 𝑡𝑛 converges to a 𝒩(0, 1) when 𝑛 is large. Forexample, with 𝑛 larger than 30, the normal is a good approximation.


Figure 2.21: The Student’s 𝑡 distribution for the 𝑡-statistics associated to nullintercept and slope, for the y1 ~ x1 regression of the assumptions dataset.

In this example, the 95% confidence interval for 𝛽0 is (−0.2257, 0.1572). For 𝛽1is (−0.5587, −0.2881). Therefore, we can say that with a 95% confidence x1has a negative effect on y1. If the CI for 𝛽1 was (−0.2256901, 0.1572392),we could not arrive to the same conclusion, since the CI contains both positiveand negative numbers.

By default, the confidence interval is computed for 𝛼 = 0.05. You can changethis on the level argument, for example:confint(mod1, level = 0.90) # alpha = 0.10## 5 % 95 %## (Intercept) -0.1946762 0.1262254## x1 -0.5368291 -0.3100231confint(mod1, level = 0.95) # alpha = 0.05## 2.5 % 97.5 %## (Intercept) -0.2256901 0.1572392## x1 -0.5587490 -0.2881032confint(mod1, level = 0.99) # alpha = 0.01## 0.5 % 99.5 %## (Intercept) -0.2867475 0.2182967## x1 -0.6019030 -0.2449492


Figure 2.22: Illustration of the randomness of the CI for 𝛽0 at 100(1 − 𝛼)%confidence. The plot shows 100 random CIs for 𝛽0, computed from 100 randomdatasets generated by the same linear model, with intercept 𝛽0. The illustrationfor 𝛽1 is completely analogous. Application also available here.

Note that the larger the confidence of the interval, the longer – thusless useful – it is. For example, the interval (−∞, ∞) contains anycoefficient with a 100% confidence, but is completely useless.

If you want to make the CIs through the help of R Commander (assuming thedataset has been loaded and is the active one), then do the following:

1. Fit the linear model ('Statistics' -> 'Fit models' -> 'Linearregression...').

2. Go to 'Models' -> 'Confidence intervals...' and then input the'Condifence Level'.

Compute the CIs (95%) for the coefficients of the regressions:

– y2 ~ x2– y6 ~ x6

https://shinyserv.es/shiny/ci-random/


– y7 ~ x7

Do you think all of them are meaningful? Which ones are and why?(Recall: inference on the model makes sense if assumptions of themodel are verified)

Compute the CIs for the coefficients of the following regressions:

– MathMean ~ ScienceMean (pisa)– MathMean ~ ReadingMean (pisa)– Seats2013.2023 ~ Population2010 (US)– CamCom2011 ~ Population2010 (EU)

For the above regressions, can we conclude with a 95% confidence thatthe effect of the predictor is positive in the response?

A CI for 𝜎2 can be also computed, but is less important in practice. The formulais:

( 𝑛 − 2𝜒2

𝑛−2;𝛼/2��2, 𝑛 − 2

𝜒2𝑛−2;1−𝛼/2

��2)

where 𝜒2𝑛−2;𝑞 is the 𝑞-upper quantile of the 𝜒2 distribution11 with 𝑛 − 2 degrees

of freedom, 𝜒2𝑛−2. Note that the CI is not symmetric around ��2.

Compute the CI for 𝜎2 for the regression of MathMean on logGDPp inthe pisa dataset. Do it for 𝛼 = 0.10, 0.05, 0.01.

– To compute 𝜒2𝑛−2;𝑞, you can do:

∗ In R by qchisq(p = q, df = n - 2, lower.tail =FALSE).

∗ In R Commander, go to 'Distributions' -> 'Continuousdistributions' -> 'Chi-squared distribution' ->'Chi-squared quantiles' and then select 'Upper tail'.Input 𝑞 as the 'Probabilities' and 𝑛 − 2 as the 'Degreesof freedom'.

– To compute ��2, use summary(lm(MathMean ~ logGDPp, data =pisa))$sigma^2. Remember that there are 65 countries in thestudy.

Answers: c(1720.669, 3104.512), c(1635.441, 3306.257) andc(1484.639, 3752.946).

11𝜒2𝑛 is the distribution of the sum of the squares of 𝑛 random variables 𝒩(0, 1).


2.5.3 Testing on the coefficientsThe distributions in (2.7) also allow to conduct a formal hypothesis test on thecoefficients 𝛽𝑗, 𝑗 = 0, 1. For example the test for significance (shortcut forsignificantly difference from zero) is specially important, that is, the test of thehypotheses

𝐻0 ∶ 𝛽𝑗 = 0

for 𝑗 = 0, 1. The test of 𝐻0 ∶ 𝛽1 = 0 is specially interesting, since it allows toanswer whether the variable 𝑋 has a significant linear effect on 𝑌 . The statisticused for testing for significance is the 𝑡-statistic

𝛽𝑗 − 0SE( 𝛽𝑗)

,

which is distributed as a 𝑡𝑛−2 under the (veracity of) the null hypothesis.

Remember the analogy of hypothesis testing vs a trial, as given in thetable below.

Hypothesis testing TrialNull hypothesis 𝐻0 Accused of committing a crime. It has

the “presumption of innocence”, which meansthat it is not guilty until there is enoughevidence to supporting its guilt

Sample 𝑋1, … , 𝑋𝑛 Collection of small evidences supportinginnocence and guilt

Statistic 𝑇𝑛 Summary of the evidence presented bythe prosecutor and defense lawyer

Distribution of 𝑇𝑛 under 𝐻0 The judge conducting the trial. Evaluatesthe evidence presented by both sides andpresents a verdict for 𝐻0

Significance level 𝛼 1 − 𝛼 is the strength of evidencesrequired by the judge for condemning𝐻0. The judge allows evidences that onaverage condemn 100𝛼% of the innocents!𝛼 = 0.05 is considered a reasonable level

𝑝-value Decision of the judge. If 𝑝-value < 𝛼, 𝐻0 isdeclared guilty. Otherwise, is declared notguilty

𝐻0 is rejected 𝐻0 is declared guilty: there are strongevidences supporting its guilt


Hypothesis testing Trial𝐻0 is not rejected 𝐻0 is declared not guilty: either is innocent

or there are no enough evidencessupporting its guilt

More formally, the 𝑝-value is defined as:

The 𝑝-value is the probability of obtaining a statistic moreunfavourable to 𝐻0 than the observed, assuming that 𝐻0 is true.

Therefore, if the 𝑝-value is small (smaller than the chosen level 𝛼), it isunlikely that the evidence against 𝐻0 is due to randomness. As aconsequence, 𝐻0 is rejected. If the 𝑝-value is large (larger than 𝛼), then it ismore possible that the evidences against 𝐻0 are merely due to the randomnessof the data. In this case, we do not reject 𝐻0.

The null hypothesis 𝐻0 is tested against the alternative hypothesis, 𝐻1. If 𝐻0 isrejected, it is rejected in favor of 𝐻1. The alternative hypothesis can be bilateral,such as

𝐻0 ∶ 𝛽𝑗 = 0 vs 𝐻1 ∶ 𝛽𝑗 ≠ 0

or unilateral, such as

𝐻0 ∶ 𝛽𝑗 ≥ (≤)0 vs 𝐻1 ∶ 𝛽𝑗 < (>)0

For the moment, we will focus only on the bilateral case.

The connection of a 𝑡-test for 𝐻0 ∶ 𝛽𝑗 = 0 and the CI for 𝛽𝑗, both atlevel 𝛼, is the following.

Is 0 inside the CI for 𝛽𝑗?

– Yes ↔ do not reject 𝐻0.– No ↔ reject 𝐻0.

The tests for significance are built-in in the summary function, as we glimpsedin Section 2.1.2. For mod1, we have:summary(mod1)#### Call:## lm(formula = y1 ~ x1, data = assumptions)#### Residuals:## Min 1Q Median 3Q Max## -2.13678 -0.62218 -0.07824 0.54671 2.63056


#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.03423 0.09709 -0.353 0.725## x1 -0.42343 0.06862 -6.170 3.77e-09 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.9559 on 198 degrees of freedom## Multiple R-squared: 0.1613, Adjusted R-squared: 0.157## F-statistic: 38.07 on 1 and 198 DF, p-value: 3.772e-09

The Coefficients block of the output of summary contains the next elementsregarding the test 𝐻0 ∶ 𝛽𝑗 = 0 vs 𝐻1 ∶ 𝛽𝑗 ≠ 0:

• Estimate: least squares estimate 𝛽𝑗.• Std. Error: estimated standard error SE( 𝛽𝑗).• t value: 𝑡-statistic

𝛽𝑗SE( 𝛽𝑗) .

• Pr(>|t|): 𝑝-value of the 𝑡-test.• Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '

1: codes indicating the size of the 𝑝-value. The more stars, the moreevidence supporting that 𝐻0 does not hold.

In the above output for summary(mod1), 𝐻0 ∶ 𝛽0 = 0 is not rejected at anyreasonable level for 𝛼 (that is, 0.10,0.05 and 0.01). Hence 𝛽0 is not significantlydifferent from zero and 𝛽0 is not significant for the regression. On the otherhand, 𝐻0 ∶ 𝛽1 = 0 is rejected at any level 𝛼 larger than the 𝑝-value, 3.77e-09.Therefore, 𝛽1 is significant for the regression (and 𝛽1 is not significantly differentfrom zero).

For the assumptions dataset, do the next:

– Regression y7 ~ x7. Check that:∗ The intercept of is not significant for the regression at any

reasonable level 𝛼.∗ The slope is significant for any 𝛼 ≥ 10−7.

– Regression y6 ~ x6. Assume the linear model assumptions areverified.

∗ Check that 𝛽0 is significantly different from zero at any level𝛼.

∗ For which 𝛼 = 0.10, 0.05, 0.01 is 𝛽1 significantly differentfrom zero?

2.6. PREDICTION 61

Re-analyze the significance of the coefficients in Seats2013.2023 ~Population2010 and Seats2011 ~ Population2010 for the US andEU datasets, respectively.

2.6 PredictionThe forecast of 𝑌 from 𝑋 = 𝑥 in the linear model is approached by two differentways:

1. Inference on the conditional mean of 𝑌 given 𝑋 = 𝑥, 𝔼[𝑌 |𝑋 = 𝑥]. Thisis a deterministic quantity, which equals 𝛽0 + 𝛽1𝑥 in the linear model.

2. Prediction of the conditional response 𝑌 |𝑋 = 𝑥. This is a randomvariable, which in the linear model is distributed as 𝒩(𝛽0 + 𝛽1𝑥, 𝜎2).

Let’s study first the inference on the conditional mean. 𝛽0 +𝛽1𝑥 is estimated by𝑦 = 𝛽0 + 𝛽1𝑥. Then, is a deterministic quantity estimated by a random variable.

Moreover, it can be shown that the 100(1 − 𝛼)% CI for 𝛽0 + 𝛽1𝑥 is

( 𝑦 ± 𝑡𝑛−2∶𝛼/2√��2

𝑛 (1 + (𝑥 − 𝑥)2

𝑠2𝑥)) . (2.10)

Some important remarks on (2.10) are:

• Bias. The CI is centered around 𝑦 = 𝛽0 + 𝛽1𝑥, which obviously dependson 𝑥.

• Variance. The variance that determines the length of the CI is��2𝑛 (1 + (𝑥−��)2

𝑠2𝑥). Its interpretation is very similar to the one given for 𝛽0

and 𝛽1 in Section 2.5:

– Sample size 𝑛. As the sample size grows, the length of the CI de-creases, since the estimates 𝛽0 and 𝛽1 become more precise.

– Error variance 𝜎2. The more disperse 𝑌 is, the less precise 𝛽0 and𝛽1 are, hence the more variance on estimating 𝛽0 + 𝛽1𝑥.

– Predictor variance 𝑠2𝑥. If the predictor is spread out (large 𝑠2

𝑥), thenit is easier to “anchor” the regression line. This helps on reducing thevariance, but up to a certain limit: there is a variance componentpurely dependent on the error!

– Centrality (𝑥− 𝑥)2. The more extreme 𝑥 is, the wider the CI becomes.This is due to the “leverage” of the slope estimate 𝛽1: a small devia-tion from the true 𝛽1 is magnified when 𝑥 is far away from 𝑥, hencethe more variability in these points. The minimum is achieved with𝑥 = 𝑥, but it does not correspond to zero variance.


Figure 2.23: Illustration of the CIs for the conditional mean and response. Notehow the length of the CIs is influenced by 𝑥, specially for the conditional mean.Application also available here.

Figure 2.23 helps visualizing these concepts interactively.

The prediction and computation of CIs can be done with the R functionpredict. The objects required for predict are: first, the output of lm; second,a data.frame containing the locations 𝑥 where we want to predict 𝛽0 + 𝛽1𝑥.To illustrate the use of predict, we are going to use the pisa dataset. In caseyou do not have it loaded, you can download it here as an .RData file.# Plot the data and the regression line (alternatively, using R Commander)scatterplot(MathMean ~ ReadingMean, data = pisa, smooth = FALSE)

https://shinyserv.es/shiny/ci-prediction/

https://raw.githubusercontent.com/egarpor/handy/master/datasets/pisa.RData

2.6. PREDICTION 63

400 450 500 550

400

450

500

550

600

ReadingMean

Mat

hMea

n

# Fit a linear model (alternatively, using R Commander)model <- lm(MathMean ~ ReadingMean, data = pisa)summary(model)#### Call:## lm(formula = MathMean ~ ReadingMean, data = pisa)#### Residuals:## Min 1Q Median 3Q Max## -29.039 -10.151 -2.187 7.804 50.241#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -62.65638 19.87265 -3.153 0.00248 **## ReadingMean 1.13083 0.04172 27.102 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 15.72 on 63 degrees of freedom## Multiple R-squared: 0.921, Adjusted R-squared: 0.9198## F-statistic: 734.5 on 1 and 63 DF, p-value: < 2.2e-16

# Data for which we want a prediction for the mean# Important! You have to name the column with the predictor name!newData <- data.frame(ReadingMean = 400)


# Predictionpredict(model, newdata = newData)## 1## 389.6746

# Prediction with 95% confidence interval (the default)# CI: (lwr, upr)predict(model, newdata = newData, interval = "confidence")## fit lwr upr## 1 389.6746 382.3782 396.971predict(model, newdata = newData, interval = "confidence", level = 0.95)## fit lwr upr## 1 389.6746 382.3782 396.971

# Other levelspredict(model, newdata = newData, interval = "confidence", level = 0.90)## fit lwr upr## 1 389.6746 383.5793 395.77predict(model, newdata = newData, interval = "confidence", level = 0.99)## fit lwr upr## 1 389.6746 379.9765 399.3728

# Predictions for several valuessummary(pisa$ReadingMean)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 384 441 488 474 509 570newData2 <- data.frame(ReadingMean = c(400, 450, 500, 550))predict(model, newdata = newData2, interval = "confidence")## fit lwr upr## 1 389.6746 382.3782 396.9710## 2 446.2160 441.8363 450.5957## 3 502.7574 498.2978 507.2170## 4 559.2987 551.8587 566.7388

For the pisa dataset, do the following:

– Regress MathMean on logGDPp excluding Shanghai-China, Viet-nam and Qatar (use subset = -c(1, 16, 62)). Name the fit-ted model modExercise.

– Show the scatterplot with regression line (use subset = -c(1,16, 62)).

– Compute the estimate for the conditional mean of MathMean fora logGDPp of 10.263 (Spain’s). What is the CI at 𝛼 = 0.05?

– Do the same with the logGDPp of Sweden, Denmark, Italy and

2.6. PREDICTION 65

United States.– Check that modExercise\$fitted.values is the same as

predict(modExercise, newdata = data.frame(logGDPp =pisa$logGDPp)) (except for the three countries omitted). Whyis so?

Let’s study now the prediction of the conditional response 𝑌 |𝑋. 𝑌 |𝑋 is pre-dicted by 𝑦 = 𝛽0 + 𝛽1𝑥 (the estimated conditional mean). So we estimate theunknown response 𝑌 |𝑋 simply by its conditional mean. This is clearly a differ-ent situation: now we estimate a random variable by another randomvariable. As a consequence, there is a price to pay in terms of extra variabilityand this is reflected in the 100(1 − 𝛼)% CI for 𝑌 |𝑋:

( 𝑦 ± 𝑡𝑛−2∶𝛼/2√��2 + ��2

𝑛 (1 + (𝑥 − 𝑥)2

𝑠2𝑥)) (2.11)

The CI (2.11) is very similar to (2.10), but there is a key difference: it is alwayslonger due to the extra term ��2. In Figure 2.23 you can visualize the differencesbetween both CIs.

Similarities and differences in the prediction of the conditional mean𝔼[𝑌 |𝑋 = 𝑥] and conditional response 𝑌 |𝑋 = 𝑥:

– Similarities. The estimate is the same, 𝑦 = 𝛽0+ 𝛽1𝑥. Both CI arecentered in 𝑦 and share the term ��2

𝑛 (1 + (𝑥−��)2

𝑠2𝑥) in the variance.

– Differences. 𝔼[𝑌 |𝑋 = 𝑥] is deterministic and 𝑌 |𝑋 = 𝑥 is random.Therefore, the variance is larger for the prediction of 𝑌 |𝑋 = 𝑥:there is an extra ��2 term in the variance of its prediction.

The prediction and computation of CIs can be done with the R function pre-dict. The objects required for predict are: first, the output of lm; second, adata.frame containing the locations 𝑥 where we want to predict 𝛽0 + 𝛽1𝑥. Toillustrate the use of predict, we are going to use the pisa dataset. In case youdo not have it loaded, you can download it here as an .RData file.

The prediction of a new observation can be done via the function predict,which also provides confidence intervals.# Prediction with 95% confidence interval (the default) CI: (lwr, upr)predict(model, newdata = newData, interval = "prediction")## fit lwr upr## 1 389.6746 357.4237 421.9255

# Other levelspredict(model, newdata = newData, interval = "prediction", level = 0.90)

https://raw.githubusercontent.com/egarpor/handy/master/datasets/pisa.RData


## fit lwr upr## 1 389.6746 362.7324 416.6168predict(model, newdata = newData, interval = "prediction", level = 0.99)## fit lwr upr## 1 389.6746 346.8076 432.5417

# Predictions for several valuespredict(model, newdata = newData2, interval = "prediction")## fit lwr upr## 1 389.6746 357.4237 421.9255## 2 446.2160 414.4975 477.9345## 3 502.7574 471.0277 534.4870## 4 559.2987 527.0151 591.5824

# Comparison with the mean CIpredict(model, newdata = newData2, interval = "confidence")## fit lwr upr## 1 389.6746 382.3782 396.9710## 2 446.2160 441.8363 450.5957## 3 502.7574 498.2978 507.2170## 4 559.2987 551.8587 566.7388predict(model, newdata = newData2, interval = "prediction")## fit lwr upr## 1 389.6746 357.4237 421.9255## 2 446.2160 414.4975 477.9345## 3 502.7574 471.0277 534.4870## 4 559.2987 527.0151 591.5824

Redo the third and fourth points of the previous exercise with CIs forthe conditional response. In addition, check if the MathMean scoresof Sweden, Denmark, Vietnam and Qatar are inside or outside theprediction CIs.

2.7 ANOVA and model fit2.7.1 ANOVAAs we have seen in Sections 2.5 and 2.6, the variance of the error, 𝜎2, plays afundamental role in the inference for the model coefficients and prediction. Inthis section we will see how the variance of 𝑌 is decomposed into two parts,each one corresponding to the regression and to the error, respectively. Thisdecomposition is called the ANalysis Of VAriance (ANOVA).

Before explaining ANOVA, it is important to recall an interesting result: the

2.7. ANOVA AND MODEL FIT 67

mean of the fitted values 𝑌1, … , 𝑌𝑛 is the mean of 𝑌1, … , 𝑌𝑛. This is easily seenif we plug-in the expression of 𝛽0:

1𝑛

𝑛∑𝑖=1

𝑌𝑖 = 1𝑛

𝑛∑𝑖=1

( 𝛽0 + 𝛽1𝑋𝑖) = 𝛽0 + 𝛽1�� = ( 𝑌 − 𝛽1��) + 𝛽1�� = 𝑌 .

The ANOVA decomposition considers the following measures of variation relatedwith the response:

• SST = ∑𝑛𝑖=1 (𝑌𝑖 − 𝑌 )2, the total sum of squares. This is the total

variation of 𝑌1, … , 𝑌𝑛, since SST = 𝑛𝑠2𝑦, where 𝑠2

𝑦 is the sample varianceof 𝑌1, … , 𝑌𝑛.

• SSR = ∑𝑛𝑖=1 ( 𝑌𝑖 − 𝑌 )2

, the regression sum of squares12. This is thevariation explained by the regression line, that is, the variation from 𝑌 thatis explained by the estimated conditional mean 𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖. SSR = 𝑛𝑠2

𝑦,where 𝑠2

𝑦 is the sample variance of 𝑌1, … , 𝑌𝑛.

• SSE = ∑𝑛𝑖=1 (𝑌𝑖 − 𝑌𝑖)

2, the sum of squared errors13. Is the variation

around the conditional mean. Recall that SSE = ∑𝑛𝑖=1 𝜀2

𝑖 = (𝑛 − 2)��2,where ��2 is the sample variance of 𝜀1, … , 𝜀𝑛.

The ANOVA decomposition is

SST⏟Variation of 𝑌 ′

𝑖 𝑠= SSR⏟

Variation of 𝑌 ′𝑖 𝑠

+ SSE⏟Variation of 𝜀′

𝑖𝑠(2.12)

or, equivalently (dividing by 𝑛 in (2.12)),

𝑠2𝑦⏟Variance of 𝑌 ′

𝑖 𝑠

= 𝑠2𝑦⏟

Variance of 𝑌 ′𝑖 𝑠

+ (𝑛 − 2)/𝑛 × ��2⏟⏟⏟⏟⏟⏟⏟Variance of 𝜀′

𝑖𝑠.

The graphical interpretation of (2.12) is shown in Figures 2.24 and 2.25.

The ANOVA table summarizes the decomposition of the variance. Here is givenin the layout employed by R.

Degrees of freedom Sum Squares Mean Squares 𝐹 -value 𝑝-value

Predictor 1 SSR SSR1

SSR/1SSE/(𝑛−2) 𝑝

Residuals 𝑛 − 2 SSE SSE𝑛−2

The anova function in R takes a model as an input and returns the ANOVA table.In R Commander, the ANOVA table can be computed by going to 'Models' ->

12Recall that SSR is different from RSS (Residual Sum of Squares, Section 2.3).13Recall that SSE and RSS (for ( 𝛽0, 𝛽1)) are just different names for referring to the same

quantity: SSE = ∑𝑛𝑖=1 (𝑌𝑖 − 𝑌𝑖)

2 = ∑𝑛𝑖=1 (𝑌𝑖 − 𝛽0 − 𝛽1𝑋𝑖)

2 = RSS ( 𝛽0, 𝛽1).


'Hypothesis tests' -> 'ANOVA table...'. In the 'Type of Tests' option,select 'Sequential ("Type I")'. (This types anova() for you…)# Fit a linear model (alternatively, using R Commander)model <- lm(MathMean ~ ReadingMean, data = pisa)summary(model)#### Call:## lm(formula = MathMean ~ ReadingMean, data = pisa)#### Residuals:## Min 1Q Median 3Q Max## -29.039 -10.151 -2.187 7.804 50.241#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -62.65638 19.87265 -3.153 0.00248 **## ReadingMean 1.13083 0.04172 27.102 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 15.72 on 63 degrees of freedom## Multiple R-squared: 0.921, Adjusted R-squared: 0.9198## F-statistic: 734.5 on 1 and 63 DF, p-value: < 2.2e-16

# ANOVA tableanova(model)## Analysis of Variance Table#### Response: MathMean## Df Sum Sq Mean Sq F value Pr(>F)## ReadingMean 1 181525 181525 734.53 < 2.2e-16 ***## Residuals 63 15569 247## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “𝐹 -value” of the ANOVA table represents the value of the 𝐹 -statisticSSR/1

SSE/(𝑛−2) . This statistic is employed to test

𝐻0 ∶ 𝛽1 = 0 vs. 𝐻1 ∶ 𝛽1 ≠ 0,that is, the hypothesis of no linear dependence of 𝑌 on 𝑋. The result of this testis completely equivalent to the 𝑡-test for 𝛽1 that we saw in Section 2.5 (this issomething specific for simple linear regression – the 𝐹 -test will not be equivalentto the 𝑡-test for 𝛽1 in Chapter 3). It happens that

𝐹 = SSR/1SSE/(𝑛 − 2)

𝐻0∼ 𝐹1,𝑛−2,


where 𝐹1,𝑛−2 is the Snedecor’s 𝐹 distribution14 with 1 and 𝑛 − 2 degrees offreedom. If 𝐻0 is true, then 𝐹 is expected to be small since SSR will be closeto zero. The 𝑝-value of this test is the same as the 𝑝-value of the 𝑡-test for𝐻0 ∶ 𝛽1 = 0.

Recall that the 𝐹 -statistic, its 𝑝-value and the degrees of freedom are also givenin the output of summary.

For the y6 ~ x6 and y7 ~ x7 in the assumptions dataset, computetheir ANOVA tables. Check that the 𝑝-values of the 𝑡-test for 𝛽1 andthe 𝐹 -test are the same.

2.7.2 The 𝑅2

The coefficient of determination 𝑅2 is closely related with the ANOVA decom-position. 𝑅2 is defined as

𝑅2 = SSRSST = SSR

SSR + SSE = SSRSSR + (𝑛 − 2)��2 .

𝑅2 measures the proportion of variation of the response variable 𝑌 that isexplained by the predictor 𝑋 through the regression. The proportion of totalvariation of 𝑌 that is not explained is 1−𝑅2 = SSE

SST . Intuitively, 𝑅2 measures thetightness of the data cloud around the regression line, since is relateddirectly with ��2. Check in Figure 2.25 how changing the value of 𝜎2 (not ��2,but ��2 is obviously dependent on 𝜎2) affects the 𝑅2.

The 𝑅2 is related with the sample correlation coefficient

𝑟𝑥𝑦 = 𝑠𝑥𝑦𝑠𝑥𝑠𝑦

= ∑𝑛𝑖=1 (𝑋𝑖 − ��) (𝑌𝑖 − 𝑌 )

√∑𝑛𝑖=1 (𝑋𝑖 − ��)2√∑𝑛

𝑖=1 (𝑌𝑖 − 𝑌 )2

and it can be seen that 𝑅2 = 𝑟2𝑥𝑦. Interestingly, it also holds that 𝑅2 = 𝑟2

𝑦 𝑦,that is, the square of the sample correlation coefficient between 𝑌1, … , 𝑌𝑛 and

𝑌1, … , 𝑌𝑛 is 𝑅2, a fact that is not immediately evident. This can be easily byfirst noting that

𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖 = ( 𝑌 − 𝛽1𝑋𝑖) + 𝛽1𝑋𝑖 = 𝑌 + 𝛽1(𝑋𝑖 − ��) (2.13)

and then replacing (2.13) into

𝑟2𝑦 𝑦 =

𝑠2𝑦 𝑦

𝑠2𝑦𝑠2𝑦=

(∑𝑛𝑖=1 (𝑌𝑖 − 𝑌 ) ( 𝑌𝑖 − 𝑌 ))2

∑𝑛𝑖=1 (𝑌𝑖 − 𝑌 )2 ∑𝑛

𝑖=1 ( 𝑌𝑖 − 𝑌 )2 =(∑𝑛

𝑖=1 (𝑌𝑖 − 𝑌 ) ( 𝑌 + 𝛽1(𝑋𝑖 − ��) − 𝑌 ))2

∑𝑛𝑖=1 (𝑌𝑖 − 𝑌 )2 ∑𝑛

𝑖=1 ( 𝑌 + 𝛽1(𝑋𝑖 − ��) − 𝑌 )2 = 𝑟2

𝑥𝑦.

14The 𝐹𝑛,𝑚 distribution arises as the quotient of two independent random variables 𝜒2𝑛 and

𝜒2𝑚, 𝜒2𝑛/𝑛

𝜒2𝑚/𝑚 .


The equality 𝑅2 = 𝑟2𝑦 𝑦 is still true for the multiple linear regression,

e.g. 𝑌 = 𝛽0 +𝛽1𝑋1 +𝛽2𝑋2 +𝜀. On the contrary, there is no coefficientof correlation between three or more variables, so 𝑟𝑥1𝑥2𝑦 does not exist.Hence, 𝑅2 = 𝑟2

𝑥𝑦 is a specific fact for simple linear regression.

The result 𝑅2 = 𝑟2𝑥𝑦 = 𝑟2

𝑦 𝑦 can be checked numerically and graphically with thenext code.# Responses generated following a linear modelset.seed(343567) # Fixes seed, allows to generate the same random datax <- rnorm(50)eps <- rnorm(50)y <- -1 + 2 * x + eps

# Regression modelreg <- lm(y ~ x)yHat <- reg$fitted.values

# Summarysummary(reg)#### Call:## lm(formula = y ~ x)#### Residuals:## Min 1Q Median 3Q Max## -2.5834 -0.6015 -0.1466 0.6006 3.0419#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.9027 0.1503 -6.007 2.45e-07 ***## x 2.1107 0.1443 14.623 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.059 on 48 degrees of freedom## Multiple R-squared: 0.8167, Adjusted R-squared: 0.8129## F-statistic: 213.8 on 1 and 48 DF, p-value: < 2.2e-16

# Square of the correlation coefficientcor(y, x)^2## [1] 0.8166752cor(y, yHat)^2## [1] 0.8166752


# Plotsscatterplot(y ~ x, smooth = FALSE)scatterplot(y ~ yHat, smooth = FALSE)

−3 −2 −1 0 1 2

−6

−4

−2

02

4

x

y

−6 −4 −2 0 2 4

−6

−4

−2

02

4

yHat

y

We conclude this section by pointing out two common sins regarding the useof 𝑅2. First, recall two important concepts regarding the application of anyregression model in practice, in particular the linear model:

1. Correctness. The linear model is built on certain assumptions, such as


the ones we saw in Section 2.4. All the inferential results are basedon these assumptions being true!15. A model is formally correctwhenever the assumptions on which is based are not violated in the data.

2. Usefulness. The usefulness of the model is a more subjective concept,but is usually measured by the accuracy on the prediction and explanationof the response 𝑌 by the predictor 𝑋. For example, 𝑌 = 0𝑋 + 𝜀 is a validlinear model, but is completely useless for predicting 𝑌 from 𝑋.

Figure 2.25 show a fitted regression line to a small dataset, for various levels of𝜎2. All the linear models are correct by construction, but the ones with a larger𝑅2 are more useful for predicting/explaining 𝑌 from 𝑋, since this is done in amore precise way.

𝑅2 does not measure the correctness of a linear model but its useful-ness (for prediction, for explaining the variance of 𝑌 ), assuming themodel is correct.

Trusting blindly the 𝑅2 can lead to catastrophic conclusions, since the modelmay not be correct. Here is a counterexample of a linear regression performedin a data that clearly does not satisfy the assumptions discussed in Section 2.4,but despite so it has a large 𝑅2. Recall how biased will be the predictions for𝑥 = 0.35 and 𝑥 = 0.65!# Create data that:# 1) does not follow a linear model# 2) the error is heteroskedasticx <- seq(0.15, 1, l = 100)set.seed(123456)eps <- rnorm(n = 100, sd = 0.25 * x^2)y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps

# Great R^2!?reg <- lm(y ~ x)summary(reg)#### Call:## lm(formula = y ~ x)#### Residuals:## Min 1Q Median 3Q Max## -0.53525 -0.18020 0.02811 0.16882 0.46896#### Coefficients:

15If the assumptions are not satisfied (mismatch between what is assumed to happen intheory and what the data is), then the inference results may be misleading.

2.8. NONLINEAR RELATIONSHIPS 73

## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.87190 0.05860 14.88 <2e-16 ***## x -1.69268 0.09359 -18.09 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.232 on 98 degrees of freedom## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7671## F-statistic: 327.1 on 1 and 98 DF, p-value: < 2.2e-16

# But prediction is obviously problematicscatterplot(y ~ x, smooth = FALSE)

0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

x

y

So remember:

A large 𝑅2 means nothing if the assumptions of the model do nothold. 𝑅2 is the proportion of variance of 𝑌 explained by 𝑋, but, ofcourse, only when the linear model is correct.

2.8 Nonlinear relationshipsThe linear model is termed linear not because the regression curve is a line,but because the effects of the parameters 𝛽0 and 𝛽1 are linear. Indeed,the predictor 𝑋 may exhibit a nonlinear effect on the response 𝑌 and still be alinear model! For example, the following models can be transformed into simple


linear models:

1. 𝑌 = 𝛽0 + 𝛽1𝑋2 + 𝜀2. 𝑌 = 𝛽0 + 𝛽1 log(𝑋) + 𝜀3. 𝑌 = 𝛽0 + 𝛽1(𝑋3 − log(|𝑋|) + 2𝑋) + 𝜀

The trick is to work with the transformed predictor (𝑋2, log(𝑋), …), insteadof with the original variable 𝑋. Then, rather than working with the sample(𝑋1, 𝑌1), … , (𝑋𝑛, 𝑌𝑛), we consider the transformed sample (��1, 𝑌1), … , (��𝑛, 𝑌𝑛)with (for the above examples):

1. ��𝑖 = 𝑋2𝑖 , 𝑖 = 1, … , 𝑛.

2. ��𝑖 = log(𝑋𝑖), 𝑖 = 1, … , 𝑛.3. ��𝑖 = 𝑋3

𝑖 − log(|𝑋𝑖|) + 2𝑋𝑖 , 𝑖 = 1, … , 𝑛.

An example of this simple but powerful trick is given as follows. The left panelof Figure 2.27 shows the scatterplot for some data y and x, together with itsfitted regression line. Clearly, the data does not follow a linear pattern, but anonlinear one. In order to identify which one might be, we compare it against theset of mathematical functions displayed in Figure 2.26. We see that the shapeof the point cloud is similar to 𝑦 = 𝑥2. Hence, y might be better explained bythe square of x, x^2, rather than by x. Indeed, if we plot y against x^2 in theright panel of Figure 2.27, we can see that the fit of the regression line is muchbetter.

In conclusion, with a simple trick we have increased drastically the explanationof the response. However, there is a catch: knowing which transformation isrequired in order to linearise the relation between response and the predictor isa kind of art which requires some good eye. This is partially alleviated by theextension of this technique to deal with polynomials rather than monomials, aswe will see in Chapter 3. For the moment, we will consider only the transfor-mations displayed in Figure 2.26. Figure 2.28 shows different transformationslinearizing nonlinear data patterns.

If you apply a nonlinear transformation, namely 𝑓 , and fit the linearmodel 𝑌 = 𝛽0 +𝛽1𝑓(𝑋)+𝜀, then there is no point in fit also the modelresulting from the negative transformation −𝑓 . The model with −𝑓is exactly the same as the one with 𝑓 but with the sign of 𝛽1 flipped!

As a rule of thumb, use Figure 2.26 with the transformations to com-pare it with the data pattern, then choose the most similar curve andfinally apply the corresponding function with positive sign.

As you might have realized, applying nonlinear transformations tothe predictors is a simple trick that extends enormously thefunctionality of the linear model. This is particularly useful in


real applications, where linearity is hardly verified (for example, inthe PISA case study of Section 2.1.1, we employed logGDPp insteadof GDPp due to its higher linearity with MathMean).

Let’s see how we can compute transformations of our predictors and perform alinear regression with them. The data for the above example is the following:# Datax <- c(-2, -1.9, -1.7, -1.6, -1.4, -1.3, -1.1, -1, -0.9, -0.7, -0.6,

-0.4, -0.3, -0.1, 0, 0.1, 0.3, 0.4, 0.6, 0.7, 0.9, 1, 1.1, 1.3,1.4, 1.6, 1.7, 1.9, 2, 2.1, 2.3, 2.4, 2.6, 2.7, 2.9, 3, 3.1,3.3, 3.4, 3.6, 3.7, 3.9, 4, 4.1, 4.3, 4.4, 4.6, 4.7, 4.9, 5)

y <- c(1.4, 0.4, 2.4, 1.7, 2.4, 0, 0.3, -1, 1.3, 0.2, -0.7, 1.2, -0.1,-1.2, -0.1, 1, -1.1, -0.9, 0.1, 0.8, 0, 1.7, 0.3, 0.8, 1.2, 1.1,2.5, 1.5, 2, 3.8, 2.4, 2.9, 2.7, 4.2, 5.8, 4.7, 5.3, 4.9, 5.1,6.3, 8.6, 8.1, 7.1, 7.9, 8.4, 9.2, 12, 10.5, 8.7, 13.5)

# Data frame (a matrix with column names)nonLinear <- data.frame(x = x, y = y)

In order to perform a simple linear regression in x^2, and not in x, we need tocompute a new variable in our dataset that contains the square of x. We cando it in two equivalent ways:

1. Through R Commander. In Section 2.1.2 we saw how to createa new variable in our active dataset (remember Figure 2.8). Goto 'Data' -> 'Manage variables in active dataset...' ->'Compute new variable...'. Set the 'New variable name' to x2and the 'Expression to compute' to x^2.

2. Through R. Just type:# We create a new column inside nonLinear, called x2, that contains# nonLinear$x^2nonLinear$x2 <- nonLinear$x^2

# Check the variablesnames(nonLinear)## [1] "x" "y" "x2"

With either the two previous points you will have a new variable called x2. Ifyou wish to remove it, you can do it by either typing# Empties the column named x2nonLinear$x2 <- NULL

or, in R Commander, by going to 'Data' -> 'Manage variables in activedata set' -> 'Delete variables from data set...' and select to remove


x2.

Now we are ready to perform the regression. If you do it directly through R,you will obtain:mod1 <- lm(y ~ x, data = nonLinear)summary(mod1)#### Call:## lm(formula = y ~ x, data = nonLinear)#### Residuals:## Min 1Q Median 3Q Max## -2.5268 -1.7513 -0.4017 0.9750 5.0265#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.9771 0.3506 2.787 0.0076 **## x 1.4993 0.1374 10.911 1.35e-14 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.005 on 48 degrees of freedom## Multiple R-squared: 0.7126, Adjusted R-squared: 0.7067## F-statistic: 119 on 1 and 48 DF, p-value: 1.353e-14mod2 <- lm(y ~ x2, data = nonLinear)summary(mod2)#### Call:## lm(formula = y ~ x2, data = nonLinear)#### Residuals:## Min 1Q Median 3Q Max## -3.0418 -0.5523 -0.1465 0.6286 1.8797#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.05891 0.18462 0.319 0.751## x2 0.48659 0.01891 25.725 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.9728 on 48 degrees of freedom## Multiple R-squared: 0.9324, Adjusted R-squared: 0.931## F-statistic: 661.8 on 1 and 48 DF, p-value: < 2.2e-16


A fast way of performing and summarizing the quadratic fit is

summary(lm(y ~ I(x^2), data = nonLinear))

Some remarks about this expression:

– The I() function wrapping x^2 is fundamental when applyingarithmetic operations in the predictor. The symbols +, \*, ^,… have different meaning when imputed in a formula, so isrequired to use I() to indicate that they must be interpretedin their arithmetic meaning and that the result of the expressiondenotes a new predictor variable. For example, use I((x - 1)^3- log(3 \* x)) if you want to apply the transformation (x -1)^3 - log(3 * x).

– We are computing the summary of lm directly, without usingan intermediate variable for storing the output of lm. This isperfectly valid and very handy, but note that you will not beable to access the information outputted by lm, only the onefrom summary.

Load the dataset assumptions.RData. We are going to work withthe regressions y2 ~ x2, y3 ~ x3, y8 ~ x8 and y9 ~ x9, in order toidentify which transformation of Figure 2.26 gives the best fit. (Forthe purpose of illustration, we do not care if the assumptions arerespected.) For these, do the following:

– Find the transformation that yields the largest 𝑅2.– Compare the original and the transformed linear models.

Some hints:

– y2 ~ x2 has a negative dependence, so look at the right panel ofthe transformations figure.

– y3 ~ x3 seems to have just a subtle nonlinearity… Will it beworth to attempt a transformation?

– For y9 ~ x9, try with also with exp(-abs(x9)), log(abs(x9))and 2^abs(x9). (abs computes the absolute value.)


2.9 Exercises and case studies

2.9.1 Data importation

Import the following datasets into R Commander indicating the rightformatting options:

– iris.txt– ads.csv– auto.txt– Boston.xlsx– anscombe.RData– airquality.txt– wine.csv– world-population.RData– la-liga-2015-2016.xlsx– wdi-countries.txt

Create the file datasets.RData for saving all the datasets together.

There are a good number of datasets directly available in R:

– To inspect them, go to 'Data' -> 'Data in packages' ->'List datasets in packages' and you will get a long list ofthe available datasets, the package where they are stored and asmall description about them. Or, in R, simply type data().

– To load a dataset, go to 'Data' -> 'Data in packages'-> 'Read dataset from an attached package...' and se-lect the package and dataset. Or, in R, simply typedata(nameDataset, package = "namePackage").

– To get help on the dataset, go to 'Help' -> 'Help on activedataset (if available)' or simply type help(nameDataset).

As you can see, this is a handy way of accessing a good number ofdatasets directly from R.

Import the datasets Titanic (datasets package), PublicSchools(sandwich package) and USArrests (datasets package). Describebriefly the characteristics of each dataset (dimensions, variables, con-text).

https://raw.githubusercontent.com/egarpor/handy/master/datasets/iris.txt

https://raw.githubusercontent.com/egarpor/handy/master/datasets/ads.csv

https://raw.githubusercontent.com/egarpor/handy/master/datasets/auto.txt


https://raw.githubusercontent.com/egarpor/handy/master/datasets/anscombe.RData

https://raw.githubusercontent.com/egarpor/handy/master/datasets/airquality.txt


https://raw.githubusercontent.com/egarpor/handy/master/datasets/world-population.RData


https://raw.githubusercontent.com/egarpor/handy/master/datasets/wdi-countries.txt

2.9. EXERCISES AND CASE STUDIES 79

2.9.2 Simple data management

Perform the following data management operations. Remember toselect the adequate dataset as the active one:

– Load datasets.RData.– Establish the case names in la-liga-2015-2016 as the variable

Team (if they were not set).– Establish the case names in wdi-countries as the variable

Country.– In la-liga-2015-2016, create a new variable named

Goals.wrt.mean, defined as Goals - mean(Goals).– In wdi-countries, create a new variable that standardizes16 the

GDP.growth. Call it GDP.growth.– Delete the variable Species from iris.– For la-liga-2015-2016, go to 'Edit dataset' and change the

Points of Getafe to 40. To do so, click on the cell, change thecontent and click OK or select a new cell to save changes. Donot hit 'Enter’ or you will add a new column!.

– Explore the menu options of 'Edit dataset'for adding and re-moving rows/columns. Is a useful feature for simple edits.

– Create a newDatasets.RData file saving all the modifieddatasets.

– Restart R Commander and then load newDatasets.RData (ig-nore the error 'ERROR: There is more than one object inthe file...' and check that all the datasets are indeed avail-able).

2.9.3 Computing simple linear regressions

Import the iris dataset, either from iris.txt or datasets.RData.This dataset contains measurements for 150 iris flowers. The purposeof this exercise is to do the following analyses through R Commanderwhile inspecting and understanding the outputted code to identify whatparts are changing, how and why.

– Fit the regression line for Petal.Width (response) onPetal.Length (predictor) and summarize it.

– Make the scatterplot of Petal.Width (y) vs. Petal.Length (x)with a regression line.

– Set the 'Graph title' to “iris dataset: petal width vs. petallength”, the 'x-axis label' to “petal length” and 'y-axislabel' to “sepal length”.

– Identify the 5 most outlier points 'Automatically'.

https://raw.githubusercontent.com/egarpor/handy/master/datasets/iris.txt


– Redo the linear regression and scatterplot excluding the pointslabeled as outliers (exclude them in 'Subset expression' witha -c(...)).

– Check that the summary for the fitted line and the scatterplotdisplayed are coherent.

– Make the matrix scatterplot for the four variables and including'Least-squares lines'.

– Set the 'On Diagonal' plots to 'Histograms' and 'Boxplots'.– Set the 'Graph title' to “iris matrix scatterplot”.– Identify the 5 most outlier points 'Automatically'.– Modify the code to identify 15 points.– Compute the regression line for the plot in the fourth row and

the second column and create the scatterplot for it.– Redo the scatterplot by selecting the option 'Plot by

groups...' and then selecting 'Species'.

The last scatterplots are an illustration of the Simpson’s paradox.The paradox surges when there are two or more well-defined groupsin the data, they all have positive (negative) correlation, but taken asa whole dataset, the correlation is the opposite.

2.9.4 R basics

Answer briefly in your own words:

– What is the operator <- doing? What does it mean, for example,that a <- 1:5?

– What is the difference between a matrix and a data frame? Whyis the latter useful?

– What are the differences between a vector and a matrix?– What is c employed for?– Consider the expression lm(a ~ b, data = myData).

∗ What is lm standing for?∗ What does it mean a ~ b? What are the roles of a and b?∗ Is myData a matrix or a data frame?∗ What must be the relation between myData and a and b?∗ Explain the differences with lm(b ~ a, data = myData).

– What are the differences between running a <- 1; a, a <- 1and 1.

– What are the differences between a list and a data frame? Whatare their common parts?

– Why is \$ employed? How can you know in which variables you


can use $?– If you have a vector x, what are x^2 and x + 1 doing to its

elements?

Do the following:

– Create the vectors 𝑥 = (1.17, 0.41, 0.34, 1.11, 1.02, 0.22, −0.24, −0.27, −0.40, −1.38)and 𝑦 = (3.63, 1.69, 0.27, 5.83, 2.64, 1.33, 1.22 − 0.62, 1.29, −0.43).

– Set the positions 3, 4 and 8 of 𝑥 to 0. Set the positions 1, 4, 9 of𝑦 to 0.5, -0.75 and 0.3, respectively.

– Create a new vector 𝑧 containing log(𝑥2) − 𝑦3√exp(𝑥).– Create the vector 𝑡 = (1, 4, 9, 16, 25, … , 100).– Access all the elements of 𝑡 except the third and fifth.– Create the matrix 𝐴 = (1 −3

0 2 ). Hint: use rbind or cbind.

– Using 𝐴, what is a short way (less amount of code) of computing

𝐵 = (1 +√

2 sin(3) −3 +√

2 sin(3)0 +

√2 sin(3) 2 +

√2 sin(3) )?

– Compute A*B. Check that it makes sense with the results ofA[1, 1] * B[1, 1], A[1, 2] * B[1, 2], A[2, 1] * B[2, 1]and A[2, 2] * B[2, 2]. Why?

– Create a data frame named worldPopulation such that:∗ the first variable is called Year and contains the

values c(1915, 1925, 1935, 1945, 1955, 1965, 1975,1985, 1995, 2005, 2015).

∗ the second variable is called Population and containsthe values c(1798.0, 1952.3, 2197.3, 2366.6, 2758.3,3322.5, 4061.4, 4852.5, 5735.1, 6519.6, 7349.5).

– Write names(worldPopulation). Access to the two variables.– Create a new variable in worldPopulation called

logPopulation that contains log(Population).– Compute the standard deviation, mean and median of the vari-

ables in worldPopulation.– Regress logPopulation into Year. Save the result as mod.– Compute the summary of the model and save it as sumMod.– Do a str on A, worldPopulation, mod and sumMod.– Access the 𝑅2 and �� in sumMod.– Check that 𝑅2 is the same as the squared correlation between

predictor and response, and also the squared correlation betweenresponse and mod$fitted.values.


2.9.5 Model formulation and estimation

Answer the following conceptual questions in your own words:

– What is the difference between (𝛽0, 𝛽1) and ( 𝛽0, 𝛽1).– Is 𝛽0 a random variable? What about 𝛽1? Justify your answer.– What function are the least squares estimates minimizing? Is

important the choice of the kind distances (horizontal, vertical,perpendicular)?

– What is the justification for the use of a vertical distance in theRSS?

– Is 𝜎2 affecting the 𝑅2 (indirectly or directly)? Why?– What are the residuals? What is their interpretation?– What are the fitted values? What is their interpretation?– What is the relation of 𝛽1 with 𝑟𝑥𝑦?

Finally, check that the regression line goes through (��, 𝑌 ), in otherwords, that 𝑌 = 𝛽0 + 𝛽1��.

2.9.6 Assumptions of the linear model

The dataset moreAssumptions.RData (download) contains the vari-ables x1, …, x9 and y1, …, y9. For each regression y1 ~ x1, …, y9~ x9 describe whether the assumptions of the linear model are beingsatisfied or not. Justify your answer and state which assumption(s)you think are violated.

2.9.7 Nonlinear relations

Load moreAssumptions. For the regressions y1 ~ x1, y2 ~ x2, y6 ~x6 and y9 ~ x9, identify which nonlinear transformation yields thelargest 𝑅2. For that transformation, check wether the assumptionsare verified.

Hints: use the transformations in Figure 2.26 for the three first regres-sions. For y9 ~ x9, try with (5 - x9)^2, abs(x9 - 5) and abs(x9- 5)^3.

2.9.8 Case study: Moore’s lawMoore’s law (Moore, 1965) is an empirical law that states that the power of acomputer doubles approximately every two years. More precisely:

Moore’s law is the observation that the number of transistors in adense integrated circuit [e.g. a CPU] doubles approximately every

https://raw.githubusercontent.com/egarpor/handy/master/datasets/moreAssumptions.RData


two years.

— Wikipedia article on Moore’s law

Translated into a mathematical formula, Moore’s law is

transistors ≈ 2years/2.Applying logarithms to both sides gives (why?)

log(transistors) ≈ log(2)2 years.

We can write the above formula more generally

log(transistors) = 𝛽0 + 𝛽1years + 𝜀,where 𝜀 is a random error. This is a linear model!

The dataset cpus.txt (download) contains the transistor counts forthe CPUs appeared in the time range 1971–2015. For this data, dothe following:

– Import conveniently the data and name it as cpus.– Show a scatterplot of Transistor.count

vs. Date.of.introduction with a linear regression.– Are the assumptions verified in Transistor.count ~

Date.of.introduction? Which ones are which are more“problematic”?

– Create a new variable, named Log.Transistor.count, contain-ing the logarithm of Transistor.count.

– Show a scatterplot of Log.Transistor.countvs. Date.of.introduction with a linear regression.

– Are the assumptions verified in Log.Transistor.count ~Date.of.introduction? Which ones are which are more “prob-lematic”?

– Regress Log.Transistor.count ~ Date.of.introduction.– Summarize the fit. What are the estimates 𝛽0 and 𝛽1? Is 𝛽1

close to log(2)2 ?

– Compute the CI for 𝛽1 at 𝛼 = 0.05. Is log(2)2 inside it? What

happens at levels 𝛼 = 0.10, 0.01?– We want to forecast the average log-number of transistors for the

CPUs to be released in 2017. Compute the adequate predictionand CI.

– A new CPU design is expected for 2017. What is the rangeof log-number of transistors expected for it, at a 95% level ofconfidence?

– Compute the ANOVA table for Log.Transistor.count ~Date.of.introduction. Is 𝛽1 significative?

https://en.wikipedia.org/wiki/Moore%27s_law

https://raw.githubusercontent.com/egarpor/handy/master/datasets/cpus.txt


The dataset gpus.txt (download) contains the transistor counts forthe GPUs appeared in the period 1997–2016. Repeat the previousanalysis for this dataset.

2.9.9 Case study: Growth in a time of debtIn the aftermath of the 2007–2008 financial crisis, the paper Growth in a timeof debt (Reinhart and Rogoff, 2010), from Carmen M. Reinhart and KennethRogoff (both at Harvard), provided an important economical support for pro-austerity policies. The paper claimed that for levels of external debt in excessof 90% of the GDP, the GDP growth of a country was dramatically differentthan for lower levels of external debt. Therefore, it concludes the existence ofa magical threshold – 90% – for which the level of external debt must be keptbelow in order to have a growing economy. Figure 2.29, extracted from Reinhartand Rogoff (2010), illustrates the main finding.

Herndon et al. (2013) replicated the analysis of Reinhart and Rogoff (2010) andfound that “selective exclusion of available data, coding errors and inappropriateweighting of summary statistics lead to serious miscalculations that inaccuratelyrepresent the relationship between public debt and GDP growth among 20 ad-vanced economies”. The authors concluded that “both mean and median GDPgrowth when public debt levels exceed 90% of GDP are not dramatically dif-ferent from when the public debt/GDP ratios are lower”. As a consequence,Reinhart and Rogoff (2010) led to an unjustified support for the adoption ofausterity policies for countries with various levels of public debt.

You can read the full story at BBC, The New York Times and The Economist.Also, the video in Figure 2.30 contains a quick summary of the story by NobelPrize laureate Paul Krugman.

Herndon et al. (2013) made the data of the study publicly available. You candownload it here.

The dataset hap.txt (download) contains data for 20 advancedeconomies in the time period 1946–2009 and is the data source forthe papers aforementioned. The variable dRGDP represents the realGDP growth (as a percentage) and debtgdp represents the percent-age of public debt with respect to the GDP.

– Import the data and save it as hap.– Set the case names of hap as Country.Year.– Summarize dRGDP and debtgdp. What are their minimum and

maximum values?

https://raw.githubusercontent.com/egarpor/handy/master/datasets/gpus.txt

http://www.bbc.com/news/magazine-22223190

http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html

http://www.economist.com/news/finance-and-economics/21576362-seminal-analysis-relationship-between-debt-and-growth-comes-under

http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566/

https://raw.githubusercontent.com/egarpor/handy/master/datasets/hap.txt


– What is the correlation between dRGDP and debtgdp? What isthe standard deviation of each variable?

– Show the scatterplot of dRGDP vs. debtgdp with the regressionline. Is this coherent with what was stated in the video at 1:30?

– Do you see any gap on the data around 90%? Is there any sub-stantial change for dRGDP around there?

– Compute the linear regression of dRGDP on debtgdp and summa-rize the fit.

– What are the fitted coefficients? What are their standard errors?What is the 𝑅2?

– Compute the ANOVA table. How many degrees of freedom are?What is the SSR? What is SSE? What is the 𝑝-value for 𝐻0 ∶𝛽1 = 0?

– Is SSR larger than SSE? Is this coherent with the resulting 𝑅2?– Are 𝛽0 and 𝛽1 significant for the regression at level 𝛼 = 0.05?

And at level 𝛼 = 0.10, 0.01?– Compute the CIs for the coefficients. Can we conclude that the

effect of debtgdp on dRGDP is positive at 𝛼 = 0.05? And nega-tive?

– Predict the average growth for levels of debt of 60%, 70%, 80%,90%, 100% and 110%. Compute the 95% CIs for all of them.

– Predict the growth for the previous levels of debt. Compute alsothe CI for them. Is there a marked difference on the CIs for debtlevels below and above 90%?

– Which assumptions of the linear model you think are satisfied?Should we trust blindly the inferential results obtained assumingthat the assumptions were satisfied?


Figure 2.24: Visualization of the ANOVA decomposition. SST measures thevariation of 𝑌1, … , 𝑌𝑛 with respect to 𝑌 . SST measures the variation withrespect to the conditional means, 𝛽0 + 𝛽1𝑋𝑖. SSE collects the variation of theresiduals.


Figure 2.25: Illustration of the ANOVA decomposition and its dependence on𝜎2 and ��2. Application also available here.

−2 −1 0 1 2 3 4 5

−2

−1

01

23

45

x

y

y = xy = x2

y = x3

y = xy = exp(x)y = exp(− x)y = log(x)

−2 −1 0 1 2 3 4 5

−5

−4

−3

−2

−1

01

2

x

y

y = − xy = − x2

y = − x3

y = − xy = − exp(− x)y = − exp(x)y = − log(x)

Figure 2.26: Some common nonlinear transformations and their negative coun-terparts. Recall the domain of definition of each transformation.

https://shinyserv.es/shiny/anova/


−2 −1 0 1 2 3 4 5

05

10

x

y

0 5 10 15 20 25

05

10

x^2

y

Figure 2.27: Left: quadratic pattern when plotting 𝑌 against 𝑋. Right: lin-earized pattern when plotting 𝑌 against 𝑋2.

Figure 2.28: Illustration of the choice of the nonlinear transformation. Applica-tion also available here.

https://shinyserv.es/shiny/non-linear/


Figure 2.29: The magical threshold of 90% external debt.

Figure 2.30: CNN interview to Paul Krugman (key point from 1:16 to 1:40),broadcasted in 2013.

https://www.youtube.com/embed/3xj338ACri8

Chapter 3

Multiple linear regression

The multiple linear regression is an extension of the simple linear regressionsaw in Chapter 2. If the simple linear regression employed a single predictor𝑋 to explain the response 𝑌 , the multiple linear regression employs multiplepredictors 𝑋1, … , 𝑋𝑘 for explaining a single response 𝑌 :

𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝜀

To convince you why is useful, let’s begin by seeing what it can deliver in real-case scenarios!

3.1 Examples and applications3.1.1 Case study I: The Bordeaux equation

Calculate the winter rain and the harvest rain (in millimeters).Add summer heat in the vineyard (in degrees centigrade). Subtract12.145. And what do you have? A very, very passionate argumentover wine.

— “Wine Equation Puts Some Noses Out of Joint”, The New YorkTimes, 04/03/1990

This case study is motivated by the study of Princeton professor Orley Ashen-felter (Ashenfelter et al., 1995) on the quality of red Bordeaux vintages. Thestudy became mainstream after disputes with the wine press, especially withRobert Parker, Jr., one of the most influential wine critic in America. See ashort review of the story at the Financial Times (Google’s cache) and at thevideo in Figure 3.3.

Red Bordeaux wines have been produced in Bordeaux, one of most famous andprolific wine regions in the world, in a very similar way for hundreds of years.

91

http://www.nytimes.com/1990/03/04/us/wine-equation-puts-some-noses-out-of-joint.html

http://www.nytimes.com/1990/03/04/us/wine-equation-puts-some-noses-out-of-joint.html

http://www.ft.com/cms/s/0/1e9cb152-5824-11dc-8c65-0000779fd2ac.html

https://webcache.googleusercontent.com/search?q=cache:1mRF68v_Uz4J:https://www.ft.com/content/1e9cb152-5824-11dc-8c65-0000779fd2ac

92 CHAPTER 3. MULTIPLE LINEAR REGRESSION

However, the quality of vintages is largely variable from one season to anotherdue to a long list of random factors, such as the weather conditions. BecauseBordeaux wines taste better when they are older (young wines are astringent,when the wines age they lose their astringency), there is an incentive to storethe young wines until they are mature. Due to the important difference intaste, it is hard to determine the quality of the wine when it is so young just bytasting it, because it is going to change substantially when the aged wine is inthe market. Therefore, being able to predict the quality of a vintage is a valuableinformation for investing resources, for determining a fair price for vintages andfor understanding what factors are affecting the wine quality. The purpose ofthis case study is to answer:

• Q1. Can we predict the quality of a vintage effectively?• Q2. What is the interpretation of such prediction?

The wine.csv file (download) contains 27 red Bordeaux vintages. The data isthe originally employed by Ashenfelter et al. (1995), except for the inclusionof the variable Year, the exclusion of NAs and the reference price used for thewine. The original source is here. Each row has the following variables:

• Year: year in which grapes were harvested to make wine.• Price: logarithm of the average market price for Bordeaux vintages ac-

cording to 1990–1991 auctions.1 This is a nonlinear transformation of theresponse (hence different to what we did in Section 2.8) made to linearizethe response.

• WinterRain: winter rainfall (in mm).• AGST: Average Growing Season Temperature (in Celsius degrees).• HarvestRain: harvest rainfall (in mm).• Age: age of the wine measured as the number of years stored in a cask.• FrancePop: population of France at Year (in thousands).

The quality of the wine is quantified as the Price, a clever way of quantifyinga qualitative measure. The data is shown in Table 3.1.

Table 3.1: First 15 rows of the wine dataset.

Year Price WinterRain AGST HarvestRain Age FrancePop1952 7.4950 600 17.1167 160 31 43183.571953 8.0393 690 16.7333 80 30 43495.031955 7.6858 502 17.1500 130 28 44217.861957 6.9845 420 16.1333 110 26 45152.251958 6.7772 582 16.4167 187 25 45653.811959 8.0757 485 17.4833 187 24 46128.641960 6.5188 763 16.4167 290 23 46584.001In Ashenfelter, Ashmore and Lalonde (1995), this variable is expressed relative to the price

of the 1961 vintage, regarded as the best one ever recorded. In other words, they considerPrice - 8.4937 as the price variable.


http://www.liquidasset.com/winedata.html


1961 8.4937 830 17.3333 38 22 47128.001962 7.3880 697 16.3000 52 21 48088.671963 6.7127 608 15.7167 155 20 48798.991964 7.3094 402 17.2667 96 19 49356.941965 6.2518 602 15.3667 267 18 49801.821966 7.7443 819 16.5333 86 17 50254.971967 6.8398 714 16.2333 118 16 50650.411968 6.2435 610 16.2000 292 15 51034.41

Let’s begin by summarizing the information in Table 3.1. First, import correctlythe dataset into R Commander and 'Set case names...' as the variable Year.Let’s summarize and inspect the data in two ways:

1. Numerically. Go to 'Statistics' -> 'Summaries' -> 'Activedata set'.summary(wine)## Price WinterRain AGST HarvestRain## Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0## 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0## Median :6.984 Median :600.0 Median :16.42 Median :123.0## Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8## 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5## Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0## Age FrancePop## Min. : 3.00 Min. :43184## 1st Qu.: 9.50 1st Qu.:46856## Median :16.00 Median :50650## Mean :16.19 Mean :50085## 3rd Qu.:22.50 3rd Qu.:53511## Max. :31.00 Max. :55110

Additionally, other summary statistics are available in 'Statistics' ->'Summaries' -> 'Numerical summaries...'.

2. Graphically. Make a scatterplot matrix with all the variables. Add the'Least-squares lines', 'Histograms' on the diagonals and choose toidentify 2 points.scatterplotMatrix(~ Age + AGST + FrancePop + HarvestRain + Price + WinterRain,

reg.line = lm, smooth = FALSE, spread = FALSE, span = 0.5,ellipse = FALSE, levels = c(.5, .9), id.n = 2,diagonal = 'histogram', data = wine)

Recall that the objective is to predict Price. Based on the above matrixscatterplot the best we can predict Price by a simple linear regression seems tobe with AGST or HarvestRain. Let’s see which one yields the larger 𝑅2.


Age

15.0

16.5

5015

025

0

5 15 25

400

600

800

15.0 16.5

AGST

FrancePop

44000 52000

50 150 250

HarvestRain

Price

6.5 7.5 8.5

400 600 800

515

2544

000

5200

06.

57.

58.

5

WinterRain

Figure 3.1: Scatterplot matrix for wine.

modAGST <- lm(Price ~ AGST, data = wine)summary(modAGST)#### Call:## lm(formula = Price ~ AGST, data = wine)#### Residuals:## Min 1Q Median 3Q Max## -0.78370 -0.23827 -0.03421 0.29973 0.90198#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.5469 2.3641 -1.500 0.146052## AGST 0.6426 0.1434 4.483 0.000143 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.4819 on 25 degrees of freedom## Multiple R-squared: 0.4456, Adjusted R-squared: 0.4234## F-statistic: 20.09 on 1 and 25 DF, p-value: 0.0001425


modHarvestRain <- lm(Price ~ HarvestRain, data = wine)summary(modHarvestRain)#### Call:## lm(formula = Price ~ HarvestRain, data = wine)#### Residuals:## Min 1Q Median 3Q Max## -1.03792 -0.27679 -0.07892 0.40434 1.21958#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 7.679856 0.241911 31.747 < 2e-16 ***## HarvestRain -0.004405 0.001497 -2.942 0.00693 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.5577 on 25 degrees of freedom## Multiple R-squared: 0.2572, Adjusted R-squared: 0.2275## F-statistic: 8.658 on 1 and 25 DF, p-value: 0.00693

In Price ~ AGST, the intercept is not significant for the regression but the slopeis, and AGST has a positive effect on the Price. For Price ~ HarvestRain, bothintercept and slope are significant and the effect is negative.

Complete the analysis by computing the linear models Price ~FrancePop, Price ~ Age and Price ~ WinterRain. Name them asmodFrancePop, modAge and modWinterRain. Check if the interceptsand slopes are significant for the regression.

If we do the simple regressions of Price on the remaining predictors, we obtaina table like this for the 𝑅2:

Predictor 𝑅2

AGST 0.4456HarvestRain 0.2572FrancePop 0.2314Age 0.2120WinterRain 0.0181

A natural question to ask is:

Can we combine these simple regressions to increase both the 𝑅2

and the prediction accuracy for Price?


The answer is yes, by means of the multiple linear regression. In orderto make our first one, go to 'Statistics' -> 'Fit models' -> 'Linearmodel...'. A window like Figure 3.2 will pop-up.

Figure 3.2: Window for performing multiple linear regression.

Set the response as Price and add the rest of variables as predictors, in theform Age + AGST + FrancePop + HarvestRain + WinterRain. Note the useof + for including all the predictors. This does not mean that they are allsummed and then the regression is done on the sum!2. Instead of, this notationis designed to resemble the multiple linear model:

𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝜀

If the model is named modWine1, we get the following summary when clickingin 'OK':modWine1 <- lm(Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain, data = wine)summary(modWine1)#### Call:## lm(formula = Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain,## data = wine)#### Residuals:## Min 1Q Median 3Q Max## -0.46541 -0.24133 0.00413 0.18974 0.52495#### Coefficients:## Estimate Std. Error t value Pr(>|t|)

2If you wanted to do so, you will need the function I() for indicating that + is not includingpredictors in the model, but is acting as a sum operator: Price ~ I(Age + AGST + FrancePop+ HarvestRain + WinterRain).


## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384## Age 1.377e-02 5.821e-02 0.237 0.81531## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.293 on 21 degrees of freedom## Multiple R-squared: 0.8278, Adjusted R-squared: 0.7868## F-statistic: 20.19 on 5 and 21 DF, p-value: 2.232e-07

The main difference with simple linear regressions is that we have more rowson the 'Coefficients' section, since these correspond to each of the predic-tors. The fitted regression is Price = −2.343 + 0.013 × Age +0.614 × AGST−0.000 × FrancePop −0.003 × HarvestRain +0.001 × WinterRain . Recallthat the 'Multiple R-squared' has almost doubled with respect to the bestsimple linear regression!3 This tells us that we can explain up to 82.75% of thePrice variability by the predictors.

Note however that many predictors are not significant for the regression:FrancePop, Age and the intercept are not significant. This is an indication ofan excess of predictors adding little information to the response. Note thealmost perfect correlation between FrancePop and Age shown in Figure 3.1: oneof them is not adding any extra information to explain Price. This complicatesthe model unnecessarily and, more importantly, it has the undesirable effectof making the coefficient estimates less precise. We opt to remove thepredictor FrancePop from the model since it is exogenous to the wine context.

Two useful tips about lm’s syntax for including/excluding predictorsfaster:

– Price ~ . -> includes all the variables in the datasetas predictors. It is equivalent to Price ~ Age + AGST +FrancePop + HarvestRain + WinterRain.

– Price ~ . - FrancePop -> includes all the variables ex-cept the ones with - as predictors. It is equivalentto It is equivalent to Price ~ Age + AGST + HarvestRain +WinterRain.

Then, the model without FrancePop is

3The 𝑅2 for the multiple linear regression 𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝜀 is not the sumof the 𝑅2’s for the simple linear regressions 𝑌 = 𝛽0 + 𝛽𝑗𝑋𝑗 + 𝜀, 𝑗 = 1, … , 𝑘.


modWine2 <- lm(Price ~ . - FrancePop, data = wine)summary(modWine2)#### Call:## lm(formula = Price ~ . - FrancePop, data = wine)#### Residuals:## Min 1Q Median 3Q Max## -0.46024 -0.23862 0.01347 0.18601 0.53443#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *## WinterRain 0.0011667 0.0004820 2.420 0.02421 *## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***## Age 0.0238480 0.0071667 3.328 0.00305 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.2865 on 22 degrees of freedom## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08

All the coefficients are significant at level 𝛼 = 0.05. Therefore, there is noclear redundant information. In addition, the 𝑅2 is very similar to the fullmodel, but the 'Adjusted R-squared', a weighting of the 𝑅2 to account forthe number of predictors used by the model, is slightly larger. Hence, this meansthat, comparatively to the number of predictors used, modWine2 explains morevariability of Price than modWine1. Later in this chapter we will see the precisemeaning of the 𝑅2 adjusted.

The comparison of the coefficients of both models can be done with 'Models-> Compare model coefficients...':compareCoefs(modWine1, modWine2)## Calls:## 1: lm(formula = Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain,## data = wine)## 2: lm(formula = Price ~ . - FrancePop, data = wine)#### Model 1 Model 2## (Intercept) -2.34 -3.65## SE 7.70 1.69#### Age 0.01377 0.02385


## SE 0.05821 0.00717#### AGST 0.6144 0.6164## SE 0.0980 0.0952#### FrancePop -2.21e-05## SE 1.27e-04#### HarvestRain -0.003837 -0.003861## SE 0.000837 0.000808#### WinterRain 0.001153 0.001167## SE 0.000499 0.000482##

Note how the coefficients for modWine2 have smaller errors thanmodWine1.

As a conclusion, modWine2 is a model that explains the 82.75% of the variabilityin a non-redundant way and with all its coefficients significant. Therefore, wehave a formula for effectively explaining and predicting the quality of a vintage(answers Q1).

The interpretation of modWine2 agrees with well-known facts in viticulture thatmake perfect sense (Q2):

• Higher temperatures are associated with better quality (higher priced)wine.

• Rain before the growing season is good for the wine quality, but duringharvest is bad.

• The quality of the wine improves with the age.

Although these were known facts, keep in mind that the model allows to quantifythe effect of each variable on the wine quality and provides us with a precise wayof predicting the quality of future vintages.

Create a new variable in wine named PriceOrley, defined as Price- 8.4937. Check that the model PriceOrley ~ . - FrancePop -Price kind of coincides with the formula given in the second para-graph of the Financial Times article (Google’s cache). (There are acouple of typos in the article’s formula: the Age term is missing andthe ACGS coefficient has an extra zero. Emailed the author, his answer:“Thanks for the heads up on this. Ian Ayres.”.)

http://www.ft.com/cms/s/0/1e9cb152-5824-11dc-8c65-0000779fd2ac.html

https://webcache.googleusercontent.com/search?q=cache:1mRF68v_Uz4J:https://www.ft.com/content/1e9cb152-5824-11dc-8c65-0000779fd2ac


Figure 3.3: ABC interview to Orley Ashenfelter, broadcasted in 1992.

3.1.2 Case study II: Housing values in BostonThe second case study is motivated by Harrison and Rubinfeld (1978), whoproposed an hedonic model for determining the willingness of house buyers topay for clean air. An hedonic model is a model that decomposes the priceof an item into separate components that determine its price. For example,an hedonic model for the price of a house may decompose its price into thehouse characteristics, the kind of neighborhood and the location. The studyof Harrison and Rubinfeld (1978) employed data from the Boston metropolitanarea, containing 560 suburbs and 14 variables. The Boston dataset is availablethrough the file Boston.xlsx file (download) and through the dataset Bostonin the MASS package (load MASS by 'Tools' -> 'Load package(s)...').

The description of the related variables can be found in ?Boston and Harrisonand Rubinfeld (1978)4, but we summarize here the most important ones as theyappear in Boston. They are aggregated into five topics:

• Dependent variable: medv, the median value of owner-occupied homes (inthousands of dollars).

• Structural variables indicating the house characteristics: rm (average num-ber of rooms “in owner units”) and age (proportion of owner-occupiedunits built prior to 1940).

• Neighborhood variables: crim (crime rate), zn (proportion of residentialareas), indus (proportion of non-retail business area), chas (river limita-tion), tax (cost of public services in each community), ptratio (pupil-teacher ratio), black (variable 1000(𝐵 − 0.63)2, where 𝐵 is the black pro-portion of population – low and high values of 𝐵 increase housing prices)and lstat (percent of lower status of the population).

4But be aware of the changes in units for medv, black, lstat and nox.

https://www.youtube.com/embed/Ec8hPHLMyzY



• Accessibility variables: dis (distances to five Boston employment centers)and rad (accessibility to radial highways – larger index denotes betteraccessibility).

• Air pollution variable: nox, the annual concentration of nitrogen oxide (inparts per ten million).

A summary of the data is shown below:summary(Boston)## crim zn indus chas## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000## nox rm age dis## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127## rad tax ptratio black## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38## Median : 5.000 Median :330.0 Median :19.05 Median :391.44## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90## lstat medv## Min. : 1.73 Min. : 5.00## 1st Qu.: 6.95 1st Qu.:17.02## Median :11.36 Median :21.20## Mean :12.65 Mean :22.53## 3rd Qu.:16.95 3rd Qu.:25.00## Max. :37.97 Max. :50.00

The two goals of this case study are:

• Q1. Quantify the influence of the predictor variables in the housing prices.• Q2. Obtain the “best possible” model for decomposing the housing variables

and interpret it.

We begin by making an exploratory analysis of the data with a matrix scatter-plot. Since the number of variables is high, we opt to plot only five variables:crim, dis, medv, nox and rm. Each of them represents the five topics in whichvariables were classified.


scatterplotMatrix(~ crim + dis + medv + nox + rm, reg.line = lm, smooth = FALSE,spread = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9),id.n = 0, diagonal = 'density', data = Boston)

crim

24

68

120.

40.

60.

8

0 20 60

2 4 6 8 12

dis

medv

10 30 50

0.4 0.6 0.8

nox

020

6010

3050

4 5 6 7 8

45

67

8rm

Figure 3.4: Scatterplot matrix for crim, dis, medv, nox and rm from the Bostondataset.

The diagonal panels are showing an estimate of the unknown density of eachvariable. Note the peculiar distribution of crim, very concentrated at zero, andthe asymmetry in medv, with a second mode associated to the most expensiveproperties. Inspecting the individual panels, it is clear that some nonlinearityexists in the data. For simplicity, we disregard that analysis for the moment(but see the final exercise).

Let’s fit a multiple linear regression for explaining medv. There are a goodnumber of variables now and some of them might be of little use for predictingmedv. However, there is no clear intuition of which predictors will yield betterexplanations of medv with the information at hand. Therefore, we can start bydoing a linear model on all the predictors:modHouse <- lm(medv ~ ., data = Boston)summary(modHouse)#### Call:## lm(formula = medv ~ ., data = Boston)


#### Residuals:## Min 1Q Median 3Q Max## -15.595 -2.730 -0.518 1.777 26.199#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***## crim -1.080e-01 3.286e-02 -3.287 0.001087 **## zn 4.642e-02 1.373e-02 3.382 0.000778 ***## indus 2.056e-02 6.150e-02 0.334 0.738288## chas 2.687e+00 8.616e-01 3.118 0.001925 **## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***## age 6.922e-04 1.321e-02 0.052 0.958229## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***## tax -1.233e-02 3.760e-03 -3.280 0.001112 **## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***## black 9.312e-03 2.686e-03 3.467 0.000573 ***## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 4.745 on 492 degrees of freedom## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

There are a couple of non-significant variables, but so far the model has an𝑅2 = 0.74 and the fitted coefficients are sensible with what it would be expected.For example, crim, tax, ptratio and nox have negative effects on medv, whilerm, rad and chas have positive. However, the non-significant coefficients are notimproving significantly the model, but only adding artificial noise and decreasingthe overall accuracy of the coefficient estimates!

Let’s polish a little bit the previous model. Instead of removing manually eachnon-significant variable to reduce the complexity, we employ an automatic toolin R called stepwise model selection. It has different flavors, that we willsee in detail in Section 3.7, but essentially this powerful tool usually ends upselecting “a” best model: a model that delivers the maximum fit with theminimum number of variables.

The stepwise model selection is located at 'Models' -> 'Stepwise modelselection...' and is always applied on the active model. Apply it with thedefault options to modBest:


modBest <- stepwise(modHouse, direction = 'backward/forward', criterion = 'BIC')#### Direction: backward/forward## Criterion: BIC#### Start: AIC=1648.81## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +## tax + ptratio + black + lstat#### Df Sum of Sq RSS AIC## - age 1 0.06 11079 1642.6## - indus 1 2.52 11081 1642.7## <none> 11079 1648.8## - chas 1 218.97 11298 1652.5## - tax 1 242.26 11321 1653.5## - crim 1 243.22 11322 1653.6## - zn 1 257.49 11336 1654.2## - black 1 270.63 11349 1654.8## - rad 1 479.15 11558 1664.0## - nox 1 487.16 11566 1664.4## - ptratio 1 1194.23 12273 1694.4## - dis 1 1232.41 12311 1696.0## - rm 1 1871.32 12950 1721.6## - lstat 1 2410.84 13490 1742.2#### Step: AIC=1642.59## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +## ptratio + black + lstat#### Df Sum of Sq RSS AIC## - indus 1 2.52 11081 1636.5## <none> 11079 1642.6## - chas 1 219.91 11299 1646.3## - tax 1 242.24 11321 1647.3## - crim 1 243.20 11322 1647.3## - zn 1 260.32 11339 1648.1## - black 1 272.26 11351 1648.7## + age 1 0.06 11079 1648.8## - rad 1 481.09 11560 1657.9## - nox 1 520.87 11600 1659.6## - ptratio 1 1200.23 12279 1688.4## - dis 1 1352.26 12431 1694.6## - rm 1 1959.55 13038 1718.8## - lstat 1 2718.88 13798 1747.4##


## Step: AIC=1636.48## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +## black + lstat#### Df Sum of Sq RSS AIC## <none> 11081 1636.5## - chas 1 227.21 11309 1640.5## - crim 1 245.37 11327 1641.3## - zn 1 257.82 11339 1641.9## - black 1 270.82 11352 1642.5## + indus 1 2.52 11079 1642.6## - tax 1 273.62 11355 1642.6## + age 1 0.06 11081 1642.7## - rad 1 500.92 11582 1652.6## - nox 1 541.91 11623 1654.4## - ptratio 1 1206.45 12288 1682.5## - dis 1 1448.94 12530 1692.4## - rm 1 1963.66 13045 1712.8## - lstat 1 2723.48 13805 1741.5

Note the different steps: it starts with the full model and, when + is shown,it means that the variable is excluded at that step. The procedure seeks tominimize an information criterion (BIC or AIC)5. An information criterionbalances the fitness of a model with the number of predictors employed. Hence,it determines objectively the best model: the one that minimizes the informationcriterion. Remember to save the output to a variable if you want to have thefinal model (you need to do this in R)!

The summary of the final model is:summary(modBest)#### Call:## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +## tax + ptratio + black + lstat, data = Boston)#### Residuals:## Min 1Q Median 3Q Max## -15.5984 -2.7386 -0.5046 1.7273 26.2373#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***## crim -0.108413 0.032779 -3.307 0.001010 **## zn 0.045845 0.013523 3.390 0.000754 ***

5Although note that the printed messages always display 'AIC' even if you choose 'BIC'.


## chas 2.718716 0.854240 3.183 0.001551 **## nox -17.376023 3.535243 -4.915 1.21e-06 ***## rm 3.801579 0.406316 9.356 < 2e-16 ***## dis -1.492711 0.185731 -8.037 6.84e-15 ***## rad 0.299608 0.063402 4.726 3.00e-06 ***## tax -0.011778 0.003372 -3.493 0.000521 ***## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***## black 0.009291 0.002674 3.475 0.000557 ***## lstat -0.522553 0.047424 -11.019 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 4.736 on 494 degrees of freedom## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16

Let’s compute the confidence intervals at level 𝛼 = 0.05:confint(modBest)## 2.5 % 97.5 %## (Intercept) 26.384649126 46.29764088## crim -0.172817670 -0.04400902## zn 0.019275889 0.07241397## chas 1.040324913 4.39710769## nox -24.321990312 -10.43005655## rm 3.003258393 4.59989929## dis -1.857631161 -1.12779176## rad 0.175037411 0.42417950## tax -0.018403857 -0.00515209## ptratio -1.200109823 -0.69293932## black 0.004037216 0.01454447## lstat -0.615731781 -0.42937513

We have quantified the influence of the predictor variables in the housing prices(Q1) and we can conclude that, in the final model and with confidence level𝛼 = 0.05:

• chas, age, rad and black have a significantly positive influence onmedv.

• nox, dis, tax, pratio and lstat have a significantly negative influenceon medv.

The model employed in Harrison and Rubinfeld (1978) is differentfrom modBest. In the paper, several nonlinear transformations ofthe predictors (remember Section 2.8) and the response are done toimprove the linear fit. Also, different units are used for medv, black,


lstat and nox. The authors considered these variables:

– Response: log(1000 * medv)– Linear predictors: age, black / 1000 (this variable corresponds

to their (𝐵 − 0.63)2), tax, ptratio, crim, zn, indus and chas.– Nonlinear predictors: rm^2, log(dis), log(rad), log(lstat /

100) and (10 * nox)^2.

Do the following:

1. Check if the model with such predictors corresponds tothe one in the first column, Table VII, page 100 ofHarrison and Rubinfeld (1978) (open-access paper availablehere). To do so, Save this model as modelHarrison andsummarize it. Hint: the formula should be somethinglike I(log(1000 * medv)) ~ age + I(black / 1000) + ...+ I(log(lstat / 100)) + I((10 * nox)^2).

2. Make a stepwise selection of the variables in modelHarrison(use defaults) and save it as modelHarrisonSel. Summarize it.

3. Which model has a larger 𝑅2? And adjusted 𝑅2? Which issimpler and has more significant coefficients?

3.2 Model formulation and estimation by leastsquares

The multiple linear model extends the simple linear model by describing therelation between the random variables 𝑋1, … , 𝑋𝑘 and 𝑌 . For example, in thelast model for the wine dataset, we had 𝑘 = 4 variables 𝑋1 =WinterRain,𝑋2 =AGST, 𝑋3 =HarvestRain and 𝑋4 =Age and 𝑌 = Price. Therefore, as inSection 2.3, the multiple linear model is constructed by assuming that the linearrelation

𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝜀 (3.1)

holds between the predictors 𝑋1, … , 𝑋𝑘 and the response 𝑌 . In (3.1), 𝛽0 is theintercept and 𝛽1, … , 𝛽𝑘 are the slopes, respectively. 𝜀 is a random variable withmean zero and independent from 𝑋1, … , 𝑋𝑘. Another way of looking at (3.1) is

𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘, (3.2)

since 𝔼[𝜀|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 0.

The LHS of (3.2) is the conditional expectation of 𝑌 given 𝑋1, … , 𝑋𝑘. It repre-sents how the mean of the random variable 𝑌 is changing according to particularvalues, denoted by 𝑥1, … , 𝑥𝑘, of the random variables 𝑋1, … , 𝑋𝑘. With the RHS,

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/22636/0000186.pdf


what we are saying is that the mean of 𝑌 is changing in a linear fashion withrespect to the values of 𝑋1, … , 𝑋𝑘. Hence the interpretation of the coefficients:

• 𝛽0: is the mean of 𝑌 when 𝑋1 = … = 𝑋𝑘 = 0.• 𝛽𝑗, 1 ≤ 𝑗 ≤ 𝑘: is the increment in mean of 𝑌 for an increment of one unit in

𝑋𝑗 = 𝑥𝑗, provided that the remaining variables 𝑋1, … , 𝑋𝑗−1, 𝑋𝑗+1, … , 𝑋𝑘do not change.

Figure 3.5 illustrates the geometrical interpretation of a multiple linear model:a plane in the (𝑘 + 1)-dimensional space. If 𝑘 = 1, the plane is the regressionline for simple linear regression. If 𝑘 = 2, then the plane can be visualized in athree-dimensional plot

Figure 3.5: The least squares regression plane 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 and its de-pendence on the kind of squared distance considered. Application also availablehere.

The estimation of 𝛽0, 𝛽1, … , 𝛽𝑘 is done as in simple linear regression, by min-imizing the Residual Sum of Squares (RSS). First we need to introduce somehelpful matrix notation. In the following, bold face are used for distinguishingvectors and matrices from scalars:

• A sample of (𝑋1, … , 𝑋𝑘, 𝑌 ) is (𝑋11, … , 𝑋1𝑘, 𝑌1), … , (𝑋𝑛1, … , 𝑋𝑛𝑘, 𝑌𝑛),where 𝑋𝑖𝑗 denotes the 𝑖-th observation of the 𝑗-th predictor 𝑋𝑗. Wedenote with X𝑖 = (𝑋𝑖1, … , 𝑋𝑖𝑘) to the 𝑖-th observation of (𝑋1, … , 𝑋𝑘),so the sample simplifies to (X1, 𝑌1), … , (X𝑛, 𝑌𝑛).

• The design matrix contains all the information of the predictors and a

https://shinyserv.es/shiny/least-squares-3D/


column of ones

X = ⎛⎜⎝

1 𝑋11 ⋯ 𝑋1𝑘⋮ ⋮ ⋱ ⋮1 𝑋𝑛1 ⋯ 𝑋𝑛𝑘

⎞⎟⎠𝑛×(𝑘+1)

.

• The vector of responses Y, the vector of coefficients 𝛽 and the vector oferrors are, respectively6,

Y = ⎛⎜⎝

𝑌1⋮

𝑌𝑛

⎞⎟⎠𝑛×1

, 𝛽 =⎛⎜⎜⎜⎝

𝛽0𝛽1⋮

𝛽𝑘

⎞⎟⎟⎟⎠(𝑘+1)×1

and 𝜀 = ⎛⎜⎝

𝜀1⋮

𝜀𝑛

⎞⎟⎠𝑛×1

.

Thanks to the matrix notation, we can turn the sample version of themultiple linear model, namely

𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖1 + ⋯ + 𝛽𝑘𝑋𝑖𝑘 + 𝜀𝑖, 𝑖 = 1, … , 𝑛,

into something as compact as

Y = X𝛽 + 𝜀.

Recall that if 𝑘 = 1 we have the simple linear model. In this case:

X = ⎛⎜⎝

1 𝑋11⋮ ⋮1 𝑋𝑛1

⎞⎟⎠𝑛×2

and 𝛽 = (𝛽0𝛽1

)2×1

.

The RSS for the multiple linear regression is

RSS(𝛽) =𝑛

∑𝑖=1

(𝑌𝑖 − 𝛽0 − 𝛽1𝑋𝑖1 − ⋯ − 𝛽𝑘𝑋𝑖𝑘)2

= (Y − X𝛽)𝑇 (Y − X𝛽). (3.3)

The RSS aggregates the squared vertical distances from the data to a regres-sion plane given by 𝛽. Remember that the vertical distances are consideredbecause we want to minimize the error in the prediction of 𝑌 . The least squaresestimators are the minimizers of the RSS7:

�� = arg min𝛽∈ℝ𝑘+1

RSS(𝛽).

6The vectors are regarded as column matrices.7They are unique and always exist.


Luckily, thanks to the matrix form of (3.3), it is simple to compute a closed-formexpression for the least squares estimates:

�� = (X𝑇 X)−1X𝑇 Y. (3.4)

There are some similarities between (3.4) and 𝛽1 = (𝑠2𝑥)−1𝑠𝑥𝑦 from

the simple linear model: both are related to the covariance betweenX and 𝑌 weighted by the variance of X.

The data of the illustration has been generated with the following code:# Generates 50 points from a N(0, 1): predictors and errorset.seed(34567) # Fixes the seed for the random generatorx1 <- rnorm(50)x2 <- rnorm(50)x3 <- x1 + rnorm(50, sd = 0.05) # Make variables dependenteps <- rnorm(50)

# ResponsesyLin <- -0.5 + 0.5 * x1 + 0.5 * x2 + epsyQua <- -0.5 + x1^2 + 0.5 * x2 + epsyExp <- -0.5 + 0.5 * exp(x2) + x3 + eps

# DataleastSquares3D <- data.frame(x1 = x1, x2 = x2, yLin = yLin,

yQua = yQua, yExp = yExp)

Let’s check that indeed the coefficients given by lm are the ones given by equation(3.4) for the regression yLin ~ x1 + x2.# Matrix XX <- cbind(1, x1, x2)

# Vector YY <- yLin

# Coefficientsbeta <- solve(t(X) %*% X) %*% t(X) %*% Y# %*% multiplies matrices# solve() computes the inverse of a matrix# t() transposes a matrixbeta## [,1]## -0.5702694## x1 0.4832624## x2 0.3214894


# Output from lmmod <- lm(yLin ~ x1 + x2, data = leastSquares3D)mod$coefficients## (Intercept) x1 x2## -0.5702694 0.4832624 0.3214894

Compute 𝛽 for the regressions yLin ~ x1 + x2, yQua ~ x1 + x2 andyExp ~ x2 + x3 using:

– equation (3.4) and– the function lm.

Check that the fitted plane and the coefficient estimates are coherent.

Once we have the least squares estimates ��, we can define the next two concepts:

• The fitted values 𝑌1, … , 𝑌𝑛, where

𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖1 + ⋯ + 𝛽𝑘𝑋𝑖𝑘, 𝑖 = 1, … , 𝑛.

They are the vertical projections of 𝑌1, … , 𝑌𝑛 into the fitted line (see Figure3.5). In a matrix form, inputting (3.3)

Y = X�� = X(X𝑇 X)−1X𝑇 Y = HY,

where H = X(X𝑇 X)−1X𝑇 is called the hat matrix because it “puts thehat into Y”. What it does is to project Y into the regression plane (seeFigure 3.5).

• The residuals (or estimated errors) 𝜀1, … , 𝜀𝑛, where

𝜀𝑖 = 𝑌𝑖 − 𝑌𝑖, 𝑖 = 1, … , 𝑛.

They are the vertical distances between actual data and fitted data.

We conclude with an insight on the relation of multiple and simple linear regres-sions. It is illustrated in Figure 3.6.

Consider the multiple linear model 𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + 𝜀 and itsassociated simple linear models 𝑌 = 𝛼0+𝛼1𝑋1+𝜀 and 𝑌 = 𝛾0+𝛾1𝑋2+𝜀. Assume that we have a sample (𝑋11, 𝑋12, 𝑌1), … , (𝑋𝑛1, 𝑋𝑛2, 𝑌𝑛).Then, in general, 𝛼0 ≠ 𝛽0, 𝛼1 ≠ 𝛽1, 𝛾0 ≠ 𝛽0 and 𝛾1 ≠ 𝛽1. Thisis, in general, the inclusion of a new predictor changes thecoefficient estimates.

The data employed in Figure 3.6 is:


Figure 3.6: The regression plane (blue) and its relation with the simple linearregressions (green lines). The red points represent the sample for (𝑋1, 𝑋2, 𝑌 )and the black points the subsamples for (𝑋1, 𝑋2) (bottom), (𝑋1, 𝑌 ) (left) and(𝑋2, 𝑌 ) (right).

set.seed(212542)n <- 100x1 <- rnorm(n, sd = 2)x2 <- rnorm(n, mean = x1, sd = 3)y <- 1 + 2 * x1 - x2 + rnorm(n, sd = 1)data <- data.frame(x1 = x1, x2 = x2, y = y)

With the above data, check how the fitted coefficients change for y ~x1, y ~ x2 and y ~ x1 + x2.

3.3 Assumptions of the modelSome probabilistic assumptions are required for performing inference on themodel parameters. In other words, to infer properties about the unknown pop-ulation coefficients 𝛽 from the sample (X1, 𝑌1), … , (X𝑛, 𝑌𝑛).The assumptions of the multiple linear model are an extension of the simple


Figure 3.7: The key concepts of the multiple linear model when 𝑘 = 2. Thespace between the yellow planes denotes where the 95% of the data is, accordingto the model.

linear model:

i. Linearity: 𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘.ii. Homoscedasticity: 𝕍ar(𝜀𝑖) = 𝜎2, with 𝜎2 constant for 𝑖 = 1, … , 𝑛.iii. Normality: 𝜀𝑖 ∼ 𝒩(0, 𝜎2) for 𝑖 = 1, … , 𝑛.iv. Independence of the errors: 𝜀1, … , 𝜀𝑛 are independent (or uncorre-

lated, 𝔼[𝜀𝑖𝜀𝑗] = 0, 𝑖 ≠ 𝑗, since they are assumed to be normal).

A good one-line summary of the linear model is the following (independence isassumed)

𝑌 |(𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) ∼ 𝒩(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘, 𝜎2). (3.5)

Recall:

– Compared with simple liner regression, the only different as-sumption is linearity.

– Nothing is said about the distribution of 𝑋1, … , 𝑋𝑘. They could


be deterministic or random. They could be discrete or continu-ous.

– 𝑋1, … , 𝑋𝑘 are not required to be independent betweenthem.

– 𝑌 has to be continuous, since the errors are normal – recall(2.1).

Figure 3.8 represents situations where the assumptions of the model are re-spected and violated, for the situation with two predictors. Clearly, the inspec-tion of the scatterplots for identifying strange patterns is more complicated thanin simple linear regression – and here we are dealing only with two predictors.In Section 3.8 we will see more sophisticated methods for checking whether theassumptions hold or not for an arbitrary number of predictors.

Figure 3.8: Valid (all the assumptions are verified) and problematic (a singleassumption does not hold) multiple linear models, when there are two predictors.Application also available here.

To conclude this section, let’s see how to make a 3D scatterplot with the re-gression plane, in order to evaluate visually how good the fit of the model is.We will do it with the iris dataset, that can be imported in R simply by run-ning data(iris). In R Commander go to 'Graphs' -> '3D Graphs' -> '3Dscatterplot...'. A window like Figures 3.9 and 3.10 will pop-up. The optionsare similar to the ones for 'Graphs' -> 'Scatterplot...'.

If you select the options as shown in Figures 3.9 and 3.10, you should get

https://shinyserv.es/shiny/assump-lm-3D/


Figure 3.9: 3D scatterplot window, 'Data' panel.

Figure 3.10: 3D scatterplot window, 'Options' panel. Remember to tick the'Linear least-squares fit' box in order to display the fitted regressionplane.

something like this:data(iris)scatter3d(Petal.Length ~ Petal.Width + Sepal.Length, data = iris, fit = "linear",

residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE,ellipsoid = FALSE, id.method = 'mahal', id.n = 2)


3.4 Inference for model parametersThe assumptions introduced in the previous section allow to specify what is thedistribution of the random vector ��. The distribution is derived conditionally onthe sample predictors X1, … , X𝑛. In other words, we assume that the random-ness of Y = X𝛽+𝜀 comes only from the error terms and not from the predictors.To denote this, we employ lowercase for the sample predictors x1, … , x𝑛.

3.4.1 Distributions of the fitted coefficientsThe distribution of �� is:

�� ∼ 𝒩𝑘+1 (𝛽, 𝜎2(X𝑇 X)−1) (3.6)

where 𝒩𝑚 is the 𝑚-dimensional normal, this is, the extension of the usualnormal distribution to deal with 𝑚 random variables8. The interpretation of(3.6) is not so easy as in the simple linear case. Here are some broad remarks:

• Bias. The estimates are unbiased.

• Variance. Depending on:

– Sample size 𝑛. Hidden inside X𝑇 X. As 𝑛 grows, the precision of theestimators increases.

– Error variance 𝜎2. The larger 𝜎2 is, the less precise �� is.– Predictor sparsity (X𝑇 X)−1. The more sparse the predictor is (small

|(X𝑇 X)−1|), the more precise �� is.

The problem with (3.6) is that 𝜎2 is unknown in practice, so we need to estimate𝜎2 from the data. We do so by computing a rescaled sample variance of theresiduals 𝜀1, … , 𝜀𝑛:

��2 = ∑𝑛𝑖=1 𝜀2

𝑖𝑛 − 𝑘 − 1.

Note the 𝑛−𝑘 −1 in the denominator. Now 𝑛−𝑘 −1 are the degrees of freedom,the number of data points minus the number of already fitted parameters (𝑘slopes and 1 intercept). As in simple linear regression, the mean of the residuals𝜀1, … , 𝜀𝑛 is zero.

If we use the estimate ��2 instead of 𝜎2, we get more useful distributions, thistime for the individual 𝛽𝑗’s:

𝛽𝑗 − 𝛽𝑗SE( 𝛽𝑗)

∼ 𝑡𝑛−𝑘−1, SE( 𝛽𝑗)2 = ��2𝑣2𝑗 (3.7)

8With 𝑚 = 1, the density of a 𝒩𝑚 corresponds to a bell-shaped curve With 𝑚 = 2, thedensity is a surface similar to a bell.

3.4. INFERENCE FOR MODEL PARAMETERS 117

where 𝑡𝑛−𝑘−1 represents the Student’s 𝑡 distribution with 𝑛 − 𝑘 − 1 degrees offreedom and

𝑣𝑗 is the 𝑗-th element of the diagonal of (X𝑇 X)−1.

The LHS of (3.7) is the 𝑡-statistic for 𝛽𝑗, 𝑗 = 0, … , 𝑘. They are employed forbuilding confidence intervals and hypothesis tests.

3.4.2 Confidence intervals for the coefficientsThanks to (3.7), we can have the 100(1 − 𝛼)% CI for the coefficient 𝛽𝑗, 𝑗 =0, … , 𝑘:

( 𝛽𝑗 ± SE( 𝛽𝑗)𝑡𝑛−𝑘−1;𝛼/2) (3.8)

where 𝑡𝑛−𝑘−1;𝛼/2 is the 𝛼/2-upper quantile of the 𝑡𝑛−𝑘−1. Note that with 𝑘 = 1we have same CI as in (2.9).

Let’s see how we can compute the CIs. We return to the wine dataset, so incase you do not have it loaded, you can download it here as an .RData file. Weanalyze the CI for the coefficients of Price ~ Age + WinterRain.# Fit modelmod <- lm(Price ~ Age + WinterRain, data = wine)

# Confidence intervals at 95%confint(mod)## 2.5 % 97.5 %## (Intercept) 4.746010626 7.220074676## Age 0.007702664 0.064409106## WinterRain -0.001030725 0.002593278

# Confidence intervals at other levelsconfint(mod, level = 0.90)## 5 % 95 %## (Intercept) 4.9575969417 7.008488360## Age 0.0125522989 0.059559471## WinterRain -0.0007207941 0.002283347confint(mod, level = 0.99)## 0.5 % 99.5 %## (Intercept) 4.306650310 7.659434991## Age -0.002367633 0.074479403## WinterRain -0.001674299 0.003236852

In this example, the 95% confidence interval for 𝛽0 is (4.7460, 7.2201), for 𝛽1is (0.0077, 0.0644) and for 𝛽2 is (−0.0010, 0.0026). Therefore, we can say witha 95% confidence that the coefficient of WinterRain is non significant. But inSection 3.1.1 we saw that it was significant in the model Price ~ Age + AGST

https://raw.githubusercontent.com/egarpor/handy/master/datasets/wine.RData


+ HarvestRain + WinterRain! How is this possible? The answer is that thepresence of extra predictors affects the coefficient estimate, as we saw in Figure3.6. Therefore, the precise statement to make is: in the model Price ~ Age+ WinterRain, with 𝛼 = 0.05, the coefficient of WinterRain is non significant.Note that this does not mean that it will be always non significant: in Price~ Age + AGST + HarvestRain + WinterRain it is.

Compute and interpret the CIs for the coefficients, at levels 𝛼 =0.10, 0.05, 0.01, for the following regressions:

– medv ~ . - lstat - chas - zn - crim (Boston)– nox ~ chas + zn + indus + lstat + dis + rad (Boston)– Price ~ WinterRain + HarvestRain + AGST (wine)– AGST ~ Year + FrancePop (wine)

3.4.3 Testing on the coefficientsThe distributions in (3.7) also allow to conduct a formal hypothesis test on thecoefficients 𝛽𝑗, 𝑗 = 0, … , 𝑘. For example the test for significance is speciallyimportant:

𝐻0 ∶ 𝛽𝑗 = 0

for 𝑗 = 0, … , 𝑘. The test of 𝐻0 ∶ 𝛽𝑗 = 0 with 1 ≤ 𝑗 ≤ 𝑘 is specially interesting,since it allows to answer whether the variable 𝑋𝑗 has a significant linear effecton 𝑌 . The statistic used for testing for significance is the 𝑡-statistic


,

which is distributed as a 𝑡𝑛−𝑘−1 under the (veracity of) the null hypothesis. 𝐻0is tested against the bilateral alternative hypothesis 𝐻1 ∶ 𝛽𝑗 ≠ 0.

Remember two important insights regarding hypothesis testing.

In an hypothesis test, the 𝑝-value measures the degree of veracity of𝐻0 according to the data. The rule of thumb is the following:

Is the 𝑝-value lower than 𝛼?

– Yes → reject 𝐻0.– No → do not reject 𝐻0.

The connection of a 𝑡-test for 𝐻0 ∶ 𝛽𝑗 = 0 and the CI for 𝛽𝑗, both atlevel 𝛼, is the following.


Is 0 inside the CI for 𝛽𝑗?

– Yes ↔ do not reject 𝐻0.– No ↔ reject 𝐻0.

The tests for significance are built-in in the summary function, as we saw inSection 3. For mod, the regression of Price ~ Age + WinterRain, we have:summary(mod)#### Call:## lm(formula = Price ~ Age + WinterRain, data = wine)#### Residuals:## Min 1Q Median 3Q Max## -0.88964 -0.51421 -0.00066 0.43103 1.06897#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***## Age 0.0360559 0.0137377 2.625 0.0149 *## WinterRain 0.0007813 0.0008780 0.890 0.3824## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.5769 on 24 degrees of freedom## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884

The unilateral test 𝐻0 ∶ 𝛽𝑗 ≥ 0 (respectively, 𝐻0 ∶ 𝛽𝑗 ≤ 0) vs 𝐻1 ∶ 𝛽𝑗 <0 (𝐻1 ∶ 𝛽𝑗 > 0) can be done by means of the CI for 𝛽𝑗. If 𝐻0 is rejected,they allow to conclude that 𝛽𝑗 is significantly negative (positive) andthat for the considered regression model, 𝑋𝑗 has a significant negative(positive) effect on 𝑌 . We have been doing them using the followingrule of thumb:

Is the CI for 𝛽𝑗 below (above) 0 at level 𝛼?

– Yes → reject 𝐻0 at level 𝛼. Conclude 𝑋𝑗 has a significantnegative (positive) effect on 𝑌 at level 𝛼.

– No → the criterion is not conclusive.


3.5 PredictionAs in the simple linear model, the forecast of 𝑌 from X = x (this is, 𝑋1 =𝑥1, … , 𝑋𝑘 = 𝑥𝑘) is approached by two different ways:

1. Inference on the conditional mean of 𝑌 given X = x, 𝔼[𝑌 |X = x]. Thisis a deterministic quantity, which equals 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘.

2. Prediction of the conditional response 𝑌 |X = x. This is a randomvariable distributed as 𝒩(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘, 𝜎2).

The prediction and computation of CIs can be done with the R function predict(unfortunately, there is no R Commander shortcut for this one). The objectsrequired for predict are: first, the output of lm; second, a data.frame contain-ing the locations x = (𝑥1, … , 𝑥𝑘) where we want to predict 𝛽0 +𝛽1𝑥1 +⋯+𝛽𝑘𝑥𝑘.The prediction is 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘

It is mandatory to name the columns of the data frame with the samenames of the predictors used in lm. Otherwise predict will generatean error, see below.

To illustrate the use of predict, we return to the wine dataset.# Fit a linear model for the price on WinterRain, HarvestRain and AGSTmodelW <- lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)summary(modelW)#### Call:## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)#### Residuals:## Min 1Q Median 3Q Max## -0.62816 -0.17923 0.02274 0.21990 0.62859#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -4.9506001 1.9694011 -2.514 0.01940 *## WinterRain 0.0012820 0.0005765 2.224 0.03628 *## HarvestRain -0.0036242 0.0009646 -3.757 0.00103 **## AGST 0.7123192 0.1087676 6.549 1.11e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.3436 on 23 degrees of freedom## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7069## F-statistic: 21.9 on 3 and 23 DF, p-value: 6.246e-07

3.5. PREDICTION 121

# Data for which we want a prediction# Important! You have to name the column with the predictor name!weather <- data.frame(WinterRain = 500, HarvestRain = 123,

AGST = 18)

## Prediction of the mean

# Prediction of the mean at 95% - the defaultspredict(modelW, newdata = weather)## 1## 8.066342

# Prediction of the mean with 95% confidence interval (the default)# CI: (lwr, upr)predict(modelW, newdata = weather, interval = "confidence")## fit lwr upr## 1 8.066342 7.714178 8.418507predict(modelW, newdata = weather, interval = "confidence", level = 0.95)## fit lwr upr## 1 8.066342 7.714178 8.418507

# Other levelspredict(modelW, newdata = weather, interval = "confidence", level = 0.90)## fit lwr upr## 1 8.066342 7.774576 8.358108predict(modelW, newdata = weather, interval = "confidence", level = 0.99)## fit lwr upr## 1 8.066342 7.588427 8.544258

## Prediction of the response

# Prediction of the mean at 95% - the defaultspredict(modelW, newdata = weather)## 1## 8.066342

# Prediction of the response with 95% confidence interval (the default)# CI: (lwr, upr)predict(modelW, newdata = weather, interval = "prediction")## fit lwr upr## 1 8.066342 7.273176 8.859508predict(modelW, newdata = weather, interval = "prediction", level = 0.95)## fit lwr upr## 1 8.066342 7.273176 8.859508


# Other levelspredict(modelW, newdata = weather, interval = "prediction", level = 0.90)## fit lwr upr## 1 8.066342 7.409208 8.723476predict(modelW, newdata = weather, interval = "prediction", level = 0.99)## fit lwr upr## 1 8.066342 6.989951 9.142733

# Predictions for several valuesweather2 <- data.frame(WinterRain = c(500, 200), HarvestRain = c(123, 200),

AGST = c(17, 18))predict(modelW, newdata = weather2, interval = "prediction")## fit lwr upr## 1 7.354023 6.613835 8.094211## 2 7.402691 6.533945 8.271437

For the wine dataset, do the following:

– Regress WinterRain on HarvestRain and AGST. Name the fittedmodel modExercise.

– Compute the estimate for the conditional mean of WinterRainfor HarvestRain= 123.0 and AGST= 16.15. What is the CI at𝛼 = 0.01?

– Compute the estimate for the conditional response forHarvestRain= 125.0 and AGST= 15. What is the CI at 𝛼 = 0.10?

– Check that modExercise$fitted.values is the same aspredict(modExercise, newdata = data.frame(HarvestRain= wine$HarvestRain, AGST = wine$AGST)). Why is so?

Similarities and differences in the prediction of the conditional mean𝔼[𝑌 |X = x] and conditional response 𝑌 |X = x:

– Similarities. The estimate is the same, 𝑦 = 𝛽0 + 𝛽1𝑥1 +⋯+ 𝛽𝑘𝑥𝑘.Both CI are centered in 𝑦.

– Differences. 𝔼[𝑌 |X = x] is deterministic and 𝑌 |X = x is random.Therefore, the variance is larger for the prediction of 𝑌 |X = xthan for the prediction of 𝔼[𝑌 |X = x].


3.6 ANOVA and model fit3.6.1 ANOVAThe ANOVA decomposition for multiple linear regression is quite analogous tothe one in simple linear regression. The ANOVA decomposes the variance of𝑌 into two parts, each one corresponding to the regression and to the error,respectively. Since the difference between simple and multiple linear regressionis the number of predictors – the response 𝑌 is unique in both cases – theANOVA decompositions are highly similar, as we will see.

As in simple linear regression, the mean of the fitted values 𝑌1, … , 𝑌𝑛 is themean of 𝑌1, … , 𝑌𝑛. This is an important result that can be checked if we usematrix notation. The ANOVA decomposition considers the following measuresof variation related with the response:

• SST = ∑𝑛𝑖=1 (𝑌𝑖 − 𝑌 )2, the total sum of squares. This is the total

variation of 𝑌1, … , 𝑌𝑛, since SST = 𝑛𝑠2𝑦, where 𝑠2

𝑦 is the sample varianceof 𝑌1, … , 𝑌𝑛.

• SSR = ∑𝑛𝑖=1 ( 𝑌𝑖 − 𝑌 )2

, the regression sum of squares. This is thevariation explained by the regression plane, that is, the variation from 𝑌that is explained by the estimated conditional mean 𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖1 + ⋯ +

𝛽𝑘𝑋𝑖𝑘. SSR = 𝑛𝑠2𝑦, where 𝑠2

𝑦 is the sample variance of 𝑌1, … , 𝑌𝑛.

• SSE = ∑𝑛𝑖=1 (𝑌𝑖 − 𝑌𝑖)

2, the sum of squared errors9. Is the variation

around the conditional mean. Recall that SSE = ∑𝑛𝑖=1 𝜀2

𝑖 = (𝑛 − 𝑘 − 1)��2,where ��2 is the sample variance of 𝜀1, … , 𝜀𝑛.

The ANOVA decomposition is exactly the same as in simple linear regression:

SST⏟Variation of 𝑌 ′

𝑖 𝑠= SSR⏟

Variation of 𝑌 ′𝑖 𝑠

+ SSE⏟Variation of 𝜀′

𝑖𝑠(3.9)

or, equivalently (dividing by 𝑛 in (3.9)),

𝑠2𝑦⏟Variance of 𝑌 ′

𝑖 𝑠

= 𝑠2𝑦⏟

Variance of 𝑌 ′𝑖 𝑠

+ (𝑛 − 𝑘 − 1)/𝑛 × ��2⏟⏟⏟⏟⏟⏟⏟⏟⏟Variance of 𝜀′

𝑖𝑠.

Notice the 𝑛−𝑘−1 instead of simple linear regression’s 𝑛−2, which is the mainchange. The graphical interpretation of (3.9) when 𝑘 = 2 is shown in Figures3.11 and 3.12.

The ANOVA table summarizes the decomposition of the variance.

9SSE and RSS are two names for the same quantity (that appears in different contexts):SSE = ∑𝑛

𝑖=1 (𝑌𝑖 − 𝑌𝑖)2 = ∑𝑛

𝑖=1 (𝑌𝑖 − 𝛽0 − 𝛽1𝑋𝑖1 − ⋯ − 𝛽𝑘𝑋𝑖𝑘)2 = RSS(��).


Degrees offreedom

SumSquares

MeanSquares 𝐹 -value 𝑝-value

Predictors𝑘 SSR SSR𝑘

SSR/𝑘SSE/(𝑛−𝑘−1) 𝑝

Residuals𝑛 − 𝑘 − 1 SSE SSE𝑛−𝑘−1

The “𝐹 -value” of the ANOVA table represents the value of the 𝐹 -statisticSSR/𝑘

SSE/(𝑛−𝑘−1) . This statistic is employed to test

𝐻0 ∶ 𝛽1 = … = 𝛽𝑘 = 0 vs. 𝐻1 ∶ 𝛽𝑗 ≠ 0 for any 𝑗,

that is, the hypothesis of no linear dependence of 𝑌 on 𝑋1, … , 𝑋𝑘 (the planeis completely flat, with no inclination). If 𝐻0 is rejected, it means that at leastone 𝛽𝑗 is significantly different from zero. It happens that

𝐹 = SSR/𝑘SSE/(𝑛 − 𝑘 − 1)

𝐻0∼ 𝐹𝑘,𝑛−𝑘−1,

where 𝐹𝑘,𝑛−𝑘−1 is the Snedecor’s 𝐹 distribution with 𝑘 and 𝑛 − 𝑘 − 1 degrees offreedom. If 𝐻0 is true, then 𝐹 is expected to be small since SSR will be closeto zero (little variation is explained by the regression model since �� ≈ 0). The𝑝-value of this test is not the same as the 𝑝-value of the 𝑡-test for 𝐻0 ∶ 𝛽1 = 0,that only happens in simple linear regression because 𝑘 = 1!

The “ANOVA table” is a broad concept in statistics, with differentvariants. Here we are only covering the basic ANOVA table fromthe relation SST = SSR + SSE. However, further sophisticationsare possible when SSR is decomposed into the variations contributedby each predictor. In particular, for multiple linear regression R’sanova implements a sequential (type I) ANOVA table, which is notthe previous table!

The anova function in R takes a model as an input and returns the followingsequential ANOVA table10:

Degrees offreedom

SumSquares


Predictor 1 1 SSR1SSR1

1SSR1/1

SSE/(𝑛−𝑘−1) 𝑝1Predictor 2 1 SSR2

SSR21

SSR2/1SSE/(𝑛−𝑘−1) 𝑝2

⋮ ⋮ ⋮ ⋮ ⋮ ⋮Predictor 𝑘 1 SSR𝑘

SSR𝑘1

SSR𝑘/1SSE/(𝑛−𝑘−1) 𝑝𝑘

10More complex – included here just for clarification of the anova’s output.


Degrees offreedom

SumSquares


Residuals 𝑛 − 𝑘 − 1 SSE SSE𝑛−𝑘−1

Here the SSR𝑗 represents the regression sum of squares associated to the inclu-sion of 𝑋𝑗 in the model with predictors 𝑋1, … , 𝑋𝑗−1, this is:

SSR𝑗 = SSR(𝑋1, … , 𝑋𝑗) − SSR(𝑋1, … , 𝑋𝑗−1).

The 𝑝-values 𝑝1, … , 𝑝𝑘 correspond to the testing of the hypotheses

𝐻0 ∶ 𝛽𝑗 = 0 vs. 𝐻1 ∶ 𝛽𝑗 ≠ 0,

carried out inside the linear model 𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑗𝑋𝑗 + 𝜀. This islike the 𝑡-test for 𝛽𝑗 for the model with predictors 𝑋1, … , 𝑋𝑗.

Let’s see how we can compute both ANOVA tables in R. The sequential tableis simple: use anova. We illustrate it with the Boston dataset.# Load datalibrary(MASS)data(Boston)

# Fit a linear modelmodel <- lm(medv ~ crim + lstat + zn + nox, data = Boston)summary(model)#### Call:## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)#### Residuals:## Min 1Q Median 3Q Max## -14.972 -3.956 -1.344 2.148 25.076#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 30.93462 1.72076 17.977 <2e-16 ***## crim -0.08297 0.03677 -2.257 0.0245 *## lstat -0.90940 0.05040 -18.044 <2e-16 ***## zn 0.03493 0.01395 2.504 0.0126 *## nox 5.42234 3.24241 1.672 0.0951 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 6.169 on 501 degrees of freedom## Multiple R-squared: 0.5537, Adjusted R-squared: 0.5502


## F-statistic: 155.4 on 4 and 501 DF, p-value: < 2.2e-16

# ANOVA table with sequential testanova(model)## Analysis of Variance Table#### Response: medv## Df Sum Sq Mean Sq F value Pr(>F)## crim 1 6440.8 6440.8 169.2694 < 2e-16 ***## lstat 1 16950.1 16950.1 445.4628 < 2e-16 ***## zn 1 155.7 155.7 4.0929 0.04360 *## nox 1 106.4 106.4 2.7967 0.09509 .## Residuals 501 19063.3 38.1## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# The last p-value is the one of the last t-test

In order to compute the simplified ANOVA table, we need to rely on an ad-hocfunction11. The function takes as input a fitted lm:# This function computes the simplified anova from a linear modelsimpleAnova <- function(object, ...) {

# Compute anova tabletab <- anova(object, ...)

# Obtain number of predictorsp <- nrow(tab) - 1

# Add predictors rowpredictorsRow <- colSums(tab[1:p, 1:2])predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])

# F-quantitiesFval <- predictorsRow[3] / tab[p + 1, 3]pval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)predictorsRow <- c(predictorsRow, Fval, pval)

# Simplified tabletab <- rbind(predictorsRow, tab[p + 1, ])row.names(tab)[1] <- "Predictors"return(tab)

}

11You will need to run this piece of code whenever you want to call simpleAnova, since it isnot part of R nor R Commander.


# Simplified ANOVAsimpleAnova(model)

## Analysis of Variance Table#### Response: medv## Df Sum Sq Mean Sq F value Pr(>F)## Predictors 4 23653 5913.3 155.41 < 2.2e-16 ***## Residuals 501 19063 38.1## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall that the 𝐹 -statistic, its 𝑝-value and the degrees of freedom are also givenin the output of summary.

Compute the ANOVA table for the regression Price ~ WinterRain+ AGST + HarvestRain + Age in the wine dataset. Check that the𝑝-value for the 𝐹 -test given in summary and by simpleAnova are thesame.

3.6.2 The 𝑅2

The coefficient of determination 𝑅2 is defined as in simple linear regression:

𝑅2 = SSRSST = SSR

SSR + SSE = SSRSSR + (𝑛 − 𝑘 − 1)��2 .

𝑅2 measures the proportion of variation of the response variable 𝑌 that isexplained by the predictors 𝑋1, … , 𝑋𝑘 through the regression. Intuitively, 𝑅2

measures the tightness of the data cloud around the regression plane.Check in Figure 3.12 how changing the value of 𝜎2 (not ��2, but ��2 is obviouslydependent on 𝜎2) affects the 𝑅2. Also, as we saw in Section 2.7, 𝑅2 = 𝑟2

𝑦 𝑦,that is, the square of the sample correlation coefficient between 𝑌1, … , 𝑌𝑛 and

𝑌1, … , 𝑌𝑛 is 𝑅2.

Trusting blindly the 𝑅2 can lead to catastrophic conclusions in model selection.Here is a counterexample of a multiple regression where the 𝑅2 is apparentlylarge but the assumptions discussed in Section 3.3 are clearly not satisfied.# Create data that:# 1) does not follow a linear model# 2) the error is heteroskedasticx1 <- seq(0.15, 1, l = 100)set.seed(123456)x2 <- runif(100, -3, 3)eps <- rnorm(n = 100, sd = 0.25 * x1^2)


y <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps

# Great R^2!?reg <- lm(y ~ x1 + x2)summary(reg)#### Call:## lm(formula = y ~ x1 + x2)#### Residuals:## Min 1Q Median 3Q Max## -0.78737 -0.20946 0.01031 0.19652 1.05351#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.788812 0.096418 8.181 1.1e-12 ***## x1 -2.540073 0.154876 -16.401 < 2e-16 ***## x2 0.002283 0.020954 0.109 0.913## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.3754 on 97 degrees of freedom## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16

# But prediction is obviously problematicscatter3d(y ~ x1 + x2, fit = "linear")

Remember that:

– 𝑅2 does not measure the correctness of a linear model but itsusefulness, assuming the model is correct.

– 𝑅2 is the proportion of variance of 𝑌 explained by 𝑋1, … , 𝑋𝑘,but, of course, only when the linear model is correct.

We finalize by pointing out a nice connection between the 𝑅2, the ANOVAdecomposition and the least squares estimator ��:

The ANOVA decomposition gives another interpretation of the least-squares estimates: �� are the estimated coefficients that maxi-mize the 𝑅2 (among all the possible estimates we could think about).To see this, recall that

𝑅2 = SSRSST = SST − SSE

SST = SST − RSS(��)SST ,


so if RSS(��) = min𝛽∈ℝ𝑘+1 RSS(𝛽), then 𝑅2 is maximal for ��!

3.6.3 The 𝑅2Adj

As we saw, these are equivalent forms for 𝑅2:

𝑅2 = SSRSST = SST − SSE

SST = 1 − SSESST

= 1 − ��2

SST × (𝑛 − 𝑘 − 1). (3.10)

The SSE on the numerator always decreases as more predictors are added tothe model, even if these are no significant. As a consequence, the 𝑅2 alwaysincreases with 𝑘. Why is so? Intuitively, because the complexity – hence theflexibility – of the model augments when we use more predictors to explain 𝑌 .Mathematically, because when 𝑘 approaches 𝑛 − 1 the second term in (3.10) isreduced and, as a consequence, 𝑅2 grows.

The adjusted 𝑅2 is an important quantity specifically designed to cover this𝑅2’s flaw, which is ubiquitous in multiple linear regression. The purpose is tohave a better tool for comparing models without systematically favoringcomplexer models. This alternative coefficient is defined as

𝑅2Adj = 1 − SSE/(𝑛 − 𝑘 − 1)

SST/(𝑛 − 1) = 1 − SSESST × 𝑛 − 1

𝑛 − 𝑘 − 1

= 1 − ��2

SST × (𝑛 − 1). (3.11)

The 𝑅2Adj is independent of 𝑘, at least explicitly. If 𝑘 = 1 then 𝑅2

Adj is almost𝑅2 (practically identical if 𝑛 is large). Both (3.10) and (3.11) are quite similarexcept for the last factor, which in the former does not depend on 𝑘. Therefore,(3.11) will only increase if ��2 is reduced with 𝑘 – in other words, if the newvariables contribute in the reduction of variability around the regression plane.

The different behavior between 𝑅2 and 𝑅2Adj can be visualized by a small simula-

tion. Suppose that we generate a random dataset, with 𝑛 = 200 observations ofa response 𝑌 and two predictors 𝑋1, 𝑋2. This is, the sample {(𝑋𝑖1, 𝑋𝑖2, 𝑌𝑖)}𝑛

𝑖=1with

𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖1 + 𝛽2𝑋𝑖2 + 𝜀𝑖, 𝜀𝑖 ∼ 𝒩(0, 1).To this data, we add 196 garbage predictors that are completely independentfrom 𝑌 . Therefore, we end up with 𝑘 = 198 predictors. Now we compute the𝑅2(𝑗) and 𝑅2

Adj(𝑗) for the models

𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑗𝑋𝑗 + 𝜀,


with 𝑗 = 1, … , 𝑘 and we plot them as the curves (𝑗, 𝑅2(𝑗)) and (𝑗, 𝑅2Adj(𝑗)). Since

𝑅2 and 𝑅2Adj are random variables, we repeat the procedure 100 times to

have a measure of the variability.

Figure 3.13 contains the results of this experiment. As you can see 𝑅2 increaseslinearly with the number of predictors considered, although only the first twoones were important! On the contrary, 𝑅2

Adj only increases in the first two vari-ables and then is flat on average, but it has a huge variability when 𝑘 approaches𝑛 − 2. This is a consequence of the explosive variance of ��2 in that degeneratecase (as we will see in Section 3.7). The experiment evidences that 𝑅2

Adj ismore adequate than the 𝑅2 for evaluating the fit of a multiple linearregression.

An example of a simulated dataset considered in the experiment of Figure 3.13:# Generate datak <- 198n <- 200set.seed(3456732)beta <- c(0.5, -0.5, rep(0, k - 2))X <- matrix(rnorm(n * k), nrow = n, ncol = k)Y <- drop(X %*% beta + rnorm(n, sd = 3))data <- data.frame(y = Y, x = X)

# Regression on the two meaningful predictorssummary(lm(y ~ x.1 + x.2, data = data))

# Adding 20 garbage variablessummary(lm(y ~ X[, 1:22], data = data))

The 𝑅2Adj no longer measures the proportion of variation of 𝑌 ex-

plained by the regression, but the result of correcting this proportion bythe number of predictors employed. As a consequence of this, 𝑅2

Adj ≤ 1but it can be negative!

The next code illustrates a situation where we have two predictors completelyindependent from the response. The fitted model has a negative 𝑅2

Adj.# Three independent variablesset.seed(234599)x1 <- rnorm(100)x2 <- rnorm(100)y <- 1 + rnorm(100)

# Negative adjusted R^2summary(lm(y ~ x1 + x2))

3.7. MODEL SELECTION 131

#### Call:## lm(formula = y ~ x1 + x2)#### Residuals:## Min 1Q Median 3Q Max## -3.5081 -0.5021 -0.0191 0.5286 2.4750#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.97024 0.10399 9.330 3.75e-15 ***## x1 0.09003 0.10300 0.874 0.384## x2 -0.05253 0.11090 -0.474 0.637## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.034 on 97 degrees of freedom## Multiple R-squared: 0.009797, Adjusted R-squared: -0.01062## F-statistic: 0.4799 on 2 and 97 DF, p-value: 0.6203

Construct more predictors (x3, x4, …) by sampling 100 points from anormal (rnorm(100)). Check that when the predictors are added tothe model, the 𝑅2

Adj decreases and the 𝑅2 increases.

3.7 Model selectionIn Section 3.1.1 we briefly saw that the inclusion of more predictors is notfor free: there is a price to pay in terms of more variability on the coefficients.Indeed, there is a maximum number of predictors 𝑘 that can be consideredin a linear model for a sample size 𝑛: 𝑘 ≤ 𝑛 − 2. Or equivalently, there isa minimum sample size 𝑛 required for fitting a model with 𝑘 predictors:𝑛 ≥ 𝑘 + 2.

The interpretation of this fact is simple if we think on the geometry for 𝑘 = 1and 𝑘 = 2:

• If 𝑘 = 1, we need at least 𝑛 = 2 points to fit uniquely a line. However,this line gives no information on the vertical variation around it and hence��2 can not be estimated (applying its formula, we would have ��2 = ∞).Therefore we need at least 𝑛 = 3 points, or in other words 𝑛 ≥ 𝑘 + 2 = 3.

• If 𝑘 = 2, we need at least 𝑛 = 3 points to fit uniquely a plane. Butthis plane gives no information on the variation of the data around it andhence ��2 can not be estimated. Therefore we need 𝑛 ≥ 𝑘 + 2 = 4.


Another interpretation is the following:

The fitting of a linear model with 𝑘 predictors involves the estimationthe 𝑘 + 2 parameters (𝛽, 𝜎2) from 𝑛 data points. The closer 𝑘 + 2and 𝑛 are, the more variable the estimates (��, ��2) will be, since lessinformation is available for estimating each one. In the limit case𝑛 = 𝑘 + 2, each sample point determines a parameter estimate.

The degrees of freedom 𝑛 − 𝑘 − 1 quantify the increasing on the variability of(��, ��2) when 𝑛 − 𝑘 − 1 decreases. For example:

• 𝑡𝑛−𝑘−1;𝛼/2 appears in (3.7) and influences the length of the CIs for 𝛽𝑗, see(3.8). It also influences the length of the CIs for the prediction. As Figure3.14 shows, when the degrees of freedom decrease, 𝑡𝑛−𝑘−1;𝛼/2 increases,thus the intervals become wider.

• ��2 = 1𝑛−𝑘−1 ∑𝑛

𝑖=1 𝜀2𝑖 influences the 𝑅2 and 𝑅2

Adj. If no relevant variablesare added to the model then ∑𝑛

𝑖=1 𝜀2𝑖 will not change substantially. How-

ever, the reducing factor 1𝑛−𝑘−1 will decrease as 𝑘 augments, inflating ��2

and its variance. This is exactly what happened in Figure 3.13.

Now that we have added more light into the problem of having an excess ofpredictors, we turn the focus into selecting the most adequate predictorsfor a multiple regression model. This is a challenging task without a uniquesolution, and what is worse, without a method that is guaranteed to work inall the cases. However, there is a well-established procedure that usually givesgood results: the stepwise regression. Its principle is to compare multiple lin-ear regression models with different predictors (and, of course, with the sameresponses).

Before introducing the method, we need to understand what is an informationcriterion. An information criterion balances the fitness of a model with thenumber of predictors employed. Hence, it determines objectively the best modelas the one that minimizes the information criterion. Two common criteria arethe Bayesian Information Criterion (BIC) and the Akaike Information Criterion(AIC). Both are based on a balance between the model fitness and itscomplexity:

BIC(model) = − 2 log lik(model)⏟⏟⏟⏟⏟⏟⏟Model fitness

+ npar(model) × log 𝑛⏟⏟⏟⏟⏟⏟⏟⏟⏟Complexity

, (3.12)

where lik(model) is the likelihood of the model (how well the model fits the data)and npar(model) is the number of parameters of the model, 𝑘 + 2 in the caseof a multiple linear regression model with 𝑘 predictors. The AIC replaces log 𝑛by 2 in (3.12), so it penalizes less complexer models. This is one of thereasons why BIC is preferred by some practitioners for model comparison. Also,because is consistent in selecting the true model: if enough data is provided, theBIC is guaranteed to select the data-generating model among a list of candidatemodels.


The BIC and AIC can be computed in R through the functions BIC and AIC.They take a model as the input.# Load iris datasetdata(iris)

# Two models with different predictorsmod1 <- lm(Petal.Length ~ Sepal.Width, data = iris)mod2 <- lm(Petal.Length ~ Sepal.Width + Petal.Width, data = iris)

# BICsBIC(mod1)## [1] 579.7856BIC(mod2) # Smaller -> better## [1] 208.0366

# Check the summariessummary(mod1)#### Call:## lm(formula = Petal.Length ~ Sepal.Width, data = iris)#### Residuals:## Min 1Q Median 3Q Max## -3.7721 -1.4164 0.1719 1.2094 4.2307#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 9.0632 0.9289 9.757 < 2e-16 ***## Sepal.Width -1.7352 0.3008 -5.768 4.51e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.6 on 148 degrees of freedom## Multiple R-squared: 0.1836, Adjusted R-squared: 0.178## F-statistic: 33.28 on 1 and 148 DF, p-value: 4.513e-08summary(mod2)#### Call:## lm(formula = Petal.Length ~ Sepal.Width + Petal.Width, data = iris)#### Residuals:## Min 1Q Median 3Q Max## -1.33753 -0.29251 -0.00989 0.21447 1.24707#### Coefficients:


## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.25816 0.31352 7.203 2.84e-11 ***## Sepal.Width -0.35503 0.09239 -3.843 0.00018 ***## Petal.Width 2.15561 0.05283 40.804 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.4574 on 147 degrees of freedom## Multiple R-squared: 0.9338, Adjusted R-squared: 0.9329## F-statistic: 1036 on 2 and 147 DF, p-value: < 2.2e-16

Let’s go back to the selection of predictors. If we have 𝑘 predictors, a naiveprocedure would be to check all the possible models that can be constructedwith them and then select the best one in terms of BIC/AIC. The problemis that there are 2𝑘+1 possible models! Fortunately, the stepwise procedurehelps us navigating this ocean of models. The function takes as input a modelemploying all the available predictors.# Explain NOx in Boston datasetmod <- lm(nox ~ ., data = Boston)

# With BICmodBIC <- stepwise(mod, trace = 0)#### Direction: backward/forward## Criterion: BICsummary(modBIC)#### Call:## lm(formula = nox ~ indus + age + dis + rad + ptratio + medv,## data = Boston)#### Residuals:## Min 1Q Median 3Q Max## -0.117146 -0.034877 -0.005863 0.031655 0.183363#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.7649531 0.0347425 22.018 < 2e-16 ***## indus 0.0045930 0.0005972 7.691 7.85e-14 ***## age 0.0008682 0.0001381 6.288 7.03e-10 ***## dis -0.0170889 0.0020226 -8.449 3.24e-16 ***## rad 0.0033154 0.0003730 8.888 < 2e-16 ***## ptratio -0.0130209 0.0013942 -9.339 < 2e-16 ***## medv -0.0021057 0.0003413 -6.170 1.41e-09 ***## ---


## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.05485 on 499 degrees of freedom## Multiple R-squared: 0.7786, Adjusted R-squared: 0.7759## F-statistic: 292.4 on 6 and 499 DF, p-value: < 2.2e-16

# Different search directionsstepwise(mod, trace = 0, direction = "forward")#### Direction: forward## Criterion: BIC#### Call:## lm(formula = nox ~ dis + indus + age + rad + ptratio + medv,## data = Boston)#### Coefficients:## (Intercept) dis indus age rad ptratio## 0.7649531 -0.0170889 0.0045930 0.0008682 0.0033154 -0.0130209## medv## -0.0021057stepwise(mod, trace = 0, direction = "backward")#### Direction: backward## Criterion: BIC#### Call:## lm(formula = nox ~ indus + age + dis + rad + ptratio + medv,## data = Boston)#### Coefficients:## (Intercept) indus age dis rad ptratio## 0.7649531 0.0045930 0.0008682 -0.0170889 0.0033154 -0.0130209## medv## -0.0021057

# With AICmodAIC <- stepwise(mod, trace = 0, criterion = "AIC")#### Direction: backward/forward## Criterion: AICsummary(modAIC)#### Call:## lm(formula = nox ~ crim + indus + age + dis + rad + ptratio +


## medv, data = Boston)#### Residuals:## Min 1Q Median 3Q Max## -0.122633 -0.035593 -0.004273 0.030938 0.182914#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.7750308 0.0349903 22.150 < 2e-16 ***## crim -0.0007603 0.0003753 -2.026 0.0433 *## indus 0.0044875 0.0005976 7.509 2.77e-13 ***## age 0.0008656 0.0001377 6.288 7.04e-10 ***## dis -0.0175329 0.0020282 -8.645 < 2e-16 ***## rad 0.0037478 0.0004288 8.741 < 2e-16 ***## ptratio -0.0132746 0.0013956 -9.512 < 2e-16 ***## medv -0.0022716 0.0003499 -6.491 2.06e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.05468 on 498 degrees of freedom## Multiple R-squared: 0.7804, Adjusted R-squared: 0.7773## F-statistic: 252.8 on 7 and 498 DF, p-value: < 2.2e-16

The model selected by stepwise is a good starting point for further additionsor deletions of predictors. For example, in modAIC we could remove crim.

When applying stepwise for BIC/AIC, different final models mightbe selected depending on the choice of direction. This is the inter-pretation:

– "backward": starts from the full model, removes predictors se-quentially.

– "forward": starts from the simplest model, adds predictors se-quentially.

– "backward/forward" (default) and "forward/backward": com-bination of the above.

The advice is to try several of these methods and retain the onewith minimum BIC/AIC. Set trace = 0 to omit lengthy outputs ofinformation of the search procedure.

stepwise assumes that no NA’s (missing values) are present in thedata. It is advised to remove the missing values in the data beforesince their presence might lead to errors. To do so, employ data =

3.8. MODEL DIAGNOSTICS AND MULTICOLLINEARITY 137

na.omit(dataset) in the call to lm (if your dataset is dataset).

We conclude highlighting a caveat on the use of the BIC and AIC: they areconstructed assuming that the sample size 𝑛 is much larger than the number ofparameters in the model (𝑘 + 2). Therefore, they will work reasonably well if𝑛 >> 𝑘 + 2, but if this is not true they may favor unrealistic complex models.An illustration of this phenomena is Figure 3.15, which is the BIC/AIC versionof Figure 3.13 for the experiment done in Section 3.6. The BIC and AIC curvestend to have local minimums close to 𝑘 = 2 and then increase. But when 𝑘 + 2gets close to 𝑛, they quickly drop down. Note also how the BIC penalizes morethe complexity than the AIC, which is more flat.

3.8 Model diagnostics and multicollinearity

As we saw in Section 3.3, checking the assumptions of the multiple linear modelthrough the data scatterplots becomes tricky even when 𝑘 = 2. To solve thisissue, a series of diagnostic plots have been designed in order to evaluate graph-ically and in a simple way the validity of the assumptions. For illustration, weretake the wine dataset (download).mod <- lm(Price ~ Age + AGST + HarvestRain + WinterRain, data = wine)

We will focus only in three plots:

1. Residuals vs. fitted values plot. This plot serves mainly to check thelinearity, although lack of homoscedasticity or independence can also bedetected. Here is an example:plot(mod, 1)



6.0 6.5 7.0 7.5 8.0

−0.

40.

00.

20.

40.

6

Fitted values

Res

idua

ls

lm(Price ~ Age + AGST + HarvestRain + WinterRain)

Residuals vs Fitted

1959

1977 1973

Under linearity, we expect the red line (a nonlinear fit of the mean ofthe residuals) to be almost flat. This means that the trend of 𝑌1, … , 𝑌𝑛is linear with respect to the predictors. Heteroskedasticity can be detectedalso in the form of irregular vertical dispersion around the red line. Thedependence between residuals can be detected (harder) in the form of nonrandomly spread residuals.

2. QQ-plot. Checks the normality:plot(mod, 2)

−2 −1 0 1 2

−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als


Normal Q−Q

1959

1977 1973


Under normality, we expect the points (sample quantiles of the stan-dardized residuals vs. theoretical quantiles of a 𝒩(0, 1)) to align withthe diagonal line, which represents the ideal position of the points ifthose were sampled from a 𝒩(0, 1). It is usual to have larger departuresfrom the diagonal in the extremes than in the center, even under normality,although these departures are more clear if the data is non-normal.

3. Scale-location plot. Serves for checking the homoscedasticity. It issimilar to the first diagnostic plot, but now with the residuals standard-ized and transformed by a square root (of the absolute value). This changetransforms the task of spotting heteroskedasticity by looking into irregu-lar vertical dispersion patterns into spotting for nonlinearities, which issomehow simpler.plot(mod, 3)

6.0 6.5 7.0 7.5 8.0

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als


Scale−Location1959

1977 1973

Under homoscedasticity, we expect the red line to be almostflat. If there are consistent nonlinear patterns, then there is evidence ofheteroskedasticity.

If you type plot(mod), several diagnostic plots will be shown sequen-tially. In order to advance them, hit 'Enter' in the R console.

The next figures present datasets where the assumptions are satisfied and vio-lated.

Load the dataset assumptions3D.RData(download) and compute the

https://raw.githubusercontent.com/egarpor/handy/master/datasets/assumptions3D.RData


regressions y.3 ~ x1.3 + x2.3, y.4 ~ x1.4 + x2.4, y.5 ~ x1.5 +x2.5 and y.8 ~ x1.8 + x2.8. Use the three diagnostic plots to testthe assumptions of the linear model.

A common problem that arises in multiple linear regression is multicollinear-ity. This is the situation when two or more predictors are highly linearly relatedbetween them. Multicollinearitiy has important effects on the fit of the model:

• It reduces the precision of the estimates. As a consequence, signs offitted coefficients may be reversed and valuable predictors may appear asnon significant.

• It is difficult to determine how each of the highly related predic-tors affects the response, since one masks the other. This may resultin numerical instabilities.

An approach is to detect multicollinearity is to compute the correlationmatrix between the predictors by cor (in R Commander: 'Statistics' ->'Summaries' -> 'Correlation matrix...')cor(wine)## Price WinterRain AGST HarvestRain Age## Price 1.0000000 0.13488004 0.66752483 -0.50718463 0.46040873## WinterRain 0.1348800 1.00000000 -0.32113230 -0.26798907 -0.05118354## AGST 0.6675248 -0.32113230 1.00000000 -0.02708361 0.29488335## HarvestRain -0.5071846 -0.26798907 -0.02708361 1.00000000 0.05884976## Age 0.4604087 -0.05118354 0.29488335 0.05884976 1.00000000## FrancePop -0.4810720 0.02945091 -0.30126148 -0.03201463 -0.99227908## FrancePop## Price -0.48107195## WinterRain 0.02945091## AGST -0.30126148## HarvestRain -0.03201463## Age -0.99227908## FrancePop 1.00000000

Here we can see what we already knew from Section 3.1.1, that Age and Yearare perfectly linearly related and that Age and FrancePop are highly linearlyrelated.

However, is not enough to inspect pair by pair correlations in order to get ridof multicollinearity. Here is a counterexample:# Create predictors with multicollinearity: x4 depends on the restset.seed(45678)x1 <- rnorm(100)x2 <- 0.5 * x1 + rnorm(100)x3 <- 0.5 * x2 + rnorm(100)


x4 <- -x1 + x2 + rnorm(100, sd = 0.25)

# Responsey <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4 + rnorm(100)data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations - none seems suspiciouscor(data)## x1 x2 x3 x4 y## x1 1.0000000 0.38254782 0.2142011 -0.5261464 0.31194689## x2 0.3825478 1.00000000 0.5167341 0.5673174 -0.04428223## x3 0.2142011 0.51673408 1.0000000 0.2500123 -0.77482655## x4 -0.5261464 0.56731738 0.2500123 1.0000000 -0.28677304## y 0.3119469 -0.04428223 -0.7748265 -0.2867730 1.00000000

A better approach is to compute the Variance Inflation Factor (VIF) of eachcoefficient 𝛽𝑗. This is a measure of how linearly dependent is 𝑋𝑗 with the restof predictors:

VIF( 𝛽𝑗) = 11 − 𝑅2

𝑋𝑗|𝑋−𝑗

where 𝑅2𝑋𝑗|𝑋−𝑗

is the 𝑅2 from a regression of 𝑋𝑗 into the remaining predictors.The next rule of thumb gives direct insight into which predictors are multi-collinear:

• VIF close to 1: absence of multicollinearity.• VIF larger than 5 or 10: problematic amount of mul-

ticollinearity. Advised to remove the predictor with largestVIF.

VIF is called by vif and takes as argument a linear model (In R Comman-der: 'Models' -> 'Numerical diagnostics' -> 'Variance-inflationfactors'). We continue with the previous example.# Abnormal variance inflation factors: largest for x4, we remove itmodMultiCo <- lm(y ~ x1 + x2 + x3 + x4)vif(modMultiCo)## x1 x2 x3 x4## 26.361444 29.726498 1.416156 33.293983

# Without x4modClean <- lm(y ~ x1 + x2 + x3)

# Comparisonsummary(modMultiCo)#### Call:


## lm(formula = y ~ x1 + x2 + x3 + x4)#### Residuals:## Min 1Q Median 3Q Max## -1.9762 -0.6663 0.1195 0.6217 2.5568#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 1.0622 0.1034 10.270 < 2e-16 ***## x1 0.9224 0.5512 1.673 0.09756 .## x2 1.6399 0.5461 3.003 0.00342 **## x3 -3.1652 0.1086 -29.158 < 2e-16 ***## x4 -0.5292 0.5409 -0.978 0.33040## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.028 on 95 degrees of freedom## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9108## F-statistic: 253.7 on 4 and 95 DF, p-value: < 2.2e-16summary(modClean)#### Call:## lm(formula = y ~ x1 + x2 + x3)#### Residuals:## Min 1Q Median 3Q Max## -1.91297 -0.66622 0.07889 0.65819 2.62737#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 1.0577 0.1033 10.24 < 2e-16 ***## x1 1.4495 0.1162 12.47 < 2e-16 ***## x2 1.1195 0.1237 9.05 1.63e-14 ***## x3 -3.1450 0.1065 -29.52 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.028 on 96 degrees of freedom## Multiple R-squared: 0.9135, Adjusted R-squared: 0.9108## F-statistic: 338 on 3 and 96 DF, p-value: < 2.2e-16

# Variance inflation factors normalvif(modClean)## x1 x2 x3## 1.171942 1.525501 1.364878


Figure 3.11: Visualization of the ANOVA decomposition when 𝑘 = 2. SST mea-sures the variation of 𝑌1, … , 𝑌𝑛 with respect to 𝑌 . SST measures the variationwith respect to the conditional means, 𝛽0 + 𝛽1𝑋𝑖1 + 𝛽2𝑋𝑖2. SSE collects thevariation of the residuals.


Figure 3.12: Illustration of the ANOVA decomposition and its dependence on𝜎2 and ��2. Application also available here.

https://shinyserv.es/shiny/anova-3D/


Figure 3.13: Comparison of 𝑅2 and 𝑅2Adj for 𝑛 = 200 and 𝑘 ranging from 1

to 198. 𝑀 = 100 datasets were simulated with only the first two predictorsbeing significant. The thicker curves are the mean of each color’s curves.

0 5 10 15 20 25 30

02

46

810

df

t df;α

2

α = 0.1α = 0.05α = 0.01

Figure 3.14: Effect of df = 𝑛 − 𝑘 − 1 in 𝑡df;𝛼/2 for 𝛼 = 0.10, 0.05, 0.01.


Figure 3.15: Comparison of BIC and AIC for 𝑛 = 200 and 𝑘 ranging from 1to 198. 𝑀 = 100 datasets were simulated with only the first two predictorsbeing significant. The thicker curves are the mean of each color’s curves.


Figure 3.16: Residuals vs. fitted values plots for datasets respecting (left col-umn) and violating (right column) the linearity assumption.


Figure 3.17: QQ-plots for datasets respecting (left column) and violating (rightcolumn) the normality assumption.


Figure 3.18: Scale-location plots for datasets respecting (left column) and vio-lating (right column) the homoscedasticity assumption.

Chapter 4

Logistic regression

As we saw in Chapters 2 and 3, linear regression assumes that the responsevariable 𝑌 is continuous. In this chapter we will see how logistic regression candeal with a discrete response 𝑌 . The simplest case is with 𝑌 being a binaryresponse, that is, a variable encoding two categories. In general, we assume thatwe have 𝑋1, … , 𝑋𝑘 predictors for explaining 𝑌 (multiple logistic regression) andcover the peculiarities for 𝑘 = 1 as particular cases.

4.1 More R basicsIn order to implement some of the contents of this chapter we need to cover moreR basics, mostly related with flexible plotting that is not implemented directlyin R Commander. The R functions we will are also very useful for simplifyingsome R Commander approaches.

In the following sections, type – not copy and paste systematically – the codein the 'R Script' panel and send it to the output panel. Remember that youshould get the same outputs (which are preceded by ## [1]).

4.1.1 Data frames revisited

# Let's begin importing the iris datasetdata(iris)

# names gives you the variables in the data framenames(iris)## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

# The beginning of the datahead(iris)

151

152 CHAPTER 4. LOGISTIC REGRESSION

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa

# So we can access variables by $ or as in a matrixiris$Sepal.Length[1:10]## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9iris[1:10, 1]## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9iris[3, 1]## [1] 4.7

# Information on the dimension of the data framedim(iris)## [1] 150 5

# str gives the structure of any object in Rstr(iris)## 'data.frame': 150 obs. of 5 variables:## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Recall the species variable: it is a categorical variable (or factor),# not a numeric variableiris$Species[1:10]## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa## Levels: setosa versicolor virginica

# Factors can only take certain valueslevels(iris$Species)## [1] "setosa" "versicolor" "virginica"

# If a file contains a variable with character strings as observations (either# encapsulated by quotation marks or not), the variable will become a factor# when imported into R

4.1. MORE R BASICS 153

Do the following:

– Import auto.txt into R as the data frame auto. Check how thecharacter strings in the file give rise to factor variables.

– Get the dimensions of auto and show beginning of the data.– Retrieve the fifth observation of horsepower in two different

ways.– Compute the levels of name.

4.1.2 Vector-related functions# The function seq creates sequences of numbers equally separatedseq(0, 1, by = 0.1)## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0seq(0, 1, length.out = 5)## [1] 0.00 0.25 0.50 0.75 1.00

# You can short the latter argumentseq(0, 1, l = 5)## [1] 0.00 0.25 0.50 0.75 1.00

# Repeat numberrep(0, 5)## [1] 0 0 0 0 0

# Reverse a vectormyVec <- c(1:5, -1:3)rev(myVec)## [1] 3 2 1 0 -1 5 4 3 2 1

# Another waymyVec[length(myVec):1]## [1] 3 2 1 0 -1 5 4 3 2 1

# Count repetitions in your datatable(iris$Sepal.Length)#### 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2## 1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6 6 4## 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9## 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1table(iris$Species)##

https://raw.githubusercontent.com/egarpor/handy/master/datasets/auto.txt


## setosa versicolor virginica## 50 50 50

Do the following:

– Create the vector 𝑥 = (0.3, 0.6, 0.9, 1.2).– Create a vector of length 100 ranging from 0 to 1 with entries

equally separated.– Compute the amount of zeros and ones in x <- c(0, 0, 1, 0,

1, 0, 0, 1, 0, 1, 0). Check that they are the same as inrev(x).

– Compute the vector (0.1, 1.1, 2.1, ..., 100.1) in four different waysusing seq and rev. Do the same but using : instead of seq.(Hint: add 0.1)

4.1.3 Logical conditions and subsetting# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y# They return TRUE or FALSE

# Smaller than0 < 1## [1] TRUE

# Greater than1 > 1## [1] FALSE

# Greater or equal to1 >= 1 # Remember: ">="" and not "=>"" !## [1] TRUE

# Smaller or equal to2 <= 1 # Remember: "<="" and not "=<"" !## [1] FALSE

# Equal1 == 1 # Tests equality. Remember: "=="" and not "="" !## [1] TRUE

# Unequal1 != 0 # Tests inequality## [1] TRUE


# TRUE is encoded as 1 and FALSE as 0TRUE + 1## [1] 2FALSE + 1## [1] 1

# In a vector-like fashionx <- 1:5y <- c(0, 3, 1, 5, 2)x < y## [1] FALSE TRUE FALSE TRUE FALSEx == y## [1] FALSE FALSE FALSE FALSE FALSEx != y## [1] TRUE TRUE TRUE TRUE TRUE

# Subsetting of vectorsx## [1] 1 2 3 4 5x[x >= 2]## [1] 2 3 4 5x[x < 3]## [1] 1 2

# Easy way of work with parts of the datadata <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)data## x y## 1 0 1## 2 1 2## 3 3 3## 4 3 4## 5 0 5

# Data such that x is zerodata0 <- data[data$x == 0, ]data0## x y## 1 0 1## 5 0 5

# Data such that x is larger than 2data2 <- data[data$x > 2, ]data2## x y


## 3 3 3## 4 3 4

# In an exampleiris$Sepal.Width[iris$Sepal.Width > 3]## [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6## [20] 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4 3.5 3.2 3.5 3.8## [39] 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3 3.6 3.2 3.2 3.8 3.2 3.3## [58] 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4

# Problem - what happened?data[x > 2, ]## x y## 3 3 3## 4 3 4## 5 0 5

# In an examplesummary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300## Median :5.800 Median :3.000 Median :4.350 Median :1.300## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500## Species## setosa :50## versicolor:50## virginica :50######summary(iris[iris$Sepal.Width > 3, ])## Sepal.Length Sepal.Width Petal.Length Petal.Width## Min. :4.400 Min. :3.100 Min. :1.000 Min. :0.1000## 1st Qu.:5.000 1st Qu.:3.200 1st Qu.:1.450 1st Qu.:0.2000## Median :5.400 Median :3.400 Median :1.600 Median :0.4000## Mean :5.684 Mean :3.434 Mean :2.934 Mean :0.9075## 3rd Qu.:6.400 3rd Qu.:3.600 3rd Qu.:5.000 3rd Qu.:1.8000## Max. :7.900 Max. :4.400 Max. :6.700 Max. :2.5000## Species## setosa :42## versicolor: 8## virginica :17


######

# On the factor variable only makes sense == and !=summary(iris[iris$Species == "setosa", ])## Sepal.Length Sepal.Width Petal.Length Petal.Width## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200## Median :5.000 Median :3.400 Median :1.500 Median :0.200## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600## Species## setosa :50## versicolor: 0## virginica : 0######

# Subset argument in lmlm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)#### Call:## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width >## 3)#### Coefficients:## (Intercept) Petal.Length## 3.59439 -0.05455lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)#### Call:## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width >## 3)#### Coefficients:## (Intercept) Petal.Length## 3.59439 -0.05455# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris# tells R to look for Sepal.Width in the iris dataset

# Same thing for the subset field in R Commander's menus


# AND operator &TRUE & TRUE## [1] TRUETRUE & FALSE## [1] FALSEFALSE & FALSE## [1] FALSE

# OR operator |TRUE | TRUE## [1] TRUETRUE | FALSE## [1] TRUEFALSE | FALSE## [1] FALSE

# Both operators are useful for checking for ranges of datay## [1] 0 3 1 5 2index1 <- (y <= 3) & (y > 0)y[index1]## [1] 3 1 2index2 <- (y < 2) | (y > 4)y[index2]## [1] 0 1 5

# In an examplesummary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])## Sepal.Length Sepal.Width Petal.Length Petal.Width## Min. :4.400 Min. :3.100 Min. :1.200 Min. :0.100## 1st Qu.:4.925 1st Qu.:3.125 1st Qu.:1.500 1st Qu.:0.200## Median :5.950 Median :3.200 Median :4.450 Median :1.400## Mean :5.781 Mean :3.245 Mean :3.460 Mean :1.145## 3rd Qu.:6.700 3rd Qu.:3.400 3rd Qu.:5.375 3rd Qu.:2.075## Max. :7.200 Max. :3.400 Max. :6.000 Max. :2.500## Species## setosa :20## versicolor: 8## virginica :14######


Do the following for the iris dataset:

– Compute the subset corresponding to Petal.Length eithersmaller than 1.5 or larger than 2. Save this dataset asirisPetal.

– Compute and summarize a linear regression of Sepal.Widthinto Petal.Width + Petal.Length for the dataset irisPetal.What is the 𝑅2? (Solution: 0.101)

– Check that the previous model is the same as regressingSepal.Width into Petal.Width + Petal.Length for the datasetiris with the appropriate subset expression.

– Compute the variance for Petal.Width when Petal.Width issmaller or equal that 1.5 and larger than 0.3. (Solution:0.1266541)

4.1.4 Plotting functions# plot is the main function for plotting in R# It has a different behavior depending on the kind of object that it receives

# For example, for a regression model, it produces diagnostic plotsmod <- lm(Sepal.Width ~ Sepal.Length, data = iris)plot(mod, 1)

2.95 3.00 3.05 3.10 3.15

−1.

00.

00.

51.

01.

5

Fitted values

Res

idua

ls

lm(Sepal.Width ~ Sepal.Length)

Residuals vs Fitted

16

34

61


# How to plot some dataplot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

Sepal.Length vs Sepal.Width

iris$Sepal.Length

iris$

Sep

al.W

idth

# Change the axis limitsplot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))

0 2 4 6 8 10

02

46

810

iris$Sepal.Length

iris$

Sep

al.W

idth

# How to plot a curve (a parabola)x <- seq(-1, 1, l = 50)


y <- x^2plot(x, y)

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

plot(x, y, main = "A dotted parabola")

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A dotted parabola

x

y

plot(x, y, main = "A parabola", type = "l")


−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A parabola

x

y

plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A red and thick parabola

x

y

# Plotting a more complicated curve between -pi and pix <- seq(-pi, pi, l = 50)y <- (2 + sin(10 * x)) * x^2plot(x, y, type = "l") # Kind of rough...


−3 −2 −1 0 1 2 3

05

1015

2025

x

y

# More detailed plotx <- seq(-pi, pi, l = 500)y <- (2 + sin(10 * x)) * x^2plot(x, y, type = "l")

−3 −2 −1 0 1 2 3

05

1015

2025

x

y

# Remember that we are joining points for creating a curve!

# For more options in the plot customization see?plot## Help on topic 'plot' was found in the following packages:##


## Package Library## base /Library/Frameworks/R.framework/Resources/library## graphics /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library###### Using the first match ...?par

# plot is a first level plotting function. That means that whenever is called,# it creates a new plot. If we want to add information to an existing plot, we# have to use a second level plotting function such as points, lines or abline

plot(x, y) # Create a plotlines(x, x^2, col = "red") # Add linespoints(x, y + 10, col = "blue") # Add pointsabline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x

−3 −2 −1 0 1 2 3

05

1015

2025

x

y

4.1.5 DistributionsThe operations on distributions described here are implemented in R Comman-der through the menu 'Distributions', but is convenient for you to grasp howare they working.# R allows to sample [r], compute density/probability mass [d],# compute distribution function [p] and compute quantiles [q] for several# continuous and discrete distributions. The format employed is [rdpq]name,# where name stands for:# - norm -> Normal


# - unif -> Uniform# - exp -> Exponential# - t -> Student's t# - f -> Snedecor's F# - chisq -> Chi squared# - pois -> Poisson# - binom -> Binomial# More distributions:?Distributions

# Sampling from a Normal - 100 random points from a N(0, 1)rnorm(n = 10, mean = 0, sd = 1)## [1] -1.8367426 -1.3366952 -0.4906582 1.0215158 0.1637865 2.5039127## [7] 1.3113124 -0.1352548 0.1846896 -0.6373963

# If you want to have always the same result, set the seed of the random number# generatorset.seed(45678)rnorm(n = 10, mean = 0, sd = 1)## [1] 1.4404800 -0.7195761 0.6709784 -0.4219485 0.3782196 -1.6665864## [7] -0.5082030 0.4433822 -1.7993868 -0.6179521

# Plotting the density of a N(0, 1) - the Gauss bellx <- seq(-4, 4, l = 100)y <- dnorm(x = x, mean = 0, sd = 1)plot(x, y, type = "l")

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

y


# Plotting the distribution function of a N(0, 1)x <- seq(-4, 4, l = 100)y <- pnorm(q = x, mean = 0, sd = 1)plot(x, y, type = "l")

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

y

# Computing the 95% quantile for a N(0, 1)qnorm(p = 0.95, mean = 0, sd = 1)## [1] 1.644854

# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)# and qname(p,...), but the parameters in ... change. Look them in ?Distributions# For example, here is the same for the uniform distribution

# Sampling from a U(0, 1)set.seed(45678)runif(n = 10, min = 0, max = 1)## [1] 0.9251342 0.3339988 0.2358930 0.3366312 0.7488829 0.9327177 0.3365313## [8] 0.2245505 0.6473663 0.0807549

# Plotting the density of a U(0, 1)x <- seq(-2, 2, l = 100)y <- dunif(x = x, min = 0, max = 1)plot(x, y, type = "l")


−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

x

y

# Computing the 95% quantile for a U(0, 1)qunif(p = 0.95, min = 0, max = 1)## [1] 0.95

# Sampling from a Bi(10, 0.5)set.seed(45678)samp <- rbinom(n = 200, size = 10, prob = 0.5)table(samp) / 200## samp## 1 2 3 4 5 6 7 8 9## 0.010 0.060 0.115 0.220 0.210 0.215 0.115 0.045 0.010

# Plotting the probability mass of a Bi(10, 0.5)x <- 0:10y <- dbinom(x = x, size = 10, prob = 0.5)plot(x, y, type = "h") # Vertical bars


0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

x

y

# Plotting the distribution function of a Bi(10, 0.5)x <- 0:10y <- pbinom(q = x, size = 10, prob = 0.5)plot(x, y, type = "h")

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

x

y

Do the following:

– Compute the 90%, 95% and 99% quantiles of a 𝐹 distributionwith df1 = 1 and df2 = 5. (Answer: c(4.060420, 6.607891,16.258177))


– Plot the distribution function of a 𝑈(0, 1). Does it make sensewith its density function?

– Sample 100 points from a Poisson with lambda = 5.– Sample 100 points from a 𝑈(−1, 1) and compute its mean.– Plot the density of a 𝑡 distribution with df = 1 (use a sequence

spanning from -4 to 4). Add lines of different colors with thedensities for df = 5, df = 10, df = 50 and df = 100. Do yousee any pattern?

4.1.6 Defining functions# A function is a way of encapsulating a block of code so it can be reused easily# They are useful for simplifying repetitive tasks and organize the analysis# For example, in Section 3.7 we had to make use of simpleAnova for computing# the simple ANOVA table in multiple regression.

# This is a silly function that takes x and y and returns its sumadd <- function(x, y) {x + y

}

# Calling add - you need to run the definition of the function first!add(1, 1)## [1] 2add(x = 1, y = 2)## [1] 3

# A more complex function: computes a linear model and its posterior summary.# Saves us a few keystrokes when computing a lm and a summarylmSummary <- function(formula, data) {model <- lm(formula = formula, data = data)summary(model)

}

# UsagelmSummary(Sepal.Length ~ Petal.Width, iris)#### Call:## lm(formula = formula, data = data)#### Residuals:## Min 1Q Median 3Q Max## -1.38822 -0.29358 -0.04393 0.26429 1.34521


#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.77763 0.07293 65.51 <2e-16 ***## Petal.Width 0.88858 0.05137 17.30 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.478 on 148 degrees of freedom## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16

# Recall: there is no variable called model in the workspace.# The function works on its own workspace!model#### Call:## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)#### Coefficients:## (Intercept) crim lstat zn nox## 30.93462 -0.08297 -0.90940 0.03493 5.42234

# Add a line to a plotaddLine <- function(x, beta0, beta1) {lines(x, beta0 + beta1 * x, lwd = 2, col = 2)

}

# Usageplot(x, y)addLine(x, beta0 = 0.1, beta1 = 0)


0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

x

y

4.2 Examples and applications4.2.1 Case study: The Challenger disasterThe Challenger disaster occurred on the 28th January of 1986, when the NASASpace Shuttle orbiter Challenger broke apart and disintegrated at 73 secondsinto its flight, leading to the deaths of its seven crew members. The accidentdeeply shocked the US society, in part due to the attention the mission hadreceived because of the presence of Christa McAuliffe, who would have been thefirst astronaut-teacher. Because of this, NASA TV broadcasted live the launchto US public schools, which resulted in millions of school children witnessingthe accident. The accident had serious consequences for the NASA credibilityand resulted in an interruption of 32 months in the shuttle program. ThePresidential Rogers Commission (formed by astronaut Neil A. Armstrong andNobel laureate Richard P. Feynman, among others) was created to investigatethe disaster.

The Rogers Commission elaborated a report (Presidential Commission on theSpace Shuttle Challenger Accident, 1986) with all the findings. The commissiondetermined that the disintegration began with the failure of an O-ring sealin the solid rocket motor due to the unusual cold temperatures (-0.6 Celsius degrees) during the launch. This failure produced a breach ofburning gas through the solid rocket motor that compromised the whole shuttlestructure, resulting in its disintegration due to the extreme aerodynamic forces.The problematic with O-rings was something known: the night beforethe launch, there was a three-hour teleconference between motor engineers andNASA management, discussing the effect of low temperature forecasted for thelaunch on the O-ring performance. The conclusion, influenced by Figure 4.2a,


Figure 4.1: Challenger launch and posterior explosion, as broadcasted live byNBC in 28/01/1986.

was:

“Temperature data [are] not conclusive on predicting primary O-ringblowby.”

The Rogers Commission noted a major flaw in Figure 4.2a: the flights withzero incidents were excluded from the plot because it was felt that theseflights did not contribute any information about the temperature ef-fect (Figure 4.2b). The Rogers Commission concluded:

“A careful analysis of the flight history of O-ring performance wouldhave revealed the correlation of O-ring damage in low temperature”.

The purpose of this case study, inspired by Dalal et al. (1989), is to quantifywhat was the influence of the temperature in the probability of having at leastone incident related with the O-rings. Specifically, we want to address thefollowing questions:

• Q1. Is the temperature associated with O-ring incidents?• Q2. In which way was the temperature affecting the probability of O-ring

incidents?• Q3. What was the predicted probability of an incident in an O-ring for the

temperature of the launch day?

To try to answer these questions we have the challenger dataset (download).The dataset contains (shown in Table 4.1) information regarding the state of thesolid rocket boosters after launch1 for 23 flights. Each row has, among others,the following variables:

1After the shuttle exits the atmosphere, the solid rocket boosters separate and descend toland using a parachute where they are carefully analyzed.

https://www.youtube.com/embed/fSTrmJtHLFU



Figure 4.2: Number of incidents in the O-rings (filed joints) versus temperatures.Panel a includes only flights with incidents. Panel b contains all flights (withand without incidents).

• fail.field, fail.nozzle: binary variables indicating whether there wasan incident with the O-rings in the field joints or in the nozzles of the solidrocket boosters. 1 codifies an incident and 0 its absence. On the analysis,we focus on the O-rings of the field joint as being the most determinantsfor the accident.

• temp: temperature in the day of launch. Measured in Celsius degrees.• pres.field, pres.nozzle: leak-check pressure tests of the O-rings.

These tests assured that the rings would seal the joint.

Table 4.1: The challenger dataset.

flight date fail.field fail.nozzle temp1 12/04/81 0 0 18.92 12/11/81 1 0 21.13 22/03/82 0 0 20.65 11/11/82 0 0 20.06 04/04/83 0 1 19.4


7 18/06/83 0 0 22.28 30/08/83 0 0 22.89 28/11/83 0 0 21.141-B 03/02/84 1 1 13.941-C 06/04/84 1 1 17.241-D 30/08/84 1 1 21.141-G 05/10/84 0 0 25.651-A 08/11/84 0 0 19.451-C 24/01/85 1 1 11.751-D 12/04/85 0 1 19.451-B 29/04/85 0 1 23.951-G 17/06/85 0 1 21.151-F 29/07/85 0 0 27.251-I 27/08/85 0 0 24.451-J 03/10/85 0 0 26.161-A 30/10/85 1 0 23.961-B 26/11/85 0 1 24.461-C 12/01/86 1 1 14.4

Let’s begin the analysis by replicating Figures 4.2a and 4.2b and checking thatlinear regression is not the right tool for answering Q1–Q3. For that, we maketwo scatterplots of nfails.field (number of total incidents in the field joints)versus temp, the first one excluding the launches without incidents (subset =nfails.field > 0) and the second one for all the data. Doing it through RCommander as we saw in Chapter 2, you should get something similar to:scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,

boxplots = FALSE, data = challenger, subset = nfails.field > 0)


12 14 16 18 20 22 24

1.0

1.2

1.4

1.6

1.8

2.0

temp

nfai

ls.fi

eld

scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,boxplots = FALSE, data = challenger)

15 20 25

0.0

0.5

1.0

1.5

2.0

temp

nfai

ls.fi

eld

There is a fundamental problem in using linear regression for this data: theresponse is not continuous. As a consequence, there is no linearity and theerrors around the mean are not normal (indeed, they are strongly non normal).Let’s check this with the corresponding diagnostic plots:mod <- lm(nfails.field ~ temp, data = challenger)par(mfrow = 1:2)plot(mod, 1)plot(mod, 2)


−0.2 0.2 0.6 1.0

−0.

50.

00.

51.

01.

52.

0

Fitted values

Res

idua

ls

Residuals vs Fitted

21

142

−2 −1 0 1 2

−1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

21

14

2

Albeit linear regression is not the adequate tool for this data, it is able to detectthe obvious difference between the two plots:

1. The trend for launches with incidents is flat, hence suggestingthere is no dependence on the temperature (Figure 4.2a). This wasone of the arguments behind NASA’s decision of launching the rocket ata temperature of -0.6 degrees.

2. However, the trend for all launches indicates a clear negative de-pendence between temperature and number of incidents! (Figure4.2b). Think about it in this way: the minimum temperature for a launchwithout incidents ever recorded was above 18 degrees, but the Challengerwas launched at -0.6 without clearly knowing the effects of such low tem-peratures.

Instead of trying to predict the number of incidents, we will concentrate onmodeling the probability of expecting at least one incident given the temperature,a simpler but also revealing approach. In other words, we look to estimate thefollowing curve:

𝑝(𝑥) = ℙ(incident = 1|temperature = 𝑥)

from fail.field and temp. This probability can not be properly modeled asa linear function like 𝛽0 + 𝛽1𝑥, since inevitably will fall outside [0, 1] for somevalue of 𝑥 (some will have negative probabilities or probabilities larger thanone). The technique that solves this problem is the logistic regression. Theidea behind is quite simple: transform a linear model 𝛽0 + 𝛽1𝑥 – which is aimedfor a response in ℝ – so that it yields a value in [0, 1]. This is achieved by the


logistic function

logistic(𝑡) = 𝑒𝑡

1 + 𝑒𝑡 = 11 + 𝑒−𝑡 . (4.1)

The logistic model in this case is

ℙ(incident = 1|temperature = 𝑥) = logistic (𝛽0 + 𝛽1𝑥) = 11 + 𝑒−(𝛽0+𝛽1𝑥) ,

with 𝛽0 and 𝛽1 unknown. Let’s fit the model to the data by estimating 𝛽0 and𝛽1.

In order to fit a logistic regression to the data, go to 'Statistics' -> 'Fitmodels' -> 'Generalized linear model...'. A window like Figure 4.3 willpop-up, which you should fill as indicated.

Figure 4.3: Window for performing logistic regression.

A code like this will be generated:nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)#### Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.0566 -0.7575 -0.3818 0.4571 2.2195##


## Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335#### Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp## 1965.9743592 0.6592539

The summary of the logistic model is notably different from the linear regression,as the methodology behind is quite different. Nevertheless, we have tests forthe significance of each coefficient. Here we obtain that temp is significantlydifferent from zero, at least at a level 𝛼 = 0.05. Therefore we can concludethat the temperature is indeed affecting the probability of an incidentwith the O-rings (answers Q1).

The precise interpretation of the coefficients will be given in the next section.For now, the coefficient of temp, 𝛽1, can be regarded the “correlation betweenthe temperature and the probability of having at least one incident”. Thiscorrelation, as evidenced by the sign of 𝛽1, is negative. Let’s plot the fittedlogistic curve to see that indeed the probability of incident and temperature arenegatively correlated:# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",

ylab = "Incident probability")

# Draw the fitted logistic curvex <- seq(-1, 30, l = 200)y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))y <- 1 / (1 + y)lines(x, y, col = 2, lwd = 2)

# The Challengerpoints(-0.6, 1, pch = 16)text(-0.6, 1, labels = "Challenger", pos = 4)


0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Temperature

Inci

dent

pro

babi

lity

Challenger

At the sight of this curve and the summary of the model we can conclude thatthe temperature was increasing the probability of an O-ring incident(Q2). Indeed, the confidence intervals for the coefficients show a significativenegative correlation at level 𝛼 = 0.05:confint(nasa, level = 0.95)## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp -0.9237721 -0.1089953

Finally, the probability of having at least one incident with the O-ringsin the launch day was 0.9996 according to the fitted logistic model (Q3).This is easily obtained:predict(nasa, newdata = data.frame(temp = -0.6), type = "response")## 1## 0.999604

Be aware that type = "response" has a different meaning in logistic regression.As you can see it does not return a CI for the prediction as in linear models.Instead, type = "response" means that the probability should be returned,instead of the value of the link function, which is returned with type = "link"(the default).

Recall that there is a serious problem of extrapolation in the prediction, whichmakes it less precise (or more variable). But this extrapolation, together withthe evidences raised by a simple analysis like we did, should have been strongarguments for postponing the launch.

To conclude this section, we refer to a funny and comprehensive exposition byJuan Cuesta (University of Cantabria) on the flawed statistical analysis thatcontributed to the Challenger disaster.


Figure 4.4: The Challenger disaster and other elegant applications of statisticsin complex problems.

4.3 Model formulation and estimation by maxi-mum likelihood

As we saw in Section 3.2, the multiple linear model described the relation be-tween the random variables 𝑋1, … , 𝑋𝑘 and 𝑌 by assuming the linear relation

𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝜀.

Since we assume 𝔼[𝜀|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 0, the previous equation wasequivalent to

𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘, (4.2)

where 𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] is the mean of 𝑌 for a particular value of theset of predictors. As remarked in Section 3.3, it was a necessary condition that𝑌 was continuous in order to satisfy the normality of the errors, hence the linearmodel assumptions. Or in other words, the linear model is designed for acontinuous response.

The situation when 𝑌 is discrete (naturally ordered values; e.g. number of fails,number of students) or categorical (non-ordered categories; e.g. territorial di-visions, ethnic groups) requires a special treatment. The simplest situation iswhen 𝑌 is binary (or dichotomic): it can only take two values, codified for con-venience as 1 (success) and 0 (failure). For example, in the Challenger casestudy we used fail.field as an indicator of whether “there was at least anincident with the O-rings” (1 = yes, 0 = no). For binary variables there isno fundamental distinction between the treatment of discrete and categoricalvariables.

https://www.youtube.com/embed/MOeDbQlF5Tw?start=552

4.3. MODEL FORMULATION AND ESTIMATION BY MAXIMUM LIKELIHOOD181

More formally, a binary variable is known as a Bernoulli variable, which is thesimplest non-trivial random variable. We say that 𝑌 ∼ Ber(𝑝), 0 ≤ 𝑝 ≤ 1, if

𝑌 = { 1, with probability 𝑝,0, with probability 1 − 𝑝,

or, equivalently, if ℙ[𝑌 = 1] = 𝑝 and ℙ[𝑌 = 0] = 1 − 𝑝, which can be writtencompactly as

ℙ[𝑌 = 𝑦] = 𝑝𝑦(1 − 𝑝)1−𝑦, 𝑦 = 0, 1. (4.3)

Recall that a binomial variable with size 𝑛 and probability 𝑝, Bi(𝑛, 𝑝), is obtainedby summing 𝑛 independent Ber(𝑝) (so Ber(𝑝) is the same as Bi(1, 𝑝)). This iswhy we need to use a family = "binomial" in glm, to indicate that the responseis binomial.

A Bernoulli variable 𝑌 is completely determined by the probability 𝑝.So do its mean and variance:

– 𝔼[𝑌 ] = 𝑝 × 1 + (1 − 𝑝) × 0 = 𝑝– 𝕍ar[𝑌 ] = 𝑝(1 − 𝑝)

In particular, recall that ℙ[𝑌 = 1] = 𝔼[𝑌 ] = 𝑝.

This is something relatively uncommon (on a 𝒩(𝜇, 𝜎2), 𝜇 determinesthe mean and 𝜎2 the variance) that has important consequences forthe logistic model: we do not need a 𝜎2.

Are these Bernoulli variables? If so, which is the value of 𝑝 and whatcould the codes 0 and 1 represent?

– The toss of a fair coin.– A variable with mean 𝑝 and variance 𝑝(1 − 𝑝).– The roll of a dice.– A binary variable with mean 0.5 and variance 0.45.– The winner of an election with two candidates.

Assume then that 𝑌 is a binary/Bernoulli variable and that 𝑋1, … , 𝑋𝑘 arepredictors associated to 𝑌 (no particular assumptions on them). The purposein logistic regression is to estimate

𝑝(𝑥1, … , 𝑥𝑘) = ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘],this is, how the probability of 𝑌 = 1 is changing according to particular values,denoted by 𝑥1, … , 𝑥𝑘, of the random variables 𝑋1, … , 𝑋𝑘. 𝑝(𝑥1, … , 𝑥𝑘) = ℙ[𝑌 =1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] stands for the conditional probability of 𝑌 = 1 given𝑋1, … , 𝑋𝑘. At sight of (4.2), a tempting possibility is to consider the model

𝑝(𝑥1, … , 𝑥𝑘) = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘.


However, such a model will run into serious problems inevitably: negative prob-abilities and probabilities larger than one. A solution is to consider a functionto encapsulate the value of 𝑧 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘, in ℝ, and map it back to[0, 1]. There are several alternatives to do so, based on distribution functions𝐹 ∶ ℝ ⟶ [0, 1] that deliver 𝑦 = 𝐹(𝑧) ∈ [0, 1] (see Figure 4.5). Different choicesof 𝐹 give rise to different models:

• Uniform. Truncate 𝑧 to 0 and 1 when 𝑧 < 0 and 𝑧 > 1, respectively.• Logit. Consider the logistic distribution function:

𝐹(𝑧) = logistic(𝑧) = 𝑒𝑧

1 + 𝑒𝑧 = 11 + 𝑒−𝑧 .

• Probit. Consider the normal distribution function, this is, 𝐹 = Φ.

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

Pro

babi

lity

Linear regressionUniformLogitProbit

Figure 4.5: Different transformations mapping the response of a simple linearregression 𝑧 = 𝛽0 + 𝛽1𝑥 to [0, 1].

The logistic transformation is the most employed due to its tractability, in-terpretability and smoothness. Its inverse, 𝐹 −1 ∶ [0, 1] ⟶ ℝ, known as thelogit function, is

logit(𝑝) = logistic−1(𝑝) = log 𝑝1 − 𝑝 .

This is a link function, this is, a function that maps a given space (in this case[0, 1]) into ℝ. The term link function is employed in generalized linear models,which follow exactly the same philosophy of the logistic regression – mappingthe domain of 𝑌 to ℝ in order to apply there a linear model. We will concentrate


here exclusively on the logit as a link function. Therefore, the logistic model is

𝑝(𝑥1, … , 𝑥𝑘) = logistic(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘) = 11 + 𝑒−(𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘) .

(4.4)

The linear form inside the exponent has a clear interpretation:

• If 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘 = 0, then 𝑝(𝑥1, … , 𝑥𝑘) = 12 (𝑌 = 1 and 𝑌 = 0 are

equally likely).• If 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘 < 0, then 𝑝(𝑥1, … , 𝑥𝑘) < 1

2 (𝑌 = 1 less likely).• If 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘 > 0, then 𝑝(𝑥1, … , 𝑥𝑘) > 1

2 (𝑌 = 1 more likely).

To be more precise on the interpretation of the coefficients 𝛽0, … , 𝛽𝑘 we needto introduce the odds. The odds is an equivalent way of expressing thedistribution of probabilities in a binary variable. Since ℙ[𝑌 = 1] = 𝑝and ℙ[𝑌 = 0] = 1 − 𝑝, both the success and failure probabilities can be inferredfrom 𝑝. Instead of using 𝑝 to characterize the distribution of 𝑌 , we can use

odds(𝑌 ) = 𝑝1 − 𝑝 = ℙ[𝑌 = 1]

ℙ[𝑌 = 0] . (4.5)

The odds is the ratio between the probability of success and the probability offailure. It is extensively used in betting2 due to its better interpretability. Forexample, if a horse 𝑌 has a probability 𝑝 = 2/3 of winning a race (𝑌 = 1), thenthe odds of the horse is

odds = 𝑝1 − 𝑝 = 2/3

1/3 = 2.

This means that the horse has a probability of winning that is twice larger thanthe probability of losing. This is sometimes written as a 2 ∶ 1 or 2 × 1 (spelled“two-to-one”). Conversely, if the odds of 𝑌 is given, we can easily know what isthe probability of success 𝑝, using the inverse of (4.5):

𝑝 = ℙ[𝑌 = 1] = odds(𝑌 )1 + odds(𝑌 ) .

For example, if the odds of the horse were 5, that would correspond to a prob-ability of winning 𝑝 = 5/6.

Recall that the odds is a number in [0, +∞]. The 0 and +∞ valuesare attained for 𝑝 = 0 and 𝑝 = 1, respectively. The log-odds (or logit)is a number in [−∞, +∞].

2Recall that the result of a bet is binary: you either win or lose the bet.


We can rewrite (4.4) in terms of the odds (4.5). If we do so, we have:

odds(𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) = 𝑝(𝑥1, … , 𝑥𝑘)1 − 𝑝(𝑥1, … , 𝑥𝑘)

= 𝑒𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘

= 𝑒𝛽0𝑒𝛽1𝑥1 … 𝑒𝛽𝑘𝑥𝑘 (4.6)

or, taking logarithms, the log-odds (or logit)

log(odds(𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘)) = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘. (4.7)

The conditional log-odds (4.7) plays here the role of the conditional mean formultiple linear regression. Therefore, we have an analogous interpretation forthe coefficients:

• 𝛽0: is the log-odds when 𝑋1 = … = 𝑋𝑘 = 0.• 𝛽𝑗, 1 ≤ 𝑗 ≤ 𝑘: is the additive increment of the log-odds for an in-

crement of one unit in 𝑋𝑗 = 𝑥𝑗, provided that the remaining variables𝑋1, … , 𝑋𝑗−1, 𝑋𝑗+1, … , 𝑋𝑘 do not change.

The log-odds is not so easy to interpret as the odds. For that reason, an equiv-alent way of interpreting the coefficients, this time based on (4.6), is:

• 𝑒𝛽0 : is the odds when 𝑋1 = … = 𝑋𝑘 = 0.• 𝑒𝛽𝑗 , 1 ≤ 𝑗 ≤ 𝑘: is the multiplicative increment of the odds for an

increment of one unit in 𝑋𝑗 = 𝑥𝑗, provided that the remaining variables𝑋1, … , 𝑋𝑗−1, 𝑋𝑗+1, … , 𝑋𝑘 do not change. If the increment in 𝑋𝑗 is of 𝑟units, then the multiplicative increment in the odds is (𝑒𝛽𝑗)𝑟.

As a consequence of this last interpretation, we have:

If 𝛽𝑗 > 0 (respectively, 𝛽𝑗 < 0) then 𝑒𝛽𝑗 > 1 (𝑒𝛽𝑗 < 1) in (4.6). There-fore, an increment of one unit in 𝑋𝑗, provided that the remainingvariables 𝑋1, … , 𝑋𝑗−1, 𝑋𝑗+1, … , 𝑋𝑘 do not change, results in an incre-ment (decrement) of the odds, this is, in an increment (decrement) ofℙ[𝑌 = 1].

Since the relationship between 𝑝(𝑋1, … , 𝑋𝑘) and 𝑋1, … , 𝑋𝑘 is notlinear, 𝛽𝑗 does not correspond to the change in 𝑝(𝑋1, … , 𝑋𝑘)associated with a one-unit increase in 𝑋𝑗.

Let’s visualize this concepts quickly with the output of the Challenger casestudy:


nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)#### Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.0566 -0.7575 -0.3818 0.4571 2.2195#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335#### Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp## 1965.9743592 0.6592539

# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",

ylab = "Incident probability")

# Draw the fitted logistic curvex <- seq(-1, 30, l = 200)y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))y <- 1 / (1 + y)lines(x, y, col = 2, lwd = 2)

# The Challengerpoints(-0.6, 1, pch = 16)text(-0.6, 1, labels = "Challenger", pos = 4)


0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Temperature

Inci

dent

pro

babi

lity

Challenger

The exponentials of the estimated coefficients are:

• 𝑒 𝛽0 = 1965.974. This means that, when the temperature is zero, the fittedodds is 1965.974, so the probability of having an incident (𝑌 = 1) is1965.974 times larger than the probability of not having an incident (𝑌 =0). In other words, the probability of having an incident at temperaturezero is 1965.974

1965.974+1 = 0.999.• 𝑒 𝛽1 = 0.659. This means that each Celsius degree increment in the temper-

ature multiplies the fitted odds by a factor of 0.659 ≈ 23 , hence reducing

it.

The estimation of 𝛽 = (𝛽0, 𝛽1, … , 𝛽𝑘) from a sample (X1, 𝑌1), … , (X𝑛, 𝑌𝑛) isdifferent than in linear regression. It is not based on minimizing the RSS buton the principle of Maximum Likelihood Estimation (MLE). MLE is based onthe following leitmotiv: what are the coefficients 𝛽 that make the samplemore likely? Or in other words, what coefficients make the model moreprobable, based on the sample. Since 𝑌𝑖 ∼ Ber(𝑝(X𝑖)), 𝑖 = 1, … , 𝑛, thelikelihood of 𝛽 is

lik(𝛽) =𝑛

∏𝑖=1

𝑝(X𝑖)𝑌𝑖(1 − 𝑝(X𝑖))1−𝑌𝑖 . (4.8)

lik(𝛽) is the probability of the data based on the model. Therefore, it isa number between 0 and 1. Its detailed interpretation is the following:

• ∏𝑛𝑖=1 appears because the sample elements are assumed to be indepen-

dent and we are computing the probability of observing the whole sample(X1, 𝑌1), … , (X𝑛, 𝑌𝑛). This probability is equal to the product of the prob-abilities of observing each (X𝑖, 𝑌𝑖).


• 𝑝(X𝑖)𝑌𝑖(1 − 𝑝(X𝑖))1−𝑌𝑖 is the probability of observing (X𝑖, 𝑌𝑖), as givenby (4.3). Remember that 𝑝 depends on 𝛽 due to (4.4).

Usually, the log-likelihood is considered instead of the likelihood for stabilityreasons – the estimates obtained are exactly the same and are

�� = arg max𝛽∈ℝ𝑘+1

log lik(𝛽).

Unfortunately, due to the non-linearity of the optimization problem there areno explicit expressions for ��. These have to be obtained numerically by meansof an iterative procedure (the number of iterations required is printed in theoutput of summary). In low sample situations with perfect classification, theiterative procedure may not converge.

Figure 4.6 shows how the log-likelihood changes with respect to the values for(𝛽0, 𝛽1) in three data patterns.

Figure 4.6: The logistic regression fit and its dependence on 𝛽0 (horizontaldisplacement) and 𝛽1 (steepness of the curve). Recall the effect of the sign of𝛽1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, theform is a reflected ‘s’. Application also available here.

The data of the illustration has been generated with the following code:

https://shinyserv.es/shiny/log-maximum-likelihood/


# Dataset.seed(34567)x <- rnorm(50, sd = 1.5)y1 <- -0.5 + 3 * xy2 <- 0.5 - 2 * xy3 <- -2 + 5 * xy1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))

# DatadataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)

Let’s check that indeed the coefficients given by glm are the ones that maximizethe likelihood given in the animation of Figure 4.6. We do so for y ~ x1.mod <- glm(y1 ~ x, family = "binomial", data = dataMle)mod$coefficients## (Intercept) x## -0.1691947 2.4281626

For the regressions y ~ x2 and y ~ x3, do the following:

– Check that the true 𝛽 is close to maximizing the likelihood com-puted in Figure 4.6.

– Plot the fitted logistic curve and compare it with the one inFigure 4.6.

In linear regression we relied on least squares estimation, in otherwords, the minimization of the RSS. Why do we need MLE in logisticregression and not least squares? The answer is two-fold:

1. MLE is asymptotically optimal when estimating unknownparameters in a model. That means that when the sample size 𝑛is large, it is guaranteed to perform better than any otherestimation method. Therefore, considering a least squares ap-proach for logistic regression will result in suboptimal estimates.

2. In multiple linear regression, due to the normality assump-tion, MLE and least squares estimation coincide. So MLEis hidden under the form of the least squares, which is a moreintuitive estimation procedure. Indeed, the maximized likelihoodlik(��) in the linear model and the RSS are intimately related.


As in the linear model, the inclusion of a new predictor changesthe coefficient estimates of the logistic model.

4.4 Assumptions of the modelSome probabilistic assumptions are required for performing inference on themodel parameters 𝛽 from the sample (X1, 𝑌1), … , (X𝑛, 𝑌𝑛). These assumptionsare somehow simpler than the ones for linear regression.

Figure 4.7: The key concepts of the logistic model.

The assumptions of the logistic model are the following:

i. Linearity in the logit3: logit(𝑝(x)) = log 𝑝(x)1−𝑝(x) = 𝛽0 +𝛽1𝑥1 +⋯+𝛽𝑘𝑥𝑘.

ii. Binariness: 𝑌1, … , 𝑌𝑛 are binary variables.iii. Independence: 𝑌1, … , 𝑌𝑛 are independent.

3An equivalent way of stating this assumption is 𝑝(x) = logistic(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘).


A good one-line summary of the logistic model is the following (independenceis assumed)

𝑌 |(𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) ∼ Ber (logistic(𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑘𝑥𝑘))

= Ber ( 11 + 𝑒−(𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘) ) . (4.9)

There are three important points of the linear model assumptions missing inthe ones for the logistic model:

• Why is homoscedasticity not required? As seen in the previous section,Bernoulli variables are determined only by the probability of success, inthis case 𝑝(x). That determines also the variance, which is variable, sothere is heteroskedasticity. In the linear model, we have to control 𝜎2

explicitly due to the higher flexibility of the normal.• Where are the errors? The errors played a fundamental role in the linear

model assumptions, but are not employed in logistic regression. The errorsare not fundamental for building the linear model but just a helpful con-cept related to least squares. The linear model can be constructed withouterrors as (3.5), which has a logistic analogous in (4.9).

• Why is normality not present? A normal distribution is not adequate toreplace the Bernoulli distribution in (4.9) since the response 𝑌 has to bebinary and the Normal or other continuous distribution would put yieldillegal values for 𝑌 .

Recall that:

– Nothing is said about the distribution of 𝑋1, … , 𝑋𝑘. They couldbe deterministic or random. They could be discrete or continu-ous.

– 𝑋1, … , 𝑋𝑘 are not required to be independent betweenthem.

4.5 Inference for model parametersThe assumptions on which the logistic model is constructed allow to specify whatis the asymptotic distribution of the random vector ��. Again, the distribution isderived conditionally on the sample predictors X1, … , X𝑛. In other words, weassume that the randomness of 𝑌 comes only from 𝑌 |(𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) ∼Ber(𝑝(x)) and not from the predictors. To denote this, we employ lowercase forthe sample predictors x1, … , x𝑛.

There is an important difference between the inference results for the linearmodel and for logistic regression:


• In linear regression the inference is exact. This is due to the niceproperties of the normal, least squares estimation and linearity. As a con-sequence, the distributions of the coefficients are perfectly known assumingthat the assumptions hold.

• In logistic regression the inference is asymptotic. This means thatthe distributions of the coefficients are unknown except for large samplesizes 𝑛, for which we have approximations. The reason is the more com-plexity of the model in terms of non-linearity. This is the usual situationfor the majority of the regression models.

4.5.1 Distributions of the fitted coefficientsThe distribution of �� is given by the asymptotic theory of MLE:

�� ∼ 𝒩𝑘+1 (𝛽, 𝐼(𝛽)−1) (4.10)

where ∼ must be understood as approximately distributed as […] when 𝑛 → ∞for the rest of this chapter. 𝐼(𝛽) is known as the Fisher information matrix, andreceives that name because it measures the information available in the samplefor estimating 𝛽. Therefore, the larger the matrix is, the more precise is theestimation of 𝛽, because that results in smaller variances in (4.10). The inverseof the Fisher information matrix is

𝐼(𝛽)−1 = (X𝑇 VX)−1, (4.11)

where V is a diagonal matrix containing the different variances for each 𝑌𝑖(remember that 𝑝(x) = 1/(1 + 𝑒−(𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘))):

V =⎛⎜⎜⎜⎝

𝑝(X1)(1 − 𝑝(X1))𝑝(X2)(1 − 𝑝(X2))

⋱𝑝(X𝑛)(1 − 𝑝(X𝑛))

⎞⎟⎟⎟⎠

In the case of the multiple linear regression, 𝐼(𝛽)−1 = 𝜎2(X𝑇 X)−1 (see (3.6)),so the presence of V here is revealing the heteroskedasticity of the model.

The interpretation of (4.10) and (4.11) gives some useful insights on what con-cepts affect the quality of the estimation:

• Bias. The estimates are asymptotically unbiased.

• Variance. It depends on:

– Sample size 𝑛. Hidden inside X𝑇 VX. As 𝑛 grows, the precision ofthe estimators increases.

– Weighted predictor sparsity (X𝑇 VX)−1. The more sparse the predic-tor is (small |(X𝑇 VX)−1|), the more precise �� is.


The precision of �� is affected by the value of 𝛽, which is hiddeninside V. This contrasts sharply with the linear model, where theprecision of the least squares estimator was not affected by the value ofthe unknown coefficients (see (3.6)). The reason is partially due to theheteroskedasticity of logistic regression, which implies a dependenceof the variance of 𝑌 in the logistic curve, hence in 𝛽.

Figure 4.8: Illustration of the randomness of the fitted coefficients ( 𝛽0, 𝛽1) andthe influence of 𝑛, (𝛽0, 𝛽1) and 𝑠2

𝑥. The sample predictors 𝑥1, … , 𝑥𝑛 are fixedand new responses 𝑌1, … , 𝑌𝑛 are generated each time from a logistic model𝑌 |𝑋 = 𝑥 ∼ Ber(𝑝(𝑋)). Application also available here.

Similar to linear regression, the problem with (4.10) and (4.11) is that V isunknown in practice because it depends on 𝛽. Plugging-in the estimate �� to 𝛽in V results in V. Now we can use V to get

𝛽𝑗 − 𝛽𝑗SE( 𝛽𝑗)

∼ 𝒩(0, 1), SE( 𝛽𝑗)2 = 𝑣2𝑗 (4.12)

https://shinyserv.es/shiny/log-random/


where𝑣𝑗 is the 𝑗-th element of the diagonal of (X𝑇 VX)−1.

The LHS of (3.7) is the Wald statistic for 𝛽𝑗, 𝑗 = 0, … , 𝑘. They are employedfor building confidence intervals and hypothesis tests.

4.5.2 Confidence intervals for the coefficientsThanks to (4.12), we can have the 100(1 − 𝛼)% CI for the coefficient 𝛽𝑗, 𝑗 =0, … , 𝑘:

( 𝛽𝑗 ± SE( 𝛽𝑗)𝑧𝛼/2) (4.13)

where 𝑧𝛼/2 is the 𝛼/2-upper quantile of the 𝒩(0, 1). In case we are interestedin the CI for 𝑒𝛽𝑗 , we can just simply take the exponential on the above CI. Sothe 100(1 − 𝛼)% CI for 𝑒𝛽𝑗 , 𝑗 = 0, … , 𝑘, is

𝑒( 𝛽𝑗± SE( 𝛽𝑗)𝑧𝛼/2).

Of course, this CI is not the same as (𝑒 𝛽𝑗 ± 𝑒 SE( 𝛽𝑗)𝑧𝛼/2), which is not a CI for

𝑒 𝛽𝑗 .

Let’s see how we can compute the CIs. We return to the challenger dataset,so in case you do not have it loaded, you can download it here. We analyze theCI for the coefficients of fail.field ~ temp.# Fit modelnasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)

# Confidence intervals at 95%confint(nasa)## Waiting for profiling to be done...## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp -0.9237721 -0.1089953

# Confidence intervals at other levelsconfint(nasa, level = 0.90)## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 2.2070301 15.7488590## temp -0.8222858 -0.1513279

# Confidence intervals for the factors affecting the oddsexp(confint(nasa))## Waiting for profiling to be done...## 2.5 % 97.5 %



## (Intercept) 3.8053375 5.287456e+07## temp 0.3970186 8.967346e-01

In this example, the 95% confidence interval for 𝛽0 is (1.3364, 17.7834) and for𝛽1 is (−0.9238, −0.1090). For 𝑒𝛽0 and 𝑒𝛽1 , the CIs are (3.8053, 5.2874 × 107)and (0.3070, 0.8967), respectively. Therefore, we can say with a 95% confidencethat:

• When temp=0, the probability of fail.field=1 is significantly lagerthan the probability of fail.field=0 (using the CI for 𝛽0). Indeed,fail.field=1 is between 3.8053 and 5.2874 × 107 more likely thanfail.field=0 (using the CI for 𝑒𝛽0).

• temp has a significantly negative effect in the probability of fail.field=1(using the CI for 𝛽1). Indeed, each unit increase in temp produces areduction of the odds of fail.field by a factor between 0.3070 and 0.8967(using the CI for 𝑒𝛽1).

Compute and interpret the CIs for the exponentiated coefficients, atlevel 𝛼 = 0.05, for the following regressions (challenger dataset):

– fail.field ~ temp + pres.field– fail.nozzle ~ temp + pres.nozzle– fail.field ~ temp + pres.nozzle– fail.nozzle ~ temp + pres.field

The interpretation of the variables is given above Table 4.1.

4.5.3 Testing on the coefficientsThe distributions in (4.12) also allow to conduct a formal hypothesis test on thecoefficients 𝛽𝑗, 𝑗 = 0, … , 𝑘. For example, the test for significance:

𝐻0 ∶ 𝛽𝑗 = 0

for 𝑗 = 0, … , 𝑘. The test of 𝐻0 ∶ 𝛽𝑗 = 0 with 1 ≤ 𝑗 ≤ 𝑘 is especially interesting,since it allows to answer whether the variable 𝑋𝑗 has a significant effect onℙ[𝑌 = 1]. The statistic used for testing for significance is the Wald statistic


,

which is asymptotically distributed as a 𝒩(0, 1) under the (veracity of) the nullhypothesis. 𝐻0 is tested against the bilateral alternative hypothesis 𝐻1 ∶ 𝛽𝑗 ≠ 0.

The tests for significance are built-in in the summary function. However, a noteof caution is required when applying the rule of thumb:


Is the CI for 𝛽𝑗 below (above) 0 at level 𝛼?

• Yes → reject 𝐻0 at level 𝛼.• No → the criterion is not conclusive.

The significances given in summary and the output of confint areslightly incoherent and the previous rule of thumb does not ap-ply. The reason is because MASS’s confint is using a more sophis-ticated method (profile likelihood) to estimate the standard error of

𝛽𝑗, SE( 𝛽𝑗), and not the asymptotic distribution behind Wald statistic.

By changing confint to R’s default confint.default, the results ofthe latter will be completely equivalent to the significances in summary,and the rule of thumb still be completely valid. For the contents of thiscourse we prefer confint.default due to its better interpretability.

To illustrate this we consider the regression of fail.field ~ temp +pres.field:# Significances with asymptotic approximation for the standard errorsnasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",

data = challenger)summary(nasa2)#### Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial",## data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.2109 -0.6081 -0.4292 0.3498 2.0913#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 6.642709 4.038547 1.645 0.1000## temp -0.435032 0.197008 -2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078##


## Number of Fisher Scoring iterations: 5

# CIs with asymptotic approximation - coherent with summaryconfint.default(nasa2, level = 0.90)## 5 % 95 %## (Intercept) -0.000110501 13.28552771## temp -0.759081468 -0.11098301## pres.field -0.005132393 0.02388538confint.default(nasa2, level = 0.99)## 0.5 % 99.5 %## (Intercept) -3.75989977 17.04531697## temp -0.94249107 0.07242659## pres.field -0.01334432 0.03209731

# CIs with profile likelihood - incoherent with summaryconfint(nasa2, level = 0.90) # intercept still significant## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 0.945372123 14.93392497## temp -0.845250023 -0.16532086## pres.field -0.004184814 0.02602181confint(nasa2, level = 0.99) # temp still significant## Waiting for profiling to be done...## 0.5 % 99.5 %## (Intercept) -1.86541750 21.49637422## temp -1.17556090 -0.04317904## pres.field -0.01164943 0.03836968

For the previous exercise, check the differences of using confint orconfint.default for computing the CIs.

4.6 PredictionPrediction in logistic regression focuses mainly on predicting the values of thelogistic curve

𝑝(𝑥1, … , 𝑥𝑘) = ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 11 + 𝑒−(𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘)

by means of

𝑝(𝑥1, … , 𝑥𝑘) = ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 11 + 𝑒−( 𝛽0+ 𝛽1𝑥1+⋯+ 𝛽𝑘𝑥𝑘)

.

From the perspective of the linear model, this is the same as predicting theconditional mean (not the conditional response) of the response, but this

4.6. PREDICTION 197

time this conditional mean is also a conditional probability. The predictionof the conditional response is not so interesting since it follows immediately from𝑝(𝑥1, … , 𝑥𝑘):

𝑌 |(𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) = { 1, with probability 𝑝(𝑥1, … , 𝑥𝑘),0, with probability 1 − 𝑝(𝑥1, … , 𝑥𝑘).

As a consequence, we can predict 𝑌 as 1 if 𝑝(𝑥1, … , 𝑥𝑘) > 12 and as 0 if

𝑝(𝑥1, … , 𝑥𝑘) < 12 .

Let’s focus then on how to make predictions and compute CIs in practice withpredict. Similarly to the linear model, the objects required for predict are:first, the output of glm; second, a data.frame containing the locations x =(𝑥1, … , 𝑥𝑘) where we want to predict 𝑝(𝑥1, … , 𝑥𝑘). However, there are twodifferences with respect to the use of predict for lm:

• The argument type. type = "link", gives the predictions in the log-odds,this is, returns log ��(𝑥1,…,𝑥𝑘)

1−��(𝑥1,…,𝑥𝑘) . type = "response" gives the predictionsin the probability space [0, 1], this is, returns 𝑝(𝑥1, … , 𝑥𝑘).

• There is no interval argument for using predict for glm. That meansthat there is no easy way of computing CIs for prediction.

Since it is a bit cumbersome to compute by yourself the CIs, we can code thefunction predictCIsLogistic so that it computes them automatically for you,see below.# Data for which we want a prediction# Important! You have to name the column with the predictor name!newdata <- data.frame(temp = -0.6)

# Prediction of the conditional log-odds - the defaultpredict(nasa, newdata = newdata, type = "link")## 1## 7.833731

# Prediction of the conditional probabilitypredict(nasa, newdata = newdata, type = "response")## 1## 0.999604

# Function for computing the predictions and CIs for the conditional probabilitypredictCIsLogistic <- function(object, newdata, level = 0.95) {

# Compute predictions in the log-oddspred <- predict(object = object, newdata = newdata, se.fit = TRUE)

# CI in the log-oddsza <- qnorm(p = (1 - level) / 2)


lwr <- pred$fit + za * pred$se.fitupr <- pred$fit - za * pred$se.fit

# Transform to probabilitiesfit <- 1 / (1 + exp(-pred$fit))lwr <- 1 / (1 + exp(-lwr))upr <- 1 / (1 + exp(-upr))

# Return a matrix with column names "fit", "lwr" and "upr"result <- cbind(fit, lwr, upr)colnames(result) <- c("fit", "lwr", "upr")return(result)

}

# Simple callpredictCIsLogistic(nasa, newdata = newdata)## fit lwr upr## 1 0.999604 0.4838505 0.9999999# The CI is large because there is no data around temp = -0.6 and# that makes the prediction more variable (and also because we only# have 23 observations)

For the challenger dataset, do the following:

– Regress fail.nozzle on temp and pres.nozzle.– Compute the predicted probability of fail.nozzle=1 for temp=

15 and pres.nozzle= 200. What is the predicted probabilityfor fail.nozzle=0?

– Compute the confidence interval for the two predicted probabili-ties at level 95%.

Finally, Figure 4.9 gives an interactive visualization of the CIs for the conditionalprobability in simple logistic regression. Their interpretation is very similar tothe CIs for the conditional mean in the simple linear model, see Section 2.6 andFigure 2.23.

4.7 Deviance and model fitThe deviance is a key concept in logistic regression. Intuitively, it measures thedeviance of the fitted logistic model with respect to a perfect modelfor ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘]. This perfect model, known as the saturatedmodel, denotes an abstract model that fits perfectly the sample, this is, the

4.7. DEVIANCE AND MODEL FIT 199

Figure 4.9: Illustration of the CIs for the conditional probability in the simplelogistic regression. Application also available here.

model such that

ℙ[𝑌 = 1|𝑋1 = 𝑋𝑖1, … , 𝑋𝑘 = 𝑋𝑖𝑘] = 𝑌𝑖, 𝑖 = 1, … , 𝑛.

This model assigns probability 0 or 1 to 𝑌 depending on the actual value of 𝑌𝑖.To clarify this concept, Figure 4.10 shows a saturated model and a fitted logisticregression.

More precisely, the deviance is defined as the difference of likelihoods betweenthe fitted model and the saturated model:

𝐷 = −2 log lik(��) + 2 log lik(saturated model).

Since the likelihood of the saturated model is exactly one4, then the devianceis simply another expression of the likelihood:

𝐷 = −2 log lik(��).

As a consequence, the deviance is always larger or equal than zero, beingzero only if the fit is perfect.

4The probability of the sample according to the saturated is 1 – replace 𝑝(X𝑖) = 𝑌𝑖 in(4.8).

https://shinyserv.es/shiny/log-ci-prediction/


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

Fitted logistic modelA saturated modelThe null model

Figure 4.10: Fitted logistic regression versus a saturated model (several arepossible depending on the interpolation between points) and the null model.

A benchmark for evaluating the magnitude of the deviance is the null deviance,

𝐷0 = −2 log lik( 𝛽0),which is the deviance of the worst model, the one fitted without anypredictor, to the perfect model:

𝑌 |(𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘) ∼ Ber(logistic(𝛽0)).

In this case, 𝛽0 = logit( 𝑚𝑛 ) = log

𝑚𝑛

1− 𝑚𝑛

where 𝑚 is the number of 1’s in 𝑌1, … , 𝑌𝑛(see Figure 4.10).

The null deviance serves for comparing how much the model has improved byadding the predictors 𝑋1, … , 𝑋𝑘. This can be done by means of the 𝑅2 statis-tic, which is a generalization of the determination coefficient in multiple linearregression:

𝑅2 = 1 − 𝐷𝐷0

= 1 − deviance(fitted logistic, saturated model)deviance(null model, saturated model) . (4.14)

This global measure of fit shares some important properties with the determi-nation coefficient in linear regression:

1. It is a quantity between 0 and 1.2. If the fit is perfect, then 𝐷 = 0 and 𝑅2 = 1. If the predictors do not add

anything to the regression, then 𝐷 = 𝐷0 and 𝑅2 = 0.


In logistic regression, 𝑅2 does not have the same interpretation as inlinear regression:

– Is not the percentage of variance explained by the logisticmodel, but rather a ratio indicating how close is the fit to beingperfect or the worst.

– It is not related to any correlation coefficient.

The 𝑅2 in (4.14) is valid for the whole family of generalized linearmodels, for which linear and logistic regression are particular cases.The connexion between (4.14) and the determination coefficient isgiven by the expressions of the deviance and null the deviance for thelinear model:

𝐷 = SSE (or 𝐷 = RSS) and 𝐷0 = SST.

Let’s see how these concepts are given by the summary function:# Summary of modelnasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summaryLog <- summary(nasa)summaryLog # 'Residual deviance' is the deviance; 'Null deviance' is the null deviance#### Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.0566 -0.7575 -0.3818 0.4571 2.2195#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335##


## Number of Fisher Scoring iterations: 5

# Null model - only interceptnull <- glm(fail.field ~ 1, family = "binomial", data = challenger)summaryNull <- summary(null)summaryNull#### Call:## glm(formula = fail.field ~ 1, family = "binomial", data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -0.852 -0.852 -0.852 1.542 1.542#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -0.8267 0.4532 -1.824 0.0681 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 28.267 on 22 degrees of freedom## AIC: 30.267#### Number of Fisher Scoring iterations: 4

# Computation of the R^2 with a function - useful for repetitive computationsr2Log <- function(model) {

summaryLog <- summary(model)1 - summaryLog$deviance / summaryLog$null.deviance

}

# R^2r2Log(nasa)## [1] 0.280619r2Log(null)## [1] -4.440892e-16

Another way of evaluating the model fit is its predictive accuracy. Themotivation is that most of the times we are interested simply in classifying, foran observation of the predictors, the value of 𝑌 as either 0 or 1, but not in


predicting the value of 𝑝(𝑥1, … , 𝑥𝑘) = ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘]. Theclassification in prediction is simply done by the rule

𝑌 = { 1, 𝑝(𝑥1, … , 𝑥𝑘) > 12 ,

0, 𝑝(𝑥1, … , 𝑥𝑘) < 12 .

The overall predictive accuracy can be summarized with the hit matrix

Reality vs. classified 𝑌 = 0 𝑌 = 1𝑌 = 0 Correct0 Incorrect01𝑌 = 1 Incorrect10 Correct1

and with the hit ratio Correct0+Correct1𝑛 . The hit matrix is easily computed with

the table function. The function, whenever called with two vectors, computesthe cross-table between the two vectors.# Fitted probabilities for Y = 1nasa$fitted.values## 1 2 3 4 5 6 7## 0.42778935 0.23014393 0.26910358 0.32099837 0.37772880 0.15898364 0.12833090## 8 9 10 11 12 13 14## 0.23014393 0.85721594 0.60286639 0.23014393 0.04383877 0.37772880 0.93755439## 15 16 17 18 19 20 21## 0.37772880 0.08516844 0.23014393 0.02299887 0.07027765 0.03589053 0.08516844## 22 23## 0.07027765 0.82977495

# Classified Y'syHat <- nasa$fitted.values > 0.5

# Hit matrix:# - 16 correctly classified as 0# - 4 correctly classified as 1# - 3 incorrectly classified as 0tab <- table(challenger$fail.field, yHat)tab## yHat## FALSE TRUE## 0 16 0## 1 3 4

# Hit ratio (ratio of correct classification)(16 + 4) / 23 # Manually## [1] 0.8695652sum(diag(tab)) / sum(tab) # Automatically## [1] 0.8695652


It is important to recall that the hit matrix will be always biased towards un-realistic good classification rates if it is computed in the same sample used forfitting the logistic model. A familiar analogy is asking to your mother (data)whether you (model) are a good-looking human being (good predictive accu-racy) – the answer will be highly positively biased. To get a fair hit matrix,the right approach is to split randomly the sample into two: a training dataset,used for fitting the model, and a test dataset, used for evaluating the predictiveaccuracy.

4.8 Model selection and multicollinearityThe same discussion we did in Section 3.7 is applicable to logistic regressionwith small changes:

1. The deviance of the model (reciprocally the likelihood and the 𝑅2) al-ways decreases (increase) with the inclusion of more predictors – nomatter whether they are significant or not.

2. The excess of predictors in the model is paid by a larger variability inthe estimation of the model which results in less precise prediction.

3. Multicollinearity may hide significant variables, change the sign of themand result in an increase of the variability of the estimation.

The use of information criteria, stepwise and vif allow to efficiently fightback these issues. Let’s review them quickly from the perspective of logisticregression.

First, remember that the BIC/AIC information criteria are based on a balancebetween the model fitness, given by the likelihood, and its complexity.In the logistic regression, the BIC is

BIC(model) = −2 log lik(��) + (𝑘 + 1) × log 𝑛= 𝐷 + (𝑘 + 1) × log 𝑛,

where lik(��) is the likelihood of the model. The AIC replaces log 𝑛 by 2, hencepenalizing less model complexity. The BIC and AIC can be computed in Rthrough the functions BIC and AIC, and we can check manually that they matchwith its definition.# Modelsnasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",

data = challenger)

# nasasummary1 <- summary(nasa1)summary1##

4.8. MODEL SELECTION AND MULTICOLLINEARITY 205

## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.0566 -0.7575 -0.3818 0.4571 2.2195#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335#### Number of Fisher Scoring iterations: 5BIC(nasa1)## [1] 26.60584summary1$deviance + 2 * log(dim(challenger)[1])## [1] 26.60584AIC(nasa1)## [1] 24.33485summary1$deviance + 2 * 2## [1] 24.33485

# nasa2summary2 <- summary(nasa2)summary2#### Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial",## data = challenger)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.2109 -0.6081 -0.4292 0.3498 2.0913#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 6.642709 4.038547 1.645 0.1000


## temp -0.435032 0.197008 -2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078#### Number of Fisher Scoring iterations: 5BIC(nasa2)## [1] 28.48469summary2$deviance + 3 * log(dim(challenger)[1])## [1] 28.48469AIC(nasa2)## [1] 25.07821summary2$deviance + 3 * 2## [1] 25.07821

Second, stepwise works analogously to the linear regression situation. Hereis an illustration for a binary variable that measures whether a Boston suburb(Boston dataset) is wealth or not. The binary variable is medv > 25: it is TRUE(1) for suburbs with median house value larger than 25000$) and FALSE (0)otherwise. The cutoff 25000$ corresponds to the 25% richest suburbs.# Boston datasetdata(Boston)

# Model whether a suburb has a median house value larger than 25000$mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")summary(mod)#### Call:## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)#### Deviance Residuals:## Min 1Q Median 3Q Max## -3.3498 -0.2806 -0.0932 -0.0006 3.3781#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 5.312511 4.876070 1.090 0.275930## crim -0.011101 0.045322 -0.245 0.806503## zn 0.010917 0.010834 1.008 0.313626


## indus -0.110452 0.058740 -1.880 0.060060 .## chas 0.966337 0.808960 1.195 0.232266## nox -6.844521 4.483514 -1.527 0.126861## rm 1.886872 0.452692 4.168 3.07e-05 ***## age 0.003491 0.011133 0.314 0.753853## dis -0.589016 0.164013 -3.591 0.000329 ***## rad 0.318042 0.082623 3.849 0.000118 ***## tax -0.010826 0.004036 -2.682 0.007314 **## ptratio -0.353017 0.122259 -2.887 0.003884 **## black -0.002264 0.003826 -0.592 0.554105## lstat -0.367355 0.073020 -5.031 4.88e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 209.11 on 492 degrees of freedom## AIC: 237.11#### Number of Fisher Scoring iterations: 7r2Log(mod)## [1] 0.628923

# With BIC - ends up with only the significant variables and a similar R^2modBIC <- stepwise(mod, trace = 0)#### Direction: backward/forward## Criterion: BICsummary(modBIC)#### Call:## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio +## lstat, family = "binomial", data = Boston)#### Deviance Residuals:## Min 1Q Median 3Q Max## -3.3077 -0.2970 -0.0947 -0.0005 3.2552#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 1.556433 3.948818 0.394 0.693469## indus -0.143236 0.054771 -2.615 0.008918 **## rm 1.950496 0.441794 4.415 1.01e-05 ***## dis -0.426830 0.111572 -3.826 0.000130 ***


## rad 0.301060 0.076542 3.933 8.38e-05 ***## tax -0.010240 0.003631 -2.820 0.004800 **## ptratio -0.404964 0.112086 -3.613 0.000303 ***## lstat -0.384823 0.069121 -5.567 2.59e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 215.03 on 498 degrees of freedom## AIC: 231.03#### Number of Fisher Scoring iterations: 7r2Log(modBIC)## [1] 0.6184273

Finally, multicollinearity can also be present in logistic regression. De-spite the nonlinear logistic curve, the predictors are combined linearly in (4.4).Due to this, if two or more predictors are highly correlated between them, thefit of the model will be compromised since the individual linear effect of eachpredictor is hard to disentangle from the rest of correlated predictors.

In addition to inspecting the correlation matrix and look for high correlations,a powerful tool to detect multicollinearity is the VIF of each coefficient 𝛽𝑗. Thesituation is exactly the same as in linear regression, since VIF looks only intothe linear relations of the predictors. Therefore, the rule of thumb gives is thesame as in Section 3.8:

• VIF close to 1: absence of multicollinearity.• VIF larger than 5 or 10: problematic amount of multicollinearity.

Advised to remove the predictor with largest VIF.

Here is an example illustrating the use of VIF, through vif, in practice. It alsoshows also how the simple inspection of the covariance matrix is not enough fordetecting collinearity in tricky situations.# Create predictors with multicollinearity: x4 depends on the restset.seed(45678)x1 <- rnorm(100)x2 <- 0.5 * x1 + rnorm(100)x3 <- 0.5 * x2 + rnorm(100)x4 <- -x1 + x2 + rnorm(100, sd = 0.25)

# Responsez <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))


data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations - none seems suspiciouscor(data)## x1 x2 x3 x4 y## x1 1.0000000 0.38254782 0.2142011 -0.5261464 0.20198825## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.07456324## x3 0.2142011 0.51673408 1.0000000 0.2500123 -0.49853746## x4 -0.5261464 0.56731738 0.2500123 1.0000000 -0.11188657## y 0.2019882 0.07456324 -0.4985375 -0.1118866 1.00000000

# Abnormal generalized variance inflation factors: largest for x4, we remove itmodMultiCo <- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")vif(modMultiCo)## x1 x2 x3 x4## 27.84756 36.66514 4.94499 36.78817

# Without x4modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")

# Comparisonsummary(modMultiCo)#### Call:## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")#### Deviance Residuals:## Min 1Q Median 3Q Max## -2.4743 -0.3796 0.1129 0.4052 2.3887#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 1.2527 0.4008 3.125 0.00178 **## x1 -3.4269 1.8225 -1.880 0.06007 .## x2 6.9627 2.1937 3.174 0.00150 **## x3 -4.3688 0.9312 -4.691 2.71e-06 ***## x4 -5.0047 1.9440 -2.574 0.01004 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 132.81 on 99 degrees of freedom## Residual deviance: 59.76 on 95 degrees of freedom## AIC: 69.76


#### Number of Fisher Scoring iterations: 7summary(modClean)#### Call:## glm(formula = y ~ x1 + x2 + x3, family = "binomial")#### Deviance Residuals:## Min 1Q Median 3Q Max## -2.0952 -0.4144 0.1839 0.4762 2.5736#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.9237 0.3221 2.868 0.004133 **## x1 1.2803 0.4235 3.023 0.002502 **## x2 1.7946 0.5290 3.392 0.000693 ***## x3 -3.4838 0.7491 -4.651 3.31e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 132.813 on 99 degrees of freedom## Residual deviance: 68.028 on 96 degrees of freedom## AIC: 76.028#### Number of Fisher Scoring iterations: 6r2Log(modMultiCo)## [1] 0.5500437r2Log(modClean)## [1] 0.4877884

# Generalized variance inflation factors normalvif(modClean)## x1 x2 x3## 1.674300 2.724351 3.743940

For the Boston dataset, do the following:

1. Compute the hit matrix and hit ratio for the regression I(medv> 25) ~ . (hint: do table(medv > 25, ...)).

2. Fit I(medv > 25) ~ . but now using only the first 300 observa-tions of Boston, the training dataset (hint: use subset).

3. For the previous model, predict the probability of the responses


and classify them into 0 or 1 in the last 206 observations, thetesting dataset (hint: use predict on that subset).

4. Compute the hit matrix and hit ratio for the new predictions.Check that the hit ratio is smaller than the one in the first point.The hit ratio on the testing dataset, and not the first hit ratio,is an estimator of how well the model is going to classify futureobservations.

Chapter 5

Principal componentanalysis

Principal Component Analysis (PCA) is a powerful multivariate technique de-signed to summarize the most important features and relations of 𝑘 numeri-cal random variables 𝑋1, … , 𝑋𝑘. PCA does dimension reduction of the orig-inal dataset by computing a new set of variables, the principal componentsPC1, … PC𝑘, which explain the same information as 𝑋1, … , 𝑋𝑘 but in an or-dered way: PC1 explains the most of the information and PC𝑘 the least.

There is no response 𝑌 or particular variable in PCA that deserves a particularattention – all variables are treated equally.

5.1 Examples and applications5.1.1 Case study: Employment in European countries in

the late 70sThe purpose of this case study, motivated by Hand et al. (1994) andBartholomew et al. (2008), is to reveal the structure of the job market andeconomy in different developed countries. The final aim is to have a meaningfuland rigorous plot that is able to show the most important features of thecountries in a concise form.

The dataset eurojob (download) contains the data employed in this case study.It contains the percentage of workforce employed in 1979 in 9 industries for 26European countries. The industries measured are:

• Agriculture (Agr)• Mining (Min)• Manufacturing (Man)

213


214 CHAPTER 5. PRINCIPAL COMPONENT ANALYSIS

• Power supply industries (Pow)• Construction (Con)• Service industries (Ser)• Finance (Fin)• Social and personal services (Soc)• Transport and communications (Tra)

If the dataset is imported into R and the case names are set as Country (impor-tant in order to have only numerical variables), then the data should look likethis:

Table 5.1: The eurojob dataset.

Country Agr Min Man Pow Con Ser Fin Soc TraBelgium 3.3 0.9 27.6 0.9 8.2 19.1 6.2 26.6 7.2Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1France 10.8 0.8 27.5 0.9 8.9 16.8 6.0 22.6 5.7WGerm 6.7 1.3 35.8 0.9 7.3 14.4 5.0 22.3 6.1Ireland 23.2 1.0 20.7 1.3 7.5 16.8 2.8 20.8 6.1Italy 15.9 0.6 27.6 0.5 10.0 18.1 1.6 20.1 5.7Luxem 7.7 3.1 30.8 0.8 9.2 18.5 4.6 19.2 6.2Nether 6.3 0.1 22.5 1.0 9.9 18.0 6.8 28.5 6.8UK 2.7 1.4 30.2 1.4 6.9 16.9 5.7 28.3 6.4Austria 12.7 1.1 30.2 1.4 9.0 16.8 4.9 16.8 7.0Finland 13.0 0.4 25.9 1.3 7.4 14.7 5.5 24.3 7.6Greece 41.4 0.6 17.6 0.6 8.1 11.5 2.4 11.0 6.7Norway 9.0 0.5 22.4 0.8 8.6 16.9 4.7 27.6 9.4Portugal 27.8 0.3 24.5 0.6 8.4 13.3 2.7 16.7 5.7Spain 22.9 0.8 28.5 0.7 11.5 9.7 8.5 11.8 5.5Sweden 6.1 0.4 25.9 0.8 7.2 14.4 6.0 32.4 6.8Switz 7.7 0.2 37.8 0.8 9.5 17.5 5.3 15.4 5.7Turkey 66.8 0.7 7.9 0.1 2.8 5.2 1.1 11.9 3.2Bulgaria 23.6 1.9 32.3 0.6 7.9 8.0 0.7 18.2 6.7Czech 16.5 2.9 35.5 1.2 8.7 9.2 0.9 17.9 7.0EGerm 4.2 2.9 41.2 1.3 7.6 11.2 1.2 22.1 8.4Hungary 21.7 3.1 29.6 1.9 8.2 9.4 0.9 17.2 8.0Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9Romania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5.0USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3Yugoslavia 48.7 1.5 16.8 1.1 4.9 6.4 11.3 5.3 4.0

So far, we know how to compute summaries for each variable, and how toquantify and visualize relations between variables with the correlation matrix


and the scatterplot matrix. But even for a moderate number of variables likethis, their results are hard to process.# Summary of the data - marginalsummary(eurojob)## Agr Min Man Pow## Min. : 2.70 Min. :0.100 Min. : 7.90 Min. :0.1000## 1st Qu.: 7.70 1st Qu.:0.525 1st Qu.:23.00 1st Qu.:0.6000## Median :14.45 Median :0.950 Median :27.55 Median :0.8500## Mean :19.13 Mean :1.254 Mean :27.01 Mean :0.9077## 3rd Qu.:23.68 3rd Qu.:1.800 3rd Qu.:30.20 3rd Qu.:1.1750## Max. :66.80 Max. :3.100 Max. :41.20 Max. :1.9000## Con Ser Fin Soc## Min. : 2.800 Min. : 5.20 Min. : 0.500 Min. : 5.30## 1st Qu.: 7.525 1st Qu.: 9.25 1st Qu.: 1.225 1st Qu.:16.25## Median : 8.350 Median :14.40 Median : 4.650 Median :19.65## Mean : 8.165 Mean :12.96 Mean : 4.000 Mean :20.02## 3rd Qu.: 8.975 3rd Qu.:16.88 3rd Qu.: 5.925 3rd Qu.:24.12## Max. :11.500 Max. :19.10 Max. :11.300 Max. :32.40## Tra## Min. :3.200## 1st Qu.:5.700## Median :6.700## Mean :6.546## 3rd Qu.:7.075## Max. :9.400

# Correlation matrixcor(eurojob)## Agr Min Man Pow Con Ser## Agr 1.00000000 0.03579884 -0.6710976 -0.40005113 -0.53832522 -0.7369805## Min 0.03579884 1.00000000 0.4451960 0.40545524 -0.02559781 -0.3965646## Man -0.67109759 0.44519601 1.0000000 0.38534593 0.49447949 0.2038263## Pow -0.40005113 0.40545524 0.3853459 1.00000000 0.05988883 0.2019066## Con -0.53832522 -0.02559781 0.4944795 0.05988883 1.00000000 0.3560216## Ser -0.73698054 -0.39656456 0.2038263 0.20190661 0.35602160 1.0000000## Fin -0.21983645 -0.44268311 -0.1558288 0.10986158 0.01628255 0.3655553## Soc -0.74679001 -0.28101212 0.1541714 0.13241132 0.15824309 0.5721728## Tra -0.56492047 0.15662892 0.3506925 0.37523116 0.38766214 0.1875543## Fin Soc Tra## Agr -0.21983645 -0.7467900 -0.5649205## Min -0.44268311 -0.2810121 0.1566289## Man -0.15582884 0.1541714 0.3506925## Pow 0.10986158 0.1324113 0.3752312## Con 0.01628255 0.1582431 0.3876621## Ser 0.36555529 0.5721728 0.1875543


## Fin 1.00000000 0.1076403 -0.2459257## Soc 0.10764028 1.0000000 0.5678669## Tra -0.24592567 0.5678669 1.0000000

# Scatterplot matrixscatterplotMatrix(eurojob, reg.line = lm, smooth = FALSE, spread = FALSE,

span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 0,diagonal = 'histogram')

Agr

0.0

2.0

0.5

612

520

10 50

0.0 2.0

Min

Man

10 30

0.5

Pow

Con

4 8

6 12

Ser

Fin

2 6

5 20

Soc

1050

1030

48

26

3 6 9

36

9Tra

We definitely need a way of visualizing and quantifying the relations betweenvariables for a moderate to large amount of variables. PCA will be a handy way.In a nutshell, what PCA does is:

1. Takes the data for the variables 𝑋1, … , 𝑋𝑘.2. Using this data, looks for new variables PC1, … PC𝑘 such that:

• PC𝑗 is a linear combination of 𝑋1, … , 𝑋𝑘, 1 ≤ 𝑗 ≤ 𝑘. This is,PC𝑗 = 𝑎1𝑗𝑋1 + 𝑎2𝑗𝑋2 + ⋯ + 𝑎𝑘𝑗𝑋𝑘.

• PC1, … PC𝑘 are sorted decreasingly in terms of variance. HencePC𝑗 has more variance than PC𝑗+1, 1 ≤ 𝑗 ≤ 𝑘 − 1,

• PC𝑗1and PC𝑗2

are uncorrelated, for 𝑗1 ≠ 𝑗2.• PC1, … PC𝑘 have the same information, measured in terms of to-

tal variance, as 𝑋1, … , 𝑋𝑘.3. Produces three key objects:

• Variances of the PCs. They are sorted decreasingly and give anidea of which PCs are contain most of the information of the data


(the ones with more variance).• Weights of the variables in the PCs. They give the interpre-

tation of the PCs in terms of the original variables, as they are thecoefficients of the linear combination. The weights of the variables𝑋1, … , 𝑋𝑘 on the PC𝑗, 𝑎1𝑗, … , 𝑎𝑘𝑗, are normalized: 𝑎2

1𝑗 +⋯+𝑎2𝑘𝑗 = 1,

𝑗 = 1, … , 𝑘. In R, they are called loadings.• Scores of the data in the PCs: this is the data with PC1, … PC𝑘

variables instead of 𝑋1, … , 𝑋𝑘. The scores are uncorrelated. Use-ful for knowing which PCs have more effect on a certain observation.

Hence, PCA rearranges our variables in an information-equivalent, but moreconvenient, layout where the variables are sorted according to the amountof information they are able to explain. From this position, the next stepis clear: stick only with a limited number of PCs such that they explainmost of the information (e.g., 70% of the total variance) and do dimensionreduction. The effectiveness of PCA in practice varies from the structure presentin the dataset. For example, in the case of highly dependent data, it couldexplain more than the 90% of variability of a dataset with tens of variables withjust two PCs.

Let’s see how to compute a full PCA in R.# The main function - use cor = TRUE to avoid scale distortionspca <- princomp(eurojob, cor = TRUE)

# What is inside?str(pca)## List of 7## $ sdev : Named num [1:9] 1.867 1.46 1.048 0.997 0.737 ...## ..- attr(*, "names")= chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ loadings: 'loadings' num [1:9, 1:9] 0.52379 0.00132 -0.3475 -0.25572 -0.32518 ...## ..- attr(*, "dimnames")=List of 2## .. ..$ : chr [1:9] "Agr" "Min" "Man" "Pow" ...## .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ center : Named num [1:9] 19.131 1.254 27.008 0.908 8.165 ...## ..- attr(*, "names")= chr [1:9] "Agr" "Min" "Man" "Pow" ...## $ scale : Named num [1:9] 15.245 0.951 6.872 0.369 1.614 ...## ..- attr(*, "names")= chr [1:9] "Agr" "Min" "Man" "Pow" ...## $ n.obs : int 26## $ scores : num [1:26, 1:9] -1.71 -0.953 -0.755 -0.853 0.104 ...## ..- attr(*, "dimnames")=List of 2## .. ..$ : chr [1:26] "Belgium" "Denmark" "France" "WGerm" ...## .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...## $ call : language princomp(x = eurojob, cor = TRUE)## - attr(*, "class")= chr "princomp"

# The standard deviation of each PC


pca$sdev## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## 1.867391569 1.459511268 1.048311791 0.997237674 0.737033056 0.619215363## Comp.7 Comp.8 Comp.9## 0.475135828 0.369851221 0.006754636

# Weights: the expression of the original variables in the PCs# E.g. Agr = -0.524 * PC1 + 0.213 * PC5 - 0.152 * PC6 + 0.806 * PC9# And also: PC1 = -0.524 * Agr + 0.347 * Man + 0256 * Pow + 0.325 * Con + ...# (Because the matrix is orthogonal, so the transpose is the inverse)pca$loadings#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## Agr 0.524 0.213 0.153 0.806## Min 0.618 -0.201 -0.164 -0.101 -0.726## Man -0.347 0.355 -0.150 -0.346 -0.385 -0.288 0.479 0.126 0.366## Pow -0.256 0.261 -0.561 0.393 0.295 0.357 0.256 -0.341## Con -0.325 0.153 -0.668 0.472 0.130 -0.221 -0.356## Ser -0.379 -0.350 -0.115 -0.284 0.615 -0.229 0.388 0.238## Fin -0.454 -0.587 0.280 -0.526 -0.187 0.174 0.145## Soc -0.387 -0.222 0.312 0.412 -0.220 -0.263 -0.191 -0.506 0.351## Tra -0.367 0.203 0.375 0.314 0.513 -0.124 0.545#### Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000

# Scores of the data on the PCs: how is the data re-expressed into PCshead(pca$scores, 10)## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6## Belgium -1.7104977 -1.22179120 -0.11476476 0.33949201 -0.32453569 -0.04725409## Denmark -0.9529022 -2.12778495 0.95072216 0.59394893 0.10266111 -0.82730228## France -0.7546295 -1.12120754 -0.49795370 -0.50032910 -0.29971876 0.11580705## WGerm -0.8525525 -0.01137659 -0.57952679 -0.11046984 -1.16522683 -0.61809939## Ireland 0.1035018 -0.41398717 -0.38404787 0.92666396 0.01522133 1.42419990## Italy -0.3754065 -0.76954739 1.06059786 -1.47723127 -0.64518265 1.00210439## Luxem -1.0594424 0.75582714 -0.65147987 -0.83515611 -0.86593673 0.21879618## Nether -1.6882170 -2.00484484 0.06374194 -0.02351427 0.63517966 0.21197502## UK -1.6304491 -0.37312967 -1.14090318 1.26687863 -0.81292541 -0.03605094## Austria -1.1764484 0.14310057 -1.04336386 -0.15774745 0.52098078 0.80190706## Comp.7 Comp.8 Comp.9## Belgium -0.34008766 0.4030352 -0.0010904043## Denmark -0.30292281 -0.3518357 0.0156187715


## France -0.18547802 -0.2661924 -0.0005074307## WGerm 0.44455923 0.1944841 -0.0065393717## Ireland -0.03704285 -0.3340389 0.0108793301## Italy -0.14178212 -0.1302796 0.0056017552## Luxem -1.69417817 0.5473283 0.0034530991## Nether -0.30339781 -0.5906297 -0.0109314745## UK 0.04128463 -0.3485948 -0.0054775709## Austria 0.41503736 0.2150993 -0.0028164222

# Scatterplot matrix of the scores - they are uncorrelated!scatterplotMatrix(pca$scores, reg.line = lm, smooth = FALSE, spread = FALSE,

span = 0.5, ellipse = FALSE, levels = c(.5, .9), id.n = 0,diagonal = 'histogram')

Comp.1

−2

13

−2

0−

0.5

1.0

−0.

60.

4

−2 2 6

−2 1 3

Comp.2

Comp.3

−3 0 2

−2 0

Comp.4

Comp.5

−1.0 1.0

−0.5 1.0

Comp.6

Comp.7

−1.5 0.5

−0.6 0.4

Comp.8

−2

26

−3

02

−1.

01.

0−

1.5

0.5

−0.010 0.015

−0.

010

0.01

5

Comp.9

# Means of the variables - before PCA the variables are centeredpca$center## Agr Min Man Pow Con Ser Fin## 19.1307692 1.2538462 27.0076923 0.9076923 8.1653846 12.9576923 4.0000000## Soc Tra## 20.0230769 6.5461538

# Rescaling done to each variable# - if cor = FALSE (default), a vector of ones# - if cor = TRUE, a vector with the standard deviations of the variables


pca$scale## Agr Min Man Pow Con Ser Fin## 15.2446654 0.9512060 6.8716767 0.3689101 1.6136300 4.4864045 2.7520622## Soc Tra## 6.6969171 1.3644471

# Summary of the importance of components - the third row is keysummary(pca)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Standard deviation 1.8673916 1.4595113 1.0483118 0.9972377 0.73703306## Proportion of Variance 0.3874613 0.2366859 0.1221064 0.1104981 0.06035753## Cumulative Proportion 0.3874613 0.6241472 0.7462536 0.8567517 0.91710919## Comp.6 Comp.7 Comp.8 Comp.9## Standard deviation 0.61921536 0.47513583 0.36985122 6.754636e-03## Proportion of Variance 0.04260307 0.02508378 0.01519888 5.069456e-06## Cumulative Proportion 0.95971227 0.98479605 0.99999493 1.000000e+00

# Scree plot - the variance of each componentplot(pca)

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

pca

Var

ianc

es

0.0

0.5

1.0

1.5

2.0

2.5

3.0

# With connected lines - useful for looking for the "elbow"plot(pca, type = "l")


pca

Var

ianc

es

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5


# PC1 and PC2pca$loadings[, 1:2]## Comp.1 Comp.2## Agr 0.523790989 0.05359389## Min 0.001323458 0.61780714## Man -0.347495131 0.35505360## Pow -0.255716182 0.26109606## Con -0.325179319 0.05128845## Ser -0.378919663 -0.35017206## Fin -0.074373583 -0.45369785## Soc -0.387408806 -0.22152120## Tra -0.366822713 0.20259185

PCA produces uncorrelated variables from the original set𝑋1, … , 𝑋𝑘. This implies that:

– The PCs are uncorrelated, but not independent (uncorrelateddoes not imply independent).

– An uncorrelated or independent variable in 𝑋1, … , 𝑋𝑘 will geta PC only associated to it. In the extreme case where all the𝑋1, … , 𝑋𝑘 are uncorrelated, these coincide with the PCs (up tosign flips).

Based on the weights of the variables on the PCs, we can extract the followinginterpretation:


• PC1 is roughly a linear combination of Agr, with negative weight, and (Man,Pow, Con, Ser, Soc, Tra), with positive weights. So it can be interpreted asan indicator of the kind of economy of the country: agricultural (negativevalues) or industrial (positive values).

• PC2 has negative weights on (Min, Man, Pow, Tra) and positive weights in(Ser, Fin, Soc). It can be interpreted as the contrast between relativelylarge or small service sectors. So it tends to be negative in communistcountries and positive in capitalist countries.

The interpretation of the PCs involves inspecting the weights andinterpreting the linear combination of the original variables, whichmight be separating between two clear characteristics of the data

To conclude, let’s see how we can represent our original data into a plot calledbiplot that summarizes all the analysis for two PCs.# Biplot - plot together the scores for PC1 and PC2 and the# variables expressed in terms of PC1 and PC2biplot(pca)

−0.2 0.0 0.2 0.4 0.6

−0.

20.

00.

20.

40.

6

Comp.1

Com

p.2

Belgium

Denmark

France

WGerm

IrelandItaly

Luxem

Nether

UK

Austria

FinlandGreece

NorwayPortugalSpain

Sweden

SwitzTurkey

Bulgaria

CzechEGermHungary

PolandRomania

USSR

Yugoslavia

−4 −2 0 2 4 6 8

−4

−2

02

46

8

Agr

Min

Man

Pow

Con

Ser

Fin

Soc

Tra


5.1.2 Case studies: Analysis of USArrests, USJudgeRatingsand La Liga 2015/2016 metrics

The selection of the number of PCs and their interpretation though theweights and biplots are key aspects in a successful application of PCA. In thissection we will see examples of both points through the datasets USArrests,USJudgeRatings and La Liga 2015/2016 (download).

The selection of the number of components 𝑙, 1 ≤ 𝑙 ≤ 𝑘1, is a trickyproblem without a unique and well-established criterion for what is the bestnumber of components. The reason is that selecting the number of PCs is atrade-off between the variance of the original data that we want to explainand the price we want to pay in terms of a more complex dataset. Obviously,except for particular cases2, none of the extreme situations 𝑙 = 1 (potential lowexplained variance) or 𝑙 = 𝑘 (same number of PCs as the original data – nodimension reduction) is desirable.

There are several heuristic rules in order to determine the number of compo-nents:

1. Select 𝑙 up to a threshold of the percentage of variance explained,such as 70% or 80%. We do so by looking into the third row of thesummary(...) of a PCA.

2. Plot the variances of the PCs and look for an “elbow” in thegraph whose location gives 𝑙. Ideally, this elbow appears at the PC forwhich the next PC variances are almost similar and notably smaller whencompared with the first ones. Use plot(..., type = "l") for creatingthe plot.

3. Select 𝑙 based on the threshold of the individual variance of eachcomponent. For example, select only the PCs with larger variance thanthe mean of the variances of all the PCs. If we are working with stan-dardized variables (cor = TRUE), this equals to taking the PCs withstandard deviation larger than one. We do so by looking into thefirst row of the summary(...) of a PCA.

In addition to these three heuristics, in practice we might apply a justified biastowards:

4. 𝑙 = 1, 2, since these are the ones that allow to have a simple graphicalrepresentation of the data. Even if the variability explained by the𝑙 PCs is low (lower than 50%), these graphical representations are usu-ally insightful. 𝑙 = 3 is preferred as a second option since its graphicalrepresentation is more cumbersome (see the end of this section).

5. 𝑙’s such that they yield interpretable PCs. Interpreting PCs is notso straightforward as interpreting the original variables. Furthermore, it

1We are implicitly assuming that 𝑛 > 𝑘. Otherwise, the maximum number of PCs wouldbe min(𝑛 − 1, 𝑘).

2For example, if PC1 explains all the variance of 𝑋1, … , 𝑋𝑘 or if the variables are uncor-related, in which case the PCs will be equal to the original variables.



becomes more difficult the larger the index of the PC is, since it explainsless information of the data.

Let’s see these heuristics in practice with the USArrests dataset (arrest statisticsand population of US states).# Load datadata(USArrests)

# Snapshot of the datahead(USArrests)## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2## Alaska 10.0 263 48 44.5## Arizona 8.1 294 80 31.0## Arkansas 8.8 190 50 19.5## California 9.0 276 91 40.6## Colorado 7.9 204 78 38.7

# PCApcaUSArrests <- princomp(USArrests, cor = TRUE)summary(pcaUSArrests)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4## Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752## Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

# Plot of variances (screeplot)plot(pcaUSArrests, type = "l")


pcaUSArrests

Var

ianc

es

0.5

1.0

1.5

2.0

2.5

Comp.1 Comp.2 Comp.3 Comp.4

The selections of 𝑙 for this PCA, based on the previous heuristics, are:

1. 𝑙 = 2, since it explains the 86% of the variance and 𝑙 = 1 only the 62%.2. 𝑙 = 2, since from 𝑙 = 2 onward the variances are very similar.3. 𝑙 = 1, since the PC2 has standard deviation smaller than 1 (limit case).4. 𝑙 = 2 is fine, it can be easily represented graphically.5. 𝑙 = 2 is fine, both components are interpretable, as we will see later.

Therefore, we can conclude that 𝑙 = 2 PCs is a good compromise for representingthe USArrests dataset.

Let’s see what happens for the USJudgeRatings dataset (lawyers’ ratings of USSuperior Court judges).# Load datadata(USJudgeRatings)

# Snapshot of the datahead(USJudgeRatings)## CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN## AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8## ALEXANDER,J.M. 6.8 8.9 8.8 8.5 7.8 8.1 8.0 8.0 7.8 7.9 8.5 8.7## ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8## BERDON,R.I. 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8 8.7## BRACKEN,J.J. 7.3 6.4 4.3 6.5 6.0 6.2 5.7 5.7 5.1 5.3 5.5 4.8## BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6

# PCApcaUSJudgeRatings <- princomp(USJudgeRatings, cor = TRUE)


summary(pcaUSJudgeRatings)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Standard deviation 3.1833165 1.05078398 0.5769763 0.50383231 0.290607615## Proportion of Variance 0.8444586 0.09201225 0.0277418 0.02115392 0.007037732## Cumulative Proportion 0.8444586 0.93647089 0.9642127 0.98536661 0.992404341## Comp.6 Comp.7 Comp.8 Comp.9## Standard deviation 0.193095982 0.140295449 0.124158319 0.0885069038## Proportion of Variance 0.003107172 0.001640234 0.001284607 0.0006527893## Cumulative Proportion 0.995511513 0.997151747 0.998436354 0.9990891437## Comp.10 Comp.11 Comp.12## Standard deviation 0.0749114592 0.0570804224 0.0453913429## Proportion of Variance 0.0004676439 0.0002715146 0.0001716978## Cumulative Proportion 0.9995567876 0.9998283022 1.0000000000

# Plot of variances (screeplot)plot(pcaUSJudgeRatings, type = "l")

pcaUSJudgeRatings

Var

ianc

es

02

46

810



1. 𝑙 = 1, since it explains alone the 84% of the variance.2. 𝑙 = 1, since from 𝑙 = 1 onward the variances are very similar compared to

the first one.3. 𝑙 = 2, since the PC3 has standard deviation smaller than 1.4. 𝑙 = 1, 2 are fine, they can be easily represented graphically.5. 𝑙 = 1, 2 are fine, both components are interpretable, as we will see later.


Based on the previous criteria, we can conclude that 𝑙 = 1 PC is a reasonablecompromise for representing the USJudgeRatings dataset.

We analyze now a slightly more complicated dataset. It contains the standingsand team statistics for La Liga 2015/2016:

Table 5.2: Selection of variables for La Liga 2015/2016 dataset.

Points Matches Wins Draws LosesBarcelona 91 38 29 4 5Real Madrid 90 38 28 6 4Atlético Madrid 88 38 28 4 6Villarreal 64 38 18 10 10Athletic 62 38 18 8 12Celta 60 38 17 9 12Sevilla 52 38 14 10 14Málaga 48 38 12 12 14Real Sociedad 48 38 13 9 16Betis 45 38 11 12 15Las Palmas 44 38 12 8 18Valencia 44 38 11 11 16Eibar 43 38 11 10 17Espanyol 43 38 12 7 19Deportivo 42 38 8 18 12Granada 39 38 10 9 19Sporting Gijón 39 38 10 9 19Rayo Vallecano 38 38 9 11 18Getafe 36 38 9 9 20Levante 32 38 8 8 22

# PCA - we remove the second variable, matches played, since it is constantpcaLaliga <- princomp(laliga[, -2], cor = TRUE)summary(pcaLaliga)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Standard deviation 3.4372781 1.5514051 1.14023547 0.91474383 0.85799980## Proportion of Variance 0.6563823 0.1337143 0.07222983 0.04648646 0.04089798## Cumulative Proportion 0.6563823 0.7900966 0.86232642 0.90881288 0.94971086## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10## Standard deviation 0.60295746 0.4578613 0.373829925 0.327242606 0.22735805## Proportion of Variance 0.02019765 0.0116465 0.007763823 0.005949318 0.00287176## Cumulative Proportion 0.96990851 0.9815550 0.989318830 0.995268148 0.99813991## Comp.11 Comp.12 Comp.13 Comp.14## Standard deviation 0.1289704085 0.0991188181 0.0837498223 2.860411e-03


## Proportion of Variance 0.0009240759 0.0005458078 0.0003896685 4.545528e-07## Cumulative Proportion 0.9990639840 0.9996097918 0.9999994603 9.999999e-01## Comp.15 Comp.16 Comp.17 Comp.18## Standard deviation 1.238298e-03 1.583337e-08 1.154388e-08 0## Proportion of Variance 8.518782e-08 1.392753e-17 7.403399e-18 0## Cumulative Proportion 1.000000e+00 1.000000e+00 1.000000e+00 1

# Plot of variances (screeplot)plot(pcaLaliga, type = "l")

pcaLaliga

Var

ianc

es

02

46

810

12



1. 𝑙 = 2, 3, since they explain the 79% and 86% of the variance (it dependson the threshold of the variance, 70% or 80%).

2. 𝑙 = 3, since from 𝑙 = 1 onward the variances are very similar compared tothe first one.

3. 𝑙 = 3, since the PC4 has standard deviation smaller than 1.4. 𝑙 = 2 is preferred to 𝑙 = 3.5. 𝑙 = 1, 2 are fine, both components are interpretable, as we will see later.

𝑙 = 3 is harder to interpret.

Based on the previous criteria, we can conclude that 𝑙 = 2 PCs is a reasonablecompromise for representing La Liga 2015/2016 dataset.

Let’s focus now on the interpretation of the PCs. In addition to the weightspresent in the loadings slot, biplot provides a powerful and succinct way ofdisplaying the relevant information for 1 ≤ 𝑙 ≤ 2. The biplot shows:


1. The scores of the data in PC1 and PC2 by points (with optional textlabels, depending if there are case names). This is the representation ofthe data in the first two PCs.

2. The variables represented in the PC1 and PC2 by the arrows.These arrows are centered at (0, 0).

Let’s examine the arrow associated to the variable 𝑋𝑗. 𝑋𝑗 is expressed in termsof PC1 and PC2 by the weights 𝑎𝑗1 and 𝑎𝑗2:

𝑋𝑗 = 𝑎𝑗1PC1 + 𝑎𝑗2PC2 + ⋯ + 𝑎𝑗𝑘PC𝑘 ≈ 𝑎𝑗1PC1 + 𝑎𝑗2PC2.

𝑎𝑗1 and 𝑎𝑗2 have the same sign as Cor(𝑋𝑗, PC1) and Cor(𝑋𝑗, PC2), respectively.The arrow associated to 𝑋𝑗 is given by the segment joining (0, 0) and (𝑎𝑗1, 𝑎𝑗2).Therefore:

• If the arrow points right (𝑎𝑗1 > 0), there is positive correlation between 𝑋𝑗and PC1. Analogous if the arrow points left.

• If the arrow is approximately vertical (𝑎𝑗1 ≈ 0), there is uncorrelationbetween 𝑋𝑗 and PC1.

Analogously:

• If the arrow points up (𝑎𝑗2 > 0), there is positive correlation between 𝑋𝑗and PC2. Analogous if the arrow points down.

• If the arrow is approximately horizontal (𝑎𝑗2 ≈ 0), there is uncorrelationbetween 𝑋𝑗 and PC2.

In addition, the magnitude of the arrow informs about the correlation.

The biplot also provides the direct relation between variables, at sight of theirexpressions in PC1 and PC2. The angle of the arrows of variable 𝑋𝑗 and 𝑋𝑚gives an approximation to the correlation between them, Cor(𝑋𝑗, 𝑋𝑚):

• If angle ≈ 0∘, the two variables are highly positively correlated.• If angle ≈ 90∘, they are approximately uncorrelated.• If angle ≈ 180∘, the two variables are highly negatively correlated.

The approximation to the correlation by means of the arrow angles isas good as the percentage of variance explained by PC1 and PC2.

Let see an in-depth illustration of the previous concepts for pcaUSArrests:# Weights and biplotpcaUSArrests$loadings

#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4## Murder 0.536 0.418 0.341 0.649## Assault 0.583 0.188 0.268 -0.743## UrbanPop 0.278 -0.873 0.378 0.134


## Rape 0.543 -0.167 -0.818#### Comp.1 Comp.2 Comp.3 Comp.4## SS loadings 1.00 1.00 1.00 1.00## Proportion Var 0.25 0.25 0.25 0.25## Cumulative Var 0.25 0.50 0.75 1.00biplot(pcaUSArrests)

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

Alabama Alaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

IndianaIowaKansas

KentuckyLouisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

OklahomaOregonPennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−6 −4 −2 0 2 4 6 8

−6

−4

−2

02

46

8

Murder

Assault

UrbanPop

Rape

We can extract the following conclusions regarding the arrows and PCs:

• Murder, Assault and Rape are negatively correlated with PC1, whichmight be regarded as an indicator of the absence of crime (positive for lesscrimes, negative for more). The variables are highly correlated betweenthem and the arrows are:

Murder = (−0.536, 0.418)Assault = (−0.583, 0.188)

Rape = (−0.543, −0.167)

• Murder and UrbanPop are approximately uncorrelated.

• UrbanPop is the most correlated variable with PC2 (positive for low urbanpopulation, negative for high). Its arrow is:

UrbanPop = (−0.278 − 0.873).


Therefore, the biplot shows that states like Florida, South Carolina andCalifornia have high crime rate, whereas states like North Dakota or Ver-mont have low crime rate. California, in addition to have a high crimerate has a large urban population, whereas South Carolina has a low ur-ban population. With the biplot, we can visualize the differences betweenstates according to the crime rate and urban population in a simple way.

Let’s see now the biplot for the USJudgeRatings, which has a clear interpreta-tion:# Weights and biplotpcaUSJudgeRatings$loadings

#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10## CONT 0.933 0.335## INTG -0.289 -0.182 0.549 0.174 -0.370 -0.450 -0.334 0.275 -0.109## DMNR -0.287 -0.198 0.556 -0.124 -0.229 0.395 0.467 0.247 0.199## DILG -0.304 -0.164 0.321 -0.302 -0.599 0.210 0.355 0.383## CFMG -0.303 0.168 -0.207 -0.448 0.247 -0.714 -0.143## DECI -0.302 0.128 -0.298 -0.424 0.393 -0.536 0.302 0.258## PREP -0.309 -0.152 0.214 0.203 0.335 0.154 0.109 -0.680## FAMI -0.307 -0.195 0.201 0.507 0.102 0.223## ORAL -0.313 0.246 0.150 -0.300 0.256## WRIT -0.311 0.137 0.306 0.238 -0.126 0.475## PHYS -0.281 -0.154 -0.841 0.118 -0.299 0.266## RTEN -0.310 0.173 -0.184 -0.256 0.221 -0.756 -0.250## Comp.11 Comp.12## CONT## INTG -0.113## DMNR 0.134## DILG## CFMG 0.166## DECI -0.128## PREP -0.319 0.273## FAMI 0.573 -0.422## ORAL -0.639 -0.494## WRIT 0.696## PHYS## RTEN 0.286#### Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083## Cumulative Var 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750## Comp.10 Comp.11 Comp.12


## SS loadings 1.000 1.000 1.000## Proportion Var 0.083 0.083 0.083## Cumulative Var 0.833 0.917 1.000biplot(pcaUSJudgeRatings, cex = 0.75)

−0.2 0.0 0.2 0.4

−0.

20.

00.

20.

4

Comp.1

Com

p.2

AARONSON,L.H.

ALEXANDER,J.M.

ARMENTANO,A.J.

BERDON,R.I.

BRACKEN,J.J.

BURNS,E.B.

CALLAHAN,R.J.

COHEN,S.S.DALY,J.J.

DANNEHY,J.F.

DEAN,H.H.

DEVITA,H.J.DRISCOLL,P.J.

GRILLO,A.E.

HADDEN,W.L.JR.

HAMILL,E.C.

HEALEY.A.H.HULL,T.C.

LEVINE,I.

LEVISTER,R.L.

MARTIN,L.F.

MCGRATH,J.F.

MIGNONE,A.F.MISSAL,H.M.

MULVEY,H.M.

NARUK,H.J.

O'BRIEN,F.J.

O'SULLIVAN,T.J.

PASKEY,L.

RUBINOW,J.E.SADEN.G.A.

SATANIELLO,A.G.

SHEA,D.M.

SHEA,J.F.JR.

SIDOR,W.J.

SPEZIALE,J.A.

SPONZO,M.J.

STAPLETON,J.F.

TESTO,R.J.TIERNEY,W.L.JR.

WALL,R.A.

WRIGHT,D.B.

ZARRILLI,K.J.

−5 0 5 10

−5

05

10

CONT

INTGDMNR

DILG

CFMGDECIPREPFAMIORALWRIT

PHYS

RTEN

The PC1 gives a lawyer indicator of how badly the judge conducts a trial. Thevariable CONT, which measures the number of contacts between judge and lawyer,is almost uncorrelated with the rest of variables and is captured by PC2 (hencethe rates of the lawyers are not affected by the number of contacts with thejudge). We can identify the high-rated and low-rated judges in the left andright of the plot, respectively.

Let’s see an application of the biplot in La Liga 2015/2016, a dataset with morevariables and a harder interpretation of PCs.# Weights and biplotpcaLaliga$loadings

#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7## Points 0.280 0.136 0.145 0.137## Wins 0.277 0.219 0.146## Draws -0.157 -0.633 0.278 0.332 -0.102 0.382## Loses -0.269 -0.118 -0.215 -0.137 -0.131 -0.313## Goals.scored 0.271 -0.220 -0.101


## Goals.conceded -0.232 -0.336 -0.178 0.374## Difference.goals 0.288 -0.171## Percentage.scored.goals 0.271 -0.219## Percentage.conceded.goals -0.232 -0.336 -0.178 0.375## Shots 0.229 -0.299 -0.133 -0.325 -0.272 0.249## Shots.on.goal 0.252 -0.265 -0.209## Penalties.scored 0.160 -0.272 -0.410 0.636 -0.389 -0.160## Assistances 0.271 -0.186 -0.158 0.129 0.176## Fouls.made -0.189 0.561 0.178 0.213 0.592## Matches.without.conceding 0.222 0.364 0.163 0.138 0.105 -0.239## Yellow.cards -0.244 -0.108 0.358 -0.161 0.144## Red.cards -0.158 -0.340 0.594 -0.192 -0.385 -0.303## Offsides 0.163 -0.341 0.453 0.426 0.429 -0.283## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14## Points 0.135 0.290 0.137 0.213## Wins 0.263 0.120 0.230## Draws 0.150 0.126 -0.249## Loses -0.220 -0.338 -0.171 -0.103 -0.154## Goals.scored -0.129 0.104 -0.251 0.238 0.289 -0.225 -0.260## Goals.conceded 0.103 0.225 0.144 0.424## Difference.goals -0.125 -0.272 0.151 -0.109 -0.374## Percentage.scored.goals -0.132 0.103 -0.251 0.252 0.297 -0.223 0.501## Percentage.conceded.goals 0.231 0.161 -0.601## Shots 0.236 -0.267 0.452 -0.478 0.188## Shots.on.goal -0.471 0.325 -0.439 0.488 0.192## Penalties.scored 0.131 0.328## Assistances 0.154 -0.570 -0.458 -0.504## Fouls.made -0.363 -0.187 0.135 0.108 -0.135## Matches.without.conceding 0.282 -0.303 0.388 0.258 -0.554## Yellow.cards 0.733 -0.369 -0.152 0.219## Red.cards 0.384 0.216 -0.127## Offsides -0.302 -0.255 -0.146 0.128## Comp.15 Comp.16 Comp.17 Comp.18## Points 0.720 0.401## Wins -0.272 0.741 -0.267## Draws 0.118 0.337## Loses 0.401 0.566 0.136## Goals.scored 0.511 0.244 -0.438## Goals.conceded 0.477 -0.181 0.324## Difference.goals 0.103 -0.375 0.104 0.672## Percentage.scored.goals -0.568## Percentage.conceded.goals -0.422## Shots## Shots.on.goal## Penalties.scored## Assistances


## Fouls.made## Matches.without.conceding## Yellow.cards## Red.cards## Offsides#### Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056## Cumulative Var 0.056 0.111 0.167 0.222 0.278 0.333 0.389 0.444 0.500## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000## Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056## Cumulative Var 0.556 0.611 0.667 0.722 0.778 0.833 0.889 0.944## Comp.18## SS loadings 1.000## Proportion Var 0.056## Cumulative Var 1.000biplot(pcaLaliga, cex = 0.75)

−0.4 −0.2 0.0 0.2 0.4 0.6

−0.

4−

0.2

0.0

0.2

0.4

0.6

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

−4 −2 0 2 4

−4

−2

02

4

PointsWinsDraws

Loses

Goals.scored

Goals.conceded

Difference.goals

Percentage.scored.goals

Percentage.conceded.goalsShots

Shots.on.goalPenalties.scored

Assistances

Fouls.made

Matches.without.conceding

Yellow.cards

Red.cards Offsides

Some interesting highlights:

• PC1 can be regarded as the non-performance of a team during the sea-son. It is negatively correlated with Wins, Points,… and positively cor-related with Draws, Loses, Yellow.cards,… The best performing teams


are not surprising: Barcelona, Real Madrid and Atlético Madrid. On theother hand, among the worst-performing teams are Levante, Getafe andGranada.

• PC2 can be seen as the inefficiency of a team (conceding points with littleparticipation in the game). Using this interpretation we can see that RayoVallecano and Real Madrid were the most inefficient teams and AtléticoMadrid and Villareal were the most efficient.

• Offsides is approximately uncorrelated with Red.cards.• PC3 does not have a clear interpretation.

If you are wondering about the 3D representation of the biplot, it can be com-puted through:# Install this package with install.packages("pca3d")library(pca3d)pca3d(pcaLaliga, show.labels = TRUE, biplot = TRUE)

Finally, we mention that R Commander has a menu entry for performing PCA:'Statistics' -> 'Dimensional analysis' -> 'Principal-componentsanalysis...'. Alternatively, the plug-in FactoMineR implements a PCA withmore options and graphical outputs. It can be loaded (if installed) in 'Tools'-> 'Load Rcmdr plug-in(s)...' -> 'RcmdPlugin.FactoMineR' (you willneed to restart R Commander). For performing a PCA in FactoMineR, go to'FactoMineR' -> 'Principal Component Analysis (PCA)'. In that menuyou will have more advanced options than in R Commander’s PCA.

Chapter 6

Cluster analysis

Cluster analysis is the collection of techniques designed to find subgroups or clus-ters in a dataset of variables 𝑋1, … , 𝑋𝑘. Depending on the similarities betweenthe observations, these are partitioned in homogeneous groups as separated aspossible between them. Clustering methods can be classified into two maincategories:

• Partition methods. Given a fixed number of cluster 𝑘, these methodsaim to assign each observation of 𝑋1, … , 𝑋𝑘 to a unique cluster, in such away that the within-cluster variation is as small as possible (the clustersare as homogeneous as possible) while the between cluster variation is aslarge as possible (the clusters are as separated as possible).

• Hierarchical methods. These methods construct a hierarchy for theobservations in terms of their similitudes. This results in a tree-basedrepresentation of the data in terms of a dendrogram, which depicts howthe observations are clustered at different levels – from the smallest groupsof one element to the largest representing the whole dataset.

We will see the basics of the most well-known partition method, namely 𝑘-meansclustering, and of the agglomerative hierarchical clustering.

6.1 𝑘-means clusteringThe 𝑘-means clustering looks for 𝑘 clusters in the data such that theyare as compact as possible and as separated as possible. In clusteringterminology, the clusters minimize the with-in cluster variation with respect tothe cluster centroid while they maximize the between cluster variation amongclusters. The distance used for measuring proximity is the usual Euclideandistance between points. As a consequence, this clustering method tend toyield spherical or rounded clusters and is not adequate for categorical variables.

237

238 CHAPTER 6. CLUSTER ANALYSIS

0.0 0.5 1.0 1.5

−0.

50.

51.

52.

5

k = 1

x

y

0.0 0.5 1.0 1.5

−0.

50.

51.

52.

5

k = 2

x

y

0.0 0.5 1.0 1.5

−0.

50.

51.

52.

5

k = 3

x

y

0.0 0.5 1.0 1.5

−0.

50.

51.

52.

5

k = 4

x

y

Figure 6.1: The 𝑘-means partitions for a two-dimensional dataset with 𝑘 =1, 2, 3, 4. Centers of each cluster are displayed with an asterisk.

Let’s analyze the possible clusters in a smaller subset of La Liga 2015/2016(download) dataset, where the results can be easily visualized. To that end,import the data as laliga.# We consider only a smaller dataset (Points and Yellow.cards)head(laliga, 2)## Points Matches Wins Draws Loses Goals.scored Goals.conceded## Barcelona 91 38 29 4 5 112 29## Real Madrid 90 38 28 6 4 110 34## Difference.goals Percentage.scored.goals Percentage.conceded.goals## Barcelona 83 2.95 0.76## Real Madrid 76 2.89 0.89## Shots Shots.on.goal Penalties.scored Assistances Fouls.made## Barcelona 600 277 11 79 385## Real Madrid 712 299 6 90 420## Matches.without.conceding Yellow.cards Red.cards Offsides## Barcelona 18 66 1 120## Real Madrid 14 72 5 114pointsCards <- laliga[, c(1, 17)]plot(pointsCards)


6.1. 𝐾-MEANS CLUSTERING 239

30 40 50 60 70 80 90

8010

012

014

0

Points

Yello

w.c

ards

# kmeans uses a random initialization of the clusters, so the results may vary# from one call to another. We use set.seed() to have reproducible outputs.set.seed(2345678)

# kmeans call:# - centers is the k, the number of clusters.# - nstart indicates how many different starting assignments should be considered# (useful for avoiding suboptimal clusterings)k <- 2km <- kmeans(pointsCards, centers = k, nstart = 20)

# What is inside km?km## K-means clustering with 2 clusters of sizes 4, 16#### Cluster means:## Points Yellow.cards## 1 82.7500 78.25## 2 44.8125 113.25#### Clustering vector:## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 1 2 1## Celta Sevilla Málaga Real Sociedad Betis## 2 2 2 2 2## Las Palmas Valencia Eibar Espanyol Deportivo


## 2 2 2 2 2## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 2 2 2 2 2#### Within cluster sum of squares by cluster:## [1] 963.500 4201.438## (between_SS / total_SS = 62.3 %)#### Available components:#### [1] "cluster" "centers" "totss" "withinss" "tot.withinss"## [6] "betweenss" "size" "iter" "ifault"str(km)## List of 9## $ cluster : Named int [1:20] 1 1 1 2 1 2 2 2 2 2 ...## ..- attr(*, "names")= chr [1:20] "Barcelona" "Real Madrid" "Atlético Madrid" "Villarreal" ...## $ centers : num [1:2, 1:2] 82.8 44.8 78.2 113.2## ..- attr(*, "dimnames")=List of 2## .. ..$ : chr [1:2] "1" "2"## .. ..$ : chr [1:2] "Points" "Yellow.cards"## $ totss : num 13691## $ withinss : num [1:2] 964 4201## $ tot.withinss: num 5165## $ betweenss : num 8526## $ size : int [1:2] 4 16## $ iter : int 1## $ ifault : int 0## - attr(*, "class")= chr "kmeans"

# between_SS / total_SS gives a criterion to select k similar to PCA.# Recall that between_SS / total_SS = 100% if k = n

# Centroids of each clusterkm$centers## Points Yellow.cards## 1 82.7500 78.25## 2 44.8125 113.25

# Assignments of observations to the k clusterskm$cluster## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 1 2 1## Celta Sevilla Málaga Real Sociedad Betis## 2 2 2 2 2## Las Palmas Valencia Eibar Espanyol Deportivo


## 2 2 2 2 2## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 2 2 2 2 2

# Plot data with colors according to clustersplot(pointsCards, col = km$cluster)

# Add the names of the observations above the pointstext(x = pointsCards, labels = rownames(pointsCards), col = km$cluster,

pos = 3, cex = 0.75)

30 40 50 60 70 80 90

8010

012

014

0

Points

Yello

w.c

ards

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

Celta

Sevilla

MálagaReal SociedadBetis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

GetafeLevante

# Clustering with k = 3k <- 3set.seed(2345678)km <- kmeans(pointsCards, centers = k, nstart = 20)plot(pointsCards, col = km$cluster)text(x = pointsCards, labels = rownames(pointsCards), col = km$cluster,

pos = 3, cex = 0.75)


30 40 50 60 70 80 90

8010

012

014

0

Points

Yello

w.c

ards

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

Celta

Sevilla


Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

GetafeLevante

# Clustering with k = 4k <- 4set.seed(2345678)km <- kmeans(pointsCards, centers = k, nstart = 20)plot(pointsCards, col = km$cluster)text(x = pointsCards, labels = rownames(pointsCards), col = km$cluster,

pos = 3, cex = 0.75)

30 40 50 60 70 80 90

8010

012

014

0

Points

Yello

w.c

ards

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

Celta

Sevilla


Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

GetafeLevante


So far, we have only taken the information of two variables for performingclustering. Using PCA, we can visualize the clustering performed with all theavailable variables in the dataset.

By default, kmeans does not standardize variables, which will affect theclustering result. As a consequence, the clustering of a dataset will be differentif one variable is expressed in millions or in tenths. If you want to avoid thisdistortion, you can use scale to automatically center and standardize a dataframe (the result will be a matrix, so you need to transform it to a data frameagain).# Work with standardized data (and remove Matches)laligaStd <- data.frame(scale(laliga[, -2]))

# Clustering with all the variables - unstandardized dataset.seed(345678)kme <- kmeans(laliga, centers = 3, nstart = 20)kme$cluster## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 3 2 3## Celta Sevilla Málaga Real Sociedad Betis## 2 2 2 3 3## Las Palmas Valencia Eibar Espanyol Deportivo## 3 2 2 2 3## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 2 2 2 2 2table(kme$cluster)#### 1 2 3## 2 12 6

# Clustering with all the variables - standardized dataset.seed(345678)kme <- kmeans(laligaStd, centers = 3, nstart = 20)kme$cluster## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 2 2 2 1 1## Celta Sevilla Málaga Real Sociedad Betis## 1 1 1 1 1## Las Palmas Valencia Eibar Espanyol Deportivo## 1 1 3 3 1## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 3 3 3 3 3table(kme$cluster)#### 1 2 3


## 10 3 7

# PCApca <- princomp(laliga[, -2], cor = TRUE)summary(pca)## Importance of components:## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5## Standard deviation 3.4372781 1.5514051 1.14023547 0.91474383 0.85799980## Proportion of Variance 0.6563823 0.1337143 0.07222983 0.04648646 0.04089798## Cumulative Proportion 0.6563823 0.7900966 0.86232642 0.90881288 0.94971086## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10## Standard deviation 0.60295746 0.4578613 0.373829925 0.327242606 0.22735805## Proportion of Variance 0.02019765 0.0116465 0.007763823 0.005949318 0.00287176## Cumulative Proportion 0.96990851 0.9815550 0.989318830 0.995268148 0.99813991## Comp.11 Comp.12 Comp.13 Comp.14## Standard deviation 0.1289704085 0.0991188181 0.0837498223 2.860411e-03## Proportion of Variance 0.0009240759 0.0005458078 0.0003896685 4.545528e-07## Cumulative Proportion 0.9990639840 0.9996097918 0.9999994603 9.999999e-01## Comp.15 Comp.16 Comp.17 Comp.18## Standard deviation 1.238298e-03 1.583337e-08 1.154388e-08 0## Proportion of Variance 8.518782e-08 1.392753e-17 7.403399e-18 0## Cumulative Proportion 1.000000e+00 1.000000e+00 1.000000e+00 1

# Biplot (the scores of the first two PCs)biplot(pca)


−0.4 −0.2 0.0 0.2 0.4 0.6

−0.

4−

0.2

0.0

0.2

0.4

0.6

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

ValenciaEibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

−4 −2 0 2 4

−4

−2

02

4

PointsWinsDraws

Loses

Goals.scored

Goals.conceded

Difference.goals

Percentage.scored.goals

Percentage.conceded.goals ShotsShots.on.goalPenalties.scored

Assistances

Fouls.made

Matches.without.conceding

Yellow.cards

Red.cards Offsides

# Redo the biplot with colors indicating the cluster assignmentsplot(pca$scores[, 1:2], col = kme$cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = kme$cluster)

−4 −2 0 2 4 6 8

−3

−2

−1

01

23

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

# Recall: this is a visualization with PC1 and PC2 of the clustering done with


# all the variables, not just PC1 and PC2

# Clustering with only the first two PCs - different and less accurate result,# but still insightfulset.seed(345678)kme2 <- kmeans(pca$scores[, 1:2], centers = 3, nstart = 20)plot(pca$scores[, 1:2], col = kme2$cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = kme2$cluster)

−4 −2 0 2 4 6 8

−3

−2

−1

01

23

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

𝑘-means can also be performed through the help of R Commander. To doso, go to 'Statistics' -> 'Dimensional Analysis' -> 'Clustering' ->'k-means cluster analysis...'. If you do this for the USArrests datasetafter rescaling it, select to 'Assign clusters to the data set' and namethe 'Assignment variable' as 'KMeans', you should get something like this:# Load data and scale itdata(USArrests)USArrests <- as.data.frame(scale(USArrests))

# Statistics -> Dimensional Analysis -> Clustering -> k-means cluster analysis....cluster <- KMeans(model.matrix(~-1 + Assault + Murder + Rape + UrbanPop, USArrests),centers = 2, iter.max = 10, num.seeds = 10)

.cluster$size # Cluster Sizes## [1] 20 30.cluster$centers # Cluster Centroids## new.x.Assault new.x.Murder new.x.Rape new.x.UrbanPop## 1 1.0138274 1.004934 0.8469650 0.1975853


## 2 -0.6758849 -0.669956 -0.5646433 -0.1317235.cluster$withinss # Within Cluster Sum of Squares## [1] 46.74796 56.11445.cluster$tot.withinss # Total Within Sum of Squares## [1] 102.8624.cluster$betweenss # Between Cluster Sum of Squares## [1] 93.1376remove(.cluster).cluster <- KMeans(model.matrix(~-1 + Assault + Murder + Rape + UrbanPop, USArrests),centers = 2, iter.max = 10, num.seeds = 10)

.cluster$size # Cluster Sizes## [1] 20 30.cluster$centers # Cluster Centroids## new.x.Assault new.x.Murder new.x.Rape new.x.UrbanPop## 1 1.0138274 1.004934 0.8469650 0.1975853## 2 -0.6758849 -0.669956 -0.5646433 -0.1317235.cluster$withinss # Within Cluster Sum of Squares## [1] 46.74796 56.11445.cluster$tot.withinss # Total Within Sum of Squares## [1] 102.8624.cluster$betweenss # Between Cluster Sum of Squares## [1] 93.1376biplot(princomp(model.matrix(~-1 + Assault + Murder + Rape + UrbanPop, USArrests)),xlabs = as.character(.cluster$cluster))

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

1 1

1

2

1

12

2

1

1

2

2

1

222

2 1

2 1

2

1

2

1

1

2

2

1

2

2

1

1

1

2

2

2

22

2

1

2 1

1

2

2

2

2

2

2

2

−6 −4 −2 0 2 4 6 8

−6

−4

−2

02

46

8

Assault

Murder

Rape

UrbanPop


USArrests$KMeans <- assignCluster(model.matrix(~-1 + Assault + Murder + Rape + UrbanPop,USArrests), USArrests, .cluster$cluster)

remove(.cluster)

How many clusters 𝑘 do we need in practice? There is not a singleanswer: the advice is to try several and compare. Inspecting the'between_SS / total_SS' for a good trade-off between the numberof clusters and the percentage of total variation explained usually givesa good starting point for deciding on 𝑘.

For the iris dataset, do sequentially:

1. Apply scale to the dataset and save it as irisStd. Note: thefifth variable is a factor, so you must skip it.

2. Fix the seed to 625365712.3. Run 𝑘-means with 20 runs for 𝑘 = 2, 3, 4. Save the results as km2,

km3 and km4.4. Compute the PCA of irisStd.5. Plot the first two scores, colored by the assignments of km2.6. Do the same for km3 and km4.7. Which 𝑘 do you think it gives the most sensible partition based

on the previous plots?

6.2 Agglomerative hierarchical clusteringHierarchical clustering starts by considering that each observation is its owncluster, and then merges sequentially the clusters with a lower degree of dis-similarity 𝑑 (the lower the similarity, the larger the similarity). For example, ifthere are three clusters, 𝐴, 𝐵 and 𝐶, and their dissimilarities are 𝑑(𝐴, 𝐵) = 0.1,𝑑(𝐴, 𝐶) = 0.5, 𝑑(𝐵, 𝐶) = 0.9, then the three clusters will be reduced to justtwo: (𝐴, 𝐵) and 𝐶.

The advantages of hierarchical clustering are several:

• We do not need to specify a fixed number of clusters 𝑘.• The clusters are naturally nested within each other, something that does

not happen in 𝑘-means. Is possible to visualize this nested structurethroughout the dendrogram.

• It can deal with categorical variables, throughout the specification ofproper dissimilarity measures. In particular, it can deal with numericalvariables using the Euclidean distance.

The linkage employed by hierarchical clustering refers to how the cluster arefused:

6.2. AGGLOMERATIVE HIERARCHICAL CLUSTERING 249

−0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

2.0

2.5

Data

x

y

12 34

5

678

910

1112

13

1415

16

17

18 1920

14 156 10

911 12 13 7 8

53 4 1 2 16 19

1718 20

0.0

0.5

1.0

1.5

average

hclust (*, "average")dist(x)

Hei

ght

14 156 10

911 12 13 7 8

5 3 4 1 2 16 1917

18 20

0.0

1.0

2.0

complete

hclust (*, "complete")dist(x)

Hei

ght

53 4 1 2

911 12

13 7 8 14 156 10 16 19

1718 20

0.0

0.4

0.8

single

hclust (*, "single")dist(x)

Hei

ght

Figure 6.2: The hierarchical clustering for a two-dimensional dataset with com-plete, single and average linkages.

• Complete. Takes the maximal dissimilarity between all the pairwisedissimilarities between the observations in cluster A and cluster B.

• Single. Takes the minimal dissimilarity between all the pairwise dissimi-larities between the observations in cluster A and cluster B.

• Average. Takes the average dissimilarity between all the pairwise dissim-ilarities between the observations in cluster A and cluster B.

Hierarchical clustering is quite sensible to the kind of dissimilarity employedand the kind of linkage used. In addition, the hierarchical property might forcethe clusters to unnatural behaviors. Particularly, single linkage may result inextended, chained clusters in which a single observation is added at a new level.As a consequence, complete and average are usually recommended in practice.

Let’s illustrate how to perform hierarchical clustering in laligaStd.# Compute dissimilarity matrix - in this case Euclidean distanced <- dist(laligaStd)

# Hierarchical clustering with complete linkagetreeComp <- hclust(d, method = "complete")plot(treeComp)


Ray

o V

alle

cano

Esp

anyo

l

Spo

rtin

g G

ijón

Get

afe

Leva

nte Gra

nada

Val

enci

a

Eib

ar

Dep

ortiv

o

Mál

aga

Bet

is

Rea

l Soc

ieda

d

Las

Pal

mas V

illar

real

Ath

letic

Cel

ta

Sev

illa

Atlé

tico

Mad

rid

Bar

celo

na

Rea

l Mad

rid

02

46

810

12

Cluster Dendrogram

hclust (*, "complete")d

Hei

ght

# With average linkagetreeAve <- hclust(d, method = "average")plot(treeAve)


Dep

ortiv

o

Ray

o V

alle

cano

Eib

ar

Gra

nada

Esp

anyo

l

Spo

rtin

g G

ijón

Get

afe

Leva

nte V

illar

real

Ath

letic

Val

enci

a

Cel

ta

Sev

illa

Mál

aga

Bet

is

Rea

l Soc

ieda

d

Las

Pal

mas

Atlé

tico

Mad

rid

Bar

celo

na

Rea

l Mad

rid

02

46

8Cluster Dendrogram

hclust (*, "average")d

Hei

ght

# With single linkagetreeSingle <- hclust(d, method = "single")plot(treeSingle) # Chaining


Bar

celo

na

Rea

l Mad

rid

Atlé

tico

Mad

rid

Ray

o V

alle

cano

Dep

ortiv

o

Vill

arre

al

Esp

anyo

l

Gra

nada

Ath

letic

Spo

rtin

g G

ijón

Get

afe

Leva

nte M

álag

a

Bet

is

Rea

l Soc

ieda

d

Las

Pal

mas

Eib

ar

Val

enci

a

Cel

ta

Sev

illa1

23

45

67

Cluster Dendrogram

hclust (*, "single")d

Hei

ght

# Set the number of clusters after inspecting visually the dendrogram for "long"# groups of hanging leaves# These are the cluster assignmentscutree(treeComp, k = 2) # (Barcelona, Real Madrid) and (rest)## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 1 2 2## Celta Sevilla Málaga Real Sociedad Betis## 2 2 2 2 2## Las Palmas Valencia Eibar Espanyol Deportivo## 2 2 2 2 2## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 2 2 2 2 2cutree(treeComp, k = 3) # (Barcelona, Real Madrid), (Atlético Madrid) and (rest)## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 2 3 3## Celta Sevilla Málaga Real Sociedad Betis## 3 3 3 3 3## Las Palmas Valencia Eibar Espanyol Deportivo## 3 3 3 3 3## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 3 3 3 3 3

# Compare differences - treeComp makes more sense than treeAve


cutree(treeComp, k = 4)## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 2 3 3## Celta Sevilla Málaga Real Sociedad Betis## 3 3 3 3 3## Las Palmas Valencia Eibar Espanyol Deportivo## 3 4 4 4 3## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 4 4 4 4 4cutree(treeAve, k = 4)## Barcelona Real Madrid Atlético Madrid Villarreal Athletic## 1 1 2 3 3## Celta Sevilla Málaga Real Sociedad Betis## 3 3 3 3 3## Las Palmas Valencia Eibar Espanyol Deportivo## 3 3 3 3 4## Granada Sporting Gijón Rayo Vallecano Getafe Levante## 3 3 3 3 3

# We can plot the results in the first two PCs, as we did in k-meanscluster <- cutree(treeComp, k = 2)plot(pca$scores[, 1:2], col = cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)

−4 −2 0 2 4 6 8

−3

−2

−1

01

23

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

cluster <- cutree(treeComp, k = 3)plot(pca$scores[, 1:2], col = cluster)


text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)

−4 −2 0 2 4 6 8

−3

−2

−1

01

23

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

cluster <- cutree(treeComp, k = 4)plot(pca$scores[, 1:2], col = cluster)text(x = pca$scores[, 1:2], labels = rownames(pca$scores), pos = 3, col = cluster)

−4 −2 0 2 4 6 8

−3

−2

−1

01

23

Comp.1

Com

p.2

Barcelona

Real Madrid

Atlético Madrid

Villarreal

Athletic

CeltaSevilla

Málaga

Real Sociedad

Betis

Las Palmas

Valencia

Eibar

Espanyol

Deportivo

Granada

Sporting Gijón

Rayo Vallecano

Getafe

Levante

If categorical variables are present, replace dist by daisy from the cluster


package (you need to do first library(cluster)). For example, let’s clusterthe iris dataset.# Load datadata(iris)

# The fifth variable is a factorhead(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa

# Compute dissimilarity matrix using the Gower dissimilarity measure# This dissimilarity is able to handle both numerical and categorical variables# daisy automatically detects whether there are factors present in the data and# applies Gower (otherwise it applies the Euclidean distance)library(cluster)d <- daisy(iris)tree <- hclust(d)

# 3 main clustersplot(tree)


107

120

135

115

122

114

102

143 1

0911

214

713

412

412

715

012

813

910

411

713

814

114

512

512

114

4 101

116

137

149

111

148

142

146

103

113

140

105

129

133 12

613

010

813

110

612

3 119

136 1

1011

813

242

14 9 39 43

30 3 48 36 50 10 354 31 2 26 13 46 2

35 38 21 32 37

11 4924 27 44 41 1 18 8 40 28 29

712 25 2

0 47 22 4519 6 17

16 1533 34 72 74 75 98 64 92 67 85 62 79 65

89 96 95

56 97 100 66 76 51

53 8778

77 55 59 73 84 71

86 52 5763

69 8861

9958 94

6080 68 83 9391

54 90 70

81 82

0.0

0.2

0.4

0.6

0.8

Cluster Dendrogram

hclust (*, "complete")d

Hei

ght

# The clusters correspond to the Speciescutree(tree, k = 3)## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3## [149] 3 3table(iris$Species, cutree(tree, k = 3))#### 1 2 3## setosa 50 0 0## versicolor 0 50 0## virginica 0 0 50

Performing hierarchical clustering in practice depends on several de-cisions that may have big consequences on the final output:

– What kind of dissimilarity and linkage should be employed? Nota single answer: try several and compare.

– Where to cut the dendrogram? The general advice is to look forgroups of branches hanging for a long space and cut on their top.

Despite the general advice, there is not a single and best solution


for the previous questions. What is advisable in practice is to analyzeseveral choices, report the general patterns that arise and the differentfeatures of the data the methods expose.

Hierarchical clustering can also be performed through the help of R Com-mander. To do so, go to 'Statistics' -> 'Dimensional Analysis' ->'Clustering' -> 'Hierar...'. If you do this for the USArrests datasetafter rescaling, you should get something like this:HClust.1 <- hclust(dist(model.matrix(~-1 + Assault + Murder + Rape + UrbanPop,

USArrests)), method = "complete")plot(HClust.1, main = "Cluster Dendrogram for Solution HClust.1",

xlab = "Observation Number in Data Set USArrests",sub = "Method=complete; Distance=euclidian")

Sou

th D

akot

aW

est V

irgin

iaN

orth

Dak

ota

Ver

mon

tM

aine

Iow

aN

ew H

amps

hire

Idah

oM

onta

naN

ebra

ska

Ken

tuck

yA

rkan

sas

Virg

inia

Wyo

min

gM

isso

uri

Ore

gon

Was

hing

ton

Del

awar

eR

hode

Isla

ndM

assa

chus

etts

New

Jer

sey

Con

nect

icut

Min

neso

taW

isco

nsin

Okl

ahom

aIn

dian

aK

ansa

sO

hio

Pen

nsyl

vani

aH

awai

iU

tah

Col

orad

oC

alifo

rnia

Nev

ada

Flo

rida

Texa

sIll

inoi

sN

ew Y

ork

Ariz

ona

Mic

higa

nM

aryl

and

New

Mex

ico

Ala

ska

Ala

bam

aLo

uisi

ana

Geo

rgia

Tenn

esse

eN

orth

Car

olin

aM

issi

ssip

piS

outh

Car

olin

a

02

46

Cluster Dendrogram for Solution HClust.1

Method=complete; Distance=euclidianObservation Number in Data Set USArrests

Hei

ght

Import the eurojob (download) dataset and standardize it properly.Perform a hierarchical clustering analysis for the three kind of linkagesseen.


Appendix A

Glossary of important Rcommands

Basic usage

The following table contains important R commands for its basic usage.

Description R ExampleAssign values to avariable

<- x <- 1

Compute severalexpressions at once

; x <- 1; 2 + 2; 3 * 8

Create vectors byconcatenating numbers

c c(1, 2, -1)

Create sequentialinteger vectors

: 1:10

Create a matrix bycolumns

cbind cbind(1:3, c(0, 2, 0))

Create a matrix by rows rbind rbind(1:3, c(0, 2, 0))Create a data frame data.frame data.frame(name1 = c(-1,

3), name2 = c(0.4, 1))Create a list list list(obj1 = c(-1, 3),

obj2 = -1:5, obj3 =rbind(1:2, 3:2))

Access elements of a…… vector [] c(0.5, 2)[1], c(0.5,

2)[-1]; c(0.5, 2)[2:1]… matrix [, ] cbind(1:2, 3:4)[1, 2];

cbind(1:2, 3:4)[1, ]

259

260 APPENDIX A. GLOSSARY OF IMPORTANT R COMMANDS

Description R Example… data frame [, ] and $ data.frame(name1 = c(-1,

3), name2 = c(0.4,1))$name1;data.frame(name1 = c(-1,3), name2 = c(0.4, 1))[2,1]

… list $ list(x = 2, y = 7:0)$ySummarize any object summary summary(1:10)

Linear regression

Some useful commands for performing simple and multiple linear regres-sion are given in the next table. We assume that:

• dataset is an imported dataset such that– resp is the response variable– pred1 is first predictor– pred2 is second predictor– …– predk is the last predictor

• model is the result of applying lm• newPreds is a data.frame with variables named as the predictors• num is 1, 2 or 3• level is a number between 0 and 1

Description RFit a simple linear model lm(response ~ pred1, data =

dataset)Fit a multiple linear model…… on two predictors lm(response ~ pred1 + pred2,

data = dataset)… on all predictors lm(response ~ ., data =

dataset)… on all predictors except pred1 lm(response ~ . - pred1,

data = dataset)Summarize linear model: coefficientestimates, standard errors, 𝑡-values,𝑝-values for 𝐻0 ∶ 𝛽𝑗 = 0, �� (Residualstandard error), degrees of freedom, 𝑅2,Adjusted 𝑅2, 𝐹 -test, 𝑝-value for𝐻0 ∶ 𝛽1 = … = 𝛽𝑘 = 0

summary(model)

ANOVA decomposition anova(model)

261

Description RCIs coefficients confint(model, level =

level)Prediction predict(model, newdata =

new)CIs predicted mean predict(model, newdata =

new, interval ="confidence", level = level)

CIs predicted response predict(model, newdata =new, interval ="prediction", level = level)

Variable selection stepwise(model)Multicollinearity detection vif(model)Compare model coefficients compareCoefs(model1, model2)Diagnostic plots plot(model, num)

More basic usage

The following table contains important R commands for its basic usage. Weassume the following dataset is available:data <- data.frame(x = 1:10, y = c(-1, 2, 3, 0, 3, 1, -1, 3, 0, -1))

Description R ExampleData framemanagementvariable names names names(data)structure str str(data)dimensions dim dim(data)beginning head head(data)Vector related functionscreate sequences seq seq(0, 1, l = 10); seq(0,

1, by = 0.25)reverse a vector rev rev(1:5)length of a vectors length length(1:5)count repetitions in avector

table table(c(1:5, 4:2))

Logical conditionsrelational operators <, <=, >, >=,

==, !=1 < 0; 1 <= 1; 2 > 1; 3>= 4; 1 == 0; 1 != 0

“and” & TRUE & FALSE“or” | TRUE | FALSESubsetting


Description R Examplevector data$x[data$x > 0];

data$x[data$x > 2 &data$x < 8]

data frame data[data$x > 0, ];data[data$x < 2 | data$x> 8, ]

Distributionssampling rxxxx rnorm(n = 10, mean = 0,

sd = 1)density dxxxx x <- seq(-4, 4, l = 20);

dnorm(x = x, mean = 0, sd= 1)

distribution pxxxx x <- seq(-4, 4, l = 20);pnorm(q = x, mean = 0, sd= 1)

quantiles qxxxx p <- seq(0.1, 0.9, l =10); qnorm(p = p, mean =0, sd = 1)

Plottingscatterplot plot plot(rnorm(100),

rnorm(100))plot a curve plot, seq x <- seq(0, 1, l = 100);

plot(x, x^2, type = "l")add lines lines, x <- seq(0, 1, l = 100);

plot(x, x^2 + rnorm(100,sd = 0.1)); lines(x, x^2,col = 2, lwd = 2)

Logistic regression

Some useful commands for performing logistic regression are given in thenext table. We assume that:

• dataset is an imported dataset such that– resp is the response binary variable– pred1 is first predictor– pred2 is second predictor– …– predk is the last predictor

• model is the result of applying glm• newPreds is a data.frame with variables named as the predictors• level is a number between 0 and 1

263

Description RFit a simple logistic model glm(response ~ pred1, data =

dataset, family ="binomial")

Fit a multiple logistic model…… on two predictors glm(response ~ pred1 +

pred2, data = dataset,family = "binomial")

… on all predictors glm(response ~ ., data =dataset, family ="binomial")

… on all predictors except pred1 glm(response ~ . - pred1,data = dataset, family ="binomial")

Summarize logistic model: coefficientestimates, standard errors, Wald statistics('z value'), 𝑝-values for 𝐻0 ∶ 𝛽𝑗 = 0,Null deviance, deviance ('Residualdeviance'), AIC, number of iterations

summary(model)

CIs coefficients confint(model, level =level);confint.default(model, level= level)

CIs exp-coefficients exp(confint(model, level =level));exp(confint.default(model,level = level))

Prediction predict(model, newdata =new, type = "response")

CIs predicted probability Not immediate. UsepredictCIsLogistic(model,newdata = new, level =level) as seen in Section 4.6

Variable selection stepwise(model)Multicollinearity detection vif(model)𝑅2 Not immediate. Use r2Log(model

= model) as seen in Section 4.8Hit matrix table(data$resp,

model$fitted.values > 0.5)

Principal component analysis

Some useful commands for performing logistic regression are given in thenext table. We assume that:


• dataset is an imported dataset with several non-categorical variables(the variables must be continuous or discrete).

• pca is a PCA object, this is, the output of princomp.

Description RCompute a PCA…… unnormalized (if variables have thesame scale)

princomp(dataset)

… normalized (if variables have differentscales)

princomp(dataset, cor =TRUE)

Summarize PCA: standard deviationexplained by each PC, proportion ofvariance explained by each PC,cumulative proportion of varianceexplained up to a given component

summary(pca)

Weights pca$loadingsScores pca$scoresStandard deviations of the PCs pca$sdevMeans of the original variables pca$centerScreeplot plot(pca); plot(pca, type =

"l")Biplot biplot(pca)

Appendix B

Use of qualitativepredictors in regression

An important situation not covered in Chapters 2, 3 and 4 is how to deal withqualitative, and not quantitative, predictors. Qualitative predictors, also knownas categorical variables or, in R’s terminology, factors, are ubiquitous in socialsciences. Dealing with them requires some care and proper understanding ofhow these variables are represented in statistical softwares such as R.

Two levels

The simplest case is the situation with two levels, this is, the binary casecovered in logistic regression. There we saw that a binary variable 𝐶 with twolevels (for example, a and b) could be represented as

𝐷 = { 1, if 𝐶 = 𝑏,0, if 𝐶 = 𝑎.

𝐷 now is a dummy variable: it codifies with zeros and ones the two pos-sible levels of the categorical variable. An example of 𝐶 could be gender,which has levels male and female. The dummy variable associated is 𝐷 = 0 ifthe gender is male and 𝐷 = 1 if the gender is female.

The advantage of this dummification is its interpretability in regression models.Since level a corresponds to 0, it can be seen as the reference level to whichlevel b is compared. This is the key point in dummification: set one level as thereference and codify the rest as departures from it with ones.

The previous interpretation translates easily to regression models. Assume thatthe dummy variable 𝐷 is available together with other predictors 𝑋1, … , 𝑋𝑘.Then:

265

266APPENDIX B. USE OF QUALITATIVE PREDICTORS IN REGRESSION

• Linear model

𝔼[𝑌 |𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘, 𝐷 = 𝑑] = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑘𝑋𝑘 + 𝛽𝑘+1𝐷.

The coefficient associated to 𝐷 is easily interpretable. 𝛽𝑘+1 is the incre-ment in mean of 𝑌 associated to changing 𝐷 = 0 (reference) to 𝐷 = 1,while the rest of the predictors are fixed. Or in other words, 𝛽𝑘+1 is the in-crement in mean of 𝑌 associated to changing of the level of the categoricalvariable from a to b.

• Logistic model

ℙ[𝑌 = 1|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘, 𝐷 = 𝑑] = logisitic(𝛽0+𝛽1𝑋1+⋯+𝛽𝑘𝑋𝑘+𝛽𝑘+1𝐷).

We have two interpretations of 𝛽𝑘+1, either in terms of log-odds or odds:

– 𝛽𝑘+1 is the additive increment in log-odds of 𝑌 associated to changingof the level of the categorical variable from a (reference, 𝐷 = 0) to b(𝐷 = 1).

– 𝑒𝛽𝑘+1 is the multiplicative increment in odds of 𝑌 associated to chang-ing of the level of the categorical variable from a (reference, 𝐷 = 0)to b (𝐷 = 1).

R does the dummification automatically (translates a categorical variable 𝐶into its dummy version 𝐷) if it detects that a factor variable is present in theregression model. Let’s see an example of this in linear and logistic regression.# Load the Boston datasetlibrary(MASS)data(Boston)

# Structure of the datastr(Boston)## 'data.frame': 506 obs. of 14 variables:## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...## $ rm : num 6.58 6.42 7.18 7 7.15 ...## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...## $ black : num 397 397 393 395 397 ...## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...# chas is a dummy variable measuring if the suburb is close to the river (1)

267

# or not (0). In this case it is not codified as a factor but as a 0 or 1.

# Summary of a linear modelmod <- lm(medv ~ chas + crim, data = Boston)summary(mod)#### Call:## lm(formula = medv ~ chas + crim, data = Boston)#### Residuals:## Min 1Q Median 3Q Max## -16.540 -5.421 -1.878 2.575 30.134#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 23.61403 0.41862 56.409 < 2e-16 ***## chas 5.57772 1.46926 3.796 0.000165 ***## crim -0.40598 0.04339 -9.358 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 8.373 on 503 degrees of freedom## Multiple R-squared: 0.1744, Adjusted R-squared: 0.1712## F-statistic: 53.14 on 2 and 503 DF, p-value: < 2.2e-16# The coefficient associated to chas is 5.57772. That means that if the suburb# is close to the river, the mean of medv increases in 5.57772 units.# chas is significant (the presence of the river adds a valuable information# for explaining medv)

# Create a binary response (1 expensive suburb, 0 inexpensive)Boston$expensive <- Boston$medv > 25

# Summary of a logistic modelmod <- glm(expensive ~ chas + crim, data = Boston, family = "binomial")summary(mod)#### Call:## glm(formula = expensive ~ chas + crim, family = "binomial", data = Boston)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.26764 -0.84292 -0.67854 -0.00099 2.87470#### Coefficients:## Estimate Std. Error z value Pr(>|z|)


## (Intercept) -0.82159 0.12217 -6.725 1.76e-11 ***## chas 1.04165 0.36962 2.818 0.00483 **## crim -0.22816 0.05265 -4.333 1.47e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 513.44 on 503 degrees of freedom## AIC: 519.44#### Number of Fisher Scoring iterations: 6# The coefficient associated to chas is 1.04165. That means that if the suburb# is close to the river, the log-odds of expensive increases by 1.04165.# Alternatively, the odds of expensive increases by a factor of exp(1.04165).# chas is significant (the presence of the river adds a valuable information# for explaining medv)

More than two levels

Let’s see now the case with more than two levels, for example, a categoricalvariable 𝐶 with levels a, b and c. If we take a as the reference level, this variablecan be represented by two dummy variables:

𝐷1 = { 1, if 𝐶 = 𝑏,0, if 𝐶 ≠ 𝑏

and𝐷2 = { 1, if 𝐶 = 𝑐,

0, if 𝐶 ≠ 𝑐.Then 𝐶 = 𝑎 is represented by 𝐷1 = 𝐷2 = 0, 𝐶 = 𝑏 is represented by 𝐷1 =1, 𝐷2 = 0 and 𝐶 = 𝑐 is represented by 𝐷1 = 0, 𝐷2 = 1. The interpretation ofthe regression models with the presence of 𝐷1 and 𝐷2 is the very similar to theone before. For example, for the linear model, the coefficient associated to 𝐷1gives the increment in mean of 𝑌 when the category of 𝐶 changes from a to b.The coefficient for 𝐷2 gives the increment in mean of 𝑌 when it changes froma to c.

In general, if we have a categorical variable with 𝐽 levels, then the numberof dummy variables required is 𝐽 − 1. Again, R does the dummification auto-matically for you if it detects that a factor variable is present in the regressionmodel.# Load dataset - factors in the last columndata(iris)summary(iris)

269

## Sepal.Length Sepal.Width Petal.Length Petal.Width## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300## Median :5.800 Median :3.000 Median :4.350 Median :1.300## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500## Species## setosa :50## versicolor:50## virginica :50######

# Summary of a linear modelmod1 <- lm(Sepal.Length ~ ., data = iris)summary(mod1)#### Call:## lm(formula = Sepal.Length ~ ., data = iris)#### Residuals:## Min 1Q Median 3Q Max## -0.79424 -0.21874 0.00899 0.20255 0.73103#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***## Petal.Width -0.31516 0.15120 -2.084 0.03889 *## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.3068 on 144 degrees of freedom## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16# Speciesversicolor (D1) coefficient: -0.72356. The average increment of# Sepal.Length when the species is versicolor instead of setosa (reference).# Speciesvirginica (D2) coefficient: -1.02350. The average increment of# Sepal.Length when the species is virginica instead of setosa (reference).# Both dummy variables are significant


# How to set a different level as reference (versicolor)iris$Species <- relevel(iris$Species, ref = "versicolor")

# Same estimates except for the dummy coefficientsmod2 <- lm(Sepal.Length ~ ., data = iris)summary(mod2)#### Call:## lm(formula = Sepal.Length ~ ., data = iris)#### Residuals:## Min 1Q Median 3Q Max## -0.79424 -0.21874 0.00899 0.20255 0.73103#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 1.44770 0.28149 5.143 8.68e-07 ***## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***## Petal.Width -0.31516 0.15120 -2.084 0.03889 *## Speciessetosa 0.72356 0.24017 3.013 0.00306 **## Speciesvirginica -0.29994 0.11898 -2.521 0.01280 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.3068 on 144 degrees of freedom## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16# Speciessetosa (D1) coefficient: 0.72356. The average increment of# Sepal.Length when the species is setosa instead of versicolor (reference).# Speciesvirginica (D2) coefficient: -0.29994.s The average increment of# Sepal.Length when the species is virginica instead of versicolor (reference).# Both dummy variables are significant

# Coefficients of the modelconfint(mod2)## 2.5 % 97.5 %## (Intercept) 0.8913266 2.00408209## Sepal.Width 0.3257653 0.66601260## Petal.Length 0.6937939 0.96469395## Petal.Width -0.6140049 -0.01630542## Speciessetosa 0.2488500 1.19827390## Speciesvirginica -0.5351144 -0.06475727# The coefficients of Speciesversicolor and Speciesvirginica are significantly# negative. Therefore, there are significant

271

Do not codify a categorical variable as a discrete variable.This constitutes a major methodological fail that will flaw the subse-quent statistical analysis.

For example if you have a categorical variable party with levelspartyA, partyB and partyC, do not encode it as a discrete variabletaking the values 1, 2 and 3, respectively. If you do so:

– You assume implicitly an order in the levels of party, sincepartyA is closer to partyB than to partyC.

– You assume implicitly that partyC is three times larger thanpartyA.

– The codification is completely arbitrary – why not considering 1,1.5 and 1.75 instead of?

The right way of dealing with categorical variables in regression is toset the variable as a factor and let R do internally the dummification.

Appendix C

Multinomial logisticregression

The logistic model can be generalized to categorical variables 𝑌 with morethan two possible levels, namely {1, … , 𝐽}. Given the predictors 𝑋1, … , 𝑋𝑘,multinomial logistic regression models the probability of each level 𝑗 of 𝑌 by

𝑝𝑗(x) = ℙ[𝑌 = 𝑗|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 𝑒𝛽0𝑗+𝛽1𝑗𝑋1+⋯+𝛽𝑘𝑗𝑋𝑘

1 + ∑𝐽−1𝑙=1 𝑒𝛽0𝑙+𝛽1𝑙𝑋1+⋯+𝛽𝑘𝑙𝑋𝑘

(C.1)

for 𝑗 = 1, … , 𝐽 − 1 and (for the last level 𝐽)

𝑝𝐽(x) = ℙ[𝑌 = 𝐽|𝑋1 = 𝑥1, … , 𝑋𝑘 = 𝑥𝑘] = 11 + ∑𝐽−1

𝑙=1 𝑒𝛽0𝑙+𝛽1𝑙𝑋1+⋯+𝛽𝑘𝑙𝑋𝑘.

(C.2)

Note that (C.1) and (C.2) imply that ∑𝐽𝑗=1 𝑝𝑗(x) = 1 and that there are (𝐽 −

1) × (𝑘 + 1) coefficients ((𝐽 − 1) intercepts and (𝐽 − 1) × 𝑘 slopes). Also, (C.2)reveals that the last level, 𝐽 , is given a different treatment. This is because itis the reference level (it could be a different one, but is tradition to choose thelast one).

The multinomial logistic model has an interesting interpretation in terms oflogistic regressions. Taking the quotient between (C.1) and (C.2) gives

𝑝𝑗(x)𝑝𝐽(x) = 𝑒𝛽0𝑗+𝛽1𝑗𝑋1+⋯+𝛽𝑘𝑗𝑋𝑘 (C.3)

for 𝑗 = 1, … , 𝐽 − 1. Therefore, applying a logarithm to both sides we have:

log𝑝𝑗(x)𝑝𝐽(x) = 𝛽0𝑗 + 𝛽1𝑗𝑋1 + ⋯ + 𝛽𝑘𝑗𝑋𝑘. (C.4)

273

274 APPENDIX C. MULTINOMIAL LOGISTIC REGRESSION

This equation is indeed very similar to (4.7). If 𝐽 = 2, it is the same up to achange in the codes for the levels: the logistic regression giving the probabilityof 𝑌 = 1 versus 𝑌 = 2. On the LHS of (C.4) we have the logarithm of the ratioof two probabilities and on the RHS a linear combination of the predictors. Ifthe probabilities on the LHS were complementary (if they added up toone), then we would have a log-odds and hence a logistic regression for𝑌 . This is not the situation, but it is close: instead of odds and log-odds, wehave ratios and log-ratios of non complementary probabilities. Also, it givesa good insight on what the multinomial logistic regression is: a set of 𝐽 − 1independent “logistic regressions” for the probability of 𝑌 = 𝑗 versusthe probability of the reference 𝑌 = 𝐽 .

Equation (C.3) gives also interpretation on the coefficients of the model since

𝑝𝑗(x) = 𝑒𝛽0𝑗+𝛽1𝑗𝑋1+⋯+𝛽𝑘𝑗𝑋𝑘𝑝𝐽(x).

Therefore:

• 𝑒𝛽0𝑗 : is the ratio between 𝑝𝑗(0)/𝑝𝐽(0), the probabilities of 𝑌 = 𝑗 and𝑌 = 𝐽 when 𝑋1 = … = 𝑋𝑘 = 0. If 𝑒𝛽0𝑗 > 1 (equivalently, 𝛽0𝑗 > 0), then𝑌 = 𝑗 is more likely than 𝑌 = 𝐽 . If 𝑒𝛽0𝑗 < 1 (𝛽0𝑗 < 0), then 𝑌 = 𝑗 is lesslikely than 𝑌 = 𝐽 .

• 𝑒𝛽𝑙𝑗 , 𝑙 ≥ 1: is the multiplicative increment of the ratio between𝑝𝑗(x)/𝑝𝐽(x) for an increment of one unit in 𝑋𝑙 = 𝑥𝑙, provided that theremaining variables 𝑋1, … , 𝑋𝑙−1, 𝑋𝑙+1, … , 𝑋𝑘 do not change. If 𝑒𝛽𝑙𝑗 > 1(equivalently, 𝛽𝑙𝑗 > 0), then 𝑌 = 𝑗 becomes more likely than 𝑌 = 𝐽 foreach increment in 𝑋𝑗. If 𝑒𝛽𝑙𝑗 < 1 (𝛽𝑙𝑗 < 0), then 𝑌 = 𝑗 becomes less likelythan 𝑌 = 𝐽 .

The following code illustrates how to compute a basic multinomial regression inR.# Package included in R that implements multinomial regressionlibrary(nnet)

# Data from the voting intentions in the 1988 Chilean national plebiscitedata(Chile)summary(Chile)## region population sex age education## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462## N :322 Median :175000 Median :36.00 S :1120## S :718 Mean :152222 Mean :38.55 NA's: 11## SA:960 3rd Qu.:250000 3rd Qu.:49.00## Max. :250000 Max. :70.00## NA's :1## income statusquo vote## Min. : 2500 Min. :-1.80301 A :187

275

## 1st Qu.: 7500 1st Qu.:-1.00223 N :889## Median : 15000 Median :-0.04558 U :588## Mean : 33876 Mean : 0.00000 Y :868## 3rd Qu.: 35000 3rd Qu.: 0.96857 NA's:168## Max. :200000 Max. : 2.04859## NA's :98 NA's :17# vote is a factor with levels A (abstention), N (against Pinochet),# U (undecided), Y (for Pinochet)

# Fit of the model done by multinom: Response ~ Predictors# It is an iterative procedure (maxit sets the maximum number of iterations)# Read the documentation in ?multinom for more informationmod1 <- multinom(vote ~ age + education + statusquo, data = Chile,

maxit = 1e3)## # weights: 24 (15 variable)## initial value 3476.826258## iter 10 value 2310.201176## iter 20 value 2135.385060## final value 2132.416452## converged

# Each row of coefficients gives the coefficients of the logistic# regression of a level versus the reference level (A)summary(mod1)## Call:## multinom(formula = vote ~ age + education + statusquo, data = Chile,## maxit = 1000)#### Coefficients:## (Intercept) age educationPS educationS statusquo## N 0.3002851 0.004829029 0.4101765 -0.1526621 -1.7583872## U 0.8722750 0.020030032 -1.0293079 -0.6743729 0.3261418## Y 0.5093217 0.016697208 -0.4419826 -0.6909373 1.8752190#### Std. Errors:## (Intercept) age educationPS educationS statusquo## N 0.3315229 0.006742834 0.2659012 0.2098064 0.1292517## U 0.3183088 0.006630914 0.2822363 0.2035971 0.1059440## Y 0.3333254 0.006915012 0.2836015 0.2131728 0.1197440#### Residual Deviance: 4264.833## AIC: 4294.833

# Set a different level as the reference (N) for facilitating interpretationsChile$vote <- relevel(Chile$vote, ref = "N")


mod2 <- multinom(vote ~ age + education + statusquo, data = Chile,maxit = 1e3)

## # weights: 24 (15 variable)## initial value 3476.826258## iter 10 value 2393.713801## iter 20 value 2134.438912## final value 2132.416452## convergedsummary(mod2)## Call:## multinom(formula = vote ~ age + education + statusquo, data = Chile,## maxit = 1000)#### Coefficients:## (Intercept) age educationPS educationS statusquo## A -0.3002035 -0.00482911 -0.4101274 0.1525608 1.758307## U 0.5720544 0.01519931 -1.4394862 -0.5217093 2.084491## Y 0.2091397 0.01186576 -0.8521205 -0.5382716 3.633550#### Std. Errors:## (Intercept) age educationPS educationS statusquo## A 0.3315153 0.006742654 0.2658887 0.2098012 0.1292494## U 0.2448452 0.004819103 0.2116375 0.1505854 0.1091445## Y 0.2850655 0.005700894 0.2370881 0.1789293 0.1316567#### Residual Deviance: 4264.833## AIC: 4294.833exp(coef(mod2))## (Intercept) age educationPS educationS statusquo## A 0.7406675 0.9951825 0.6635657 1.1648133 5.802607## U 1.7719034 1.0153154 0.2370495 0.5935052 8.040502## Y 1.2326171 1.0119364 0.4265095 0.5837564 37.846937# Some highlights:# - intercepts do not have too much interpretation (correspond to age = 0).# A possible solution is to center age by its mean (so age = 0 would# represent the mean of the ages)# - both age and statusquo increase the probability of voting Y, A or U# with respect to voting N -> conservativeness increases with ages# - both age and statusquo increase more the probability of voting Y and U# than A -> elderly and status quo supporters are more decided to participate# - a PS level of education increases the probability of voting N. Same for# a S level of education, but more prone to A

# Prediction of votes - three profile of votersnewdata <- data.frame(age = c(23, 40, 50),

277

education = c("PS", "S", "P"),statusquo = c(-1, 0, 2))

# Probabilities of belonging to each classpredict(mod2, newdata = newdata, type = "probs")## N A U Y## 1 0.856057623 0.064885869 0.06343390 0.01562261## 2 0.208361489 0.148185871 0.40245842 0.24099422## 3 0.000288924 0.005659661 0.07076828 0.92328313

# Predicted classpredict(mod2, newdata = newdata, type = "class")## [1] N U Y## Levels: N A U Y

Multinomial logistic regression will suffer from numerical instabilitiesand its iterative algorithm might even fail to converge if the levelsof the categorical variable are very separated (e.g., two data cloudsclearly separated corresponding to a different level of the categoricalvariable).

The multinomial model employs (𝐽 − 1)(𝑘 + 1) parameters. It is easyto end up with complex models – that require a large samplesize to be fitted properly – if the response has a few number of levelsand there are several predictors. For example, with 5 levels and 8predictors we will have 36 parameters. Estimating this model with50 − 100 observations will probably result in overfitting.

Appendix D

Reporting with R and RCommander

A nice feature of R Commander is that integrates seamless with R Markdown,which is able to create .html, .pdf and .docx reports directly from the outputsof R. Depending on the kind of report that we want, we will need the followingauxiliary software1:

• .html. No extra software is required.• .docx and .rtf. You must install Pandoc, a document converter software.

Download it here.• .pdf (only recommended for experts). An installation of LaTeX, addition-

ally to Pandoc, is needed. Download LaTeX here.

The workflow is simple. Once you have done some statistical analysis, eitherby using R Commander’s menus or R code directly, you will end up with an Rscript, on the 'R Script' tab, that contains all the commands you have run sofar. Switch then to the 'R Markdown' tab and you will see the commands youhave entered in a different layout, which essentially encapsulates the code intochunks delimited by ```{r} and ```. This will generate a report once you clickin the 'Generate report' button.

Let’s illustrate this process through an example. Suppose we were analyzingthe Boston dataset, as we did in Section 3.1.2. Ideally2 our final script wouldbe something like this:# A simple and non-exhaustive analysis for the price of the houses in the Boston# dataset. The purpose is to quantify, by means of a multiple linear model,

1Alternatively, the 'Tools' -> 'Install auxiliary software [if not alreadyinstalled]' will redirect you to the download links for the auxiliary software.

2This is, assuming we have performed the right steps in the analysis without making anymistake.

279

http://pandoc.org/installing.html

https://www.latex-project.org/get/

280 APPENDIX D. REPORTING WITH R AND R COMMANDER

# the effect of 14 variables in the price of a house in the suburbs of Boston.

# Import datalibrary(MASS)data(Boston)

# Make a multiple linear regression of medv in the rest of variablesmod <- lm(medv ~ ., data = Boston)summary(mod)

# Check the linearity assumptionplot(mod, 1) # Clear non-linearity

# Let's consider the transformations given in Harrison and Rubinfeld (1978)modTransf <- lm(I(log(medv * 1000)) ~ I(rm^2) + age + log(dis) +

log(rad) + tax + ptratio + I(black / 1000) +I(log(lstat / 100)) + crim + zn + indus + chas +I((10 * nox)^2), data = Boston)

summary(modTransf)

# The non-linearity is more subtle nowplot(modTransf, 1)

# Look for the best model in terms of the BICmodTransfBIC <- stepwise(modTransf)summary(modTransfBIC)

# Let's explore the most significant variables, to see if the model can be# reduced drastically in complexitymod3D <- lm(I(log(medv * 1000)) ~ I(log(lstat / 100)) + crim, data = Boston)summary(mod3D)

# With only 2 variables, we explain the 72% of variability.# Compared with the 80% with 10 variables, it is an important improvement# in terms of simplicity.

# Let's add these variables to the dataset, so we can call scatterplotMatrix# and scatter3d through R Commander's menuBoston$logMedv <- log(Boston$medv * 1000)Boston$logLstat <- log(Boston$lstat / 100)

# Visualize the pair-by-pair relations of the response and two predictorsscatterplotMatrix(~ crim + logLstat + logMedv, reg.line = lm, smooth = FALSE,

spread = FALSE, span = 0.5, ellipse = FALSE,levels = c(.5, .9), id.n = 0, diagonal = 'histogram',

281

data = Boston)

# Visualize the full relation between the response and the two predictorsscatter3d(logMedv ~ crim + logLstat, data = Boston, fit = "linear",

residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE,ellipsoid = FALSE)

This contains all the major points in the analysis, that now can be expandedand detailed. You can download the script here, open it through 'File' ->'Open script file...' and run it by yourself in R Commander. If you so,and then switch to the R Markdown tab, you will see this:

---title: "Replace with Main Title"author: "Your Name"date: "AUTOMATIC"---

```{r, echo = FALSE, message = FALSE}# Include this code chunk as-is to set optionsknitr::opts_chunk$set(comment=NA, prompt=TRUE)library(Rcmdr)library(car)library(RcmdrMisc)```

```{r, echo = FALSE}# Include this code chunk as-is to enable 3D graphslibrary(rgl)knitr::knit_hooks$set(webgl = hook_webgl)```

```{r}# A simple and non-exhaustive analysis for the price of the houses in the Boston```

```{r}# dataset. The purpose is to quantify, by means of a multiple linear model,```

```{r}# the effect of 14 variables in the price of a house in the suburbs of Boston.```

```{r}# Import data

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report-script.R


```

```{r}library(MASS)```

```{r}data(Boston)```

```{r}# Make a multiple linear regression of medv in the rest of variables```

```{r}mod <- lm(medv ~ ., data = Boston)```

```{r}summary(mod)```

[More outputs - omitted]```

The complete, lengthy, file can be downloaded here. This is an R Markdownfile, which has extension .Rmd. As you can see, by default, R Commander willgenerate a code chunk like

```{r}code line```

for each code line you run in R Commander. You probably will want to modifythis crude report manually by merging chunks of code, removing comments oradding more information in between chunks of code. To do so, go to 'Edit' ->'Edit Markdown document'. Here you can also remove unnecessary chunks ofcode resulting from any mistake or irrelevant analyses.

The following file (download) could be a final report. Pay attention to thenumerous changes with respect to the previous one:

---title: "What makes a house valuable?"subtitle: "A reproducible analysis in the Boston suburbs"author: "Outstanding student 1, Awesome student 2 and Great student 3"date: "31/11/16"---

https://raw.githubusercontent.com/egarpor/handy/master/reporting/crude-report.Rmd

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.Rmd

283

```{r, echo = FALSE, message = FALSE, warning = FALSE}# include this code chunk as-is to set optionsknitr::opts_chunk$set(comment=NA, prompt=TRUE)library(Rcmdr)library(car)library(RcmdrMisc)```

```{r, echo = FALSE, message = FALSE, warning = FALSE}# include this code chunk as-is to enable 3D graphslibrary(rgl)knitr::knit_hooks$set(webgl = hook_webgl)```

This short report shows a simple and non-exhaustive analysis for the price ofthe houses in the `Boston` dataset. The purpose is to quantify, by means of amultiple linear model, the effect of 14 variables in the price of a house inthe suburbs of Boston.

We start by importing the data into R and considering a multiple linearregression of `medv` (median house value) in the rest of variables:```{r}# Import datalibrary(MASS)data(Boston)```

```{r}mod <- lm(medv ~ ., data = Boston)summary(mod)```The variables `indus` and `age` are non-significant in this model. Also,although the adjusted R-squared is high, there seems to be a clearnon-linearity:```{r}plot(mod, 1)```

In order to bypass the non-linearity, we are going to consider thenon-linear transformations given in Harrison and Rubinfeld (1978)for both the response and the predictors:```{r}modTransf <- lm(I(log(medv * 1000)) ~ I(rm^2) + age + log(dis) +

log(rad) + tax + ptratio + I(black / 1000) +I(log(lstat / 100)) + crim + zn + indus + chas +


I((10*nox)^2), data = Boston)summary(modTransf)```The adjusted R-squared is now higher and, what is more important, thenon-linearity now is more subtle (it is still not linear but closerthan before):```{r}plot(modTransf, 1)```

However, `modTransf` has more non-significant variables. Let\'s see ifwe can improve over the previous model by removing some of thenon-significant variables? To see this, we look for the best model interms of the Bayesian Information Criterion (BIC) by `stepwise`:```{r}modTransfBIC <- stepwise(modTransf, trace = 0)summary(modTransfBIC)```The resulting model has a slightly higher adjusted R-squared than `modTransf`with all the variables significant.

We explore the most significant variables to see if the model can be reduceddrastically in complexity.```{r}mod3D <- lm(I(log(medv * 1000)) ~ I(log(lstat / 100)) + crim, data = Boston)summary(mod3D)```

It turns out that **with only 2 variables, we explain the 72% of variability**.Compared with the 80% with 10 variables, it is an important improvementin terms of simplicity: the logarithm of `lstat` (percent of lower status ofthe population) and `crim` (crime rate) alone explain the 72% of thevariability in the house prices.

We add these variables to the dataset, so we can call `scatterplotMatrix` and`scatter3d` through R Commander,```{r}Boston$logMedv <- log(Boston$medv * 1000)Boston$logLstat <- log(Boston$lstat / 100)```and conclude with the visualization of:

1. the pair-by-pair relations of the response and the two predictors;2. the full relation between the response and the two predictors.```{r, warning = FALSE}# 1

285

scatterplotMatrix(~ crim + logLstat + logMedv, reg.line = lm, smooth = FALSE,spread = FALSE, span = 0.5, ellipse = FALSE,levels = c(.5, .9), id.n = 0, diagonal = 'histogram',data = Boston)

``````{r, webgl = TRUE}# 2scatter3d(logMedv ~ crim + logLstat, data = Boston, fit = "linear",

residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE,ellipsoid = FALSE)

```

When we click on 'Generate report' for the above R Markdown file, we shouldget the following output files:

• .html: visualize and download. Once it is produced, this file is difficultto modify, but very easy to distribute (anyone with a browser can see it).

• .docx: visualize and download. Easy to modify in a document processorlike Microsoft Office. Easy to distribute.

• .rtf: download. Easy to modify in a document processor, not very ele-gant.

• .pdf: visualize and download. Elegant and easy to distribute, but hardto modify once it is produced.

For advanced users, there is a lot of information on mastering RMarkdown here by using RStudio, a more advanced framework thanR Commander.

http://htmlpreview.github.io/?https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.html

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.html

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.docx

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.rtf

https://raw.githubusercontent.com/egarpor/handy/master/reporting/report.pdf

http://rmarkdown.rstudio.com/lesson-1.html

Appendix E

Group project

Groups

You will team up in groups of 3 to 5 members. It is up to you to form the groupsbased on your grade expectations, affinity, complementary skills, etc. You mustcommunicate the group compositions no later than November the 30thby a (single) email to [email protected] detailing the members ofthe group and, if you have it, a preliminary description about the topic of theproject (~ 3 lines).

Aim of the project

You will analyze a real dataset of your choice using the statistical methodologythat we have seen in the lessons and labs. The purpose is to demonstrate thatyou know how to apply and interpret some of the studied statistical techniques(such as simple/multiple linear regression, logistic regression, or any other meth-ods covered in the course) in a real-case scenario that is appealing for you.

Structure of the report

Use the following mandatory structure when writing your report:

0. Abstract. Provide a concise summary of the project. It must not exceed250 words.

1. Introduction. State what is the problem to be studied. Provide somecontext, the question(s) that you want to address, a motivation of itsimportance, references, etc. Remember how we introduced the case studiescovered in the course as a template (but you will need to elaborate more).

2. Statistical analysis. Make use of some of the aforementioned statisticaltechniques, the ones that are more convenient to your particular case study.You can choose between covering several at a more superficial level, orone or two in more depth. Justify their adequacy and obtain analyses,

287

mailto:[email protected]

288 APPENDIX E. GROUP PROJECT

explaining how you did it, in the form of plots and summaries. Provide acritical discussion about the outputs and give insights about them.

3. Conclusions. Summary of what was addressed in the project and of themost important conclusions. Takeaway messages. The conclusions arenot required to be spectacular, but fair and honest in terms of what youdiscovered.

4. References. Refer to the sources of information that you have employed(for the data, for information on the data, for the statistical analyses, etc).

Mandatory format guidelines:

• Structure: title, authors, abstract, first section, second section and so on.Like this. Do not use a cover.

• Font size: 12 points.• Spacing: single space, single column.• Length limit: less than 5000 words and 15 pages. You are not required

to make use of all the space.

Grading

All students in a group will be graded evenly. Take this into account whenforming the group. The grading is on a scale of 0-10 (plus 2 bonus points) andwill be performed according to the following breakdown:

• Originality of the problem studied and data acquisition process (up to 2points).

• Statistical analyses presented and their depth (up to 3 points). Atleast two different techniques should be employed (simple and multiplelinear regression count as different, but the use of other techniques aswell is mostly encouraged). Graded depending on their adequacy to theproblem studied and the evidence you demonstrate about your knowledgeof them.

• Accuracy of the interpretation of the analyses (up to 2 points). Gradeddepending on the detail and rigor of the insights you elaborate from thestatistical analyses.

• Reproducibility of the study (1.5 point). Awarded if the code for repro-ducing the study, as well as the data, is provided in a ready-to-use way(e.g. the outputs from R Commander’s report mode along with the data).

• Presentation of the report (1.5 point). This involves the correct usageof English, the readability, the conciseness and the overall presentationquality.

• Excellence (2 bonus points). Awarded for creativity of the analysis, useof advanced statistical tools, use of points briefly covered in lessons/labs,advanced insights into the methods, completeness of the report, use ofadvanced presentation tools, etc. Only awarded if the sum of regularpoints is above 7.5.

The ratio “quality project”/“group size” might be taken into account in extreme

http://cje.oxfordjournals.org/content/38/2/257.full.pdf+html

289

cases (e.g. poor report written by 5 people, extremely good report written by 3people).

Academic fraud

Evidences of academic fraud will have serious consequences, such as a zero gradefor the whole group and the reporting of the fraud detection to the pertinentacademic authorities. Academic fraud includes (but is not limited to) plagiarism,use of sources without proper credit, project outsourcing and the use of externaltutoring not mentioned explicitly.

Tips

• Think about a topic that could be reused for other subjects, or take in-spiration from previous projects you did. In that way, this project couldserve as the quantification of another subject’s project. If you do this, addan explicit mention in the report.

• Data sources. Here are some useful data sources:– A list of all the datasets included in R. See Section 2.9.1 of lab notes

for how to load them.– Some datasets employed in the course.– The World Bank contains a huge collection of economic and sociolog-

ical variables for countries and regions, for long periods of time.– SIPRI contains several databases about international transfers of

arms.– The Global Health Observatory is the World Health Organization’s

main health statistics repository.– Sport statistics (teams, players) are a great source if you like sports.

Sport webpages usually have a section on statistics.• Inspiration for the project’s topic.

– The case studies covered (and left as exercises) in the lab notes mightserve as a good starting point for defining a project.

– The Economist usually has some good and up-to-date politi-cal/economical analyses that could serve as motivation.

– Try to quantify the impact in society of certain laws (traffic, educa-tion, gender violence, etc).

– Is there a continuous variable that you would like to predict fromothers? (linear regression)

– Is there a binary variable that you would like to predict from others?(logistic regression)

– Would you like to assess which combination of variables explain mostof the variability of your data, so you can visualize it in 2D or 3D?(principal component analysis)

– Would you like aggregate individuals according to several character-istics in order to classify them? (clustering)

• Use R Commander’s report mode (Appendix D) to simplify the generation

http://vincentarelbundock.github.io/Rdatasets/datasets.html

https://bookdown.org/egarpor/SSS2-UC3M/exercises-and-case-studies.html

https://bookdown.org/egarpor/SSS2-UC3M/datasets-for-the-course.html

http://data.worldbank.org

https://www.sipri.org/

http://www.who.int/gho/database/en/

http://www.economist.com/

290 APPENDIX E. GROUP PROJECT

of graphs and summaries directly from the statistical analysis. Use thatcode to make the analysis reproducible.

• Make use of office hours before it is too late.• Pro-tip: if you come to my office with a printed draft of the project, I can

provide you some quick feedback on what could be improved.

Deadline

Submit the reports before December the 23rd at 16:59 through Aula Global.Not by email. Reports received after the deadline will not be evaluated.

Bibliography

Allaire, J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham,H., Atkins, A., and Hyndman, R. (2016). rmarkdown: Dynamic Documentsfor R. R package version 1.0.9013.

Ashenfelter, O., Ashmore, D., and Lalonde, R. (1995). Bordeaux wine vintagequality and the weather. CHANCE, 8(4):7–14.

Bartholomew, D. J., Steele, F., Galbraith, J., and Moustaki, I. (2008). Analysisof multivariate social science data. CRC press.

Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of thespace shuttle: Pre-challenger prediction of failure. Journal of the AmericanStatistical Association, 84(408):945–957.

Fox, J. (2005). The R Commander: A basic statistics graphical user interfaceto R. Journal of Statistical Software, 14(9):1–42.

Grimmett, G., Laslier, J.-F., Pukelsheim, F., Ramirez Gonzalez, V., Rose, R.,Slomczynski, W., Zachariasen, M., and Życzkowski, K. (2011). The allocationbetween the EU member states of the seats in the european parliament -cambridge compromise. Technical report.

Hand, D. J., Daly, F., McConway, K., Lunn, D., and Ostrowski, E. (1994). Ahandbook of small data sets. CRC Press.

Harrison, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and thedemand for clean air. Journal of Environmental Economics and Management,5(1):81–102.

Herndon, T., Ash, M., and Pollin, R. (2013). Does high public debt consistentlystifle economic growth? a critique of Reinhart and Rogoff. Cambridge Journalof Economics.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introductionto statistical learning, volume 103 of Springer Texts in Statistics. Springer,New York. With applications in R.

Moore, G. E. (1965). Cramming more components onto integrated circuits.Electronics, 38(8):114–117.

291

292 BIBLIOGRAPHY

OECD (2012a). Does money buy strong performance in pisa? PISA in Focus,(13):1–4.

OECD (2012b). PISA 2012 Results: What Students Know and Can Do (VolumeI, Revised edition, February 2014): Student Performance in Mathematics,Reading and Science. OECD Publishing, Paris.

Peña, D. (2002). Análisis de Datos Multivariantes. McGraw-Hill, Madrid.

Presidential Commission on the Space Shuttle Challenger Accident (1986). Re-port of the Presidential Commission on the Space Shuttle Challenger Accident(Vols. 1 & 2). Washington, DC.

R Core Team (2015). R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria.

Reinhart, C. M. and Rogoff, K. S. (2010). Growth in a time of debt. WorkingPaper 15639, National Bureau of Economic Research.

The World Bank, W. D. I. (2012). Gdp per capita.

Xie, Y. (2016a). bookdown: Authoring Books with R Markdown. R packageversion 0.1.6.

Xie, Y. (2016b). knitr: A General-Purpose Package for Dynamic Report Gen-eration in R. R package version 1.14.

Lab notes for Statistics for Social Sciences II ...

Documents

Transcript of Lab notes for Statistics for Social Sciences II ...