Statistics in Climate Sciences I + II 2013{2014 Part I ...

Statistics in Climate Sciences I + II2013–2014

Part I: Multivariate StatisticsPart II: Time Series

Part III: Statistical analysis of extreme values

Dr. Michel Piot

Bibliography

Agresti, A. and C. Franklin (2009). Statistics: The Art and Science of Learning fromData (Second ed.). London: Pearson – Prentice Hall.

Auer, I. et al. (2007). Histalp – historical instrumental climatological surface timeseries of the greater alpine region. International Journal of Climatology , 17–46.

Begert, M. (2008). Die Reprasentativitat der Stationen im Swiss National Basic Cli-matological Network (Swiss NBCN). Arbeitsberichte der MeteoSchweiz, 217. Bun-desamt fur Meteorologie und Klimatologie.

Brockwell, P. J. and R. A. Davis (2002). Introduction to Time Series and Forecasting(Second ed.). New York: Springer Verlag.

Ceppi, P., P. M. Della-Marta, and C. Appenzeller (2008). Extreme Value Analysis ofWind Speed Observations over Switzerland. Arbeitsberichte der MeteoSchweiz 219.Zurich: Bundesmat fur Meteorologie und Klimatologie.

Everitt, B. S. and A. Skrondal (2010). The Cambridge Dictionary of Statistics (Fourthed.). Cambridge: Cambridge University Press.

Freund, R. J. (1979). Multicollinearity etc., some “new” examples. American Statis-tical Association Proceedings of Statistical Computing Section, 111–112.

Gabriel, K. R. (1971). The biplot graphic display of matrices with application toprincipal component analysis. Biometrika, 453–467.

Gabriel, K. R. (1972). Analysis of meteorological data by means of canonical decom-position and biplots. Journal of Applied Meteorology , 1071–1077.

Johnson, R. A. and G. K. Bhattacharyya (2011). Statistics: Principles and Methods(Sixth ed.). International Student Version. Wiley.

Johnson, R. A. and D. W. Wichern (2007). Applied Multivariate Statistical Analysis(Sixth ed.). Upper Saddle River: Pearson International Edition.

Katz, R. W., M. B. Palrange, and P. Naveau (2002). Statistics of extremes in hydrol-ogy. Advances in Water Resources , 1287–1304.

Kaufmann, P. and C. D. Whiteman (1999). Cluster-analysis classification of winter-time wind pattern in the grand canyon region. Journal of Applied Meteorology ,1131–1147.

Keller, C. (1921). Schweiz. Junk’s Natur-Fuhrer. Berlin: Verlag von W. Junk.

Kockelkorn, U. (2012). Statistik fur Anwender. Berlin: Springer Spektrum.

MeteoSchweiz (2013). Klimabulletin Jahr 2012. Zurich: MeteoSchweiz.

Milton, J. S. and J. C. Arnold (1995). Introduction to Probability and Statistics (Thirded.). New York: McGraw-Hill.

OcCC (2008). Das Klima andert – was nun? Der neue UN-Klimabericht (IPCC 2007)und die wichtigsten Ergebnisse aus Sicht der Schweiz. Bern: OcCC – Organe con-sultatif sur les changements climatiques.

Pruscha, H. (2006). Statistisches Methodenbuch. Berlin: Springer Verlag.

Reiss, R.-D. and M. Thomas (2007). Statistical Analysis of Extreme Values (Thirded.). Basel: Birkhauser.

Rencher, A. C. (1995). Methods of Multivariate Analysis. New York: John Wiley &Sons.

Schlittgen, R. and B. H. J. Streitberg (2001). Zeitreihenanalyse (Ninth ed.). Munchen:Oldenbourg Verlag.

Schuenemeyer, J. H. and L. J. Drew (2011). Statistics for Earth and EnvironmentalScientists. New York: Wiley.

Seber, G. A. F. (1984). Multivariate Observations. New York: Wiley Series in Proba-bility and Statistics.

Sharma, S. (1996). Applied Multivariate Techniques. New York: John Wiley & Sons.

Swiss Federal Office of Energy (2013). Schweizerische Gesamtenergiestatistik 2012.Bern: Swiss Federal Office of Energy.

von Storch, H. and F. W. Zwiers (2001). Statistical Analysis in Climate Research.Cambridge: Cambridge University Press.

Wilks, D. S. (2006). Statistical Methods in the Atmospheric Sciences (Second ed.).Burlington: Academic Press.

Xoplaki, E., J. Luterbacher, R. Burkard, I. Patrikas, and P. Maheras (2000). Connec-tion between the large-scale 500 hPa geopotential heigth fields and precipitationover greece during wintertime. Climate Research, 129–146.

Zwiers, F. W. and H. von Storch (2004). On the role of statistics in climate research.International Journal of Climatology , 665–680.

Contents

I Multivariate Statistics

1 Introduction 1-11.1 Some Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21.3 Data Sets and Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1-31.4 Classification of Multivariate Methods . . . . . . . . . . . . . . . . . . . 1-12

2 Linear Algebra 2-12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12.2 Supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

3 Sample Geometry and Random Sampling 3-13.1 Geometry of the Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13.2 Random Samples and the Expected Values of the Sample Mean and Co-

variance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23.3 Generalized Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43.4 Sample Mean, Covariance, and Correlation as Matrix Operations . . . . . 3-73.5 Sample Values of Linear Combinations of Variables . . . . . . . . . . . . 3-8

4 Multivariate Normal Distribution 4-14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14.2 Supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14.3 Detecting Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1

5 Inferences about a Mean Vector 5-15.1 Plausibility of µ0 as a Value for a Normal Population Mean . . . . . . . 5-15.2 Confidence Regions and Simultaneous Comparisons of Component Means 5-65.3 Large Sample Inference about a Population Mean Vector . . . . . . . . . 5-16

6 Comparisons of Several Multivariate Means 6-16.1 Paired Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16.2 Comparing Mean Vectors from Two Populations . . . . . . . . . . . . . 6-46.3 Comparing Several Multivariate Population Means (One-way MANOVA) 6-8

7 Linear Regression Models 7-17.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 7-37.3 Inferences about the Regression Model . . . . . . . . . . . . . . . . . . . 7-67.4 Inferences from the Estimated Regression Function . . . . . . . . . . . . 7-77.5 Model Checking and other Aspects of Regression . . . . . . . . . . . . . 7-9

8 Principal Components Analysis 8-18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18.2 Population Principal Components . . . . . . . . . . . . . . . . . . . . . . 8-18.3 Summarizing Sample Variation by Principal Components . . . . . . . . . 8-4

9 Canonical Correlation Analysis 9-19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-19.2 Canonical Variates and Canonical Correlations . . . . . . . . . . . . . . . 9-19.3 Interpreting the Population Canonical Variables . . . . . . . . . . . . . . 9-49.4 Sample Canonical Variates and Sample Canonical Correlations . . . . . . 9-69.5 Canonical Correlation Analysis applied to Fields and Forecasting with

Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . 9-9

10 Discrimination and Classification 10-110.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-110.2 Separation and Classification for Two Populations . . . . . . . . . . . . . 10-1

11 Clustering, Distance Methods and Ordination 11-111.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-111.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-111.3 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . 11-311.4 Nonhierarchical Clustering Methods . . . . . . . . . . . . . . . . . . . . . 11-811.5 Clustering based on Statistical Models . . . . . . . . . . . . . . . . . . . 11-911.6 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1011.7 Biplots for viewing Sampling Units and Variables . . . . . . . . . . . . . 11-10

II Time Series

12 Introduction 12-112.1 Simple Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . 12-112.2 Stationary Models and the Autocorrelation Function . . . . . . . . . . . 12-1012.3 Estimation and Elimination of Trend and Seasonal Components . . . . . 12-1412.4 Testing the Estimated Noise Sequence . . . . . . . . . . . . . . . . . . . 12-18

13 Stationary Processes 13-113.1 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-113.2 Linear Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-313.3 Introduction to Autoregressive Moving Average (ARMA) Processes . . . 13-613.4 Forecasting Stationary Time Series . . . . . . . . . . . . . . . . . . . . . 13-7

14 ARMA Models 14-114.1 ARMA(p, q) Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-114.2 Autocorrelation Function and Partial Autocorrelation Function of an ARMA(p, q)

Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-414.3 Forecasting ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 14-6

15 Modeling and Forecasting with ARMA Processes 15-115.1 Preliminary Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-115.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 15-915.3 Diagnostic Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1015.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1115.5 Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12

16 Nonstationary and Seasonal Time Series Models 16-116.1 ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-116.2 SARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1

III Statistical analysis of extreme values

17 Statistical Analysis of Extreme Values 17-117.1 Applications of statistical analysis of extreme values . . . . . . . . . . . . 17-117.2 Modeling by Extreme Value Distributions . . . . . . . . . . . . . . . . . 17-717.3 Modeling by Generalized Pareto Distributions . . . . . . . . . . . . . . . 17-9

Part I

Multivariate Statistics

1 Introduction

These lecture notes are only designated for the individual use for the students attendingthe course “Statistics in Climate Sciences”. The text in this first part is a summaryof Johnson and Wichern (2007) which is highly recommended as a reference book formultivariate statistical analysis. The more applied and therefore complementary text-book of Schuenemeyer and Drew (2011) gives a comprehensive treatment of statisticalapplications for solving real-world environmental problems.

Of course there are numerous other textbooks dealing with multivariate methods instatistics. If the focus is set on a mathematical approach then consider Seber (1984).Rencher (1995) can be used as an alternative to Johnson and Wichern (2007). Finally,several textbooks address questions from environmental and atmospherical sciences, e.g.Wilks (2006) and von Storch and Zwiers (2001).

To understand the methods of multivariate statistics at least a lecture in univariatestatistics is required. Several textbooks can be recommended to repeat the basics ofstatistics: in Agresti and Franklin (2009) a concept-driven approach is chosen whichplaces more emphasis on why statistics is important in the real world and places lessemphasis on probability. A nice book in German with a modern design is Kockelkorn(2012). Two traditional introductions to the principles and methods in statistics areMilton and Arnold (1995) and Johnson and Bhattacharyya (2011).

1.1 Some Notation

A critical distinction for the analyst to make is sample versus population. A unit is asingle object whose characteristics are of interest. A population comprises all units ofinterest in a study. In most earth science applications, the population is large to infinite.In air quality studies, it may be the troposphere. A sample is a subset of a population.A statistic is a number derived from a sample. The method used to obtain a sample(the sampling plan) determines the type of inferences that can be made. Generally, inearth science applications, the sample size will be small with respect to the populationsize.

A random variable X associates a numerical value with each outcome of an experi-ment. The word “random” serves as a reminder of the fact that, beforehand, we do notknow the outcome of an experiment or its associated value of X. Source: Johnson andBhattacharyya (2011).

A random sample is either a set of n independent and identically distributed randomvariables, or a sample of n individuals selected from a population in such a way thateach sample of the same size is equally likely. Source: Everitt and Skrondal (2010).

The notations that are used in these lecture notes to represent populations andsamples are those commonly used in the statistics literature. Statistics involves the useof random variables. There are two types of random variables, continuous and discrete.

1-1

� An uppercase italic letter denotes a random variable. For example, X may denotethe actual temperature in Switzerland.

� A lowercase italic letter refers to a specific element of a population, for example x.A sample of size n with actual temperatures at different stations in Switzerland isx1, . . . , xn.

� A boldface uppercase italic letter denotes a random vector, e.g. X = (X1, . . . , Xp)′.

A boldface lowercase italic letter refers to a specific element of a population, forexample x.

� A boldface uppercase letter denotes a matrix, e.g. X.

� Population attributes are generally unknown and are usually denoted by Greekletters. For example, the population mean and standard deviation of the actualtemperature in Switzerland are typically denoted by µ and σ, respectively.

� Statistics are typically designated by a putting a “hat” over the parameter, as inµ and σ for the sample mean and sample standard deviation, respectively, or withupper- or lowercase italic letters. For example, X is the mean of a sample of X’sand S may be used to represent the sample standard deviation; x and s representspecific values. Both the hat and italic letter are used in these lecture notes.

1.2 Objectives

Multivariate analysis consists of a collection of methods that can be used when severalmeasurements are made on each individual or object in one or more samples. We willrefer to the measurements as variables and to the individuals or objects as units orsubjects.

The objectives of scientific investigations to which multivariate methods naturallylend themselves include the following:

1. Data reduction or structural simplification. The phenomenon being studied isrepresented as simply as possible without sacrificing valuable information. It ishoped that this will make interpretation easier.

2. Sorting and grouping. Groups of “similar” objects or variables are created, basedupon measured characteristics. Alternatively, rules for classifying objects intowell-defined groups may be required.

3. Investigation of the dependence among variables. The nature of the relationshipsamong variables is of interest. Are all the variables mutually independent or areone or more variables dependent on the others? If so, how?

4. Prediction. Relationships between variables must be determined for the purposeof predicting the values of one or more variables on the basis of observations onthe other variables.

1-2

5. Hypothesis construction and testing. Specific statistical hypotheses, formulated interms of the parameters of multivariate populations, are tested. This may be doneto validate assumptions or to rein-force prior convictions.

1.3 Data Sets and Applications

1.3.1 Random Sampling and Multivariate Normal Distribu-tion: Climate Time Series in Europe between 1900-2002

Example (Climate Time Series). The Histalp database consists of monthly homoge-nized records of temperature, pressure, precipitation, sunshine and cloudiness for theGreater Alpine Region (see Figure 1.1). The longest temperature and air pressure seriesextend back to 1760, precipitation to 1800, cloudiness to the 1840s and sunshine to the1880s. A systematic quality control procedure has been applied to the series and a highnumber of inhomogeneities (more than 2500) and outliers (more than 5000) have beendetected and removed. All Histalp Data will be provided free of charge.

Figure 1.1: Greater Alpine Region with all stations of the Histalp project. Source:Histalp homepage (http://www.zamg.ac.at/histalp/).

Table 1.1 shows some stations of the Histalp database for which the mean averagewinter temperatures (December-February) are given for the years 1900-2002. Table 1.2shows an extract of the data set. Source: Histalp database.

Further reading. Auer et al. (2007) describes the Histalp database.

1-3

Table 1.1: Longitude (°E), latitude (°N) and altitude (m) for several weather stations inEurope. Source: Histalp database.

Station Longitude Latitude Altitude

Badgastein (A) 13.13 47.12 1100

Bern (CH) 7.43 46.95 565

Bregenz (AT) 9.73 47.50 424

Davos (CH) 9.85 46.78 1590

Genf (CH) 6.15 46.20 405

Grosser St. Bernhard (CH) 7.18 45.87 2472

Graz (A) 15.45 47.08 377

Hohenpeissenberg (D) 11.02 47.80 986

Innsbruck (A) 11.38 47.27 609

Milano (I) 9.00 45.47 122

Munich (D) 11.55 48.18 525

Nice (F) 7.20 43.65 4

Santis (CH) 9.35 47.25 2500

Schmittenhohe (A) 12.73 47.33 1973

Sils Maria (CH) 9.77 46.43 1802

Sonnblick (A) 12.95 47.05 3105

Vienna (A) 16.35 48.22 209

1.3.2 Comparisons of Means: Bern-Chur-Zurich

Example (Bern-Chur-Zurich). Consider the two stations Bern-Zollikofen (552 m) andZurich-Fluntern (555 m) in the Swiss Midland and Chur (556 m) in the Swiss Alps,which have the same altitude. It is interesting to compare several climatic variables ofthese three stations. We will analyze the annual mean pressure in hPa, annual meantemperature in ◦C, annual precipitation in mm and annual sunshine duration in hours.The data set provided from MeteoSchweiz covers the years 1901-2010.

Table 1.3 shows different comparisons of means of one or more than one station withone or more than one variable. Depending on the number of stations and variables ofinterest different statistical methods are used.

Figure 1.2 shows a scatterplot matrix of the annual values of pressure, temperature,precipitation and sunshine duration for Bern for the period 1901-2010. Thus one stationwith three variables is of interest. Does the first column show a positive correlation forthe pairs time-temperature and time-pressure? Obviously no trends can be found inthe time-precipitation and time-sunshine duration patterns. Furthermore the negativecorrelation between sunshine duration and precipitation is not surprising.

1-4

Table 1.2: Winter temperatures from 1900-1930 for some Swiss stations. Source: Histalpdatabase.

Year Bern Davos Genf Gr. St. Bernhard Säntis Sils Maria

1900 -2.3 -7.5 -0.1 -9.4 -9.2 -9.2

1901 -0.7 -4.7 0.9 -7.6 -7.2 -6.4

1902 -0.9 -5.4 1.3 -7.0 -6.6 -7.1

1903 -1.5 -6.2 0.9 -9.5 -8.0 -7.9

1904 -1.4 -7.1 0.0 -8.8 -8.6 -8.4

1905 -0.9 -6.9 0.7 -7.7 -8.1 -8.5

1906 -3.0 -8.1 -1.1 -10.7 -10.7 -8.4

1907 -1.0 -5.4 0.9 -7.6 -7.5 -7.1

1908 -2.8 -8.1 -0.7 -9.8 -9.4 -9.6

1909 0.4 -4.6 2.8 -8.4 -8.1 -6.3

1910 -1.3 -6.0 1.1 -7.5 -7.5 -7.3

1911 1.7 -3.1 3.4 -6.2 -6.0 -5.4

1912 0.2 -4.8 1.8 -6.7 -6.5 -6.9

1913 -1.6 -6.0 0.2 -7.5 -7.1 -6.9

1914 0.4 -5.4 2.3 -9.9 -9.0 -7.5

1915 2.3 -3.6 3.5 -6.8 -6.6 -5.2

1916 -1.8 -6.7 -0.2 -9.8 -9.0 -8.4

1917 -2.3 -6.6 -0.2 -8.3 -7.7 -8.5

1918 0.3 -5.7 2.2 -8.2 -7.8 -7.3

1919 1.9 -4.5 3.0 -6.5 -6.1 -6.3

1920 1.0 -5.0 2.2 -6.9 -6.6 -6.4

1921 -0.4 -5.6 1.4 -8.1 -8.2 -6.9

1922 0.2 -5.7 1.8 -8.1 -8.4 -7.0

1923 -1.7 -7.5 0.5 -9.5 -9.7 -8.9

1924 0.8 -4.3 2.3 -6.7 -5.8 -6.8

1925 1.1 -4.4 2.7 -7.1 -7.3 -6.8

1926 -0.9 -6.0 0.9 -8.6 -8.1 -7.2

1927 0.8 -4.5 2.4 -7.4 -7.1 -6.8

1928 -4.2 -9.5 -1.9 -10.9 -10.9 -10.7

1929 1.4 -4.6 2.9 -8.1 -6.7 -6.8

1930 -0.3 -7.0 1.7 -9.5 -9.2 -8.1

Figure 1.3 shows the scatterplot matrix of sunshine duration for Bern, Chur andZurich. In this case we compare one variable at three stations. Obviously there is astrong positive correlation between Bern and Zurich, whereas the relationship betweenChur and the two other stations is still positive but not so strong.

To get an idea of the shapes of the distributions of the different variables for the threestations we can use boxplots (see Figure 1.4). The pressure boxplot shows that the rangeand median are similar for Bern and Zurich, whereas Chur has a greater range and themedian is lower. A more detailed analysis of the pressure data for Chur would show thatthere is an inhomogeneity in the time series starting in 1980 with pressure values that arequite higher then from 1901-1979. As a consequence before using the pressure data forChur it would be important to find an explanation for this behavior. The temperatureboxplot lets us assume that the annual mean temperature is significantly higher for Churthan for Zurich and Bern whereas the boxplot of the annual sunshine duration does notsupport such a statement.

Remark. Homoscedasticity means that the population variances are equal; heteroscedas-

1-5

Table 1.3: Overview of different methods for the comparisons of means depending onthe number of samples and variables.

1 variable more than 1 variable

(temperature) (pressure, temperature,

precipitation, sunshine)

paired unpaired paired unpaired

1 sample t-test Hotelling’s T 2

(Bern) — H0 : µ = µ0 — H0 : µ = µ0

Chapter 5.1.1 Chapter 5.1.2

2 samples t-test t-test Hotelling’s T 2 Hotelling’s T 2

(Bern, Zurich) H0 : µD = 0 H0 : µ1 = µ2 H0 : µD = 0 H0 : µ1 = µ2

Chapter 6.1.1 Chapter 6.1.2 Chapter 6.2

m > 2 samples ANOVA MANOVA

(Bern, Chur, Zurich) — H0 : µ1 = . . . = µm — H0 : µ1 = . . . = µm

Chapter 6.3.1 Chapter 6.3.2

ticity means that they are not equal. Boxplots are an important graphical tool to helpevaluate the homoscedasticity assumption. The classic test of homoscedasticity is thatof Bartlett. An alternative is Levene’s test, which is less sensitive to the normalityassumption. For further explications consider Schuenemeyer and Drew (2011) p. 67f.

1.3.3 Linear Regression: Basel

(Simple) linear regression involves a response, which is a continuous variable and a singleexplanatory variable. Multiple (linear) regression is a model in which a continuousresponse variable is regressed on a number of explanatory variables. Source: Everitt andSkrondal (2010).

Example (Basel). Basel-Binningen has a very long climate time series, starting in 1755with temperature and since 1864 also pressure, precipitation, humidity, cloudiness andheating degree days. The data set is provided by MeteoSchweiz. Figure 1.5 shows therelationship between the different variables. We will use this data set not only for linearregression but also for time series analysis in Part II.

1.3.4 Principal Components Analysis: Weather Report 2007

Principal components analysis is a procedure for analyzing multivariate data which trans-forms the original variables into new ones that are uncorrelated and account for decreas-ing proportions of the variance in the data. The aim of the method is to reduce thedimensionality of the data. The new variables, the principal components, are defined aslinear functions of the original variables. Source: Everitt and Skrondal (2010).

1-6

1920 1940 1960

1960 1980 2000

1960

1980

2000

1920

1940

1960

year

948 949 950

951 952 953

951

952

953

948

949

950pressure

6.5 7.0 7.5 8.0

8.5 9.0 9.5 10.0

8.5

9.0

9.5

10.0

6.5

7.0

7.5

8.0temperature

800 1000

1000 1200

1000

1200

800

1000precipitation

1400 1600

1800 2000

1800

2000

1400

1600

sunshine

Scatterplot for Bern from 1901-2000

Figure 1.2: Scatterplot matrix of the four variables of Bern for the time period 1901-2010.Data set: Bern-Chur-Zurich.

Example (Weather Report). MeteoSchweiz publishes annual weather reports which area summary of the last year’s weather. In the lecture notes we often will refer to the dataof the Weather Report 2007. Table 1.4 shows several climatic variables for fifty Swissstations in the year 2007:

� Sunshine: sunshine duration, norm in % (percentage compared to the period 1960-1990), relative (percentage of possible sunshine).

� Air temperature: mean, norm in % (percentage compared to the period 1960-1990), absolute minimum (lowest temperature of the year with the correspondingdate), absolute maximum (highest temperature of the year with the correspondingdate).

� Heating degree days (when the daily mean temperature falls below 12 ◦C, theheating degree days is the difference between 20 ◦C and the mean temperature).

1-7

1400 1600

1800 2000

1800

2000

1400

1600

Bern

140015001600

170018001900

1700

1800

1900

1400

1500

1600Chur

1400 1600

1800 2000

1800

2000

1400

1600Zürich

Scatterplot for sunshine duration from 1901-2010

Figure 1.3: Scatterplot matrix of sunshine duration for Bern, Chur and Zurich for thetime period 1901-2010. Data set: Bern-Chur-Zurich.

� Precipitation: sum, norm in % (percentage compared to the period 1960-1990),maximum in one day with the corresponding date, number of days with more than0.9 mm.

1.3.5 Canonical Correlation Analysis: Soil Evaporation

Canonical correlation analysis is a method of analysis for investigating the relationshipbetween two groups of variables, by finding linear functions of one of the sets of variablesthat maximally correlate with linear functions of the variables in the other set. Source:Everitt and Skrondal (2010).

Example (Soil Evaporation). Table 1.5 shows an extract of 46 consecutive days fromJune 6 through July 21. The observed variables are maximum (maxst), minimum(minst), and average soil temperatures (avst); maximum (maxat), minimum (minat),

1-8

Table 1.4: Weather Report 2007 for Switzerland. Source: MeteoSchweiz Jahreswetter-bericht 2007.

1-9

944

946

948

950

952

954

Bern Chur Zürich

station

pre

ssure

in h

Pa

Annual mean pressure in Bern, Chur, Zürich from 1901-2010

78

910

11

Bern Chur Zürich

station

tem

pera

ture

in °

C

Annual mean temperature in Bern, Chur, Zürich from 1901-2010600

800

1000

1400

Bern Chur Zürich

station

pre

cip

itation in m

m p

er

year

Annual precipitation in Bern, Chur, Zürich from 1901-2010

1400

1600

1800

2000

Bern Chur Zürich

station

sunshin

e d

ura

tion in h

ours

per

year

Annual sunshine in Bern, Chur, Zürich 1901-2010

Figure 1.4: Boxplots of annual mean pressure, mean temperature, precipitation andsunshine duration for Bern, Chur and Zurich for the time period 1901-2010. Data set:Bern-Chur-Zurich.

and average air temperature (avat); maximum (maxh), minimum (minh), and averagerelative humidity (avh); total wind in miles per day (wind) and the daily amount ofevaporation from the soil (evap). The three “average” measurements are integrated:average soil temperature is the integrated area under the daily soil temperature curve,average air temperature is the integrated area under the daily air temperature curve,and average relative humidity is the integrated area under the daily relative humiditycurve.

1.3.6 Discrimination and Classification: El Nino

With discriminant analysis we aim to assess whether or not a set of variables distinguishor discriminate between two (or more) groups of individuals. The discrimination resultsin a classification rule (often also know as an allocation rule) that may be used to assigna new observation to one of the groups. Source: Everitt and Skrondal (2010).

1-10

Table 1.5: Extract of a data set with soil and air temperatures, humidity, wind andevaporation variables. Source: Freund (1979).

day month maxst minst avst maxat minat avat maxh minh avh wind evap

6 6 84 65 147 85 59 151 95 40 398 273 30

7 6 84 65 149 86 61 159 94 28 345 140 34

8 6 79 66 142 83 64 152 94 41 388 318 33

9 6 81 67 147 83 65 158 94 50 406 282 26

10 6 84 68 167 88 69 180 93 46 379 311 41

11 6 74 66 131 77 67 147 96 73 478 446 4

12 6 73 66 131 78 69 159 96 72 462 294 5

13 6 75 67 134 84 68 159 95 70 464 313 20

14 6 84 68 161 89 71 195 95 63 430 455 31

15 6 86 72 169 91 76 206 93 56 406 604 38

16 6 88 73 178 91 76 208 94 55 393 610 43

17 6 90 74 187 94 76 211 94 51 385 520 47

18 6 88 72 171 94 75 211 96 54 405 663 45

19 6 88 72 171 92 70 201 95 51 392 467 45

20 6 81 69 154 87 68 167 95 61 448 184 11

21 6 79 68 149 83 68 162 95 59 436 177 10

22 6 84 69 160 87 66 173 95 42 392 173 30

23 6 84 70 160 87 68 177 94 44 392 76 29

24 6 84 70 168 88 70 169 95 48 398 72 23

25 6 77 67 147 83 66 170 97 60 431 183 16

Example (El Nino). El Nino is defined by prolonged warming in the Pacific Ocean seasurface temperatures when compared with the average value. The accepted definitionis a warming of at least 0.5 ◦C averaged over the east-central tropical Pacific Ocean.Typically, this anomaly happens at irregular intervals of two to seven years, and lastsnine months to two years. The average period length is five years. When this warmingoccurs for only seven to nine months, it is classified as El Nino “condition”; when itoccurs for more than that period, it is classified as El Nino “episodes”. Similarly, LaNina conditions and episodes are defined for cooling. Source: Wikipedia.

In Ecuador the El Nino phenomena is a very important factor for fishermen. So oneof the questions is, how an El Nino year can be predicted. At the beginning we analyzethe past, having measured the mean temperature, the precipitation and mean pressureof June at Guayaquil and we know whether it was an El Nino or La Nina year. Thisstep, where we form two classes, is called discrimination (or separation) of the data set.With the classification (or allocation) we derive a rule that can be used to optimallyassign new objects to the labeled classes.

1.3.7 Cluster Analysis: Weather Report 2007

Cluster analysis is a method for constructing a sensible and informative classification ofan initially unclassified set of data, using the variable values observed on each individual.Source: Everitt and Skrondal (2010).

Example (Weather Report). Table 1.4 shows the Weather Report for 2007 in Switzer-land for fifty stations. A descriptive analysis of the data set often consists in grouping

1-11

stations with similar characteristics. Therefore we define measures to compare the dis-tances of multidimensional points with each other. With the help of a dendrogram,where the distances between stations are figured, we can form a reasonable number ofclusters.

1.4 Classification of Multivariate Methods

1.4.1 Classification of Sharma (1996)

Consider a data set consisting of n observations on p variables. Further assume that thep variables can be divided into two groups or subsets. Statistical methods for analyzingthese types of data sets are referred to as dependence methods. The dependence meth-ods test for the presence or absence of relationships between the two sets of variables.However, if the researcher designates variables in one subset as independent variablesand variables in the other subset as dependent variables, then the objective of the de-pendence methods is to determine whether the set of independent variables affects theset of dependent variables individually and/or jointly.

On the other hand, data sets do exist for which it is impossible to conceptuallydesignate one set of variables as dependent and another set of variables as independent.For these types of data sets the objectives are to identify how and why the variables arerelated among themselves. Statistical methods for analyzing these types of data sets arecalled interdependence methods.

Dependence Methods

Dependence methods can be further classified according to:

1. The number of independent variables – one or more than one.

2. The number of dependent variables – one or more than one.

3. The type of measurement scale used for the dependent variables (ie., metric ornonmetric).

4. The type of measurement scale used for the independent variables (ie., metric ornonmetric).

Figure 1.6 gives a list of the statistical methods classified according to the above criteria.

Interdependence Methods

As mentioned previously, situations do exist in which it is impossible or incorrect todelineate one set of variables as independent and another set as dependent. In thesesituations the major objective of data analysis is to understand or identify why and howthe variables are correlated among themselves. Figure 1.7 gives a list of interdependence

1-12

multivariate methods. The multivariate methods for the case of two variables are thesame as the methods for more than two variables and, consequently, are not discussedseparately.

1.4.2 Classification of Rencher (1995)

We will list four basic types of (continuous) multivariate data and then briefly describesome possible analyses. Some writers would consider this an oversimplification andwould prefer elaborate tree diagrams of data structure. However, most data sets canfit into one of these categories, and the simplicity of this structure makes it easier toremember. The four basic data types are as follows:

1. A single sample with several variables measured on each sampling unit (subject orobject).

2. A single sample with two sets of variables measured on each unit.

3. Two samples with several variables measured on each unit.

4. Three or more samples with several variables measured on each unit.

Each data type has extensions, and various combinations of the four are possible. A fewexamples of analyses for each case will now be given:

1. A single sample with several variables measured on each sampling unit:

(a) Test the hypothesis that the means of the variables have specified values.

(b) Test the hypothesis that the variables are uncorrelated and have a commonvariance.

(c) Find a small set of linear combinations of the original variables that summa-rizes most of the variation in the data (principal components).

(d) Express the original variables as linear functions of a smaller set of underlyingvariables that account for the original variables and their intercorrelations(factor analysis).

2. A single sample with two sets of variables measured on each unit:

(a) Determine the number, the size, and the nature of relationships between thetwo sets of variables. For example, we may wish to relate a set of interestvariables to a set of achievement variables. How much overall correlation isthere between these two sets (canonical correlation)?

(b) Find a model to predict one set of variables from the other set (multivariatemultiple regression).

3. Two samples with several variables measured on each unit:

1-13

(a) Compare the means of the variables across the two samples (Hotelling’s T 2-test).

(b) Find a linear combination of the variables that best separates the two samples(discriminant analysis).

(c) Find a function of the variables that will accurately allocate the units intothe two groups (classification analysis).

4. Three or more samples with several variables measured on each unit:

(a) Compare the means of the variables across the groups (multivariate analysisof variance).

(b) Similar to 3(b).

(c) Similar to 3(c).

Two preconditions for the use of multivariate methods are basic knowledge of linearalgebra (Chapter 2) and the multivariate normal distribution (Chapter 4).

Further reading. Zwiers and von Storch (2004) give a review on the role of statisticalanalysis in the climate sciences. In this article a lot of references can be found.

1-14

tim

e

55

65

75

2500

3500

89

10

70

74

78

17501900

556575clo

ud

radia

tion

115130145

25003500

hdd

pre

ssure

978982

8911

tem

pera

ture

pre

cip

itation

6001000

707478

hum

idity

1750

1900

115

125

135

145

978

982

600

1000

1400

1800

14002000

sunshin

e

Figure 1.5: Scatterplot matrix of the different variables for Basel for the time period1755-2010. Data set: Basel.

1-15

Figure 1.6: Classification of dependence statistical methods. Source: Sharma (1996).

Figure 1.7: Classification of interdependence statistical methods. Source: Sharma(1996).

1-16

2 Linear Algebra

2.1 Introduction

Source: Rencher (1995), pp. 6-41.

2.2 Supplement

Proposition 2.2.1. Let A be a k × k symmetric matrix. Then A has k pairs ofeigenvalues and eigenvectors namely

(λ1, e1), . . . , (λk, ek).

The eigenvectors can be chosen to satisfy 1 = e′1e1 = . . . = e′kek and be mutuallyperpendicular. The eigenvectors are unique unless two or more eigenvalues are equal.

2.2.1 Positive Definite Matrices

Positive and non-negative definite (or positive semi-definite) matrices are important inprobability theory because covariance and correlation matrices are always non-negativedefinite.

Definition 2.2.2. The spectral decomposition of a k × k symmetric matrix A is givenby

A = λ1 e1 e′1 + . . .+ λk ek e′k

(k×k) (k×1) (1×k) (k×1) (1×k)

where λ1, . . . , λk are the eigenvalues of A and e1, . . . , ek are the associated normalizedeigenvectors. Thus, e′iei = 1 for i = 1, . . . , k, and e′iej = 0 for i 6= j.

Definition 2.2.3. A quadratic form Q(x) in the k variables x1, . . . , xk is Q(x) = x′Ax,where x′ = (x1, . . . , xk) and A is a k × k symmetric matrix.

Note that the quadratic form can be written as Q(x) =∑k

i=1

∑kj=1 aijxixj.

Definition 2.2.4. When a k × k symmetric matrix A is such that

0 ≤ x′Ax, ∀x′ = (x1, . . . , xk), (2.1)

the matrix A is said to be non-negative definite. If equality holds in (2.1) only for thevector x′ = (0, . . . , 0), then A is said to be positive definite. In other words, A ispositive definite if

0 < x′Ax

for all vectors x′ 6= 0.

2-1

Example. The matrix A =

(1 2

2 100

)is positive definite. B =

(1 2

2 1

)is not.

Remark. Using the spectral decomposition, we can show that a k× k symmetric matrixA is a positive definite matrix if and only if every eigenvalue of A is positive.

Remark. A positive definite quadratic form can be interpreted as a square distance.Conversely, distance is determined from a positive definite quadratic form x′Ax.

Remark. Let the square of the distance from the point x′ = (x1, . . . , xp) to the originbe given by x′Ax, where A is a p × p symmetric positive definite matrix. Then thesquare of the distance from x to an arbitrary fixed point µ′ = (µ1, . . . , µp) is given bythe general expression

(x− µ)′A(x− µ).

Example. Suppose A positive definite and p = 2. Then the points x′ = (x1, x2) ofconstant distance c from the origin satisfy

x′Ax = a11x21 + a22x

22 + 2a12x1x2 = c2.

By the spectral decomposition

A = λ1e1e′1 + λ2e2e

′2

we get

x′ A x = λ1 ( x′ e1 )2 + λ2 ( x′ e2 )2.

(1×k) (k×k) (k×1) (1×k) (k×1) (1×k) (k×1)

Now, λ1(x′e1)

2 + λ2(x′e2)

2 = c2 is an ellipse because λ1, λ2 > 0 when A is positive

definite. We see that x = cλ−1/21 e1 satisfies x′Ax = c2. Similarly, x = cλ

−1/22 e2 gives

the appropriate distance in the e2 direction. Thus, the points at distance c lie on anellipse whose axes are given by the eigenvectors of A with lengths proportional to thereciprocals of the square roots of the eigenvalues (see Figure 2.1).

2.2.2 Square-root Matrices

The spectral decomposition allows us to express the inverse of a square matrix in termsof its eigenvalues and eigenvectors. Let A be a k × k positive definite matrix with thespectral decomposition

A =k∑

i=1

λieie′i.

Let the normalized eigenvectors be the columns of another matrix

P = (e1, . . . , ek).

2-2

Figure 2.1: The ellipse is the locus of all points with constant statistical distance c fromthe origin. Source: Johnson and Wichern (2007).

Then

A =k∑

i=1

λi ei e′i = P Λ P′

(k×k) (k×1) (1×k) (k×k) (k×k) (k×k)

where PP′ = P′P = I and

Λ =

λ1 0 · · · 0

0 λ2 · · · 0...

.... . .

...

0 0 · · · λk

with λi > 0.

Thus

A−1= PΛ−1P′ =

k∑

i=1

1

λieie′i.

Proposition 2.2.5. The square-root matrix of a positive definite matrix A is

A1/2=

k∑

i=1

√λieie

′i = PΛ1/2P′

and has the following properties:

2-3

1.(A1/2

)′= A1/2

(symmetric),

2. A1/2A1/2= A,

3.(A1/2

)−1=∑k

i=11√λieie′i = PΛ−1/2P′, where Λ−1/2 is a diagonal matrix with

1/√λi as the ith diagonal element,

4. A1/2A−1/2= I and A−1/2A−1/2

= A−1.

2.2.3 Random Vectors and Matrices

A random vector is a vector whose elements are random variables. Similarly, a randommatrix is a matrix whose elements are random variables. The expected value of a randommatrix is the matrix consisting of the expected values of each of its elements. Specifically,let X = {Xij} be an n× p random matrix. Then the expected value of X, denoted byE(X), is the n× p matrix of numbers

E(X) =

E(X11) · · · E(X1p)...

. . ....

E(Xn1) · · · E(Xnp)

where, for each element of the matrix,

E(Xij) =

∫ ∞

−∞xijfij(xij)dxij

if Xij is a continuous random variable withprobability density function fij(xij)

∑

all xij

xijpij(xij)ifXij is a discrete random variable with prob-ability density function pij(xij)

Proposition 2.2.6. Let X and Y be random matrices of the same dimension, and letA and B be conformable matrices of constants. Then

E(X + Y) = E(X) + E(Y)

E(AXB) = A E(X) B

2.2.4 Mean Vectors and Covariance Matrices

Suppose X ′ = (X1, . . . , Xp) is a p × 1 random vector. Then each element of X is arandom variable with its own marginal probability distribution. The marginal means µiand variances σ2

i are defined as µi = E(Xi) and variances σ2i = E(Xi−µi)2, i = 1, 2, . . . , p,

respectively.It will be convenient in later sections to denote the marginal variances by σii rather

than the more traditional σ2i .

2-4

The behavior of any pair of random variables, such as Xi and Xk, is described bytheir joint probability function, and a measure of the linear association between them isprovided by the covariance

σik = E(Xi − µi)(Xk − µk)where µi and µk, i, k = 1, 2, . . . , p, are the marginal means.

More generally, the collective behavior of the p random variables X1, . . . , Xp or,equivalently, the random vector X ′ = (X1, . . . , Xp), is described by a joint probabilitydensity function f(x1, . . . , xp) = f(x).

Definition 2.2.7. If the joint probability P (Xi ≤ xi and Xk ≤ xk) can be written asthe product of the corresponding marginal probabilities, so that

P (Xi ≤ xi and Xk ≤ xk) = P (Xi ≤ xi)P (Xk ≤ xk)

for all pairs of values xi, xk, then Xi and Xk are said to be statistically independent.

The p continuous random variables X1, X2, . . . , Xp are mutually statistically inde-pendent if their joint density can be factored as

f12···p(x1, . . . , xp) = f1(x1) · · · fp(xp)for all p-tuples (x1, . . . , xp).

Remark. Statistical independence has an important implication for covariance:

Cov(Xi, Xk) = 0 if Xi and Xk are independent.

The converse is not true in general.

Example. Consider two random variables X and Y = X2; then Cov(X, Y ) = EX3 −EXEX2. If X has any density symmetric about zero, then Cov(X, Y ) = 0. However,given X we can preditct Y very well.

Example. Let U and V be two random variables with the same mean and the samevariance. Write X = U + V and Y = U − V . Then Cov(X, Y ) = E(U2 − V 2)− E(U +V )E(U − V ) = 0, so that X and Y are uncorrelated but not necessarily independent.

The means and covariances of the p×1 random vector X can be set out as matrices.The expected value of each element is contained in the vector of means µ = E(X), andthe p variances σii and the p(p − 1)/2 distinct covariances σik (i < k) are contained inthe symmetric variance-covariance matrix

Σ = E(X − µ)(X − µ)′

= Cov(X) =

σ11 · · · σ1p...

. . ....

σp1 · · · σpp

We shall refer to µ and Σ as the population mean vector and population variance-covariance matrix, respectively.

2-5

Proposition 2.2.8. The linear combination c′X = c1X1 + · · ·+ cpXp has

mean = E(c′X) = c′µ

variance = Var(c′X) = c′Σc

where µ = E(X) and Σ = Cov(X).The linear combinations Z = CX, where C is a q × p matrix, have

µZ = E(Z) = E(CX) = CµX

ΣZ = Cov(Z) = Cov(CX) = CΣXC′,

where µX and ΣX are the mean vector and variance-covariance matrix of X, respec-tively.

Remark. Since 0 ≤ Var(c′X) = c′Σc it is obvious that the covariance matrix is non-negative definite.

Example. Define

Z1 := X1 −X2

Z2 := X1 +X2.

We find

ΣZ =

(σ11 − 2σ12 + σ22 σ11 − σ22

σ11 − σ22 σ11 + 2σ12 + σ22

).

Note that if σ11 = σ22, that is, if X1 and X2 have equal variances, the off-diagonal termsin ΣZ vanish. This demonstrates that the sum and difference of two random variableswith identical variances are uncorrelated.

2-6

3 Sample Geometry and RandomSampling

Random sampling implies that

� measurements taken on different items (or trials) are to some extent unrelated toone another (in general we want independence) and

� the joint distribution of all p variables remains the same for all items.

3.1 Geometry of the Sample

A single multivariate observation is the collection of measurements on p different vari-ables taken on the same item or trial. If n observations have been obtained, the entiredata set can be placed in an n× p matrix:

X =

x11 · · · x1p...

. . ....

xn1 · · · xnp

n observations

(n×p) p variables

Remark. We often say that the data are a sample of size n from a p-variate population.The sample then consists of n measurements, each of which has p components.

The data can be plotted in two different ways:

� For the p-dimensional scatterplot, the rows of X represent n points in p-dimensionalspace (see Figure 3.1 and 3.2). We can write

X =

x11 · · · x1p...

. . ....

xn1 · · · xnp

=

x′1...

x′n

1st observation

...

nth observation

The scatterplot matrix of n points in p-dimensional space provides information onthe locations and variability of the points.

� Consider the data as p vectors in n-dimensional space. Here we take the elementsof the columns of the data matrix to be the coordinates of the vectors:

X =

x11 · · · x1p...

. . ....

xn1 · · · xnp

=

(y1| . . . |yp

).

3-1

Figure 3.1: Three-dimensional plot with the temperatures for Bern, Davos and GrosserSt. Bernhard. Data set: Climate Time Series, p. 1-3.

3.2 Random Samples and the Expected Values of

the Sample Mean and Covariance Matrix

Suppose that the data have not yet been observed, but we intend to collect n sets ofmeasurements on p variables. Before the measurements are made we treat them asrandom variables. Each set of measurements Xj on p variables is a random vector andwe have the random matrix

X =

X11 · · · X1p

.... . .

...

Xn1 · · · Xnp

=

X ′1...

X ′n

Definition 3.2.1. If the row vectors X ′1, . . . ,X′n represent independent observations

from a common joint distribution with density function f(x) = f(x1, . . . , xp) thenX1, . . . ,Xn are said to form a random sample from f(x).

Remark. Mathematically X1, . . . ,Xn form a random sample if their joint density func-tion is given by the product f(x1)f(x2) · . . . ·f(xn), where f(xj) = f(xj1, . . . , xjp) is thedensity function for the jth row vector.

3-2

Figure 3.2: Scatterplot matrix of the temperatures for Bern, Davos and Grosser St.Bernhard. Data set: Climate Time Series, p. 1-3.

Remark. Two points connected with the definition of random sample merit special at-tention:

� The measurements of the p variables in a single trial will usually be correlated.The measurements from different trials must, however, be independent.

� The independence of measurements from trial to trial may not hold when thevariables are likely to drift over time, as with sets of p stock prices or p economicindicators. Violations of the tentative assumption of independence can have aserious impact on the quality of statistical inferences.

Proposition 3.2.2. Let X1, . . . ,Xn be a random sample from a joint distribution thathas mean vector µ and covariance matrix Σ. Then

X =1

n

n∑

j=1

Xj

3-3

is an unbiased estimator of µ and its covariance matrix is Σ/n, that is

E(X) = µ,

Cov(X) =1

nΣ.

Remark. The unbiased sample variance-covariance matrix is

S =1

n− 1

n∑

j=1

(Xj −X)(Xj −X)′.

3.3 Generalized Variance

When p variables are observed on each unit, the variation is described by the samplevariance-covariance matrix

S =

s11 · · · s1p...

. . ....

s1p · · · spp

=

{sik =

1

n− 1

n∑

j=1

(xji − xi)(xjk − xk)}.

The sample covariance matrix contains p variances and 12p(p − 1) potentially different

covariances. Sometimes it is desirable to assign a single numerical value for the variationexpressed by S. One choice for a value is the determinant of S:

generalized sample variance: =∣∣S∣∣ .

The generalized sample variance provides one way of writing the information on allvariances and covariances as a single number. Of course, when p > 1, some informationabout the sample is lost in the process. A geometrical interpretation of

∣∣S∣∣ will help us

appreciate its strengths and weakness as a descriptive summary.

Example. To develop the formula of∣∣S∣∣ we start with a matrix X with n items and 2

variables. Consider Figure 3.3 with the area generated within the plane by two deviationvectors d1 := y1 − x11 and d2 := y2 − x21.

X =

x11 x12...

...

xn1 xn2

= (y1|y2)

d1 = y1 − x11 =

x11 − x1...

xn1 − x1

Ld1 =

√√√√n∑

j=1

(xj1 − x1)2 =√

(n− 1)s11.

3-4

Figure 3.3: Calculation of the area of the trapezoid. Source: Johnson and Wichern(2007).

Similar results hold for d2 and Ld2 . Per definition

cos θ =d′1d2

|d1||d2|and therefore cos(θ) = s12√

s11√s22

= r12, where r12 is the sample correlation coefficient.

The area of the trapezoid (see Figure 3.3) is

A = Ld1Ld2

√1− cos2(θ) = . . . = (n− 1)

√s11s22(1− r212).

Furthermore |S| = s11s22(1− r212) and therefore |S| = A2/(n− 1)2.

Remark. The following general result for p deviation vectors d1, . . . ,dp can be establishedby induction:

Generalized sample variance: |S| = (volume)2(n− 1)−p.

Remark. For a fixed sample size, it is clear from the geometry that volume, or |S|, will(see Figure 3.4)

1. increase

� if the length of any di is increased, or

� if the di’s are moved until they are at right angles to one another,

2. decrease

� if just one sii is small, or

� if one of the di’s lies nearly in the plane formed by the others.

Remark. Consider the measure of distance

(x− µ)′A(x− µ)

where A is a p × p symmetric positive definite matrix. With the choices x = µ andS−1 = A the coordinates x′ = (x1, . . . , xp) of the points with a constant distance c fromx satisfy

(x− x)′S−1(x− x) = c2.

3-5

Figure 3.4: Large and small generalized sample variance for p = 3. Source: Johnson andWichern (2007)

It can be shown that the volume of

{x : (x− x)′S−1(x− x) ≤ c2} = kp|S|1/2cp

where kp = 2πp/2

pΓ(p/2) and Γ(·) denotes the gamma function.

Example. Consider three data sets each with x′ = (2, 1) and corresponding covariancematrices

S =

(5 4

4 5

), r = 0.8; S =

(3 0

0 3

), r = 0; S =

(5 −4

−4 5

), r = −0.8.

Here the generalized variances |S| gives the same value, |S| = 9, for all three patterns. Ifwe draw the eigenvectors of the three matrices in the scatterplot of the data we get Figure3.5. So we conclude that the generalized variance does not contain any information onthe orientation of the patterns. Generalized variance is easier to interpret when the twoor more patterns being compared have nearly the same orientations.

Remark. As the example demonstrates, different correlation structures are not detectedby |S|.

3-6

3.3.1 Situations in which the generalized sample variance iszero

Proposition 3.3.1. The generalized variance is zero if and only if at least one deviationvector lies in the plane formed by all linear combinations of the others – that is, whenthe columns of the matrix of deviations X− 1x′ are linearly dependent.

Remark. A singular covariance matrix occurs when, for instance, the data are test scoresand the investigator has included variables that are sums of the others. Therefore, in anstatistical analysis |S| = 0 means that the measurements on some variables should beremoved from the study as far as the mathematical computations are concerned. Thecorresponding reduced data matrix will then lead to a covariance matrix of full rankand a nonzero generalized variance. The question of which measurements to remove indegenerate cases is not easy to answer.

3.3.2 Generalized Variance determined by |R| and its Geomet-rical Interpretation

The generalized sample variance is affected by the variability of measurements on asimple variable. Consequently, it is sometimes useful to scale all the deviation vectorsso that they have the same length.

The sample covariance matrix of the standardized variables is then R, the samplecorrelation matrix of the original variables. We define the generalized sample varianceof the standardized variables as

∣∣R∣∣.

Therefore, we can make the statement that∣∣R∣∣ is large when all the rik are nearly

zero (vectors are nearly perpendicular) and it is small when one or more of the rik arenearly +1 or -1.

Remark. The quantities∣∣S∣∣ and

∣∣R∣∣ are connected by the relationship

|S| = (s11 · . . . · spp) |R|.

3.4 Sample Mean, Covariance, and Correlation as

Matrix Operations

It is possible to link algebraically the calculation of x and S directly to X using matrixoperations.

x =1

nX′

1

(p×1) (p×n) (n×1)

S =1

n− 1X′(I− 1

n11′)

X.

3-7

With

D1/2=

√s11 0 · · · 0

0√s22 · · · 0

......

. . ....

0 0 · · · √spp

we have

R = D−1/2SD−1/2

=

s11√s11√s11

· · · s1p√s11√spp

.... . .

...s1p√

s11√spp

· · · spp√spp√spp

=

1 r12 · · · r1p

r12 1 · · · r2p...

.... . .

...

r1p r2p · · · 1

,

S = D1/2RD1/2.

3.5 Sample Values of Linear Combinations of Vari-

ables

In many multivariate procedures we are led naturally to consider linear combinations ofthe form

c′X = c1X1 + . . .+ cpXp,

b′X = b1X1 + . . .+ bpXp,

whose observed values on the jth trial are

c′xj = c1xj1 + . . .+ cpxjp, j = 1, . . . , n, (3.1)

b′xj = b1xj1 + . . .+ bpxjp, j = 1, . . . , n. (3.2)

For the n derived observations in (3.1) we find the

� sample mean of c′X = c′x

� sample variance of c′X =

c′(x1 − x)(x1 − x)′ + . . .+ (xn − x)(xn − x)′

n− 1c = c′Sc

� sample covariance of b′X and c′X = b′Sc.

3-8

Figure 3.5: Axes of the mean-centered 95% ellipses for the scatterplots. Source: Johnsonand Wichern (2007).

3-9

4 Multivariate NormalDistribution

4.1 Introduction

Source: Rencher (1995), pp. 94-120.

4.2 Supplement

Contours of constant density for the p-dimensional normal distribution are ellipsoids (seeFigure 4.1) defined by x such that

(x− µ)′Σ−1(x− µ) = c2.

These ellipsoids are centered at µ and have axes ±c√λiei, where Σei = λiei for i =

1, . . . , p.

Proposition 4.2.1. The solid ellipsoid of x values satisfying

(x− µ)′Σ−1(x− µ) ≤ χ2p(α)

has probability 1− α (see Figure 4.2).

The choice c2 = χ2p(α), where χ2

p(α) is the upper (100α)th percentile of a chi-squaredistribution with p degrees of freedom, leads to contours that contain (1 − α) of theprobability.

Proposition 4.2.2. Let X1, . . . ,Xn be independent observations from a population withmean µ and finite covariance Σ. Then

√n(X − µ) is approximately Np(0,Σ) (4.1)

andn(X − µ)′S−1(X − µ) is approximately χ2

p (4.2)

for n− p large (see Figure 4.3).

4.3 Detecting Outliers

Most data sets contain one or a few unusual observations that do not seem to belongto the pattern of variability produced by the other observations. With data on a singlecharacteristic, unusual observations are those that are either very large or very smallrelative to the others. We must emphasize that not all outliers are wrong numbers.They may, justifiably, be part of the group and may lead to a better understanding ofthe phenomena being studied.

4-1

Figure 4.1: Joint and marginal probability density functions of a bivariate normal dis-tributed random vector Z = (X, Y ). Source: Wikipedia.

Key points regarding outliers

� They may or may not be influence points.

� The larger the data set, the less the influence of individual outliers.

� What appears to be an outlier in a small data set may not be an outlier in a largedata set. For example, in a data set of sample size 20, an observation locatedthree standard deviations from the mean will be somewhat unusual; the sameobservation will not be so unusual in a data set of size 100.

� Outliers often result from measurement error.

� Outliers can result from a mixture of populations.

� What appears as an outlier in a small sample may be an observation drawn froma highly skewed distribution.

4-2

Figure 4.2: Scatterplot matrix of Bern, Davos and Bern, Genf with 99%, 95% and 75%confidence regions. Data set: Climate Time Series, p. 1-3.

Steps for Detecting Outliers

Outliers are best detected visually whenever this is possible.

1. Make a dot plot for each variable.

2. Make a scatterplot for each pair of variables.

3. Calculate the standardized values zjk = (xjk − xk)/√skk for j = 1, 2, . . . , n and

each column k = 1, 2, . . . , p. Examine these standardized values for large or smallvalues.

4. Calculate the generalized squared distances (xj −x)′S−1(xj −x). Examine thesedistances for unusually large values. In a chi-square plot, these would be the pointsfarthest from the origin (see Figure 4.4).

Recommendations

What should be done with suspected outliers and influence points? The following arerecommended (Schuenemeyer and Drew (2011) p. 117):

� Examine the raw data if possible. See if there is evidence that a mistake has beenmade, such as a recording error or equipment failure; if so, it is proper to deleteand/or correct the data value.

� Determine if the outlier is from another population; if so, attempt to separate thepopulations or delete the outlier.

4-3

Figure 4.3: Confidence regions for the mean value of the winter temperatures for Bern,Davos and Bern, Genf with 99%, 95% and 75% confidence intervals. Data set: ClimateTime Series, p. 1-3.

� Determine if the data are from a skewed distribution using analog, theory, simu-lation, or other techniques; if so, consider a normalization transformation such asa log.

� Consider a robust procedure, which will down-weight outliers. Robust proceduresare discussed later in the chapter.

� When it is necessary and proper to delete an outlier or influence point, the inves-tigator needs to document it and provide a justification. Observations must notbe removed to satisfy an investigators conjecture or to provide an outcome desiredby a sponsor.

4-4

Quantiles of Standard Normal

Quantile

s B

ern

-2 -1 0 1 2

-4-2

02

QQ-Plot Bern


Quantile

s D

avos

-2 -1 0 1 2

-8-6

-4-2

QQ-Plot Davos


Quantile

s S

t. B

ern

hard

-2 -1 0 1 2

-12

-10

-8-6

QQ-Plot St. Bernhard

Chi-square quantiles

quadra

tic d

ista

nces

0 2 4 6 8 10 12

02

46

810

Chi-square plot

0.5 0.9 0.95 0.99

Figure 4.4: Normal-QQ-plots for Bern, Davos and Grosser St. Bernhard and Chi-squareplot. Data set: Climate Time Series, p. 1-3.

4-5

x

de

nsity f(x

)

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal-density function


x

-4 -2 0 2 4

-4-2

02

QQ-Plot of a normal distribution

x

de

nsity f(x

)

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

t-density function


x

-2 0 2

-40

-20

02

0

QQ-Plot of a t-distribution

x

de

nsity f(x

)

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

F-density function


x

-2 0 2

02

46

QQ-Plot of a F-distribution

Figure 4.5: QQ-plots for a normal distribution N(0,1), t(2)-distribution and F (12, 13)-distribution.

4-6

Statistics in Climate Sciences I + II 2013{2014 Part I ...

Documents

Transcript of Statistics in Climate Sciences I + II 2013{2014 Part I ...