Statistics in WR: Lecture 1

77
Statistics in WR: Lecture 1 • Key Themes Knowledge discovery in hydrology Introduction to probability and statistics Definition of random variables • Reading: Helsel and Hirsch, Chapter 1

description

Statistics in WR: Lecture 1. Key Themes Knowledge discovery in hydrology Introduction to probability and statistics Definition of random variables Reading: Helsel and Hirsch, Chapter 1. By deduction from existing knowledge By experiment in a laboratory - PowerPoint PPT Presentation

Transcript of Statistics in WR: Lecture 1

Page 1: Statistics in WR: Lecture 1

Statistics in WR: Lecture 1

• Key Themes– Knowledge discovery in hydrology– Introduction to probability and statistics– Definition of random variables

• Reading: Helsel and Hirsch, Chapter 1

Page 2: Statistics in WR: Lecture 1

How is new knowledge discovered?

• By deduction from existing knowledge

• By experiment in a laboratory

• By observation of the natural environment

After completing the Handbook of Hydrology in 1993, I asked myself the question: how is new knowledge discovered in hydrology?

I concluded:

Page 3: Statistics in WR: Lecture 1

Deduction – Isaac Newton

• Deduction is the classical path of mathematical physics– Given a set of axioms– Then by a logical process– Derive a new principle or

equation

• In hydrology, the St Venant equations for open channel flow and Richard’s equation for unsaturated flow in soils were derived in this way.

(1687)Three laws of motion and law of gravitation

http://en.wikipedia.org/wiki/Isaac_Newton

Page 4: Statistics in WR: Lecture 1

Experiment – Louis Pasteur

• Experiment is the classical path of laboratory science – a simplified view of the natural world is replicated under controlled conditions

• In hydrology, Darcy’s law for flow in a porous medium was found this way.

Pasteur showed that microorganisms cause disease & discovered vaccination

Foundations of scientific medicine http://en.wikipedia.org/wiki/Louis_Pasteur

Page 5: Statistics in WR: Lecture 1

Observation – Charles Darwin

• Observation – direct viewing and characterization of patterns and phenomena in the natural environment

• In hydrology, Horton discovered stream scaling laws by interpretation of stream maps

Published Nov 24, 1859Most accessible book of great

scientific imagination ever written

Page 6: Statistics in WR: Lecture 1

Mean Annual Flow

Page 7: Statistics in WR: Lecture 1

Is there a relation between flow and water quality?

Total Nitrogen in water

Page 8: Statistics in WR: Lecture 1

Are Annual Flows Correlated?

Page 9: Statistics in WR: Lecture 1

CE 397 Statistics in Water Resources, Lecture 2, 2009

David R. MaidmentDept of Civil Engineering

University of Texas at Austin

9

Page 10: Statistics in WR: Lecture 1

Key Themes• Statistics

– Parametric and non-parametric approach• Data Visualization• Distribution of data and the distribution of

statistics of those data• Reading: Helsel and Hirsch p. 17-51 (Sections 2.1

to 2.3• Slides from Helsel and Hirsch (2002) “Techniques

of water resources investigations of the USGS, Book 4, Chapter A3.

10

Page 11: Statistics in WR: Lecture 1

Characteristics of Water Resources Data

• Lower bound of zero• Presence of “outliers”• Positive skewness• Non-normal distribution

of data• Data measured with

thresholds (e.g. detection limits)

• Seasonal and diurnal patterns

• Autocorrelation – consecutive measurements are not independent

• Dependence on other uncontrolled variables e.g. chemical concentration is related to discharge

11

Page 12: Statistics in WR: Lecture 1

Normal Distribution

From Helsel and Hirsch (2002) 12

Page 13: Statistics in WR: Lecture 1

Lognormal Distribution

From Helsel and Hirsch (2002) 13

Page 14: Statistics in WR: Lecture 1

Method of Moments

From Helsel and Hirsch (2002) 14

Page 15: Statistics in WR: Lecture 1

Statistical measures

• Location (Central Tendency)– Mean– Median– Geometric mean

• Spread (Dispersion)– Variance– Standard deviation– Interquartile range

• Skewness (Symmetry)– Coefficient of skewness

• Kurtosis (Flatness)– Coefficient of kurtosis

15

Page 16: Statistics in WR: Lecture 1

Histogram

From Helsel and Hirsch (2002)

16

Annual Streamflow for the Licking River at Catawba, Kentucky03253500

Page 17: Statistics in WR: Lecture 1

Quantile Plot

From Helsel and Hirsch (2002) 17

Page 18: Statistics in WR: Lecture 1

Plotting positions

i = rank of the data with i = 1 is the lowestn = number of datap = cumulative probability or “quantile” of the data value (its percentile value)

18

Page 19: Statistics in WR: Lecture 1

Normal Distribution Quantile Plot

From Helsel and Hirsch (2002) 19

Page 20: Statistics in WR: Lecture 1

Probability Plot with Normal Quantiles (Z values)

qzsqq

q

z

q

From Helsel and Hirsch (2002) 20

Page 21: Statistics in WR: Lecture 1

Annual Flows From HydroExcel

21

Annual Flows produced using Pivot Tables in Excel

Page 22: Statistics in WR: Lecture 1

22

Page 23: Statistics in WR: Lecture 1

CE 397 Statistics in Water Resources, Lecture 3, 2009

David R. MaidmentDept of Civil Engineering

University of Texas at Austin

23

Page 24: Statistics in WR: Lecture 1

Key Themes

• Using HydroExcel for accessing water resources data using web services

• Descriptive statistics and histograms using Excel Analysis Toolpak

• Reading: Chapter 11 of Applied Hydrology by Chow, Maidment and Mays

24

Page 25: Statistics in WR: Lecture 1

CE 397 Statistics in Water Resources, Lecture 4, 2009

David R. MaidmentDept of Civil Engineering

University of Texas at Austin

25

Page 26: Statistics in WR: Lecture 1

Key Themes

• Frequency and probability functions• Fitting methods• Typical distributions• Reading: Chapter 4 of Helsel and Hirsh pp. 97-

116 on Hypothesis tests

26

Page 27: Statistics in WR: Lecture 1

27

Page 28: Statistics in WR: Lecture 1

Method of Moments

28

Page 29: Statistics in WR: Lecture 1

Maximum Likelihood

29

Page 30: Statistics in WR: Lecture 1

CE 397 Statistics in Water Resources, Lecture 5, 2009

David R. MaidmentDept of Civil Engineering

University of Texas at Austin

30

Page 31: Statistics in WR: Lecture 1

Key Themes

• Using Excel to fit frequency and probability distributions

• Chi Square test and probability plotting• Beginning hypothesis testing• Reading: Chapter 3 of Helsel and Hirsh pp. 65-

97 on Describing Uncertainty• Slides from Helsel and Hirsch Chap. 4

31

Page 32: Statistics in WR: Lecture 1

32

Page 33: Statistics in WR: Lecture 1

Statistics in Water Resources, Lecture 6

• Key theme– T-distribution for distributions where standard

deviation is unknown– Hypothesis testing– Comparing two sets of data to see if they are

different

• Reading: Helsel and Hirsch, Chapter 6 Matched Pair Tests

Page 34: Statistics in WR: Lecture 1

Chi-Square Distribution

http://en.wikipedia.org/wiki/Chi-square_distribution

Page 35: Statistics in WR: Lecture 1

t-, z and ChiSquare

Source: http://en.wikipedia.org/wiki/Student's_t-distribution

Page 36: Statistics in WR: Lecture 1

Normal and t-distributions

Normal

t-dist for ν = 1

t-dist for ν = 30t-dist for ν = 5

t-dist for ν = 3t-dist for ν = 2

t-dist for ν = 10

Page 37: Statistics in WR: Lecture 1

• Standard Normal z– X1, … , Xn are

independently distributed (μ,σ), and

– thenis normally distributed

with mean 0 and std dev 1

Standard Normal and Student - t

• Student’s t-distribution– Applies to the case

where the true standard deviation σ is unknown and is replaced by its sample estimate Sn

Page 38: Statistics in WR: Lecture 1

38

p-value is the probability of obtaining the value of the test-statistic if the null hypothesis (Ho) is true

If p-value is very small (<0.05 or 0.025) then reject Ho

If p-value is larger than α then do not reject Ho

Page 39: Statistics in WR: Lecture 1

One-sided test

Page 40: Statistics in WR: Lecture 1

Two-sided test

Page 41: Statistics in WR: Lecture 1

Statistics in WR: Lecture 7

• Key Themes– Statistics for populations and samples– Suspended sediment sampling– Testing for differences in means and variances

• Reading: Helsel and Hirsch Chapter 8 Correlation

Page 42: Statistics in WR: Lecture 1

Estimators of the Variance

Maximum Likelihood Estimate forPopulation variance

Unbiased estimatefrom a sample

http://en.wikipedia.org/wiki/Variance

Page 43: Statistics in WR: Lecture 1

Bias in the VarianceCommon sense would suggest to apply the population formula to the sample as well. The reason that it is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations. This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations from the sample mean will often be smaller than the deviations from the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population.

Page 44: Statistics in WR: Lecture 1

Suspended Sediment Sampling

http://pubs.usgs.gov/sir/2005/5077/

Page 45: Statistics in WR: Lecture 1

T-test with same variances

Page 46: Statistics in WR: Lecture 1

T-test with different variances

Page 47: Statistics in WR: Lecture 1

Statistics in WR: Lecture 8

• Key Themes– Replication in Monte Carlo experiments– Testing paired differences and analysis of

variance– Correlation

• Reading: Helsel and Hirsch Chapter 9 Simple Regression

Page 48: Statistics in WR: Lecture 1

Statistics of Mean of Replicated Series

Page 49: Statistics in WR: Lecture 1

Patterns of data that all have correlation between x and y of 0.7

Page 50: Statistics in WR: Lecture 1

Monotonic nonlinear correlation

Linear correlation

Non-monotonic correlation

Page 51: Statistics in WR: Lecture 1

Statistics in WR: Lecture 9

• Key Themes– Using SAS to compute cross-correlation between two data

series– Using Excel to compute autocorrelation of a single data

series– Correlation length and influence of data interval on that– Lagged Cross-correlation between rainfall and flow

• Reading: Helsel and Hirsch Chapter 12 Trend Analysis

Page 52: Statistics in WR: Lecture 1

Correlation

• Correlation (or cross-correlation) measures the association between two sets of data (x, y)

• Autocorrelation measures the correlation of a dataset with lagged or displace values of itself (either in time or space), e.g x(t) with x(t – L) where L is the lag time

• Lagged cross-correlation measures the association between one series y(t), and lagged values of another series x(t – L)

Page 53: Statistics in WR: Lecture 1

Statistics in WR: Lecture 10

• Key Themes– Trend analysis using Simple Linear Regression– Characterization of outliers– Multiple Linear Regression

• Reading: Helsel and Hirsch Chapter 11 Multiple Linear Regression

• Slides are from Helsel and Hirsch, Chapter 9

Page 54: Statistics in WR: Lecture 1

H&H p.222

Page 55: Statistics in WR: Lecture 1

H&H p.226

Regression Formulas

Page 56: Statistics in WR: Lecture 1

H&H p.227

Regression Formulas

Page 57: Statistics in WR: Lecture 1

Statistics in WR: Lecture 11

• Key Themes– Simple Linear Regression– Derivation of the normal equations– Multiple Linear Regression

• Reading: Helsel and Hirsch Chapter 7 Comparing several independent groups

• Reading: Barnett, Environmental Statistics Chapter 10 Time series methods

• Slides are from Helsel and Hirsch, Chapter 9

Page 58: Statistics in WR: Lecture 1

Regression Assumptions

Page 59: Statistics in WR: Lecture 1

Formulas used in the derivation of the normal

equations

Page 60: Statistics in WR: Lecture 1

(1a) Plot the Data: TDS vs LogQ

Page 61: Statistics in WR: Lecture 1

(2) Interpret Regression Statistics

Page 62: Statistics in WR: Lecture 1

A good set of Residuals

Page 63: Statistics in WR: Lecture 1

Multiple Linear Regression

Page 64: Statistics in WR: Lecture 1

Simple vs Complex regression models

Page 65: Statistics in WR: Lecture 1

F-distributionhttp://en.wikipedia.org/wiki/F-test

“If U is a Chisquare random variable with m degrees of freedom, V is a Chisquare random variable with n degrees of freedom, and if U and V are independent, then the ratio [(U/m)/V/n) has an F-distribution with (m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology, p.122

The values of the F-statistic are tabulated at:

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

Page 66: Statistics in WR: Lecture 1

Statistics in WR: Lecture 12

• Key Themes– Regression y|x and x|y– Adjusted R2

– Time series and seasonal variations

Page 67: Statistics in WR: Lecture 1

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.950344

R Square 0.903154 0.903154347

Adjusted R Square 0.898543 0.89854265

Standard Error 159033.1

Observations 23

ANOVA

df SS MS FSignificance

F

Regression 1 4.95309E+12 4.95309E+12 195.8399 4.07E-12

Residual (error) 21 5.31122E+11 25291521454

Total (y) 22 5.48421E+12

)1/(

)/(12

nSSy

pnSSEAdjR

SSy

SSER 12

R2 and Adjusted R2

Page 68: Statistics in WR: Lecture 1

Time Series Trend: Tide Levels at San Diego

http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA

Page 69: Statistics in WR: Lecture 1

One harmonic

Page 70: Statistics in WR: Lecture 1

Five harmonics

http://en.wikipedia.org/wiki/Fourier_series

Page 71: Statistics in WR: Lecture 1

Statistics in WR: Lecture 13

• Key Themes– ANOVA for sediment data– Fourier series for diurnal cycles– Fourier series for seasonal cycles

Page 72: Statistics in WR: Lecture 1

Analysis of Variance (ANOVA)

Assumptions

There are several variants (one factor, two factor, two factor with replication). We will deal just with One Factor ANOVA

Page 73: Statistics in WR: Lecture 1

Single Factor ANOVA

Page 74: Statistics in WR: Lecture 1

Single Factor ANOVA

Page 75: Statistics in WR: Lecture 1

ANOVA Formulas

Page 76: Statistics in WR: Lecture 1

Single Factor ANOVA

Page 77: Statistics in WR: Lecture 1

TWDB Mean189,000 Ton/yr

USGS2 Mean97,000 Ton/yr

USGS1 Mean218,000 Ton/yr

Groups of Sediment Load Data (Ex3)

Overall Mean183,000 Ton/yr

Zero

3.5 x 106 5.5 x 106

480,000