Correlation and Regression Analysis Many engineering design and analysis problems involve factors...

23
Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff volume, rainfall; (2) evaporation, temperature, wind speed; (3) peak discharge, drainage area, rainfall intensity; (4) crop yield, irrigated water, fertilizer. Due to inherent complexity of system behaviors and lack of full understanding of the procedure involved, the relationship among the various relevant factors or variables are established empirically or semi-empirically. Regression analysis is a useful and widely used statistical tool dealing with investigation of the relationship between two or more variables related in a non-deterministic fashion. If a variable Y is related to several variables X 1 , X 2 , …, X K and their relationships can be expressed, in general, as Y = g(X 1 , X 2 , …, X K ) where g(.) = general expression for a function; Y = Dependent (or response) variable; X 1 , X 2 ,…, X K = Independent (or explanatory) variables.

Transcript of Correlation and Regression Analysis Many engineering design and analysis problems involve factors...

Page 1: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Correlation and Regression Analysis• Many engineering design and analysis problems involve factors that are

interrelated and dependent. E.g., (1) runoff volume, rainfall; (2) evaporation, temperature, wind speed; (3) peak discharge, drainage area, rainfall intensity; (4) crop yield, irrigated water, fertilizer.

• Due to inherent complexity of system behaviors and lack of full understanding of the procedure involved, the relationship among the various relevant factors or variables are established empirically or semi-empirically.

• Regression analysis is a useful and widely used statistical tool dealing with investigation of the relationship between two or more variables related in a non-deterministic fashion.

• If a variable Y is related to several variables X1, X2, …, XK and their

relationships can be expressed, in general, as

Y = g(X1, X2, …, XK)

where g(.) = general expression for a function;

Y = Dependent (or response) variable;

X1, X2,…, XK = Independent (or explanatory) variables.

Page 2: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Correlation• When a problem involves two dependent random variables, the degree of

linear dependence between the two can be measured by the correlation coefficient (X,Y), which is defined as

where Cov(X,Y) is the covariance between random variables X and Y defined as

 

where <Cov(X,Y)< and (X,Y) .

• Various correlation coefficients are developed in statistics for measuring the degree of association between random variables. The one defined above is called the Pearson product moment correlation coefficient or correlation coefficient.

• If the two random variables X and Y are independent, then (X,Y)= Cov(X,Y)= . However, the reverse statement is not necessarily true.

Page 3: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Cases of Correlation

Perfectly linearlycorrelated in oppositedirection

Strongly & positively correlated inlinear fashion

Perfectly correlated innonlinear fashion, butuncorrelated linearly.

Uncorrelated in linear fashion

Page 4: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Calculation of Correlation Coefficient

• Given a set of n paired sample observations of two random variables (xi, yi), the sample correlation coefficient ( r) can be calculated as

Page 5: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Auto-correlation• Consider following daily stream flows (in 1000 m3) in June 2001 at Chung Mei

Upper Station (610 ha) located upstream of a river feeding to Plover Cove Reservoir. Determine its 1-day auto-correlation coefficient, i.e., (Qt, Qt+1).

• 29 pairs: {(Qt, Qt+1)} = {(Q1, Q2), (Q2, Q3), …, (Q29, Q30)};

Relevant sample statistics: n=29

The 1-day auto-correlation is 0.439

Day (t) Flow Q(t) Day (t) Flow Q(t) Day (t) Flow Q(t)1 8.35 11 313.89 21 20.062 6.78 12 480.88 22 17.523 6.32 13 151.28 23 116.134 17.36 14 83.92 24 68.255 191.62 15 44.58 25 280.226 82.33 16 36.58 26 347.537 524.45 17 33.65 27 771.308 196.77 18 26.39 28 124.209 785.09 19 22.98 29 58.00

10 562.05 20 21.92 30 44.08

11186.22; 230.06; 187.45; 229.17t tt Q t QQ S Q S

Page 6: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Chung Mei Upper Daily Flow

10 20 30

0

100

200

300

400

500

600

700

800

Day

Flo

w (1000 c

ubic

mete

rs)

1 2 3 4 5

-1.0-0.8-0.6-0.4-0.20.00.20.40.60.81.0

Aut

ocor

rela

tion

Autocorrelation for June 2001 Daily Flows at Chung Mei Upper, HK

Time lags (Days)

0

100

200300

400

500

600700

800

900

0 200 400 600 800 1000

Q(t), 1000 m^3

Q(t

+1),

1000

m^3

Page 7: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Regression Models

• due to the presence of uncertainties a deterministic functional relationship generally is not very appropriate or realistic.

• The deterministic model form can be modified to account for uncertainties in the model as

Y = g(X1, X2, …, XK) +

where = model error term with E()=0, Var()=2.

• In engineering applications, functional forms commonly used for establishing empirical relationships are 

– Additive: Y = 0 + 1X1 + 2X2 + … + KXK +

– Multiplicative: K21 β

Kβ2

β10 X...XXβY

Page 8: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Least Square Method

Suppose that there are n pairs of data, {(xi, yi)}, i=1, 2,.. , n and a plot of these data appears as

What is a plausible mathematical model describing x & y relation?

x

y

Page 9: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Least Square Method

Considering an arbitrary straight line, y =0+1 x, is to be fitted through these data points. The question is “Which line is the most representative”?

11

0

xi

x

yi

yi^

y =0+1 x^

ei = yi – yi = error (residual)^

y

11

0

xi

x

yi

yiyi^

y =0+1 xy =0+1 xy =0+1 x^

ei = yi – yi = error (residual)^ei = yi – yi = error (residual)^

y

Page 10: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Least Square Criterion

• What are the values of 0 and 1 such that the resulting line “best” fits the data points?

• But, wait !!! What goodness-of-fit criterion to use to determine among all possible combinations of 0 and 1 ?

• The least squares (LS) criterion states that the sum of the squares of errors (or residuals, deviations) is minimum. Mathematically, the LS criterion can be written as:

 

• Any other criteria that can be used?

Page 11: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Normal Equations for LS Criterion

• The necessary conditions for the minimum values of D are:

and

• Expanding the above equations

• Normal equations:

00

D

01

D

n

iiii

n

iii

xxyD

xyD

110

1

110

0

02

012

n

iiii

n

iii

xyx

xy

1

1

0

0

0

0

1

2

11

11

n

ii

n

ii

n

iii

n

ii

n

ii

xxyx

xny

n

iii

n

ii

n

ii

n

ii

n

ii

yxxx

yxn

11

2

1

11

Page 12: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Solution (2 Unknowns)

2

1

2

12

11

2

111

11

1

1

ˆ

ˆˆ

xnx

yxnyx

xn

x

yxn

yx

xyn

x

n

y

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

n

ii

Page 13: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Fitting a Polynomial Eq. By LS Method nixxxy i

kikiii ,,2,1,2

2 LS criterion:

minimize D=

n

i

kiiii xxxy

1

22

,,

Set kjforD

j

,,2,1,0,0

Normal Equations are:

n

i

kii

n

i

ki

n

i

ki

n

i

ki

n

iii

n

i

ki

n

ii

n

ii

n

ii

n

i

ki

n

ii

xyxxx

xyxxx

yxxn

11

2

1

1

1

11

1

1

2

1

111

Page 14: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Fitting a Linear Function of Several Variables kxxxy 21

Normal equations:

n

iiki

n

iik

n

iiik

n

iik

n

iii

n

iiki

n

ii

n

ii

n

ii

n

iik

n

ii

xyxxxx

xyxxxx

yxxn

11

2

11

1

11

11

1

21

1

1111

LS criterion :

Minimize D= 2

11

n

i i ki

y x x x

k ,,, 1

Set 0 , 0,1, 2, ,j

Dfor j k

Page 15: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Matrix Form of Multiple Regression by LS

nknknn

k

k

n xxx

xxx

xxx

y

y

y

2

1

21

22221

11211

2

1

1

1

1

(Note: ijx= i

th observation of the jth independent variable)

or y = X + in short

LS criterion is:

min β X -y 'βX -y ε ε' 1

2

n

iiD

β Set 0β D , and result in: 0β XyX

^

) - ( '

The LS solutions are: y X' XX' β 1 ˆ

Page 16: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Measure of Goodness-of-Fit

R2 = Coefficient of Determination

n

1i

2y

iy

n

1i

2iε

1

= 1 - % of variation in the dependent variable, y, unexplained by the regression equation; = % of variation in the dependent variable, y, explained by the

regression equation.

Page 17: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Example 1 (LS Method)

Page 18: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

Example 1 (LS Method)

Page 19: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Example

Page 20: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Example (Matrix Approach)

Page 21: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Example (by Minitab w/ 0)

Page 22: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Example (by Minitab w/o 0)

Page 23: Correlation and Regression Analysis Many engineering design and analysis problems involve factors that are interrelated and dependent. E.g., (1) runoff.

LS Example (Output Plots)