Statistics lecture 11 (chapter 11)

38
1

description

Regression & Correlation

Transcript of Statistics lecture 11 (chapter 11)

Page 1: Statistics lecture 11 (chapter 11)

1

Page 2: Statistics lecture 11 (chapter 11)

2

• Analyze the relationship among two

quantitative variables

• Correlation determines the strength and

direction between the variables

• Regression determines a mathematical

equation to explain the relation

• Equation can be used for prediction

Page 3: Statistics lecture 11 (chapter 11)

3

• Regression Analysis – X → independent variable

– Y → dependent variable

– Independent variable influence depended variable

– Sample consists of n pairs of observations

– Ascertain if a relation exists

– Examine the nature of the relation

– Obtain an equation that relates Y to X

– The magnitude in change of one variable due to change in another variable can be evaluated

– Predict value of Y on different values of X

Page 4: Statistics lecture 11 (chapter 11)

4

• Regression Analysis – scatter plot – Effective way to display the relationship

– X variable on horizontal axis

– Y variable on vertical axis

– Plot a dot for each pair of observations

– Can determine the • Form

– Linear or nonlinear

• Direction

– Positive or negative

• Strength

– Dots scattered close – strong relation

– Large scatter – weak relation

Page 5: Statistics lecture 11 (chapter 11)

5

• Regression Analysis – scatter plot

– Example

– Two variables

• Cost of producing units

• Number of units produced

– Cost is depending on number of

units

Number

Units (x)

Cost per

unit (y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

Relation between units produced

and cost of production

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0 30 60 90 120 150

Number of units

Co

st p

er u

nit

(R

)

From the graph it seems there is a negative relation between number of units and cost – more units then decrease in cost

Page 6: Statistics lecture 11 (chapter 11)

6

• Simple linear regression analysis

– Which line fits the data best?

Relation between units produced

and cost of production

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0 30 60 90 120 150

Number of units

Co

st p

er u

nit

(R

)

Page 7: Statistics lecture 11 (chapter 11)

7

• Simple linear regression analysis

– Which line fits the data best?

– Method of least squares

– y = a + b x

• b → slope

• a → y intercept

– ∑ei = 0

– ∑ei2 measures size

of set of errors

– Least squares method

• Sum squares of errors the smallest

Page 8: Statistics lecture 11 (chapter 11)

8

• Least squares regression model

– Population regression model

• Y = α + βx + ε

• ε random error

– Sample regression model

• ŷ = a + b x

• b → change in y due to change in x

• a → value of y when x = 0

Page 9: Statistics lecture 11 (chapter 11)

9

• Least squares

regression model

– ŷ = a + b x

Number Units

(x)

Cost per unit

(y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

∑x = 470 ∑y = 47,4

∑x2 = 38300 ∑y2 = 335,54

∑xy = 2033

212

212

1

and

where,

S =

S =

S =

xy

xx

xx n

yy n

xy n

Sb a y bx

S

x x

y y

xy x y

58,75x 5,925y

Page 10: Statistics lecture 11 (chapter 11)

Number

Units (x)

Cost per unit

(y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

∑x = ? ∑y = ?

∑x2 = ? ∑y2 = ?

∑xy = ? 10

• Least squares

regression model

ŷ = a + b x

212

212

1

and

where,

S =

S =

S =

xy

xx

xx n

yy n

xy n

Sb a y bx

S

x x

y y

xy x y

Calculate Sxx, Syy, Sxy

Page 11: Statistics lecture 11 (chapter 11)

Number

Units (x)

Cost per unit

(y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

∑x = 470 ∑y = 47,4

∑x2 = 38300 ∑y2 = 335,54

∑xy = 2033 11

• Least squares

regression model

– ŷ = a + b x

58,75x 5,925y

1 2

8

1 2

8

1

8

S =38300 (470) 10687,5

S =335.54 (47,4) 54,695

S =2033 (470) 47,4

751

d

5

a

,7

nxy

x

x

x

x

yy

xy

Sb a y bx

S

Page 12: Statistics lecture 11 (chapter 11)

• Least squares regression model

S =10687,5 S =54,695 S 751,75

58,75 5,925

xx yy xy

x y

5,925 ( 0,07)(58,75)

10,0375

a y bx

751,75

10687,5

0,07

xy

xx

Sb

S

→ ŷ = 10,0375 – 0,07x

Note Syy not used

here but we will

use later!!

Page 13: Statistics lecture 11 (chapter 11)

13

• Least squares regression

model

– ŷ = a + b x

– ŷ = 10,0375 – 0,07x

x

y

b > 0

Positive linear

x

y

b < 0

Negative linear

x

y

b = 0

No relation

Page 14: Statistics lecture 11 (chapter 11)

14

• Plot least squares regression model

– ŷ = 10,04 – 0,07x

Relation between units produced

and cost of production

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0 30 60 90 120 150

Number of units

Co

st

per

un

it (

R)

If x = 30:

→ ŷ = 10,04 - 0,07(30)

=7,94

If x = 90:

→ ŷ = 10,04 - 0,07(90)

= 3,74

Page 15: Statistics lecture 11 (chapter 11)

EXAMPLE A car manufacturing business wants to find out

how the price of its car models depreciate with

age. The business took a sample of 8 models and

collected the following information on age (yrs) and

price (R1000):-

Find the equation for the regression line with price

as dependent variable and age as independent

15

Age 8 3 6 9 2 5 6 3

Price 16 74 38 19 102 36 33 69

Page 16: Statistics lecture 11 (chapter 11)

Example answer

Example 11.4, textbook, part 2, page 383

16

Page 17: Statistics lecture 11 (chapter 11)

PREDICTIONS IN REGRESSION ANALYSIS

• A sample regression line usually obtained

for the purpose of prediction

• That is to estimate the value of Y

corresponding to as selected value of x

• Two ways to estimate y:-

– Point estimate

– Confidence interval

17

Page 18: Statistics lecture 11 (chapter 11)

18

• Prediction with regression model – Point estimate using ŷ = 10,04 – 0,07x

– What will be the estimated cost if 60 units

will be produced?

– ŷ = 10,04 – 0,07(60)=R5,84

– What will be the estimated cost if 25 units

will be produced?

– ŷ = 10,075 – 0,07(25)=R8,29

Page 19: Statistics lecture 11 (chapter 11)

ERRORS

• When regression line estimates every

observed value has a predicted value

• Predicted values will all fall exactly on

regression line

• All observed values will not fall on

regression line

• Difference between the two values is

known as an ERROR and is denoted by

ei

19

Page 20: Statistics lecture 11 (chapter 11)

ERRORS • Since the observed values deviate from the

predicted values the regression equation is not a

perfect predictor

• Need to be able to assess the accuracy of the

regression line in predicting the values and this

is done by analysing the errors ei

• STD DEV errors measures how widely observed

values are spread around regression line

• The smaller the STD DEV the closer the points

cluster around line

20

Page 21: Statistics lecture 11 (chapter 11)

21

• Standard deviation of random errors

– ŷ = 10,04 – 0,07x

– ei indicate how the observed and expected values differ

– Standard deviation of errors measures spread around the line

• Smaller - points closer to line

ŷ = 10,04 – 0,07(10) = 9,34 ŷ = 10,04 – 0,07(20) = 8,64

Number

Units

(x)

Cost

per

unit (y)

Predicted

cost per

unit (ŷ)

Difference ei

= yi - ŷi

10 10,00 9,34 0,66

20 8,80 8,64 0,16

30 7,90 7,94 -0,04

50 6,20 6,54 -0,34

60 5,00 5,84 -0,84

80 4,00 4,44 -0,44

100 3,50 3,04 0,46

120 2,00 1,64 0,36

Page 22: Statistics lecture 11 (chapter 11)

22

• Standard deviation of random errors

– Small

– Values close to line

Number

Units

(x)

Cost

per

unit (y)

Predicted

cost per

unit (ŷ)

Difference ei

= yi - ŷi

10 10,00 9,34 0,66

20 8,80 8,64 0,16

30 7,90 7,94 -0,04

50 6,20 6,54 -0,34

60 5,00 5,84 -0,84

80 4,00 4,44 -0,44

100 3,50 3,04 0,46

120 2,00 1,64 0,36

2

54,695 ( 0,07)( 751,75)

8 2

0,588

yy xy

e

S bSS

n

Page 23: Statistics lecture 11 (chapter 11)

CONFIDENCE INTERVAL FOR PREDICTION

• Different samples from the same population will

give different point estimates

• Likely that different samples from same

population will give different estimated

regression lines

• Therefore need to construct a confidence

interval for Y based on one sample that will give

a more reliable estimate of Y

• Generally called a PREDICTION INTERVAL

23

Page 24: Statistics lecture 11 (chapter 11)

24

• Confidence interval for prediction

– Point estimate for 60 units

• ŷ = 10,04 – 0,07(60)=R5,84

– Rather calculate a confidence interval for the

mean value of y for a given x value

– Use the t-distribution

– Confidence interval for the mean of y, given x = x0

0 02

0

| 0 2 ; 11

2

02

| e

XX

1where

S

y x y xn

y x

CONF a bx t s

x xS s

n

Page 25: Statistics lecture 11 (chapter 11)

25

• Confidence interval for prediction –

0 02

0

| 0 2 ; 11

2

02

| e

XX

2

2

1where

S

60 58,7510,588

8 10687,5

0, 2080

y x y xn

y x

CONF a bx t s

x xS s

n

Page 26: Statistics lecture 11 (chapter 11)

26

• Confidence interval for prediction – 95% confidence interval if x = 60

– 95% sure mean cost for 60 units will be

between R5,33 an R6,35

0 02| 0 2 ; 11

8 2;1 0,025

10,04 0,07(60) 0,2080

5,84 2,447(0,2080)

5,84 0,508976

5,33 ; 6,35

y x y xnCONF a bx t s

t

Page 27: Statistics lecture 11 (chapter 11)

27

• Inferences about β (population slope)

– b point estimate of β

– T-distribution used to make inferences about β

– Confidence interval for β

– If confidence interval includes 0 – no linear relation

– If confidence interval not includes 0 – might be a linear relation

2

2 ; 11

where

bn

eb

xx

CONF b t s

ss

s

Page 28: Statistics lecture 11 (chapter 11)

28

• Inferences about β (population

slope)

– Confidence interval for β

2

2 ; 11

0,588where 0,00569

10687,5

bn

eb

xx

CONF b t s

ss

s

Page 29: Statistics lecture 11 (chapter 11)

29

• Inferences about β (population slope)

– Confidence interval for β

– 95% sure population slope will be between -0,0839 and -0,0561

– Interval does not include 0

– Might be a linear relation

22 ; 11

0,07 2,447(0,00569

0,0839 ; 0,0561

bnCONF b t s

Page 30: Statistics lecture 11 (chapter 11)

30

• Inferences about β (population slope)

– Hypothesis test concerning β

Testing H0: β = 0 for n < 30

Alternative

hypothesis

Decision rule:

Reject H0 if Test statistic

H1: β ≠ 0 |t| ≥ tn - 2;1- α/2

H1: β > 0 t ≥ tn-2;1- α

H1: β < 0 t ≤ -tn-2;1- α

with s

b

eb

xx

bt

s

s

s

Page 31: Statistics lecture 11 (chapter 11)

31

• Solution

– H0 : β = 0

– H1 : β ≠ 0

– α = 0,05

– Reject H0

0,5880,00569

10687,5

0,0712,346

0,00569

eb

xx

b

ss

s

bt

s

At α = 0,05 the slope is not zero –

there is a linear relation between

number of units and cost per unit

Reject H0 Accept H0 Reject H0

-2,447 +2,447

If H1 : β > 0 - test for positive slope

If H1 : β < 0 - test for negative slope

Page 32: Statistics lecture 11 (chapter 11)

32

• Correlation Analysis – Strength of linear relationship

– Direction of linear relationship • Positive

• Negative

– Population correlation coefficient ρ (rho)

– Sample correlation coefficient r

– r always between -1 and +1 • r = 1 perfect positive

• r = -1 perfect negative

• r = 0 no relationship

• near 0 weak relationship

• near -1 or +1 strong relationship

Page 33: Statistics lecture 11 (chapter 11)

33

Coefficient of correlation

• The coefficient of correlation is used to measure

the strength of association between two

variables.

• The coefficient values range between -1 and 1.

– If r = -1 (negative association) or r = +1

(positive association) every point falls on the

regression line.

– If r = 0 there is no linear pattern.

• The coefficient can be used to test for linear

relationship between two variables.

Page 34: Statistics lecture 11 (chapter 11)

34

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

Perfect positive

r = +1

High positive

r = +0,9

Low positive

r = +0,3

Perfect negative

r = -1

High negative

r = -0,8

No Correlation

r = 0

Page 35: Statistics lecture 11 (chapter 11)

35

• Correlation coefficient r

– Strong negative

relationship

Number

Units (x)

Cost per

unit (y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

∑x = 470 ∑y = 47,4

∑x2 = 38300 ∑y2 = 335,54

∑xy = 2033

58,75x 5,925y

1 2

8

1 2

8

1

8

S =38300 (470) 10687,5

S =335.54 (47, 4) 54,695

S =2033

751,75

10687,5(

(470)

54

47, 4

,695)

0,98

751,75

xy

x

xx

yy

xy

x yy

Sr

s s

Page 36: Statistics lecture 11 (chapter 11)

36

• Coefficient of determination

r2

– Measures proportion of

changes in the dependent

variable y that can be

explained by the

independent variable x

– % of total variation in y that

is explained by the

regression model

Number

Units (x)

Cost per

unit (y)

10 R10,00

20 8,80

30 7,90

50 6,20

60 5,00

80 4,00

100 3,50

120 2,00

∑x = 470 ∑y = 47,4

∑x2 = 38300 ∑y2 = 335,54

∑xy = 2033

58,75x 5,925y 2 20,98 96,04%r

– 96% of the variation in the cost of units is explained by the variation in the number of units produced

– 4% is unexplained

Page 37: Statistics lecture 11 (chapter 11)

37

• Hypothesis test concerning the

correlation coefficient ρ

Testing H0: ρ = 0 for n < 30

Alternative

hypothesis

Decision rule:

Reject H0 if Test statistic

H1: ρ ≠ 0 |t| ≥ tn - 2;1- α/2 21

2

rt

r

n

Page 38: Statistics lecture 11 (chapter 11)

38

• Solution

– H0 : ρ = 0

– H1 : ρ ≠ 0

– α = 0,05

– Reject H0

2 2

0,9812,06

1 1 ( 0,98)

2 8 2

rt

r

n

At α = 0,05 the correlation coefficient is

not zero – there is a linear relation

between number of units and cost per unit

Reject H0 Accept H0 Reject H0

-2,447 +2,447