Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

94
Chapter 5 Discovering Relationships Slide-Show, Copyright 1994-95 by Quant Systems Inc.
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    4

Transcript of Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Page 1: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Chapter 5

Discovering Relationships

Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Page 2: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 2

• Discovering Relationships and

Patterns in the data

Page 3: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 3

Statistical Tools

• There are statistical tools which will aid in the discovery of relationships.

• Specifically,

– a graph (called a scatterplot)

– a summary statistical measure (called a correlation coefficient)

– a mathematical model (developed using a technique called regression analysis).

Page 4: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 4

Relationships

• A college admissions counselor would like to know if SAT or ACT scores are related to college performance.

• A student who would like to make an A on an upcoming exam wonders about the relationship between study time and grade received.

Page 5: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 5

Relationships

• A farmer wants to know the relationship between fertilizer and yield of a corn crop.

• A couple seeking to buy their first house wants to know the relationship between house square footage and price.

Page 6: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 6

Discovering a relationship can lead to

good things.

• If a guidance counselor knows the relationship between SAT or ACT scores and academic performance at various institutions, then they can use the relationships to recommend colleges.

• If a student knows the relationship between study time and test grade, then the student will be able to choose the grade they wish by studying the amount necessary to achieve the desired grade.

Page 7: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 7

Discovering a relationship can lead to

good things.

• If a farmer knows the relationship between fertilizer and corn yield, he can select the fertilizer level that produces the greatest profit.

• If a couple knows the relationship between cost and square footage, they will know when they have a good buy.

Page 8: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 8

Relationships can be represented in many forms.

Relationships Among Ideas

Take, for example, the relationship between enthusiasm and involvement. The relationship implied by the diagram above suggests involvement produces enthusiasm and vice-a-versa. However, the exact relationship is not specified with precision.

Enthusiasm

Involvement

Page 9: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 9

Formal Relationships

Mathematical functions

y = 3 + 4x

• This equation specifies an exact relationship between the variables y and x.

• This type of relationship is a deterministic mathematical model.

• The relationship is specified very precisely by the equation.

Page 10: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 10

Discovering Relationships

• When we have data on two variables, we have the possibility of discovering a relationship and building a mathematical model to represent the relationship.

• Is there some general tendency (pattern-relationship) that the data exhibits?

– If two variables were related there should be some pattern that connects them.

Page 11: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 11

What is a pattern?

• Patterns are structure.

• Patterns connect variables.

• One of best ways to represent a pattern is with a mathematical model.

Page 12: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 12

Pattern Structure: Upward Sloping

Linear Pattern

Note: The points fit the pattern exactly.

X - axis

Page 13: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 13

Pattern Structure: Downward Sloping

Linear Pattern

Note: The points tightly fit the pattern. The relationship exhibited by the data pattern is completely captured with a line.

X - axis

Page 14: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 14

Another Upward Sloping Linear Pattern

Note: The data loosely fits the linear pattern. Yet the essence of the pattern is captured with an upward sloping line.

X - axis

Page 15: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 15

Another Upward Sloping Pattern

Note: This picture represents data with a very loose upward sloping pattern. A line could be used to represent the upward sloping nature of the relationship. However, the line would not represent the relationship very well.

X - axis

Page 16: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 16

Data With No Linear Pattern

X - axis

Note: There is no apparent relationship between the x and y variables. Often, it is just as important to know there is no relationship between two variables as it is to discover that one exists.

Page 17: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Let’s look at some bivariate data on High School Graduation Rates vs. Crime Rate (per 100,000) for each of the States.

Let’s look at some bivariate data on High School Graduation Rates vs. Crime Rate (per 100,000) for each of the States.

5 - 17

Page 18: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 18

Page 19: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 19

Page 20: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

There is too much data in the previous table to comprehend without the aid of statistical tools.

Let’s look at the data with a scatterplot.

There is too much data in the previous table to comprehend without the aid of statistical tools.

Let’s look at the data with a scatterplot.

5 - 20

Page 21: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 21

Does there appear to be a pattern in the data?

Note: While the relationship between crime and graduation rate is apparent, the relationship is not very strong.

Crimes Occurring per 100,000 People

Page 22: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 22

Building a Model•

Page 23: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 23

Building a Model

• One of the most useful methods of defining a precise relationship between two variables is to create a mathematical model that relates the variables.

• Suppose we know that the following relationship between test score and hours of study time is

Test Score = 45 + 3.8 (hrs of study time).

Page 24: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 24

Building a Model

Test Score = 45 + 3.8 (hrs of study time)

• If this mathematical model is accurate, then anyone would be able to control his/her destiny.

• If someone studied for 10 hours, according to the model his/her test score would be:

Test Score = 45 + 3.8 (10) = 83.

Page 25: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 25

Building a Model

• If this score was not sufficiently high, then study 11 hours:

Test Score = 45 + 3.8 (11) = 86.8

or study 12 hours:

Test Score = 45 + 3.8 (12) = 90.6.

• Admittedly, there is no model that can precisely predict a test score just on the basis of time studied; there are many other variables that affect test scores.

Page 26: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Suppose a model was available which, although imperfect, fairly reliably predicted test scores based on hours studied.

Suppose a model was available which, although imperfect, fairly reliably predicted test scores based on hours studied.

• New Model:

Test Score = 45 + 3.8(hours of study time) + error

• The new model admits the possibility of error.

• Using the new model to predict someone’s score if they studied 12 hours would yield

Test Score = 45 + 3.8(12) = 90.6 + error.

5 - 26

Page 27: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 27

Building a Model

• The predicted test score would still be 90.6, but there is an unknown error associated with the prediction.

• Admitting the possibility of error in our model makes it more realistic and more credible.

• Whether the model will actually be useful will depend on the size of the errors.

• If the errors are small enough (say, at most 5 points) then the predicted value will still be useful for planning purposes.

Page 28: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 28

How would you like your errors?

• If the model admits the possibility of error, gauging the expected magnitude of the error is essential in determining the model’s usefulness.

• A model with an average error of zero and small variation of the error terms will be desirable and should yield useful predictions.

Page 29: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 29

The Linear Model

• A linear relationship is graphically described as a line.

• Mathematically, a line is a set of points that satisfy the functional relationship

y = mx + b,

where m is the slope of the line and b is the point where the function crosses the Y axis, which is called the Y - intercept.

Page 30: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 30

The Linear Model

The relationship in the graph is the linear equation y = 5x + 3.

Thus, m = 5 and b = 3.

y

x

b

Page 31: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 31

Parameters of the Linear Equation

• Together, the slope and the intercept are called the parameters of the linear equation.

• The parameters completely define the equation of the line.

Page 32: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 32

The Degree of the Linear Relationship

Page 33: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 33

Measuring the Degree of Linear Relationship

• Nature doesn’t cooperate by requiring all relationships to be straight lines.

• However, if two variables seem to have a positive (upward sloping) or inverse (downward sloping) relationships, it would be useful to:

– define the exact parameters of a regression line for the set of data,

– find the degree to which the data points cluster about the line.

Page 34: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 34

Correlation Coefficient

The correlation coefficient (r) was developed by Karl Pearson in 1896 to measure how a set of data points fit a straight line.

, rn

x x

s

y y

sri

xi

ni

y

1

11 1

1

Page 35: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 35

Correlation Coefficient

For positive relationships

Points above the mean of X and the mean of Y

Points below the mean of X and the mean of Y

The mean of Y

The mean of X

x x

s

y y

si

x

i

y

will generally be a positive number.

Page 36: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 36

Correlation Coefficient

For negative relationships

x x

s

y y

si

x

i

y

will generally be a negative number.

Points above the mean of Y but below the mean of X

Points below the mean of Y but above the mean of X

The mean of X

The mean of Y

Page 37: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 37

Properties of the correlation coefficient

• The correlation coefficient, r, measures the degree of linear relationship-how well the data clusters around a line.

• The value of r is always between -1 and +1.

• A value of r near -1 or +1 means the data is tightly bundled around a line.

• A value of r near -1 or +1 means that that it would be easy to predict one of the variables by using the other.

Page 38: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 38

More Properties of the Correlation Coefficient

• Positive association is indicated by a plus sign and is associated with an upward sloping relationship.

• Negative association is indicated by a minus sign and a negatively sloping relationship.

• It does not matter whether you correlate Y with X, or X with Y, you will get the same value for r.

Page 39: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 39

Correlation Pitfalls•

Page 40: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 40

Some Correlation Pitfalls

A high correlation does not imply causation.

What could explain this relationship?

Ice Cream Sales Snake Bites

Suppose that a high correlation has been observed between the weekly sales of ice cream and the number of snake bites.

Page 41: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 41

Common response

• Common response means that both variables (ice cream sales and snake bites) are related to a third variable.

• In this case high temperature in the summer cause both ice cream sales and reptile activity to increase.

Page 42: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 42

Does low correlation mean that no

relationship exists?

A low correlation could mean that no linear relationship exists.

The correlation measure for these

points is going to be very close to zero.

Yet, the data does have a very distinct

relationship. The relationship is not

linear. It is quadratic.

y

x

Page 43: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 43

Confounding

• Another problem that can produce low correlations is confounding.

• Confounding occurs when more than one variable affects the dependent variable, and the effects of the variables cannot be distinguished from each other.

• Suppose that the variable Y is dependent on X. Thus, as X changes, it produces changes in Y.

Page 44: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 44

Confounding

• Such a relationship should produce a significant correlation measure between the two variables.

• But suppose there is another variable Z, which also affects Y. As Z changes so does Y.

• It is certainly possible that changes in Z will mask the changes caused by X.

Y

Z

X

Page 45: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 45

The range of X values can affect the

correlation coefficient.

If the range of X data is large, the correlation will usually be greater than if the range of the X values is small.

Page 46: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 46

• Defining a Linear Relationship--

Regression Analysis

Page 47: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 47

The typical equation of a line:

y = mx + b

slope y - intercept

The regression equation of a line:

y = b0 + b1x

y - intercept slope

Linear Relationship Between Two Variables

Page 48: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 48

The Relationship Between X and Y

• The value of Y is completely dependent on the value of X.

• Thus, the Y variable is called the dependent variable, and the X variable is called the independent variable.

Page 49: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 49

Example

Let b0 = 8 and b1 = -2.

x

y

y = 8 - 2x

9

7

5

3

1

1 3 5 71 3 5 7

y

y = 8 - 2x

7

5

3

1

9

x

Page 50: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 50

Which line best fits the data?

Specifying the relationship between X and Y with a linear model means to find a line that best fits the data in some way.

Problem: There are many lines that could be interpreted as fitting the data.

y

x

Page 51: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 51

How do we measure how close a line is to

the data?

• One possible method of choosing the best fitting line is to use the line to predict the Y value for each observation.

• By using a set of data points, one possible line that seems to fit the data reasonably well is Y = 1 + .7X.

Page 52: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 52

The Model: Y = 1 + .7X

X-Y Plot

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10

(4,2) Observed value of X

(2,3)

(5,6)

(8,5)

(9,8)

If we plug X=4 in our model we get Y= 1 + .7 (4) = 3.8.

Page 53: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 53

How do we measure how close a line is to

the data?

• Examine the purpose of the regression line: y = b0 + b1x.

• Once the coefficients of the model (b0,b1) are estimated they become constants.

• The value of X is selected by the user of the model.

Page 54: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 54

How do we measure how close a line is to

the data?

• Once the coefficients are estimated and the value of X is chosen, the corresponding value of Y is completely determined.

• Example:

If X = 2, and Y = 1 + .7X, then

Y = 1 + .7(2) = 2.4.

Page 55: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 55

Model Error

• The purpose of the model is to predict Y for some given X.

• But, the model Y = 1 + .7X is wrong.

• In the first historical observation, the observed data for X = 2 is Y = 3, not 2.4 as the model predicts.

• The difference between the observed value of Y and the predicted value of Y is called the model’s error.

error = observed Y - predicted Y

error = 3 - 2.4 = .6

Page 56: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 56

• The errors reflect how far each observation is from the line.

• Examining the errors suggests how well the line fits the data.

• If we incorporate the notion of error in our model, it becomes

Model Error

y b b x error 0 1 .

Page 57: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 57

Model Error

• Once the possibility of error has been admitted, the notation in the estimated model changes when the error term is not included.

• Specifically, the dependent variable is referred to as (pronounced “y hat”), the predicted value of y.

• The symbol y is reserved for the observed value of y. The error for the model for any observation is given by

y

y b b x 0 1

error y y .

Page 58: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 58

The Sum of Squared Errors (SSE)

The Sum of Squared Errors (SSE) can be used as a criteria for selecting the best fitting line through a set of points.

SSE errori2

( ( ))y b b xi i0 12

( )y yi i2

Page 59: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 59

The “Best” Line

• If the Sum of Squared Errors (SSE) is zero, then the model fits the data exactly and the observed data must lie in a straight line.

• The “best” line is the line with the least Sum of Squared Errors.

• If found, it would be called the Least Squares Line since it would have the smallest SSE.

Page 60: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 60

Finding the Least Squares Line

• Defining a line means specifying its slope and Y intercept.

• The equations for determining the slope and intercept are:

and

bn( xy) ( x)( y)

n( x ) ( x)1 2 2

bn

y b x0 1

1 ( ) .

Page 61: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 61

Estimating a Linear Relationship

Page 62: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 62

Estimating a Linear Relationship

Data on age and asking price of a Ford Taurus has been gathered from ten classified car ads in a local newspaper.

Page 63: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 63

Estimating a Linear Relationship

• The estimated coefficients using least squares methods are

b0 = 13,947.4 (the y-intercept) and b1 = -1,787.9 ( the slope).

• The estimated model for the price of the Ford Taurus is Price = $13,947 - $1,788 (age), which has a smaller SSE for this data than any other line.

Page 64: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 64

Asking Price vs. Age of Car

0 1 2 3 4 5

Age of Car

5,000

7,000

9,000

11,000

13,000

15,000

Page 65: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 65

Estimating a Linear Relationship

• Examine the errors produced by the least squares model (Table 5.9.1).

• Note that the sum of the errors equals zero.

errors 0

errors2 14 555 372 , ,

Page 66: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 66

Estimating a Linear Relationship

A partial output from the Taurus model is shown below.

Predictor Coef

Constant 13947.4

Age -1787.9

s = 1349 R-sq = 83.1% R-sq(adj) = 81.0%

b0 (estimated Y intercept)

b1 (estimated slope)

Page 67: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 67

Interpreting the Regression Equation

Page 68: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 68

Interpreting the Regression Equation

• If Age = 0 then the value of b0 = $13,947 is equivalent to the predicted price for a new Ford Taurus.

• Since b1 is the estimated slope of the line, it is interpreted to be the change in the dependent variable (price of a Ford Taurus) for a one unit change in the independent variable (age).

Price = $13,947 - $1,788(0) = $13,947

Page 69: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 69

Interpreting the Regression Equation

• In our example, the independent variable, age, is expressed in years.

• Thus, for every additional year of age the price declines by $1,788.

• If the interpretation is correct, the model’s predicted price of a two year old Taurus should be $1,788 greater than a three year old Taurus.

Page 70: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 70

Interpreting the Regression Equation

• If the Taurus is two years old, then x = 2, and

• If the Taurus is three years old, then x = 3, and

Price = $13,947 - $1,788(2) = $10,371.

Price = $13,947 - $1,788(3) = $8,583.

Page 71: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 71

The Importance of Errors

Page 72: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 72

The Importance of Errors

• The usefulness of the estimated model depends on the magnitude of the prediction errors you expect the model to produce.

• The Taurus model is

Price = b0 + b1(Age) + error.

Page 73: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 73

The Importance of Errors

• How do we assess the magnitude of the errors for any model?

– To learn about the quality of the model we need data about the prediction errors it produces.

– Using the model to predict the observed outcomes of the dependent variable (Y) produces error data that will be used to evaluate the quality of the model.

Page 74: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 74

The Importance of Errors

• How do you summarize the errors a model produces?

– Large variation in the errors would indicate a model’s prediction was not very reliable.

– Small variation would indicate the model is capable of producing more trustworthy predictions.

Page 75: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 75

Variance of Errors and Standard Deviation

• The definition for the variance of the errors is given by

• In the Taurus data the variance of the errors is

and the standard deviation of the error terms is given by

se e

n

e

n

SSE

ne2

2 2

2 2 2

( ).

se2 14 555 372

10 21 819 421

, ,, , ,

se 1 819 421 1 348 86, , , . .

Page 76: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 76

Evaluating the Fit of a Model

Page 77: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 77

Total Sum of Squares

• Before determining how much variation a model explains, it is necessary to evaluate how much variability exists in the Y variable.

• This quantity is called Total Sum of Squares (TSS) and represents the total variation of the dependent variable (Y).

TSS y y ( )2

Page 78: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 78

Sum of Squared Errors

• An error represents the model’s inability to predict the variation in Y.

• Adding all the errors accumulates the total of all unexplained variation, which is denoted SSE, Sum of Squared Errors.

SSE e 2

Page 79: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 79

The Variation in Y

• The variation in Y can be divided into two categories, unexplained and explained.

TSS = +

TSS = SSE + explained variation

• Denoting explained variation as SSR produces TSS = SSE + SSR.

• and solving this equation for SSR results in SSR = TSS - SSE.

unexplained variation

explained variation

Page 80: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 80

Interpreting SSR

• It would be delightful if the model would explain all of the variability in the Y’s.

• The difference between total variation in the Y’s and the unexplained variation must be the variation that is explained by the regression model.

• That’s why explained variation is called the Sum of Squares of Regression (SSR).

Page 81: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 81

Interpreting SSR

In the Taurus example,

TSS = 86,158,336 and

SSE = 14,555,372; thus

SSR = TSS - SSE

= 86,158,336 - 4,555,372

= 71,602,964.

Page 82: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 82

Interpreting SSR

• The proportion of variation explained by the model is called the coefficient of determination and is denoted as R2:

• For the Taurus data,

• In other words, the estimated model can predict about 83% of the variation in prices.

RSSR

TSSR2 20 1 .

R2 71 602 964

86 158 336831

, ,

, ,. .

Page 83: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Estimating the Linear Relationship Between SAT

Scores and Graduating GPA

5 - 83

Page 84: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 84

Estimating the Linear Relationship Between SAT

Scores and Graduating GPA

Using the least squares method, the estimated model is given by

= 1.36 + 0.00141 (Total SAT Score).Graduating

GPA

Page 85: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Measuring the Fit of the DataMeasuring the Fit of the Data

SAT Score

700 900 1,100 1,300

SAT Score

2

2.6

3.2

3.8

SAT Score

700 900 1,100 1,300

SAT Score

2

2.6

3.2

3.8

5 - 85

Page 86: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 86

Measuring the Fit of the Data

• R2 is a numerical measure that describes fit.

• What percent of the variation in Final Grade Point Average can be explained by the model?

RSSR

TSS

1.1862

6.1013.1942

Page 87: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 87

Fitting a Linear Time Trend

Page 88: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 88

Example

Humans have continued to better their performances in various sporting events.

If we look at world records for women’s freestyle swimming, what is the relationship between distance and time?

World Records in 1993

Page 89: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 89

Questions

a. Use the data to construct a model using time as the dependent variable and distance as the independent variable.

b. What fraction of the variation in time is explained by distance?

c. If there were a 1000 meter race, what would you predict the world record to be?

Page 90: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 90

Solution

a. You should always draw a scatterplot to confirm that an approximate linear relationship exists between the two variables.

0

200

400

600

800

1000

0 500 1000 1500

Page 91: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 91

Solution

Model:

Time = -10.3 + .640(Distance)

Calculations

Page 92: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 92

Solution

The coefficients of the model are calculated as follows:

bn xy x y

n x x1 2 2

( ) ( )( )

( ) ( )

6 1952863 5 3050 1888 99

6 3102500 3050 2

( . ) ( )( . )

( ) ( )

5955761 5

9312500639545

..

Page 93: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 93

Solution

y b b x x 0 1 10 3 640. .

bn

y b x0 1

1 ( )

1

61888 99 639545 3050( . . ( ))

10 27.

Page 94: Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 94

Solution

b. R2 = 100%, meaning that 100% of the variation in the observed times is explained by the distance (i. e. this is a very good model).

c. Time = -10.3 + .640(1000) = 629.7 seconds.