Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Chapter 5

Discovering Relationships

Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

5 - 2

• Discovering Relationships and

Patterns in the data

5 - 3

Statistical Tools

• There are statistical tools which will aid in the discovery of relationships.

• Specifically,

– a graph (called a scatterplot)

– a summary statistical measure (called a correlation coefficient)

– a mathematical model (developed using a technique called regression analysis).

5 - 4

Relationships

• A college admissions counselor would like to know if SAT or ACT scores are related to college performance.

• A student who would like to make an A on an upcoming exam wonders about the relationship between study time and grade received.

5 - 5

Relationships

• A farmer wants to know the relationship between fertilizer and yield of a corn crop.

• A couple seeking to buy their first house wants to know the relationship between house square footage and price.

5 - 6

Discovering a relationship can lead to

good things.

• If a guidance counselor knows the relationship between SAT or ACT scores and academic performance at various institutions, then they can use the relationships to recommend colleges.

• If a student knows the relationship between study time and test grade, then the student will be able to choose the grade they wish by studying the amount necessary to achieve the desired grade.

5 - 7

Discovering a relationship can lead to

good things.

• If a farmer knows the relationship between fertilizer and corn yield, he can select the fertilizer level that produces the greatest profit.

• If a couple knows the relationship between cost and square footage, they will know when they have a good buy.

5 - 8

Relationships can be represented in many forms.

Relationships Among Ideas

Take, for example, the relationship between enthusiasm and involvement. The relationship implied by the diagram above suggests involvement produces enthusiasm and vice-a-versa. However, the exact relationship is not specified with precision.

Enthusiasm

Involvement

5 - 9

Formal Relationships

Mathematical functions

y = 3 + 4x

• This equation specifies an exact relationship between the variables y and x.

• This type of relationship is a deterministic mathematical model.

• The relationship is specified very precisely by the equation.

5 - 10

Discovering Relationships

• When we have data on two variables, we have the possibility of discovering a relationship and building a mathematical model to represent the relationship.

• Is there some general tendency (pattern-relationship) that the data exhibits?

– If two variables were related there should be some pattern that connects them.

5 - 11

What is a pattern?

• Patterns are structure.

• Patterns connect variables.

• One of best ways to represent a pattern is with a mathematical model.

5 - 12

Pattern Structure: Upward Sloping

Linear Pattern

Note: The points fit the pattern exactly.

X - axis

5 - 13

Pattern Structure: Downward Sloping

Linear Pattern

Note: The points tightly fit the pattern. The relationship exhibited by the data pattern is completely captured with a line.

X - axis

5 - 14

Another Upward Sloping Linear Pattern

Note: The data loosely fits the linear pattern. Yet the essence of the pattern is captured with an upward sloping line.

X - axis

5 - 15

Another Upward Sloping Pattern

Note: This picture represents data with a very loose upward sloping pattern. A line could be used to represent the upward sloping nature of the relationship. However, the line would not represent the relationship very well.

X - axis

5 - 16

Data With No Linear Pattern

X - axis

Note: There is no apparent relationship between the x and y variables. Often, it is just as important to know there is no relationship between two variables as it is to discover that one exists.

Let’s look at some bivariate data on High School Graduation Rates vs. Crime Rate (per 100,000) for each of the States.

Let’s look at some bivariate data on High School Graduation Rates vs. Crime Rate (per 100,000) for each of the States.

5 - 17

5 - 18

5 - 19

There is too much data in the previous table to comprehend without the aid of statistical tools.

Let’s look at the data with a scatterplot.

There is too much data in the previous table to comprehend without the aid of statistical tools.

Let’s look at the data with a scatterplot.

5 - 20

5 - 21

Does there appear to be a pattern in the data?

Note: While the relationship between crime and graduation rate is apparent, the relationship is not very strong.

Crimes Occurring per 100,000 People

5 - 22

Building a Model•

5 - 23

Building a Model

• One of the most useful methods of defining a precise relationship between two variables is to create a mathematical model that relates the variables.

• Suppose we know that the following relationship between test score and hours of study time is

Test Score = 45 + 3.8 (hrs of study time).

5 - 24

Building a Model

Test Score = 45 + 3.8 (hrs of study time)

• If this mathematical model is accurate, then anyone would be able to control his/her destiny.

• If someone studied for 10 hours, according to the model his/her test score would be:

Test Score = 45 + 3.8 (10) = 83.

5 - 25

Building a Model

• If this score was not sufficiently high, then study 11 hours:

Test Score = 45 + 3.8 (11) = 86.8

or study 12 hours:

Test Score = 45 + 3.8 (12) = 90.6.

• Admittedly, there is no model that can precisely predict a test score just on the basis of time studied; there are many other variables that affect test scores.

Suppose a model was available which, although imperfect, fairly reliably predicted test scores based on hours studied.

Suppose a model was available which, although imperfect, fairly reliably predicted test scores based on hours studied.

• New Model:

Test Score = 45 + 3.8(hours of study time) + error

• The new model admits the possibility of error.

• Using the new model to predict someone’s score if they studied 12 hours would yield

Test Score = 45 + 3.8(12) = 90.6 + error.

5 - 26

5 - 27

Building a Model

• The predicted test score would still be 90.6, but there is an unknown error associated with the prediction.

• Admitting the possibility of error in our model makes it more realistic and more credible.

• Whether the model will actually be useful will depend on the size of the errors.

• If the errors are small enough (say, at most 5 points) then the predicted value will still be useful for planning purposes.

5 - 28

How would you like your errors?

• If the model admits the possibility of error, gauging the expected magnitude of the error is essential in determining the model’s usefulness.

• A model with an average error of zero and small variation of the error terms will be desirable and should yield useful predictions.

5 - 29

The Linear Model

• A linear relationship is graphically described as a line.

• Mathematically, a line is a set of points that satisfy the functional relationship

y = mx + b,

where m is the slope of the line and b is the point where the function crosses the Y axis, which is called the Y - intercept.

5 - 30

The Linear Model

•

The relationship in the graph is the linear equation y = 5x + 3.

Thus, m = 5 and b = 3.

y

x

b

5 - 31

Parameters of the Linear Equation

• Together, the slope and the intercept are called the parameters of the linear equation.

• The parameters completely define the equation of the line.

5 - 32

The Degree of the Linear Relationship

•

5 - 33

Measuring the Degree of Linear Relationship

• Nature doesn’t cooperate by requiring all relationships to be straight lines.

• However, if two variables seem to have a positive (upward sloping) or inverse (downward sloping) relationships, it would be useful to:

– define the exact parameters of a regression line for the set of data,

– find the degree to which the data points cluster about the line.

5 - 34

Correlation Coefficient

The correlation coefficient (r) was developed by Karl Pearson in 1896 to measure how a set of data points fit a straight line.

, rn

x x

s

y y

sri

xi

ni

y

1

11 1

1

5 - 35


For positive relationships

Points above the mean of X and the mean of Y

Points below the mean of X and the mean of Y

The mean of Y

The mean of X

x x

s

y y

si

x

i

y

will generally be a positive number.

5 - 36


For negative relationships

x x

s

y y

si

x

i

y

will generally be a negative number.

Points above the mean of Y but below the mean of X

Points below the mean of Y but above the mean of X

The mean of X

The mean of Y

5 - 37

Properties of the correlation coefficient

• The correlation coefficient, r, measures the degree of linear relationship-how well the data clusters around a line.

• The value of r is always between -1 and +1.

• A value of r near -1 or +1 means the data is tightly bundled around a line.

• A value of r near -1 or +1 means that that it would be easy to predict one of the variables by using the other.

5 - 38

More Properties of the Correlation Coefficient

• Positive association is indicated by a plus sign and is associated with an upward sloping relationship.

• Negative association is indicated by a minus sign and a negatively sloping relationship.

• It does not matter whether you correlate Y with X, or X with Y, you will get the same value for r.

5 - 39

Correlation Pitfalls•

5 - 40

Some Correlation Pitfalls

A high correlation does not imply causation.

What could explain this relationship?

Ice Cream Sales Snake Bites

Suppose that a high correlation has been observed between the weekly sales of ice cream and the number of snake bites.

5 - 41

Common response

• Common response means that both variables (ice cream sales and snake bites) are related to a third variable.

• In this case high temperature in the summer cause both ice cream sales and reptile activity to increase.

5 - 42

Does low correlation mean that no

relationship exists?

A low correlation could mean that no linear relationship exists.

The correlation measure for these

points is going to be very close to zero.

Yet, the data does have a very distinct

relationship. The relationship is not

linear. It is quadratic.

y

x

5 - 43

Confounding

• Another problem that can produce low correlations is confounding.

• Confounding occurs when more than one variable affects the dependent variable, and the effects of the variables cannot be distinguished from each other.

• Suppose that the variable Y is dependent on X. Thus, as X changes, it produces changes in Y.

5 - 44

Confounding

• Such a relationship should produce a significant correlation measure between the two variables.

• But suppose there is another variable Z, which also affects Y. As Z changes so does Y.

• It is certainly possible that changes in Z will mask the changes caused by X.

Y

Z

X

5 - 45

The range of X values can affect the

correlation coefficient.

If the range of X data is large, the correlation will usually be greater than if the range of the X values is small.

5 - 46

• Defining a Linear Relationship--

Regression Analysis

5 - 47

The typical equation of a line:

y = mx + b

slope y - intercept

The regression equation of a line:

y = b0 + b1x

y - intercept slope

Linear Relationship Between Two Variables

5 - 48

The Relationship Between X and Y

• The value of Y is completely dependent on the value of X.

• Thus, the Y variable is called the dependent variable, and the X variable is called the independent variable.

5 - 49

Example

Let b0 = 8 and b1 = -2.

x

y

y = 8 - 2x

9

7

5

3

1

1 3 5 71 3 5 7

y

y = 8 - 2x

7

5

3

1

9

x

5 - 50

Which line best fits the data?

Specifying the relationship between X and Y with a linear model means to find a line that best fits the data in some way.

Problem: There are many lines that could be interpreted as fitting the data.

y

x

5 - 51

How do we measure how close a line is to

the data?

• One possible method of choosing the best fitting line is to use the line to predict the Y value for each observation.

• By using a set of data points, one possible line that seems to fit the data reasonably well is Y = 1 + .7X.

5 - 52

The Model: Y = 1 + .7X

X-Y Plot

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10

(4,2) Observed value of X

(2,3)

(5,6)

(8,5)

(9,8)

If we plug X=4 in our model we get Y= 1 + .7 (4) = 3.8.

5 - 53


the data?

• Examine the purpose of the regression line: y = b0 + b1x.

• Once the coefficients of the model (b0,b1) are estimated they become constants.

• The value of X is selected by the user of the model.

5 - 54


the data?

• Once the coefficients are estimated and the value of X is chosen, the corresponding value of Y is completely determined.

• Example:

If X = 2, and Y = 1 + .7X, then

Y = 1 + .7(2) = 2.4.

5 - 55

Model Error

• The purpose of the model is to predict Y for some given X.

• But, the model Y = 1 + .7X is wrong.

• In the first historical observation, the observed data for X = 2 is Y = 3, not 2.4 as the model predicts.

• The difference between the observed value of Y and the predicted value of Y is called the model’s error.

error = observed Y - predicted Y

error = 3 - 2.4 = .6

5 - 56

• The errors reflect how far each observation is from the line.

• Examining the errors suggests how well the line fits the data.

• If we incorporate the notion of error in our model, it becomes

Model Error

y b b x error 0 1 .

5 - 57

Model Error

• Once the possibility of error has been admitted, the notation in the estimated model changes when the error term is not included.

• Specifically, the dependent variable is referred to as (pronounced “y hat”), the predicted value of y.

• The symbol y is reserved for the observed value of y. The error for the model for any observation is given by

y

y b b x 0 1

error y y .

5 - 58

The Sum of Squared Errors (SSE)

The Sum of Squared Errors (SSE) can be used as a criteria for selecting the best fitting line through a set of points.

SSE errori2

( ( ))y b b xi i0 12

( )y yi i2

5 - 59

The “Best” Line

• If the Sum of Squared Errors (SSE) is zero, then the model fits the data exactly and the observed data must lie in a straight line.

• The “best” line is the line with the least Sum of Squared Errors.

• If found, it would be called the Least Squares Line since it would have the smallest SSE.

5 - 60

Finding the Least Squares Line

• Defining a line means specifying its slope and Y intercept.

• The equations for determining the slope and intercept are:

and

bn( xy) ( x)( y)

n( x ) ( x)1 2 2

bn

y b x0 1

1 ( ) .

5 - 61

Estimating a Linear Relationship

•

5 - 62


Data on age and asking price of a Ford Taurus has been gathered from ten classified car ads in a local newspaper.

5 - 63


• The estimated coefficients using least squares methods are

b0 = 13,947.4 (the y-intercept) and b1 = -1,787.9 ( the slope).

• The estimated model for the price of the Ford Taurus is Price = $13,947 - $1,788 (age), which has a smaller SSE for this data than any other line.

5 - 64

Asking Price vs. Age of Car

•

0 1 2 3 4 5

Age of Car

5,000

7,000

9,000

11,000

13,000

15,000

5 - 65


• Examine the errors produced by the least squares model (Table 5.9.1).

• Note that the sum of the errors equals zero.

errors 0

errors2 14 555 372 , ,

5 - 66


A partial output from the Taurus model is shown below.

Predictor Coef

Constant 13947.4

Age -1787.9

s = 1349 R-sq = 83.1% R-sq(adj) = 81.0%

b0 (estimated Y intercept)

b1 (estimated slope)

5 - 67

Interpreting the Regression Equation

•

5 - 68


• If Age = 0 then the value of b0 = $13,947 is equivalent to the predicted price for a new Ford Taurus.

• Since b1 is the estimated slope of the line, it is interpreted to be the change in the dependent variable (price of a Ford Taurus) for a one unit change in the independent variable (age).

Price = $13,947 - $1,788(0) = $13,947

5 - 69


• In our example, the independent variable, age, is expressed in years.

• Thus, for every additional year of age the price declines by $1,788.

• If the interpretation is correct, the model’s predicted price of a two year old Taurus should be $1,788 greater than a three year old Taurus.

5 - 70


• If the Taurus is two years old, then x = 2, and

• If the Taurus is three years old, then x = 3, and

Price = $13,947 - $1,788(2) = $10,371.

Price = $13,947 - $1,788(3) = $8,583.

5 - 71

The Importance of Errors

•

5 - 72


• The usefulness of the estimated model depends on the magnitude of the prediction errors you expect the model to produce.

• The Taurus model is

Price = b0 + b1(Age) + error.

5 - 73


• How do we assess the magnitude of the errors for any model?

– To learn about the quality of the model we need data about the prediction errors it produces.

– Using the model to predict the observed outcomes of the dependent variable (Y) produces error data that will be used to evaluate the quality of the model.

5 - 74


• How do you summarize the errors a model produces?

– Large variation in the errors would indicate a model’s prediction was not very reliable.

– Small variation would indicate the model is capable of producing more trustworthy predictions.

5 - 75

Variance of Errors and Standard Deviation

• The definition for the variance of the errors is given by

• In the Taurus data the variance of the errors is

and the standard deviation of the error terms is given by

se e

n

e

n

SSE

ne2

2 2

2 2 2

( ).

se2 14 555 372

10 21 819 421

, ,, , ,

se 1 819 421 1 348 86, , , . .

5 - 76

Evaluating the Fit of a Model

•

5 - 77

Total Sum of Squares

• Before determining how much variation a model explains, it is necessary to evaluate how much variability exists in the Y variable.

• This quantity is called Total Sum of Squares (TSS) and represents the total variation of the dependent variable (Y).

TSS y y ( )2

5 - 78

Sum of Squared Errors

• An error represents the model’s inability to predict the variation in Y.

• Adding all the errors accumulates the total of all unexplained variation, which is denoted SSE, Sum of Squared Errors.

SSE e 2

5 - 79

The Variation in Y

• The variation in Y can be divided into two categories, unexplained and explained.

TSS = +

TSS = SSE + explained variation

• Denoting explained variation as SSR produces TSS = SSE + SSR.

• and solving this equation for SSR results in SSR = TSS - SSE.

unexplained variation

explained variation

5 - 80

Interpreting SSR

• It would be delightful if the model would explain all of the variability in the Y’s.

• The difference between total variation in the Y’s and the unexplained variation must be the variation that is explained by the regression model.

• That’s why explained variation is called the Sum of Squares of Regression (SSR).

5 - 81

Interpreting SSR

In the Taurus example,

TSS = 86,158,336 and

SSE = 14,555,372; thus

SSR = TSS - SSE

= 86,158,336 - 4,555,372

= 71,602,964.

5 - 82

Interpreting SSR

• The proportion of variation explained by the model is called the coefficient of determination and is denoted as R2:

• For the Taurus data,

• In other words, the estimated model can predict about 83% of the variation in prices.

RSSR

TSSR2 20 1 .

R2 71 602 964

86 158 336831

, ,

, ,. .

Estimating the Linear Relationship Between SAT

Scores and Graduating GPA

•

5 - 83

5 - 84

Estimating the Linear Relationship Between SAT

Scores and Graduating GPA

Using the least squares method, the estimated model is given by

= 1.36 + 0.00141 (Total SAT Score).Graduating

GPA

Measuring the Fit of the DataMeasuring the Fit of the Data

SAT Score

700 900 1,100 1,300

SAT Score

2

2.6

3.2

3.8

SAT Score

700 900 1,100 1,300

SAT Score

2

2.6

3.2

3.8

5 - 85

5 - 86

Measuring the Fit of the Data

• R2 is a numerical measure that describes fit.

• What percent of the variation in Final Grade Point Average can be explained by the model?

RSSR

TSS

1.1862

6.1013.1942

5 - 87

Fitting a Linear Time Trend

•

5 - 88

Example

Humans have continued to better their performances in various sporting events.

If we look at world records for women’s freestyle swimming, what is the relationship between distance and time?

World Records in 1993

5 - 89

Questions

a. Use the data to construct a model using time as the dependent variable and distance as the independent variable.

b. What fraction of the variation in time is explained by distance?

c. If there were a 1000 meter race, what would you predict the world record to be?

5 - 90

Solution

a. You should always draw a scatterplot to confirm that an approximate linear relationship exists between the two variables.

0

200

400

600

800

1000

0 500 1000 1500

5 - 91

Solution

Model:

Time = -10.3 + .640(Distance)

Calculations

5 - 92

Solution

The coefficients of the model are calculated as follows:

bn xy x y

n x x1 2 2

( ) ( )( )

( ) ( )

6 1952863 5 3050 1888 99

6 3102500 3050 2

( . ) ( )( . )

( ) ( )

5955761 5

9312500639545

..

5 - 93

Solution

•

y b b x x 0 1 10 3 640. .

bn

y b x0 1

1 ( )

1

61888 99 639545 3050( . . ( ))

10 27.

5 - 94

Solution

b. R2 = 100%, meaning that 100% of the variation in the observed times is explained by the distance (i. e. this is a very good model).

c. Time = -10.3 + .640(1000) = 629.7 seconds.

Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.

Documents

Transcript of Chapter 5 Discovering Relationships Stat-Slide-Show, Copyright 1994-95 by Quant Systems Inc.