Chapter 4: More about Relationships in 2 variables Ms. Namad.

35
Chapter 4: More about Relationships in 2 variables Ms. Namad

Transcript of Chapter 4: More about Relationships in 2 variables Ms. Namad.

Page 1: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Chapter 4: More about Relationships in 2

variables Ms. Namad

Page 2: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Introduction

• When 2 variable data shows a nonlinear relationship, we must develop new techniques for finding an appropriate model.

• 4.1 discusses how we can transform the data to straighten a nonlinear pattern (hardest section).

• 4.2 will deal with relationships between categorical variables

• 4.3 will tackle the issue of establishing causation

Page 3: Chapter 4: More about Relationships in 2 variables Ms. Namad.

4.1 Transforming to achieve linearity

• Scatterplot of brain weight against body weight for 96 species of mammals:

• Scatterplot of brain weight with outliers removed- curved data:

Correlation between brain weight and body weight is .86 but this is misleading.

If we remove the elephant and hippo, the correlation for the other 95 species is r = .50.

Page 4: Chapter 4: More about Relationships in 2 variables Ms. Namad.

• We need linear data to do regression

• Scatterplot and LSRL of the logarithm of brain weight against the logarithm of body weight for 96 species of mammals

After transformation:

The effect is almost magical - correlation is .96.

Page 5: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Transforming (or re-expressing the data)

• Changing the scale of measurement that was used when the data was collected are LINEAR TRANSFORMATIONS

• As we know, these cannot straighten a curved relationship. To deal with curved data, we transform the data with other methods

• Common transformations are logarithms or raising to a positive/negative power

Page 6: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Example

• Scatterplot of Atlantic Ocean rockfish weight versus length. When we cube the length our data looks linear.

• A least-squares regression on the transformed points (length3, weight) the resulting equation is:

• weight = 4.066 + .0147 x length3

• If we superimpose our regression equation on our original data set, it matches closely.

Page 7: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Transforming with Powers (don’t memorize)

• Facts about powers:

• The graph of a power with exponent 1 (p = 1) is a straight line.

• Powers greater than 1 give graphs that bend upward. The sharpness of the bend increases as the power increases.

• Powers less than 1 but greater than 0 give graphs that bend downward.

• Powers less than 0 give graphs that decrease as x increases. Greater negative values of p result in graphs that decrease more quickly.

• The logarithm function corresponds to p = 0 (not the same as raising to the 0 power which is just a horizontal line at y = 1)

Page 8: Chapter 4: More about Relationships in 2 variables Ms. Namad.
Page 9: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Hierarchy of Power transformations at work

Page 10: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Exponential Growth

• .

Page 11: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Examples of Exponential growth

• Bacteria: The count of bacteria after x hours is 2x

• The value of $24 invested for x years at 6% interest is 24 x 1.06x

• Both are examples of the exponential growth model y = abx for different constants a and b.

Page 12: Chapter 4: More about Relationships in 2 variables Ms. Namad.

The Logarithm Transformation

• If an exponential model of the form y = abx describes the relationship between x and y then we can use logarithms to transform the data to produce a linear relationship (and vice versa- if a transformation of (x,y) data to (x, log y) straightens our data, we know it’s exponential

• So how does this work? well if we have the equation y = abx and take the log of both sides:

• log y = log (abx)

• = log a + log bx

• = log a + log b (x) Does this look familiar?!

Page 13: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Prediction in the Exponential Growth Model

• Regression is often used for prediction. In exponential growth the logarithms of the responses rather than the actual responses follow a linear pattern so to do prediction we need to “undo” the logarithm transformation to return to the original units of measurement.

• With the bacteria equation where y is our number of bacteria based on number of years passed y, y = 2x to apply linear regression we take the log of both sides and our regression equation is log y hat = log(2)(x) .

• To predict the log of the number of bacteria after 15 years, log of y hat = (log(2))(15) = 4.515

• To find the ACTUAL predicted number of bacteria (y hat, not the log of that number) we have to take the log inverse (10x) of 4.515

• On calculator hit 2nd log, then type 10^x (4.515) and you get 32, 768!

• *note- when the explanatory variable is years, transform the data to “years since” so that the values are smaller and don’t create problems when you perform the inverse transformation.

Page 14: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Calculator Example

• 4.5 Some college students collected data on the intensity of light at various depths in a lake:

• Make a scatter plot, describe the form

• To achieve linearity take the natural log (ln) of light intensity (define L3 as ln(L2) )

• Calculate the regression equation on your transformed data (so x is your depth and y is the ln of light intensity): stat-calc-8 LinReg (a + bx) L1, L3

• ln(y hat) = 6.789 -.333(x)

• The intercept provides an estimate for the average value of the natural log of the light intensity at the surface of the lake (depth 0 meters) while the slope indicates that the natural log of light intensity decreases on average by .3330 lumens for each one meter increase in depth.

• Construct and interpret a residual plot (x list is L3, Y list is RESID). Plot shows our model is appropriate and r is now strong so this was a good way to straighten our data.

Depth (meters) 5 6 7 8 9 10 11Light Intensity (lumens)

168 120.42

86.31

61.87

44.34

31.78

22.78

Page 15: Chapter 4: More about Relationships in 2 variables Ms. Namad.

• Perform the inverse transformation to express light intensity as an exponential function of depth in the lake (ln inverse is e^x on your calculator..2nd ln):

• y hat = (e^(6.789)) (e^(-.333x) )

• * To undo an ln or a log transformation: y = ea+bx or y = 10a+bx

Or, to

• or, to see it in the more familiar exponential form, this is the same as yhat = (e^a)(e^b)^x NOTE: Log or Ln can be used interchangeably

• Construct a scatterplot of the original data with this model superimposed (plot it in y = and go to your original statplot). Is your exponential function a satisfactory model for the data?

• Use your model to predict the light intensity at a depth of 22 meters.

• The actual reading at that depth was .58 lumens.

Page 16: Chapter 4: More about Relationships in 2 variables Ms. Namad.

• Geometry tells us to expect area to go up with the square of a dimension such as diameter:

• Ex: area of circle changes with the square of the radius!

• This is a Power Law Model of the form

• y = axp (different from exponential Y = abx)

• When you take the log of both sides to achieve linearity ( log y = log a + p log x) you see that power p is the slope of the straight line so the slope is a good estimate of the p in the underlying power model. The greater the scatter of the points in the scatterplot about the fitted line, the smaller our confidence in the accuracy of this estimate

• If taking the logs of both variables produces a linear scatterplot, a power law is a reasonable model for the original data.

Power Law Models

πr2 = π (x

2)2 = π

x2

4=π

4x2

Page 17: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Prediction in Power Law Models

• If transforming your data with (logx, logy) straightens it, then you are working with a Power Law model instead of an exponential one (remember our transformation for exponential functions was (x, logy).

• Get your a and b for regression line of transformed data on calculator

• Undo your ln or log transformations to get your regression equation for the original data:

• yhat = 10^a (x)^b

Page 18: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Summary: Exponential vs. Power

• If the relationship is exponential then the plot of the log (x) versus y should be roughly linear. If the relationship between these variables follows a power model, then a plot of log (x) vs. log (y) should be fairly linear.

• In an exponential model you are transforming the response variable. In a power model you are transforming both.

• Our eyes are a bad judge of curves so we need to do both transformations, make a scatter plot of each, and compare the residual plot and r values of each to see which did a better job of linearizing the data.

• We can fit exponential growth and power models to data by finding the least-squares regression line for the transformed data, then doing the inverse transformation

Page 19: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Summary of what you need to know

• When data doesn’t look straight, try both transformations: (x,y) to (x, logy) or (x, lny) and (logx, logy) or (lnx, lny)- log and natural log are both fine!

• Check which transformation did a better job straightening:

• Make a scatterplot of each transformation. Do LinReg a+bx to check your r for each. The stronger the r, the better.

• Also do a residual plot for each transformation to see which better fits the data (for exponential trial: L1, RESID. For Power Law trial: L3, RESID)

• If your first transformation was better than it’s an underlying exponential function fitting your data. If the second transformation was better than it’s a power model.

• Find the regression equation for your original untransformed data:

• If it was exponential, yhat = (10^a)(10^b)^x

• If it was a power model, yhat = (10^a)(x^b)

Page 20: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Relationships between categorical variables

4.2

Page 21: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Categorical variables and marginal distributions

• Some variables are categorical by nature (sex, race, occupation), others are created by grouping values of a quantitative variable into classes.

• The distributions of sex alone and age alone are called marginal distributions because they appear at the bottom and right margins of the two-way table.

Age Group

Female Male Total

15-17yrs 89 61 150

18-24 5668 4697 10,365

25-34 1,904 1,589 3,494

35 or older

1,660 970 2,630

Total 9,321 7,317 16,639

Page 22: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Describing Relationships

• Since counts are often hard to compare, we take percents.

• Ex: women make up 54.7% of the traditional college age group, but they make up 63.1% of students 35 and older. Women are more likely than men to return to college after working for a number of years.

• When we compare the % of women in two age groups we are comparing 2 conditional distributions

Page 23: Chapter 4: More about Relationships in 2 variables Ms. Namad.
Page 24: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Simpsons Paradox

• Transportation of victims by helicopter, we see that 32% died compared with only 24% of the others. This seems discouraging. Heli Road

Victim died

64 260

Victim Survived

136 840

Total 200 1100

Page 25: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Lurking variable?

• The explanation is that the helicopter is sent mostly to serious accidents so that the victims transported by helicopter are more often seriously injured and likely to die with or without helicopter evacuation. Here is the same data broken down by seriousness of accident:

• If you compare less serious accidents, 84% survived by heli vs. 80% by road.

Seriouss

non-serious

Heli Road Heli Road

Died 48 60 16 200

Survived

52 40 84 800

Total 100 100 100 100

Page 26: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Establishing Establishing CausationCausation

4.3

Page 27: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Beware the post-hoc Beware the post-hoc fallacyfallacy

“Post hoc, ergo propter hoc.”

To avoid falling for the post-hoc fallacy, assuming that an observed correlation is due to causation, you must put any statement of relationship through sharp inspection.

Causation can not be established “after the fact.” It can only be established through well-designed experiments. {see Ch 5}

Page 28: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Explaining AssociationExplaining Association

Strong Associations can generally be explained by one of three relationships.

ConfoundingConfounding: x may cause y, but y may instead be caused by a confounding variable z

CommonCommon ResponseResponse: x and y are reacting to a lurking variable z

CausationCausation:x causes y

Page 29: Chapter 4: More about Relationships in 2 variables Ms. Namad.

CausationCausation

Causation is not easily established.

The best evidence for causation comes from experiments that change x while holding all other factors fixed.

Even when direct causation is present, it is rarely a complete explanation of an association between two variables.

Even well established causal relations may not generalize to other settings.

Page 30: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Common ResponseCommon Response

“Beware the Lurking Variable”

The observed association between two variables may be due to a third variable.

Both x and y may be changing in response to changes in z.

Consider the fact that students who are smart and who have learned a lot tend to have both high SAT scores and high colelge grades. The positive correlation is explained by this common response to students’ ability and knowledge.

Page 31: Chapter 4: More about Relationships in 2 variables Ms. Namad.

ConfoundingConfounding

Two variables are confounded when their effects on a response variable cannot be distinguished from each other.

Confounding prevents us from drawing conclusions about causation.

We can help reduce the chances of confounding by designing a well-controlled experiment.

Page 32: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Confounding Example

• Mothers with high BMI have a strong correlation with daughters with high BMI. Gene inheritance no doubt explains part of the association between BMI of daughters and their mothers, but can we use r or r squared to say how much inheritance contributes to the daugthers’ BMI’s? No!

• Mothers who are overweight also set an example of little exercise, poor eating habits, and lots of television so their daughters pick up these habits to some extent, so the influence of heredity is mixed up with influences from the girls’ environment.

• The mixing of influences is what we call confounding.

Page 33: Chapter 4: More about Relationships in 2 variables Ms. Namad.

ExamplesExamples4.41: There is a high positive correlation: nations with many TV sets have higher life expectancies. Could we lengthen the life of people in Rwanda by shipping them TVs?

4.42: People who use artificial sweeteners in place of sugar tend to be heavier than people who use sugar. Does artificial sweetener use cause weight gain?

4.43: Women who work in the production of computer chips have abnormally high numbers of miscarriages. The union claimed chemicals cause the miscarriages. Another explanation may be the fact these workers spend a lot of time on their feet.

Page 34: Chapter 4: More about Relationships in 2 variables Ms. Namad.

cont.cont.

4.44: People with two cars tend to live longer than people who own only one car. Owning three cars is even better, and so on. What might explain the association?

4.45: Children who watch many hours of TV get lower grades on average than those who watch less TV. Why does this fact not show that watching TV causes low grades?

Page 35: Chapter 4: More about Relationships in 2 variables Ms. Namad.

Cont.Cont.

4.46: Data show that married men (and men who are divorced or widowed) earn more than men who have never been married. If you want to make more money, should you get married?

4.47: High school students who take the SAT, enroll in an SAT coaching course, and take the SAT again raise their mathematics score from an average of 521 to 561. Can this increase be attributed entirely to taking the course?