Unit 5: Transformations to achieve linearity

© Judith D. Singer, Harvard Graduate School of Education Unit 5/Slide 1

Unit 5: Transformations to achieve linearity


The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together


In this unit, we’re going to learn about…

• What happens if we fit a linear regression model to data that are nonlinearly related?

• Alternative statistical models that are useful for nonlinear relationships– Logarithms—a brief refresher– The effects of logarithmic transformation– Other nonlinear relationships that can be modeled using logarithmic

transformations

• What’s the difference between taking logarithms to base 2, 10 and e?– Interpreting the regression of Y on log(X)– Interpreting the regression of log(Y) on X– Interpreting the regression of log(Y) on log(X)

• How should we select among alternative transformation options: The Rule of the Bulge


The 10th Grade Math MCAS: % Scoring in the “Advanced” range

ID DISTRICT HOME PCTADV L2HOME

1 AMESBURY 341.75 33 8.41680 2 ANDOVER 556.75 56 9.12089 3 ARLINGTON 476.00 44 8.89482 4 ASHLAND 390.00 38 8.60733 5 BELLINGHAM 308.00 15 8.26679 6 BELMONT 675.00 54 9.39874 7 BEVERLY 385.00 38 8.58871 8 BILLERICA 365.00 34 8.51175 9 BRAINTREE 375.00 39 8.5507510 BROCKTON 269.90 11 8.0762811 BURLINGTON 397.00 43 8.6330012 CAMBRIDGE 749.00 23 9.5488213 CANTON 472.50 41 8.8841714 CHELMSFORD 360.45 38 8.4936615 DANVERS 389.95 31 8.6071516 DEDHAM 383.25 28 8.5821417 DRACUT 296.25 27 8.2106718 DUXBURY 537.50 55 9.07012. . .

HOMEPCTADV 10

The UNIVARIATE ProcedureVariable: PCTADV Location VariabilityMean 36.63636 Std Deviation 14.19603Median 36.00000 Variance 201.52727Mode 38.00000 Range 60.00000

Stem Leaf # Boxplot 7 1 1 | 6 59 2 | 6 12 2 | 5 5678 4 | 5 12344 5 | 4 57 2 | 4 1223344 7 +-----+ 3 6678888999 10 *--+--* 3 011223334 9 | | 2 567778899 9 +-----+ 2 011123344 9 | 1 56799 5 | 1 1 1 | ----+----+---

OutcomeThe UNIVARIATE ProcedureVariable: HOME Location VariabilityMean 426.7553 Std Deviation 128.62042Median 384.1250 Variance 16543Mode 355.0000 Range 633.00000

Stem Leaf # Boxplot 8 8 1 * 8 7 5 1 0 7 14 2 0 6 558 3 0 6 0 1 | 5 699 3 | 5 244 3 | 4 5678 4 +-----+ 4 00012222234 11 | + | 3 55566666777778888899 20 *-----* 3 001222333334444 15 +-----+ 2 7 1 | 2 4 1 | ----+----+----+----+

Predictor

Wellesley

CambridgeNewton, Lexington

RQ: Is the percentage of 10th graders scoring in the advanced range on the math MCAS

primarily a function of a district’s socioeconomic status?

n = 66

Wellesley

Brockton


Side Note: Percent Difference vs. Percentage Point Difference

District Noname

Percent Advanced: 10%

Mandated to increase percent advanced by:

2 percentage points:

Target rate for percent advanced?





10 percent:


50 percent:


100 percent:


A percent difference is a proportional difference that is relative

to a particular starting value.

A percentage point difference is an arithmetic difference between two

percentages.


What’s the relationship between PctAdv and Home Prices?

HOMEADVTPC 085.0345.0ˆ

Cambridge

Cambridge

The REG ProcedureDependent Variable: PCTADV

Root MSE 9.11982 R-Square 0.5936Dependent Mean 36.63636 Adj R-Sq 0.5873

Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 0.34532 3.91746 0.09 0.9300HOME 1 0.08504 0.00879 9.67 <.0001

If the assumptions held (which they don’t!), we could interpret the slope coefficient as follows:

Each $1000 difference in median home price is associated, on average, with a .085 percentage

point difference in the percent of students scoring in the advanced category on MCAS.

or$10,000 difference → .85 percentage point

differenceor

$100,000 difference → 8.5 percentage point difference


What happens if we set aside Cambridge?


Root MSE 7.37525 R-Square 0.7346Dependent Mean 36.84615 Adj R-Sq 0.7304 Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 -4.86330 3.28861 -1.48 0.1442HOME 1 0.09888 0.00743 13.20 <.0001

HOMEADVTPC 099.0863.4ˆ

Regression line with Cambridge

%)4.59(

085.0345.0ˆ

2

R

HOMEADVTPC


What happens when we fit a linear model to a non-linear relationship?

More negative residuals(over-predicting)

More positive residuals(under-predicting)






What alternative statistical model might be useful here?

HOMELogPCTADV 210

HOMEPCTADV 10

What kind of population model would have given rise to these

sample data?

The effect of HOME prices is larger when HOME prices are low and smaller when HOME prices

are high


Logarithmic transformations in everyday life

Musical scales

Up 1 octave = doubling of CPS

Octave

1

3

2

CPS

Richter scale

10,000,000

1,000,000

Amplitude SW

1,000,000,000

100,000,000

Japan, 1995

Greece, 1999

Sumatra, 2004

7.0

6.0

Richter

9.0

8.0SF, 1906

Up 1 Richter = 10 fold ASW

+110(doubling)

+220(doubling)

+440(doubling)

Flash drives, then and now

Each new generationdoubles in storage capacity

16 32 64 128 256 512 1024 (1GB)

2G 4G 8G 16G

…

M-Systems’ original 16 mb “disgo,” considered the first USB flash drive

1998


Understanding Logarithms

5

4

3

2

1

10000,100

10000,10

10000,1

10100

1010

5

4

3

2

1

232

216

28

24

22

powerbasex The power identifies the

logarithmbase(x)because raising the base to

that power yields x

5)000,100(log

4)000,10(log

3)000,1(log

2)100(log

1)10(log

10

10

10

10

10

5)32(log

4)16(log

3)8(log

2)4(log

1)2(log

2

2

2

2

2

These are the logarithms

Each 1 unit increase in a base-2 logarithm

represents a doubling of x

Each 1 unit increase in a base-10 logarithm

represents a 10-fold increase in x

12 4 8 16 Raw32 64

1 2 3 40Log25 6

For more on logarithms:Dallal, Logarithms, part I

So…taking logs spreads out the distance between small values and compresses the distance between

large values


Understanding the effects of logarithmic transformation in the MCAS data

Brockton ($269.9 8.07)

Westwood ($600 9.23)

Wellesley ($875 9.77)

One log unit

One log unit

Double of the raw:

256(2) = 512

Double of the raw:512(2) = 1024

Gapminder


What happens if we regress PctAdv on log2(HOME)?



Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -255.31368 19.78409 -12.91 <.0001L2HOME 1 33.69919 2.27994 14.78 <.0001

HOMELADVTPC 270.3332.255ˆ

8.0

L2HOME

9.0

8.5

10.0

9.5

256

Home

512

362

1,024

724

+33.70

13.48

47.08

30.28

80.68

63.88

PCTADV^

+33.70

+33.70

+33.70

Every doubling in median

home price is positively associated with a 33.7 percentage

point difference in

students scoring in the

advanced range


Conceptually, what have we done by regressing Y on Log2(X)?


HOMEADVTPC 099.0863.4ˆ 70.3332.255ˆ

22 HOMEADVTPC


How would we summarize the results of this analysis?

Comparison of regression models predicting percentage of students scoring “Advanced” on 10th grade MCAS (n=65)(Massachusetts Dept of Education, 2005)

Predictor Model A Model B

Intercept-4.863

(3.289)

-255.32***

(19.78)

Median Home Price

0.099***(0.007)

Log2 (Median Home Price)

33.70***(2.28)

R2 73.5% 77.6%Cell entries are estimated regression coefficients (standard errors).*** p<.001Note: Cambridge set aside from both models because of its unusually low MCAS performance relative to its unusually high real estate values


Cambridge

Some possible ways to describe the effect:

• The percentage of 10th graders scoring in the advanced range on the math MCAS is an estimated 33.70 points higher for every doubling of district median home prices.• As median home prices double, the percentage of 10th graders scoring in the advanced range on the math MCAS is 33.70 points higher, on average.

• A doubling in median home prices is associated, on average, with a 33.70 point difference in the percentage of 10th grades scoring in the advanced range on the math MCAS.


OECD’s Education at a Glance

“The relationship between GDP per capita and

expenditure per student is complex. Chart B1.6 shows the co-existence of two different relationships between two distinct groups of countries…

Countries with a GDP per capita around US$25,000 demonstrate a clear positive relationship between

spending on education per student and GDP per capita. … There is considerable variation in spending on education per student among OECD countries with a GDP per capita greater than $25,000, where the higher GDP per capita,

the greater the variation in expenditure devoted to students.” (OECD, 2005)

RQ: What’s the relationship

between GDP and PPE in OECD countries?


Let’s examine the OECD data for ourselves…

country GDP PPE L2PPE

Mexico 9.215 6.0737 2.60258Poland 10.846 4.8342 2.27327Slovak 12.255 4.7556 2.24964Hungary 13.894 8.2048 3.03647Czech Re 15.102 6.2355 2.64051Korea 17.016 6.0467 2.59614Portugal 18.434 6.9602 2.79913Greece 18.439 4.7306 2.24204Spain 22.406 8.0205 3.00369Italy 25.568 8.6357 3.11032Germany 25.917 10.9990 3.45930Finland 26.495 11.7676 3.55675Japan 26.954 11.7158 3.55039Sweden 27.209 15.7151 3.97408France 27.217 9.2764 3.21356Belgium 27.716 12.0187 3.58720UnK 27.948 11.8222 3.56342Australia 28.068 12.4160 3.63412Iceland 28.399 8.2505 3.04449Austria 28.872 12.4475 3.63778Nether 29.009 13.1011 3.71162Denmark 29.231 15.1830 3.92438Switzerl 30.455 23.7141 4.56768Ireland 32.646 9.8091 3.29413Norway 35.482 13.7387 3.78018US 36.121 20.5454 4.36074

GDPPPE 10

The UNIVARIATE ProcedureVariable: PPE Location VariabilityMean 10.65453 Std Deviation 4.67585Median 10.40407 Variance 21.86353Mode . Range 18.98350

Stem Leaf # Boxplot 22 7 1 0 20 5 1 | 18 | 16 | 14 27 2 | 12 04417 5 +-----+ 10 0788 4 *--+--* 8 023638 6 | | 6 0120 4 +-----+ 4 788 3 | ----+----

OutcomeThe UNIVARIATE ProcedureVariable: GDP Location VariabilityMean 24.26592 Std Deviation 7.48505Median 27.08150 Variance 56.02600Mode . Range 26.9060

Stem Leaf # Boxplot 3 56 2 | 3 03 2 | 2 6667778888999 13 +-----+ 2 2 1 | + | 1 5788 4 +-----+ 1 124 3 | 0 9 1 | ----+----+----

Predictor

n = 26

SwitzerlandUS


What’s the relationship between PPE and GDP?

GDPEPP 48.099.0ˆ

The REG ProcedureDependent Variable: PPE


Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -0.88349 2.09666 -0.42 0.6772GDP 1 0.47548 0.08270 5.75 <.0001

Switzerland

Switzerland





What alternative statistical model might be useful here?

What kind of population model would have given rise to these

sample data?

The effect of GDP is relative to its magnitude: its effect is larger when GDP

is larger and smaller when GDP is smaller

GDPPPE 10

)(0

12 GDPPPE)(

012 GDPPPE

GDPPPE 1022 loglog


Exponential growth models in everyday life

)(0

12 XY


How do we fit and interpret the exponential growth model?

11201XY

111 222 0)1(

02 XXY

1

1

11

22

22

0

0

1

2

X

X

Y

Y

2 key properties of logs1. log(xy)=log(x)+log(y)2. log(xp)=p*log(x)

XY 1022 loglog

So just regress log2(Y) on X and substitute the estimated slope into the equation for the percentage growth rate to obtain the estimated percentage growth rate per unit

change in X.

)2(log2

)(0

12 XY

)12(100rategrowth percentage 1

Why? How do we solve for this in practice?

.n larger tha times2 is So, 121 YY

)(022

12loglog XY

)(2022

12logloglog XY


What happens if we regress log2(PPE) on GDP?

The REG ProcedureDependent Variable: L2PPE


Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 1.58837 0.23402 6.79 <.0001GDP 1 0.06992 0.00923 7.58 <.0001

GDPPEPL 0699.05884.1ˆ2

For each $1,000 of GDP, PPE is 5% higher

31

30

21

20

11

10

GDP

+ $244 ≈ 5%

13503.86

12865.18

8318.37

7924.94

5124.11

4881.75

PPE^

%97.4)12(100 06992.0 rate growth %age

+.0699

+.0699

+.0699

+ $393 ≈ 5%

+ $638 ≈ 5%3.7553

3.6854

3.0563

2.9864

2.3573

2.2874

L2PPE^


)(0

1 XeY

Side note: what’s a natural logarithm (and why would we ever use it)?

...718281828.21

1lim

n

n ne

For more about e and natural logs

41.148

60.54

09.20

38.7

72.2

5

4

3

2

1

e

e

e

e

e

.

11

31

21

1 ...62

)1( 1

.25ish))( small is (if

e

The REG ProcedureDependent Variable: LnPPE


Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 1.10098 0.16221 6.79 <.0001GDP 1 0.04847 0.00640 7.58 <.0001

)(0

12 XY

)(0

1 XconstantY

rate growth percentage )1(100 1e

%8.4)1(100 04847.0 e

rate growth percentage


How would we summarize the results of this analysis?

0.475***(0.083)

Per Capita GDP

70.5%57.9%R2

Cell entries are estimated regression coefficients (standard errors).*** p<.001

1.101***(0.162)

-0.883(2.097)

Intercept

Model B: ln(PPE)

Model A: PPE

Comparison of regression models predicting Per Pupil Expenditures in OECD countries (n=26) (OECD, 2005)

Another possible way to describe the effect: • Per capita gross domestic product (GDP) is a strong predictor of per pupil expenditures. If we compare two countries whose GDPs differ by $1,000, we predict that the richer country will have a per pupil expenditure that is 5% higher.

GDPPEPLn 048.0101.1ˆ

0.048***(0.006)

Predictor

When comparing models, remember that:

• You’re trying to evaluate whether the model’s assumptions are tenable

• R2 is NOT a measure of whether assumptions are tenable

• R2 statistics do not tell us which model is “better” (both in general and especially if you’ve transformed Y)(go to appendix)


Who’s got the biggest brain?

ID SPECIES BRAIN BODY lnBRAIN lnBODY

1 Lessershort-tailedshrew 0.14 0.01 -1.96611 -5.29832 2 Littlebrownbat 0.25 0.01 -1.38629 -4.60517 3 Bigbrownbat 0.30 0.02 -1.20397 -3.77226 4 Mouse 0.40 0.02 -0.91629 -3.77226 5 Muskshrew 0.33 0.05 -1.10866 -3.03655 6 Starnosedmole 1.00 0.06 0.00000 -2.81341 7 Easter.mericanmole 1.20 0.08 0.18232 -2.59027 8 Groundsquirrel 4.00 0.10 1.38629 -2.29263 9 Treeshrew 2.50 0.10 0.91629 -2.26336 10 Goldenhamster 1.00 0.12 0.00000 -2.12026 11 Molerat 3.00 0.12 1.09861 -2.10373 12 Galago 5.00 0.20 1.60944 -1.60944 13 Rat 1.90 0.28 0.64185 -1.27297 14 Chinchilla 6.40 0.43 1.85630 -0.85567 15 Owlmonkey 15.50 0.48 2.74084 -0.73397

. . .

47 Chimpanzee 440.00 52.16 6.08677 3.95432 48 Sheep 175.00 55.50 5.16479 4.01638 49 Giantarmadillo 81.00 60.00 4.39445 4.09434 50 Human 1320.00 62.00 7.18539 4.12713 51 Grayseal 325.00 85.00 5.78383 4.44265 52 Jaguar 157.00 100.00 5.05625 4.60517 53 Braziliantapir 169.00 160.00 5.12990 5.07517 54 Donkey 419.00 187.10 6.03787 5.23164 55 Pig 180.00 192.00 5.19296 5.25750 56 Gorilla 406.00 207.00 6.00635 5.33272 57 Okapi 490.00 250.00 6.19441 5.52146 58 Cow 423.00 465.00 6.04737 6.14204 59 Horse 655.00 521.00 6.48464 6.25575 60 Giraffe 680.00 529.00 6.52209 6.27099 61 Asianelephant 4603.00 2547.00 8.43446 7.84267 62 Africanelephant 5712.00 6654.00 8.65032 8.80297

BODYBRAIN 10

Source: Allison, T. & Cicchetti, D. V. (1976). Sleep in mammals: Ecological and Constitutional Correlates. Science, 194, 732-734.

RQ: What’s the relationship between brain weight and body

weight?

n = 62


Distribution of BRAIN and BODY

The UNIVARIATE ProcedureVariable: BRAIN Location VariabilityMean 283.1342 Std Deviation 930.27894Median 17.2500 Variance 865419Mode 1.0000 Range 5712

Histogram # Boxplot

5750+* 1 * . .* 1 * . . . . . . .* 1 * .* 2 0 250+***************************** 57 +--0--+ ----+----+----+----+----+---- * may represent up to 2 counts

Outcome

The UNIVARIATE ProcedureVariable: BODY Location VariabilityMean 198.7900 Std Deviation 899.15801Median 3.3425 Variance 808485Mode 0.0230 Range 6654

Histogram # Boxplot

6750+* 1 * . . . . . . . .* 1 * . . . .* 2 * 250+***************************** 58 +--0--+ ----+----+----+----+----+---- * may represent up to 2 counts

Predictor

African El

Asian El


Plots of BRAIN vs. BODY on several scales

African El

Asian El

African El

Asian El African

ElAsian

El


+.75

What’s the relationship between LnBRAIN and LnBODY?

LnBODYAINRLnB 75.013.2ˆ

The REG ProcedureDependent Variable: LnBRAIN


Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 2.13479 0.09604 22.23 <.0001LnBODY 1 0.75169 0.02846 26.41 <.0001

But how do we interpret the estimated regression coefficient, 0.75?

8.889

8.13

6.63

5.13

3.63

2.88

LnBRAIN

8

6

4

2

1

LnBODY

^

7186.79

3394.80

757.48

169.02

37.71

17.81

BRAIN^

22026.47

2980.96

403.43

54.60

7.39

2.72

BODY

+1

+1 +.75

+1 +.75

+.75

+1


The proportional growth model: The regression of log(Y) on log(X)

eXY 10

101

XY

111 )01.1()01.1( 002 XXY

1

1

11

)01.1()01.1(

0

0

1

2

X

X

Y

Y

2 key properties of logs1. log(xy)=log(x)+log(y)2. log(xp)=p*log(x)

Why? How do we solve for this in practice?

.n larger tha times)01.1( is So, 121 YY

Xin change 1%per Yin change %1

Economists call 1 an elasticity

eXY ee1

0loglog

eXY eeee loglogloglog 10

XY eee logloglog 10

So just regress loge(Y) on loge(X) and the slope estimate provides the estimated percentage difference in Y per 1%

difference in X.• A 1% difference in bodyweight is positively associated with a 0.75% difference in brain weight.• For every 1% difference in bodyweight, animal brains differ by ¾ of a percent.


Who has the biggest brain?

Human

Rhesus Monkey

BaboonOwl Monkey

Water Opossum


Sidebar: Why couldn’t we just take the ratio of BrainWt to BodyWt?

Obs SPECIES BRAIN BODY RATIO

1 Africanelephant 5712.00 6654.00 0.8584 2 Cow 423.00 465.00 0.9097 3 Pig 180.00 192.00 0.9375 4 Braziliantapir 169.00 160.00 1.0563 5 Wateropossum 3.90 3.50 1.1143 6 Horse 655.00 521.00 1.2572 7 Giraffe 680.00 529.00 1.2854 8 Giantarmadillo 81.00 60.00 1.3500 9 Jaguar 157.00 100.00 1.5700 10 Kangaroo 56.00 35.00 1.6000 11 Asianelephant 4603.00 2547.00 1.8072 12 Okapi 490.00 250.00 1.9600 13 Gorilla 406.00 207.00 1.9614 14 Donkey 419.00 187.10 2.2394 15 Tenrec 2.60 0.90 2.8889

. . . 47 Vervet 58.00 4.19 13.8425 48 Chinchilla 6.40 0.43 15.0588 49 Easter.mericanmole 1.20 0.08 16.0000 50 Rockhyrax(Heterob) 12.30 0.75 16.4000 51 Starnosedmole 1.00 0.06 16.6667 52 Baboon 179.50 10.55 17.0142 53 Mouse 0.40 0.02 17.3913 54 Man 1320.00 62.00 21.2903 55 Treeshrew 2.50 0.10 24.0385 56 Molerat 3.00 0.12 24.5902 57 Littlebrownbat 0.25 0.01 25.0000 58 Galago 5.00 0.20 25.0000 59 Rhesusmonkey 179.00 6.80 26.3235 60 Lessershort-tailedshrew 0.14 0.01 28.0000 61 Owlmonkey 15.50 0.48 32.2917 62 Groundsquirrel 4.00 0.10 39.6040

BODY

BRAINRATIO

Man

African Elephant

Ground squirrel

Owl monkey

BaboonMouse

Smallest brains?

Biggest brains?

Middling brains?


Review: How to fit and interpret models using log-transformed variables

Exponential Growth ModelLearning Curve Proportional Growth Model

Every doubling of X(100% difference) is associated with a difference in Y1̂

)(logon Y Regress 2 X

)(logˆˆ210 XY

Xon (Y)log Regress e

XYe 10ˆˆ)(log

)(logon (Y)log Regress ee X

)(logˆˆ)(log 10 XY ee

Every 1 unit difference in X is associated with a % difference in Y (often interpreted asa %age growth rate)

)1(100 1̂ e

Every 1% difference in X is associated with a difference in Y%1̂

Helpful mnemonic device: If you’ve logarithmically transformed a variable, you’ll be modifying the interpretation of an effect by expressing differences for that variable in percentage, not unit, terms


Another helpful mnemonic:Mosteller and Tukey’s “Rule of the Bulge”

Up in Y (e.g., Y2)

Down in Y (e.g., log(Y))

Up in X(e.g., X2)

Down in X(e.g., log(X))

Two more important ideas about transformation:

•It’s usually “low cost” to transform X, potentially “higher cost” to transform Y

•If the range of a variable is very large, taking logarithms often helps

Fred Mosteller

JohnTukey

Broadly speaking, there are four general shapes that a monotonic nonlinear

relationship might take:

We’ll learn about this shape in Unit 10

MCAS/Brain

OECD

If you think of this display as representing plots of Y vs. X, identify the curve that most closely

matches your data (and theory, hopefully) and you can linearize the relationship by choosing

transformations of X, Y or both that go in the “direction of the bulge”


What’s the big takeaway from this unit?

• Check your assumptions– Regression is a very powerful statistical technique, but its built on a

set of assumptions– Before accepting a set of regression results, you should examine the

assumptions to make sure they’re tenable– A high R2 or small p-value cannot tell you whether your assumptions

hold– Plot your data and plot your residuals

• Many relationships are nonlinear– We often begin by assuming linearity, but we often find that the

underlying relationship is nonlinear– Transformation makes it easy to fit nonlinear models using linear

regression techniques– Models expressed using transformed variables can be easily

interpreted

• Regression as statistical control– We often want to do more than just summarize the relationship

between variables– Regression provides a straightforward strategy that allows us to

statistically control for the effects of a predictor and see what’s “left over”

– Residuals can be easily interpreted as “controlled observations”


Appendix: Annotated PC-SAS Code for transforming variables

data one; infile 'm:\datasets\MCAS.txt'; input ID 1-2 District $ 4-22 Home 24-29 PctAdv 33-34; L2Home=log2(home);

Note that the handouts include only annotations for the new additional SAS code. For the complete program, check program “Unit 5—MCAS analysis” on the website.

Note that the handouts include only annotations for the new additional SAS code. For the complete program, check program “Unit 5—MCAS analysis” on the website.

*-------------------------------------------------------------*Fitting OLS regression model L2PPE on GDP Plotting studentized residuals on GDP *-------------------------------------------------------------*; proc reg data=one; model L2PPE=GDP; plot student.*GDP; output out=resdat2 r=residual student=student; id country;proc univariate data = resdat2 plot; var student; id country;*-----------------------------------------------------------*Create new natural log transformation of outcome PPE: Ln(PPE) *-----------------------------------------------------------*;data one; set one; LnPPE = Log(PPE);*-------------------------------------------------------------*

“Unit 5—OECD analysis”“Unit 5—OECD analysis”

The data step can include additional statements to create new variables by transforming variables already included in the data set.

To add log base 2 transformations of variables in the sample, use the following syntax: Newvar = log2(oldvar);

Different transformation can be used, including natural logs (log (var)), squared and cubic versions (var**2 or var**3, inverses (-1/var), and roots (var**.5).

The data step can include additional statements to create new variables by transforming variables already included in the data set.

To add log base 2 transformations of variables in the sample, use the following syntax: Newvar = log2(oldvar);

Different transformation can be used, including natural logs (log (var)), squared and cubic versions (var**2 or var**3, inverses (-1/var), and roots (var**.5).

The data step can also be repeated in the middle of the program to add additional new variables to the original data set. Note that you can keep the data set’s original name by using the same name in both the set and data statements.

The data step can also be repeated in the middle of the program to add additional new variables to the original data set. Note that you can keep the data set’s original name by using the same name in both the set and data statements.


Understanding the effects of transformation in the OECD data

Ln(PPE) = 0.69315*log2(PPE)

rLnPPE, L2PPE = 1.00


Appendix: Why you shouldn’t rely solely on R2 statistics to select models

**26.3%,6.31

0073.00729.83ˆ

2

slopetR

XY

**21.3%,9.30

)(log34.778.25ˆ

2

2

slopetR

XY

ID X log2(X) Y

1 45 5.4919 110.000 2 56 5.8074 55.574 3 96 6.5850 59.762 4 136 7.0875 65.318 5 176 7.4594 76.433 6 216 7.7549 90.033 7 256 8.0000 41.970 8 296 8.2095 101.890 9 336 8.3923 98.22810 376 8.5546 89.93911 416 8.7004 50.91412 456 8.8329 99.55113 496 8.9542 73.43714 536 9.0661 133.13915 576 9.1699 92.48516 616 9.2668 116.76717 656 9.3576 108.69718 696 9.4429 102.03019 1500 10.5507 80.00020 1800 10.8138 100.00021 2000 10.9658 80.00022 3000 11.5507 100.00023 4000 11.9658 130.00024 6000 12.5507 120.00025 8000 12.9658 140.000

Regression of Y on log2(X)

Regression of Y on X

A good model is a model in which your

assumptions appear tenable


Glossary terms included in Unit 5

• Logarithms• Rule of the bulge• Transformation

Unit 5: Transformations to achieve linearity

Documents

Transcript of Unit 5: Transformations to achieve linearity