EPIB 698D Lecture 6

30
EPIB 698D Lecture 6 Raul Cruz-Cano Spring 2013

description

EPIB 698D Lecture 6. Raul Cruz-Cano Spring 2013. DETERMINING NORMALITY (OR LACK THEREOF). One of the first steps in test selection should be investigating the distribution of the data. PROC UNIVARIATE can be implemented to determine whether or not your data are normal . - PowerPoint PPT Presentation

Transcript of EPIB 698D Lecture 6

Page 1: EPIB 698D Lecture 6

EPIB 698D Lecture 6

Raul Cruz-CanoSpring 2013

Page 2: EPIB 698D Lecture 6

DETERMINING NORMALITY (OR LACK THEREOF)

• One of the first steps in test selection should be investigating the distribution of the data.

• PROC UNIVARIATE can be implemented to determine whether or not your data are normal.– If the population from which the data are obtained is normal, the mean

and median should be equal or close to equal.– The skewness coefficient, which is a measure of symmetry, should be near

zero. Positive values for the skewness coefficient indicate that the data are right skewed, and negative values indicate that that data are left skewed.

– The kurtosis coefficient, which is a measure of spread, should also be near zero.

– Positive values for the kurtosis coefficient indicate that the distribution of the data is steeper than a normal distribution, and negative values for kurtosis indicate that the distribution of the data is flatter than normal distribution.

Page 3: EPIB 698D Lecture 6

DETERMINING NORMALITY (OR LACK THEREOF)

• The NORMAL option in PROC UNIVARIATE produces a table with tests for normality.– Shapiro-Wilk Statistic, EDF Goodness-of-Fit Tests, Kolmogorov D Statistic, Anderson-

Darling Statistic, Cramér-von Mises Statistic– In general, if the p-values are less than 0.05, then the data should be considered non-

normally distributed.– However, it is important to remember that these tests are heavily dependent on sample

size. – Strikingly non-normal data may have a p-value greater than 0.05 due to a small

samplesize. Therefore, graphical representations of the data should always be examined.• The PLOTS option in PROC UNIVARIATE creates low-resolution stem-and-leaf, box,

and normal probability plots. – The stem-and-leaf plot is used to visualize the overall distribution of the data and the box

plot is a graphical representation of the 5-number summary. – The normal probability plot is designed to investigate whether a variable is normally

distributed. If the data are normal, then the plot should display a straight diagonal line. Different departures from the straight diagonal line indicate different types of departures from normality

Page 4: EPIB 698D Lecture 6

DETERMINING NORMALITY• The HISTOGRAM statement in PROC UNIVARIATE will produce high resolution

histograms.• PROC UNIVARIATE is an invaluable tool in visualizing and summarizing data in

order to gain an understanding of the underlying populations from which the data are obtained. To produce these results, the following code can be used.

PROC UNIVARIATE data=datafile normal plots;Histogram;

Var variable1 variable2 ... variablen;Run;

• The determination of the normality of the data should result from evaluation of the graphical output in conjunction with the numerical output.

• In addition, the user might wish to look at subsets of the data; for example, a CLASS statement might be used to stratify by gender.

Page 5: EPIB 698D Lecture 6

Normality Test: Box Plot

DATA relieftime;INPUT relief;

DATALINES;909393999810010310499102;PROC UNIVARIATE DATA = relieftime normal plot;

VAR relief;histogram relief / midpoints = 80 to 120 by 5 normal;

RUN;

One-sample t-test?

When used in conjunction with the NORMAL option, the histogram will have a line indicating the shape of a normal distribution with the same mean and variance as the sample.

Page 6: EPIB 698D Lecture 6

Tests for Normality• The histogram shows most observations falling

at the peak of the normal curve.• The box-plot shows that the mean falls on the

median (*--+--*), indicating no skewed data.• The formal tests of normality in the output are

non-significant, indicating these data come from a normal distribution.

• We can assume the data are normally distributed and proceed with the one-sample t-test.

Page 7: EPIB 698D Lecture 6

Normality Test: Box Plot

DATA study;INPUT before after;

DATALINES;90 9587 92100 10480 8995 10190 105;PROC UNIVARIATE DATA = study normal plot;

VAR before after;histogram before after / normal;

RUN;

Paired t-test?

Page 8: EPIB 698D Lecture 6

Tests for Normality• There are so few data points that the

histograms are difficult to interpret.• The box-plots for before and after both show

the mean very close to the median, suggesting the data are not skewed.

• The tests of normality for before and after have p-values > alpha, indicating we do not reject the assumption of normality.

• We can proceed with the matched pairs t-test.

Page 9: EPIB 698D Lecture 6

Tests for Normality

DATA response;INPUT group $ time;

DATALINES;c 80c 93c 83c 89c 98t 100t 103t 104t 99t 102;PROC UNIVARIATE DATA = response normal plot;

class group;var time;histogram time / midpoints = 80 to 120 by 5 normal;

RUN;

Two-sample t-test?

Page 10: EPIB 698D Lecture 6

A few notes:

• The code has specified that the univariate procedure be performed on the variable time, but that it is done by the class “group.” This way you will have separate summary statistics, plots and histograms for the treatment and control groups.

Page 11: EPIB 698D Lecture 6

Tests for Normality• The tests for normality for both the

treatment and control groups are non-significant (p-value > alpha), indicating we can assume they come from a normal distribution.

• Because each group only has 5 subjects, the histograms are difficult to interpret, but there is no indication of non-normality.

• Proceed with the two-sample t-test

Page 12: EPIB 698D Lecture 6

Histograms for control and treatment groups

0

20

40

60

80

100

Percent

c

80 85 90 95 100 105 110 115 1200

20

40

60

80

100

Percent

t

t i me

Page 13: EPIB 698D Lecture 6

Another Example• A semiconductor manufacturer produces printed circuit boards that are

sampled to determine the thickness of their copper plating.• The following statements create a data set named Trans, which contains the

plating thicknesses (Thick) of 100 boards:data Trans;input Thick @@;label Thick = 'Plating Thickness (mils)';datalines; 3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512 3.490 3.467 3.498 3.519 3.504

3.469 3.497 3.495 3.518 3.523 3.458 3.478 3.443 3.500 3.449 3.525 3.461 3.489 3.514 3.470 3.561 3.506 3.444 3.479 3.524 3.531 3.501 3.495 3.443 3.458 3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516 3.476 3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520 3.505 3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491 3.510 3.461 3.431 3.502 3.491 3.506 3.439 3.513 3.496 3.539 3.469 3.481 3.515 3.535 3.460 3.575 3.488 3.515 3.484 3.482 3.517 3.483 3.467 3.467 3.502 3.471 3.516 3.474 3.500 3.466 ;

run;

Page 14: EPIB 698D Lecture 6

Example

title 'Analysis of Plating Thickness'; proc univariate data=Trans;

histogram Thick / normal(percents=20 40 60 80 midpercents) name='MyPlot'; run;

Page 15: EPIB 698D Lecture 6

Q-Q Plots• The following properties of Q-Q plots and probability plots make them

useful diagnostics of how well a specified theoretical distribution fits a set of measurements: – If the quantiles of the theoretical and data distributions agree, the plotted

points fall on or near the line . – If the theoretical and data distributions differ only in their location or scale, the

points on the plot fall on or near the line . The slope and intercept are visual estimates of the scale and location parameters of the theoretical distribution.

• Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters because the -axis of a Q-Q plot is scaled linearly.

• On the other hand, probability plots are more convenient for estimating percentiles or probabilities.

Page 16: EPIB 698D Lecture 6

Q-Q plots Example

• Data set Measures, which contains the measurements of the diameters of 50 steel rods in the variable Diameter:

data Measures;input Diameter @@; label Diameter = 'Diameter (mm)'; datalines; 5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177

5.332 5.399 5.661 5.512 5.252 5.404 5.739 5.525 5.160 5.410 5.823 5.376 5.202 5.470 5.410 5.394 5.146 5.244 5.309 5.480 5.388 5.399 5.360 5.368 5.394 5.248 5.409 5.304 6.239 5.781 5.247 5.907 5.208 5.143 5.304 5.603 5.164 5.209 5.475 5.223 ;

run;

Page 17: EPIB 698D Lecture 6

Q-Q plots Example

symbol v=plus; title 'Normal Q-Q Plot for Diameters';proc univariate data=Measures noprint; qqplot Diameter / normal square vaxis=axis1;

axis1 label=(a=90 r=0); run;

Page 18: EPIB 698D Lecture 6

Probability Plots• The PROBPLOT statement creates a probability plot, which

compares ordered variable values with the percentiles of a specified theoretical distribution. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Consequently, you can use a probability plot to determine how well a theoretical distribution models a set of measurements.

• Probability plots are similar to Q-Q plots, which you can create with the QQPLOT statement. Probability plots are preferable for graphical estimation of percentiles, whereas Q-Q plots are preferable for graphical estimation of distribution parameters.

Page 19: EPIB 698D Lecture 6

Probability Plot Example

proc univariate data=Measures; probplot Length1 Length2 / normal(mu=10 sigma=0.3) square ctext=blue;

run;

You can check against other distributions: lognormal, Gamma, Betta, etc.

Page 20: EPIB 698D Lecture 6

Collinearity• When a regressor is nearly a linear combination of other

regressors in the model, the affected estimates are unstable and have high standard errors.

• This problem is called collinearity or multicollinearity.• It is a good idea to find out which variables are nearly collinear

with which other variables. • Consequences of high multicollinearity:

– Increased standard error of estimates of the β’s (decreased reliability).– Often confusing and misleading results

• The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.

Page 21: EPIB 698D Lecture 6

Collinearity• The COLLIN option in the MODEL statement requests that a

collinearity analysis be performed.• Belsey, Kuh, and Welsch (1980) suggest that, when this number is

around 10, weak dependencies might be starting to affect the regression estimates. When this number is larger than 100, the estimates might have a fair amount of numerical error (although the statistical standard error almost always is much greater than the numerical error).

• For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables.

Page 22: EPIB 698D Lecture 6

Collinearity

• The VIF option in the MODEL statement provides the variance inflation factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values.

• The TOL option requests the tolerance values for the parameter estimates

Page 23: EPIB 698D Lecture 6

Example Collinearity

data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571

8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172

;Run;

Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements

Page 24: EPIB 698D Lecture 6

Example Collinearityproc reg data=fitness;

model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin;run;

dangerously high VIFs

Look at large CN rows

See which predictors have large % variance in each

Page 25: EPIB 698D Lecture 6

Solution

• Check correlations then Redefine variables– Remove or average redundant ones

• Variable selection, model re-specification– Use forward or backward in the regression

analysis

Page 26: EPIB 698D Lecture 6

Let’s revisit the examples from previous classes…

Page 27: EPIB 698D Lecture 6

data blood;INFILE ‘F:\blood.txt';INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;run;

data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run;

Check normality and collinearity of RBC, WBC and cholesterol

PROC UNIVARIATE DATA = blood1 normal plot;class gender;var RBC;histogram RBC / normal; qqplot RBC / normal square ;

RUN;

proc reg DATA = blood1;model cholesterol =RBC WBC / vif collin;run;

Page 28: EPIB 698D Lecture 6

title 'Paired Comparison';data pressure;input SBPbefore SBPafter @@;diff_BP=SBPafter-SBPbefore ;

datalines; 120 128 124 131 130 131 118 127 140 132 128 125 140 141 135 137 126 118 130 132 126 129 127 135 ; run;

data piared;input lossa lossj;diff=lossa-lossj;datalines ;+4 -8 +3 -10 0 -12 -3 -16 -4 -18 -5 -20 -11 -21 -14 -24 -15 -26 -300 -30 ;run;

Check normality and collinearity of SBPbefore & SBPafter and lossa & lossb

Page 29: EPIB 698D Lecture 6

proc ttest data = "c:\hsb2"; class female; var write;

run;

data nonparametric;input loss diet $;datalines ;+4 atkins +3 atkins0 atkins-3 atkins-4 atkins-5 atkins -11 atkins-14 atkins-15 atkins-300 atkins-8 jenny-10 jenny-12 jenny-16 jenny-18 jenny -20 jenny-21 jenny-24 jenny-26 jenny-30 jenny;run;Check normality of write and loss

Page 30: EPIB 698D Lecture 6

data Clover; input Strain $ Nitrogen @@; datalines; 3DOK1 19.4 3DOK1 32.6 3DOK1 27.0

3DOK1 32.1 3DOK1 33.0 3DOK5 17.7 3DOK5 24.8 3DOK5 27.9 3DOK5 25.2 3DOK5 24.3 3DOK4 17.0 3DOK4 19.4 3DOK4 9.1 3DOK4 11.9 3DOK4 15.8 3DOK7 20.7 3DOK7 21.0 3DOK7 20.5 3DOK7 18.8 3DOK7 18.6 3DOK13 14.3 3DOK13 14.4 3DOK13 11.8 3DOK13 11.6 3DOK13 14.2 COMPOS 17.3 COMPOS 19.4 COMPOS 19.1 COMPOS 16.9 COMPOS 20.8 ;run;

Check normality of Nitrogen