Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

35
Slide 5- 1 Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Transcript of Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Page 1: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 1Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Page 2: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Chapter 5Describing Distributions Numerically

Page 3: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 3Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Finding the Center: The Median

When we think of a typical value, we usually look for the center of the distribution.

For a unimodal, symmetric distribution, it’s easy to find the center—it’s just the center of symmetry.

Page 4: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 4Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Finding the Center: The Median (cont.)

As a measure of center, the midrange (the average of the minimum and maximum values) is very sensitive to skewed distributions and outliers.

The median is a more reasonable choice for center than the midrange.

Page 5: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 5Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

The median is the value with exactly half the data values below it and half above it. It is the middle data

value (once the data values have been ordered) that divides the histogram into two equal areas.

It has the same units as the data.

Finding the Center: The Median (cont.)

Page 6: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 6Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Percentiles

The 27th percentile means that 27% of the data falls BELOW that score. So if you have a score of 120 on a given standardized exam and you are in the 27th percentile, then 27% of the other exam takers scored below you.

The first quartile is the same as the 50th percentile. The median is the same as the 50th percentile and the second quartile. The third quartile is the same as the 75th percentile.

Page 7: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 7Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Spread: Home on the Range

Always report a measure of spread along with a measure of center when describing a distribution numerically.

The range of the data is the difference between the maximum and minimum values:

Range = max – min A disadvantage of the range is that a single extreme value

can make it very large and, thus, not representative of the data overall.

Page 8: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 8Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Spread: The Interquartile Range

The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data.

To find the IQR, we first need to know what quartiles are…

Page 9: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 9Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Spread: The Interquartile Range (cont.)

Quartiles divide the data into four equal sections. The lower quartile is the median of the half of

the data below the median. The upper quartile is the median of the half of

the data above the median. The difference between the quartiles is the IQR,

so IQR = upper quartile – lower quartile

Page 10: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 10Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Spread: The Interquartile Range (cont.)

The lower and upper quartiles are the 25th and 75th percentiles of the data, so…

The IQR contains the middle 50% of the values of the distribution, as shown in Figure 5.3 from the text:

Page 11: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 11Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

The Five-Number Summary

The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). Example: The five-

number summary for the ages at death for rock concert goers who died from being crushed is

Max 47 years

Q3 22

Median 19

Q1 17

Min 13

Page 12: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 12Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Rock Concert Deaths: Making Boxplots

A boxplot is a graphical display of the five-number summary.

Boxplots are particularly useful when comparing groups.

Page 13: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 13Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Constructing Boxplots

1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box.

Page 14: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 14Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Constructing Boxplots (cont.)

2. Erect “fences” around the main part of the data.

The upper fence is 1.5 IQRs above the upper quartile.

The lower fence is 1.5 IQRs below the lower quartile.

Note: the fences only help with constructing the boxplot and should not appear in the final display.

Page 15: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 15Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Constructing Boxplots (cont.)

3. Use the fences to grow “whiskers.”

Draw lines from the ends of the box up and down to the most extreme data values found within the fences.

If a data value falls outside one of the fences, we do not connect it with a whisker.

Page 16: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 16Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Constructing Boxplots (cont.)

4. Add the outliers by displaying any data values beyond the fences with special symbols.

We often use a different symbol for “far outliers” that are farther than 3 IQRs from the quartiles.

Page 17: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 17Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Rock Concert Deaths: Making Boxplots (cont.)

Compare the histogram and boxplot for rock concert deaths:

How does each display represent the distribution?

Page 18: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 18Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Comparing Groups With Boxplots

The following set of boxplots compares the effectiveness of various coffee containers:

What does this graphical display tell you?

Page 19: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 19Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Summarizing Symmetric Distributions

Medians do a good job of identifying the center of skewed distributions.

When we have symmetric data, the mean is a good measure of center.

We find the mean by adding up all of the data values and dividing by n, the number of data values we have.

Page 20: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 20Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Summarizing Symmetric Distributions (cont.)

The distribution of pulse rates for 52 adults is generally symmetric, with a mean of 72.7 beats per minute (bpm) and a median of 73 bpm:

Page 21: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 21Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

The Formula for Averaging

The formula for the mean is given by

The formula says that to find the mean, we add up the numbers and divide by n.

yTotaly

n n

Page 22: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 22Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Mean or Median?

Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance:

Page 23: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 23Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Mean or Median? (cont.)

In symmetric distributions, the mean and median are approximately the same in value, so either measure of center may be used.

For skewed data, though, it’s better to report the median than the mean as a measure of center.

Page 24: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 24Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What About Spread? The Standard Deviation

A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean.

A deviation is the distance that a data value is from the mean. Since adding all deviations together would total

zero, we square each deviation and find an average of sorts for the deviations.

Page 25: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 25Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What About Spread? The Standard Deviation (cont.)

The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them:

The variance will play a role later in our study, but it is problematic as a measure of spread—it is measured in squared units!

22

1

y ys

n

Page 26: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 26Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What About Spread? The Standard Deviation (cont.)

The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data.

2

1

y ys

n

Page 27: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 27Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Thinking About Variation

Since Statistics is about variation, spread is an important fundamental concept of Statistics.

Measures of spread help us talk about what we don’t know.

When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be small.

When the data values are scattered far from the center, the IQR and standard deviation will be large.

Page 28: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 28Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Shape, Center, and Spread

When telling about a quantitative variable, always report the shape of its distribution, along with a center and a spread. If the shape is skewed, report the median and

IQR. If the shape is symmetric, report the mean and

standard deviation and possibly the median and IQR as well.

Page 29: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 29Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What About Outliers?

If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

Note: The median and IQR are not likely to be affected by the outliers.

Page 30: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 30Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What Can Go Wrong?

Don’t forget to do a reality check—don’t let technology do your thinking for you.

Don’t forget to sort the values before finding the median or percentiles.

Don’t compute numerical summaries of a categorical variable.

Watch out for multiple modes—multiple modes might indicate multiple groups in your data.

Page 31: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 31Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What Can Go Wrong? (cont.)

Be aware of slightly different methods—different statistics packages and calculators may give you different answers for the same data.

Beware of outliers. Make a picture (make a picture, make a picture).

Page 32: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 32Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What Can Go Wrong? (cont.)

Be careful when comparing groups that have very different spreads. Consider these

side-by-side boxplots of cotinine levels:

Page 33: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 33Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

*Re-expressing to Equalize the Spread of Groups

Here are the side-by-side boxplots of the log(cotinine) values:

Page 34: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 34Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What have we learned?

We can now summarize distributions of quantitative variables numerically. The 5-number summary displays the min, Q1,

median, Q3, and max. Measures of center include the mean and

median. Measures of spread include the range, IQR,

and standard deviation. We know which measures to use for symmetric

distributions and skewed distributions.

Page 35: Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.

Slide 5- 35Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What have we learned? (cont.)

We can also display distributions with boxplots. While histograms better show the shape of the

distribution, boxplots reveal the center, middle 50%, and any outliers in the distribution.

Boxplots are useful for comparing groups.