Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  ·...

16
1 Chapter 1: Looking at Data--Distributions Section 1.1: Introduction, Displaying Distributions with Graphs Section 1.2: Describing Distributions with Numbers Learning goals for this chapter: Identify categorical and quantitative variables. Interpret, create (by hand and with SPSS), and know when to use: bar graphs, pie charts, stemplots (standard, back-to-back, split), histograms, and boxplots (regular, modified, side-by-side). Describe the shape, center, and spread of data distributions. Define, calculate (by hand and with SPSS), and know when to use measures of center (mean vs. median) and spread (range, 5-number summary, IQR, variance, standard deviation). Understand what a resistant measure of center and spread is and when this is important. Use the 1.5IQR rule to look for outliers. Draw a Normal curve in correct proportions and identify the mean/median, standard deviation, middle 68%, middle 95%, and middle 99.7%. Perform calculations with the empirical rule, both backwards and forwards. Understand the need for standardization. Big picture: what do we learn in this chapter? Individuals vs. Variables Categorical vs. Quantitative Variables Graphs: Bar graphs and pie charts (categorical variables) Histograms and stemplots (quantitative variablesgood for checking for symmetry and skewness) Boxplots (quantitative variablesgraphical display of the 5 # summary, modified boxplots show outliers) Describing distributions Shape (symmetric/skewed, unimodal/bimodal/multimodal) Center (mean or median) Spread (usually standard deviation/variance or IQR from the 5 # summary) Outliers If you have a symmetric distribution with no outliers, use the mean and standard deviation. If you have a skewed distribution and/or you have outliers, use the 5 # summary instead.

Transcript of Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  ·...

Page 1: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

1

Chapter 1: Looking at Data--Distributions

Section 1.1: Introduction, Displaying Distributions with Graphs

Section 1.2: Describing Distributions with Numbers

Learning goals for this chapter:

Identify categorical and quantitative variables.

Interpret, create (by hand and with SPSS), and know when to use: bar graphs, pie

charts, stemplots (standard, back-to-back, split), histograms, and boxplots

(regular, modified, side-by-side).

Describe the shape, center, and spread of data distributions.

Define, calculate (by hand and with SPSS), and know when to use measures of

center (mean vs. median) and spread (range, 5-number summary, IQR, variance,

standard deviation).

Understand what a resistant measure of center and spread is and when this is

important.

Use the 1.5IQR rule to look for outliers.

Draw a Normal curve in correct proportions and identify the mean/median,

standard deviation, middle 68%, middle 95%, and middle 99.7%.

Perform calculations with the empirical rule, both backwards and forwards.

Understand the need for standardization.

Big picture: what do we learn in this chapter?

Individuals vs. Variables

Categorical vs. Quantitative Variables

Graphs: Bar graphs and pie charts (categorical variables)

Histograms and stemplots (quantitative variables—good for checking for

symmetry and skewness)

Boxplots (quantitative variables—graphical display of the 5 # summary, modified

boxplots show outliers)

Describing distributions Shape (symmetric/skewed, unimodal/bimodal/multimodal)

Center (mean or median)

Spread (usually standard deviation/variance or IQR from the 5 # summary)

Outliers

If you have a symmetric distribution with no outliers, use the mean and standard

deviation.

If you have a skewed distribution and/or you have outliers, use the 5 # summary

instead.

Page 2: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

2

2 components in describing data or information:

Individuals: objects being described by a set of data (people, households, cars,

animals, corn, etc.)

Variables: characteristics of individuals (height, yield, length, age, eye color,

etc.)

Categorical: places an individual into one of several groups (gender, eye

color, college major, hometown, etc.)

Quantitative: Attaches a numerical value to a variable so that adding or

averaging the values makes sense (height, weight, age, income, yield, etc.)

Distribution of a variable: describes what values a variables takes and how often it

takes those values

If you have more than one variable in your problem, you should look at each variable by

itself before you look at relationships between the variables.

Example: Identify whether the following questions would give you categorical or

quantitative data.

a) What letter grade did you get in your Calculus class last semester?

b) What was your score on the last exam?

c) Who will you vote for in the next election?

d) How many votes did George W. Bush get?

e) How many red M&Ms are in this bag?

f) Which type of M&Ms has more red ones: peanut or plain?

It’s always a good idea to start by displaying variables graphically before you do any

other statistical analysis. What kind of graph should you use? That depends on

whether you have a categorical or quantitative variable.

Categorical Variables:

Bar graphs or pie charts

Messy room example: In a poll of 200 parents of children ages 6 to 12,

respondents were asked to name the most disgusting things ever found in their

children’s rooms. The results are below (J&C 2005)

Most disgusting thing # of parents % of parents

Food-related 106 53%

Page 3: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

3

Animal and insect-related

nuisances 22 11%

Clothing (dirty socks and

underwear especially) 22 11%

Other 50 25%

Bar graph (can use either # of parents like below or % of parents):

Pie chart (needs % of parents):

animal clothing food other

type of disgusting mess

0

20

40

60

80

100

120

Co

un

t

Cases weighted by # of parents

11.0%animal

11.0%clothing

53.0%food

25.0%other

type of disgusting mess

animal

clothing

food

other

Cases weighted by # of parents

Page 4: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

4

Quantitative Variables:

Stemplots, histograms, and boxplots (discussed a little later)

Example: You investigate the amount of time students spend online (in minutes).

You study 28 students, and their times are listed below. Show the distribution of

times with a stemplot.

7 20 24 25 25 28 28 30 32 35

42 43 44 45 46 47 48 48 50 51

72 75 77 78 79 83 87 88

To create a stemplot by hand,

1. Put the data in order from smallest to largest.

2. The ―stem‖ will be all digits for a data point except for the last one. Write the

stems in a vertical line. (Think of ―7‖ as being ―07‖ so that all the numbers

have a digit in the tens place.)

3. The ―leaf‖ will be the next digit (in this case, the ones place) from each data

point. Write the leaves after the appropriate stem, in increasing order.

4. It is possible to ―trim‖ any digits that you feel may be unnecessary. For

example, if our second data point had been 20.3, we would probably choose to

ignore the ―.3‖ for the purposes of the stemplot so that we could create a more

reasonable stemplot. If we did not ignore this ―.3‖, then our stems would have

been 07, 08, 09, 10, 11, 12, 13,…, 88 with decimal numbers as our leaves. This

would show a very uniform stemplot with only one leaf for each stem (all

leaves would be 0 except for the 3). This would not be helpful to us at all. It

makes much more sense to use the tens place for the stem and the ones place as

the leaves in this example.

Stemplot

0 | 7 1 | 2 | 0 4 5 5 8 8

3 | 0 2 5

4 | 2 3 4 5 6 7 8 8

5 | 0 1

6 |

7 | 2 5 7 8 9

8 | 3 7 8

A split stemplot just has more

stems. There are several ways to

split the stems. Here they are

split by fives. 0 | 7

1 |

1 |

2 | 0 4

2 | 5 5 8 8

3 | 0 2

3 | 5

4 | 2 3 4

4 | 5 6 7 8 8

5 | 0 1

5 |

6 |

6 |

7 | 2

7 | 5 7 8 9

8 | 3

8 | 7 8

Page 5: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

5

Why do we need split stemplots? Sometimes it is easier to see the shape of the data with

more stems. Sometimes a regular stemplot is better. If you’re not sure, try it both ways

and see if a pattern appears.

Try a stemplot and a split stemplot with this data (use the hundreds place for stems):

3, 4, 17, 18, 39, 93, 102, 110, 143, 178, 250, 278, 299, 300.1

Histograms

Sorting the quantitative data into bins. How many bins?

Not too many bins with either 0 or 1 counts.

Not overly summarized so that you lose all the information

Not so detailed that it is no longer a summary

Too few bins OK Too many bins

Page 6: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

6

Histograms Bar graphs The bars for each interval touch each

other.

The bars for each category do not touch each

other. There are spaces between the bars.

Histograms have a continuous,

quantitative x-axis, with the x-values

in order.

Bar graphs can have the categories on the x-axis

listed in any order (alphabetical, biggest-to-

smallest, etc.)

Quantitative variables Categorical variables

Histograms Stemplots Quantitative variables Quantitative variables

Good for big data sets, especially if

technology is available.

Good for small data sets, convenient for back-of-

the-envelope calculations. Rarely found in

scientific or laymen publications.

Uses a box to represent each data

point.

Uses a digit to represent each data point.

Page 7: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

7

You’ve drawn your graph (histogram or stemplot). Now what?

Look for overall pattern and any outliers.

The pattern is described by shape, center, and spread.

1. Shape:

o # of peaks (unimodal = 1, bimodal = 2, multimodal > 2)

o Where the long tail is:

Symmetric

Right skewed

(long tail on the

right)

Left skewed

(long tail on the left)

Median Mean Median < Mean Median > Mean

To describe the shape, use a histogram with

a smoothed curve highlighting the overall

pattern of the distribution (don’t get overly

detailed).

2. Center: (If the distribution is symmetric, the mean will equal the median, but

otherwise these numbers are not the same.)

a) Mean: arithmetic average, 1

1 n

i

i

x xn

Where n = the total # of observations

And xi = an individual observation

b) Mode: the most common number, biggest peak

Page 8: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

8

c) Median: M, midpoint of the distribution such that ½ the observations are

smaller and ½ the observations are larger. The median is not as affected

by outliers as the mean is; the median is resistant to outliers.

To find the median:

i. Order the data form smallest to largest

ii. Count the # of observations (n)

iii. Calculate 1

2

n to find the center of the data set.

iv. If n is odd, M is the data point at the center of the data set.

v. If n is even, 1

2

n falls between 2 data points, called the

―middle pair.‖ M = the average of the middle pair

Examples of center:

Find the mean and median of the following 7 numbers in Dataset A:

23 25 32.5 33 67 1 -20

Find the mean and median of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

3. Spread:

a) Range = max – min (simplest, not always the most helpful)

b) Variance: s2, average of the square of deviations of observations from the

mean

2 2

1

1( )

1

n

i

i

s x xn

c) Standard Deviation: s, square root of the variance, common way for

measuring how far observations are from the mean

Example of finding the standard deviation by hand: 0, 2, 4

1. Calculate the mean.

2. Calculate the variance.

3. Take the square root of the variance.

Page 9: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

9

d) Pth

percentile: value such that p% of the observations fall at or below it

Median = M = 50th

percentile

First Quartile = Q1 = 25th

percentile

Third Quartile = Q3 = 75th

percentile

How do you find quartiles? Think of them as ―mini-medians.‖ Leave the

median out, and then find the median of what is left over on the left side (Q1)

and what is left over on the right side (Q3).

Find the 1st and 3

rd quartiles of the following 7 numbers in Dataset A:

-20 1 23 25 32.5 33 67

Min M Max

Find the 1st and 3

rd quartiles of the following 8 numbers in Dataset B:

1 2 4 6 8 9 12 13

Min Max

M = 7

e) 5-Number Summary: Min Q1 M Q3 Max

f) Interquartile Range (IQR) = Q3 – Q1

Call an observation a suspected outlier if it is:

> Q3 + 1.5 IQR

OR

< Q1 – 1.5 IQR

g) Boxplots: Graph of the 5-number summary

Modified boxplots have lines extend from the box out to the smallest and

largest observations which are NOT outliers. Dots mark any outliers.

(We will always ask for the modified boxplot, but if there are no outliers,

the modified and regular boxplots look exactly the same.)

Page 10: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

10

For the online time example (with 2 additional data points added in), list the 5-number

summary, find any outliers present, and show a boxplot and modified boxplot.

7 20 24 25 25 28 28 30 32 35

42 43 44 45 46 47 48 48 50 51

72 75 77 78 79 83 87 88 135 151

Boxplot for Dataset A with 5-

number summary:

-20, 1, 25, 33, 67

Since there were no outliers in

this dataset, a regular boxplot and

a modified boxplot look exactly

the same for this data.

Page 11: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

11

How do you know which method is best for determining center and spread?

5-Number Summary: better for skewed distributions or distribution with outliers

Mean and Standard Deviation: good for reasonably symmetric distributions free of

outliers.

Always start with a graph!

In the internet time example, here are how the mean/standard deviation and 5-number

summary are affected by the outlier:

With outlier (151) With outlier removed from dataset

Mean 54.77 51.45

Standard Deviation 32.647 27.600

5-number summary 7, 30, 46.5, 77, 151 7, 29, 46, 76, 135

―The Median vs. the Mean in the Age of Average‖ by Mike Pesca on NPR’s Day-to-Day

7/19/06: http://www.npr.org/templates/story/story.php?storyId=5567890

Do you always have to do all of this by hand? NO!

Statistical software packages like SPSS can make life much easier for you, but it’s a good

idea to know how to do these by hand so you can make sense of your output. Also, on

the exam, you won’t have access to a computer.

Read over your SPSS manual and get comfortable with using SPSS. You will have a

chance to practice on the HW for this week, and you will work on it in lab on Friday.

Enter your data, then Analyze--> Descriptive Statistics--> Explore. Follow the

instructions on p. 48 of the SPSS manual.

The output from SPSS for the internet time problem looks like:

Descriptives

54.77 5.961

42.58

66.96

52.13

46.50

1065.840

32.647

7

151

144

48

1.314 .427

1.977 .833

Mean

Lower Bound

Upper Bound

95% Confidence

Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviat ion

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Time spent on the web

Stat istic Std. Error

Page 12: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

12

Notice on the boxplot, it is easy to identify the potential outlier. This would be your

indication that the 5-number summary would be the best way to describe your data. (You

could also try calculating the mean and standard deviation without the outlier for

comparison.)

SPSS can also give you the Quartiles (listed under ―Percentiles‖), but these are not

necessarily the same answers as what you would get by hand. The ―weighted average‖

and ―Tukey’s Hinges‖ are not the same method we use. For this class, whenever we

ask you to calculate the Quartiles, we want you to do them by hand.

Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 0 . 0

9.00 0 . 222222333

10.00 0 . 4444444455

5.00 0 . 77777

3.00 0 . 888

.00 1 .

1.00 1 . 3

1.00 Extremes (>=151)

Stem width: 100

Each leaf: 1 case(s)

0 50 100 150

Time spent on the web

0

2

4

6

8

10

Fre

qu

en

cy

Mean = 54.77Std. Dev. = 32.647N = 30

Histogram

Page 13: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

13

What if you want to compare the results from two or more different groups? Use

side-by-side boxplots or back-to-back stemplots for your graphs.

Female Male

9 2

8 1 3

6 4

5 0

3 3 0 6 8 8

8 1 1 0 7 0 8

6 5 2 8 3 4 5 9

9 9 9 9 2 2 4 5 6

Page 14: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

14

Preview of Section 1.3 (from Section 1.3)

A z-score tells us how many standard deviations away from the mean an observation is.

xz

This is also called getting a standardized value.

Why is standardization useful? For comparing apples to oranges.

Example: (p. 88, Problem 1.99) Jacob scores 16 on the ACT. Emily scores 670 on the

SAT. Assuming that both tests measure scholastic aptitude, who has the higher score?

The SAT scores for 1.4 million students in a recent graduating class were roughly

normal with a mean of 1026 and standard deviation of 209. The ACT scores for more

than 1 million students in the same class were roughly normal with mean of 20.8 and

standard deviation of 4.8.

Page 15: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

15

How else can we use standardization? If the distribution of observations has a bell-

shape, then these standardized values have some special properties. One of these is the

68-95-99.7% Empirical Rule.

Approximately 68% of the observations fall within 1 of the mean

(between 1 and 1 ) .

Approximately 95% of the observations fall within 2 of the mean

(between 2 and 2 ) .

Approximately 99.7% of the observations fall within 3 of the mean

(between 3 and 3 ) .

The most famous bell-shaped distribution is the Normal distribution. We will spend

several lectures talking about it for Section 1.3, and it will be important to everything we

do for the rest of the semester.

Standard deviations away

from the mean (z-score),

so a z-score of -2 could

also be written as 2 ,

for example. The mean and the median of a

bell-shaped curve are in the

middle. This is shown with a

0 because the mean is 0

standard deviations away

from itself.

P( -1 <X< +1 ) = 0.68

P( -2 <X< +2 ) = 0.95

P( -3 <X< +3 ) = 0.997

Page 16: Chapter 1: Looking at Data--Distributions Section 1.1: …ghobbs/STAT_301/Chapter1.pdf ·  · 2011-01-233 Animal and insect-related nuisances 22 11% Clothing (dirty socks and underwear

16

Example: Checking account balances are approximately Normally distributed with a

mean of $1325 and a standard deviation of $25.

a) Between what numbers do 68% of the balances fall?

b) Above what number do 2.5% of the balances lie?

c) Approximately what percent of balances are between 1250 and 1400?