Chapter 1: Exploring Data 1.1 – Displaying Distributions with Graphs

Chapter 1: Exploring Data

1.1 – Displaying Distributions with Graphs

Types of Graphs:

Categorical Quantitative

Bar Chart

Pie Chart

Dotplot

Histogram Stemplot

Ogive

Time Plot

Bar graph: Displays categorical variables

How to construct a bar graph:

Step 2: Scale your axes

Step 3: Leave spaces between bars

Step 1: Label your axes and title graph

categories

Cou

nt

Title

Side-by-Side bar graph:

Compares two variables of one individual

categories

Cou

nt

Legend

=

=

Title

Example #1 The table shows results of a poll asking adults whether they were looking forward to the Super Bowl game, the commercials, or didn’t plan to watch.

Male Female Total

Game 279 200 479

Commercials 81 156 237

Won’t Watch 132 160 292

Total 492 516 1008

Construct a side-by-side bar chart for their preference based on gender. Note any trends that appear.

= Game

= Commercials

= Won’t Watch

Male Female

50

100

150

200

250

300

Reason Looking Forward to Super Bowl

Males overwhelmingly watch the Super Bowl for the game, where women seem mixed as to why they want to watch it.

- Center

Describing Quantitative Distributions:

When describing a Graph -- CUSS

CMean: Average value, add up then divide by #

Mode: Most frequent number. There can be many modes

Median: Number in the center when data is lined up

Calculator Tip: To calculate mean and median

Stat – edit – type in data – exitStat – CALC – 1-Var Stats - L1

- Unusual points

Don’t call them outliers yet!

UAny data points that stand out as different





- ShapeSSymmetric: Fold in half, it matches up

Bell/Normal: Special Case, don’t say yet!

Uniform: All the same frequencies

Unimodal: One peak in the data

- ShapeS

Bimodal: Two peaks in the data

Gaps: Space between the data

- ShapeS

Cluster: Several data points grouped together

Skewed Right: Unusual point to the right

Skewed Left: Unusual point to the left

- ShapeS

Range:



- SpreadS

Distance between largest and smallest values. Range = Maximum - Minimum

Homogeneous: Data is all in a similar space(small spread)

Dotplot: Dots are used to keep count of the frequency of each number

How to construct a dotplot:

Step 2: Mark a dot above the corresponding value

Step 1: Label your axis and title your graph.

# range

TITLE

Example #2 The data below give the number of hurricanes classified as major hurricanes in the Atlantic Ocean each year from 1944 through 2006, as reported by NOAA.

a. Make a dotplot of the data.

3 2 1 2 4 3 7 2 3 3 2 5 2 2 4 2 2

6 0 2 5 1 3 1 0 3 2 1 0 1 2 3 2 1

2 2 2 3 1 1 1 3 0 1 3 2 1 2 1 1 0

5 6 1 3 5 3 3 2 3 6 7 2 6 8

Number of Hurricanes Classified as a Major Hurricane (1944-2006)

b. Describe what you see in a few sentences.

0 1 2 3 4 5 6 7 8

• A dotplot is a simple display. It just places a dot along an axis for each case in the data.

• The dotplot to the right shows Kentucky Derby winning times, plotting each race as its own dot.

• You might see a dotplot displayed horizontally or vertically.

Guidelines for constructing Stemplots (stem and leaf)

1. Put data in order from smallest to largest

2. Separate each value in a STEM and LEAFThe leaf is a single digit and it is the rightmost digit of the number. The stem will consist of everything else to the left of the leaf

3. Stems go in a vertical column from small to large and a vertical line is drawn to the right of the stems

4. Leaves are written to the right of their stems from small to large.

Back-to-Back StemplotsTo compare two different sets of data

Split StemplotsTo spread out the data to see more trends if they are grouped together. Leaves will split from 0-4 and 5-9.

Example #3 The data below give the amount of caffeine content (in milligrams) for an 8-ounce serving of popular soft drinks.

20 15 23 29 23 15 23 31 28 35 37 27 24 26 47 28 24 28 28

16 38 36 35 37 27 33 37 25 47 27 29 26 43 43 28 35 31 25

a. Construct stemplot.

b. Construct a split stemplot.

Caffeine per 8oz of soda

1234

5 5 60 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9 1 1 3 5 5 5 6 7 7 7 8 3 3 7 7

Key: 1 5 = 15mill

1223344

5 5 60 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9 1 1 3 5 5 5 6 7 7 7 8 3 3 7 7

a.

b.

c. Differences?

www.whfreeman.com/tps3e

1-Var Stats

http://www.whfreeman.com/tps3e

Most people believe that you need to drink coffee or an energy drink to get good “buzz” off of the caffeine. Below is a table with common caffeine levels of tea, coffee, and energy drinks.

Coffee

133 160 150 103 150 93 150 115 75 75 40

Energy Drink

d. Make a back-to-back stemplot. Comment on the difference in caffeine levels between coffee and energy drinks.

160 144 100 100 95 83 80 80 80 79

74 50 48

456789

10111213141516

0

5 5

335

3

0 0 00

8 0

9 43 0 0 0 5 0 0

4

0

CoffeeEnergy Drink

Key: 1 5 = 15mg

http://www.cspinet.org/new/cafchart.htm

562

56 2

56.2

56 2

5.62

56 2

562

5 6

2

0 2

50

5 0

Back-to-Back Stemplots

565

5 2 56 5 57 2 7 0 58

562572580

To compare two different sets of data

565 577

Split Stemplots

565

2 56 5 56 5 57 2 57 7 0 58

562572580

To spread out the data to see more trends if they are grouped together. Leaves will split from 0-4 and 5-9.

565 577

median

Count towards median

Count towards median

Calculator Tip: Sort values from smallest to largest

Stat – Edit – type in data – exitStat – SortA – L1

Calculator Tip: Clearing Lists

All Lists: Mem – ClrAllLists – Enter

One List: Stat – Edit – Highlight List name – Clear

Calculator Tip: Deleted a list?

STAT – SetUpEditor – Enter

Calculator Tip: Save a list?

L1 – STO – Any name or Letter

To Retrieve later: 2nd – List

Calculator Tip: Remove a number from list?

Line up number you want to delete, hit DEL

Histogram:

1. Divide the range of data into classes of equal width.

2. Count the number of observations in each class. Ensure no one number falls into two classes

3. Label and scale the axes and title your graph.

4. Draw a bar that represents the count in each class. The base of a bar should cover its class, and the bar height is the class count. Leave no horizontal space between the bars unless the class is empty.

Calculator Tip: Make a histogram. Pg. 59

Stat – Edit – type in data – exitStatPlot – 1 – On – histogram – L1 – Freq 1Zoom – ZoomStat (#9)

To adjust the classes:Window: Xmin: Lowest value

Xmax: Highest valueXscl: Scale on x-axis (width of bars)Ymin: -0.2 typicallyYmax: Highest frequency rate (height of bars)Yscl: Scale on y-axis

Xmin XmaxXsclYmin

Ymax

Ysc

l

Ex. #4: Describe the distribution of the graph.

C:

U:

4-5 words

12 words

S: Unimodal, slight skew right

S: 1 to 12

Range = 11

Example#5: An executive finds the subscriptions (in millions of people) of the 20 leading American magazines is as follows:

Reader’s Digest17.9 Ladies’ Home

Journal5.3

TV Guide 17.1 National Enquirer 4.7

National Geographic

10.6 Time 4.6

Modern Maturity 9.3 Playboy 4.2

AARP News Bulletin

8.8 Redbook 4

Better Homes and Gardens

8 The Star 3.7

Family Circle 7.2 Penthouse 3.5

Woman’s Day 7 Newsweek 3

McCall’s 6.4 Cosmopolitan 3

Good Housekeeping

5.4 People Weekly 2.8

Make a histogram for the number of subscriptions in intervals of 2 (million) compared to the frequency of that number. Then describe the graph.

Describe the features of the graph in detail.

C: mean = 6 .825, median = 5.35

U: 17.1 & 17.9

S: Skewed to the right, unimodal

S: 2.8 to 17.9, range of 15.1

2 4 6 8 10 12 14 16 18 20

87654321

Circulation (in millions)

Fre

qu

ency

Circulation in millions of people of American Magazines

Height of NBA Players

http://bcs.whfreeman.com/tps3e

Page 50: applets: One-variable Statistical calculator

1. How do you determine how many classes to make?

2. When is it good to split the stems on a stemplot?

http://bcs.whfreeman.com/tps3e

P21.1

Types of GraphsBar graphDotplotStemplotHistogramDescribing a Graph

19 7, 9

4757-58109

3(a&b only)1151

HW

1.1 & 1.2

Day 3

Relative Cumulative Frequency Graph (Ogive):

Shows relative standing of an observation

Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.

a. What percent of presidents were younger than 60? 80%


b. What percent of presidents were between 50 and 55? 30%


c. There is a horizontal line between 35 and 40 years of age. What does that mean?

No presidents were less than 40 years old


d. What is the median age of the current presidents? 55


e. President Obama was 47 when he was inaugurated. What percent of presidents were older than him?

85%

Time Plots: Plots each observation against the time at which it was measured. Always mark the time scale on the horizontal axis and the variable being measured on the y axis.

Trend:

Seasonal Variations:

A common overall pattern.

A pattern that repeats itself at regular time intervals

Ex. #7: Identify any trends and describe the time plot.

Seems to fluctuate, peaking in 1983

Chapter 1: Exploring Data

1.2 – Describing Distributions with Numbers

Mean: The average number of a set of data. Add the values in the data set and divide by the number of observations

or n

xx i

n

xxxx n

...21

For n observations,

Ex#8: Find the mean for the two sets of data.

Data set A: 1 1 2 2 3

Data set B: 1 1 2 2 500,000

Data set A: 8.1x

Data set B: 2.100001x

What happened?

Strongly influenced by unusual values

Variance: Average of the squares of the deviations of the observations from their mean

or

22 1

1x is x xn

2 2 212 ...

1n n

xx x x x x x

sn

Standard Deviation: The square root of the variance

Degrees of Freedom: Dividing by n – 1

Measures the average distance the values are away from the mean.

21

1x is x xn

Calculator Tip: Standard Deviation1-var stats – L1

Ex#9: Calculate the Standard Deviation by HandData Set: 6, 4, 4, 3, 2, 6, 10

Mean = 5

(6-5)2 + (4-5)2 + (4-5)2 + (3-5)2 + (2-5)2 + (6-5)2 + (10-5)2

2x x

21

1x is x xn

(1)2 + (-1)2 + (-1)2 + (-2)2 + (-3)2 + (1)2 + (5)2 = 42

142

7 1xs

= 142

6 7 2.64575

Example #10Using the numbers 1-10, choose 4 numbers so the standard deviation will be the smallest. Then choose 4 numbers so the standard deviation will be the largest. (Repeats are ok)

Smallest: 1, 1, 1, 1

Largest: 1, 1, 10, 10

Sx = 0

Sx = 5.196

http://www.stat.tamu.edu/~west/ph/stddev.html

http://www.stat.tamu.edu/~west/ph/stddev.html

Example #11Which graph will have the larger standard deviation? Why?

a. b. c.

d. e.

x x x

xx

Properties of the standard deviation and variance:

1. Sensitive to _______________.

2. Some deviations are positive and some are negative (that’s why we square them!) Otherwise, they would add up to zero and tell us nothing about the deviance around the mean. Then, to get the original units, we take the square root.

outliers

3. Standard deviation is at least ZERO, or greater, but never ________________.

4. Values that are very close together have a _____________ standard deviation and those far apart have a _____________ standard deviation.

Properties of the standard deviation and variance:

negative

smalllarge

1.1 & 1.2

OgivesTime PlotMean VarianceStandard Deviation

64-6989101

13(a&b only), 22, 23, 2639, 4354

Curriculum Night

Day 4 – 1.2

Median: The midpoint or value where half of the data is above the median and half is below the median. (50% mark)

To find the median:

1. Put all the data in order from smallest to largest

2. Cancel off the end data points until you find the middle

Resistant measure:

Good estimate even when there is very unusual values.

Ex#12: Find the median for the two sets of data.

Data set A: 1 1 2 2 3

Data set B: 1 1 2 2 500,000

Data set A: M = 2

Data set B: M = 2

Which one is a resistant measure? Mean or Median?

pth percentile: p percent of the observations fall at or below it

Quartiles: 25th percentile = first quartile = Q150th percentile = median = Q275th percentile = third quartile = Q3

Five-Number Summary: Min, Q1, M, Q3, Max

Boxplot:

Uses the five-number summary. A box is drawn connecting Q1 and Q3 with a line through the median. Whiskers are drawn to the max and min.

min Q1 med Q3 max

# line

25% 25% 25%25%

Interquartile Range: IQR = Q3 – Q1

Outliers: Data that is away from the majority of points

To Determine:

Lower Outlier: Q1 – 1.5(IQR)

Upper Outlier: Q3 + 1.5(IQR)

All values should be between these two numbers

min Q1 med Q3 max

# line

* **

OutliersOutlier

Keep in mind, you don’t know how much data is in a boxplot!

Calculator Tip:

Pg. 81

Stat – Edit – type data – exitStatPlot – 1 – on – boxplot (with or without outliers) – L1

Boxplots.

Calculator Tip:Pg. 81

Stat – Calc – 1-var Stats – L1

5-Number Summary

Ex #13: The Fuel Economy of 2004 vehicles is given.

a. Determine the 5-number summary.

13 15 16 16 17 19 20 22 23 2323 24 25 25 26 28 28 28 29 3266

Min = 13

Q1 =

Q3 =

Max =

Med = 23

18

28

66

b. Calculate the range and IQR for each data set.

Range = 66 – 13 = 53

IQR = 28 – 18 = 10

Min = 13

Q1 =

Q3 =

Max =

Med = 23

18

28

66

c. Make a box plot using the 5-number summary.

10 15 20 25 30 35 40 45 50 55 60 65 70

d. Describe the shape, center, and spread.

C:

U:

Median = 23

66

S: Skewed Right

S: Range = 53, IQR = 10

e. Are there any potential outliers using the criterion?

Q1 – 1.5(IQR) Q3 + 1.5(IQR)

18 – 1.5(10)

18 – 15

3

28 + 1.5(10)

28 + 15

43

Yes, 66 is above 43.

f. Construct a modified boxplot to account for the outlier.

10 15 20 25 30 35 40 45 50 55 60 65 70

*

Ozone and OutliersThe 'ozone hole' above Antarctica provides the setting for one of the most infamous outliers in recent history. It is a great story to tell students who wantonly delete outliers from a dataset merely because they are outliers.In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by some data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal January levels. The puzzle was why the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, hadn't recorded similarly low ozone concentrations. When they examined the data from the satellite it didn't take long to realize that the satellite was in fact recording these low concentrations levels and had been doing so for years. But because the ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! The Nimbus 7 satellite had in fact been gathering evidence of low ozone levels since 1976. The damage to our atmosphere caused by chloroflourocarbons went undetected and untreated for up to nine years because outliers were discarded without being examined.Moral: Don't just toss out outliers, as they may be the most valuable members of a dataset.

Weight of NBA Players

• Compare the histogram and boxplot for daily wind speeds:

• How does each display represent the distribution?

Matching Histograms and Boxplots

Match each histogram with its boxplot, by writing the letter of the boxplot in the space provided.

1970 Draft

Was the draft fair?

1971 Draft

1.2 PercentileMedianQuartilesBoxplotIQRDetermine outlier

82-84

106-107

33, 36, 37

61(a only), 62

Day 5 – 1.2

Comparing Distributions:

Make sure you actually compare!!!!!!

Don’t just state CUSS, but compare the values

% change in population from 1990 to 2000

http://www.ruf.rice.edu/~lane/stat_sim/descriptive/

Mean and median applet.

www.whfreeman.com/tps3e

Mean and median applet.

Pg. 73

http://www.ruf.rice.edu/~lane/stat_sim/descriptive/

http://www.whfreeman.com/tps3e

If the data is uniform or symmetric use:

If the data is skewed, use:

MeanCenter:

Spread: standard deviation

MedianCenter:

Spread: Five-number summary, Range, IQR

Who's Counting: It's Mean to Ignore the Median

Reading Economic Numbers from Democratic, Republican Points of View

Aug. 6, 2006 — - Believe it or not, the difference in the way the Democrats and Republicans react to the performance of the U.S. economy is clarified by a mathematical distinction studied in elementary school. The distinction is between the mean, which the Republicans emphasize, while the Democrats prefer the median. The relevance of this distinction is apparent in the just-released figures on the U.S. economy for 2004, the latest year for which there is complete data. The Republicans chortle that the economy grew at a healthy rate of 4.2 percent. (It's slowed since then.) The Democrats point to data from the Census Bureau for the same year (and earlier as well), indicating that the real median family income fell and that poverty increased.

Example #14Should you use the mean or median to discuss the center?

a. Average price of home

b. Average age

c. Average height

d. Average gas mileage for all cars

Median

Median

Mean

Mean

Linear Transformation:

Change in the measurement unit where you add or multiply the data

Matching Histograms and Summary Statistics

Match each histogram with a set of summary statistics, by writing the letter in the space provided.

0 5 10 15 20

1. DD.mean 10.2standard deviation 4.1median 11.9IQR 6.8

6 8 10 12 14

2. AA.mean 10.5standard deviation 1.4median 10.7IQR 2.0

3. B

4 7 10 13 16

B.mean 10.1standard deviation 2.7median 10.1IQR 4.2

4. EE.mean 8.8standard deviation 2.8median 8.0IQR 1.9

5 11 18 24 30

5. C

2 6 9 13 16

C.mean 10.2standard deviation 2.1median 10.5IQR 2.5

Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.

Original Data

Mean

Median

S.D.

Q1

Q3

IQR

Range

3

3.5

1.77

1

4.5

3.5

4


1 2 3 4 5

Dotplot


Boxplot

1 2 3 4 5


Original Data

Mean

Median

S.D.

Q1

Q3

IQR

Range

3

3.5

1.77

1

4.5

3.5

4

Multiply by 3

9

10.5

5.31

3

13.5

10.5

12


1 2 3 4 5

Dotplot

3 4 5 6 7 8 9 10 11 12 13 14 15


Boxplot

1 2 3 4 5

3 4 5 6 7 8 9 10 11 12 13 14 15


Original Data

Mean

Median

S.D.

Q1

Q3

IQR

Range

3

3.5

1.77

1

4.5

3.5

4

Multiply by 3

9

10.5

5.31

3

13.5

10.5

12

Add 4

7

7.5

1.77

3

8.5

3.5

4


1 2 3 4 5

Dotplot

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9


Boxplot

1 2 3 4 5

Conclusion:

Multiply: Changes both center and spread

Add: Changes mean & 5-number summarySpread doesn’t change.

Middle always Moves

Spread Sometimes Shifts

Mean: a bX Xa b

Standard Deviation: a bX Xb

Example #6:True or False.

a. If you add 7 to each entry on a list, that adds 7 to the mean.

b. If you add 7 to each entry on a list, that adds 7 to the standard deviation.

c. If you double each entry on a list, that doubles the mean.

TRUE

FALSE

TRUE


d. If you double each entry on a list, that doubles the standard deviation.

e. Multiplying each entry on a list changes the mean.

f. Multiplying each entry on a list changes the standard deviation.

TRUE

TRUE

TRUE


g. Adding to each entry on a list changes the mean.

h. Adding to each entry on a list changes the standard deviation.

TRUE

FALSE

Example #17: A college professor gave a test to his students. The test had five questions, each worth 20 points. The summary statistics for the students’ scores on the test are below. After grading the test, the professor realized that, because he had made a typographical error in question number 2, no student was able to answer the question. So he decided to adjust the students’ scores by adding 20 points to each one. What will be the summary statistics for the new, adjusted scores?

Summary Statistics for Scores NEW

Mean 62

Median 60

Range 45

Standard Deviation 8

Q1 71

Q3 48

IQR 23

828045 8916823

Example #18: The summary statistics for the property tax per property collected by one county are below. This year, county residents voted to increase property taxes by 2 percent to support the local school system. What will be the summary statistics for the new, increased property taxes?

Summary Statistics for Property Tax NEW

Mean 12,000

Median 8,000

Range 30,000

Standard Deviation 5,000

Q1 14,000

Q3 5,000

IQR 9,000

12,2408,16030,600 5,10014,2805,1009,180

1.2 Mean vs. MedianDescribing a GraphChoosing a SummaryLinear Transformations

55-5774-75

828997102

110-111

7, 1027, 31, 32 3540, 4245, 465868, 70

Research Project Due Soon!

Chapter 1: Exploring Data 1.1 – Displaying Distributions with Graphs

Documents

Transcript of Chapter 1: Exploring Data 1.1 – Displaying Distributions with Graphs