Chapter 1: Exploring Data 1.1 – Displaying Distributions with Graphs
-
Upload
bernard-williams -
Category
Documents
-
view
82 -
download
0
description
Transcript of Chapter 1: Exploring Data 1.1 – Displaying Distributions with Graphs
Chapter 1: Exploring Data
1.1 – Displaying Distributions with Graphs
Types of Graphs:
Categorical Quantitative
Bar Chart
Pie Chart
Dotplot
Histogram Stemplot
Ogive
Time Plot
Bar graph: Displays categorical variables
How to construct a bar graph:
Step 2: Scale your axes
Step 3: Leave spaces between bars
Step 1: Label your axes and title graph
categories
Cou
nt
Title
Side-by-Side bar graph:
Compares two variables of one individual
categories
Cou
nt
Legend
=
=
Title
Example #1 The table shows results of a poll asking adults whether they were looking forward to the Super Bowl game, the commercials, or didn’t plan to watch.
Male Female Total
Game 279 200 479
Commercials 81 156 237
Won’t Watch 132 160 292
Total 492 516 1008
Construct a side-by-side bar chart for their preference based on gender. Note any trends that appear.
= Game
= Commercials
= Won’t Watch
Male Female
50
100
150
200
250
300
Reason Looking Forward to Super Bowl
Males overwhelmingly watch the Super Bowl for the game, where women seem mixed as to why they want to watch it.
- Center
Describing Quantitative Distributions:
When describing a Graph -- CUSS
CMean: Average value, add up then divide by #
Mode: Most frequent number. There can be many modes
Median: Number in the center when data is lined up
Calculator Tip: To calculate mean and median
Stat – edit – type in data – exitStat – CALC – 1-Var Stats - L1
- Unusual points
Don’t call them outliers yet!
UAny data points that stand out as different
Describing Quantitative Distributions:
When describing a Graph -- CUSS
Describing Quantitative Distributions:
When describing a Graph -- CUSS
- ShapeSSymmetric: Fold in half, it matches up
Bell/Normal: Special Case, don’t say yet!
Uniform: All the same frequencies
Unimodal: One peak in the data
- ShapeS
Bimodal: Two peaks in the data
Gaps: Space between the data
- ShapeS
Cluster: Several data points grouped together
Skewed Right: Unusual point to the right
Skewed Left: Unusual point to the left
- ShapeS
Range:
Describing Quantitative Distributions:
When describing a Graph -- CUSS
- SpreadS
Distance between largest and smallest values. Range = Maximum - Minimum
Homogeneous: Data is all in a similar space(small spread)
Dotplot: Dots are used to keep count of the frequency of each number
How to construct a dotplot:
Step 2: Mark a dot above the corresponding value
Step 1: Label your axis and title your graph.
# range
TITLE
Example #2 The data below give the number of hurricanes classified as major hurricanes in the Atlantic Ocean each year from 1944 through 2006, as reported by NOAA.
a. Make a dotplot of the data.
3 2 1 2 4 3 7 2 3 3 2 5 2 2 4 2 2
6 0 2 5 1 3 1 0 3 2 1 0 1 2 3 2 1
2 2 2 3 1 1 1 3 0 1 3 2 1 2 1 1 0
5 6 1 3 5 3 3 2 3 6 7 2 6 8
Number of Hurricanes Classified as a Major Hurricane (1944-2006)
b. Describe what you see in a few sentences.
0 1 2 3 4 5 6 7 8
• A dotplot is a simple display. It just places a dot along an axis for each case in the data.
• The dotplot to the right shows Kentucky Derby winning times, plotting each race as its own dot.
• You might see a dotplot displayed horizontally or vertically.
Guidelines for constructing Stemplots (stem and leaf)
1. Put data in order from smallest to largest
2. Separate each value in a STEM and LEAFThe leaf is a single digit and it is the rightmost digit of the number. The stem will consist of everything else to the left of the leaf
3. Stems go in a vertical column from small to large and a vertical line is drawn to the right of the stems
4. Leaves are written to the right of their stems from small to large.
Back-to-Back StemplotsTo compare two different sets of data
Split StemplotsTo spread out the data to see more trends if they are grouped together. Leaves will split from 0-4 and 5-9.
Example #3 The data below give the amount of caffeine content (in milligrams) for an 8-ounce serving of popular soft drinks.
20 15 23 29 23 15 23 31 28 35 37 27 24 26 47 28 24 28 28
16 38 36 35 37 27 33 37 25 47 27 29 26 43 43 28 35 31 25
a. Construct stemplot.
b. Construct a split stemplot.
Caffeine per 8oz of soda
1234
5 5 60 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9 1 1 3 5 5 5 6 7 7 7 8 3 3 7 7
Key: 1 5 = 15mill
1223344
5 5 60 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9 1 1 3 5 5 5 6 7 7 7 8 3 3 7 7
a.
b.
c. Differences?
Most people believe that you need to drink coffee or an energy drink to get good “buzz” off of the caffeine. Below is a table with common caffeine levels of tea, coffee, and energy drinks.
Coffee
133 160 150 103 150 93 150 115 75 75 40
Energy Drink
d. Make a back-to-back stemplot. Comment on the difference in caffeine levels between coffee and energy drinks.
160 144 100 100 95 83 80 80 80 79
74 50 48
456789
10111213141516
0
5 5
335
3
0 0 00
8 0
9 43 0 0 0 5 0 0
4
0
CoffeeEnergy Drink
Key: 1 5 = 15mg
http://www.cspinet.org/new/cafchart.htm
http://www.cspinet.org/new/cafchart.htm
562
56 2
56.2
56 2
5.62
56 2
562
5 6
2
0 2
50
5 0
Back-to-Back Stemplots
565
5 2 56 5 57 2 7 0 58
562572580
To compare two different sets of data
565 577
Split Stemplots
565
2 56 5 56 5 57 2 57 7 0 58
562572580
To spread out the data to see more trends if they are grouped together. Leaves will split from 0-4 and 5-9.
565 577
median
Count towards median
Count towards median
Calculator Tip: Sort values from smallest to largest
Stat – Edit – type in data – exitStat – SortA – L1
Calculator Tip: Clearing Lists
All Lists: Mem – ClrAllLists – Enter
One List: Stat – Edit – Highlight List name – Clear
Calculator Tip: Deleted a list?
STAT – SetUpEditor – Enter
Calculator Tip: Save a list?
L1 – STO – Any name or Letter
To Retrieve later: 2nd – List
Calculator Tip: Remove a number from list?
Line up number you want to delete, hit DEL
Histogram:
1. Divide the range of data into classes of equal width.
2. Count the number of observations in each class. Ensure no one number falls into two classes
3. Label and scale the axes and title your graph.
4. Draw a bar that represents the count in each class. The base of a bar should cover its class, and the bar height is the class count. Leave no horizontal space between the bars unless the class is empty.
Calculator Tip: Make a histogram. Pg. 59
Stat – Edit – type in data – exitStatPlot – 1 – On – histogram – L1 – Freq 1Zoom – ZoomStat (#9)
To adjust the classes:Window: Xmin: Lowest value
Xmax: Highest valueXscl: Scale on x-axis (width of bars)Ymin: -0.2 typicallyYmax: Highest frequency rate (height of bars)Yscl: Scale on y-axis
Xmin XmaxXsclYmin
Ymax
Ysc
l
Ex. #4: Describe the distribution of the graph.
C:
U:
4-5 words
12 words
S: Unimodal, slight skew right
S: 1 to 12
Range = 11
Example#5: An executive finds the subscriptions (in millions of people) of the 20 leading American magazines is as follows:
Reader’s Digest17.9 Ladies’ Home
Journal5.3
TV Guide 17.1 National Enquirer 4.7
National Geographic
10.6 Time 4.6
Modern Maturity 9.3 Playboy 4.2
AARP News Bulletin
8.8 Redbook 4
Better Homes and Gardens
8 The Star 3.7
Family Circle 7.2 Penthouse 3.5
Woman’s Day 7 Newsweek 3
McCall’s 6.4 Cosmopolitan 3
Good Housekeeping
5.4 People Weekly 2.8
Make a histogram for the number of subscriptions in intervals of 2 (million) compared to the frequency of that number. Then describe the graph.
Describe the features of the graph in detail.
C: mean = 6 .825, median = 5.35
U: 17.1 & 17.9
S: Skewed to the right, unimodal
S: 2.8 to 17.9, range of 15.1
2 4 6 8 10 12 14 16 18 20
87654321
Circulation (in millions)
Fre
qu
ency
Circulation in millions of people of American Magazines
Height of NBA Players
http://bcs.whfreeman.com/tps3e
Page 50: applets: One-variable Statistical calculator
1. How do you determine how many classes to make?
2. When is it good to split the stems on a stemplot?
P21.1
Types of GraphsBar graphDotplotStemplotHistogramDescribing a Graph
19 7, 9
4757-58109
3(a&b only)1151
HW
1.1 & 1.2
Day 3
Relative Cumulative Frequency Graph (Ogive):
Shows relative standing of an observation
Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.
a. What percent of presidents were younger than 60? 80%
Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.
b. What percent of presidents were between 50 and 55? 30%
Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.
c. There is a horizontal line between 35 and 40 years of age. What does that mean?
No presidents were less than 40 years old
Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.
d. What is the median age of the current presidents? 55
Example #6:The President of the United States has to be at least 35 years old and be born in America. Below is an ogive showing the relative cumulative frequency of the previous presidents that were inaugurated.
e. President Obama was 47 when he was inaugurated. What percent of presidents were older than him?
85%
Time Plots: Plots each observation against the time at which it was measured. Always mark the time scale on the horizontal axis and the variable being measured on the y axis.
Trend:
Seasonal Variations:
A common overall pattern.
A pattern that repeats itself at regular time intervals
Ex. #7: Identify any trends and describe the time plot.
Seems to fluctuate, peaking in 1983
Chapter 1: Exploring Data
1.2 – Describing Distributions with Numbers
Mean: The average number of a set of data. Add the values in the data set and divide by the number of observations
or n
xx i
n
xxxx n
...21
For n observations,
Ex#8: Find the mean for the two sets of data.
Data set A: 1 1 2 2 3
Data set B: 1 1 2 2 500,000
Data set A: 8.1x
Data set B: 2.100001x
What happened?
Strongly influenced by unusual values
Variance: Average of the squares of the deviations of the observations from their mean
or
22 1
1x is x xn
2 2 212 ...
1n n
xx x x x x x
sn
Standard Deviation: The square root of the variance
Degrees of Freedom: Dividing by n – 1
Measures the average distance the values are away from the mean.
21
1x is x xn
Calculator Tip: Standard Deviation1-var stats – L1
Ex#9: Calculate the Standard Deviation by HandData Set: 6, 4, 4, 3, 2, 6, 10
Mean = 5
(6-5)2 + (4-5)2 + (4-5)2 + (3-5)2 + (2-5)2 + (6-5)2 + (10-5)2
2x x
21
1x is x xn
(1)2 + (-1)2 + (-1)2 + (-2)2 + (-3)2 + (1)2 + (5)2 = 42
142
7 1xs
= 142
6 7 2.64575
Example #10Using the numbers 1-10, choose 4 numbers so the standard deviation will be the smallest. Then choose 4 numbers so the standard deviation will be the largest. (Repeats are ok)
Smallest: 1, 1, 1, 1
Largest: 1, 1, 10, 10
Sx = 0
Sx = 5.196
http://www.stat.tamu.edu/~west/ph/stddev.html
Example #11Which graph will have the larger standard deviation? Why?
a. b. c.
d. e.
x x x
xx
Properties of the standard deviation and variance:
1. Sensitive to _______________.
2. Some deviations are positive and some are negative (that’s why we square them!) Otherwise, they would add up to zero and tell us nothing about the deviance around the mean. Then, to get the original units, we take the square root.
outliers
3. Standard deviation is at least ZERO, or greater, but never ________________.
4. Values that are very close together have a _____________ standard deviation and those far apart have a _____________ standard deviation.
Properties of the standard deviation and variance:
negative
smalllarge
1.1 & 1.2
OgivesTime PlotMean VarianceStandard Deviation
64-6989101
13(a&b only), 22, 23, 2639, 4354
Curriculum Night
Day 4 – 1.2
Median: The midpoint or value where half of the data is above the median and half is below the median. (50% mark)
To find the median:
1. Put all the data in order from smallest to largest
2. Cancel off the end data points until you find the middle
Resistant measure:
Good estimate even when there is very unusual values.
Ex#12: Find the median for the two sets of data.
Data set A: 1 1 2 2 3
Data set B: 1 1 2 2 500,000
Data set A: M = 2
Data set B: M = 2
Which one is a resistant measure? Mean or Median?
pth percentile: p percent of the observations fall at or below it
Quartiles: 25th percentile = first quartile = Q150th percentile = median = Q275th percentile = third quartile = Q3
Five-Number Summary: Min, Q1, M, Q3, Max
Boxplot:
Uses the five-number summary. A box is drawn connecting Q1 and Q3 with a line through the median. Whiskers are drawn to the max and min.
min Q1 med Q3 max
# line
25% 25% 25%25%
Interquartile Range: IQR = Q3 – Q1
Outliers: Data that is away from the majority of points
To Determine:
Lower Outlier: Q1 – 1.5(IQR)
Upper Outlier: Q3 + 1.5(IQR)
All values should be between these two numbers
min Q1 med Q3 max
# line
* **
OutliersOutlier
Keep in mind, you don’t know how much data is in a boxplot!
Calculator Tip:
Pg. 81
Stat – Edit – type data – exitStatPlot – 1 – on – boxplot (with or without outliers) – L1
Boxplots.
Calculator Tip:Pg. 81
Stat – Calc – 1-var Stats – L1
5-Number Summary
Ex #13: The Fuel Economy of 2004 vehicles is given.
a. Determine the 5-number summary.
13 15 16 16 17 19 20 22 23 2323 24 25 25 26 28 28 28 29 3266
Min = 13
Q1 =
Q3 =
Max =
Med = 23
18
28
66
b. Calculate the range and IQR for each data set.
Range = 66 – 13 = 53
IQR = 28 – 18 = 10
Min = 13
Q1 =
Q3 =
Max =
Med = 23
18
28
66
c. Make a box plot using the 5-number summary.
10 15 20 25 30 35 40 45 50 55 60 65 70
d. Describe the shape, center, and spread.
C:
U:
Median = 23
66
S: Skewed Right
S: Range = 53, IQR = 10
e. Are there any potential outliers using the criterion?
Q1 – 1.5(IQR) Q3 + 1.5(IQR)
18 – 1.5(10)
18 – 15
3
28 + 1.5(10)
28 + 15
43
Yes, 66 is above 43.
f. Construct a modified boxplot to account for the outlier.
10 15 20 25 30 35 40 45 50 55 60 65 70
*
Ozone and OutliersThe 'ozone hole' above Antarctica provides the setting for one of the most infamous outliers in recent history. It is a great story to tell students who wantonly delete outliers from a dataset merely because they are outliers.In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by some data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal January levels. The puzzle was why the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, hadn't recorded similarly low ozone concentrations. When they examined the data from the satellite it didn't take long to realize that the satellite was in fact recording these low concentrations levels and had been doing so for years. But because the ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! The Nimbus 7 satellite had in fact been gathering evidence of low ozone levels since 1976. The damage to our atmosphere caused by chloroflourocarbons went undetected and untreated for up to nine years because outliers were discarded without being examined.Moral: Don't just toss out outliers, as they may be the most valuable members of a dataset.
Weight of NBA Players
• Compare the histogram and boxplot for daily wind speeds:
• How does each display represent the distribution?
Matching Histograms and Boxplots
Match each histogram with its boxplot, by writing the letter of the boxplot in the space provided.
1. D
2. A
3. C
4. E
5. B
1970 Draft
Was the draft fair?
1971 Draft
1.2 PercentileMedianQuartilesBoxplotIQRDetermine outlier
82-84
106-107
33, 36, 37
61(a only), 62
Day 5 – 1.2
Comparing Distributions:
Make sure you actually compare!!!!!!
Don’t just state CUSS, but compare the values
% change in population from 1990 to 2000
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/
Mean and median applet.
www.whfreeman.com/tps3e
Mean and median applet.
Pg. 73
If the data is uniform or symmetric use:
If the data is skewed, use:
MeanCenter:
Spread: standard deviation
MedianCenter:
Spread: Five-number summary, Range, IQR
Who's Counting: It's Mean to Ignore the Median
Reading Economic Numbers from Democratic, Republican Points of View
Aug. 6, 2006 — - Believe it or not, the difference in the way the Democrats and Republicans react to the performance of the U.S. economy is clarified by a mathematical distinction studied in elementary school. The distinction is between the mean, which the Republicans emphasize, while the Democrats prefer the median. The relevance of this distinction is apparent in the just-released figures on the U.S. economy for 2004, the latest year for which there is complete data. The Republicans chortle that the economy grew at a healthy rate of 4.2 percent. (It's slowed since then.) The Democrats point to data from the Census Bureau for the same year (and earlier as well), indicating that the real median family income fell and that poverty increased.
Example #14Should you use the mean or median to discuss the center?
a. Average price of home
b. Average age
c. Average height
d. Average gas mileage for all cars
Median
Median
Mean
Mean
Linear Transformation:
Change in the measurement unit where you add or multiply the data
Matching Histograms and Summary Statistics
Match each histogram with a set of summary statistics, by writing the letter in the space provided.
0 5 10 15 20
1. DD.mean 10.2standard deviation 4.1median 11.9IQR 6.8
6 8 10 12 14
2. AA.mean 10.5standard deviation 1.4median 10.7IQR 2.0
3. B
4 7 10 13 16
B.mean 10.1standard deviation 2.7median 10.1IQR 4.2
4. EE.mean 8.8standard deviation 2.8median 8.0IQR 1.9
5 11 18 24 30
5. C
2 6 9 13 16
C.mean 10.2standard deviation 2.1median 10.5IQR 2.5
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Original Data
Mean
Median
S.D.
Q1
Q3
IQR
Range
3
3.5
1.77
1
4.5
3.5
4
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
1 2 3 4 5
Dotplot
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Boxplot
1 2 3 4 5
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Original Data
Mean
Median
S.D.
Q1
Q3
IQR
Range
3
3.5
1.77
1
4.5
3.5
4
Multiply by 3
9
10.5
5.31
3
13.5
10.5
12
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
1 2 3 4 5
Dotplot
3 4 5 6 7 8 9 10 11 12 13 14 15
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Boxplot
1 2 3 4 5
3 4 5 6 7 8 9 10 11 12 13 14 15
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Original Data
Mean
Median
S.D.
Q1
Q3
IQR
Range
3
3.5
1.77
1
4.5
3.5
4
Multiply by 3
9
10.5
5.31
3
13.5
10.5
12
Add 4
7
7.5
1.77
3
8.5
3.5
4
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
1 2 3 4 5
Dotplot
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Example #15:Consider the following data set: 1, 1, 1, 3, 4, 4, 5, 5Transform the data.
Boxplot
1 2 3 4 5
Conclusion:
Multiply: Changes both center and spread
Add: Changes mean & 5-number summarySpread doesn’t change.
Middle always Moves
Spread Sometimes Shifts
Mean: a bX Xa b
Standard Deviation: a bX Xb
Example #6:True or False.
a. If you add 7 to each entry on a list, that adds 7 to the mean.
b. If you add 7 to each entry on a list, that adds 7 to the standard deviation.
c. If you double each entry on a list, that doubles the mean.
TRUE
FALSE
TRUE
Example #6:True or False.
d. If you double each entry on a list, that doubles the standard deviation.
e. Multiplying each entry on a list changes the mean.
f. Multiplying each entry on a list changes the standard deviation.
TRUE
TRUE
TRUE
Example #6:True or False.
g. Adding to each entry on a list changes the mean.
h. Adding to each entry on a list changes the standard deviation.
TRUE
FALSE
Example #17: A college professor gave a test to his students. The test had five questions, each worth 20 points. The summary statistics for the students’ scores on the test are below. After grading the test, the professor realized that, because he had made a typographical error in question number 2, no student was able to answer the question. So he decided to adjust the students’ scores by adding 20 points to each one. What will be the summary statistics for the new, adjusted scores?
Summary Statistics for Scores NEW
Mean 62
Median 60
Range 45
Standard Deviation 8
Q1 71
Q3 48
IQR 23
828045 8916823
Example #18: The summary statistics for the property tax per property collected by one county are below. This year, county residents voted to increase property taxes by 2 percent to support the local school system. What will be the summary statistics for the new, increased property taxes?
Summary Statistics for Property Tax NEW
Mean 12,000
Median 8,000
Range 30,000
Standard Deviation 5,000
Q1 14,000
Q3 5,000
IQR 9,000
12,2408,16030,600 5,10014,2805,1009,180
1.2 Mean vs. MedianDescribing a GraphChoosing a SummaryLinear Transformations
55-5774-75
828997102
110-111
7, 1027, 31, 32 3540, 4245, 465868, 70
Research Project Due Soon!