2.1 Frequency Distribution
description
Transcript of 2.1 Frequency Distribution
2.1 Frequency DistributionA table that shows classes or intervals of data with a count of the number of entries in each class. The frequency, f, of a class is the number of data entries in the class.
Larson/Farber 1
Class Frequency, f
1 – 5 5
6 – 10 8
11 – 15 6
16 – 20 8
21 – 25 5
26 – 30 4
Lower & Upper class limits
Class width = 5
Constructing a Frequency Distribution1.Decide on the number of classes. (usually 5 to 20) 2.Find the class width.
Determine the range of the data. (Xmax-Xmin)
Divide the range by the number of classes. Round up to the next convenient number (*)
3.Find the class limits. Can use the minimum data entry as 1st class lower limit. Find remaining lower limits: Lower-Limit of preceeding class +width Find upper limits: Lower-Limit of class + width - 1
•Tally mark for each data entry in row of appropriate class.
•Count tally marks to find total frequency f for each class.
(*) Report class width as the next successive whole number. (Ex: 7.3 becomes 8, 7 becomes 8, 7.9 becomes 8)
Example: Constructing a Frequency Distribution
The data set below lists the number of minutes 50 Internet subscribers spent on the Internet during their most recent session. Construct a frequency distribution that has seven classes. 50 40 41 17 11 7 22 44 28 21 19 23 37 51 54 42 86
41 78 56 72 56 17 7 69 30 80 56 29 33 46 31 39 20
18 29 34 59 73 77 36 39 30 62 54 67 39 31 53 44
Larson/Farber 4th ed. 2
1. Number of classes = 7 (given)
2. Find the class width:
Range / #Classes = (86-7) / 7 ≈ 11.3 ↑ 12
3. Find lower & upper limits of each class.
4. Tally the frequencies
5. Write the frequency for each class
Class Tally Frequency, f
7 – 18 IIII I 6
19 – 30 IIII IIII 10
31 – 42 IIII IIII III 13
43 – 54 IIII III 8
55 – 66 IIII 5
67 – 78 IIII I 6
79 – 90 II 2Σf = 50
# of subscribersMinutes online
Frequency Distribution(with additional data features)
Larson/Farber 3
Minutes onlineClass
# of subscribersFrequency, f Midpoint
Relativefrequency
Cumulative frequency
7 – 18 6 12.5 0.12 6
19 – 30 10 24.5 0.20 16
31 – 42 13 36.5 0.26 29
43 – 54 8 48.5 0.16 37
55 – 66 5 60.5 0.10 42
67 – 78 6 72.5 0.12 48
79 – 90 2 84.5 0.04 50
Σf = 50 1n
f
Midpoint Calculation(Lower class limit) (Upper class limit)
2
Relative Frequency of a class
Percentage of data in a class.
n
f
sizeSample
frequencyclassfrequencyrelative
Cumulative class
Frequency: The
Sum of the
frequency for
that class and all
previous classes.
Class Boundaries
Larson/Farber 4
Class Class boundaries
Frequency, f
7 – 18 6.5 – 18.5 6
19 – 30 18.5 – 30.5 10
31 – 42 30.5 – 42.5 13
43 – 54 42.5 – 54.5 8
55 – 66 54.5 – 66.5 5
67 – 78 66.5 – 78.5 6
79 – 90 78.5 – 90.5 2
Frequency Histogram• A bar graph that represents the frequency distribution.
• The horizontal scale is quantitative and measures the data values.
• The vertical scale measures the frequencies of the classes.
• Consecutive bars must touch.
Larson/Farber 5
data values
freq
uen
cy
6.5 18.5 30.5 42.5 54.5 66.5 78.5 90.5
(using class midpoints) (using class boundaries)
More than half of the subscribers spent between 19 and 54 minutes on the Internet during their most recent session.
Frequency Polygon
• A line graph that emphasizes the continuous change in frequencies.
Larson/Farber 6
Class Midpoint Freq. f
7 – 18 12.5 6
19 – 30 24.5 10
31 – 42 36.5 13
43 – 54 48.5 8
55 – 66 60.5 5
67 – 78 72.5 6
79 – 90 84.5 2
02468101214
0.5 12.5 24.5 36.5 48.5 60.5 72.5 84.5 96.5
Freq
uenc
y
Time online (in minutes)
Internet Usage
The graph should begin and end on the horizontal axis, so extend the left side to one class width before the first class midpoint and extend the right side to one class width after the last class midpoint.
You can see that the frequency of subscribers increases up to 36.5 minutes and then decreases.
Relative Frequency Histogram• Same shape and same horizontal scale as corresponding frequency histogram.
• The vertical scale measures the relative frequencies, not frequencies.
Larson/Farber 7
ClassClass
boundariesFrequency
, fRelativefrequenc
y
7 – 18 6.5 – 18.5 6 0.12
19 – 30 18.5 – 30.5 10 0.20
31 – 42 30.5 – 42.5 13 0.26
43 – 54 42.5 – 54.5 8 0.16
55 – 66 54.5 – 66.5 5 0.10
67 – 78 66.5 – 78.5 6 0.12
79 – 90 78.5 – 90.5 2 0.04
6.5 18.5 30.5 42.5 54.5 66.5 78.5 90.5
From this graph you can see that 20% of Internet subscribers spent between 18.5 minutes and 30.5 minutes online.
2.2 More Graphs and Displays
Larson/Farber 8
Stem-and-leaf plot
• Each number separated into a stem & a leaf.
• Still contains original data values.Data: 21, 25, 25, 26, 27, 28, 30, 36, 36, 45
26
2 1 5 5 6 7 83 0 6 6 4 5
Dot plot
• Each data entry is plotted, using a point, above a horizontal axis
Data: 21, 25, 25, 26, 27, 28, 30, 36, 36, 45
26
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Graphing for Quantitative Data
Examples: Graphing Quantitative DataThe following are the numbers of text messages sent last month by the cellular phone users on one floor of a college dormitory.
Larson/Farber 4th ed. 9
155 159 144 129 105 145 126 116 130 114 122 112 112 142 126 118118 108 122 121 109 140 126 119 113 117 118 109 109 119 148 147 126139 139 122 78 133 126 123 145 121 134 124 119 132 133 124 129 112
More than 50% of the cellular phone users sent between 110 and 130 text messages.
Stem-and-Leaf Plots
Dot Plot
More Graphs for Qualitative Data SetsPie Chart
• A circle is divided into sectors that represent categories.
• The area of each sector is proportional to the frequency of each category.
Larson/Farber 10
Pareto Chart
• A vertical bar graph in which the height of each bar represents frequency or relative frequency.
• The bars are positioned in order of decreasing height, with the tallest bar positioned at the left.
CategoriesF
requ
ency
Example: Pie Chart (Qualititative Data)The numbers of motor vehicle occupants killed in crashes in 2005 are shown in the table. A pie chart is used to organize the data. (Source: U.S. Department of Transportation, National Highway Traffic Safety Administration)
Larson/Farber 11
Vehicle type Killed (frequency)
Cars 18,440
Trucks 13,778
Motorcycles 4,553
Other 823
f =37,594
Relative frequency (%)18440
0.4937594
137780.37
37594
45530.12
37594
8230.02
37594
360º(0.49)≈176º
360º(0.37)≈133º
360º(0.12)≈43º
360º(0.02)≈7º
Central Angle – Degrees (°)
From the pie chart, you can see that most fatalities in motor vehicle crashes were those involving the occupants of cars.
Example: Pareto Chart (Qualitative Data)In a recent year, the retail industry lost $41.0 million in inventory shrinkage. Inventory shrinkage is the loss of inventory through breakage, pilferage, shoplifting, and so on. The causes of the inventory shrinkage are administrative error ($7.8 million), employee theft ($15.6 million), shoplifting ($14.7 million), and vendor fraud ($2.9 million). Use a Pareto chart to organize this data. (Source: National Retail Federation and Center for Retailing Education, University of Florida)
Larson/Farber 12
Cause $ (million)
Admin. error 7.8
Employee theft
15.6
Shoplifting 14.7
Vendor fraud 2.9
From the graph, it is easy to see that the causes of inventory shrinkage that should be addressed first are employee theft and shoplifting.
2.2 More Graphs for Paired Data Sets
Scatter Plot. The ordered pairs are graphed as
points in a coordinate plane.
Used to show the relationship between two quantitative variables.
Larson/Farber 4th ed. 13
x
y
Time Series
• Data set is composed of quantitative entries taken at regular intervals over a period of time.
• Example: The amount of precipitation measured each day for one month. time
Qua
ntit
ativ
e da
ta
(Each entry in one data set corresponds to one entry in a second data set.)
Example:Scatter Plot (Paired Data)
The British statistician Ronald Fisher introduced a famous data set called Fisher's Iris data set. This data set describes various physical characteristics, such as petal length and petal width (in millimeters), for three species of iris. The petal lengths form the first data set and the petal widths form the second data set. (Source: Fisher, R. A., 1936)
Larson/Farber 14
Each point in the scatter plot represents thepetal length and petal width of one flower.
Interpretation As the petal length increases, the petal width also tends to increase.
Example:Time Series Chart (Paired Data)The table lists the number of cellular telephone subscribers (in millions) for the years 1995 through 2005. Construct a time series chart for the number of cellular subscribers. (Source: Cellular Telecommunication & Internet Association)
Larson/Farber 15
The graph shows that the number of subscribers has been increasing since 1995, with greater increases 2003 to 2005
2.3 Measures of Central Tendency(Typical or Central Entry of a data Set)
Larson/Farber 16
Mean Median Mode
Mean (average)
• The sum of all the data entries divided by the number of entries.
• Sigma notation: Σx = add all of the data entries (x) in the data set.
• Population mean Sample Meanx
N
xx
n
Example: The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to Cancun, Mexico are listed. Find the mean flight price.
872 432 397 427 388 782 397Σx = 872 + 432 + 397 + 427 + 388 + 782 + 397 = 3695
3695527.9
7
xx
n
Mean flight price is about $527.90.
Measures of Central Tendency
Larson/Farber 4th ed. 17
Mean Median Mode
Median
• The value that lies in the middle of the data when the data set is ordered.
• Measures the center of an ordered data set by dividing it into two equal parts.
• If the data set has an odd number of entries: median is the middle data entry. even number of entries: median is the mean of the two middle data entries.
Example1: The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to Cancun, Mexico are listed. Find the median of the flight prices.
872 432 397 427 388 782 397
Order data and find middle: 388 397 397 427 432 782 872
Example2: The flight priced at $432 is no longer available. What is the median price of the remaining flights?
388 397 397 427 782 872397 427
Median 4122
Measure of Central Tendency
Mode
• The data entry that occurs with the greatest frequency.
• If no entry is repeated the data set has no mode.
• If two entries occur with the same greatest frequency, each entry is a mode (bimodal).
Larson/Farber 18
Mean Median Mode
Example1: The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to Cancun, Mexico are listed. Find the mode of the flight prices. 872 432 397 427 388 782 397• Ordering the data helps to find the mode.
388 397 397 427 432 782 872
The mode of the flight prices is $397.
Example2: At a political debate a sample of audience members was asked to name the political party to which they belong. Their responses are shown in the table. What is the mode of the responses?
Political Party Frequency, f
Democrat 34
Republican 56
Other 21
Did not respond 9Republican
Comparing the Mean, Median, and ModeThe mean is a reliable measure; it takes into account every entry of a data set,
BUT, the mean is greatly affected by outliers (a data entry that is far removed
from the other entries in the data set).
Larson/Farber 19
Example: Find the mean, median, and mode of the sample ages of a class shown. Which measure of central tendency best describes a typical entry of this data set? Are there any outliers? Ages in a class
20 20 20 20 20 20 21
21 21 21 22 22 22 23
23 23 23 24 24 65
Mean:20 20 ... 24 65
23.8 years20
xx
n
Median:21 22
21.5 years2
20 years (the entry occurring with thegreatest frequency)
Mode:
• The mean takes every entry into account, but is influenced by the outlier of 65. • The median also takes every entry into account, and it is not affected by the
outlier.• In this case the mode exists, but it doesn't appear to represent a typical entry.
Example: Finding a Weighted Mean
You are taking a class in which your grade is determined from five sources: 50% from your test mean, 15% from your midterm, 20% from your final exam, 10% from your computer lab work, and 5% from your homework. Your scores are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab), and 100 (homework). What is the weighted mean of your scores? If the minimum average for an A is 90, did you get an A?
Larson/Farber 20
Source Score, x Weight, w x∙w
Test Mean 86 0.50 86(0.50)= 43.0
Midterm 96 0.15 96(0.15) = 14.4
Final Exam 82 0.20 82(0.20) = 16.4
Computer Lab 98 0.10 98(0.10) = 9.8
Homework 100 0.05 100(0.05) = 5.0
Σw = 1 Σ(x∙w) = 88.6
( ) 88.688.6
1
x wx
w
The data has
varying weights.
The Shape of Distributions
Larson/Farber 4th ed. 21
Symmetric Distribution
• A vertical line can be drawn through the middle of a graph of the distribution and the resulting halves are approximately mirror images.
Uniform Distribution (rectangular)
• All entries or classes in the distribution have equal or approximately equal frequencies.
• Symmetric.
Skewed Left Distribution (negative skew)
• “Tail” of the graph elongates more to the left.
• The mean is to the left of the median.
Skewed Right Distribution (positive skew)
• “Tail” of graph elongates to the right.
• Mean is to the right of the median.
2.4 Measures of Deviation
Range
• Quantitative data only
• The difference between the maximum and minimum data entries in the set.
• Range = (Xmax - Xmin)
• Advantage: Easy to compute
• Disadvantage: Only uses 2 data entries (not all)
Larson/Farber 4th ed. 22
Example: Corporation A hired 10 graduates. The starting salaries for each graduate are shown. Find the range of the starting salaries.
Starting salaries (1000s of dollars)
41 38 39 45 47 41 44 41 37 42
Xmax = 47
Xmin = 37 Range = 47 – 37 = 10
• Variation in data
• How individual data values vary within a given data set
Corporation B’s starting salaries are below:
40 23 41 50 49 32 41 29 52 58
Xmax = 58
Xmin = 23Range = 58 – 23 = 35
Note: Both corporation data sets have the same mean, median & mode. The range shows us how ‘varied’ the data is!
Deviation, Variance, and Standard Deviation
Deviation
• The difference between the data entry, x, and the mean of the data set.
• Population data set: Deviation of x = x – μ
• Sample data set: Deviation of x = x – x
Larson/Farber 23
41541.5
10
x
N
Mean
Salary ($1000s) x
Deviation x – μ
41 41 – 41.5 = –0.5
38 38 – 41.5 = –3.5
39 39 – 41.5 = –2.5
45 45 – 41.5 = 3.5
47 47 – 41.5 = 5.5
41 41 – 41.5 = –0.5
44 44 – 41.5 = 2.5
41 41 – 41.5 = –0.5
37 37 – 41.5 = –4.5
42 42 – 41.5 = 0.5
Deviations for all data entries in Corporation A’ starting salary data set.
Σx = 415 Σ(x – μ) = 0
The sum of deviations = 0. This is true for any data set, so we use the squares of the deviations instead.
Deviation, Variance, and Standard Deviation
Population Variance
Population Standard Deviation
Larson/Farber
22 ( )x
N
(Sum of squares, SSx)
22 ( )x
N
Sample Variance
Sample Standard Deviation
Note: For ‘grouped-data’ organized into a frequency distribution use:
22 ( )
1
x xs
n
22 ( )
1
x xs s
n
(Population) Standard Deviation
Step1: Find the mean of the data set.
Step2: Find deviation of each entry:
Step3: Square each deviation:
Step4: Add to get the sum of squares.
x
N
x – μ
(x – μ)2
SSx = Σ(x – μ)2
Step5: Divide by N to get the variance.
Step6: Square root to get standard deviation.
**Question**
How would the directions change for a SAMPLE Standard Deviation?
22 ( )x
N
2( )x
N
22 ( )
1
x xs s
n
f
Standard Deviation
Larson/Farber 25
The following data represents the midterm grade percentages of all students in an algebra class. Find the standard deviation of the data.57 55 72 75 84 69 69 90 68 76 85 50 56 13 76 49 93 78 73 60 62 70 38
Number of data values: N = _______
Mean = ______________x
N
Grades (x)
Deviation (x – μ )
(x – μ)2
5755727584696990687685505613764993787360627038
-9-1169
1833
242
1019-16-10-5310-1727127-6-44
-28
811213681
32499
5764
100361256100
280910028972914449361616
784
∑(x – μ )=0
SSx=Σ(x – μ)2 = ____
23
1518/23 = 66
7030
22 ( )x
N
Variance = ___________
7030/23 = 305.65
Standard Deviation 2
2 ( )x
N
Using Technology for Calculations
Larson/Farber 26
The TI-83/84 calculator can do some of this work for you.
1. <STAT> <ENTER>
2. Choose a column such as L3 and enter data.
3. <STAT>, Arrow over to <CALC> <ENTER>
4. See: 1-Var Stats <2nd> <L3> <ENTER>
5. See Readout such as this
Note: You can also do these
Functions separately using
<LIST><MATH>
Interpreting Standard Deviation• Standard deviation is a measure of the typical amount an entry deviates from the mean.
• The more the entries are spread out, the greater the standard deviation.
Larson/Farber. 27
For data with a (symmetric) bell-shaped distribution, the standard deviation has the following characteristics:
• About 68% of the data lie within one standard deviation of the mean.
• About 95% of the data lie within two standard deviations of the mean.
• About 99.7% of the data lie within three standard deviations of the mean.
Empirical Rule (68 – 95 – 99.7 Rule)
Interpreting Standard Deviation: Empirical Rule (68 – 95 – 99.7 Rule)
Larson/Farber 28
3x s x s 2x s 3x sx s x2x s
68% within 1 standard deviation
34% 34%
99.7% within 3 standard deviations
2.35% 2.35%
95% within 2 standard deviations
13.5% 13.5%
Example: Using the Empirical RuleIn a survey conducted by the National Center for Health Statistics, the sample mean height of women in the United States (ages 20-29) was 64 inches, with a sample standard deviation of 2.71 inches. Estimate the percent of the women whose heights are between 64 inches and 69.42 inches.
Larson/Farber 29
• Because the distribution is bell-shaped, you can use the Empirical Rule.
3x s x s 2x s 3x sx s x2x s55.87 58.58 61.29 64 66.71 69.42 72.13
34%
13.5%
34% + 13.5% = 47.5% of women are between 64 and 69.42 inches tall.
Chebychev’s Theorem
• The portion of any data set lying within k standard deviations (k > 1) of the mean is at least:
Larson/Farber 4th ed. 30
2
11
k
2
1 81 or 88.9%
3 9
2 standard deviations : (k=2), At least of the data lie within 2 standard deviations of the mean.
2
1 31 or 75%
2 4
3 standard deviations : (k=3), At least of the data lie within 3 standard deviations of the mean.
Example: The age distribution for Florida is shown in the histogram. Apply Chebychev’s Theorem to the data using k = 2. What can you conclude?
k = 2: μ – 2σ = 39.2 – 2(24.8) = -10.4 (Use 0 - age is non-negative)
μ + 2σ = 39.2 + 2(24.8) = 88.8Conclusion: At least 75% of the population of Florida is between 0 and 88.8 years old.
For data with any shape distribution:
2.5 Measures of Position• Fractiles are numbers that partition (divide) an ordered data set into equal parts.
• Quartiles approximately divide an ordered data set into four equal parts. First quartile, Q1: About ¼ of the data fall on or below Q1. Second quartile, Q2: About ½ of the data fall on or below Q2 (median). Third quartile, Q3: About three quarters of the data fall on or below Q3. Interquartile Range (IQR):
Larson/Farber 31
Example: The test scores of 15 employees enrolled in a CPR training course are listed. Find the first, second, and third quartiles of the test scores.
13 9 18 15 14 21 7 10 11 20 5 18 37 16 17
Step1: Order the data: 5 7 9 10 11 13 14 15 16 17 18 18 20 21 37
Step2: Find Median (Q2): Q2
Lower half Upper half
Step3: Find Q1 & Q3 (medians of lower & upper halves respectively):
Q1 Q3
Percentiles: Divide a data set into 100 equal parts. •Often used in education & health fields Ex: A student scored in the 95th percentile on the math test - better than 95% of the other students.•Q1 = 25th percentile, Q2 = 50th percentile, Q3 = 75th percentile
¼ of employees scored 10 or less
Q3 – Q1
Box-and-Whisker Plot• Exploratory data analysis tool that highlights important features of a data set.
• Requires (five-number summary): Minimum & Maximum entry, Q1 Q2 & Q3
.
Creating a Box-and-whisker plot
1.Find the 5-number data set summary
2.Construct a horizontal scale that spans the range of the data.
3.Plot the five numbers above the horizontal scale.
4.Draw a box above the horizontal scale from Q1 to Q3 and draw a vertical line in the box at Q2.
5.Draw whiskers from the box to the minimum and maximum entries.
Whisker Whisker
Maximum entry
Minimum entry
Box
Median, Q2 Q3Q1
Example:
Draw a box-and-whisker plot
Minimum value = 6
Maximum value = 104
Q1 = 10,
Q2 = 18,
Q3 = 31,
About half the scores are between 10 & 31. There is a possible outlier of 104.
The Standard Score (Z-Score)• The number of standard deviations a given value x falls from the mean μ.
• Negative Z : The x-value is below the mean
• Positive Z : The x-value is above the mean
• Zero Z : The x-value is equal to the mean
value - mean
standard deviation
xz
Example: In 2007, Forest Whitaker won the Best Actor Oscar at age 45 for his role in the movie The Last King of Scotland. Helen Mirren won the Best Actress Oscar at age 61 for her role in The Queen. The mean age of all best actor winners is 43.7, with a standard deviation of 8.8. The mean age of all best actress winners is 36, with a standard deviation of 11.5. Find the z-score that corresponds to the age for each actor or actress. Compare results.
• Forest Whitaker
• Helen Mirren
45 43.70.15
8.8
xz
0.15 Std. Dev. above mean
61 362.17
11.5
xz
2.17 Std. Dev above mean
(Usual range)
(Unusual range)
Unusual Scores occur about 5% of the time
Very Unusual Scores occur about .3% of the time