Class Session #2 Numerically Summarizing Data
description
Transcript of Class Session #2 Numerically Summarizing Data
![Page 1: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/1.jpg)
1
Class Session #2Numerically Summarizing Data
• Measures of Central Tendency• Measures of Dispersion• Measures of Central Tendency and
Dispersion from Grouped Data• Measures of Position
![Page 2: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/2.jpg)
2
Recall the Definitions
• Parameter – a descriptive measure of a population
(p = parameter = population, usually in Greek letters)
• Statistic – a descriptive measure of a sample
(s = statistic = sample, usually in Roman letters)
![Page 3: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/3.jpg)
3
Common “descriptions”
• ? Average ? – “typical” as described in the news reports
• Give some of today’s examples• Data distributions’ “characteristics”
– Shape – look at a picture (histogram)– Center – mean, mode, median– Spread – range, variance, std. dev.
![Page 4: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/4.jpg)
4
Central Tendency Definitions
• Arithmetic mean – the sum of all the values of the variable in the data set, divided by the number of observations
• Population arithmetic mean - computed using all the individuals in the population (“mew” = μ) (≠ micro µ)
• Sample arithmetic mean – computed using the sample data (“x-bar”)
• Note: is a statistic, μ is a parameter
x
![Page 5: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/5.jpg)
5
More Central Tendency Defs
• Median – the value that lies in the middle of the data, when arranged in ascending order(think of the median strip of highway in the middle of the road)
• Mode – the most frequent observation of the variable in the data set (think “a la mode” in fashion /on top)
![Page 6: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/6.jpg)
6
Measures of Dispersion Definitions
• Range (R) – the difference between the largest data value (maximum) & the smallest data value (minimum)
• Deviation about the mean – how “spread out” the data is.? for both population and sample variance, the sum of all deviations about the mean equals what ?? the square of a non-zero number is ?
![Page 7: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/7.jpg)
7
More Measures of Dispersion Definitions
• Population Variance – sum of squared deviations about the population mean, divided by the number of observations in the population N (sigma squared)
• ? i.e. population variance is the mean of the ______ _________ ____ __ _________ ___ ?
Answer: Population variance is the mean of the squared deviations about the population mean
![Page 8: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/8.jpg)
8
More Measures of Dispersion Definitions
• Sample Variance – sum of the squared deviations about the sample mean, divided by the number of observations minus one (s squared)
• Degrees of freedom is the “n-1”
![Page 9: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/9.jpg)
9
More Measures of Dispersion Definitions
• Population Standard Deviation – the square root of the population variance (sigma, written as “σ”)
• Sample Standard Deviation – the square root of the sample variance (s, written as “s”)
BTW, later we discover “s” itself is a random variable
![Page 10: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/10.jpg)
10
Empirical Rule for Symmetric Data
• If the distribution is bell shaped: 68% of data within 1 std deviations 95% of data within 2 std deviations 99.7% of data within 3 standard deviations
of the mean
Rule holds for both samples & populations
![Page 11: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/11.jpg)
11
Supposing Grouped Data
• Approximate mean of a variable from a frequency distribution• Use the midpoint of each class• Use the frequency of each class• Use the number of classes
• Population Mean• Sample Mean
![Page 12: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/12.jpg)
12
Supposing Grouped Data
• Weighted MeanGood to use when certain data
values have higher importance (or weight)
[Sum of each value of variable times its weight] / [sum of weights]
Examples of Grade Point Average (GPA) and mixed nuts pricing
![Page 13: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/13.jpg)
13
Supposing Grouped Data
• Population Variancesum of [(midpoint – mean)2 times
frequency] / [sum of frequencies]
• Sample Varianceas before except “-1” in denominator (the
degrees of freedom thing again)
![Page 14: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/14.jpg)
14
Supposing Grouped Data
• Population Standard Deviationtake square root of population variance
• Sample Standard Deviationtake square root of sample variance
![Page 15: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/15.jpg)
15
Measures of Position Definition
• z-Score – the distance that a data value is from the mean in terms of standard deviations. Equals (data value minus mean) divided by standard deviation]
• Population z-score• Sample z-score
![Page 16: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/16.jpg)
16
Measures of Position Definitions
• z-score equals [(data value minus mean) divided by standard deviation]
• Is a "unitless" measure• Can be “normalized” to get• Mean of zero• Standard Deviation of one
![Page 17: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/17.jpg)
17
Measures of Position Definitions
• z-score purpose is to provide a way to "compare apples and oranges"
• by converting variables with different centers and/or spreads
• to variables with the same center (0) and spread (1).
![Page 18: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/18.jpg)
18
Measures of Position Definition
• Percentiles – k th percentile is a set of data divides the lower k% from the upper (1-k)%• Divide into 100 parts, so 99 percentiles
exist• “P sub k”• Use to give relative standing of the data
![Page 19: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/19.jpg)
19
Measures of Position Definition
• Quartiles – divides the data into four equal parts• Four parts, so three percentiles exist• “Q sub one, two, or three”• Q2 is the median of the data• Q1 is the median of the lower half • Q3 is the median of the upper half
![Page 20: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/20.jpg)
20
Numerical summary of data
• Five number summaries
• Interquartile range (Q3 – Q1) is resistant to extreme values
• Compute five number summary• Min value | Q1 | M | Q3 | max value
![Page 21: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/21.jpg)
21
Building a Box Plot – part 1
• 1. Calculate interquartile range (IQR)
• 2. Compute lower & upper fence• Lower fence = Q1 – 1.5 (IQR)• Upper fence = Q3 + 1.5 (IQR)
• 3. Draw scale then mark Q1 and Q3
• 4. Box in Q1 to Q3 then mark M
![Page 22: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/22.jpg)
22
Building a Box Plot – part 2• 5. Temporarily mark fences with
brackets• 6. Draw line from Q1 to smallest
value inside the lower fence and a line from Q3 to largest value inside the upper fence
• 7. Put * for all values outside of the fences
• 8. Erase brackets
![Page 23: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/23.jpg)
23
Distribution based on Boxplot
• Symmetric• median near center of box• horizontal lines about same length
• Skewed Right / Positive Skew• median towards left of box• right line much longer than left line
• Skewed Left / Negative Skew• median towards right of box• left line much longer than right line
![Page 24: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/24.jpg)
24
Which measure best to report?
• Symmetric distribution• Mean• Standard Deviation
• Skewed distribution• Median• Interquartile Range
![Page 25: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/25.jpg)
25
Self Quiz
• When can the mean and the median be about equal?
• In the 2000 census conducted by the U.S. Census Bureau, two average household incomes were reported: $41,349 and $55,263. One of these averages is the mean and the other is the median. Which is which and why?
![Page 26: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/26.jpg)
26
Self Quiz
• The U.S. Department of Housing and Urban Development (HUD) uses the median to report the average price of a home in the United States.
• Why do they do that?
![Page 27: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/27.jpg)
27
Self Quiz
• A histogram of a set of data indicates that the distribution of the data is skewed right.
• Which measure of central tendency will be larger, the mean or the median?
• Why?
![Page 28: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/28.jpg)
28
Self Quiz
• If a data set contains 10,000 values arranged in increasing order, where is the median located?
• Matching: (parameter; statistic)• _____ is a descriptive measure of a
population• _____ is a descriptive measure of a
sample.
![Page 29: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/29.jpg)
29
Self Quiz• A data set will always have exactly one
mode. (true or false)• If the number of observations, n, is
odd; then the median, M, is the value calculated by the formula M=(n+1)/2
![Page 30: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/30.jpg)
30
Self Quiz
• Find the Sample Mean:20, 13, 4, 8, 10
• Find the Sample Mean:83, 65, 91, 87, 84
• Find the Population Mean:3, 6, 10, 12, 14
![Page 31: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/31.jpg)
31
Self Quiz
• The median for the given list of six data values is 26.5.
• 7 , 12 , 21 , , 41 , 50
• What is the missing value?
![Page 32: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/32.jpg)
32
Self Quiz
• The following data represent the monthly cell phone bill for the cell phone for six randomly selected months.
• $35.34 $42.09 $39.43• $38.93 $43.39 $49.26• Compute the mean, median, and mode cell
phone bill.
![Page 33: Class Session #2 Numerically Summarizing Data](https://reader036.fdocuments.net/reader036/viewer/2022062400/5681694c550346895de0eaa5/html5/thumbnails/33.jpg)
33
Self Quiz• Heather and Bill go to the store to
purchase nuts, but can not decide among peanuts, cashews, or almonds. They agree to create a mix. They bought 2.5 pounds of peanuts for $1.30 per pound, 4 pounds of cashews for $4.50 per pound, and 2 pounds of almonds for $3.75 per pound. Determine the price per pound of the mix.