Statistics 12 - rcc-jlo.weebly.com
Transcript of Statistics 12 - rcc-jlo.weebly.com
Statistics 12
Chapter 2: Modeling Distribution of Data
Dr. John LoRoyal Canadian College
2020-2021
1. Describing a location in a distribution
CHAPTER 2: MODELING DISTRIBUTION OF DATA 2RCC @ 2020/2021
› Example: Here are the scores of all 25 students in a statistics class on their first test:
› The bold score is Jenny’s mark. How did she perform on this test relative to her classmates?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 3RCC @ 2020/2021
A. Percentiles
› One way to describe Jenny’s location in the distribution of test scores is to tell what percent of students in the class earned scores that were below Jenny’s score.
› That is, we can calculate Jenny’s percentile:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 4RCC @ 2020/2021
› If the scores are displayed in form of a stemplot:
› We see that Jenny’s 86 places her fourth from the top of the class.
› Because 21 of the 25 observations (84%) are below her score, Jenny is at the 84th percentile in the class’s test score distribution.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 5RCC @ 2020/2021
› Practice: Here are data from a random sample of 20 female and 20 male students at a large high school regarding the number of pairs of shoes they have.
a) Find and interpret the percentile in the female distribution for the girl with 22 pairs of shoes.
b) Find and interpret the percentile in the male distribution for the boy with 22 pairs of shoes.
c) Who is more unusual? Explain.CHAPTER 2: MODELING DISTRIBUTION OF DATA 6
RCC @ 2020/2021
B. Cumulative relative frequency graphs
› One of the most common graphs that can be made with percentiles starts with a frequency table for a quantitative variable.
› For example, the following table lists the ages of the first 44 U.S. presidents when they took office:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 7RCC @ 2020/2021
› This table can be further expanded to include relative frequencies, cumulative frequencies and cumulative relative frequencies:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 8RCC @ 2020/2021
› The points corresponding to the cumulative relative frequency in each class can be plotted to create a cumulative relative frequency graph as follows:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 9RCC @ 2020/2021
› A cumulative relative frequency graph can be used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution.
› Example: Was Barack Obama, who was first inaugurated at age 47, unusually young?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 10RCC @ 2020/2021
› Example: Estimate and interpret the 65th percentile of the distribution.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 11RCC @ 2020/2021
C. z-scores
› Recall the example about the test scores on page 2.
› Jenny is at the 84th percentile in the class distribution.
› Where does Jenny’s score of 86 fall relative to the meanof this distribution?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 12RCC @ 2020/2021
› Statistical analysis of the test scores gives the following data (by Excel, Minitab, R or other software):
› Jenny’s score of 86 is about one standard deviation above the mean.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 13
Number of scores
25 Minimum 67.00
Mean 80.00 Maximum 93.00
Median 80.00 Q1 76.00
Standard deviation
6.07 Q3 83.50
RCC @ 2020/2021
› The process of transforming original data to standard deviation units is called standardization, and the resulting values are called standardized scores or z-scores.
› A z-score tells us how many standard deviations from the mean an observation falls, and in what direction.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 14RCC @ 2020/2021
› Recall that Jenny’s score in the test was 86; therefore the corresponding z-score is:
› That means, Jenny’s test score is 0.99 standard deviation above the mean score (i.e., 80) of the class.
› Note that z-scores can be either positive or negative.
▪ Positive: the original value is above the mean
▪ Negative: the original value is below the mean
CHAPTER 2: MODELING DISTRIBUTION OF DATA 15
𝑧 =86 − 80
6.07= 0.99
RCC @ 2020/2021
› Practice: The day after receiving her statistics test result of 86, Jenny earned an 82 on a chemistry test. The distribution of scores was fairly symmetric with a mean of 76 and a standard deviation of 4. Did she do better or worse in the chemistry test than the statistics test?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 16RCC @ 2020/2021
› To find the z-score for an individual observation, this data value is transformed by subtracting the mean and dividing the difference by the standard deviation.
› What is the effect of adding (or subtracting) a constant?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 17RCC @ 2020/2021
› How about the effect of multiplying (or dividing) by a constant?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 18RCC @ 2020/2021
› Example: A group of students in Australia was asked to guess the width of their classroom to the nearest meter. Here are their guesses in order from lowest to highest.
a) Perform a statistical analysis of this study.
b) Note that the actual width of the classroom is 13 meters. Discuss the error distribution of the data set.
c) Convert the error data into the units of feet. Discuss the distribution. Is there any change from that in b)?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 19RCC @ 2020/2021
a) A dot plot of the data and the summary of the statistical analysis of the data set are given below.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 20RCC @ 2020/2021
b) Error of each guess is determined by
› The followings are the results:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 21
error = guess − 13
RCC @ 2020/2021
c) Note that there are 3.28 feet per meter; therefore, all data will be transformed by multiplying by a factor of 3.28.
› The results are as follows:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 22RCC @ 2020/2021
2. Density curves and normal distributions
CHAPTER 2: MODELING DISTRIBUTION OF DATA 23RCC @ 2020/2021
A. Density curves
› Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve called a density curve.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 24RCC @ 2020/2021
› Example: 947 7th-grade students in Gary, Indiana participated in the Iowa Test of Basic Skills (ITBS), and their scores on the vocabulary part were plotted below:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 25
Features:• Smooth and symmetric
• Unimodal• No gap or outliers
RCC @ 2020/2021
› Note that no set of real data is exactly described by a density curve; yet it is still a good approximation that is easy to use and accurate enough for practical use.
› Example:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 26
30.3% students actually got 6 or below yet the density curve yields 29.3% instead
RCC @ 2020/2021
› The mean and median are used to describe a density curve in the same way as real data sets.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 27RCC @ 2020/2021
› Schematically:
› A key feature: The mean and median of a symmetricdensity curve are equal!
CHAPTER 2: MODELING DISTRIBUTION OF DATA 28RCC @ 2020/2021
B. Normal distribution
› A common type of density curve used in statistics is called Normal curve whose shape is determined by the mean 𝜇 and standard deviation 𝜎 of the data set:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 29RCC @ 2020/2021
› The mathematical formula associated with the Normal curve is:
› Graphically, Normal curves look like:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 30
𝑓 𝑥 =1
𝜎 2𝜋𝑒−12𝑥−𝜇𝜎
2
RCC @ 2020/2021
› There are several important facts about Normal curves:
1. All Normal curves have the same overall shape: symmetric, single-peaked (or unimodal), and bell-shaped.
2. Any specific Normal curve is completely described by giving its mean 𝜇 and its standard deviation 𝜎.
3. The mean is located at the center of the symmetric curve and is the same as the median. Changing 𝜇 without changing 𝜎 moves the curve along the horizontal axis without changing its spread.
4. The standard deviation 𝜎 controls the spread of a Normal curve. Curves with larger standard deviations are more spread out.
5. The curve changes its curvature at a distance 𝜎 on either side of the mean.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 31RCC @ 2020/2021
› Normal distributions are useful in statistics:
1. They are good descriptions for some distributions of real data such as:
• Scores on large-scale tests (e.g. SAT, IQ)
• Repeated careful measurements (e.g. scientific experiments)
• Characteristics of biological populations (e.g. yields of corn in a farm)
2. They are good approximations to the results of many kinds of chance outcomes. (i.e., probabilistic)
3. Many statistical inference procedures are based on Normal distributions.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 32RCC @ 2020/2021
› A common feature shared by all Normal curves is called the 68-95-99.7 rule or empirical rule (see p.84 of Unit 1 notes).
CHAPTER 2: MODELING DISTRIBUTION OF DATA 33RCC @ 2020/2021
› Example: The distribution of ITBS vocabulary scores for seventh-graders in Gary, Indiana, is N(6.84, 1.55).
a) What percent of the ITBS vocabulary scores are less than 3.74?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 34RCC @ 2020/2021
b) What percent of the scores are between 5.29 and 9.94?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 35RCC @ 2020/2021
C. Standard Normal distribution
› Based on the 68-95-99.7 rule, we know that all Normal curves share many properties. Indeed, all Normal curves are the same if we measure in units of 𝜎 from the mean 𝜇 as center.
› This is the standardization of the data following Normal distribution (recall the definition of z-scores):
› The resulting data 𝑧 also follows Normal distribution with mean 0 and standard deviation 1.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 36
𝑧 =𝑥 − 𝜇
𝜎
RCC @ 2020/2021
› The new distribution is called standard Normal distribution:
› The area under a standard Normal curve is equal to 1.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 37RCC @ 2020/2021
› Any question about what proportion of observations lies in some range of values can be answered by finding an area under the curve.
› Example: What is the proportion of observations that falls within one standard deviation of the mean?
› Solution: The z-scores corresponding to one standard deviation from the mean are:
› Based on the 68-95-99.7 rule, approximately 68% of the observations fall between 𝑧 = −1 and 𝑧 = 1.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 38
𝑧 =𝜇 ± 𝜎 − 𝜇
𝜎= ±1
RCC @ 2020/2021
› In general, the area under the curve for the standard Normal distribution can be determined using the standard Normal table.
› For example, to find the area to the left of 𝑧 = 0.81:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 39RCC @ 2020/2021
› Example: Find the proportion of observations from the standard Normal distribution that are greater than −1.78.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 40RCC @ 2020/2021
› Example: Find the proportion of observations from the standard Normal distribution that are between −1.25 and 0.81.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 41RCC @ 2020/2021
› On the other hand, the area under the curve for standard Normal distribution can be used to find the corresponding z-score.
› Example: What is the z-score with area 0.90 to its left?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 42RCC @ 2020/2021
› Practice: Tiger Woods practises his golf swing at the driving range by hitting a ball with the same club many, many times. The distance his ball travels off the tee in yards follows a Normal distribution N(304, 8).
a) What percent of Tiger’s ball travels at least 290 yards?
b) What percent of Tiger’s ball travels between 305 and 325 yards?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 43RCC @ 2020/2021
› Practice: High levels of cholesterol in the blood increases the risk of heart disease. For 14-year-old boys, the distribution of blood cholesterol is approximately Normal with mean 170 milligrams of cholesterol per deciliter of blood (mg/dl) and standard deviation 30 mg/dl. What is the first quartile of the distribution of blood cholesterol?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 44RCC @ 2020/2021
D. Assessing Normality
› The Normal distributions provide good models for some distributions of real data such as SAT and IQ tests.
› However, it is not safe to always assume Normal distributions. Some common variables such as personal income and lifetimes of electronic devices are skewed and therefore non-Normal.
› Therefore, it is necessary to develop a strategy for assessing Normality of data sets.
› There are two common ways:
a) Comparison with 68-95-99.7 rule
b) Normal probability plot
CHAPTER 2: MODELING DISTRIBUTION OF DATA 45RCC @ 2020/2021
› Example: Here are the data on unemployment rates in the 50 states of U.S.
› Are the data close to Normal?
CHAPTER 2: MODELING DISTRIBUTION OF DATA 46RCC @ 2020/2021
› The histogram below shows the data on unemployment rates:
› Apparently it is non-Normal!
CHAPTER 2: MODELING DISTRIBUTION OF DATA 47RCC @ 2020/2021
› Statistical analysis yields: 𝜇 = 8.682 and 𝜎 = 2.225.
› If the data distribution is close to Normal, it should fulfil the 68-95-99.7 rule.
› Here is the comparison:
› The percents are quite close to 68%, 95% and 99.7%. Therefore, the data distribution is approximately Normal.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 48RCC @ 2020/2021
› Another approach is to plot the expected z-scores of the data:
› If data are Normal, the plot of 𝑧 against 𝑥 should be a straight line.
› The expected z-score is calculated using the percentile of the data and the z-score table. For example:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 49
𝑧 =𝑥 − 𝜇
𝜎=1
𝜎𝑥 −
𝜇
𝜎
Percentile z-score
1st -2.326
3rd -1.881
5th -1.645RCC @ 2020/2021
› The resulting Normal probability plot is:
CHAPTER 2: MODELING DISTRIBUTION OF DATA 50
The linear pattern suggests that the data are close to Normal
RCC @ 2020/2021
› Practice: The survival times in days of 72 guinea pigs after they were injected with infectious bacteria in a medical experiments were recorded.
› Determine if these data are approximately Normally distributed.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 51RCC @ 2020/2021
› The followings are the histogram and the Normal probability plot of the data:
› The distribution is heavily right-skewed. The clear curvature in the Normal probability plot confirms that these data do not follow the Normal distribution.
CHAPTER 2: MODELING DISTRIBUTION OF DATA 52RCC @ 2020/2021