Statistics 12 - rcc-jlo.weebly.com

52
Statistics 12 Chapter 2: Modeling Distribution of Data Dr. John Lo Royal Canadian College 2020-2021

Transcript of Statistics 12 - rcc-jlo.weebly.com

Page 1: Statistics 12 - rcc-jlo.weebly.com

Statistics 12

Chapter 2: Modeling Distribution of Data

Dr. John LoRoyal Canadian College

2020-2021

Page 2: Statistics 12 - rcc-jlo.weebly.com

1. Describing a location in a distribution

CHAPTER 2: MODELING DISTRIBUTION OF DATA 2RCC @ 2020/2021

Page 3: Statistics 12 - rcc-jlo.weebly.com

› Example: Here are the scores of all 25 students in a statistics class on their first test:

› The bold score is Jenny’s mark. How did she perform on this test relative to her classmates?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 3RCC @ 2020/2021

Page 4: Statistics 12 - rcc-jlo.weebly.com

A. Percentiles

› One way to describe Jenny’s location in the distribution of test scores is to tell what percent of students in the class earned scores that were below Jenny’s score.

› That is, we can calculate Jenny’s percentile:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 4RCC @ 2020/2021

Page 5: Statistics 12 - rcc-jlo.weebly.com

› If the scores are displayed in form of a stemplot:

› We see that Jenny’s 86 places her fourth from the top of the class.

› Because 21 of the 25 observations (84%) are below her score, Jenny is at the 84th percentile in the class’s test score distribution.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 5RCC @ 2020/2021

Page 6: Statistics 12 - rcc-jlo.weebly.com

› Practice: Here are data from a random sample of 20 female and 20 male students at a large high school regarding the number of pairs of shoes they have.

a) Find and interpret the percentile in the female distribution for the girl with 22 pairs of shoes.

b) Find and interpret the percentile in the male distribution for the boy with 22 pairs of shoes.

c) Who is more unusual? Explain.CHAPTER 2: MODELING DISTRIBUTION OF DATA 6

RCC @ 2020/2021

Page 7: Statistics 12 - rcc-jlo.weebly.com

B. Cumulative relative frequency graphs

› One of the most common graphs that can be made with percentiles starts with a frequency table for a quantitative variable.

› For example, the following table lists the ages of the first 44 U.S. presidents when they took office:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 7RCC @ 2020/2021

Page 8: Statistics 12 - rcc-jlo.weebly.com

› This table can be further expanded to include relative frequencies, cumulative frequencies and cumulative relative frequencies:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 8RCC @ 2020/2021

Page 9: Statistics 12 - rcc-jlo.weebly.com

› The points corresponding to the cumulative relative frequency in each class can be plotted to create a cumulative relative frequency graph as follows:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 9RCC @ 2020/2021

Page 10: Statistics 12 - rcc-jlo.weebly.com

› A cumulative relative frequency graph can be used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution.

› Example: Was Barack Obama, who was first inaugurated at age 47, unusually young?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 10RCC @ 2020/2021

Page 11: Statistics 12 - rcc-jlo.weebly.com

› Example: Estimate and interpret the 65th percentile of the distribution.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 11RCC @ 2020/2021

Page 12: Statistics 12 - rcc-jlo.weebly.com

C. z-scores

› Recall the example about the test scores on page 2.

› Jenny is at the 84th percentile in the class distribution.

› Where does Jenny’s score of 86 fall relative to the meanof this distribution?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 12RCC @ 2020/2021

Page 13: Statistics 12 - rcc-jlo.weebly.com

› Statistical analysis of the test scores gives the following data (by Excel, Minitab, R or other software):

› Jenny’s score of 86 is about one standard deviation above the mean.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 13

Number of scores

25 Minimum 67.00

Mean 80.00 Maximum 93.00

Median 80.00 Q1 76.00

Standard deviation

6.07 Q3 83.50

RCC @ 2020/2021

Page 14: Statistics 12 - rcc-jlo.weebly.com

› The process of transforming original data to standard deviation units is called standardization, and the resulting values are called standardized scores or z-scores.

› A z-score tells us how many standard deviations from the mean an observation falls, and in what direction.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 14RCC @ 2020/2021

Page 15: Statistics 12 - rcc-jlo.weebly.com

› Recall that Jenny’s score in the test was 86; therefore the corresponding z-score is:

› That means, Jenny’s test score is 0.99 standard deviation above the mean score (i.e., 80) of the class.

› Note that z-scores can be either positive or negative.

▪ Positive: the original value is above the mean

▪ Negative: the original value is below the mean

CHAPTER 2: MODELING DISTRIBUTION OF DATA 15

𝑧 =86 − 80

6.07= 0.99

RCC @ 2020/2021

Page 16: Statistics 12 - rcc-jlo.weebly.com

› Practice: The day after receiving her statistics test result of 86, Jenny earned an 82 on a chemistry test. The distribution of scores was fairly symmetric with a mean of 76 and a standard deviation of 4. Did she do better or worse in the chemistry test than the statistics test?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 16RCC @ 2020/2021

Page 17: Statistics 12 - rcc-jlo.weebly.com

› To find the z-score for an individual observation, this data value is transformed by subtracting the mean and dividing the difference by the standard deviation.

› What is the effect of adding (or subtracting) a constant?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 17RCC @ 2020/2021

Page 18: Statistics 12 - rcc-jlo.weebly.com

› How about the effect of multiplying (or dividing) by a constant?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 18RCC @ 2020/2021

Page 19: Statistics 12 - rcc-jlo.weebly.com

› Example: A group of students in Australia was asked to guess the width of their classroom to the nearest meter. Here are their guesses in order from lowest to highest.

a) Perform a statistical analysis of this study.

b) Note that the actual width of the classroom is 13 meters. Discuss the error distribution of the data set.

c) Convert the error data into the units of feet. Discuss the distribution. Is there any change from that in b)?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 19RCC @ 2020/2021

Page 20: Statistics 12 - rcc-jlo.weebly.com

a) A dot plot of the data and the summary of the statistical analysis of the data set are given below.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 20RCC @ 2020/2021

Page 21: Statistics 12 - rcc-jlo.weebly.com

b) Error of each guess is determined by

› The followings are the results:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 21

error = guess − 13

RCC @ 2020/2021

Page 22: Statistics 12 - rcc-jlo.weebly.com

c) Note that there are 3.28 feet per meter; therefore, all data will be transformed by multiplying by a factor of 3.28.

› The results are as follows:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 22RCC @ 2020/2021

Page 23: Statistics 12 - rcc-jlo.weebly.com

2. Density curves and normal distributions

CHAPTER 2: MODELING DISTRIBUTION OF DATA 23RCC @ 2020/2021

Page 24: Statistics 12 - rcc-jlo.weebly.com

A. Density curves

› Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve called a density curve.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 24RCC @ 2020/2021

Page 25: Statistics 12 - rcc-jlo.weebly.com

› Example: 947 7th-grade students in Gary, Indiana participated in the Iowa Test of Basic Skills (ITBS), and their scores on the vocabulary part were plotted below:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 25

Features:• Smooth and symmetric

• Unimodal• No gap or outliers

RCC @ 2020/2021

Page 26: Statistics 12 - rcc-jlo.weebly.com

› Note that no set of real data is exactly described by a density curve; yet it is still a good approximation that is easy to use and accurate enough for practical use.

› Example:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 26

30.3% students actually got 6 or below yet the density curve yields 29.3% instead

RCC @ 2020/2021

Page 27: Statistics 12 - rcc-jlo.weebly.com

› The mean and median are used to describe a density curve in the same way as real data sets.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 27RCC @ 2020/2021

Page 28: Statistics 12 - rcc-jlo.weebly.com

› Schematically:

› A key feature: The mean and median of a symmetricdensity curve are equal!

CHAPTER 2: MODELING DISTRIBUTION OF DATA 28RCC @ 2020/2021

Page 29: Statistics 12 - rcc-jlo.weebly.com

B. Normal distribution

› A common type of density curve used in statistics is called Normal curve whose shape is determined by the mean 𝜇 and standard deviation 𝜎 of the data set:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 29RCC @ 2020/2021

Page 30: Statistics 12 - rcc-jlo.weebly.com

› The mathematical formula associated with the Normal curve is:

› Graphically, Normal curves look like:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 30

𝑓 𝑥 =1

𝜎 2𝜋𝑒−12𝑥−𝜇𝜎

2

RCC @ 2020/2021

Page 31: Statistics 12 - rcc-jlo.weebly.com

› There are several important facts about Normal curves:

1. All Normal curves have the same overall shape: symmetric, single-peaked (or unimodal), and bell-shaped.

2. Any specific Normal curve is completely described by giving its mean 𝜇 and its standard deviation 𝜎.

3. The mean is located at the center of the symmetric curve and is the same as the median. Changing 𝜇 without changing 𝜎 moves the curve along the horizontal axis without changing its spread.

4. The standard deviation 𝜎 controls the spread of a Normal curve. Curves with larger standard deviations are more spread out.

5. The curve changes its curvature at a distance 𝜎 on either side of the mean.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 31RCC @ 2020/2021

Page 32: Statistics 12 - rcc-jlo.weebly.com

› Normal distributions are useful in statistics:

1. They are good descriptions for some distributions of real data such as:

• Scores on large-scale tests (e.g. SAT, IQ)

• Repeated careful measurements (e.g. scientific experiments)

• Characteristics of biological populations (e.g. yields of corn in a farm)

2. They are good approximations to the results of many kinds of chance outcomes. (i.e., probabilistic)

3. Many statistical inference procedures are based on Normal distributions.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 32RCC @ 2020/2021

Page 33: Statistics 12 - rcc-jlo.weebly.com

› A common feature shared by all Normal curves is called the 68-95-99.7 rule or empirical rule (see p.84 of Unit 1 notes).

CHAPTER 2: MODELING DISTRIBUTION OF DATA 33RCC @ 2020/2021

Page 34: Statistics 12 - rcc-jlo.weebly.com

› Example: The distribution of ITBS vocabulary scores for seventh-graders in Gary, Indiana, is N(6.84, 1.55).

a) What percent of the ITBS vocabulary scores are less than 3.74?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 34RCC @ 2020/2021

Page 35: Statistics 12 - rcc-jlo.weebly.com

b) What percent of the scores are between 5.29 and 9.94?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 35RCC @ 2020/2021

Page 36: Statistics 12 - rcc-jlo.weebly.com

C. Standard Normal distribution

› Based on the 68-95-99.7 rule, we know that all Normal curves share many properties. Indeed, all Normal curves are the same if we measure in units of 𝜎 from the mean 𝜇 as center.

› This is the standardization of the data following Normal distribution (recall the definition of z-scores):

› The resulting data 𝑧 also follows Normal distribution with mean 0 and standard deviation 1.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 36

𝑧 =𝑥 − 𝜇

𝜎

RCC @ 2020/2021

Page 37: Statistics 12 - rcc-jlo.weebly.com

› The new distribution is called standard Normal distribution:

› The area under a standard Normal curve is equal to 1.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 37RCC @ 2020/2021

Page 38: Statistics 12 - rcc-jlo.weebly.com

› Any question about what proportion of observations lies in some range of values can be answered by finding an area under the curve.

› Example: What is the proportion of observations that falls within one standard deviation of the mean?

› Solution: The z-scores corresponding to one standard deviation from the mean are:

› Based on the 68-95-99.7 rule, approximately 68% of the observations fall between 𝑧 = −1 and 𝑧 = 1.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 38

𝑧 =𝜇 ± 𝜎 − 𝜇

𝜎= ±1

RCC @ 2020/2021

Page 39: Statistics 12 - rcc-jlo.weebly.com

› In general, the area under the curve for the standard Normal distribution can be determined using the standard Normal table.

› For example, to find the area to the left of 𝑧 = 0.81:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 39RCC @ 2020/2021

Page 40: Statistics 12 - rcc-jlo.weebly.com

› Example: Find the proportion of observations from the standard Normal distribution that are greater than −1.78.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 40RCC @ 2020/2021

Page 41: Statistics 12 - rcc-jlo.weebly.com

› Example: Find the proportion of observations from the standard Normal distribution that are between −1.25 and 0.81.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 41RCC @ 2020/2021

Page 42: Statistics 12 - rcc-jlo.weebly.com

› On the other hand, the area under the curve for standard Normal distribution can be used to find the corresponding z-score.

› Example: What is the z-score with area 0.90 to its left?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 42RCC @ 2020/2021

Page 43: Statistics 12 - rcc-jlo.weebly.com

› Practice: Tiger Woods practises his golf swing at the driving range by hitting a ball with the same club many, many times. The distance his ball travels off the tee in yards follows a Normal distribution N(304, 8).

a) What percent of Tiger’s ball travels at least 290 yards?

b) What percent of Tiger’s ball travels between 305 and 325 yards?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 43RCC @ 2020/2021

Page 44: Statistics 12 - rcc-jlo.weebly.com

› Practice: High levels of cholesterol in the blood increases the risk of heart disease. For 14-year-old boys, the distribution of blood cholesterol is approximately Normal with mean 170 milligrams of cholesterol per deciliter of blood (mg/dl) and standard deviation 30 mg/dl. What is the first quartile of the distribution of blood cholesterol?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 44RCC @ 2020/2021

Page 45: Statistics 12 - rcc-jlo.weebly.com

D. Assessing Normality

› The Normal distributions provide good models for some distributions of real data such as SAT and IQ tests.

› However, it is not safe to always assume Normal distributions. Some common variables such as personal income and lifetimes of electronic devices are skewed and therefore non-Normal.

› Therefore, it is necessary to develop a strategy for assessing Normality of data sets.

› There are two common ways:

a) Comparison with 68-95-99.7 rule

b) Normal probability plot

CHAPTER 2: MODELING DISTRIBUTION OF DATA 45RCC @ 2020/2021

Page 46: Statistics 12 - rcc-jlo.weebly.com

› Example: Here are the data on unemployment rates in the 50 states of U.S.

› Are the data close to Normal?

CHAPTER 2: MODELING DISTRIBUTION OF DATA 46RCC @ 2020/2021

Page 47: Statistics 12 - rcc-jlo.weebly.com

› The histogram below shows the data on unemployment rates:

› Apparently it is non-Normal!

CHAPTER 2: MODELING DISTRIBUTION OF DATA 47RCC @ 2020/2021

Page 48: Statistics 12 - rcc-jlo.weebly.com

› Statistical analysis yields: 𝜇 = 8.682 and 𝜎 = 2.225.

› If the data distribution is close to Normal, it should fulfil the 68-95-99.7 rule.

› Here is the comparison:

› The percents are quite close to 68%, 95% and 99.7%. Therefore, the data distribution is approximately Normal.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 48RCC @ 2020/2021

Page 49: Statistics 12 - rcc-jlo.weebly.com

› Another approach is to plot the expected z-scores of the data:

› If data are Normal, the plot of 𝑧 against 𝑥 should be a straight line.

› The expected z-score is calculated using the percentile of the data and the z-score table. For example:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 49

𝑧 =𝑥 − 𝜇

𝜎=1

𝜎𝑥 −

𝜇

𝜎

Percentile z-score

1st -2.326

3rd -1.881

5th -1.645RCC @ 2020/2021

Page 50: Statistics 12 - rcc-jlo.weebly.com

› The resulting Normal probability plot is:

CHAPTER 2: MODELING DISTRIBUTION OF DATA 50

The linear pattern suggests that the data are close to Normal

RCC @ 2020/2021

Page 51: Statistics 12 - rcc-jlo.weebly.com

› Practice: The survival times in days of 72 guinea pigs after they were injected with infectious bacteria in a medical experiments were recorded.

› Determine if these data are approximately Normally distributed.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 51RCC @ 2020/2021

Page 52: Statistics 12 - rcc-jlo.weebly.com

› The followings are the histogram and the Normal probability plot of the data:

› The distribution is heavily right-skewed. The clear curvature in the Normal probability plot confirms that these data do not follow the Normal distribution.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 52RCC @ 2020/2021