Statistics 12 - rcc-jlo.weebly.com

Statistics 12

Chapter 2: Modeling Distribution of Data

Dr. John LoRoyal Canadian College

2020-2021

1. Describing a location in a distribution

CHAPTER 2: MODELING DISTRIBUTION OF DATA 2RCC @ 2020/2021

› Example: Here are the scores of all 25 students in a statistics class on their first test:

› The bold score is Jenny’s mark. How did she perform on this test relative to her classmates?


A. Percentiles

› One way to describe Jenny’s location in the distribution of test scores is to tell what percent of students in the class earned scores that were below Jenny’s score.

› That is, we can calculate Jenny’s percentile:


› If the scores are displayed in form of a stemplot:

› We see that Jenny’s 86 places her fourth from the top of the class.

› Because 21 of the 25 observations (84%) are below her score, Jenny is at the 84th percentile in the class’s test score distribution.


› Practice: Here are data from a random sample of 20 female and 20 male students at a large high school regarding the number of pairs of shoes they have.

a) Find and interpret the percentile in the female distribution for the girl with 22 pairs of shoes.

b) Find and interpret the percentile in the male distribution for the boy with 22 pairs of shoes.

c) Who is more unusual? Explain.CHAPTER 2: MODELING DISTRIBUTION OF DATA 6

RCC @ 2020/2021

B. Cumulative relative frequency graphs

› One of the most common graphs that can be made with percentiles starts with a frequency table for a quantitative variable.

› For example, the following table lists the ages of the first 44 U.S. presidents when they took office:


› This table can be further expanded to include relative frequencies, cumulative frequencies and cumulative relative frequencies:


› The points corresponding to the cumulative relative frequency in each class can be plotted to create a cumulative relative frequency graph as follows:


› A cumulative relative frequency graph can be used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution.

› Example: Was Barack Obama, who was first inaugurated at age 47, unusually young?


› Example: Estimate and interpret the 65th percentile of the distribution.


C. z-scores

› Recall the example about the test scores on page 2.

› Jenny is at the 84th percentile in the class distribution.

› Where does Jenny’s score of 86 fall relative to the meanof this distribution?


› Statistical analysis of the test scores gives the following data (by Excel, Minitab, R or other software):

› Jenny’s score of 86 is about one standard deviation above the mean.

CHAPTER 2: MODELING DISTRIBUTION OF DATA 13

Number of scores

25 Minimum 67.00

Mean 80.00 Maximum 93.00

Median 80.00 Q1 76.00

Standard deviation

6.07 Q3 83.50

RCC @ 2020/2021

› The process of transforming original data to standard deviation units is called standardization, and the resulting values are called standardized scores or z-scores.

› A z-score tells us how many standard deviations from the mean an observation falls, and in what direction.


› Recall that Jenny’s score in the test was 86; therefore the corresponding z-score is:

› That means, Jenny’s test score is 0.99 standard deviation above the mean score (i.e., 80) of the class.

› Note that z-scores can be either positive or negative.

▪ Positive: the original value is above the mean

▪ Negative: the original value is below the mean


𝑧 =86 − 80

6.07= 0.99

RCC @ 2020/2021

› Practice: The day after receiving her statistics test result of 86, Jenny earned an 82 on a chemistry test. The distribution of scores was fairly symmetric with a mean of 76 and a standard deviation of 4. Did she do better or worse in the chemistry test than the statistics test?


› To find the z-score for an individual observation, this data value is transformed by subtracting the mean and dividing the difference by the standard deviation.

› What is the effect of adding (or subtracting) a constant?


› How about the effect of multiplying (or dividing) by a constant?


› Example: A group of students in Australia was asked to guess the width of their classroom to the nearest meter. Here are their guesses in order from lowest to highest.

a) Perform a statistical analysis of this study.

b) Note that the actual width of the classroom is 13 meters. Discuss the error distribution of the data set.

c) Convert the error data into the units of feet. Discuss the distribution. Is there any change from that in b)?


a) A dot plot of the data and the summary of the statistical analysis of the data set are given below.


b) Error of each guess is determined by

› The followings are the results:


error = guess − 13

RCC @ 2020/2021

c) Note that there are 3.28 feet per meter; therefore, all data will be transformed by multiplying by a factor of 3.28.

› The results are as follows:


2. Density curves and normal distributions


A. Density curves

› Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve called a density curve.


› Example: 947 7th-grade students in Gary, Indiana participated in the Iowa Test of Basic Skills (ITBS), and their scores on the vocabulary part were plotted below:


Features:• Smooth and symmetric

• Unimodal• No gap or outliers

RCC @ 2020/2021

› Note that no set of real data is exactly described by a density curve; yet it is still a good approximation that is easy to use and accurate enough for practical use.

› Example:


30.3% students actually got 6 or below yet the density curve yields 29.3% instead

RCC @ 2020/2021

› The mean and median are used to describe a density curve in the same way as real data sets.


› Schematically:

› A key feature: The mean and median of a symmetricdensity curve are equal!


B. Normal distribution

› A common type of density curve used in statistics is called Normal curve whose shape is determined by the mean 𝜇 and standard deviation 𝜎 of the data set:


› The mathematical formula associated with the Normal curve is:

› Graphically, Normal curves look like:


𝑓 𝑥 =1

𝜎 2𝜋𝑒−12𝑥−𝜇𝜎

2

RCC @ 2020/2021

› There are several important facts about Normal curves:

1. All Normal curves have the same overall shape: symmetric, single-peaked (or unimodal), and bell-shaped.

2. Any specific Normal curve is completely described by giving its mean 𝜇 and its standard deviation 𝜎.

3. The mean is located at the center of the symmetric curve and is the same as the median. Changing 𝜇 without changing 𝜎 moves the curve along the horizontal axis without changing its spread.

4. The standard deviation 𝜎 controls the spread of a Normal curve. Curves with larger standard deviations are more spread out.

5. The curve changes its curvature at a distance 𝜎 on either side of the mean.


› Normal distributions are useful in statistics:

1. They are good descriptions for some distributions of real data such as:

• Scores on large-scale tests (e.g. SAT, IQ)

• Repeated careful measurements (e.g. scientific experiments)

• Characteristics of biological populations (e.g. yields of corn in a farm)

2. They are good approximations to the results of many kinds of chance outcomes. (i.e., probabilistic)

3. Many statistical inference procedures are based on Normal distributions.


› A common feature shared by all Normal curves is called the 68-95-99.7 rule or empirical rule (see p.84 of Unit 1 notes).


› Example: The distribution of ITBS vocabulary scores for seventh-graders in Gary, Indiana, is N(6.84, 1.55).

a) What percent of the ITBS vocabulary scores are less than 3.74?


b) What percent of the scores are between 5.29 and 9.94?


C. Standard Normal distribution

› Based on the 68-95-99.7 rule, we know that all Normal curves share many properties. Indeed, all Normal curves are the same if we measure in units of 𝜎 from the mean 𝜇 as center.

› This is the standardization of the data following Normal distribution (recall the definition of z-scores):

› The resulting data 𝑧 also follows Normal distribution with mean 0 and standard deviation 1.


𝑧 =𝑥 − 𝜇

𝜎

RCC @ 2020/2021

› The new distribution is called standard Normal distribution:

› The area under a standard Normal curve is equal to 1.


› Any question about what proportion of observations lies in some range of values can be answered by finding an area under the curve.

› Example: What is the proportion of observations that falls within one standard deviation of the mean?

› Solution: The z-scores corresponding to one standard deviation from the mean are:

› Based on the 68-95-99.7 rule, approximately 68% of the observations fall between 𝑧 = −1 and 𝑧 = 1.


𝑧 =𝜇 ± 𝜎 − 𝜇

𝜎= ±1

RCC @ 2020/2021

› In general, the area under the curve for the standard Normal distribution can be determined using the standard Normal table.

› For example, to find the area to the left of 𝑧 = 0.81:


› Example: Find the proportion of observations from the standard Normal distribution that are greater than −1.78.


› Example: Find the proportion of observations from the standard Normal distribution that are between −1.25 and 0.81.


› On the other hand, the area under the curve for standard Normal distribution can be used to find the corresponding z-score.

› Example: What is the z-score with area 0.90 to its left?


› Practice: Tiger Woods practises his golf swing at the driving range by hitting a ball with the same club many, many times. The distance his ball travels off the tee in yards follows a Normal distribution N(304, 8).

a) What percent of Tiger’s ball travels at least 290 yards?

b) What percent of Tiger’s ball travels between 305 and 325 yards?


› Practice: High levels of cholesterol in the blood increases the risk of heart disease. For 14-year-old boys, the distribution of blood cholesterol is approximately Normal with mean 170 milligrams of cholesterol per deciliter of blood (mg/dl) and standard deviation 30 mg/dl. What is the first quartile of the distribution of blood cholesterol?


D. Assessing Normality

› The Normal distributions provide good models for some distributions of real data such as SAT and IQ tests.

› However, it is not safe to always assume Normal distributions. Some common variables such as personal income and lifetimes of electronic devices are skewed and therefore non-Normal.

› Therefore, it is necessary to develop a strategy for assessing Normality of data sets.

› There are two common ways:

a) Comparison with 68-95-99.7 rule

b) Normal probability plot


› Example: Here are the data on unemployment rates in the 50 states of U.S.

› Are the data close to Normal?


› The histogram below shows the data on unemployment rates:

› Apparently it is non-Normal!


› Statistical analysis yields: 𝜇 = 8.682 and 𝜎 = 2.225.

› If the data distribution is close to Normal, it should fulfil the 68-95-99.7 rule.

› Here is the comparison:

› The percents are quite close to 68%, 95% and 99.7%. Therefore, the data distribution is approximately Normal.


› Another approach is to plot the expected z-scores of the data:

› If data are Normal, the plot of 𝑧 against 𝑥 should be a straight line.

› The expected z-score is calculated using the percentile of the data and the z-score table. For example:


𝑧 =𝑥 − 𝜇

𝜎=1

𝜎𝑥 −

𝜇

𝜎

Percentile z-score

1st -2.326

3rd -1.881

5th -1.645RCC @ 2020/2021

› The resulting Normal probability plot is:


The linear pattern suggests that the data are close to Normal

RCC @ 2020/2021

› Practice: The survival times in days of 72 guinea pigs after they were injected with infectious bacteria in a medical experiments were recorded.

› Determine if these data are approximately Normally distributed.


› The followings are the histogram and the Normal probability plot of the data:

› The distribution is heavily right-skewed. The clear curvature in the Normal probability plot confirms that these data do not follow the Normal distribution.


Statistics 12 - rcc-jlo.weebly.com

Documents

Transcript of Statistics 12 - rcc-jlo.weebly.com