Measures of Central Tendency, Variation and Shape ...

3
Measures of Central Tendency, Variation and Shape: Central tendency: the way in which a group of data cluster around a central value. (mode, mean and median). Variation: how spread the values are from a central value. (same as Standard Deviation) Shape: the pattern of the distribution of data values. (e.g. positive skew, normal distribution, negative skew). Spread (dispersion): how similar a set of scores are to each other. Covariance is a measure of how much two random variables vary together (−∞ ∞). Correlation is a measure of how things are related (−1 1). The coefficient of variation is the ratio (%) of the standard deviation to the mean. The higher the coefficient of variation, the greater the dispersion around the mean. Data in red is more dispersed. Central tendency − = ̅ = ̅ × 100% Chebyshev Rule applies to all data sets, the percentage of values within k standard deviations of the mean must be at least: 51− 6 ! " 7 # 8 × 100% Events and Sample Spaces: Simple event: described by a single characteristic. (e.g. rolling a die and wanting 2) Sample space: set of all simple outcomes of the variable under consideration. (e.g. rolling a die and calculating all possibilities: 1, 2, 3, 4, 5, 6) Joint event: an event described by two or more characteristics. (e.g. tossing 2 coins and getting 2 heads) Complement: all simple outcomes not in an event. (e.g. when rolling a die, the complement of an “even number” is the event of an odd number) Probability Rules and Formulas Marginal Probability = $(&) $(&)($()) Mutually exclusive events cannot occur simultaneously (e.g. gender) If both events are included (joint probability), (|) General addition rule: P(A or B) = P(A) + P(B) – P(A and B) Events are independent if (|) = () or (|) = ()

Transcript of Measures of Central Tendency, Variation and Shape ...

Page 1: Measures of Central Tendency, Variation and Shape ...

Measures of Central Tendency, Variation and Shape: • Central tendency: the way in which a group of data cluster around a central value.

(mode, mean and median). • Variation: how spread the values are from a central value. (same as Standard

Deviation) • Shape: the pattern of the distribution of data values. (e.g. positive skew, normal

distribution, negative skew). • Spread (dispersion): how similar a set of scores are to each other. • Covariance is a measure of how much two random variables vary together

(−∞𝑡𝑜∞). Correlation is a measure of how things are related (−1𝑡𝑜1). • The coefficient of variation is the ratio (%) of the standard deviation to the mean. The

higher the coefficient of variation, the greater the dispersion around the mean.

Data in red is more dispersed. Central tendency

𝑧 − 𝑠𝑐𝑜𝑟𝑒 =𝑥 − �̅�𝑠

𝐶𝑉 =

𝑠�̅� × 100%

Chebyshev Rule applies to all data sets, the percentage of values within k standard deviations

of the mean must be at least: 51 − 6!"7#8 × 100%

Events and Sample Spaces:

• Simple event: described by a single characteristic. (e.g. rolling a die and wanting 2) • Sample space: set of all simple outcomes of the variable under consideration. (e.g.

rolling a die and calculating all possibilities: 1, 2, 3, 4, 5, 6) • Joint event: an event described by two or more characteristics. (e.g. tossing 2 coins

and getting 2 heads) • Complement: all simple outcomes not in an event. (e.g. when rolling a die, the

complement of an “even number” is the event of an odd number) Probability Rules and Formulas

• Marginal Probability = $(&)$(&)($())

• Mutually exclusive events cannot occur simultaneously (e.g. gender) • If both events are included (joint probability), 𝑃(𝐴|𝐵) • General addition rule: P(A or B) = P(A) + P(B) – P(A and B) • Events are independent if 𝑃(𝐴|𝐵) = 𝑃(𝐴) or 𝑃(𝐵|𝐴) = 𝑃(𝐵)

Page 2: Measures of Central Tendency, Variation and Shape ...

Sampling Distribution Sampling distribution: the probability distribution of a given sample statistic with repeated sampling of the population. Sampling distribution of the mean: the distribution of all possible sample means from samples of a given size for a given population. Finding 𝑋? for the sampling distribution of the mean: 𝑋? = 𝜇 + 𝑍 *

√,

Where 𝑋? represents sample mean. Central Limit Theorem:

• Theorem states that, as the sample size (ie, the number of values in each sample) gets large enough, the sampling distribution of the mean is approximately normally distributed, regardless of shape.

• Can be used where we know population is not normally distributed. • Mean of sampling distribution always equal to mean of population (because of

unbiasedness) • Variability decreases as the sample size increases. • Theorem allows you to make inferences about the population mean without having to

know the specific shape of the population distribution Type of Survey Sampling Methods

• Simple random sample: one where each item in the frame has an equal chance of being selected.

• Systematic sample: Choosing every kth element. • Stratified sample: items randomly selected from each of several populations or sub

populations. • Cluster sample: the frame is divided into representative groups then all items in

randomly selected clusters are chosen. (e.g. geographic area) Confidence Interval for a Mean (standard deviation known): Level of confidence: represents the percentage of intervals, based on all samples of a certain size, which would contain the population parameter. Represented by (1 - α) x 100%, where α is the area in the tails of the distribution that is outside the confidence interval. The area in the upper tail of the distribution is α/2 and the area in the lower tail of the distribution is α/2. Critical value: the value in a distribution that cuts off the required probability in the tail for a given confidence level. The value of Z needed for constructing a confidence interval E.G. for a 95% confidence interval the value of α is 0.05. The critical Z value corresponding to a cumulative area of 0.9750 is 1.96 because there is 0.025 in the upper tail of the distribution and the cumulative area less than Z = 1.96 is 0.975 Student’s t distribution:

• Student’s t distribution: a continuous probability distribution whose shape depends on the number of degrees of freedom.

• Degrees of freedom: relate to the number of values in the calculation of a statistic that are free to vary.

• If the random variable X is normally distributed, then the following statistic has a t distribution with n‐1 degrees of freedom: 𝑡 = -./0

!√#

• S is used to estimate the unknown 𝜎.

Page 3: Measures of Central Tendency, Variation and Shape ...

Properties of the t distribution: • Appearance is bell shaped. • t distribution has more area in the tails and less in the centre. • Values of t are more variable than those for Z because the value of 𝜎 is unknown and S is used to estimate it. • The degrees of freedom n‐1 are directly related to the sample size n. As the sample size and degrees of freedom increase, S becomes a better estimate of sigma and the t distribution gradually approaches the standardized normal distribution, until the two are virtually identical. • With a sample size of 120 or more, S estimates 𝜎 precisely enough so that there is little difference between the t and Z distributions. Use Z instead of t when the sample size is greater than 120. • t assumes that the random variable X is normally distributed. As long as the sample size is large enough and the population is not very skewed, you can use the t distribution to estimate the population mean when 𝜎 is unknown. • When dealing with a small sample size and a skewed population distribution, the validity of the confidence interval is a concern. The Concept of Degrees of Freedom:

• In order to calculate S2, you first need to know 𝑋?. Therefore, only n -1 of the sample values are free to vary. This means that you have n ‐degrees of freedom.

The Confidence Interval Statement: • Confidence interval for the mean (𝜎 unknown): 𝑋? ± 𝑡,/!

1√,

Hypothesis-Testing Methodology:

• Hypothesis testing: testing see if the difference is due to random chance or ‘significantly’ different from another test.

• Null hypothesis: H0: a statement about the value of one or more population parameters which we test and aim to disprove.

• Even though information is only available from the sample, the null hypothesis is written in terms of the population.

• Sample statistic is used to make inferences about the process or thing in its entirety.

• One inference may be that the results observed from the sample data indicate that the null hypothesis is false. That means something else must be true.

• Refers to a specified/hypothesized value of the population parameter (such as u), not a sample statistic (such as 𝑋?)

• Always contains an equals sign regarding the specified value of the population parameter. E.G. 𝜇 = 500 or 𝜇 ≥ 400

• Alternative hypothesis: H1: a statement that we aim to prove about one or more population parameters; the opposite of the null hypothesis.

§ Represents conclusion reached by rejecting the null hypothesis. § Never contains an equals sign regarding the specified value of the

population parameter. E.G. u ≠ 500 or u < 400 • Type I error occurs when you reject null hypothesis, H0, when it is true and

shouldn’t be rejected. (e.g. mean weight is not 500 but it is actually 500)