Review - math.utah.edulzhang/teaching... · Variables - Catigorical v.s. Quantitative Graphs for...

Review

• Variables - Catigorical v.s. Quantitative

• Graphs for distributional information: Pie chart, Bar graph,Histogram, Stemplot, Timeplot, Boxplot

• Overall pattern of the graph: Symetric/Skewed, Center,Spread, Outlier, Trend

• Measure of center: Mean/Median

• Measure of variability: Quartiles (Q1,Q2,Q3), Range, IQR,1.5×IQR rule, Outlier, Variance, Standard deviation

• Five-number summary, Boxplot

• Density curve

• Normal distributions / Normal curves

• z-score, Standard normal distribution

• 68− 95− 99.7 rule, Probabilities for normal distribution

• Density curve

• Explanatory variable / Response variable

• Scatterplot: Direction (Positive / Negative), Form (Linear /Nonlinear), Strength, Outlier

• Correlation

• Linear regression: y = a + bx ; Slope b, Intercept a,Predication

• Correlation and regression, r2, Residual

• Cautions for regression: Influential observations,Extrapolation, Lurking variables

• Sample / Population

• Random sampling design: Simple random sample (SRS),Stratified random sample, Multistage sample

• Bad samples: Voluntary response sample, Convenience sample

• Observational studies & Experimental studies (experiments)

• Treatments / Factors

• Design of experiments:

control (comparison, placebo)randomization (table of random digits, double-blind)matched pairs design / Block design

control (comparison, placebo)

randomization (table of random digits, double-blind)matched pairs design / Block design

control (comparison, placebo)randomization (table of random digits, double-blind)

matched pairs design / Block design

• Probability: Sample space (S) & Events

• Rules for probability model:

1. for any event A, 0 ≤ P(A) ≤ 12. for sample space S , P(S) = 13. if two events A and B are disjoint, then

P(A or B) = P(A) + P(B)4. for any event A, P(A does not occur) = 1− P(A)

• Discrete probability models / Continuous probability models

• Random variables / Distributions

1. for any event A, 0 ≤ P(A) ≤ 1

2. for sample space S , P(S) = 13. if two events A and B are disjoint, then

1. for any event A, 0 ≤ P(A) ≤ 12. for sample space S , P(S) = 1

3. if two events A and B are disjoint, thenP(A or B) = P(A) + P(B)

4. for any event A, P(A does not occur) = 1− P(A)

P(A or B) = P(A) + P(B)

4. for any event A, P(A does not occur) = 1− P(A)

• Population / Sample; Parameters / Statistics

µ / x , σ / s, p / p

• Statistics are random variables

• Sampling distribution of the sample mean x for an SRS:

∗ mean of x equals the population mean µ∗ standard deviation of x equals σ√

n, where σ is the

population standard deviation and n is the sample size∗ if the population has a normal distribution, then

x ∼ N(µ, σ/√

n)∗ central limit theorem: if the sample size is large

(n ≥ 30), then x is approximately normal, i.e.

xapprox∼ N(µ, σ/

µ / x , σ / s, p / p

n, where σ is the

x ∼ N(µ, σ/√

µ / x , σ / s, p / p

n, where σ is the

x ∼ N(µ, σ/√

µ / x , σ / s, p / p

∗ mean of x equals the population mean µ

∗ standard deviation of x equals σ√n

, where σ is the

x ∼ N(µ, σ/√

µ / x , σ / s, p / p

n, where σ is the

population standard deviation and n is the sample size

∗ if the population has a normal distribution, thenx ∼ N(µ, σ/

∗ central limit theorem: if the sample size is large(n ≥ 30), then x is approximately normal, i.e.

µ / x , σ / s, p / p

n, where σ is the

x ∼ N(µ, σ/√

∗ central limit theorem: if the sample size is large(n ≥ 30), then x is approximately normal, i.e.

µ / x , σ / s, p / p

n, where σ is the

x ∼ N(µ, σ/√

• Inference about µ with known σ — z-procedures (confidenceinterval & test of significance)

• Confidence intervals:

∗ form: estimate ± margin of error / interpretation

x − z∗σ√n, x + z∗

σ√n

)∗ z∗ is determined by the confidence level C — the z-score

corresponding to the upper tail (1− C )/2

x − z∗σ√n, x + z∗

σ√n

x − z∗σ√n, x + z∗

σ√n

x − z∗σ√n, x + z∗

σ√n

∗ z∗ is determined by the confidence level C — the z-scorecorresponding to the upper tail (1− C )/2

x − z∗σ√n, x + z∗

σ√n

• Test of significance:

∗ hypotheses: H0 v.s Ha / H0 : µ = µ0

∗ test statistics: z =x − µ0

σ/√

n∗ P-value:

? Ha : µ > µ0 — upper tail probability correspondingto z

? Ha : µ < µ0 — lower tail probability correspondingto z

? Ha : µ 6= µ0 — twice upper tail probabilitycorresponding to |z |

∗ significance level α and conclusion

σ/√

n∗ P-value:

σ/√

n∗ P-value:

σ/√

∗ P-value:

σ/√

n∗ P-value:

σ/√

n∗ P-value:

σ/√

n∗ P-value:

σ/√

n∗ P-value:

σ/√

n∗ P-value:

• Assumptions for z-procedures:

∗ the sample is an SRS∗ the population has a normal distribution∗ the population standard deviation σ is known

• Margin of errors in confidence intervals are affected by C , σand n

to get a level C C.I. with margin of m, we need an SRSwith sample size

(z∗σ

• The significance of test will also be affected by sample size

∗ the sample is an SRS

∗ the population has a normal distribution∗ the population standard deviation σ is known

(z∗σ

∗ the sample is an SRS∗ the population has a normal distribution

∗ the population standard deviation σ is known

(z∗σ

• Inference about µ with unknown σ — t-procedures(confidence interval & test of significance)

• Standard error:s√n

• t-distribution; degrees of freedom (n − 1)

x − t∗s√n, x + t∗

)∗ t∗ is determined by the confidence level C — the t-score

x − t∗s√n, x + t∗

∗ t∗ is determined by the confidence level C — the t-scorecorresponding to the upper tail (1− C )/2

x − t∗s√n, x + t∗

∗ test statistics: t =x − µ0

n∗ P-value:

? Ha : µ > µ0 — upper tail probability correspondingto t

? Ha : µ < µ0 — lower tail probability correspondingto t

? Ha : µ 6= µ0 — twice upper tail probabilitycorresponding to |t|

n∗ P-value:

∗ P-value:

n∗ P-value:

• Inference about two means — µ1 − µ2

• Standard error for x1 − x2:√s21

• Confidence interval for µ1 − µ2:

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

corresponding to the upper tail (1− C )/2∗ degrees of freedom: smaller of n1 − 1 and n2 − 1

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

∗ t∗ is determined by the confidence level C — the t-scorecorresponding to the upper tail (1− C )/2∗ degrees of freedom: smaller of n1 − 1 and n2 − 1

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

∗ degrees of freedom: smaller of n1 − 1 and n2 − 1

(x1 − x2)− t∗

√s21

n2, (x1 − x2) + t∗

√s21

∗ hypotheses: H0 v.s Ha / H0 : µ1 = µ2 (µ1 − µ2 = 0)

∗ test statistics: t =x1 − x2√

∗ P-value:

? degrees of freedom: smaller of n1 − 1 and n2 − 1? Ha : µ > µ0 — upper tail probability corresponding

to t? Ha : µ < µ0 — lower tail probability corresponding

to t? Ha : µ 6= µ0 — twice upper tail probability

corresponding to |t|∗ significance level α and conclusion

∗ P-value:

? degrees of freedom: smaller of n1 − 1 and n2 − 1

∗ P-value:

corresponding to |t|

∗ P-value:

• Inference about population proportion p — z-procedures(confidence interval & test of significance)

• Sampling distribution of the sample proportion p for an SRS:

∗ mean of p equals the population proportion p

∗ standard deviation of p equals

√p(1− p)

n∗ If the sample size is large, p is approximately normal, i.e.

papprox∼ N(p,

√p(1− p)

• Standard error of p:

√p(1− p)

papprox∼ N(p,

√p(1− p)

papprox∼ N(p,

√p(1− p)

∗ If the sample size is large, p is approximately normal, i.e.

papprox∼ N(p,

√p(1− p)

papprox∼ N(p,

√p(1− p)

papprox∼ N(p,

√p(1− p)

• Inference about population proportion p — z-procedures

• Large-sample confidence intervals:

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

corresponding to the upper tail (1− C )/2∗ Use it only when np ≥ 15 and n(1− p) ≥ 15

• Plus four confidence intervals:

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

number of successes in the sample + 2

n + 4∗ Use it when the confidence level is at least 90% and the

sample size n is at least 10

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

∗ z∗ is determined by the confidence level C — the z-scorecorresponding to the upper tail (1− C )/2∗ Use it only when np ≥ 15 and n(1− p) ≥ 15

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

∗ Use it only when np ≥ 15 and n(1− p) ≥ 15

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

∗ p =number of successes in the sample + 2

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

∗ Use it when the confidence level is at least 90% and thesample size n is at least 10

p − z∗√

p(1− p)

n, p + z∗

√p(1− p)

p − z∗√

p(1− p)

n + 4, p + z∗

√p(1− p)

)∗ p =

∗ hypotheses: H0 v.s Ha / H0 : p = p0

∗ test statistics: z =p − p0√p0(1−p0)

∗ P-value:

? Ha : p > p0 — upper tail probability correspondingto z

? Ha : p < p0 — lower tail probability correspondingto z

? Ha : p 6= p0 — twice upper tail probabilitycorresponding to |z |

∗ significance level α and conclusion∗ use this test when np0 ≥ 10 and n(1− p0) ≥ 10

∗ P-value:

∗ use this test when np0 ≥ 10 and n(1− p0) ≥ 10

∗ P-value:

• Inference about two proportions — p1 − p2

• Sampling distribution of p1 − p2:

∗ mean of p1 − p2 is p1 − p2

∗ standard deviation of p1 − p2 is√p1(1− p1)

p2(1− p2)

∗ If the sample size is large, p1− p2 is approximately normal

√p1(1− p1)

p2(1− p2)

√p1(1− p1)

p2(1− p2)

√p1(1− p1)

p2(1− p2)

√p1(1− p1)

p2(1− p2)

√p1(1− p1)

p2(1− p2)

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

standard error of p1 − p2:

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

p2(1− p2)

∗ z∗ is determined by the confidence level C — the z-scorecorresponding to the upper tail (1− C )/2

∗ Use it only when np ≥ 10 and n(1− p) ≥ 10

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

p2(1− p2)

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

n1 + 2+

p2(1− p2)

n2 + 2

∗ pi =number of successes in the i th sample + 1

ni + 2, i = 1, 2

∗ Use it when n1 ≥ 5 and n2 ≥ 5

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

n1 + 2+

p2(1− p2)

n2 + 2

ni + 2, i = 1, 2

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

n1 + 2+

p2(1− p2)

n2 + 2

ni + 2, i = 1, 2

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

n1 + 2+

p2(1− p2)

n2 + 2

ni + 2, i = 1, 2

(p1 − p2)− z∗SE, (p1 + p2) + z∗SE

), where SE is the

√p1(1− p1)

n1 + 2+

p2(1− p2)

n2 + 2

ni + 2, i = 1, 2

∗ hypotheses: H0 v.s Ha / H0 : p1 = p2 (p1 − p2 = 0)∗ pooled sample proportion p:

p =number of successes in both samples combined

number of individuals in both samples combined

∗ test statistics: z =p1 − p2√

p(1− p)

)∗ P-value:

? Ha : p1 − p2 > 0 — upper tail probabilitycorresponding to z

? Ha : p1 − p2 < 0 — lower tail probabilitycorresponding to z

? Ha : p1 − p2 6= 0 — twice upper tail probabilitycorresponding to |z |

∗ significance level α and conclusion∗ use this test when counts of successes and failures are

each 5 or more in boh samples

• Test of significance:∗ hypotheses: H0 v.s Ha / H0 : p1 = p2 (p1 − p2 = 0)

∗ pooled sample proportion p:

p(1− p)

)∗ P-value:

• Test of significance:∗ hypotheses: H0 v.s Ha / H0 : p1 = p2 (p1 − p2 = 0)∗ pooled sample proportion p:

p(1− p)

)∗ P-value:

p(1− p)

∗ P-value:? Ha : p1 − p2 > 0 — upper tail probability

corresponding to z? Ha : p1 − p2 < 0 — lower tail probability

corresponding to z? Ha : p1 − p2 6= 0 — twice upper tail probability

corresponding to |z |∗ significance level α and conclusion∗ use this test when counts of successes and failures are

p(1− p)

)∗ P-value:

p(1− p)

)∗ P-value:

p(1− p)

)∗ P-value:

p(1− p)

)∗ P-value:

p(1− p)

)∗ P-value:

∗ use this test when counts of successes and failures areeach 5 or more in boh samples

p(1− p)

)∗ P-value:

• Chi-square test for a two-way table

• Hypotheses: H0 : there is no relationship between the twovariables (row variable and column variable) v.s. Ha : there issome relationship

• Compares the observed counts in the cells of the two-waytable with the counts that would be expected if H0 were true

expected count =row total× column total

table total

• Chi-square test statistic:

χ2 =∑ (observed count− expected count)2

expected count

• Degrees of freedom of χ2: (r − 1)(c − 1), where r is thenumber of rows and c is the number of columns

• P-value: the area under the chi-square density curve to theright of the value of the test statistic

table total

expected count

table total

expected count

table total

expected count

table total

expected count

table total

expected count

table total

expected count

• Chi-square test for goodness of fit

• Null hypothesis: H0 : p1 = p10, p2 = p20, . . . , pk = pk0

• Compares the observed counts of each category with thecounts that would be expected if H0 were true

expected count for category i = npi0

expected count

• Degrees of freedom of χ2: k − 1, where k is the number ofcategories

expected count

• One-way analysis of variance (ANOVA) compares the meansof sevral populations.

• Hypotheses for ANOVA F -test: H0 : all the populations havethe same mean v.s. Ha : not all the means are the same

F =variation among the sample means

variation among individuals among the same sample

degrees of freedom for the numerator is I − 1 and degrees offreedom for the denominator is N − I , where I is the numberof populations and N is the total number of observations fromI samples

• Conditions for use ANOVA: independent SRS from eachpopulation; each population is Normally distributed; allpopulations have the same standard deviation

Review - math.utah.edulzhang/teaching... · Variables - Catigorical v.s. Quantitative Graphs for...

Documents

Transcript of Review - math.utah.edulzhang/teaching... · Variables - Catigorical v.s. Quantitative Graphs for...