WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web...

14
ECON1310 Final Summary (5- 12) WEEK 5: Normal distribution Text 6.1 – 6.3 Continuous probability: Continuous random variable = a variable that can assume any value on a number line c.f. discrete counting numbers. Normal distribution A continuous probability distribution, bell shaped and symmetrical about the mean. The random variable X has an infinite theoretical range. Changing μ shifts the distribution left/right (location) Changing σ increases or decreases the spread Standardised normal distribution: Z where μ = 0 and σ = 1. Area on either side = 0.5, underneath the curve = 1, specific area = P using table. Z= Xμ σ Empirical Rules: o μ + 1σ encloses 68.3% of all X values. o μ + 2σ encloses 95.4% of all X values o μ + 3σ covers 99.7% of all X values. 1. Write name of variable and given parameters (mu and sigma) 2. Express Q in notation and draw diagram 3. Convert X to a Z value and rewrite notation to express probability as Z 4. Look up Z val to find P and use it to calculate figure if necessary. WEEK 6: Sampling distribution of the sample mean and sample proportion Textbook 7.4 and 7.5

Transcript of WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web...

Page 1: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

ECON1310 Final Summary (5-12)

WEEK 5: Normal distribution Text 6.1 – 6.3

Continuous probability: Continuous random variable = a variable that can assume any value on a

number line c.f. discrete counting numbers. Normal distribution

A continuous probability distribution, bell shaped and symmetrical about the mean. The random variable X has an infinite theoretical range.

Changing μ shifts the distribution left/right (location) Changing σ increases or decreases the spread Standardised normal distribution: Z where μ = 0 andσ = 1. Area on either

side = 0.5, underneath the curve = 1, specific area = P using table.

Z= X−μσ

Empirical Rules:o μ + 1σ encloses 68.3% of all X values. o μ + 2σ encloses 95.4% of all X valueso μ + 3σ covers 99.7% of all X values.

1. Write name of variable and given parameters (mu and sigma)2. Express Q in notation and draw diagram3. Convert X to a Z value and rewrite notation to express probability as Z 4. Look up Z val to find P and use it to calculate figure if necessary.

WEEK 6: Sampling distribution of the sample mean and sample proportionTextbook 7.4 and 7.5Sampling distribution: the distribution of all possible values of a statistic using the same sample size selected from a population Sampling distribution of the sample mean (X ¿

Variability of the mean from sample to sample: standard error of the mean, σ x = σ /√n. Decreases as the sample size increases – pointier curve.

If the population variable is normally distributed with mean μ and standard deviation σ , the sample distribution of x is also exactly normally distributed with μx= μ and σ x = σ /√n

Z = X−μXσ X

= X−μσ√n

Central limit theorem – the population variable of interest may not be normally distributed or unknown. Nevertheless, the sample distribution of X becomes approximately normally distributed if n≥30.

Sampling distribution of the sample proportion

Page 2: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

σ p=√ pqn = standard error of the proportion. Decreases as n increases.

The sampling distribution of the sample, p, can be approximated by a normal distribution if both np > 5 and nq >5.

Z = p−p

√ pqn

WEEK 7: Confidence Intervals ITextbook 8.1-8.3. Point and Interval Estimates

A confidence interval is not probability. 1 – LOC = LOS or α

Confidence interval for μ, σ known.

X ±Zcrit × σ√n

Based on the sample, the average (variable) in the population is estimated with X% confidence to be between A and B.

90% = 1.645, 95% = 1.96, 99% = 2.575

Zcrit × σ√n

=X−μ=E

T distribution – s known. Value of t depends on degrees of freedom (df) = n-1 for sample means Df = the number of observations that are free to vary after the sample

mean has been calculated. Extra uncertainty as s varies between samples. NOT a normal distribution. Approaches Z as n increases, has thicker tails

and is wider. Assumptions: only s is known, population variable is normally distributed

or a large sample is used. Confidence Interval for the population mean

t = X−μXs

√n

μ=X ± ts

√n¿Where t α

2,n−1is tcrit )

Based on the sample, the (variable) in the population is estimated, with x% confidence, to be between A and B.

Confidence interval for the population proportion – use Z.

We have to estimate σ p=√ p qn Find Zcrit using LOC/2.

p ±Zcrit x √ p qn Based on the sample, the proportion of (variable) in the population is

estimated, with x% confidence, to be between x% and x%.

Page 3: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

Find LOC given CI: p=Upper p+lower p

2 , find standard error, rearrange.

Use p = 0.5 if unsure.

WEEK 8: Confidence Intervals IITextbook 8.5, 10.2 Sampling Error, E

Population parameter – sample statistic Point estimate ± E = CI. A sample size can be calculated to estimate a population parameter

provided: an acceptable sampling error (E) and LOC is specified Determining sample size, sigma known

ALWAYS ROUND UP: ≥ implies n is always rounded up to the next whole number. The upper and lower bounds are ‘within’ the specified E.

As sample size (n) increases, E decreases As spread of data (σ) decreases, E decreases As LOC increases, Zcrit increases and so E increases. To make inferences based on a sample, can only use probability sampling

Determining Sample Size to estimate the true proportion Must know the Desired LOC (which determines Z crit), Desired sampling

error (E), and value of p (unknown parameter value). May come from past info, but if you don’t know just use 0.5 because it never underestimates n.

CI Estimation for the Difference between two different population means Typically don’t know s, only know w so use t. σ Equal population variance assumption: Assume σ1

2 = σ22

o A better estimate of the common population variance is a weighted average called the pooled variance, sp

2 . Assumptions

o The variances in both populations of variable X are assumed equal: because we are pooling to get a better estimate.

o Variable X is normally distributed in each population: because typically use small samples and the t distribution is needed in calculating the sampling error.

o Samples are to be independently and randomly selected from the populations: because the variance formula does’nt allow for covariance.

Formulas:

o (X1−X2 ¿±t α2,n1+n2−2

×s X1−X 2

o X1−X2= point estimate for difference between the means of the two populations. Then find t score. Then find pooled variance and plug into root formula. Then compile formula.

o Don’t forget to interpret CI and Evaluate assumptions Precision versus LOC: a trade-off needs to be made

o Precision refers to a small sampling error (E) and narrow CI widtho LOC is reflected in Z and t values.

Page 4: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

o If n is fixed, larger LOC results in larger E. Larger E results in wider interval estimate and lower precision.

o Wide LOC = not precise (more confident, less precise). o Narrow LOC = very precise (less confident, more precise).

Careful when interpreting CIs – assumptions may be violated so may not be reliable. T test is robust for minor departures from assumption that sample variances are equal.

Using t table backwards:o Find X from upper and lower limits (average). o Find sampling error using upper and lower limits (talpha/2,n-1 s/√n¿ =

upper limit - Xo Find tcrit using sampling error value. o Use tcrit and n-1 to find alpha/2, then x2, then 1 – alpha = LOC.

WEEK 9: Hypothesis Testing IText – chapter 9

Hypothesis = A claim (assumption, belief, assertion) about an unknown population parameter.

Null Hypothesis H0 refers to the status quo (no change from the past). o It is assumed H0 is true unless there is sufficient evidence to rejecto Always contains an = sign in the statement. (=, ≤ or ≥)

Alternate Hypothesis H1

o Challenges the status quo (something new is occurring)o H1 never contains an equal sign. o Usually a researcher aims to reject H0.

Writing the Hypothesis test o For H0 look for =, is, has not changed, not different to

≥ greater than or equal to, at least, a specific value ≤ less than or equal to, at most, no more than

o For H1 look for ≠, is not, has changed, is different to. > greater than, more than, has increased < less than, fewer than, decreased, reduced

Note: H0 and H1 always involve a population parameter, not statistic Hypothesis testing process

Assume H0 statement is true until we find there is sufficient evidence to reject it using sample stats. If sufficient evidence is found to reject H0 then the H1 statement is taken as true

Can never prove H0 is true since decision is based on sample data, not population data.

H0 and H1 are mutually exclusive and collectively exhaustive. Terminology

Level of Significance, o α defines values of sample statistics that are unlikely to be

obtained from a random sample. o α defines the rejection region of the sampling distribution. o Selected by a researcher prior to any analysis.

Page 5: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

Test statistic, Zcalc or tcalc = found from the transformation process. Uses sample statistics.

Critical value, Zcrit or tcrit defines the limits of the tail area of the rejection region, Based on . The rejection region is unlikely, therefore reject if αZ/tcalc value is beyond critical.

One-tail/directional testo In a one-tail test, the alpha/LOS is all in the one tail – don’t divide

by 2. o Two-tailed test is for deviations in either direction away from the

parameter in H0. The way to identify two-tailed is word ‘not’ – doesn’t specify whether under or over. Two tailed tests will use =.

o One-tailed test is for deviations in one direction only. Key words – increase or decrease. Will involve < or > in H1.

o Conservative approach is to use two-tailed. Steps to solving problems

1. State H0 and H1

2. State the decision rule for the appropriate test statistic and sampling distribution, and whether one or two tailed.

3. Calculate the test statistic- critical values and regions from level of significance. Draw a diagram.

4. Make a decision (reject H0 or do not reject H0) 5. State a conclusion

Conclusions In any hypothesis test, the conclusion must always refer to H1 (and NEVER

H0 ) Conclusion: At the % level of significance, there IS SUFFICIENT

(reject)/INSUFFICIENT (do not reject) evidence to conclude (relate the rest of the conclusion to what H1 refers to).

Alternative methodError table H0 true (HOT) H0 falseDon’t reject H0 Correct (1-alpha) Type II error (Beta)Reject H0 Type I error (alpha) Correct (1-beta)

Rows and columns are mutually exclusive and collectively exhaustive. Beta is NOT alpha’s complement – do not add to one. Type I error – reject a true null hypothesis. Type II error – do not reject a false null hypothesis Probability of making Type I and II errors are both conditional

probabilities. Not able to make both errors at the same time. Increasing alpha will reduce Beta. Larger sample size improves the chances of not making either error. The desired risk level of making an error depend on the costs and

consequences. P(type II error) = P(in non-rejection zone | H0 is false).

WEEK 10: Hypothesis Testing IIChapter 9 and 10Hypothesis Tests for Proportion

Page 6: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

Can use critical value method, but p method better for comparison P value method = area(s) in the tail(s) beyond the test statistic, Zcalc or tcalc

It is the probability of getting a sample statistic more extreme than the observed sample value

Method:o First check assumption for approx. normal (both np>5 and nq > 5)o State the appropriate null and alternative hypotheses o Decision rule: reject H0 if p < αo Convert sample statistic (eg x bar) to a test statistic (Zcalc or tcalc)

(i.e. 0.5 – Z score probability) o Obtain the p value = area in the tail(s) beyond the test statistic(s)o Compare the p value with α to make a decision. o Conclusion: at the x% level of significance, there is

sufficient/insufficient evidence to conclude that the proportion of freaky fish is different to x% as claimed by the farmer.

Hypothesis Tests for Difference between two means

H0 : μ1 – μ2 = 0 H1 : μ1 – μ2 ≠ 0

WEEK 11 and 12: Simple Linear RegressionChapter 13

SLR is an inferential statistics technique allowing conclusions to be made about a population parameter based on a sample statistic.

What can scatter plots reveal? Is there a linear (straight line) relationship or a curvilinear relationship? If the relationship looks linear, is the line sloping upward (positive

relationship) or downward sloping (negative relationship)? Is the linear relationship weak/strong?

Correlation coefficient, r No units, can only have a value between -1 and 1. R close to 1(-1) implies a strong positive (negative) linear relationship R closer to 0 implies a weaker linear relationship R = 0 implies no linear relationship exists To calculate r using Excel, use Data>Data Analysis>Correlation N.b. correlation does not imply a causal effect

SLR - Line of best fit using sample data Y i=b0+bi X i >>linear function Changes in Y assumed to be caused by changes X. Yi = observed/measured Y value for Xi in the sample of size n, used to help

estimate the sample linear regression equation. Dependent variable Y i = Predicted value of Y for a particular chosen Xi (estimated Y) (0, b0) = y-intercept of estimate sample regression line ei = Yi - Y i

Page 7: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

b1 = slope, units is change in Y for 1 change in X units Xi = independent/explanatory variable

SLR population equation: Yi = β0 + β1Xi + εi

Yi = dependent variable, β0 = population Y intercept, β1 = population slope coefficient, Xi = independent variable. β0 and β1 are the parameters

εi = random error term to allow for a range of values of Y to occur for any given Xi. In the population there may be many different Y values for the same X value.

n.b.: some data points are above the estimated regression line (a positive error) and some below the line (a negative error)

The error at any value of Xi = Ei = Yi - Y iObservations of the Simple Linear Regression Equation (based on sample)

The error (ei) at any Xi is always measured in the vertical direction The slope of the SLR equation (b1) has the same sign as r i.e. +ve if slopes

upward, -ve if slopes downward Finding the values of b0 and b1 – least squares method

b0 and b1 are obtained by finding the values of b0 and b1 that minimises the sum of the squared differences between all pairs of Yi and Y i

Min Σ(Yi - Y i)2 = min Σ(Yi - b0+bi X i)2 . Found using Excel. Using Excel: Data/Data Analysis/Regression Output lists: regression stats, ANOVA table, coefficients table

SS = Sum Squares SSR = Sum of Squares due to Regression SSE = Sum of Squares of residuals (errors) SST = Sum of squares total MSE = Mean Square

Interpolation vs. Extrapolation Interpolation – between range of observed Xs, within sample size;

Extrapolation – beyond the range of observed Xs, careful because the estimated relationship may not be the same for beyond so extrapolation predictions are doubtful and unreliable

Sums of Squares Total variation = explained variation + unexplained variation Explained variation by the linear relationship with X (regression) Unexplained variation as a result of omitted variable(s) or random error.

Page 8: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

SST = SSR + SSE

Coefficient of Determination, r2

The portion of total variation in Y that is explained by variation in X. R2 = SSR/SST = SSR/(SSR + SSE) Only takes values between 0 and 1. Shows how useful the SLR model is. R = ±√r2, where the slope coefficient b1 must have the same sign as the

correlation coefficientStandard error of estimate, se

The standard deviation of all the (observed) points around the estimated regression line = standard deviation of the error of the model

Se = √ SSEdf =√ SSEn−2=√MSE

Comparing standard errors: a measure of variation of observed Y values from the regression line. Small if there is a strong linear relationship, larger if there is a weak linear relationship

The magnitude of se should always be judged relative to the size of the Y values in the sample data.

Confidence interval for β1

β1 = b1 ± t(n-2), /2α * sb1

The standard error for the slope coefficient (sb1) is in the ANOVA table Degrees of Freedom for tcrit in SLR: in SLR the degrees of freedom now

become df = n-2 for tcrit because there are two uncertainties Conclusion: the slope of the linear relationship, between XX, in the

population is estimated with x% confidence to be between A and B units.

Page 9: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

Assessing if the linear model is ‘good’ Statistics, CI and Hypothesis test, whether the assumptions are satisfied If the model is ‘good’, it can be used to predict the value of Y. Otherwise,

prediction will not be reliableAssumptions made about ε

1. The relationship between the variables is lineara. Observe scatter plot of (Xi ,Yi) sample pointsb. Residual plot – a pattern may indicate it’s not appropriate.

2. The error terms are normally distributed with zero mean/expected value. Like the ttest, regression is reasonably robust against departure from normality. Thus, inferences about 0β and 1β are not seriously affected if the departure from normality for each value of xis not extreme.

a. A histogram is needed – we can’t check.3. The error terms have constant variance (homoscedasticity). Thus, the

variability of yis the same for low as for high values of x. For serious departure, transformation of the data or weighted regression may assist. The residual plot should show no major changes in spread of errors (in vertical direction) over the range of X values (homoscedasticity). Non-constant variance = heteroskedasticity.

4. The error terms are independent of one another and occur randomly: particularly important when data are collected over a period of time, as errors in one period may be correlated with those in a previous period(s). Residual should show no pattern. Several consecutive positive errors followed by several consecutive negative (pattern) = violation.

How to check assumptions – use observations (subjective method) The residual plot of (Xi, ei) – a plot of all points from the sample data. Used

to verify Assumptions 3 and 4 about the error term. How to do a residual plot on excel: Data/Data Analysis/Regression and

tick box “Residual Plots”. This plot should be horizontal, train tracks. o If time is on the horizontal axis (or observations are ordered as

measured), and a pattern in the residuals exists, this violation is called autocorrelation.

Hypothesis testing in SLR (for the slope coefficient, 1). β Testing the significance of the slope coefficient can also indicate how well

the model fits the data. Use ANOVA table. Slope of 0 = no relationship, is a horizontal line. If a linear relationship does exist between X and Y, then m ≠ 0 and the line

slopes upwards (m is positive) or downward (m is negative) A hypothesis test is needed to see if the regression model slope coefficient

is significantly different from zero. If there is a statistically significant difference from zero, then the regression model is useful in predicting Y (provided the four assumptions are satisfied)

For H0: no relationship, is not significant For H1:

o ≠ there is a relationshipo > upward sloping, positive relationship, as X increases, Y increases. o < downward sloping, negative relationship, as X increases, Y

decreases. For tcrit: df = n-2, TTT /2, OTT α α

Page 10: WEEK 5: Normal distribution - uqba.org.auuqba.org.au/.../2018/02/ECON1310-Final-Summary.docx · Web viewSamples are to be independently and randomly selected from the populations:

In ANOVA:o Df is n-2. Find in Residual df column. o B1 is x coefficient, sb1 is x standard error, tstat for x is tcalc

o Tcalc is T = (b1 - 1) / sβ b1

o Where sb = se/√SSxx Example conclusion: at the 5% level of significance, there is sufficient

evidence to conclude that there is an upward sloping relationship between X and Y.

For p-value method, do almost same except find p in ANOVA for x p-value. Excel assumes TTT when reporting p-value, so if it’s an OTT then p-value

must be halved. This will be the correct p-value at the test statistic.