Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

38
Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference

Transcript of Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Page 1: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Statistical AnalysisEpisode #1:

Prior Data Analysis Logic of Statistical Inference

Page 3: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Part #1

Prior Data Analysis(descriptive statistics)

Page 4: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Data screening… and, Why?

• One needs to get a feel for the data.. Understanding the sample data is a MUST before making any statistical inference

• One variable at a time, and bivariate relationships give you a feeling about the preliminary results

• Early detection of issues give suggestions about some adjustments before moving on… (e.g. outliers, missing observations)

Page 5: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Univariate analysis

• First, check the distribution of every variable (univariate analysis) … why?

Important terms:

Skewness – distribution’s deviation from a perfectly symmetrical shape (positive skewness, negative skewness)

Kurtosis – general peakedness of a distribution (platykurtic, leptokurtic, mesokurtic)

Page 6: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Univariate analysis

Page 7: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Univariate analysis

Page 8: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Bivariate analysis

• Once you know each variable well, you can start looking at their relationships

• 2 Categorical variables – crosstabulation (joint distribution of 2 variables

• 2 Continuous variables – scatterplot

• 1 categorical 1 continous – compare the boxplots/ steam-leafs

Page 9: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Cross tabulation

Page 10: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Scatterplot

Page 11: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Part #2

Logic of Statistical Inference

Page 12: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Statistical inference• Statistics is the analysis of population characteristics by

inference from sampling.

• Statistical analysis has two foci: descriptive statistics and statistical inference

Descriptive statistics= organizing and describing data obtained from a sample of observations

Statistical inference – descriptive statistics estimates the value of measures in in the population from which the sample was drawn (based on probability theory)

The goal of statistical inference is to make precise estimates of population parameters, with known risks of error based on observations from samples (random sampling error)

Page 13: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Statistical inference

Page 14: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Sampling Distribution

• When conducting research, scientists seldom take more than one sample from a population. This single sample becomes the basis upon which inferences are made. Consider for a moment the possibility of selecting numerous samples using identical random sampling procedures from the same population. We would now have multiple instances of whatever statistic we were interested in examining… the differences between these sample statistics might give us some notion concerning how well our sampling procedure was working.

• Each sample of the same size would provide one observation to be included in the distribution (where the data points are the sample statistics of each sample being drawn from the respective population)

Page 15: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Sampling Distribution

• Sampling distribution is the distribution of a sample statistics that would be obtained if all possible samples of the same size (N) were drawn from a given population.

Page 16: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Sampling Distribution

Page 17: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Central Limit Theorem

• Distribution of a large number of sample means or sample proportions will approximate a normal distribution, regardless of the distribution of the population from which they were drawn.

• it specifies that the mean of the sampling distribution will be equal to the mean of the population.

• As the sample gets larger, sample does a better job estimating the corresponding population parameter. Standard deviation of sampling distribution is called the “standard error, or sampling error”

Page 18: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Normal distribution • Theoretical probability distribution – with a

symmetrical, unimodal, and bell shaped curve.. It is based on a probability density function….

• ND is bell shaped and has only one mode [particular value that occurs most frequently]

• It is symmetric around mean, not skewed (mean, median, mode are all equal)

• The area of a region under the curve between any two values of a variable equals the probability of observing a value in that range when an observation is randomly selected form the distribution.

• The area between the mean and a given number of standard deviations from the mean is the same for all NDs [the area between the mean and plus or minus one SD takes in 68.26 % of the observations]

Page 19: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

Which of the following data sets is most likely to be normally distributed? For other choices, explain why you believe they would not follow a normal distribution.

(a) The hand span (measured from the tip of the thumb to the tip of the extended 5th finger) of a random sample of high school seniors.

(b) The annual salaries of all employees of a large shipping company

(c) The annual salaries of a random sample of 50 CEOs of major companies (25 men and 25 women)

(d) The dates of 100 pennies taken from a cash drawer in a convenience store

Page 20: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

Assume than the mean weight of 1 year old girls in the US is normally distributed with a mean value of 9.5 kg and standard deviation of 1.1. Without using a calculator (use the empirical rule 68 %, 95 %, 99%), estimate the percentage of 1 year old girls in the US that meet the following conditions. Draw a sketch and shade the proper region for each problem…

(a) Less than 8.1 kg

(b) Between 7.3 and 11.7 kg.

(c) More than 12.8 kg.

Page 21: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Standard normal distribution

A standard normal distribution is a normal distribution with mean 0 and standard deviation 1.

From the 68-95-99.7 rule we know that for a variable with the standard normal distribution, 68% of the observations fall between -1 and 1 (within 1 standard deviation of the mean of 0), 95% fall between -2 and 2 (within 2 standard deviations of the mean) and 99.7% fall between -3 and 3 (within 3 standard deviations of the mean).

No naturally measured variable has this distribution. However, all other normal distributions are equivalent to this distribution when the unit of measurement is changed to measure standard deviations from the mean. (That's why this distribution is important--it's used to handle problems involving any normal distribution.)

Page 22: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

The grades on a marketing research course midterm are normally distributed with a mean (81) and standard deviation (6.3) . Calculate the z score for each of the following exam grades. Draw and label a sketch for each example.

(a) 65

(b) 83

(c) 93

(d) 100

Page 23: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

The grades on a marketing research course midterm are normally distributed with a mean (81) and standard deviation (6.3) . Calculate the z score for each of the following exam grades. Draw and label a sketch for each example.

(a) 65

(b) 83

(c) 93

(d) 100

Page 24: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question…

What is the relative frequency of observations below 1.18? That is, find the relative frequency of the event Z < 1.18.

z .00 .01 ... .08 .090.0 .5000 .5040 ... .5319 .53590.1 .5398 .5438 ... .5714 .5753... ... ... ... ... ...1.0 .8413 .8438 ... .8599 .86211.1 .8643 .8665 ... .8810 88301.2 .8849 .8869 ... .8997 .9015... ... ... ... ... ...

Page 25: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

Find the value z such that the event Z > z has relative frequency 0.80.

Page 26: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

• For borrowers with good credits the mean debt for revolving and installment accounts is $ 15, 015. Assume the standard deviation is $3,540 and that debt amounts are normally distributed.

What is the probability that the debt for a borrower with good credit is more than $ 18,000.

Page 27: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

• The average stock price for companies making up the S&P 500 is $30, and the standard deviation is $ 8.20. Assume the stock prices are normally distributed.

How high does a stock price have to be to put a company in the top 10 % … ?

Page 28: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

The scores on a statewide geometry exam were normally distributed with μ=72 and σ=8. What fraction of test-takers had a grade between 70 and 72 on the exam? Use the cumulative z-table provided below.

z. 00 .01 .02. 03. 04. 05. 06. 07 .08 .090.00. 50000 .50400 .50800 .51200 .51600 .51990 .52390 .52790 .53190 .53590.10. 53980 .54380 .54780 .55170 .55570 .55960 .56360 .56750 .57140 .57530.20. 57930 .58320 .58710 .59100 .59480 .59870 .60260 .60640 .61030 .61410.30. 61790 .62170 .62550 .62930 .63310 .63680 .64060 .64430 .64800 .65170.40. 65540 .65910 .66280 .66640 .67000 .67360 .67720 .68080 .68440 .68790.50. 69150 .69500 .69850 .70190 .70540 .70880 .71230 .71570 .71900 .72240.60. 72570 .72910 .73240 .73570 .73890 .74220 .74540 .74860 .75170 .75490.70. 75800 .76110 .76420 .76730 .77040 .77340 .77640 .77940 .78230 .78520.80. 78810 .79100 .79390 .79670 .79950 .80230 .80510 .80780 .81060 .81330.90. 81590 .81860 .82120 .82380 .82640 .82890 .83150 .83400 .83650 .83891.00. 84130 .84380 .84610 .84850 .85080 .85310 .85540 .85770 .85990 .8621

Page 29: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

The scores on a marketing research take home exam is normally distributed with μ=70.25 and σ=3.

Henry scored 71 on the exam Henry’s exam grade was higher than what percentage of test-takers?

Page 30: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.
Page 31: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.
Page 32: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Summary on Statistical Inference

1. SI involves generalization from sample statistics to population parameters

2. To conduct inferential analysis, we must have a theory that underlies the process. The theory is based on probability

3. There are two kinds of error in samples: bias and random sampling error. Through the random selection of a random sample, bias can be eliminated and random sampling error can be measured.

4. Sampling distributions are theoretical probability distributions that describe the relationship between populations and samples.

5. The standard deviation of the sampling distribution is called the standard error and when based on sample statistics, estimates random sampling error.

6. As the size of the sample increases, sampling error and the standard error will decrease.

7. The standard error is used both to develop interval estimates of population parameters and to conduct hypothesis testing.

Page 33: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Confidence Interval

• Sample statistic is not likely to be exactly equal to the population parameters. But, we can place an interval around a sample statistic that specifies the likely range within which the population parameter is likely to fall…the term CI refers to the degree of confidence, expressed as %, that the interval contains the population mean, and for which we have an estimate calculated from our sample.

• Accuracy & precision = small standard error. The most efficient way is to increase the N (sample size).

• With 95 % CI, about 5 % will erroneously exclude the population value.

Page 34: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Confidence Interval

Page 35: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

You sample 36 apples from your farm’s harvest of over 200,000 apples. The mean weight of the sample is 112 grams (with a 40 gram sample standard deviation). What is the probability that the weight of all 200,000 apples is within 100 and 124 grams?

Page 36: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Question

In a local teaching district a technology grant is available to teachers in order to install a cluster of four computers in their classrooms. From the 6250 teachers in the district, 250 were randomly selected and asked if they felt that computers were essential teaching tool for their classroom. Of those selected, 142 teachers felt that computers were an essential teaching tool.

(1) Calculate a 99 % confidence interval for the proportion of teachers who felt that computers are an essential teaching tool.

(2) How could the survey be changed to narrow the confidence interval but to maintain the 99 % confidence interval?

Page 37: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Hypothesis Testing

• Two hypotheses in the testing process: Null & Research …

• Reject the null hypothesis= you show that the null hypothesis is false. This means that the alternative hypothesis represents the correct state of affairs in the population.

Fail to reject the null hypothesis = you show that the null hypothesis can not be rejected. There is insufficient evidence to support the argument that you make in your research hypothesis.

You NEVER accept the null hypothesis – because you can NEVER prove that the population means were equal/less/more… We can NEVER be certain.

• Level of significance (p value) – level of risk you are willing to accept. (p < .05 means we will reject the null hypothesis only when the probability of falsely rejecting the null hypothesis is less than 5 in 100..

Page 38: Statistical Analysis Episode #1: Prior Data Analysis Logic of Statistical Inference.

Hypothesis Testing