note for stat 2... · Web viewChapter 14: multiple regression and model building Done by: T.A Dalal...

Kuwait University

College of Business Administration

Department of Quantitative Methods and information System

Tutorial Stat 220

Chapter 7: sampling and sampling distribution

Chapter 8: confidence interval

Chapter 9: hypothesis testing

Chapter 10: statistical inference based on two samples

Chapter 11: ANOVA

Chapter 12: chi-square test for independence

Chapter 13: simple linear regression

Chapter 14: multiple regression and model building

Done by:

T.A Dalal Al-Odah

T.A Narjes Akbar

T.A Dalal AL-Banwan

Supervised by Dr.Mohammed Qadry Grraph

1

summer 2014/2015

http://www.google.com.kw/imgres?imgurl=http://mon-personal-mba.fr/wp-content/uploads/2013/02/stats.png&imgrefurl=http://www.asse-live.com/football/serie-asse-domicile-sans-encaisser-but_filvert20182.php&h=310&w=304&tbnid=XQcfuSMV7IeMLM:&zoom=1&docid=rdnO-OQa8Q13NM&ei=FbXIVNzCHMa0UdndgcAJ&tbm=isch&ved=0CGkQMygxMDE

Chapter 7

I. Sampling distribution of sample mean x

2

Exactly normal

If the population data follow Normal(μ , σ )

Then

The sampling distribution x N ¿

Approximately normal

If the population data follow any distribution (not normal) and the sample size is large (n≥ 30¿

Then

The sampling distribution x N ¿

(by central limit theorem CLT)

Z= x−μσ√n

Where:

x : Sample mean

μ : Population mean

σ : Population standard deviation

σ x : Standard deviation of the sample mean (standard error)

μx : mean of the sample mean

Note: mean = average = rate = expected

EXERCISES

1. The amounts of electric bills for all households in a city have a skewed probability distribution with a mean of $80 and a standard deviation of $25. Let x be the mean amount of electric bills for a random sample of 75 households selected from this city.Find:

a. The mean of the sampling distribution of x

b. The variance of the sampling distribution of x

c. The standard deviation (error) of the sampling distribution of x

d. What is the sampling distribution of the sample mean x

e. Find the probability that the mean amount of electric bills for a random sample of 75 households selected from this city will be

i. Between $72 and $77

ii. Within $6 of the population mean

iii. More than the population mean by at least $5

iv. Less than the population mean by at least $2

v. Either less than $72 or more than $77

3

2. The print on the package of 100-watt General Electric soft-white light bulbs claims that these bulbs have an average of 750 hours. Assume that the lives of such bulbs have a normal distribution with a mean of 750 hours and a standard deviation of 55 hours. Let x be the mean life of a random sample of 25 such bulbs.

a. Find the mean and standard deviation of x and describe its sampling distribution.

b. Find the probability that the mean life of a random sample of 25 such bulbs will be within 15 hours of the population mean.

c. Find the fraction that x will be less than the population mean by 20 hours or more.

d. Find the percentage that x will be more than the population mean by at least 20 hours.

e. Find the probability that x will be within 1.5 standard deviation (error) of the population mean.

REVIEW

1. A quality control inspector periodically checks a production process. This inspector selects simple random samples of 30 finished products and computes the sample mean product weightsx. Test results over long period of time show that 2.5% of the x values are over 2.1 kg and 2.5% are less than 1.9 kg.a. What are the mean and standard deviation for the population of products produced with this

process? (μ=2 ,σ=0.2795).b. Find the probability that x will be within one standard deviation (error) of the population mean?

( p ( μ−σ x ≤ x≤ μ+σ x )=p (−σx ≤ x−μ ≤ σ x )=p (−1 ≤ z ≤1 )=0.682 ) .2. A machine makes 3-inch-long nails. The probability distribution of the lengths of these nails is

normal with a mean of 3 inches and a standard deviation of 0.1 inch. The quality control inspector takes a sample of 25 nails once a week and calculates the mean length of theses nails. If the mean of this sample is either less than 2.95 inches or greater than 3.05 inches, the inspector concludes that the machine needs an adjustment.

a. What is the mean, standard deviation (error) and sampling distribution of the sample mean?(x N ( μx=3 , σ x=0.02 ) )

4

b. What is the probability that based on a sample of 35 nails the inspector will conclude that the machine needs an adjustment?( p ( x<2.95 )+ p (x>3.05 )=.003 )

II. Sampling distribution of the sample proportion P

5

Approximately normal

If nP ≥ 5 and n(1−P)≥ 5

Then

The sampling distribution

P N (μP=P ,σ P=√ P (1−P )n

)

(by central limit theorem CLT)

Z= P−P

√ P(1−P)n

Where:

P: Population proportion

P : Sample proportion

μ P: Mean of the sample proportion

σ P : Standard deviation of the sample proportion (standard error)

EXERCISES

1. A corporation makes auto batteries. The company claims that 80% of its batteries are good for 70 months or longer. Assume that this claim is true. Let P be the proportion in a sample of 100 batteries that are good for 70 months or longer.

a. What is the mean, standard deviation (error), and sampling distribution of the sample proportion?

b. The probability that the proportion is less than 0.90?

c. The probability that this P is within 0.05 of the population proportion?

d. The probability that this Pis not within 0.05 of the population proportion?

e. What is the probability that P is less than the population proportion by 0.06 or more?

REVIEW1. Suppose that among the undergraduate students at a very large university 5.9% are international

students and 57.8% are female.a. If 28 students are randomly sampled, what is the probability that fewer than 14 are

female?(P ( x f <14 )=P ( x f

n< 14

28 )=P ( pf <0.5 )=P ( z←0.84 )=0.2005).b. If 10 students are randomly sampled, what is the probability that more than 10% are

international students?( P ( P I>0.1 )=P ( z>0.55 )=0.2912 ) .2. A machine that is used to make CDs is known to produce 6% defective CDs. The quality control

inspector selects a sample of 100 CDs every week and inspects them for being good or defective. If 8% or more of the CDs in the sample are defective, the process is stopped and the machine will

6

be adjusted. What is the probability that based on a sample of 100 CDs the process will be stopped to adjust the machine?( P ( p>0.08 )=P ( z>0.83 )=0.20 33 ).

Chapter 8

I. Population mean (μ)

point estimate of μ (μ) = x

Confidence interval (C.I.) of μ

To fine the sample size

If we have C.I. ( L , U ) then

o Sample Mean L+U2

7

σ Unknown σ known

n<30 n≥ 30

x± t α2 ,n−1

s√n

x± Z α2

s√n

x ± Z α2

σ√n

Margin of Error

(Max Error/ Error)

n=( Z α2

std .dev

E )2

o Marginal error U−L2 = width(length)

2

Exercises:

1. A random sample of 16 mid-sized cars, which were tested for fuel consumption, gave a mean of 26.4 miles/gallon with standard deviation of 2.3 miles/gallon.

a. Find a 95% confidence interval for the average fuel consumption of a mid-sized car?

b. What assumption(s) are necessary for your answer in (a) to be valid?

c. Find the error of such interval?

d. If we choose a sample of size 100 mid-sized cars, then repeat part (a)?

e. What sample size would be required to reduce the margin of error by 50%?

2. An economist wants to find a 90% confidence interval for the mean sale price of houses in a state. How large a sample should he or she select so that the estimate is within $3500 of the population mean? Assume that the standard deviation for the sale prices of all houses in this state is $31500?

8

3. IQ tests are designed to yield results that are approximately normally distributed. Researchers think that the population standard deviation is 15. A reporter is interested in estimating the average IQ of employees in a large high-tech firm in California. She gathers the IQ information on 22 employees of this firm and records the sample mean IQ as 106.

a. Compute 90% confidence intervals of the average IQ in this firm.

b. If the C.I is (97.77, 114.23) find the confidence level

4. In analyzing the operating cost for a huge fleet of delivery trucks, a manager takes a sample of 25 cars and calculated the sample mean and variance of the operating cost. Under the assumption that the operating cost has a normal distribution, he found that the 95% confidence interval for the mean operating cost is between 253 and 300 K.D.

a. Find the maximum error of estimate (error bound) for such interval?

b. Find the sample mean and variance?

c. The manager said he is 95% confidence that the sample mean lies within such interval, do you agree? Why?

9

d. Construct a 90% confidence interval for the true mean? Find the error of such interval?

Review: 1. To measure the time taken to manufacture a device, a random sample was chosen. The

following is the assembly time (the time taken to fix each device in minutes ) for the sample:

8 10 12 15 17If the sample information is used to estimate the population mean of the assembly time then:

a. Give a point estimate and a 99% confidence interval for the population mean? State your necessary assumptions we need?

( μ=x=∑ xn

=12.4 , x ±t α2

,n−1

s√n

,assumptionis normal population).b. If the population standard deviation is known to be 3, how large is the sample size

needed to estimate the mean assembly time with 0.99 confidence, and error

margin of one minute?(n=( Z α2σ

E )2

=( 2.576 ×31 )

2

=59.72≅ 60)2. Determine the margin of error for a confidence interval estimate for the population mean

of a normal distribution given the following information: Confidence level=0.98, n=13, S=15.68 (M .E=11.66)

10

II. Population proportion

Point estimate of P= P

Confidence interval of P

To find the sample size

11

P ± Z α2 √ P(1−P)

n

Margin of Error

(Max Error/ Error)

n=( Z α2

2 p q

E2 ) Where q=1− p

Exercises:1. It is said that happy and healthy workers are efficient and productive. A company that

manufactures exercising machines wanted to know the percentage of large companies that provide on-site health club facilities. A sample of 240 such companies showed that 96 of them provide such facilities on site.

a. What is the point estimate of the percentage of all such companies that provide such facilities on site? What is the margin of error associated with this point estimate?

b. Construct a 97% C.I for the percentage of all such companies that provide such facilities on site.

2. A consumer agency wants to estimate the proportion of all drivers who wear seat belts while driving. Assume that a preliminary study has shown that 76% of drivers wear seat belts while driving. How large should the sample size be so that the 99% C.I for the population proportion has a maximum error of 0.03?

3. A college registrar has received complaints about the online registration procedure at her college. She wants to estimate the proportion of all students at this college who are dissatisfied the online registration procedure. What is the most conservative estimate of

12

the sample size that would limit the maximum error to be within 0.05 of the population proportion for 90% C.I?

Review

1. A researcher wanted to know the percentage of judges who are in favor of the death penalty. He took a random sample of 15 judges and asked them whether or not they favor of the death penalty. The responses of these judges are given here

Yes No Yes Yes No No No YesYes No Yes Yes Yes No Yesa. What is the point estimate of the population proportion? What is the margin of

error associated with this point estimate?

(P= xn=0.6 , M . E=Z α

2 √ P(1−P)n

=0.2479)b. Make a 95% C.I for the percentage of all judges who are in favor of the death

penalty.(0.6 ± 0.24792 )

2.a. How large a sample should be selected so that the maximum error of estimate for

99% C.I for the population proportion is 0.035? When the value of the sample proportion obtained from a preliminary sample is 0.29?(0.29 ± .035 )

b. Find the most conservation sample size that will produce the maximum error for a 99% C.I for p equal to 0.035( Hint : if p not giventake p=0.5 ,n=1354.25≅ 1355 )

13

Chapter 9

I. Testing hypothesis for μ1. State null hypothesis (H ¿¿0)¿ and alternative hypothesis (H1)

H 0 : μ=¿ vs H 1: μ≠

2. Calculate the test statistic (T.S)

14

≤ ¿

¿≥

T.S for μ

σ unknown σ known

n<30 n≥ 30 ZC= x−μσ√n

T C=x−μ

s√n

ZC= x−μs

√n

3. Determine the rejection region (R.R)

4. Conclusion: we reject H 0 if T.S lies in the R.R

3. Calculate P-value

4. Conclusion: we reject H 0 if 5.

−t α2

,n−1

−Z α2

Z α2

−Zα Zα

t α2

,n−1−t α ,n−1

t α ,n−1

Two-Tailed Left-Tailed Right-Tailed

−ZC ZC −ZC ZC

Right-TailedLeft-TailedTwo-Tailed

p-valuep-valuep-value

+

pvalue<α

Type І error:o What is the type І error?

Reject H 0/H 0is true

o What is the probability of type І error?P (type І error) = α =P(Reject H ¿¿0/ H 0 istrue)¿

o Note: 1−α=P (do not reject H ¿¿0/ H0 istrue)¿

Type П error:o What is the type П error?

Do not reject H 0/H0 is false

o What is the probability of type П error?P (type П error) = β=P(Do not reject H ¿¿0/ H 0is false)¿

o Note: 1−β=P (Reject H ¿¿0 /H 0isfalse)¿

15

Power of the test

The null hypotheses is

Your Decision based on a random sample

true false

Reject Type І error Correct decision

Do not Reject Correct decision Type П error

Exercises:

1. According to survey by the national retail Association, the average amount that households in the United States planned to spend on gifts, decorations, greeting cards, and food during 2001 holiday season was $940. Suppose that a recent random sample of 324 households showed that they plan to spend an average of $1005 on such items during this year’s holiday season with a standard deviation of $330.

a. Test at the 1% significance level whether the mean of such planned holiday related expenditures for households for this year differs from $940.

1) 2)

3)

4)

b. Find 99% C.I of µ.

c. Use C.I from part (b) to test H0 : μ=940 vs H 1: μ≠ 940

2. A drug company is considering marketing a new local anesthetic. The effective time of the anesthetic the drug company that is currently produced has a normal distribution with

16

an average of 7.4 minutes with standard deviation of 1.2 minutes. To market the new anesthetic, the mean effective time should be less than 7.4 min. a sample of size 36 results in a sample mean of 7.1. a hypothesis test will be done to help make a decision.

a. State the null and the alternative hypotheses

b. Compute the test statistic

c. Compute the P-value of the test

d. What is your recommendation to the drug company using a level of significance of 0.01?

3. Insurance companies track life expectancy information to assist in determining the cost of life insurance policies. Last year the average life expectancy of all policyholders was 77 years. A company wants to determine if their clients now have longer life expectancy on average, so they randomly sample 20 of their recent paid policies. The sample has a mean of 78.6 years and a standard deviation of 4.48 years.

a. Write the null and alternative hypotheses

b. What is the value of the test statistic?

c. State your conclusion using α=0.05

d. Considering the result of the test, which type of errors in hypothesis testing could you have made?

e. State your assumptions

17

II. Testing hypothesis for P

1. State null hypothesis (H ¿¿0)¿ and alternative hypothesis (H1) H 0 : P=¿ vs H 1: P ≠


18

≤ ¿

¿≥

T.S for P

ZC= P−P

√ P(1−P)n

3. Determine the rejection region (R.R)

4. Conclusion: we reject H 0 if T.S lies in the R.R

3. Calculate P-value

4. Conclusion: we reject H 0 if

−Z α2

Z α2

−Zα Zα

Two-Tailed Left-Tailed Right-Tailed −ZC ZC −ZC ZC

Right-TailedLeft-TailedTwo-Tailed

p-valuep-valuep-value

+

pvalue<α

Exercises:

1. A food company is planning to market a new type of frozen yogurt. However, before marketing this yogurt, the company wants to find what percentage of the people like it. The company’s management has decided that it will market this yogurt only if at least 35% of the people like it. The company’s research department selected a random sample of 400 persons and asked them to taste this yogurt. Of these 400 persons, 112 said they like it.

a. Testing at the 2.5% significant level, can you conclude that the company should market this yogurt?

1) 2)

3)

4)

b. What will your decision be in part (a) if the probability of making a type І error is zero?

2. A study by consumer reports showed that 64percent of supermarket shoppers believes supermarket brands to be as good as national name brands.

a. Formulate the hypotheses that can be used to determine whether the percentage of supermarket shoppers who believe that supermarket brands to be as good as national name brands is different from 64%.

b. If a sample of 100 shoppers showed that 52 stating that the supermarket brand was as good as the national brand, what is the value of the test statistic?

19

c. What is the p-value?

d. At α =0.05, what is your conclusion? Justify your answer.

3. Suppose that in a sample of 1000 employees 23% said that losing their job is the major reason of concern for them.

a. Find a 98% confidence interval for the percentage of employees who said losing their job is the major reason of concern for them.

b. According to your confidence interval obtained in (a) do you believe that percentage is different from 19% and why or why not?

Review:

1. The policy of a company is to deliver on time at least 90% of all the orders it receives from its customers. The quality control inspector at the company usually takes samples of orders delivered and checks if this policy is maintained. A sample of 90 orders taken by this inspector showed that 75 of them were delivered on time. At the 2% significance level, can you conclude that the company’s policy is maintained? Use the p-value to conduct the test.( H 1 :P<0.9 , Zc=−2.21 , p−value=0.0136 , reject H 0 ).

20

III. one population varianceσ 2

Point estimate for one population variance σ 2 S2

one population st. deviation σ S

Confidence interval for

Hypothesis test about one variance σ 2

1. H 0 :σ2=¿ vs H 1: σ2≠


3. Determine the rejection region

00

4. Conclusion: Reject H 0 if T.S xC2 lies in rejection region

Note: hypothesis test about one population st.dev(σ )is the same as hypothesis test about one population variance(σ 2 ), but you need to convert the hypothesis from σ toσ 2.

21

(n−1) s2

x α2

,n−1

2 <σ2

<(n−1) s2

x1−α

2, n−1

2

One population variance σ 2One population St. Deviation σ

√ (n−1)s2

x α2 ,n−1

2 <σ<√ (n−1)s2

x1−α

2 , n−1

2

≥

≤ ¿

¿

xc2=

(n−1) S2

σ2

x1−

α2 , n−1

2 x α2 ,n−1

2 x1−α ,n−12 xα ,n−1

2

Exercises:1. A professor claims that the variance of the lengths of his lectures is within 2

square minutes. A random sample of 23 of these lectures was timed, and the variance of the lengths of these lectures was found to be 2.7 square minutes. Assume that the lengths of all such lectures by the professor are approximately normally distributed.

a) Find the point estimate of the population variance

b) Make the 98% confidence intervals for the variance and standard deviation of the lengths of all lectures by the professor.

c) Test at the 1% significance level whether the variance of the lengths of all such lectures by the professor exceeds 2 square minutes.

22

2. An assembly line produces units with a mean weight of 10 and a standard deviation of 0.20. A new process supposedly will produce units with the same mean and a smaller standard deviation. A sample of 20 units produced by the new method has a sample standard deviation of 0.126. At a significance level of 10% can we conclude that the new process has less variation than the old?

Review

1. Automotive part must be machined to close tolerances to be acceptable to customers. Production specifications call for a maximum variance in the lengths of the parts of 0.0004. Suppose the sample variance for 30 parts turns out to bes2=0.0005. Using α=0.05, test to see whether the population variance specification is being violated.

(H 1 :σ2>0.0004 , xc2=36.25 , xtable

2 =42.557 , do not reject H0)

23

Chapter 10

I. The difference between two population means (μ1−μ2) for independent samples

24

σ 1& σ 2 knownσ 1& σ 2 unknown

Point estimate: X1−X2

C.I:

( X1−X2 ) ± Z α2 √ σ1

2

n1+

σ 22

n2

Test statistic :

Zc=( X1−X2 )−D

√ σ 12

n1+

σ22

n2

Zc N (0,1)


C.I:

( X1−X2 ) ± Z α2 √ S1

2

n1+

S22

n2

Test statistic :

Zc=( X1−X2 )−D

√ S12

n1+

S22

n2

Zc N (0,1)


C.I:

( X1−X2 ) ± tα2

,n1+n2−2. SP .√ 1

n1+ 1

n2

Pooled estimate:

SP2=(n1−1 ) S1

2+(n2−1)S22

n1+n2−2 Test statistic :

t c=( X1−X2 )−D

SP√ 1n1

+ 1n2

t c t(n1+n2−2)


C.I:

( X1−X2 ) ± tα2

,n¿ √ S12

n1+

S22

n2

Test statistic :

t c=( X1−X2 )−D

√ S12

n1+

S22

n2

Degree of freedom:

n¿=( S1

2

n1+

S22

n2)

2

( S12

n1)

2

n1−1+

( S22

n2)

2

n2−1

t c t(n¿)

n1∧n2 Largen1∧n2 Small

σ 12≠ σ2

2 σ 12=σ2

2 (Homogenous)

II. Hypothesis test about Homogeneity(Equal population variancesσ 1

2=σ22)

1. H 0 :σ12=σ2

2 vs H 1: σ12≠ σ 2

2


FC=Slarge

2

Ssmall2

3. Determine the critical value

4. Conclusion:Reject H 0 if T.S (FC) lies under the rejection region (under shaded area).

25

F α2

, n−1 ,n−1

Numerator Denominator

Exercises:

1. El-Mraay Dairy company claims that its 8-ounces low-fat yogurt cups contain on the average fewer calories than the 8-ounces low-fat yogurt cups produced by its competitor El-Safy company. In order to check this claim a sample of 50 such cups produced by El-Mraay showed that they contains on the average 144 calories per cup with a standard deviation of 5.4 calories. A sample of 40 cups of El-Safy product showed that they contain on the average 147 calories per cup with a standard deviation 6.3 calories.

a. Make a 98% confidence interval for the difference between the mean number of calories in the 8-ounces low fat yogurt cups produced by the two companies

b. Find the standard and margin error of part (a).

c. Does your C.I obtained in part (a) support the hypothesis that the two means are different, what is the probability of type І error in that case.

d. Test El-Mraay Dairy Company claim with α=0.05

26

2. Two brands (A and B) of tires are tested to compare their durability. The management of company claims that brand A is durable than brand B. Twelves from each brand are tested on a machine. The mileages (in hundreds of miles for each tire) have been recorded giving the following information.

Mileages in hundredsBrand A 157 139 188 143 172 144 191 128 177 160 175 162Brand B 160 118 150 165 158 159 127 133 170 164 152 142

Mean Standard deviationBrand A 161.3 20.01Brand B 149.8 16.45

Assuming that two population are normally distributeda. At 5% level significance tests the hypothesis that the two populations are

homogeneous (equal variances).

b. Assuming homogeneous populations, test the management claim using α=0.05

27

3. A company claims that its medicine brand A provide faster relief from pain that another company’s medicine brand B. a researcher tested both brands of medicines on two groups of randomly selected patients. The results of the test are given in the following table. The mean and the standard deviation of relief time are in minutes.

Brand Sample size Mean of relief time Standard deviation of relief time

A 21 44 12.5B 17 49 7.5

Assuming that relief time is normally distributed a. Assuming equal variances test the company claim at 0.05 level of significance

b. Using α=0.05 test the hypothesis of homogeneous population (equal variances)

28

Review:1. In order to study the performance of CBA students in Stat. 120. The QMIS

department selected randomly 13 female and 12 male students and their final scores are recorded giving the following summary statistics

Sample size Mean STDFemale 13 84.15 9.90male 12 76.58 11.27

Assuming that the scores have homogeneous normal distributions test the hypothesis that the female students scores on the average more than the male students (α=0.05). (H 1 :μ1>μ2, t c=1.788 ,t .05,23=1.714 ,reject H 0)

2. A sample of 18 fathers who were company executives showed that they spend an average of 2.3 hours per week playing with their children, with a standard deviation of 0.54 hours. Another sample of 24 fathers who were medical professionals gave a mean of 4.6 hours per week with a standard deviation of 0.8 hours.Assume that the times spent per week playing with their children by all fathers who are executives and all fathers who are medical professional have normal distributions with equal standard deviations.

a. Construct a 95% C.I for difference between the mean time spent per week playing with their children by all fathers who are executives and all fathers who are medical professionals. (-2.741, -1.858)

b. Using the above C.I, test whether the mean time spent by all fathers who are executives is equal to that for all fathers who are medical professionals. (H1 :μ1≠ μ2 ,reject H 0)

3. A firm is studying the delivery times of two raw material suppliers. The firm is basically satisfied with supplier A and is prepared to stay with that supplier if the mean delivery time is the same as or less than that of supplier However, if the firm finds that the mean delivery time of supplier B is less than that of supplier A, it will begin making raw material from supplier B.

a. What are the null and alternative hypotheses for this situation?(H 1 :μ A>μB)b. Assume the independent samples show the following delivery time for the

two suppliers:

Supplier A Supplier Bn1=10 n2=20

x1=14 days x2=12.5 days

29

s1=4 s2=2With α=0.05 and using t-test with pooled variance what is your conclusion for the hypothesis from part (a)? What do you recommend in terms of supplier selection? (t c=1.38 , t.05,28=1.701 , donot reject H 0).

30

4. In a random sample of nine gasoline stations in City “A”, the prices per gallon of unleaded gas have a standard deviation of $0.08 per gallon. In a random sample of 14 gasoline stations in city “B”, the prices per gallon have a standard deviation of $0.03 per gallon. Use the 10% significance level to test the null hypothesis that the price per gallon of gasoline is equally variable in two cities. (H 1: σ1

2≠ σ 22, F c=7.11 , F0.05,8,13=2.77 , reject H0 ¿.

5. On the basis of data provided by a salary survey, the variance in annual salaries for seniors in accounting firms is approximately 2.1 and the variance in annual salaries for managers in accounting firms is approximately 11.1. The salary data were provided in thousands of euros. Assuming the salary data were based on sample of 25 seniors and 26 managers, test the hypothesis that the population variances in the salaries are equal. At α=0.05, what is your conclusion? (H 1: σ1

2≠ σ 22 , F c=5.29 ,F0.025,25,24=2.26 , reject H 0¿.

31

III. The difference between two population means (μ1−μ2=μd) for dependent (paired) samples

Point estimate of (μ1−μ2=μd) μd=d

Confidence interval C.I

d ± t α2 ,n−1

Sd

√n if σ unknown∧n<30

Where d: difference between the two variables

d=∑ d

n

Sd

2=∑ d2−

(∑ d )2

nn−1

or Sd2=∑ d2−nd2

n−1 or Sd

2=∑ (d−d )2

n−1

Sd=√Sd2

Hypothesis test

T c=

d−0Sd

√n if σ unknown∧n<30

T c t n−1

Note: We will use Z instead of T in both C.I and hypothesis test if

o σ is knowno σ is unknown with n≥30

there are three ways to do the test as mentioned in chapter 9 in this note

32

Exercises:

1. To test the difference between two body shop garages, 10 randomly chosen damaged cars were sent to these two garages (A and B). The following are the estimated repair garages of these garages.

A B d=A-B d2

236 310137 187379 392255 232279 321321 318369 389333 288137 167390 432

Assuming that the repair charges are normally distributed a. Test the hypothesis that the repair charge at garage A is lower than that at garage

B, state your assumptions.

b. Construct a 95% C.I of the difference between the two means.

33

2. The manufacture of gasoline additive claims that the use of this additive increases gasoline mileage. A random sample of 6 cars were driven for one week with the gasoline additive and then for one week without the gasoline additive. The following table provides the obtained information about the gasoline mileage.

Gasoline mileageWith without D=with-without

Mean 25.12 23.4 1.717Standard deviation 5.87 5.42 1.427

a. Compute a 99% confidence interval for the mean difference gasoline mileage?

b. Is it possible to say that the manufacturer’s claim is true? Why? Use α=0.01

34

Review:

1. A company claims that its 12-week special exercise program significantly reduces weight. A random sample of six persons was selected, and these persons were put on this exercise program for 12 weeks. The following table gives the weights (in pounds) of those six persons before and after the program. Assume that the population of all paired differences is (approximately) normally distributed.

a. Make a 95% confidence interval for the mean of the population paired differences, where a paired difference is equal to the weight before joining this exercise program minus the weight at the end of the 12-week program. (2.278, 17.382)

b. Using the 1% significance level, can you conclude that the mean weight loss for all persons due to this special exercise program is greater than zero?(H 1 :μd>0 , t c=3.35 ,t .01,5=3.365 , donot reject H 0)

2. A study used to test whether a training course is helpful for students to pass a mathematics course. To evaluate the effectiveness of the training course, eight students test scores were compared before and after taking the training course. The results are as follows

studentScores

before after1 46 502 52 503 64 714 67 705 58 546 55 617 60 628 60 68

a. Compute a 90% confidence interval for the mean difference scores? (0.25, 5.75)b. Is it possible to say that the training course is helpful? Why?

(H 1 :μd>0 , t c=2.06 , t.05,7=1.895)3. A company is considering installing new machines to assemble its products. The company is

considering two types of machines, but it will buy only one type. The company selected 11 assembly workers and asked them to use these two types of machines to assemble products. The time in minutes to assemble one unit of the product on each type of machine for each of these eleven workers is recorded and given to company statistician who supplied the following information

35

Before 180 195 177 221 208 199After 183 187 161 204 197 189

Machine І 23 26 19 24 27 22 20 18 17 21 25Machine П 21 24 23 25 24 28 24 21 17 26 23

Assuming normality, use a confidence interval for the difference between the average assembly time for the two machines to test the hypothesis that the two machines are the same at α=0.05. (H1 :μd ≠ 0 , (−3.462 , 0.916 ) , donot reject H 0)

IV. The difference between two population proportions (P1−P2 ¿

Point estimate of P1−P2 P1−P2

Confidence interval C.I

( P1−P2)± Z α2 √ P1 (1−P1 )

n1+

P2 (1−P2 )n2

Hypothesis test

Zc=( P1− P2 )−D

√P (1−P )( 1n1

+1n2 )

Where

Combined sample proportion P=X1+ X2

n1+n2 or P=

n1 P1+n2 P2

n1+n2

36

Exercises:

1. A company has two restaurants in two different areas in Kuwait. The company wants to estimate the percentage of customers who thinks that the food and service at each of these restaurants are excellent. A sample of 200 customers taken from restaurant in area A showed that 118 think that the food and service are excellent in this restaurant. Another sample of 250 customers taken from restaurant in area B showed that 160 think that the food and service are excellent in this restaurant.

a. Find the point estimate of the difference between the two proportions.

b. Construct a 97% C.I of the difference between the two proportions.

c. Find the p-value to test the hypothesis that the proportion of customers who thinks that the food and service in area A is lower than the corresponding proportion at the restaurant in area B.

37

d. What is your conclusion if α=0.025?

Review:

1. In a random sample of 800 men aged between 25 to 35, 24% of them said they live with one parent. In other sample of 850 women of the same age group, 18% said that they live with one parent. Construct a 95% confidence interval for the difference between the two population proportions. (0.021, 0.099)

2. A company that has many department stores wanted to find at two such stores the percentage of sales for which at least one of the items was returned. A sample of 800 sales randomly selected from store A showed that for 240 of them at least one of the items was returned. Another sample of 900 sales randomly selected from store B showed that for 279 of them at least one of the items was returned.

a. Construct at 98% confidence interval for the difference between the proportions of all sales at the two stores for which at least one of the items was returned. (-0.0621, 0.0421)

b. Find the standard error and the margin error of C.I. (0.02236,0.05211)c. Using the 1% significance level, can you conclude that at the two stores the

proportions of all sales for which at least one of the items was returned are different?(H 1 :P A ≠ PB , Zc=−.45 , donot reject H0)

d. Find the p-value for the test mentioned in part (c). (0.6528)e. Find the standard error of the test. (0.02237)

38

Chapter 11

ANOVA

Assumptions:

1. “k” random independent samples from 2. Normal population with3. Equal variances (homogenous populations)

Hypothesis test:

1. H 0 : μ1=μ2=…=μk vs H 1: at least one populationmeans is different

Where k: # of samples or groups or populations

2. T.S

F c=MSB(MSF )MSW (MSE)

3. Determine the Critical value

39

Calculated from ANOVA table

Fα ,k−1 ,n−k

4. Conclusion: Reject H 0 if T.S (F c¿ lies in the rejection region R.R (under shaded area).

Source of variation

Degrees of freedom(d.f)

Sum of squares(SS)

Mean squares(MS)

Between(Factor)

k-1 SSB(SSF) MSB (MSF)= SSB(SSF)

k−1 F c=MSB(MSF )MSW (MSE)

Within(Error)

n-k SSW(SSE) MSW (MSE)=SSW (SSE)

n−k

Total n-1 SST -

k: number of samples/ groups/ populations. n=n1+n2+…+nk ¿. T 1=∑ x1 , T 2=∑ x2, …, T k=∑ xk∨T 1=n1 x1 , T 2=n2 x2 ,…,Tk=nk xk

T=T 1+T 2+…+T k

S12 , S2

2 ,…, Sk2

∑ x2=¿∑ x12+∑ x2

2+…+∑ xk2 ¿

SSB(SSF) SSB = SST-SSWSSB = MSB (k-1) SSB = ( T1

2

n1+

T 22

n2+…+

T k2

nk)−T 2

n

SSB = ∑ ni ( x i−x )2

SSW(SSE) SSW = SST-SSB

SSW = MSW (n-k)SSW = (n1−1 ) S1

2+( n2−1 ) S22+…+( nk−1 ) Sk

2=∑ (n¿¿i−1)S i2¿

SSW=∑ x2−( T12

n1+

T 22

n2+…+

T k2

nk)

40

SST SST = SSB+SSWSST =∑ x2−T 2

n

MSE¿=SP2 =

(n1−1 ) S12+( n2−1 ) S2

2+…+( nk−1 ) Sk2

n−k

Note:

∑ x2 ≠ (∑ x )2

T i2≠∑ x i

2

T i2=(∑ x i )

2

Exercise

1. A consumer agency wanted to find out if the mean time it takes for each of three brands of medicine to provide relief from a headache is the same. The three drugs were administered to three randomly selected samples. The following table gives the time in minutes taken by each patient to get relief from a headache, followed by a Minitab output to such problem.

DrugІ П Ш

253842654752

14201824

4439545873

Individual 95% CIs for Mean Based on Pooled StDev Level N Mean StDev -------+---------+---------+--------- Drug І 6 44.83 13.50 (------*------) Drug П 4 19.00 4.16 (-------*-------) Drug Ш 5 53.60 13.24 (-------*------) -------+---------+---------+--------- 16 32 48

a. Complete the analysis of variance table

41

Source df SS MS FFactorErrorTotal

b. Test the consumer agency claim at 5% level of significance

c. Suppose that the hypothesis of equal means has been rejected which of the drugs is different and why?

2. A panel of trained testers judges the flavor quality of different vanilla frozen desserts: frozen yogurts, ice milks, other frozen desserts measured on a scale from 0 to 100. The sample sizes are respectively, n1=13 ,n2=8 ,∧n3=6. Below is most of the ANOVA output from the computer:

a. Complete the ANOVA table

b. Test whether there is a significant difference in the flavor quality of the three different disserts

State the null and alternative hypotheses

Find the value of the test statistic

Find the critical value(s). use 0.05 significance level

42

Source df SS MS FFactor(Between)

? 6364 3182?

Error(Within)

24 3031 ?

Total ? ?

What is your conclusion about the flavor quality of the three different disserts?

What are the assumptions required to make this test?

3. Sex hours were selected from each of 3 radio stations, and analysis of variance was performed on the data. Part of the ANOVA table is shown below

a. Complete ANOVA table

b. At 0.05 significance level, is there a difference in the stations means

43

Source df SS MS FBetween 1311.02 13.368WithinTotal

Review:1. A statistics professor has developed four methods (M1, M2, M3, M4) for teaching a

senior level class. He wishes to investigate if there is a difference in the four methods. The professor assigns students to the four teaching methods. The final exam scores for each group were recorded. The four sample sizes and sample means are in the following table:

Method Sample size Sample meanM1 7 79M2 4 83.75M3 6 70M4 5 72.8

Also you are given that the error (within) sum of squares ”SSE” (SSW)=861.55Carry out ANOVA test using a 1% level of significance

State the null and alternative hypotheses Find the value of test statistic (3.973) Find the critical value (5.09) What is your conclusion about the four different methods of teaching? (don’t

reject H 0).

2. Samples were selected from three populations, the data obtained is given below

Sample 1 Sample 2 Sample 391 77 88

44

Source df SS MS FBetweenWithin

318

570.45861.55

190.1547.864

3.973

Total 21 1432

98 87 75107 84 73102 95 84

85 7582

a. State the assumption needed to use the analysis of variance to test the equality of the three population means

b. Test the hypothesis of no difference between the three population means at 0.05 level of significance.( F c=11.885 , F.05,2,12=4.46 ,reject H 0 )

Chapter 12

Independence

1. H 0 : two variables are independent (not related ) vs H 1: two variables are dependent (related)


xc2=∑ (O−E)2

EWhere

O: Observed value

E: Expected value

E=row total∗columntotaltotal

3. Determine the critical value

45

Source df SS MS FBetweenWithin

212

968.7489

484.3540.75

11.885

Total 14 1457.7

4. Conclusion:Reject H 0 if T.S (xc

2) lies in the rejection region (under shaded area).

Exercises:

1. Let's try an example. 500 elementary school boys and girls are asked which is their favorite color: blue, green, or pink? Results are shown below:

Blue Green Pink TotalBoys 100 150 20 300Girls 20 30 180 200Total 120 180 200 500

would you conclude that there is a relationship between gender and favorite color?

a. The two hypothesis

H 0 :vsH 1:

b. The test statistic

46

xα ,(r−1)(c−1)2

c. The critical value(s). Use 0.05 significance level.

d. The conclusion

2. One hundred auto drivers who were stopped by police for some violation were also checked to see if they were wearing seat belt. The following table records the results of this survey

Wearing seat belt Not wearing seat belt TotalMen 34 21 55

Women 32 13 45

total 66 34 100

For a chi square test of independence for this contingency table:

a. What is the number of degrees of freedom?

b. What is the total of the second row?

c. How many drivers are in the sample ?

d. What are the observed frequencies for the first row?

e. What are the expected frequencies for the second row?

47

f. What are the observed frequencies for the second column?

g. What are the expected frequencies for the second column?

CHAPTER 13

Simple Linear Regression

I. The population regression modely=β0+β1 x+ε

Where o y: is the dependent variableo x: is the independent variableo β0: is y-intersection or constant termo β1: is the slopeo ε : is a random error term

Estimate the population regression model by the sample linear regression model

y=b0+b1 x

This equation is called the least squares regression line or the prediction equation.

Sum of squares

SSxy=∑ xy−∑ x∑ y

n∨SSxy=∑ xy−nx y∨SSxy=∑ (x−x)( y− y )

48

SSxx=∑ x2−(∑ x )2

n∨SSxx=∑ x2−n x2∨SSxx=∑ ( x−x )2

SSyy=∑ y2−(∑ y )2

n¿SS yy=∑ y2−n y2∨SSyy=∑ ( y− y )2

Estimated value of β0 and β1

B1=b1=SSxy

SSxx

B0=b0= y−b1 x

Prediction value of y

y=b0+b1 x

Residual (Error)

e= y− yII. How to Evaluate the estimated model

1. Coefficient of Determination

r2=b1 SSxy

SS yy=SSR

SST

0≤ r2≤ 1It explains the variation in “y” by the independent variable “x”

Note: SSR increased r2 increased Good model

2. Coefficient of Correlation r=√r2 (with the same sign of b1¿

or

r=SSxy

√ SSxx SS yy

=b1 √ SSxx

SS yy

−1 ≤r ≤ 1

It measures the strength of the linear relationship between two variables

49

given

given

Perfect Perfect

III. Estimation of the variance and standard deviation of random errors

The estimated variance of errors σ ϵ2=Se

2=MSE The estimated St. Deviation of errors σ ϵ=Se=√ MSE

MSE= SSEn−k

=SS yy−b1 SSxy

n−k

(k: number of parameters)

IV. Inferences about β1

b1 N (μb1=B1 , Sb1

=Se

√SSxx

)

Population simple linear regression equation y=β0+β1 x+ε

Point estimate of β1: β1=b1

Confidence interval of β1

o If n<30 b1± t α2

, n−kSb1 (k: number of parameters)

Hypothesis test about B1

1.

50

(Test of b1 or Test if there is a good relationship between x and y).

H 0 : β1=0 Means there is no relationship between X and Y (not significant linear relationship) (X will dropped from model)

H 1: β1 ≠ 0

H 1: β1>0

H 1: β1<0

Means there is a relationship between X and Y (significant linear relationship)

Positive linear relationship

negative linear relationship

2. Calculate Test statistic

o If n<30 T c=b1

Sb1

3. Determine the rejection region

4. Conclusion: Reject H 0if T.S lies in rejection region (R.R)V. Testing the overall model

1.

H 0 : β1=0 The model is not significant/ not useful/not good/not adequate /data not fit model.

H 1: β1 ≠ 0 The model is significant/ useful/ good/adequate/data fit model

2. Test statistic

F c=MSRMSE

3. Critical value

51

t α ,n−k−t α ,n−kt α

2,n−k

−t α2

,n−k

(Calculated from ANOVA table)

4. Conclusion Reject H 0 if T.S (F c¿ lies in the rejection region R.R (under shaded area).

ANOVA table

Source of variation

d.f Sum of squares SS Mean squares MS T.S

Regression k-1 SSR=b1 SSxy MSR= SSRk−1

F c=MSRMSE

Residual Error

n-k SSE=SST−SSR MSE= SSEn−k

Total n-1 SST=SS yy

k=¿of parameters

CHAPTER 14

Multiple Regressions

I. Least squares regression line equation

y=b0+b1 x1+b2 x2+b3 x3+…

y: Dependent variable (prediction of y)

x i: Independent variable

II. Hypothesis test 1. H 0 : β1=β2=β3=…=0 (the model is not significant)

VsH 1: at least one β is not equal to zero (the model is significant)

2. Test statistic

F c=MSRMSE (Calculated from ANOVA table)

3. Critical value

52

Fα ,k−1 ,n−k

4. Conclusion

Reject H 0 if T.S (F c¿ lies in the rejection region R.R (under shaded area).

ANOVA table

Source of variation

d.f Sum of squares SS Mean squares MS T.S

Regression k-1 SSR=b1 SSxy MSR= SSRk−1

F c=MSRMSE

Residual Error

n-k SSE=SST−SSR MSE= SSEn−k

Total n-1 SST=SS yy

Exercises:

1. The following table gives information on the temperature in a city and the volume of ice cream (in pounds) sold at an ice cream parlor for a random sample of eight days during the summer of 1999.

Temperature 93 86 77 89 98 102 87 79Ice cream sold 208 175 123 198 232 277 158 117

∑ x=711 ,∑ y=1488 , ∑ x2=63713 ,∑ y2=297428 , ∑ xy=135466 , x=88.88 , y=186

a. Find sum of squares (SS)

b. Find the least squares regression line ( y=a+bx )

53

Fα ,k−1 ,n−k

c. Give a brief interpretation of the values of a and b- a(b¿¿0)¿: the initial value of y when x=0 (the initial value of the volume of ice-cream sold (y) is

equal to a= -361.5008 when the temperature (x)is equal to zero)

- b(b¿¿1)¿: if we increase x by 1 unit then y will change (increase or decrease) by the value of b. (if we increase the temperature (x) by 1 degree then the volume of ice-cream sold (y) will increase by b=6.16.

d. Compute r and r2, explain what they mean

e. Predict the amount of ice cream sold on a day with a temperature of 95°

f. Compute the standard deviation of errors

g. Construct a 99% confidence interval for β1

h. Testing at 1% significance level, can you conclude that β1 is different from zero?

2. Regression analysis was applied between sales data (in $1000) and advertising data (in $100) and the following information was obtained y=12+1.8 xn=17 , SSR=225 , SSE=75 , Sb1

=0.2683a. Based on the above estimated regression equation, if advertising is $3000,

then the predicted value of sales (in dollars) is

b. The F statistic computed from the above data is

c. The critical F value at α=0.05 is

d. Is the estimated regression model significant? The two hypotheses

Your conclusion

e. The t statistic for testing the significance of the slope is

54

f. Is the linear relationship between X and Y significant? Use the t-test to answer this equation.

The two hypotheses

The critical value(s)

Your conclusion

g. Calculate the 95% confidence interval of the slope of the regression line for all statistics students

h. Develop an analysis of variance table

Source of variation

d.f SS MS

3. The owner of a bowling establishment is interested in the relationship between the price she charges for a game of bowling and the number of games bowled per day. She collected data on the number of games bowled per day at 15 different prices. Fill in the missing entries in the following MINITAB output that was obtained for these data. In this output, X represents the price of a game and Y is the number of games bowled per day.

The regression equation isy = ____ ____ x Predictor Coef STDEV T PConstant 691.02 21.70 ____ 0.000x -141.30 ____ -16.83 0.000

S = _____ R-Sq = _____ R-Sq(adj) = 95.3%

Analysis of Variance

Source DF SS MS F PRegression ___ 148484.0 _______ 283.13 0.000Residual Error ___ _______ _______Total ___ _______

55

4. A researcher wanted to examine the relation between a dependent variable y and an independent variable x, He selected randomly 10 observations giving the following partial MINITAB output:A.

Predictor Coef SE Coef Constant 247.97 15.01 X1 -8.172 1.077

S = 15.2158 R-Sq = 87.8% R-Sq(adj) = 86.3%

1. Write down the estimated regression equation:

2. Test the hypothesis of no regression (b1=0) using α=0.05

3. Find the correlation coefficient between y and x1 and comment on its value

B. To researcher decided to add another independent variable x2 to the model in (A) he obtained the following results

The regression equation isy = 180 - 6.17 x1 + 0.848 x2

Predictor Coef SE Coef T Constant 180.07 23.31 _____ X1 ______ ______ -6.46 X2 0.8484 0.2622 _____

S = 10.2957 R-Sq = _______ R-Sq(adj) = 93.7%


Source DF SS MS F Regression 2 ______ _______ _____ Residual Error _ 742.0 _______Total 9 15182.9

56

1. Complete the missing values in the previous MINITAB output

2. Test at α=0.05 of no regression model (i.e. the overall model does not fit the data).

3. Find a 95% confidence interval for the regression coefficient of x2

4. Test the hypothesis that the true value of the regression coefficient of x1 equal to 0

5. Which of the two models you prefer, the one estimated in (A) above or the one estimated in (B) above? And why?

6. Test the hypothesis that the true value of the regression coefficient of x2is positive (α=.05).

57

Inference about β i

The regression equation isy = 180 - 6.17 x1 + 0.848 x2

Predictor Coef SE Coef T p-value

Constant 180.07 23.31 7.725 0.00

X1 -6.17 0.955 -6.46 0.00

X2 0.8484 0.2622 3.236 0.00

S = 10.2957 R-Sq = 0.9511 R-Sq(adj) = 93.7%

Over all model (conduct the f test of model usefulness/test the whole model/ H 0 : β1=β2=0 vs H 1:at least one β ≠ 0)


Source DF SS MS F

Regression 2 14440.9 7220.45 68.117

Residual Error 7 742.0 106

Total 9 15182.9

58

Point estimate of β0 , β1 , β2

St.dev of β0 , β1 , β2Test statistic

Testing hypothesis for two tailed test we compare p-value with α. The significant of β i or the linear relationship between x i∧ y i

b0

b1

b2

Sb0

Sb1

Sb2

bi

Sbi

Estimated standard deviation of error

√MSE

Coefficient of determination

SSRSST

Error mean square

MSE= SSEn−k

Or MSE=S2

b0 b1 b2

5. The wner of ShowTime Movie theaters would like to estimate weekly gross revenue as a function of advertising expenditure. Historical data for a sample of 8 weeks follow

Weekly gross Revenue (Y)($1000s)

Television advertising (X1)($1000s)

Newspaper advertising (X2)($1000s)

96 5.0 1.590 2.0 2.095 4.0 1.592 2.5 2.595 3.0 3.394 3.5 2.394 2.5 4.294 3.0 2.5

A portion of the MINITAB computer follows

The regression equation isy = 83.2 + 2.29 x1 + 1.30 x2

Predictor Coef SE Coef T Constant 1.574 X1 0.3041 X2 0.3207

S = 0.642587 R-Sq = 91.9%


Source DF SS MS F Regression 23.435 Residual Error 5 Total

a. What is the estimate of the weekly gross revenue for a week when $3500 is spent on television advertising and $1800 is spent on newspaper advertising?

59

b. Find and interpret R2

c. When television advertising was the only independent variable, R2=0.653 (65.3%) do you prefer the multiple regression results? Why?

d. Use α=0.05 to test the hypothesesH 0 : β1=β2=0, H 1: β1∧¿∨β2is not equal to zero. Did the estimated regression equation provide a good fit to the data? Explain

e. Find the mean square error. Find the standard error of the estimate (σ )

f. Use α=0.05 to test the significance of each independent variable. Should X1 or X2 be dropped from the model

60

TABLES

61

note for stat 2... · Web viewChapter 14: multiple regression and model building Done by: T.A Dalal...

Documents

Transcript of note for stat 2... · Web viewChapter 14: multiple regression and model building Done by: T.A Dalal...