Book 2 Notes

81
7/17/2019 Book 2 Notes http://slidepdf.com/reader/full/book-2-notes 1/81 Chapter 5 – The normal distribution 5.1 Probability distributions of continuous random variables A random variable X  is called continuous if it can assume any of the possible values in some interval i.e. the number of possible values are infinite. In this case the definition of a discrete random variable (list of possible values with their corresponding probabilities) cannot be used (since there are an infinite number of possible values it is not possible to draw up a list of  possible values). For this reason probabilities associated with individual values of a continuous random variable X are taken as 0. he clustering pattern of the values of X over the possible values in the interval is described  by a mathematical function f(!) called the probability density function. A high (low) clustering of values will result in high (low) values of this function. For a continuous random variable X" only probabilities associated with ranges of values (e.g. an interval of values from a to b) will be calculated. he probability that the value of  X  will fall between the values a and b is given by the area between a and b under the curve describing the probability density function f(!). For any probability density function the total area under the graph of f(!) is #. 5.2 Normal distribution A continuous random variable X  is normally distributed (follows a normal distribution) if the  probability density function of X is given by he constants  µ  and σ can be shown to be the mean and standard deviation respectively of  X . hese constants completely specify the density function. A graph of the curve describing the probability function (known as the normal curve) for the case 0 =  µ  and # = σ  is shown on the following page. #

description

statistics work, various examples and questions

Transcript of Book 2 Notes

Page 1: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 1/81

Chapter 5 – The normal distribution

5.1 Probability distributions of continuous random

variables

A random variable X  is called continuous if it can assume any of the possible values in some

interval i.e. the number of possible values are infinite. In this case the definition of a discrete

random variable (list of possible values with their corresponding probabilities) cannot be used

(since there are an infinite number of possible values it is not possible to draw up a list of

 possible values). For this reason probabilities associated with individual values of a

continuous random variable X are taken as 0.

he clustering pattern of the values of X over the possible values in the interval is described

 by a mathematical function f(!) called the probability density function. A high (low)

clustering of values will result in high (low) values of this function. For a continuous random

variable X" only probabilities associated with ranges of values (e.g. an interval of values from

a to b) will be calculated. he probability that the value of X  will fall between the values a

and b is given by the area between a and b under the curve describing the probability density

function f(!). For any probability density function the total area under the graph of f(!) is #.

5.2 Normal distribution

A continuous random variable X  is normally distributed (follows a normal distribution) if the

 probability density function of X is given by

he constants  µ  and σ  can be shown to be the mean and standard deviation respectively of

 X . hese constants completely specify the density function. A graph of the curve describing

the probability function (known as the normal curve) for the case 0= µ   and #=σ   is shown

on the following page.

#

Page 2: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 2/81

5.2.1 Properties of the normal distribution

he graph of the function defined above has a symmetric" bell$shaped appearance. he mean

% is located on the hori&ontal a!is where the graph reaches its ma!imum value. At the two

ends of the scale the curve describing the function gets closer and closer to the hori&ontal a!is

without actually touching it. 'any uantities measured in everyday life have a distribution

which closely matches that of a normal random variable e.g. marks in an e!am" weights of

 products" heights of a male population. he parameter % shows where the distribution is

centrally located and the spread of the values around %. A short hand way of referring to arandom variable X which follows a normal distribution with mean % and variance * is by

writing X + ,(%" *). he ne!t diagram shows graphs of normal distributions for various

values of - and *.

 

*

Page 3: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 3/81

An increase (decrease) in the mean % results in a shift of the graph to the right (left) e.g. the

curve of the distribution with a mean of * is moved * units to the left. An increase

(decrease) in the standard deviation results in the graph becoming more (less) spread out

e.g. compare the curves of the distributions with * / 0.*" 0." # and .

5.2.2 Empirical example – The normal distribution and a

historam

1onsider the scores obtained by 2 00 candidates in a matric mathematics e!amination.

3

Page 4: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 4/81

0

#00*00

300

200

00

400

500

600

700

#000

  #  

  *    3    2        4    5    7  0   ' o  r e

mar! 

       f     r     e     "

he histogram of the marks has an appearance that can be described by a normal curve i.e. it

has a symmetric" bell$shaped appearance. he mean of the marks is #.7 and the standarddeviation #0.

2

Page 5: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 5/81

5.# The $tandard Normal %istribution

o find probabilities for a normally distributed random variable" we need to be able to

calculate the areas under the graph of the normal distribution. 8uch areas are obtained from a

table showing the cumulative distribution of the normal distribution (see appendi!). 8ince the

normal distribution is specified by the mean (%) and standard deviation ()" there are many

 possible normal distributions that can occur. It will be impossible to construct a table for each

 possible mean and standard deviation. his problem is overcome by transforming X" the

normal random variable of interest 9X + ,(%" *) :" to a standardi&ed normal random variable

; / σ 

 µ − X 

.

It can be shown that the transformed random variable is normally distributed with % / 0 and

σ / # i.e. ; + ,(0" #). he random variable ; can be transformed back to X by using the

formula

X /   σ  µ    Z + .

he normal distribution with mean % / 0 and standard deviation / # is called the standard

normal distribution. he symbol ; is reserved for a random variable with this distribution.

he graph of the standard normal distribution appears below.

<arious areas under the above normal curve are shown. he standard normal table gives the

area under the curve to the left of the value  z . =ther types of areas can be found bycombining several of the areas as shown in the ne!t e!amples.

Page 6: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 6/81

5.& Calculatin probabilities usin the standard normal

tablehe standard normal table is found at the back of your notes.

he areas shown in the table are those under the standard normal curve to the left of the value

of & looked up i.e. >(; ? &) e.g. >(; ? 0.#2) / 0.5.

Note

• For negative values of & less than the minimum value ( 3.57) in the table" the

 probabilities are taken as 0 i.e. >(; ? &) / 0 for & @ 3.57.

• For positive values of & greater than the ma!imum value (3.57) in the table" the

 probabilities are taken as # i.e. >(; ? &) / # for & 3.57.

B!amples

In all the e!amples that follow" ; + ,(0" #).

a) >(; @ #.3) / 0.7##

 b) >(; 0.25) / # >(; ? 0.25)

/ # 0.3#7*/ 0.4606

c) >( 0.25 @ ; @ #.3) / >(; @ #.3) >(; @ 0.25)

/ 0.7## 0.3#7*

/ 0.7*3

d) >(; 0.54) / # >(; @ 0.54)

/ # 0.5542

/ 0.**34

e) >(0.7 ? ; ? #.34) / >(; ? #.34) >(; ? 0.7)/ 0.7#3# 0.6*67

/ 0.062*

f) >( #.74 ? ; ? #.74) / >(; ? #.74) >(; ? #.74)

/ 0.750 0.0*0

/ 0.7

In all the above e!amples an area was found for a given value of &. It is also possible to find a

value of & when an area to its left is given. his can be written as >(; ? & C) / C (C is the greek 

letter for DaE and is pronounced DalphaE). In this case &C has to be found where C is the area toits left

4

Page 7: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 7/81

B!amples

# Find the value of & that has an area of 0.0322 to its left.

8earch the body of the table for the reuired area (0.0322) and then read off the valueof & corresponding to this area. In this case &0.0322 / #.6*.

* Find the value of & that has an area of 0.75 to its left.

Finding 0.75 in the body of the table and reading off the & value gives &0.75 / #.74.

3 Find the value of & that has an area of 0.7.

hen searching the body of the table for 0.7 this value is not found. he & value

corresponding to 0.7 can be estimated from the following information obtained fromthe table.

& area to left

#.42 0.727

G 0.7

#.4 0.70

8ince the reuired area (0.7) is halfway between the * areas obtained from the table"

the reuired & can be taken as the value halfway between the two & values that were

obtained

from the table i.e. & /.42.#

*

4.#42.#=

+

 

B!erciseH sing the same approach as above" verify that the & value corresponding to

an area of 0.0 to its left is #.42.

At the bottom of the standard normal table selected percentiles &C are given for different

values of C. his means that the area under the normal curve to the left of & C is C.

B!amplesH # C / 0.700" &C / #.*6*

means >(; @ #.*6*) / 0.700.

  * C / 0.77" &C / *.54

means >(; @ *.54) / 0.77.

  3 C / 0.00" &C / *.54

means >(; @ *.54) / 0.00.

5

Page 8: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 8/81

he standard normal distribution is symmetric with respect to the mean / 0. From this it

follows that the area under the normal curve to the right of a positive & entry in the standard

normal table is the same as the area to the left of the associated negative entry ( &) i.e.

>(; J &) / >(; ? &) .

For e!ample" >(; J #.74) / # 0.75 / 0.0* / >(; ? #.74).

 

5.5 Calculatin probabilities for any normal random

variable

Ket X be a ,(-" *) random variable and ; a ,(0" #) random variable. hen

B!ample #he height H  (in inches) of a population of women is appro!imately normally distributed

with a mean of 43. µ  =  and a standard deviation of *.5σ   =  inches. o calculate the

 probability that a woman is less than 43 inches tall" we first find the &$score for 43 inches

#6.0.*

.4343−=

−= z 

and then use >(L ? 43) / >(; ? 0.#6) / 0.2*64.

his means that 2*.64M (a proportion of 0.2*64) of women are less than 43 inches tall.

B!ample *

he length X (inches) of sardines is a ,(2.4*" 0.0*7) random variable. hat proportion of

sardines is

(a) longer than inchesG (b) between 2.3 and 2.6 inchesG

(a) >(X ) / >(; )

*3.0

4*.2 −

 

/ >(; #.4)

/ # >(; ? #.4)

/ # 0.70

/ 0.027.

6

Page 9: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 9/81

(b)

  / >( #.#5 ? ; ? #)/ >(; ? #) >(; ? #.#5)

  / 0.62#3 0.#*#0

/ 0.5*03.

5.' (indin percentiles by usin the standard normal

table

he standard normal table can be used to find percentiles for random variables which arenormally distributed.

 

B!ample

he scoresNmarks M  obtained in a mathematics entrance e!amination are normally distributed

with #2 µ  =  and ##3σ   = . Find the scoreNmark that is the 60th percentile. From the

standard normal table" the &$score which is closest to an entry of 0.60 in the body of the

table is 0.62 (the actual area to its left is 0.577). he scoreNmark which corresponds to a &$

score of 0.62 can be found by solving

 for m. his yields m / 406.7* i.e. a scoreNmark of appro!imately 407 is better than 60M of

all other e!am scoresNmarks.

B!ercisesH All these e!ercises refer to the normal distribution above.

(#) Find 3 P .

(*) If a person scores in the top M of test scoresNmarks" what is the minimum scoreNmark 

they could have receivedG

(3) If a person scores in the bottom #0M of test scoresNmarks" what is the ma!imumscoreNmark they could have receivedG

5.) Computer output

B!cel has a built in function that can be used to find areas under the normal curve for a given

&$score or to calculate a &$score that has a given area under the normal curve to its left.

(#) he table below shows areas under the standard normal curve to the left of various &$

scores.

7

Page 10: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 10/81

z-score area

 – 2.5 0.0062

 – 2 0.0228

 – 1.5 0.0668

 –1 0.1587

 –0.5 0.3085

0 0.5

0.5 0.6915

1 0.8413

1.5 0.9332

2 0.9772

2.5 0.9938

(*) he table below shows &$scores for certain areas under the standard normal curve to

its left.

area z-score

0.005 – 2.5758

0.01 – 2.3263

0.025 – 1.96

0.05 –1.6449

0.1 –1.2816

0.2 –0.8416

0.8 0.8416

0.9 1.2816

0.95 1.6449

0.975 1.96

0.99 2.3263

0.995 2.5758

#0

Page 11: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 11/81

Chapter ' – $amplin distributions

'.1 %efinitions

• A samplin distribution arises when repeated samples are drawn from a particular

 population (distribution) and a statistic (numerical measure of description of sample

data) is calculated for each sample. he interest is then focused on the probability

distribution (called the sampling distribution) of the statistic.

• 8ampling distributions arise in the conte!t of statistical inference i.e. when statements

are made about a population on the basis of random samples drawn from it.

B!ample

8uppose all possible samples of si&e * are drawn with replacement from a population with

sample space 8 / O*" 2" 4" 6P and the mean calculated for each sample.

he different values that can be obtained and their corresponding means are shown in the

table below.

#st valueN*nd value 2 & ' *

2 * 3 2

& 3 2 4

' 2 4 5

* 4 5 6

In the above table the row and column entries indicate the two values in the sample (#4

 possibilities when combining rows and columns). he mean is located in the cell

corresponding to these entries e.g. #st value / 2" *nd value / 4 has a mean entry of

*

42=

+

.

##

Page 12: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 12/81

Assuming that random sampling is used" all the mean values in the above table are eually

likely. nder this assumption the following distribution can be constructed for these mean

values.

 x * 3 2 4 5 6 sum

1ount # * 3 2 3 * # #4

>(   ) x X   =

#4

#

6

#

#4

3

2

#

#4

3

6

#

#4

# #

 

he above distribution is referred to as the sampling distribution of the mean for random

samples of si&e * drawn from this distribution.

he mean and variance of the population from which these samples are drawn are

% / and

* / N  N  x x   ÷−   ∑∑   :N)(9   **

 / (** Q 2* Q 4* Q 6*  *0* N 2) R 2 / .

he sampling distribution of the mean has mean and variance

= X 

 µ 

And

 ,ote that=

 X  µ 

 / % and that*

 X σ 

 / *. / N* / *N*.

1onsider a population with mean % and variance *. It can be shown that the mean and

variance of the sampling distribution of the mean" based on a random sample of si&e n" are

given by

 µ  µ    = X    and

*

 X σ 

/ *Nn.

 X σ 

 / n

σ 

 is known as the standard error. In the preceding e!ample n / *.

8ampling distributions can involve different statistics (e.g. sample mean" sample proportion"

sample variance) calculated from different sample si&es drawn from different distributions.

8ome of the important results from statistical theory concerning sampling distributions are

summari&ed in the sections that follow.

#*

Page 13: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 13/81

'.2 The Central +imit Theorem

he following result is known as the Central +imit Theorem.

Ket X#" X*" . . . " Xn  be a random sample of si&e n drawn from a distribution with mean %

and variance * (* should be finite). hen for sufficiently large n the mean

n X  X n

i

i  N#

∑=

=

 is

appro!imately normally distributed with mean / µ  µ    =

 X   and variance /*

 X σ 

 / *Nn.

his result can be written as  X + ,(%" *Nn).

 ,oteH

•  he random variable ; /   n

 X 

Nσ 

 µ −

 + ,(0" #).

•  he value of n for which this theorem is valid depends on the distribution from which

the sample is drawn. If the sample is drawn from a normal population" the theorem is

valid for all n. If the distribution from which the sample is drawn is fairly close to

 being normal" a value of n 30 will suffice for the theorem to be valid. If the

distribution from which the sample is drawn is substantially different from a normal

distribution e.g. positively or negatively skewed" a value of n much larger than 30 will

 be needed for the theorem to be valid.

• here are various versions of the central limit theorem. he only other central limit

theorem result that will be used here is the following one.

If the population from which the sample is drawn is a Sernoulli distribution (consists of only

values of 0 or # with probability p of drawing a # and probability of / #$p of drawing a 0)"

then∑=

=n

i

i X S # follows a binomial distribution with mean %8 / np and variance

*

S σ  / np.

According to the central limit theorem"

n X nS  P n

i

i  NNT

#

∑=

==

 follows a normal distribution

with mean %(   )T P   / %8 Nn / npNn / p and variance *(   )T P  /*

S σ  N n* / npNn* / pNn when n is

sufficiently large.  P T  is the proportion of #Us in the sample and can be seen as an estimate of

 p the proportion of #Us in the population (distribution from which sample is drawn).

sing the central limit theorem" it follows that

B!ampleH

#3

Page 14: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 14/81

An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal

distribution with mean 600 and variance #400. A random sample of #0 light bulbs is drawn

and the lifetime recorded for each light bulb. 1alculate the probability that the mean of this

sample

(a) differs from the actual mean lifetime of 600 by not more than #4 hours.

(b) differs from the actual mean lifetime of 600 by more than #4 hours.

(c) is greater than 6*0 hours.

(d) is less than 56 hours.

  / >(#.*4 ? ; ? #.*4)

  / >(; ? #.*4) >(; ? #.*4)

  / 0.674* 0.#036

  / 0.57*2

/ # 0.57*2

/ 0.*054 

/ >(; #.6)

  / # 0.72*7

  / 0.05#

/ >( ; @ #.#7)

/ 0.##5

'.# The t,distribution -$tudents t,distribution/

#2

Page 15: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 15/81

he central limit theorem states that the statistic ; / n

 X 

Nσ 

 µ −

 follows a standard normal

distribution. If is not known" it would be logical to replace σ   (in the formula for ;) by its

sample estimate 8. For small values of the sample si&e n " the statistic t / nS 

 X 

N

 µ −

 does notfollow a normal distribution. If it is assumed that sampling is done from a population that is

appro!imately a normal population" the distribution of the statistic t follows a t,distribution.

his distribution changes with the degrees of freedom / df / n # i.e." for each value of

degrees of freedom a different distribution is defined.

he t$distribution was first proposed in a paper by illiam Vosset in #706 who wrote the

 paper under the pseudonym D8tudentE. he t$distribution has the following properties.

• he 8tudent t-distribution is symmetric and bell$shaped" but for smaller sample si&es it

shows increased variability when compared to the standard normal distribution (its curve

has a flatter appearance than that of the standard normal distribution). In other words" the

distribution is less peaked than a standard normal distribution and with thicker tails. As

the sample si&e increases" the distribution approaches a standard normal distribution. For

n  30" the differences are negligible.

• he mean is &ero (like the standard normal distribution).

• he distribution is symmetrical about the mean.

• he variance is greater than one" but approaches one from above as the sample si&e

increases (* / # for the standard normal distribution).

he graph below shows how the t$distribution changes for different values of r (the degrees

of freedom).

#

Page 16: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 16/81

Tables for the t,distributionhe layout of the t$tables is as follows.

ν  0df 3.433 3.45 .. 3.445

1 3.056 4.3#2 43.442 #.664 *.7*0 7.7*

. .

. .

#.*6* #.42 *.54

he row entry is the degrees of freedom (df) and the column entry (C) the area under the t$

curve to the left of the value that appears in the table at the intersection of the row and

column entry.

hen a t$value that has an area less than 0. to its left is to be looked up" the fact that thet$distribution is symmetrical around 0 is used i.e."

>(t ? tC) / >(t J t# C) / >(t ? t# C) for C ? 0.

his means that tC / t# C.

B!amples

#. For df / * and C / 0.77 the entry is 7.7*. his means that for the t$distribution with

* degrees of freedom>(t ? 7.7*) / 0.77.

*. For df / W and C / 0.7 the entry is #.42. his means that for the t$distribution with

W degrees of freedom

>(t ? #.42) / 0.7.

3. For df / ν  / #0 and C / 0.#0 the value of t0.#0 such that >(t ? t0.#0 ) / 0.#0 is found

from

t0.#0 / t# 0.#0 / t0.70 / #.35*.

 ,ote that the percentile values in the last row of the t$distribution are identical to the

corresponding percentile entries in the standard normal table. 8ince the t$distribution for large

samples (degrees of freedom) is the same as the standard normal distribution" their percentiles

should be the same.

'.& The chi,s"uare -6 2/ distribution

he chi$suare distribution arises in a number of sampling situations. hese include the ones

described below.

#4

Page 17: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 17/81

#) rawing repeated samples of si&e n from an appro!imate normal distribution with

variance * and calculating the variance (8*) for each sample. It can be shown that the

uantity

*

 /

*

*)#(

σ 

S n −

follows a chi$suare distribution with degrees of freedom / n #.

*) hen comparing seuences of observed and e!pected freuencies as shown in the

table below. he observed freuencies (referring to the number of times values of

some variable of interest occur) are obtained from an e!periment" while the e!pected

ones arise from some pattern believed to be true.

observed freuency f  # f * .. f k 

e!pected freuency e# e* .. ek 

he uantity Y * /∑=

−k 

i   i

ii

ee f  

#

*)(

can be shown to follow a chi$suare distribution with

k # degrees of freedom. he purpose of calculating this Y * is to make an assessment

as to how well the observed and e!pected freuencies correspond.

he chi$suare curve is different for each value of degrees of freedom. he graph below

shows how the chi$suare distribution changes for different values of ν   (the degrees of

freedom).

nlike the normal and t$distributions the chi$suare distribution is only defined for positive

values and is not a symmetrical distribution. As the degrees of freedom increase" the chi$

#5

Page 18: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 18/81

suare distribution becomes more a more symmetrical. For a sufficiently large value of

degrees of freedom the chi$suare distribution approaches the normal distribution.

Tables for the chi,s"uare distribution

he layout of the chi$suare tables is as follows.

ν  0 df 3.335 3.31 .. 3.44 3.445

1 0.000037 0.000#5 4.43 5.66

2 0.0#00* 0.0*0#0# 7.*# #0.40

.

#3 #3.57 #2.7 0.67 3.45

he row entry is the degrees of freedom (df) and the column entry (C) the area under the chi$suare curve to the left of the value that appears in the table at the intersection of the row and

column entry.

B!amplesH

#) For df / 30 and C / 0.0# the entry is #2.7 i.e."7.#2*

0#.0Z30  = χ 

. his means that

for the chi$suare distribution with 30 degrees of freedom

>(

* χ  ? #2.7) / 0.0#.

*) For df / 30 and C / 0.77 the entry is 3.45 i.e."45.3*

77.0Z30  = χ 

. his means

that for the chi$suare distribution with 30 degrees of freedom

>(* χ 

 ? 3.45) / 0.77.

 

3) For df / 4 and C / 0.7 the entry is #*.7 i.e."7.#**

7.0Z4   = χ . his means that for 

the chi$suare distribution with 4 degrees of freedom

>(  7.#**≤ χ 

) / 0.7 or >(  7.#**> χ 

) / 0.0.

his probability statement is illusrated in the ne!t graph.

#6

Page 19: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 19/81

'.5 The (,distribution

[andom samples of si&es n# and n* (sometimes we use m instead of n*) are drawn from

normally distributed populations that are labeled # and * respectively. enote the variances

calculated from these samples by*

#S   and*

*S   respectively and their corresponding population

variances by*#σ   and

**σ  respectively. he ratio

**

**

*

#

*

#

N

N

σ 

σ 

S  F  =

 is distributed according to an

F$distribution (named after the famous statistician [.A. Fisher) with degrees of freedom

###   −= ndf    (called the numerator degrees of freedom) and #**   −= ndf    (called the

denominator degrees of freedom). hen*

*

*

#   σ σ    =  the F$ratio is*

*

*

#

S  F  =

.

he F$distribution is positively skewed" and the F$values can only be positive. he graph

 below shows plots for a number of F$distributions (F$curves) with*

*

*

#   σ σ    = . hese plots are

referred to by F(df #"df *) e.g. F(33"#0) refers to an F$distribution with 33 degrees of freedom

associated with the numerator and #0 degrees of freedom associated with the denominator.

For each combination of df # and df * there is a different F$distribution. hree other important

distributions are special cases of the F$distribution. he normal distribution is an F(#"

infinity) distribution" the t$distribution an F(#" )*n  distribution and the chi$suare distribution

an F(   #n " infinity) distribution.

#7

Page 20: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 20/81

Tables for the (,distribution

In able F (at the end of your notes) the 7" 75.* and 77 percentage points of the F$

distribution are given.

he entry in the table corresponding to a pair of(df 1 ,df 2 ) values has an area of α / 0.7 under 

the F(df 1 ,df 2 )  curve to its left (and # C to the right)" i.e.

>( F  @ F df#" df* Z C ) / α

B!amples

#) F(3"*4) / *.76 has an area (under the F(3,26) curve) of C / 0.7 to its left and # C toits right (see graph below)" i.e. F 3" *4 Z 0.7 / *.76

*) >( F @ F 2" 3* Z 0.7 ) / 0.7

*0

Page 21: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 21/81

8o F 2" 3* Z 0.7 / *.45

8ee the graph below

3) sing df # / 2 and df * / 5" the value of F that has #M of the area to the right of it is 5.6.

+o7er tail values from the (,distribution

=nly upper tail values (those with large areas below and small areas above) can be read offfrom the F$tables. Kower tail values can be calculated from the formulaH

B!amples

#) Find the value such that *.M of the area under the F(5") curve is to the left of it.

In the above formula df # / 5" df * / and C / 0.0*. hen

 

*) Find the value such that #M of the area under the F(#0"7) curve is to the left of it.

In the above formula df # / #0" df * / 7 and C / 0.. hen

*#

Page 22: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 22/81

 

'.5 Computer output

In e!cel values from the t" chi$suare and F$distributions" that have a given area under the

curve above it" can be found by using the I,<(area" df)" 1LII,< (area" df) and FI,<(area"

df #"df *) functions respectively.

B!amples

# I,<(0.0" #) / *.#3#2. he area under the t(#) curve to the right of *.#3#2 is 0.0*

and to the left of $*.#3#2 is 0.0*. hus the total tail area is 0.0.

* 1LII,<(0.0#" #2) / *7.#2#*2. he area under the chi$suare (#2) curve to the right of

*7.#2#*2 is 0.0#.

3 FI,<(0.0"#0"6) / 3.325#43. he area under the F (#0" 6) curve to the right of 3.325#43

is 0.0.

 

Chapter ) – $tatistical 8nference9

Estimation for one sample case

).1 $tatistical inference

8tatistical inference (inferential statistics) refers to the methodology used to draw conclusions

(e!pressed in the language of probability) about population parameters on the basis of 

samples drawn from the population.

B!amples

#.) he government of a country wants to estimate the proportion of voters ( p) in the

country that approve of their economic policies.

*.) A manufacturer of car batteries wishes to estimate the average lifetime (%) of their 

 batteries.

3.) A paint company is interested in estimating the variability (as measured by the

variance" *) in the drying time of their paints.

**

Page 23: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 23/81

he uantities p" % and * that are to be estimated are called population parameters.

A sample estimate  of a population parameter is called a statistic. he table below gives

e!amples of some commonly used parameters toegether with their statistics.

>arameter 8tatistic

 p   pT

%   x* 8 *

).2 Point and interval estimation

• A point estimate of a parameter is a sinle value (point) that estimates a parameter.

• An interval estimate of a parameter is a rane of values from K (lower value) to

(upper value) that estimate a parameter. Associated with this range of values is a

 probability or percentage chance that this range of values will contain the parameter

that is being estimated.

B!amples

8uppose the mean time it takes to serve customers at a supermarket checkout counter is to beestimated.

#) he mean service time of #00 customers of (say) = x *.*63 minutes is an e!ample of 

a point estimate of the parameter %.

*) If it is stated that the probability is 0.7 (7M chance) that the mean service time will

 be from #.435 minutes to 2.007 minutes" the interval of values (#.435" 2.007) is an

interval estimate of the parameter -.

he estimation approaches discussed will focus mainly on the interval estimate approach.

 

).# Confidence intervals terminoloy

A confidence interval is a rane of values from + (lower value) to : (upper value) that

estimate a population parameter \ with #00(#$α  )M confidence.

] pronounced DthetaE.

K is the lo7er confidence limit.

*3

Page 24: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 24/81

is the upper confidence limit.

he interval (K" ) is called the confidence interval.

# C is called the confidence coefficient. It is the probability that the confidence interval willcontain \ the parameter that is being estimated.

#00(# C) is called the confidence percentae.

B!ample

1onsider e!ample * of the previous section.

\" the parameter that is being estimated" is the population mean  µ  .

K / #.435 / 2.007

he confidence interval is the interval (#.435" 2.007).

C/0.0

he confidence coefficient is (#C ) / 0.7

he confidence percentage is #00(#C ) / 7.

In the sections that follow the determination of K and when estimating the parameters %" p

and

*

 will be discussed.

).& Confidence interval for the population mean

-population variance !no7n/ 

he determination of the confidence limits is based on the central limit theorem (discussed in

the previous chapter). his theorem states that for sufficiently large samples

and so

Formulae for the lower and upper confidence limits can be constructed in the following way.

*2

Page 25: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 25/81

8ince ; + ,(0"#)" it follows from the above graph that

>( #.74 ? ; ? #.74) / 0.7

 and so

Sy a few steps of mathematical manipulation (not shown here)" the above part in brackets can

 be changed to have only the parameter % between the ineuality signs. his will give

Ket K /   n X    σ 74.#−

 and / n X    σ 74.#+

. hen the above formula can be written as

>(K ? % ? ) / 0.7.

his formula is interpreted in the following wayH

8ince both K and are determined by the sample values (which determine  X  )" they

(and the confidence interval) will change for different samples. 8ince the parameter %

that is being estimated remains constant" these intervals will either include or e!clude

%. he central limit theorem states that such intervals will include the parameter %with probability 0.7 (7 out of #00 times).

*

Page 26: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 26/81

In a practical situation the confidence interval will not be determined by many

samples" but by only one sample. herefore the confidence interval that is calculated

in a practical situation will involve replacing the random variable  X   by the sample

value . x  hen the above formulae for a 7M confidence interval for the population

mean % becomes

he percentage of confidence associated with the interval is determined by the value (called

the & multiplier) obtained from the standard normal distribution. In the above formula a &$

multiplier of #.74 determines a 7M confidence interval.

If a different percentage of confidence is reuired" the & multiplier needs to be changed. he

table below is a summary of &$multipliers needed for different percentages associated withconfidence intervals.

confidence percentage 77 7 70

&$multiplier *.54 #.74 #.42α  0.0# 0.0 0.#0

Calculation of confidence interval for ; -<2 !no7n/

8tep # H 1alculate  x . <alues of n" * and confidence percentage are given

8tep * H Kook up &$multiplier for given a confidence percentage.

8tep 3 H 1onfidence interval is ± x  &$multiplier  n

σ 

B!ample

he actual content of cool drink in a 00 milliliter bottle is known to vary. he standard

deviation is known to be milliliters. hirty (30) of these 00 milliliter bottles were selected

at random and their mean content found to 276.. 1alculate 7M and 77M confidence

intervals for the population mean content of these bottles.

8olutionH

7M confidence interval

8ubstituting  x / 276." n / 30" / " & / #.74 into the above formula gives

276. ^ #.74 30

/ (274.5#" 00.*7).

77M confidence interval

8ubstituting  x / 276." n / 30" / " & / *.54 into the above formula gives

*4

Page 27: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 27/81

276. ^ *.54 30

/ (274.#" 00.6).

).5 Confidence interval for the population mean

-population variance not !no7n/

hen the population variance (*) is not known" it is replaced by the sample variance (8 *) in

the formula for ; mentioned in the previous section. In such a case the uantity

t / nS 

 X 

N

 µ −

  follows a t$distribution with

degrees of freedom / df / n #.

he confidence interval formula used in the previous section is modified by replacing the &$

multiplier by the t$multiplier that is looked up from the t$distribution.

Calculation of confidence interval for ; -<2 not !no7n/

8tep # H 1alculate  x  and 8. <alues of n and confidence percentage are given

8tep * H Kook up t$multiplier for a given confidence percentage and degrees of freedom /

n#

8tep 3 H 1onfidence interval is ± x  t$multiplier  n

B!ample

he time (in seconds) taken to complete a simple task was recorded for each of # randomly

selected employees at a certain company. he values are given below.

38.2

43.9

38.4

26.2

41.3

42.3

37.5

37.2

41.2

42.3 31

50.1

37.3

36.7

31.8

1alculate 7M and 77M confidence intervals for the population mean time it takes to

complete this task.

8olution

n / # (given) " = x 36.34" 8 / .56 (1alculated from the data)

7M confidence interval H

C / 0.0 therefore CN* / 0.0* and # CN* / 0.75

degrees of freedom / df / ν / # # / #2t$multiplier / t#2Z 0.75 / *.#2 (from t$table)

*5

Page 28: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 28/81

8ubstituting  x / 36.34" n / #" 8 / .56" t / *.#2 into the above formula gives

77M confidence interval H

C / 0.0# therefore CN* / 0.00 and # CN* / 0.77

degrees of freedom / df / ν / # # / #2

t$multiplier / t#2Z 0.77 / *.755 (from t$table)

8ubstituting  x / 36.34" n / #" 8 / .56" t / *.755 into the above formula gives

).' Confidence interval for population variance

he formulae for the confidence interval of the population variance * are based on the fact

that*

*)#(

σ 

S n −

follows a chi$suare distribution with (n #) degrees of freedom. Ket

)*

#(*   α  χ    −

and)

*(*   α  χ 

 9also written as Y *#CN* and Y *CN*  respectively: denote the #00()

*#  α −

 and *

#00α 

 percentile points of the chi$suare distribution with (n #) degrees of freedom.

hese points are shown in the graph on the following page.

*6

Page 29: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 29/81

For this distribution" it follows from the graph above that

Sy a few steps of mathematical manipulation (not shown here)" the above part in brackets can

 be changed to have only the parameter * between the ineuality signs. his will give

where

upper / Y *#CN* " the larger of the * percentile points and

lower / Y *CN*  " the smaller of the * percentile points.

he values of C and CN* are calculated fromH confidence percentage / #00(# α  )e.g. if confidence percentage / 7" C/ 0.0 " CN* / 0.0*.

*7

Page 30: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 30/81

Calculation of confidence interval for <2 

8tep # H 1alculate 8*. <alues of n and confidence percentage are given

8tep * H Kook up upper and lower chi$suare values for a given confidence percentage and

degrees of freedom / df.

8tep 3 H 1onfidence interval is 9   upper 

S n   *)#(   −

":

)#(  *

!"er 

S n −

B!ample

1alculate 70M and 7M confidence intervals for the population variance of the time taken to

complete the simple task (see previous e!ample).

8olutionH

n /# " 8* / 33.36## (1alculated from the data)

70M confidence intervalH

Kook up upper and lower chi$suare values by using df / ν  / #2 and α /0.#0.

upper / Y *# CN*  / Y *0.7 / *3.46 for ν  / #2.

lower / Y *CN* / Y *0.0  / 4.5 for ν  / #2.

(n #)8* / #2 _ 33.36## / 245.32

he confidence interval is (   46.*3

32.245

")

5.4

32.245

 / (#7.52" 5#.#3).

7M confidence interval

Kook up upper and lower chi$suare values by using df / ν  / #2 and α /0.0.

upper / Y *#$CN*  / Y *0.75  / *4.#* for ν  / #2.

lower /

)

*

(*   α  χ 

/ )0*.0(

*

 χ  / .43 for ν / #2.

(n #)8* / #2 !_33.36## / 245.32

he confidence interval is (   #*.*4

32.245

")

43.

32.245

 / (#5.67" 63.0#).

).) Confidence interval for population proportion

In some e!periments the interest is in whether or not items posses a certain characteristic of

interest e.g. whether a patient improves or not after treatment" whether an item manufacturedis acceptable or not" whether an answer to a uestion is correct or incorrect. he population

30

Page 31: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 31/81

 proportion of items labeled DsuccessE in such an e!periment (e.g. patient improves" item is

acceptable" answer is correct) is estimated by calculating the sample proportion of DsuccessE

items.

he determination of the confidence limits for the population proportion of items labeled

DsuccessE is based on the central limit theorem for the sample proportion n

 X  P  =T

" where X is

the number of items in the sample labeled DsuccessE. his theorem states that for sufficiently

large samples the sample proportion of DsuccessE items  P T  + ,(p")

n

 p#

 and hence that

; /n p#

 p P 

 P 

 P  P 

N

T

)T(

)T(T −=

σ 

 µ 

 + ,(0" #).

Formulae for the lower and upper confidence limits can be constructed in the following way.

8ince ; + ,(0"#)

>( #.74 ? ; ? #.74) / 0.7

>($#.74 ?n p#

 p P 

N

T −

? #.74) / 0.7

Sy a few steps of mathematical manipulation (not shown here)" the above part in brackets can be changed to have the parameter p (in the numerator) between the ineuality signs. his will

give

>(n p# P    N74.#T −

 ? p ?n p# P    N74.#T +

) / 0.7.

8ince the confidence interval formula is based on a single sample" the random variable

n

 X  P  =T

 is replaced by its sample estimate n

 x p  =T

 and the parameters p and # $ 1 % p by their

respective sample estimates n

 x p  =T

 and  p#   T#T   −= .his gives the following 7M confidence interval for pH

(  n# p p   NTT74.#T  −

"  n# p p   NTT74.#T  +

).

If the percentage of confidence is to be changed" the &$multiplier is changed according to the

values given in the table on the following page.

confidence percentage 77 7 70

&$multiplier *.54 #.74 #.42

α  0.0# 0.0 0.#0 

3#

Page 32: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 32/81

Calculation of confidence interval for p

8tep # H 1alculate n

 x p =T

 and  p#   T#T   −= " !" n and confidence percentage are given

8tep * H Kook up &$multiplier for given a confidence percentage.

8tep 3 H 1onfidence interval is ± pT &$multiplier   n# p   NTT

B!ample

uring a marketing campaign for a new product #54 out of the *00 potential users of this

 product that were contacted indicated that they would use it. 1alculate a 70M confidence

interval for the proportion of potential users who would use this product.

8olutionH

! / #54" n / *00 so  pT /=

*00

#54

0.66"  p#   T#T   −= / 0.#*.

confidence percentage / 70 (given) so &$multiplier / #.42 (From above table)

1onfidence interval is (0.66 ^ #.42 *00N#*.0`66.0 ) / (0.66 ^ 0.0356) / (0.62*" 0.7#6).

).* $ample si=e 7hen estimatin the population mean

1onsider the formula for the confidence interval of the mean (%) when*σ   is known.

± x  &$multiplier  n

σ 

he uantity &$multiplier  n

σ 

  is known as the error (denoted by B).

he smaller the error" the more accurately the parameter - is estimated.8uppose the si&e of the error is specified in advance and the sample si&e n is determined to

achieve this accuracy. his can be done by solving for n from the euation

B / &$multiplier  n

σ 

 " which gives

n / (

*)`

 & 

mutipier  z    σ −

.

he &$multiplier is determined by the percentage confidence reuired in the estimation.

3*

Page 33: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 33/81

B!ample

1onsider the e!ample on the interval estimation of the mean content of 00 milliliter cool

drink bottles. he standard deviation is known to be . 8uppose it is desired to estimate the

mean with 7M confidence and an error that is not greater than 0.6. hat sample si&e is

needed to achieve this accuracyG

8olutionH

/ " B / 0.6 (given)" &$multiplier / #.74 (from 7M confidence reuirement).

n/ (

*)6.0

`74.#

/ #0.04* /## (n is always rounded up).

).4 $ample si=e for estimation of population proportion

he approach used in determining the sample si&e for the estimation of the population

 proportion is much the same as that used when estimating the population mean.

he euation to be solved for n is HB / n

 p#mutipier  z    `−

.

hen solving for n the formula becomes

n / p (

*) & 

mutipier  z  −

.

A practical problem encountered when using this formula is that values for the parameters p

and /# p are needed. 8ince the purpose of this techniue is to estimate p" these values of p

and are obviously not known.

If no information on p is available" the value of p that will give the ma!imum value of

 p(# p) / p will be taken. It can be shown that p/ 0. ma!imi&es this e!pression. his gives

ma! p / 0.* . 8ubstituting this ma!imum value in the above formula gives

ma! n / (

*) & 

mutipier  z  −

.

If more accurate information on the value of p is known (e.g. some range of values)" it should

 be used in the above formula.

As e!plained before" the &$multiplier is determined by the percentage confidence reuired in

the estimation.

B!ample

33

Page 34: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 34/81

1onsider the problem (discussed earlier) of estimating the proportion of potential users who

would use a new product. 8uppose this proportion is to be estimated with 77M confidence

and an error not e!ceeding *M (proportion of 0.0*) is reuired. hat sample si&e is needed to

achieve thisG

8olutionH

B / 0.0* (given)" &$multiplier / *.54 (77M confidence reuired)

If no information is known about p then

n / (

*)0*.0

54.*

  / 2#25.34 / 2#26 (rounded up).

Sut supppose it is known that the value of p is between 0.6 and 0.7.In such a case

ma! p(# p) / p / 0.6 _ 0.* / 0.#4 (hy is  p / 0.6 usedG).

Sy using this information the value of n can be calculated as

 n / 0.#4 (

*)0*.0

54.*

/ *42.3# / *4 (rounded up).

he additional information on possible values for p reduces the sample si&e by 34M.

).13 Computer output

#. 1onfidence interval for the mean (  *σ  known). For the data in the e!ample in section

5.2" the information can be typed on an e!cel sheet and the confidence interval

calculated as follows.

mean 498.5sigma 5

n 30z multiplier 1.959964Confdence interval

lower 496.71upper 500.29

*. 1onfidence interval for the mean (  *σ  not known). For the data in the e!ample in

section 5." the information can be typed on an e!cel sheet and the confidence interval

calculated as follows.

 

mean 38.36

32

Page 35: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 35/81

stand.dev 5.777642n 15t multiplier 2.144787Confdenc

e  interval

lower 35.16upper 41.56

3. 1onfidence interval for the variance. For the data in the e!ample in section 5.4" the

information can be typed on an e!cel sheet and the confidence interval calculated as

follows.

variance 33.38114n 15degrees oreedom 14

lower c!is".5.62872

6

upper c!is".26.1189

5Confdence interval

lower 17.89

upper 83.03

2. 1onfidence interval for the proportion of successes. For the data in the e!ample in

section 5.5" the information can be typed on an e!cel sheet and the confidence interval

calculated as follows.

 

n 200# 176z multiplier 1.644854st.error 0.022978Confdenc

e  interval

lower 0.842upper 0.918

 

3

Page 36: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 36/81

Chapter * – $tatistical 8nference 9Testin of hypotheses for one sample

*.1 (ormulation of hypotheses and related terminoloy

$tatistical hypothesis

A statistical hypothesis is an assertion (claim) made about a value(s) of a population parameter.

Purpose

he purpose of testing of hypotheses is to determine whether a claim that is made

could be true. he conclusion about the truth of such a claim is not stated with

absolute certainty" but rather in terms of the language of probability.

 

Examples of claims to be tested

#) A supermarket receives complaints that the mean content of D# kilogramE sugar bagsthat are sold by them is less than # kilogram.

*) he variability in the drying time of a certain paint (as measured by the variance) has

until recently been 4 minutes. It is suspected that the variability has now increased.

3) A construction company suspects that the proportion of obs they complete behind

schedule is 0.*0 (*0M). hey want to test whether this is indeed the case.

Null and alternative hypotheses

Null hypothesis (L0)

his is a statement concerning the value of the parameter of interest (θ  ) in a claim that is

made. his is formulated as

L0H 0θ θ   =  (he statement that the parameter θ   is eual to the hypothetical value 0θ  )

.

>lternative hypothesis (L#)

his is a statement about the possible values of the parameterθ 

 that are believed to be true ifL0 is not true. =ne of the alternative hypotheses shown below will apply.

34

Page 37: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 37/81

L#aH 0θ θ   <  or L#bH 0θ θ   >

  or L#cH 0θ θ   ≠.

B!amples

#) In the first e!ample (above) the parameter of interest is the population mean % and the

hypotheses to be tested are

L0H % / # (>opulation mean is # kilogram)

L#aH % @ # (>opulation mean is less than # kilogram)

In terms of the general notation stated above θ /% and#0   =θ 

.

*) In the second e!ample (above) the parameter of interest is the population variance * 

and the hypotheses to be tested are

L0H * / 4 (>opulation variance is 4)

L#bH *  4 (>opulation variance is greater than 4)

In terms of the general notation stated above θ / * and.40   =θ 

3) In the third e!ample (above) the parameter of interest is the population proportion" p"

of ob completions behind schedule and the hypotheses to be tested are

L0H p / 0.*0 (>opulation proportion is 0.*0)L#cH p 0.*0 (>opulation proportion is not eual to 0.*0)

In terms of the general notation stated above θ / p and*0.00   =θ 

.

?ne and t7o,sided alternatives

?ne,sided alternative

his is a hypothesis that specifies the alternative values (to the null hypothesis) in a direction

that is either below or above that specified by the null hypothesis.

B!ample

he alternative hypothesis L#a (see e!ample # above) is the alternative that the value

of the parameter is less than that stated under the null hypothesis and the alternative

L#b (see e!ample * above) is the alternative that the value of the parameter is greater

than that stated under the null hypothesis.

T7o,sided alternative

his is a hypothesis that specifies the alternative values (to the null hypothesis) in directions

that can be either below or above that specified by the null hypothesis.

35

Page 38: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 38/81

B!ample

he alternative hypothesis L#c (see e!ample 3 above) is the alternative that the value

of the parameter is either greater than that stated under the null hypothesis or less than

that stated under the null hypothesis.

 *.2 Testin of hypotheses for one sample9 Terminoloyand summary of procedure

he testing procedure and terminology will be e!plained for the test for the population mean

- with population variance * known.

he hypotheses to be tested are

L0 H % / %0  versus

L#aH % @ %0  or L#bH % %0  or L#cH % %0.

he data set that is needed to perform the test is !#" !*" . . . " !n  "

a random sample of si&e n drawn from the population for which the mean is tested. he test is

 performed to see whether or not the sample data are consistent with what is stated by the null

hypothesis.

he instrument that is used to perform the test is called a test statistic. A test statistic is a

uantity calculated from the sample data.

hen testing for the population mean" the test statistic used isH

&0 / n

 x

N

0

σ 

 µ −

.

If the difference between  x and %0 (and therefore the value of &0) is reasonably small" L0 will

 be not be reected. In this case the sample mean is consistent with the value of the population

mean that is being tested. If this difference (and therefore the value of & 0) is sufficiently large"

L0 will be reected. In this case the sample mean is not consistent with the value of the

 population mean that is being tested. In order to decide how large this difference between  x

and -0 (and therefore the value of &0) should be before L0 is reected" the following should be

considered.

Type 8 error

• A type I error is committed when the null hypothesis is reected when" in fact it is true

i.e. L0 is wrongly reected.

In this e!ample" a type I error is committed when it is decided that the statement

L0H % / -0 should be reected when" in fact" it is true.

Type 88 error

• A type II error is committed when the null hypothesis is not reected when" in fact" it

is false i.e. a decision not to reect L0 is wrong.

• In this e!ample" a type II error is committed when it is decided that the statement

L0H % / -0 should not be reected when" in fact" it is false.

36

Page 39: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 39/81

he following table gives a summary of possible conclusions and their correctness when

 performing a test of hypotheses.

>ctually trueConclusion @eAect B3 %o not reAect B3

B3 is true ype I error 1orrect conclusion

B3 is false 1orrect conclusion ype II error  

A ype I error is often considered to be more serious" and therefore more important to avoid"

than a ype II error. he hypothesis testing procedure is therefore designed so that there is a

guaranteed small probability of reecting the null hypothesis wrongly. his probability is

never 0 (whyG). 'athematically the probability of a type I error can be stated as

>(type I error) / >([eect L0  L0 is true) / C.

hen testing for the population mean

>(type I error) / >(reect - / -0  - / -0 is true) / C

>(type II error) / >(do not reect % / %0  % / %0 is false) / .

>robabilities of type I and type II errors work in opposite directions. he more reluctant you

are to reect L0" the higher the risk of accepting it when" in fact" it is false. he easier you

make it to reect L0" the lower the risk of accepting it when" in fact" it is false

Critical value-s/ and critical reion

Critical -cut,off/ value-s/

• he critical value(s) for tests of hypotheses is(are) a value(s) to which the test statistic

is compared in order to determine whether or not the null hypothesis should be

reected.

• he critical value is determined according to the specified value of C" the probability

of a type I error.

For the test of the population mean the critical value is determined in the following way.

Assuming that L0 is true" the test statistic will follow a standard normal distribution i.e.

;0 / n

 X 

N

0

σ 

 µ −

 + ,(0" #).

37

Page 40: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 40/81

(i) hen testing L0 versus the alternative hypothesis L#a (% @ %0)" the critical value is the

value ;C which is such that the area under the standard normal curve to the left of ; C is C

i.e. >(;0 @ ;C) / C. his leaves an area of # C to the right of ;C.

he graph on the following page illustrates the case C / 0.0

i.e. >(;0 @ #.42) / 0.0.

(ii) hen testing L0 versus the alternative hypothesis L#b (% %0) " the critical value is the

value ;# C which is such that the area under the standard normal curve to the left of ;# C is

# C i.e. >(;0 @ ;# C) / # C. his leaves an area of C to the right of ; # C

he graph below illustrates the case C / 0.0. his means # C / 0.7 and thus

>(;0 @ #.42) / 0.7.

(iii) hen testing L0 versus the alternative hypothesis L#c (% %0)" the critical values are the

values ;# CN*  and ;CN*. he area under the standard normal curve to the left of ;# CN* is

# CN*. he area under the standard normal curve to the left of ;CN* is CN*.

 i.e. >(;0 @ ;# CN*) / # CN* and >(;0 @ ;CN*) / CN*.

he area under the normal curve between these two critical values is # C. he graph below

illustrates the case C / 0.0 i.e. >(;0 @ #.74 or ;0 #.74) / 0.0.

20

Page 41: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 41/81

Critical reion -C@/

• he critical region" or reection region [" is the set of values of the test statistic for

which the null hypothesis is reected.

(i) hen testing L0 versus the alternative hypothesis L#a " the reection region is

O &0  &0  @ ;C P.

(ii) hen testing L0 versus the alternative hypothesis L#b " the reection region is

O &0  &0  ;# C P.

 (iii) hen testing L0 versus the alternative hypothesis L#c " the reection region is

O &0  &0  ; # CN*  or &0 @ ;CN* P.

B3 is reAected when there is a sufficiently large difference between the sample mean  x  and

the mean (-0 ) under L0 . 8uch a large difference is called a sinificant difference (result of

the test is significant). he value of C is called the level of sinificance. It specifies the level

 beyond which this difference (between  x  and -0) is sufficiently large for L0 to be reected.

he value of C is specified prior to performing the test and is often taken as either 0.0 (M

level of significance) or 0.0# (#M level of significance).

hen L0 is reected" it does not necessarily mean that it is not true. It means that according to

the sample evidence available it appears not to be true. 8imilarly when L 0 is not reected" it

does not necessarily mean that it is true. It means that there is not sufficient sample evidence

to disprove L0.

1ritical values for tests based on the standard normal distribution can be found from the

selected percentiles listed at the bottom of the pages of the standard normal table.

*.# Test for the population mean -population variance

!no7n/ 

2#

Page 42: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 42/81

A summary of the steps to be followed in the testing procedure is shown below (continuing

onto the following page).

Test for  µ   7hen*σ  is !no7n

# 8tate null and alternative hypotheses.

L0H 0 µ  µ   =versus L#aH 0 µ  µ   <

 or L#bH µ  0 µ 

 or L#cH 0 µ  µ   ≠

* 1alculate the test statistic n

 x z 

N

00

σ 

 µ −=

.

3 8tate the level of significance C and determine the critical value(s) and critical region.

(i) For alternative L#a the critical region is [ / O &0  &0  @ ;C P.

(ii) For alternative L#b the critical region is [ / O & 0  &0  ;# C P.

(iii) For alternative L#c the critical region is [ / O &0  &0  ;# CN* or &0 @ ;CN* P.

2 If &0 lies in the critical region" reect L0" otherwise do not reect L0.

8tate conclusion in terms of the original problem.

B!amples

#) A supermarket receives complaints that the mean content of D# kilogramE sugar bags

that are sold by them is less than # kilogram. A random sample of 20 sugar bags is

selected from the shelves and the mean found to be 0.765 kilograms. From past

e!perience the standard deviation contents of these bags is known to be 0.0*

kilograms. est" at the M level of significance" whether this complaint is ustified.

8olutionH

L0 H - / # (he complaint is not ustified)

L# H - @ # (he complaint is ustified)

n / 20"  x / 0.765" / 0.0*" -0 / # (given)

est statisticH &0 /

=−

20N0*.0

#765.0

 3.*67.

C / 0.0 so 1ritical region [ / O &0 @ ;0.0 / #.42 P.

2*

Page 43: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 43/81

8ince &0 / 3.*67 @ #.42" L0 is reected.

1onclusionH he data suggests that the complaint is ustified.

*) A supermarket manager suspects that the machine filling D00 gramE containers ofcoffee is overfilling them i.e. the actual contents of these containers is more than 00

grams. A random sample of 30 of these containers is selected from the shelves and the

mean found to be 0#.6 grams. From past e!perience the variance of contents of these

 bags is known to be 40 grams. est at the M level of significance whether the

managerUs suspicion is ustified.

8olutionH

L0 H - / 00 (8uspicion is not ustified)

L# H - 00 (8uspicion is ustified)

n / 30"  x / 0#.6" * / 40" -0 / 00 (given)

est statisticH &0 /

=−

30N40

006.0#

#.*53.

C / 0.0 so 1ritical region [ / O &0  ;0.7 / #.42 P.

8ince &0 / #.*53 @ #.42" L0 is not reected.

1onclusionH here insufficient evidence to conclude that the complaint is ustified.

3) uring a uality control e!ercise the manager of a factory that fills cans of fro&en

shrimp wants to check whether the mean weights of the cans conform to

specifications i.e. the mean of these cans should be 400 grams as stated on the label of 

the can. LeNshe wants to guard against either over or under filling the cans. A random

sample of 0 of these cans is selected and the mean found to be 7 grams. From past

e!perience the standard deviation of contents of these bags is known to be *0 grams.

est" at the M level of significance" whether the weights conform to specifications.

[epeat the test at the #0M level of significance.

8olutionH

L0 H - / 400 (eights conform to specifications)

L# H - 400 (eights do not conform to specifications)

23

Page 44: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 44/81

n / 0"  x / 7" / *0" -0 / 400 (given)

est statisticH &0 /

=−

0N*0

4007

#.546.

C / 0.0 so 1ritical region [ / O &0 @ ;0.0* / #.74 or &0  ;0.75 / #.74 P.

8ince #.74 @ &0 / #.546 @ #.74" L0 is not reected.

1onclusionH he weights appear to conform to specifications.

8uppose the test is performed at the #0M level of significance. In such a case

C / 0.#0 so 1ritical region [ / O &0 @ ;0.* / #.42 or &0  ;0.7 / #.42 P.

8ince &0 / #.546 #.42" L0 is reected.

1onclusionH he weights appear not to conform to specifications.

hus" being less strict about controlling a type I error (changing C from 0.0 to 0.#0)

results in a different conclusion about L0 (reect instead of do not reect).

Note

#. In e!ample # the alternative hypothesis L#a was used" in e!ample * the alternative L#b 

and in e!ample 3 the alternative L#c.

*. Alternatives L#a and L#b 9one$sided (tailed) alternatives: are used when there is a

 particular direction attached to the range of mean values that could be true if L0 is not

true.

3. Alternative L#c 9two$sided (tailed) alternative: is used when there is no particular

direction attached to the range of mean values that could be true if L 0 is not true.

2. If" in the above e!amples" the level of significance had been changed to #M" the

critical values used would have been ;0.0# / *.3*4 (in e!ample #) "

;0.77 / *.3*4 (in e!ample *) and ;0.00 / *.54 " ;0.77 / *.54 (in e!ample 3).

*.& Test for the population mean -population variance not

!no7n/9 t,test

22

Page 45: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 45/81

hen performing the test for the population mean for the case where the population variance

is not known" the following modifications are made to the procedure.

• In the test statistic formula the population standard deviation is replaced by the

sample standard deviation 8.

• 8ince the test statistic t0 / nS 

 x

N

0 µ −

 that is used to perform the test follows a

t$distribution with n# degrees of freedom" critical values are looked up in the

t$tables.

Test for  µ   7hen *σ  is not !no7n -t,test/

# 8tate null and alternative hypotheses.

L0H 0 µ  µ   =versus L#aH 0 µ  µ   <

 or L#bH µ  0 µ 

 or L#cH 0 µ  µ   ≠.

* 1alculate the test statistic nS 

 xt 

N

0

0

 µ −=

.

3 8tate the level of significance C and determine the critical value(s) and critical region.

egrees of freedom / ν / n#.

(i) For alternative L#a the critical region is [ / O t0  t0  @ tC P.

(ii) For alternative L#b the critical region is [ / O t0  t0  t# C P.

 (iii) For alternative L#c the critical region is [ / O t0  t0  t# CN* or t0 @ tCN* P.

2 If t0 lies in the critical region" reect L0 " otherwise do not reect L0.

8tate conclusion in terms of the original problem.

B!amples

A paint manufacturer claims that the average drying time for a new paint is * hours (#*0

minutes). he drying times for *0 randomly selected cans of paint were obtained. he results

are shown below.

#*3 #04 #37 #3#*5 #*6 ##7 #30

2

Page 46: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 46/81

#3# #33 #*# #34

#** ## ##4 #33

#07 #*0 #30 #07

Assuming that the sample was drawn from a normal distribution"

(a) test whether the population mean drying time is greater than * hours (#*0 minutes)

(i) at the M level of significance.

(ii) at the #M level of significance.

(b) test" at the M level of significance" whether the population mean drying time could be *

hours (#*0 minutes).

8olutionH

(a) L0 H - / #*0 (mean is * hours)

  L#  H - #*0 (mean is greater than * hours)

  n / *0" -0 / #*0 (given)"  x / #*2.#" 8 / 7.4452 (calculated from the data).

  est statistic t0 /   *0N4452.7

#*0#.#*2   −

/ #.677.

(i) If C / 0.0" # C / 0.7. From the t$distribution table with

 degrees of freedom /ν  / n# /#7" t0.7 / #.5*7.

  1ritical region [ / O t0  t0.7 / #.5*7 P.

 

8ince #.677 #.5*7 " L0 is reected.

1onclusionH he mean drying time appears to be greater than * hours.

(ii) If C / 0.0#" #C / 0.77. From the t$distribution table with

degrees of freedom /ν  / n# /#7" t0.77 / *.37.

1ritical region [ / O t0  t0.7 / *.37 P.

 

8ince #.677 @ *.37 " L0 is not reected.

1onclusionH he mean drying time appears to be * hours.

24

Page 47: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 47/81

hus" being more strict about controlling a type I error (changing α   from 0.0 to

0.0#) results in a different conclusion about L0 (o not reect instead of reect).

(b) L0 H - / #*0 (mean is * hours)

  L#  H - #*0 (mean is not eual to * hours)

  n / *0" -0 / #*0 (given)"  x / #*2.#" 8 / 7.4452 (calculated from the data).

  est statisticH t0 /   *0N4452.7

#*0#.#*2   −

/ #.677 (as calculated in part(a)).

  If C / 0.0" CN* / 0.0*" # CN* / 0.75.

From the t$distribution table with

  degrees of freedom /ν  / n# /#7" t0.0* / *.073" t0.75/ *.073..

  1ritical region [ / O t0 @ *.073 or t0  t0.75 / *.073 P.

 

8ince *.073 @#.677 @ *.073" L0 is not reected.

  1onclusionH he data suggest that the mean drying time could be * hours.

 ,oteH• espite the fact that the same data were used in the above e!amples" the conclusions

were different. In the first test L0 was reected" but in the ne!t * tests L 0 was not

reected.

• In the first test the probability of a type I error was set at M" while in the second test

this was changed to #M. o achieve this" the critical was moved from #.5*7 to *.37"

resulting in the test statistic value (#.677) being less than (in stead of greater than) the

critical value.

• In the third test (which has a two$sided alternative hypothesis)" the upper critical value

was increased to *.073 (to have an area of 0.0* under the t$curve to its right). Again

this resulted in the test statistic value (#.677) being less than (in stead of greater than)

the critical value.

 

*.5 Test for population variance

25

Page 48: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 48/81

he test for the population variance is based on*

**   )#(

σ  χ 

  S n −=

 following a chi$suare

distribution with n # degrees of freedom. he critical values are therefore obtained from the

chi$suare tables.

Test for the population variance <2

# 8tate the null and alternative hypotheses.

L0H*

0

* σ σ    =versus L#aH

*

0

* σ σ    < or L#bH

*σ  *

0σ  or L#cH

*

0

* σ σ    ≠

* 1alculate the test statistic*

0

**

0

)#(

σ  χ 

  S n −=

.

3 8tate the level of significance C and determine the critical value(s) and critical region.

egrees of freedom / ν / n#.

(i) For alternative L#a the critical region is [ / O

*

0

 χ 

*

0

 χ  @

*

α 

 χ P.

(ii) For alternative L#b the critical region is [ / O*

0 χ 

*

0 χ 

*

#   α  χ −  P.

(iii) For alternative L#c the critical region is [ / O*

0 χ  

*

0 χ  

*

*N#   α  χ  −   or*

0 χ  @

*

*Nα  χ 

P.

2 If

*

0 χ 

 lies in the critical region" reect L0 " otherwise do not reect L0.

8tate conclusion in terms of the original problem.

For a one$sided test with alternative hypothesis L#b the reection region (highlighted area) is

shown in the graph below.

26

Page 49: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 49/81

For a two$sided test with alternative hypothesis L#c the reection region (highlighted area) is

shown in the graph below.

B!ample #

1onsider the e!ample on the drying time of the paint discussed in the previous section. ntil

recently it was believed that the variance in the drying time is 4 minutes. 8uppose it is

suspected that this variance has increased. est this assertion at the M level of significance.

8olutionH

L0 H * / 4 (<ariance has not increased)

L# H *  4 (<ariance has increased)

n / *0" *0σ   / 4 (given)" 8 / 7.4452 (calculated from the data).

est statisticH*

0 χ  / 4

4452.7`#7   *

/ *5.*6.

C / 0.0" # C / 0.7.

From the chi$suare distribution table with

degrees of freedom /ν / n # /#7" Y *0.7 / 30.#2.

1ritical region [ / O

*

0 χ 

*

7.0 χ 

 / 30.#2 P. 

27

Page 50: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 50/81

8ince *5.*6 @ 30.#2" L0 is not reected.

1onclusionH here is insufficient evidence to conclude that the variance has increased.

B!ample *

A manufacturer of car batteries guarantees that their batteries will last" on average 3 years

with a standard deviation of # year. en of the batteries have lifetimes (in years) of

#.* *. 3 3. *.6 2 2.3 #.7 0.5 2.3

 est at the M level of significance whether the variability guarantee is still valid.

8olutionH

L0 H * / # (Vuarantee is valid)

L# H *  # (Vuarantee is not valid)

n / #0"*

0σ  / # (given)" 8 / #.*4*0750*" 8* / #.7*667 (calculated from the data).

est statisticH*

0 χ  / #

7*667.#`7

/ #2.334.

C / 0.0" CN* / 0.0*" # CN* / 0.75.

From the chi$suare distribution table with

degrees of freedom /ν / n # /7" Y *0.0* / *.50 " Y *0.75 / #7.0*.

1ritical region [ / OY *0 @ Y *0.0* / *.50 or Y *0  Y *0.75 / #7.0*P.

 

8ince *.50 @ #2.334 @ #7.0*" L0 is not reected.

1onclusionH he data suggests that the guarantee is still valid.

*.' Test for population proportion

he test for the population proportion (p) is based on the fact that the sample proportion

n

 X  P  =T

 + ,(p" pNn) " where n is the sample si&e and X the number of items labeled

DsuccessE in the sample. From this result it follows that ; /n p#

 p P 

N

T −

+ ,(0" #).

0

Page 51: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 51/81

For this reason the critical value(s) and critical region are the same as that for the test for the

 population mean (both based on the standard normal distribution).

Test for the population proportion p# 8tate the null and alternative hypotheses.

L0H 0 p p  =

 versus L#aH 0 p p  <

  or L#bH p p0  or L#cH 0 p p  ≠

* 1alculate the test statistic 0 z  /  n# p

 p p

N

T

00

U

3 8tate the level of significance C and determine the critical value(s) and critical region.

(i) For alternative L#a the critical region is [ / O & 0  &0  @ ;C P.

(ii) For alternative L#b the critical region is [ / O &0  &0  ;#$C P.

 (iii) For alternative L#c the critical region is [ / O &0  &0  ;#$CN* or &0 @ ;CN* P.

2 If &0 lies in the critical region" reect L0" otherwise do not reect L0.

8tate conclusion in terms of the original problem.

 B!amples

#) A construction company suspects that the proportion of obs they complete behind

schedule is 0.*0 (*0M). =f their 60 most recent obs ** were completed behind

schedule. est at the M level of significance whether this information confirms their

suspicion.

 

8olutionH

L0 H p / 0.*0 (8uspicion is confirmed)

L# H p 0.*0 (8uspicion is not confirmed)

n / 60" ! / ** (given)"  pT / 60

**

/ 0.*5" p0 / 0.*0.

est statisticH &0 / 60N60.0`*0.0

*0.0*5.0   −

 / #.455.

C / 0.01ritical regionH [ / O &0 @ ;0.0*  / #.74 or &0  ;0.75 / #.74 P.

#

Page 52: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 52/81

8ince #.74 @ &0 / #.455 @ #.74" L0 is not reected.

1onclusionH here is sufficient evidence to confirm the suspicion.

*) uring a marketing campaign for a new product #54 out of the *00 potential users of

this product that were contacted indicated that they would use it. Is this evidence that

more than 6M of all the potential will actually use the productG se C / 0.0#.

8olutionH

L0 H p / 0.6 (6M of all potential users will use the product)

L# H p 0.6 ('ore than 6M of all potential users will use the product)

n / *00" ! / #54" p0 / 0.6 (given)"  pT / *00

#54

/ 0.66.

est statistic &0 / *00N#.0`6.0

6.066.0   −

 / #.#66.

C / 0.0# so 1ritical region [ / O &0  ;0.77 / *.54 P.

8ince &0 / #.#66 @ *.54" L0 is not reected.

1onclusionH he evidence suggests that 6M of all potential users will use the

 product.

*.) Computer output

#) he output shown below is when the test for the population mean" for the data in

e!ample # in section 6.2" is performed by using e!cel.

t-Test: Mean$ean 129.1

%ariance93.25263

158&'servations 20()pot!esized $ean 120d 19

t *tat1.898752

271

+,- / t onetail

0.036445

557t ritical onetail 1.729132

*

Page 53: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 53/81

792

he value of the test statistic is t0 / #.70 (* decimal places). From the table

>(@ / #.7) / 0.034. his probability is known as the p,value (the probability of

getting a t$value more remote than the test statistic). hen testing at the M level of

significance" a p$value of below 0.0 will cause the null hypothesis to be reected.

*) he output shown below is when the test for the population variance in e!ample # in

section 6. (the data in e!ample # in section 6.2) is performed by using e!cel.

Chi-square test: Variance

%ariance93.252

63

&'servations 20()pot!esized variance 65d 19

!is"uare stat27.258

46+,!is"uare / 27.25846onetail

0.098775

!is"uare critical onetail30.143

53

he values of the test statistic and critical value are the same as in the e!ample in

section 6.. he p$value is 0.07655 (*nd to last entry in the *nd column in the table

above). 8ince 0.07655 0.0 the null hypothesis cannot be reected at the M level

of significance.

Chapter 4 – $tatistical 8nference9Testin of hypotheses for t7o samples

4.1 (ormulation of hypotheses notation and additional

results

he tests discussed in the previous chapter involve hypotheses concerning parameters of asingle population and were based on a random sample drawn from a single population of

3

Page 54: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 54/81

interest. =ften the interest is in tests concerning parameters of two different populations

(labeled populations # and *) where two random samples (one from each population) are

drawn.

B!amples

#) Are the mean salaries the same for males and females with the same educational

ualifications and work e!perienceG

*) o smokers and non$smokers have the same mortality rateG

3) Are the variances in drying times for two different types of paints differentG

2) Is a particular diet successful in reducing peopleUs weightsG

Null and alternative hypotheseshe following hypotheses involving two samples will be tested.

#. he test for euality of two variances. As an e!ample see e!ample 3 above.

*. he test for euality of two means (independent samples). As an e!ample see e!ample #

above.

3. he test for euality of two means (paired samples). As an e!ample see e!ample 2 above.

2. he test for euality of two proportions. As an e!ample see e!ample * above.

he parameters to be used" when testing the hypotheses" are summari&ed in the table below.

Parameter population 1 population 2

mean %# %*

variance #* *

*

 proportion  p#  p*

he following null and alternative hypotheses (as defined in section 6.#) also apply in the two

sample case.

L0H 0θ θ   =  (he statement that the parameter θ   is eual to the hypothetical value 0θ  ) .

L#aH 0θ θ   <

  or L#bH 0θ θ   >

  or L#cH 0θ θ   ≠

.

B!amples

#) hen testing for euality of variances from * different populations labeled # and *

the hypotheses are

L0H*

*

*

#   σ σ    =  

L#aH*

*

*

#   σ σ    <   or L#bH*

*

*

#   σ σ    >   or L#cH*

*

*

#   σ σ    ≠ .

hese hypotheses can also be written as

2

Page 55: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 55/81

L0H

#*

*

*

# =σ 

σ 

L#aH

#**

*

#<

σ 

σ 

  or L#bH

#**

*

# >σ 

σ 

  or L#cH

#**

*

# ≠σ 

σ 

.

In terms of the general notation stated above*

*

*

#

σ 

σ θ   =

 and#0   =θ 

.

*) hen testing for euality of means from * different populations labeled # and * the

hypotheses are

L0H *#   µ  µ   =  

L#aH *#   µ  µ   <   or L#bH *#   µ  µ    >   or L#cH *#   µ  µ   ≠ .

hese hypotheses can also be written as

L0H0*#   =− µ  µ   

L#aH0*#   <− µ  µ 

 or L#bH

0*#   >− µ  µ    or L#cH0*#   ≠− µ  µ 

In terms of the general notation stated above *#   µ  µ θ    −=  and00   =θ 

.

3) hen testing for euality of proportions from * different populations labeled # and *

the hypotheses are

L0H *#   p p   =  

L#aH *#   p p   <   or L#bH *#   p p   > 

or L#cH *#   p p   ≠ .

hese hypotheses can also be written as

L0H0*#   =−  p p  

L#aH0*#   <−  p p   or L#bH

0*#   >−  p p   or L#cH0*#   ≠−  p p .

In terms of the general notation stated above *#   p p   −=θ   and00   =θ 

.

Notation

he following notation will used in the description of the two sample tests.

Deasure notation -population 1/ notation -population 2/sample si&e n  (or n1) m  (or n2)

Page 56: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 56/81

sample n x x x """ *#     m x x x   """ *#  

sample mean   # x * x

sample variance (standard deviation)  *

#S   (   #S  )*

*S    (   *S  )

sample proportion#

T p /   n

 xn

`

*T p /  m

 xm

`

`

n x and

`

m x are the numbers of DsuccessE items in the samples from populations # and *

respectively.

$tandard error and variance formulae

hen testing hypotheses for the difference between two means (%#  %*) or the difference

 between two proportions (p#  p*)" formulae for the standard errors of the correspondingsample differences (   *#   X  X    − when testing for the mean" *#

TT  P  P   −  when testing for proportions)

will be needed. hese formulae are summari&ed in the following table (in each of the

formulae the variance is the suare of the standard error).

$ample difference )T(θ  Conditionstandard error ):T(9   θ S& 

*#   X  X    −  population variances not eual*N#

*

*

*

# )(mn

σ σ +

*#   X  X    −  population variances eual

i.e.

**

*

*

#   σ σ σ    ==

*N#)##

(

mn

+σ 

*#TT  P  P  −  population proportions not eual

*N#**## :)#()#(

9m

 p p

n

 p p   −+

*#TT  P  P  −  population proportions eual

i.e.  p p p   ==   *#

*N#):

##)(#(9

mn p p   +−

he general form of the test statistic used in most of these tests is )T(

T0

θ 

θ θ 

S&  Z 

  −=

" where  Z 

follows a ,(0"#) distribution . In some small sample cases the test statistic has a general form

)T(T

T0

θ 

θ θ 

 & S t 

  −=

" where t follows a t$distribution.

T7o sample samplin distribution results

#) For sufficiently large random samples (both n1 , n2  30) drawn from populations (with

known variances) that are not too different from a normal population" the statistic

4

Page 57: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 57/81

*N#*

*

*

#

*#*#

)(

)(

mn

 X  X  Z 

σ σ 

 µ  µ 

+

−−−=

  follows a ,(0"#) distribution.

*) hen**

*

*

#   σ σ σ    ==  the above mentioned result still holds" but with

*N#)##(mn

+σ  in

the denominator.

3) hen the population variances*

*

*

# "σ σ   and*σ  " referred to in the two above

mentioned results" are not known they may be replaced by their sample estimates

*

*

*

# "S S   and *

)#()#(   *

*

*

#*

−+

−+−=

mn

S mS nS 

respectively. In such a case the resulting

statistic follows

(i) a ,(0"#) distribution when the sample si&es are large

(both n1 , n2  30).

(ii) a t$distribution when the sample si&es are small (at least one of n1 or

n2  ≤  30). he degrees of freedom will depend on whether **

*

*

#   σ σ σ    ==  is true or not. If**

*

*

#   σ σ σ    == is true" the degrees of

freedom is n1 Q n2  *.

2) For sufficiently large random samples the statistic

*N#**##

*#*#

:)#()#(

9

)(TT

m

 p p

n

 p p

 p p P  P  Z 

−+

−−−=

follows a ,(0"#) distribution.

) hen  p p p   ==   *# the abovementioned result still holds but with

*N#):##

)(#(9mn

 p p   +− in the denominator.

4) >rovided the sample si&es are sufficiently large" the two above mentioned results will

still be valid with *# " p p and  p  in the denominator replaced by #T p / n

 xn

`

" =*T p   m

 xm`

and mn

 x x p mn

+

+=

``

T respectively.

5

Page 58: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 58/81

4.2 Test for e"uality of population variances -(,test/

A summary of the steps to be followed in the testing procedure is shown below.

Test for

*

*

*

#   σ σ    =8tep #H 8tate null and alternative hypotheses

L0H

*

*

*

#   σ σ    = versus L#aH

*

*

*

#   σ σ    <   or L#bH*

*

*

#   σ σ    >   or L#cH*

*

*

#   σ σ    ≠

8tep *H 1alculate the test statistic F0$S 12 ' S 2

8tep 3H 8tate the level of significance α   and determine the critical value(s) and critical

region.

egrees of freedom is #

df   / sample si&e (numerator sample variance) # and

*df    / sample si&e (denominator sample variance) #

(i) For alternative L#a the critical region is [ / OF0  F0 @ FCP

(ii) For alternative L#b the critical region is [ / OF0  F0  F#CP

(iii) For alternative L#c the critical region is [ / OF0  F0 @ FCN* or F0  F# CN*P

8tep 2H If F0 lies in the critical region" reect L0" otherwise do not reect L0.

8tep H 8tate the conclusion in terms of the original problem

 

B!ample #

he following sample information about the daily travel e!penses of the sales (population #)

and audit (population *) staff at a certain company was collected.

sales #026 #060 ##46 #3*0 #066 ##34

audit #020 6#4 #03* ##2* ##7* 740 ###*

 est at the #0M level of significance whether the population variances could be the same.

8olutionH

L0H σ#* / σ*

*  L0H σ#* R σ*

* / #

=r 

L#H σ#*  σ*

*  L#H σ#* R σ*

*  #

From the above information

 n# / 4" n* / 5" 8#* / 773.4" 8*

* / #662

est statisticH F0 / 773.4 R #662 / 0.402

df # / 4 # / " df * / 5 # / 4

6

Page 59: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 59/81

C / 0.#" CN* / 0.0" # CN* / 0.7

F " 4 Z 0.7 / 2.37

F " 4 Z 0.0 / # R F4 " Z 0.7 / # R 2.7 / 0.*0*

1ritical region [ / OF0  F0 @ 0.*0* or F0  2.37P

8ince 0.*0* @ F0 / 0.402 @ 2.37" L0 is not reected.

1onclusionH here is sufficient evidence to conclude that the population variances are the

same.

B!ample *

he waiting times (minutes) for minor treatments were recorded at two different medical

centres. Selow is a summary of the calculations made from the samples.

centre sample si=e mean variance

# #* *.47 5.*00

* #0 *5.44 **.0#5

est at the M level of significance whether the centre # population variance is less than that

for population *.

8olutionH L0H σ#* / σ*

*  L0H σ#* R σ*

* / #

=r 

L#H σ#* @ σ*

*  L#H σ#* R σ*

* @ #

From the above tableH n# / #*" n* / #0" 8#* / 5.*" 8*

* / **.0#5

est statisticH F0 / (8#* R 8*

*) / (5.* R **.0#5) / 0.3*5

C / 0.0" df # / #* # / ##" df * / #0 # / 7"

1ritical valueH F## " 7 Z 0.0 / # R F7 " ## Z 0.7 / # R *.7 / 0.32

1ritical region [ / OF0  F0 @ 0.32P

8ince F0 / 0.3*5 @ 0.32" L0 is reected.

1onclusionH he evidence suggests that the variance for population # is less than that for

 population *.

4.# Test for difference bet7een means for independent

samples

i/ (or lare samples -both sample si=es n1 n2  #3/

7

Page 60: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 60/81

Test for 0*#   =− µ  µ   -lare samples population variances !no7n/

8tep #H 8tate null and alternative hypotheses

L0H0*#   =− µ  µ 

L#aH 0*#   <− µ  µ    or L#bH 0*#   >− µ  µ    or L#cH 0*#   ≠− µ  µ 

8tep *H 1alculate the test statistic

*N#

*

*

*

#

*#

0

)(mn

 x x z 

σ σ +

−=

.

8tep 3H 8tate the level of significance α   and determine the critical value(s) and critical

region.

(i) For alternative L#a the critical region is [ / O &0  &0  @ ;C P.

(ii) For alternative L#b the critical region is [ / O &0  &0  ;# C P.

(iii) For alternative L#c the critical region is [ / O &0  &0  ;# CN*  or &0 @ ;CN* P.

8tep 2H If &0 lies in the critical region" reect L0" otherwise do not reect L0.

8tep H 8tate the conclusion in terms of the original problem.

A #00(#$ )α  M confidence interval for *#   µ  µ   −  is given by

*N#*

*

*

#*N#*#   )(

mn Z  x x

  σ σ α    +±−   −

.

Note9 If the  population variances*

#σ   and*

*σ   are not known" they can be replaced in the

above formulae by their sample estimates*

#S   and*

*S   respectively with the testing procedure

unchanged.

B!ample #

ata were collected on the length of short term stay of patients at hospitals. Independent

random samples of n# / 20 male patients (population #) and n* / 3 female patients

(population *) were selected. he sample mean stays for male and female patients were

40

Page 61: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 61/81

# x  / 7 days and * x / 5.* days respectively. he population variances are known from past

e!perience to be*

#σ  / and*

*σ  / 25.

(a) est at the M level of significance whether male patients stay longer on average than

female patients.

(b) 1alculate a 7M confidence interval for the mean difference (in staying time) between

males and females.

8olutionH

a) L0H0*#   =− µ  µ   (mean staying times for males and females the same)

  L#H 0*#   >− µ  µ   (mean staying time for males greater than for females)

est statisticH

*N#

*

*

*

#

*#

0

)(mn

 x x z 

σ σ +

−=

/

*N#)3

25

20

(

57

+

/*#3.#

4264.#

*=

C / 0.0" ;0.7 / #.42

1ritical region [ / O&0  &0  #.42P

8ince &0 / #.*#3 @ #.42" L0 cannot be reected.

1onclusionH here is sufficient evidence to conclude that the mean staying times formales and females are the same.

 b) # C / 0.7" C / 0.0" CN* / 0.0*" ;# CN* / ;0.75 / #.74

denominator value when calculating the test statisticH

*#   x x   − / *"

*N#*

*

*

# )(mn

σ σ +

/ #.4264

Kower confidence limitH

*N#*

*

*

#*N#*#   )(

mn Z  x x

  σ σ α    +−−   −

/ * #.74 _ #.4264 / #.*3#

pper confidence limitH

*N#*

*

*

#*N#*#   )(

mn Z  x x

  σ σ α    ++−   −

/ * Q #.74 _ #.4264 / .*3#

A 7M confidence interval for the difference in means is ( #.*3# Z .*3#).

B!ample *

4#

Page 62: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 62/81

[esearchers in obesity want to test the effectiveness of dieting with e!ercise against dieting

without e!ercise. 8eventy three patients who were on the same diet were randomly divided

into De!erciseE (n# /35 patients) and Dno e!erciseE groups (n* /34 patients). he results of the

weight losses (in kilograms) of the patients after * months are summari&ed in the table below.

iet with e!ercise group iet without e!ercise group4.5#  = x   5.4*   = x

3.**

#   =S    7.*

*   =S 

est at the M level of significance whether there is a difference in weight loss between the *

groups.

8olutionH

L0H

0*#   =− µ  µ  (,o difference in weight loss)

L#H0*#   ≠− µ  µ   (here is a difference in weight loss)

est statisticH

*N#*

*

*

#

*#0

)(m

n

 x x z 

+

−=

 /

=

+

*N#)34

7.

35

3.*(

5.44.5

  253.0

7.0

 / #.703.

C / 0.0" CN* / 0.0*" # CN* / 0.75"

1ritical valuesH ;0.0* / #.74" ;0.75 / #.74

1ritical region [ / O&0  &0  #.74 or &0  #.74P

8ince #.74 @ &0 / #.703 @ #.74" L0 cannot be reected.

1onclusionH here is insufficient evidence to suggest a difference in weight loss between the

* groups.

ii/ (or small samples -at least one of n1 or n2 F #3/ from normal

populations 7ith variances un!no7n

he test to be performed in this case will be preceded by a test for euality of population

variances (*

*

*

#   σ σ    =  /*

σ  ) i.e. the F$test discussed in section 7.*. If the hypothesis of eual

variances cannot be reected" the test described below should be performed. If this hypothesis

is reected" the elsh$Aspin test (not to be discussed here) should be performed. If" in this

case" the assumption of samples from normal populations does not hold" a nonparametric test

like the 'ann$hitney test (not to be discussed here) should be used.

4*

Page 63: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 63/81

Test for 0*#   =− µ  µ   -small sample si=es population variances un!no7n but e"ual/

8tep #H 8tate null and alternative hypotheses

L0H0*#   =− µ  µ 

L#aH 0*#   <− µ  µ 

 or L#bH 0*#   >− µ  µ    or L#cH 0*#   ≠− µ  µ 

8tep *H 1alculate the test statistic

*N#

*#0

)##

(mn

 x xt 

+

−=

  with *

)#()#(   *

*

*

#*

−+

−+−=

mn

S mS nS 

8tep 3H 8tate the level of significance α   and determine the critical value(s) and critical

region.

egrees of freedom / *−+ mn .

(i) For alternative L#a the critical region is [ / O t0  t0  @ tC P.

(ii) For alternative L#b the critical region is [ / O t0  t0  t# C P.

(iii) For alternative L#c the critical region is [ / O t0  t0  t# CN*  or t0 @ tCN* P.

8tep 2H If t0 lies in the critical region" reect L0" otherwise do not reect L0.

8tep H 8tate the conclusion in terms of the original problem.

A #00(#$ )α  M confidence interval for *#   µ  µ   −  is given by

*N#

*N#"**#   )##

(mn

S t  x x mn   +±− −−+   α 

.

B!ample #

1onsider the e!ample on the comparison of the travel e!penses for the sale and audit staff

(see section 7.*" e!ample # for F$test).

(a) est" at the M level of significance" whether the mean e!penses for the two types of

staff could be the same.

(b) 1alculate a 7M confidence interval for the mean difference between the mean e!penses

for the two types of staff.

43

Page 64: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 64/81

8olutionH

a) 8ince the hypothesis of eual population variances was not reected" the test described

above can be performed.

 From the data given # x  /##20" * x / #02*"*

#S   / 773.4 and*

*S  / #662.

L0H0*#   =− µ  µ   ('ean travel e!penses for sale and audit staff the same)

L#H0*#   ≠− µ  µ   ('ean travel e!penses for sale and audit staff not the same)

C / 0.0" CN* / 0.0*" # CN* / 0.75

From the t$distribution with *−==   mnν   / 4 Q 5 * / ## degrees of freedom"

75.0t  / *.*0#.

1ritical region / [ / O*0#.*

75.00

  => t t  P.

*

)#()#(   *

*

*

#*

−+

−+−=

mn

S mS nS 

 /=

+

##

#662`44.773`

 #30*2.5*5"

S / ##2.#*4

est statisticH

*N#)5

#

4

#(#*4.##2

#02*##20

+

/ #.23.

8ince #.23 @*0#.*75.0   =t 

" L0 cannot be reected.

1onclusionH he data suggests that the mean travel e!penses for sale and audit staff

are the same.

 b) A 7M confidence interval for the difference between sales and audit staff means is

(##20 #02*)±

 *.*0# _ ##2.#*4 _ (

*N#)5

#

4

#+

i.e ( 2#.5 Z *35.5).

B!ample *

A certain hospital has been getting complaints that the response to calls from senior citi&ens

is slower (takes longer time on average) than that to calls from other patients. In order to test

this claim" a pilot study was carried out. he results are shown below.

Patient type sample mean response time sample standard deviation sample si=e

8enior citi&ens .40 minutes 0.* minutes #6

=thers .30 minutes 0.*# minutes #3

42

Page 65: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 65/81

est" at the #M level of significance" whether the complaint is ustified.

8olutionH

Kabel the Dsenior citi&ensE and DothersE populations as # and * and their population mean

response times as # µ   and * µ   respectively.

L0H0*#   =− µ  µ   ('ean response times the same)

L#H0*#   >− µ  µ   ('ean response time for senior citi&ens longer than for others)

C / 0.0#" # C / 0.77

From the t$distribution table with *7*#3#6*   =−+=−+=   mnν   degrees of freedom

24*.*77.0   =t  .

1ritical regionH [ / O24*.*77.00   => t t 

P.

he hypothesis that the population variances are eual cannot be reected (perform the F$test

to check this). Lence eual variances for the * populations can be assumed.

*S   / *7

*#.0`#**.0`#5   **+

 / 0.027

est statisticH

*N#0

)#3

#

#6

#(*323.0

3.4.+

−=t 

 / 3.#6

8ince#6.30   =t 

  *.24*" L0 is reected.

1onclusionH here is sufficient evidence to conclude that the claim is ustified i.e. the mean

response time for senior citi&ens takes longer than that for others.

4.& Test for difference bet7een means for paired

-matched/ samples

he tests for the difference between means in the previous section assumed independent

samples. In certain situations this assumption is not met.

B!amples

4

Page 66: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 66/81

#) A group of patients going on a diet is weighed before going on the diet and again after 

having been on the diet for one month. A test to determine whether the diet has

reduced their weight is to be performed.

*) he aptitudes of boys and girls for mathematics are to be compared. In order to

eliminate the effect of social factors" pairs of brothers and sisters are used in thecomparison. Bach (brother" sister) pair is given the same test and the mean marks of

 boys and girls compared.

In each of these situations the two samples cannot be regarded as independent. In the first

e!ample two readings (before and after readings) are made on the same subect. In the second

e!ample the two samples are matched via a common factor (family connection).

he data layout for the e!periments described above is shown below.

he mean of the paired differences of the (   ( x" ) values of the two populations is defined as

*#   µ  µ  µ    −=d  . nder the assumption that the differences are sampled from a normal

 population" hypotheses concerning the mean of the differences d  µ can be tested by

 performing a one sample t $test (described in the previous chapter) with the observed

differences nd d d    """ *# 

 as the sample. he mean and standard deviation of these sample

differences will be denoted by d  and d S  respectively.

 

Test for0=d  µ 

 - paired samples/

8tep #H 8tate null and alternative hypotheses

L0H0=d  µ 

L#aH0<d  µ 

  or L#bH0>d  µ 

  or L#cH0≠d  µ 

8tep *H 1alculate the test statistic 

8tep 3H 8tate the level of significance α   and determine the critical value(s) and critical

region.

egrees of freedom / ν   / #−n .

(i) For alternative L#a the critical region is [ / O t0  t0  @ tC P.

(ii) For alternative L#b the critical region is [ / O t0  t0  t#$C P.

44

sample ## x .....

n x

sample ** ( .....

difference***   ( xd    −= .....

nnn   xd    −=

* x

# (

###   ( xd    −=

Page 67: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 67/81

(iii) For alternative L#c the critical region is [ / O t0  t0  t#$CN* or t0 @ tCN* P.

8tep 2H If t0 lies in the critical region" reect L0" otherwise do not reect L0.

8tep H 8tate the conclusion in terms of the original problem.

A #00(#$α  ) M confidence interval for d  µ  is given by

t d  ± $multiplier`   n

S d 

"

where the t $ multiplier is obtained from the t$tables with #−n  degrees of freedom with an

area *N#   α −  under the t$curve to the left of it.

B!ample #

A bank is considering loan applications for buying each of #0 homes. wo different

companies (company # and company *) are asked to do an evaluation of each of these #0

homes. he evaluations (thousands of [and) for these homes are shown in the table below.

Lome # * 3 2 4 5 6 7 #0

company # 50 770 #0* #*6 #300 65 #*20 660 500 #3#

company * 6#0 #000 #0*0 #3*0 #*70 7# #*0 7#0 40 #*70

difference 40 #0 3 #0 20 #0 30 0 *

 

(a) At the M level of significance" is there a difference in the mean evaluations for the *

companiesG

(b) 1alculate a 7M confidence interval for the difference between the mean evaluations for

companies # and *.

8olutionH

(a) L0H0=d  µ 

 (,o difference in mean evaluations)

  L#H 0≠d  µ   (here is a difference in mean evaluations)

C / 0.0" CN* / 0.0*" #CN* / 0.7*

From the t$tables with n=ν   # / 7 degrees of freedom" 75.0t  / *.*4*.

1ritical regionH [ / Ot0 @ *.*4* or t0  *.*4*P.

From the above table d  / 7." d S / 33.#*0#" n /#0.

45

Page 68: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 68/81

est statisticH

8ince *.*4* @ 0t  / 0.705 @ *.*4*" L0 is not reected.

(b) A 7M confidence interval is given by

  7. #0

#*0#.33*4*.*±

  / ( 33.#7 Z #2.#7).

B!ample *

Bach of # people going on a diet was weighed before going on the diet and again one after

having been on the diet for one month. he weights (in kilograms) are shown in the table below.

Perso

n

befor

e  after

 differenc

e

Perso

n

befor

e  after

 differenc

e

1 90 85 – 5 4 101 99 – 2

2 110 105 – 5 13 112 105 – 7

# 124 126 2 11 138 130 – 8

& 116 118 2 12 96 93 – 3

5 105 94 – 11 1# 102 95 – 7

' 88 84 – 4 1& 111 102 – 9

) 86 87 1 15 82 83 1

* 92 87 – 5

est" at the #M level of significance" whether the mean weight after one month on the diet is

less than that before going on the diet.

8olutionH

Ket d  µ  denote the mean difference between the weight after having been on the diet for one

month and before going on the diet.

L0H0=d  µ 

 (,o difference in mean weights)

L#H0<d  µ 

 ('ean weight after one month on diet less than before going on diet)

C / 0.0#

From the t$tables with n=ν    # / #2 degrees of freedom" 77.00#.0 t t    −=/ *.4*2.

46

Page 69: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 69/81

1ritical regionH [ / O4*2.*0   −<t 

P.

From the above table d  / 2" d S / 2.#*3#" n /#.

est statisticH #

#*3#.22

0−=t 

/ 3.55.

8ince 0t  / 3.55 @ *.4*2" L0 is reected.

1onclusionH he mean weight after one month on the diet is less than before going on diet.

4.5 Test for the difference bet7een proportions for

independent samples

hen testing for the difference between the proportions of two different populations" the test

is based on the sampling distribution results 2$4 described in the first section of this chapter.

Test for 0*#   =−  p p  

8tep #H 8tate null and alternative hypotheses

L0H0*#   =−  p p

L#aH 0*#   <−  p p   or L#bH 0*#   >− p p   or L#cH 0*#   ≠−  p p

8tep *H 1alculate the test statistic

*N#

*#0

):##

)(T#(T9

TT

mn p p

 p p z 

+−

−=

  with mn

 x x p mn

+

+=

``

T.

8tep 3H 8tate the level of significance α   and determine the critical value(s) and critical

region.

(i) For alternative L#a the critical region is [ / O &0  &0  @ ;C P.

(ii) For alternative L#b the critical region is [ / O &0  &0  ;#$C P.

(iii) For alternative L#c the critical region is [ / O &0  &0  ;# CN*  or &0 @ ;CN* P.

8tep 2H If &0 lies in the critical region" reect L0" otherwise do not reect L0.

8tep H 8tate the conclusion in terms of the original problem.

47

Page 70: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 70/81

A #00(#$α  ) M confidence interval for *#   p p   −  is given by

B!ample

A perfume company is planning to market a new fragrance. In order to test the popularity of

the fragrance" #*0 young women and #0 older women were selected at random and asked

whether they liked the new fragrance. he results of the survey are shown below.

lie did not lie samplesize

)oung

48 72 120

older 72 78 150

(a) est" at the M level of significance" whether older women like the new fragrance more

than young women.

 

(b) 1alculate a 7M confidence interval for the difference between the proportions of older

and young women who like the fragrance.

8olutionH

Ket the older and younger women populations be labeled # and * respectively and # p  and * p

the respective population proportions that like the fragrance.

a) L0H0*#   =−  p p

L#H0*#   >− p p  

C / 0.0

1ritical region [ / O   0 z  7.0 Z / #.42 P.

From the above table n / #0" m / #*0"`

n x/ 5*"

`

m x/ 26.

 pT /=

+

+

#*0#0

265*

7

2

est statisticH

*N#0

):#*0

#

#0

#()

7

()

7

29(

#*0

26

#0

5*

+××

= z 

  / 04066.0

06.0

 / #.3#2.

50

Page 71: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 71/81

8ince 0 z / #.3#2 @ #.42" L0 cannot be reected.

1onclusionH here is not sufficient evidence to suggest that older women like the new

fragrance more than young women.

 b) &0.75 / #.74

A 7M confidence interval for the difference between the proportions of older and

younger women who like the fragrance is

( 0.0364 Z 0.#764 )

4.' Computer output

#) he test for the difference between population means in e!ample # in section 7.3(ii) (the

data in e!ample # in section 7.*) can be performed by using e!cel. hat follows is the

output.

t-Test: Two-Sample Assuming !ual "a#ian$es

  Variable 1 Variable 2  

%ean 1140 1042

"a#ian$e 9593.6 15884

&'se#(ations 6 7

)oole* "a#ian$e 13024.73

+,potesie* %ean/ie#en$e 0

* 11

t Stat 1.543458

)Tt two-tail 0.150984t #iti$al two-tail 2.200985

he p$value is 0.#0762 0.0. At the M level of significance the null hypothesis

cannot be reected.

*) he output shown below is when the test for euality of population variances for the data

in e!ample # in section 7.* is performed by using e!cel.

F-Test Two-Sample for Variances

  Variable 1 Variable 2 

%ean 1140 1042

5#

Page 72: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 72/81

"a#ian$e 9593.6 15884

&'se#(ations 6 7

* 5 6

0.603979

) 0.701718

#iti$al two-tail 0.143266

he value of the test statistic shown in the above table is

403757.0#662

4.773*

*

*

# == )

 )

. he

critical value (last entry under variable # in the above table) is

#23*44.076.4

##

75.0Z"4

0*.0H4"   === F 

 F 

 and the p$value (second to last entry under variable # in

the above table) is 0.50#5#6. 8ince 0.50#5#6 0.0*" the null hypothesis cannot be reected.

Chapter 13 – +inear Correlation and

reression

13.1 Givariate data and scatter diarams

=ften two variables are measured simultaneously and relationships between these variables

e!plored. ata sets involving two variables are known as bivariate data sets.

he first step in the e!ploration of bivariate data is to plot the variables on a graph. From

such a graph" which is known as a scatter diaram (scatter plot" scatter graph)" an idea can

 be formed about the nature of the relationship.

B!amples

#) he number of copies sold (y) of a new book (measured in thousands of units) is

dependent on the advertising budget (!) the publisher commits in a pre$publication

campaign (measured in thousands of [ands). he values of ! and y for #* recently

 published books are shown below.

!   8 9.5 7.2 6.5 10 12  11.

514.8

17.3

  27 30 25

y

  12.

5

18.

6

25.

3

24.

8

35.

7

45.

4

44.

4

45.

8

65.

3

75.

7

72.

3

79.

2

5*

Page 73: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 73/81

8catter diagram

*) In a study of the relationship between the amount of daily rainfall (!) and the uantity

of air pollution removed (y)" the following data were collected.

ainall,centimeters

"uantit) removed ,microgramsper cu'ic meter

4.3 1264.5 1215.9 1165.6 1186.1 1145.2 1183.8 1322.1 1417.5 108

8catter diagram

53

Page 74: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 74/81

• In both cases the relationship can be fairly well described by means of a straight line

i.e. both these relationships are linear relationships.

• In the first e!ample an increase in y is proportional to an increase in ! (positive

linear relationship).

In the second e!ample a decrease in y is proportional to an increase in !   (neative

linear relationship).

• In both the e!amples changes in the values of y are affected by changes in the values

of ! (not the other way round). he variable ! is known as the explanatory

-independent/ variable and the variable y the response -dependent/ variable.

In this section only linear relationships between * variables will be e!plored. he issues to be

e!plored are

#) 'easuring the strenth of the linear relationship between the * variables (the linear

correlation problem).

*) Finding the e"uation of the straight line that will best describe the relationship 

 between the * variables (the linear reression problem). =nce this line is determined"

it can be used to estimate a value of y for given value of ! (linear estimation).

13.2 +inear Correlation

he calculation of the coefficient of correlation (r) is based on the closeness of the plotted

points (in the scatter diagram) to the line fitted through them. It can be shown that

52

Page 75: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 75/81

 # ? r ? #.

If the plotted points are closely clustered around this line" r will lie close to either # or #

(depending on whether the linear relationship is positive or negative). he further the plotted

 points are away from the line" the closer the value of r will be to 0. 1onsider the scatterdiagrams that follow.

$tron positive correlation -r close to 1/

$tron neative correlation -r close to –1/

No pattern -r close to 3/

 

For a sample of n pairs of values (!#" y#) " (!*" y*)" . . . " (!n" yn) " the coefficient of

correlation can be calculated from the formula

5

Page 76: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 76/81

B!ample

1onsider the data on the advertising budget (!) and the number of copies sold (y) considered

earlier. For this data r can be calculated in the following way.

# #) #2 )2

8 12.5 100 64 156.259.5 18.6 176.7 90.25 345.967.2 25.3 182.16 51.84 640.096.5 24.8 161.2 42.25 615.0410 35.7 357 100 1274.49

12 45.4 544.8 144 2061.1611.5 44.4 510.6 132.25 1971.3614.8 45.8 677.84 219.04 2097.6417.3 65.3 1129.69 299.29 4264.09

27 75.7 2043.9 729 5730.4930 72.3 2169 900 5227.2925 79.2 1980 625 6272.64

sum 178.8 545 10032.89 3396.92 30656.5

8ubstituting n/#*" ! / #56.6" y / 2"

  !y / #003*.67" !* / 3374.7* y* / 3044.

into the euation for r  gives

 

1ommentH 8trong positive correlation i.e. the increase in the number of copies sold is closely

linked with an increase in advertising budget.

Coefficient of determination

he strength of the correlation between * variables is proportional to the suare of the

correlation coefficient (r *). his uantity" called the coefficient of determination" is the

 proportion of variability in the y variable that is accounted for by its linear

relationship with the ! variable.

54

Page 77: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 77/81

B!ample

In the above e!ample on copies sold (y) and advertising budget (!)" the

coefficient of determination / r * / 0.7#72* / 0.623.

his means that 62.3M of the change in the variability of copies sold is e!plained by

its relationship with advertising budget.

13.# +inear @eression

Finding the euation of the line that best fits the (!" y) points is based on the least s"uares

 principle. his principle can best be e!plained by considering the scatter diagram below.

he scatter diagram is a plot of the SL (diameter at breast height measured in inches)

versus the age (years) for #* oak trees. he data are shown in the table below.

ge,# 97 93 88 81 75 57 52 45 28 15 12 11

(,) 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1

According to the least suares principle" the line that DbestE fits the plotted points is the one

that minimi&es the sum of the suares of the vertical deviations (see vertical lines in the

graph) between the plotted y and estimated y (values on the line). For this reason the line

fitted according to this principle is called the least s"uares line.

Calculation of least s"uares linear reression line

he euation for the line to be fitted to the (!" y) points is

55

Page 78: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 78/81

 (T / * Q +!"

where  (T is the fitted y value (y value on the line which is different to the observed y

value)"a is the y,intercept and b the slope of the line.

It can be shown that the coefficients that define the least suares line can be calculatedfrom

+ / ∑ ∑∑ ∑ ∑

** )(   x xn

 ( x x(n

  and * / . x+ ( −

B!ample

For the above data on age (!) and SL (y) the least suares line can calculated as shown

 below.

# #) #2

97 12.5 1212.5 9409

93 12.5 1162.5 8649

88 8 704 7744

81 9.5 769.5 6561

75 16.5 1237.5 5625

57 11 627 3249

52 10.5 546 2704

45 9 405 2025

28 6 168 784

15 1.5 22.5 225

12 1 12 144

11 1 11 121

sum 654 99 6877.5 472408ubstitutin

n/#*" ! / 42" y / 77"

  !y / 4655. !* / 25*20

into the above euation gives.

and

herefore the euation of the y on ! least suares line that can be used to estimate values of y(SL) based on ! (age) is

56

Page 79: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 79/81

 (T / #.*6 Q 0.#*557 !.

8uppose the SL of a tree aged 70 years is to be estimated. his can be done by substituting

the value of ! / 70 into the above euation. hen (T

/ #.*6 Q 0.#*557 _ 70 / #*.564.

> 7ord of caution

• he linear relationship between y and ! is often only valid for values of ! within a

certain range e.g. when estimating the SL using age as e!planatory variable" it

should be taken into account that at some age the tree will stop growing. Assuming a

linear relationship between age and SL for values beyond the age where the tree

stops growing would be incorrect.

• =nly relationships between variables that could be related in a practical sense aree!plored e.g. it would be pointless to e!plore the relationship between the number of

vehicles in ,ew ork and the number of divorces in 8outh Africa. Bven if data

collected on such variables might suggest a relationship" it cannot be of any practical

value.

 

• If variables are not linearly related" it does not mean that they are not related. here

are many situations where the relationships between variables are non,linear.

B!ample

A plot of the banana consumption (y) versus the price (!) is shown in the graph on the

following page. A straight line will not describe this relationship very well" but the non$linear

curve shown below will describe it well.

57

Page 80: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 80/81

13.& Computer output

1onsider the data on age (! variable) and SL (y variable). he output when performing astraight line regression on this data on e!cel is shown below.

SUMMARY OUTPUTRegression Statistics

*"uare0.6893075

72&%

  Df SS MS F  Signican

ce F 

egression 1

189.3872553

189.3873

22.1862

0.000828626

esidual 1085.36274

4688.5362

74 -otal 11 274.75

Coecients

StandardError t Stat  

P-value

:ntercept1.2853539

711.702259

1530.7550

870.467

61

;%aria'le 0.12779167 0.027130722 4.71022 0.00083

60

Page 81: Book 2 Notes

7/17/2019 Book 2 Notes

http://slidepdf.com/reader/full/book-2-notes 81/81

 

# he coefficient of determination in the above table is [ suare / 0.4673055*.

* he A,=<A (Analysis of <ariance) table is constructed to test whether there is a

significant linear relationship between X and . he p$value for this test is the entry under theSinifi*ne F  heading in the A,=<A table. 8ince this p$value @ 0.0 (or 0.0#)" the

hypothesis of Dno linear relationship between X and E can be reected and it can be

concluded that there is a significant linear relationship between X and .

3 he third of the tables in the summary output shows the intercept and slope values of the

line. hese are the first two entries under .!effiient. he remaining columns to the right of

the .!effiient column concerns the performance of tests for &ero intercept and slope. From

the intercept and slope p$values (0.2454# and 0.00063 respectively) it can be seen that the

intercept is not significantly different from &ero at the M level of significance

(0.2454#0.0) but that the slope is significantly different from &ero at the M or #M levelsof significance (0.00063 @ 0.0# @ 0.0).

hen the correlation coefficient is calculated for the above mentioned data by using e!cel"

the output is as shown below.

 

Colum

n 1 Column 2  

olumn1 1

olumn

2

0.8302

5 1

he above table shows that the correlation between ! and y is 0.630*.

6#