The Fisher, Neyman-Pearson Theories of Testing Hypotheses

8/10/2019 The Fisher, Neyman-Pearson Theories of Testing Hypotheses

http://slidepdf.com/reader/full/the-fisher-neyman-pearson-theories-of-testing-hypotheses 1/9

The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?

Author(s): E. L. LehmannReviewed work(s):Source: Journal of the American Statistical Association, Vol. 88, No. 424 (Dec., 1993), pp. 1242-1249Published by: American Statistical Association

Stable URL: http://www.jstor.org/stable/2291263 .

Accessed: 08/01/2013 10:38

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal

of the American Statistical Association.

http://www.jstor.org

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2291263?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


http://www.jstor.org/stable/2291263?origin=JSTOR-pdf

http://www.jstor.org/action/showPublisher?publisherCode=astata



Lehmann:

Theories

of Testing Hypotheses

1243

Fisher

with series

f papersculminating

n his book

Sta-

tisticalMethods or

Research Workers

1925),

in whichhe

created

new paradigm

forhypothesis esting.

e greatly

extended

he

applicability

f the

t test

to

the

two-sample

problem

nd

the testing f regressionoefficients)

nd

gen-

eralized t to the esting

fhypotheses

n the nalysis f

vari-

ance.

He

advocated

5% as the standard

evel with 1%as a

more stringent lternative); hrough pplyingthis new

methodology

o a

varietyfpractical xamples,

e established

it as

a highly opular tatistical

pproachformanyfields

f

science.

A question hat

Fisherdid notraise

was the origin

f his

test tatistics:Why

these rather han some

others?

his is

the

uestion

hatNeyman

nd Pearson onsidered

nd

which

(after

ome

preliminary

ork

n

Neyman

nd

Pearson 1928)

they ater

answered Neyman

and Pearson

1933a). Their

solution

nvolved

not

only

hehypothesis

ut also a class

of

possible

alternatives

nd

the

probabilities

f two

kinds of

error:

alse

ejectionError

) and false cceptance

Error

I).

The "best"testwas one thatminimized A ErrorI) subject

to a bound on

PH

Error ),

the atter eing

he

ignificance

level

of he est.

hey ompletely

olved his roblem

or he

case of testing simple i.e.,

single

distribution) ypothesis

against

a simple

alternative y

means of

the

Neyman-

Pearson emma.

For more complex

situations, he

theory

required dditional

oncepts,

nd working

ut thedetails

f

this

program

was an important oncern

of mathematical

statistics

n

the

following

ecades.

The Neyman-Pearson

ntroduction

o the two

kinds of

error ontained

brief tatement

hatwas to become

the

focus of much later debate.

"Without hoping

to know

whetherach

separate ypothesis

s true

r false", he uthors

wrote, we may earch or ules o govern ur behaviorwith

regard

o them,

n

following

whichwe insure

hat,

n the

long

run of experience,

we shall

not be too oftenwrong."

And in this nd the following aragraph hey

efer o a

test

(i.e.,

a rule to reject

r accept the

hypothesis)s "a

rule of

behavior".

3.

INDUCTIVE

NFERENCE

VERSUS NDUCTIVE

EHAVIOR

Fisher onsidered tatistics,

he science of uncertain

n-

ference,

ble to

provide

key

to

the

ong-debated roblem

of nduction.

e

started

ne

paper

Fisher

1932,p. 257)

with

the tatementLogicianshave ongdistinguishedwomodes

ofhuman

reasoning,

nder he

respective

ames

fdeductive

and inductive

easoning.

.

.

In

inductive

easoning

e at-

tempt

o

argue

fromhe

particular,

hich

s

typically

body

of observationalmaterial,

o

the

general,

which

s

typically

a

theory pplicable

o

future

xperience."

He

developed

his

ideas

n

moredetail

n

a

laterpaper Fisher

1935a, p. 39)

. .

.everyone

who does

habitually ttempt

he difficultask

of

making

ense

of

figures

s,

in

fact, ssaying

logicalprocess

f

the kind

we

call

inductive,

n

that

he

is

attempting

o draw n-

ferences

rom

he

particular

o

the

general.

uch

inferences

e

recognize o be uncertainnferences.

..

He continued

n the next

paragraph:

Although ome

uncertainnferencesan

be rigorously

xpressed

in terms f mathematical robability,t

does not follow

hat

mathematical

robabilitys an

adequateconcept

or herigorous

expression

funcertain

nferencesf every

kind.

.

.

The in-

ferences

f the lassical heory

fprobabilityre all deductive

n

character.

hey re tatementsbout

hebehaviour f

ndividuals,

or samples,

r sequences

of samples,

drawnfrom opulations

which re fully nown.

. . More generally,

owever,

math-

ematical uantity

f different

ind,which have

termedmath-

ematical

ikelihood, ppears

to take tsplace [i.e.,

theplace of

probability]s

a measure frational

eliefwhen

we arereasoning

from he ample o thepopulation.

Neyman id

notbelieve n

theneedfor

special nductive

logic

but felt hat

he usual processes

f deductive

hinking

should uffice.

ore specifically,e had

no use for isher's

idea

of ikelihood.

n his discussion

f Fisher's 1935

paper

(Neyman,

1935,

p. 74, 75) he expressed

he thought

hat t

should be possible

to construct

theory

f mathematical

statistics . . based

solelyupon

the theory f probability,"

and

wenton to

suggest

hat

he basis

for

uch

a theory an

be

provided

by

"the conception f frequency

f errors

n

judgment."

This was the pproach

hat

he and Pearson

had

earlier

escribed

s "inductive ehavior";

n

the

case of hy-

pothesisesting,hebehavior onsisted f ither ejectinghe

hypothesis

r provisionally)

ccepting

t.

Both Neyman

and Fisher onsidered

he distinction

e-

tween inductive

ehavior" nd

"inductive

nference"

o ie

at the center

f their

disagreement.

n

fact,

n

writing

et-

rospectively

bout

the dispute,

Neyman 1961,

p. 142)

said

that

the

subject

of the

dispute

may

be symbolized y

the

opposing

erms inductive

easoning"

nd "inductive

be-

havior."

How strongly

isher

felt bout this distinction

s

indicated

y

his

statement

n

Fisher 1973, p. 7)

that

there

is something

orrifying

n the deological

movement epre-

sented

by

the doctrine

hat

reasoning, roperly

peaking,

cannotbe appliedto empiricaldata to lead

to inferences

valid

n

the

real world."

4. FIXEDLEVELS

ERSUS

VALUES

A distinction requently

ade between

he

pproaches

f

Fisher nd Neyman-Pearson

s

that

n

the atter

he test s

carried

ut

at a

fixed

evel,

whereas he

principal

utcome

ofthe former

sthe tatementf a p

value that

may

or

may

notbe followed

y pronouncement

oncerningignificance

of the result.

The

history

f

this

distinction

s

curious.Throughout

he

19th entury,

esting

was carried ut rather

nformally.

t

was roughly quivalent o calculating n (approximate)

value

and rejecting

he hypothesis

f

thisvalue appeared

o

be

sufficiently

mall.

These

early

pproximate

methods

e-

quired

only

table

of the

normaldistribution.

ith he d-

vent

f

exact mall-sample ests,

ables

f

X2,

t,F,

. . were

also

required.

Fisher,

n

his

1925

book

and

later,greatly

reduced

he

needed tabulations y providing

ables

not of

the distributions

hemselves ut

of selected

uantiles.

For

an

explanation

f

this

very

nfluentialecision

by

Fisher

ee

Kendall

1963].

On

the ther and

Cowles nd

Davis

[

1982]

argue

that conventional

evels of

three

probable

errors

r

two standard

deviations, oth

roughly quivalent

in the

normal

case] to 5%

were already n place

before

Fisher.)

These tablesallow the calculationonly of ranges or hep

values;

however, hey re exactly

uited or etermining

he

This content downloaded on Tue, 8 Jan 2013 10:38:47 AMAll use subject to JSTOR Terms and Conditions






1244

Journal

of the American

Statistical

Association, December 1993

critical alues at which

he

tatistic nder onsideration e-

comes significant

t a

given evel. As Fisherwrote

n

ex-

plaining he use of his

X2

table 1946, p. 80):

In

preparing

his

ablewe have borne n mind hat n

practice

we

do not want

o know he exact value of P for

ny

observed

2,

but,

n thefirst

lace,

whether r not theobserved alue s

open

to suspicion.fP is

between

1

and

9,

there s

certainly

o

reason

to suspect he hypothesisested. f t is below 02, it s

strongly

indicated hat hehypothesisails o accountfor hewholeof he

facts.We shall

not often e

astray

fwe

draw

conventional

ine

at .05 and consider

hat

higher alues

of

x2 indicate

real

dis-

crepancy.

Similarly,

e

also

wrote

1935, p. 13)

that it

s usual

and

convenient or

xperimenters

o take5

percent

s a standard

level of significance,n the sense that hey re prepared o

ignore ll resultswhichfail o reach his tandard

. ."

Fisher'sviews and those of some of his contemporaries

are

discussed

n

moredetail

by

Hall and

Selinger 1986).

Neyman

nd

Pearson

ollowed

isher's doption

f

fixed

level.

n

fact, earson 1962, p. 395) acknowledged

hat

hey

were nfluencedy

"

[Fisher's] ables f5 and 1% ignificance

levelswhich ent hemselveso the dea of hoice,

n

advance

of

experiment,

f

the

risk

of

the

first

ind

of

error'

which

theexperimenter as prepared

o take." He

was

even more

outspoken

n a letter o

Neyman

of

April28,

1978

unpub-

lished; n the Neyman collection f the Bancroft ibrary,

University

f

California, erkeley):

If

therehad not

been

these

%

tables available

when

you

and

I

started

work on

testing tatistical ypotheses

n

1926,

or when

you

were

starting

o talk on confidence

ntervals, ay

in

1928,

how

much

moredifficult

t

would have been for

s

The

concept

of

the

control f 1stkind of

error

would not have come

so

readily oryour dea of following ruleofbehaviour.

Anyway, ouand I mustbe gratefulor hose wotables n

the 1925 StatisticalMethods forResearchWorkers." For

an

idea of

what the

Neyman-Pearson heorymighthave

looked like had

it

been based on p values insteadof fixed

levels, ee Schweder 988.)

It is interestingo note thatunlikeFisher,Neyman and

Pearson

1 933a, p. 296)

did notrecommend

standard

evel

but

suggested

hat how

the

balance

between

he

wo

kinds

of

error]

houldbe struckmust

be left o the

nvestigator,"

and

(1933b, p. 497)

"we

attempt

o

adjust

the balance be-

tween he

risks

I

and

PI,

to

meet the

type

of

problem

e-

fore

s."

It is thus urprisinghat nSMSI Fisher 1973, p. 44-45)

criticized

he

NP

use of a

fixed

onventional evel. He ob-

jected

that

the

ttempts

hat

have

been made

to

explain

he

ogency

f

ests

of

significance

n

scientific

esearch, y

reference

o

supposed

frequencies

f

possible tatements,ased

on

them, eing ight r

wrong, hus seemto missthe essentialnature f such tests.A

man who

rejects' hypothesis rovisionally,

s a matter f ha-

bitual

ractice, hen he ignificance

s

1%

or

higher,

ill

ertainly

be

mistaken

n

notmore han

1%

f

uch ecisions. . .

However,

the alculation

s

absurdlycademic,

or n

fact o

scientificorker

has a

fixed evel

of

significance

t whichfrom

ear o year, nd

in

all

circumstances,e rejectshypotheses; e rather iveshis

mind oeach

particular ase

inthe

ight

f

his evidence nd

his

ideas.

The diffierenceetween he reportingf a p value or that

of a statement f acceptance r rejection f the hypothesis

was linked by Fisher n

Fisher 1973, pp.

79-80), to the

distinction

etweendrawing onclusionsor

makingdeci-

sions.

The

conclusions rawnfrom uch

tests onstitutehe stepsby

which

the researchworker ains a

better nderstandingf his

experimentalmaterial,

nd of theproblemswhich

t presents.

. More recently,ndeed,

considerable ody ofdoctrine as

attempted o explain, r rather o reinterpret,hese ests n the

basis

of

quite a different

odel,namely s means o

making e-

cisions

n

an

acceptance rocedure.

Responding

o earlier

ersions

f

these

nd

related

bjec-

tions yFisher o the

Neyman-Pearson ormulation,earson

(1955, p. 206)

admitted hat the terms

acceptance" and

"rejection"

wereperhapsunfortunately

hosen,but of his

joint

work

with

Neyman

he

said

that "from he

start

we

sharedProfessor

isher's

view that

n

scientific

nquiry,

statistical est s

a

means

of

earning'

and

"I

would

agree

that some of our wordingmay have been

chosen

inade-

quately,

ut do not

hink

hat ur

position

n

some

respects

wasoris so very ifferentrom hatwhichProfessor isher

himself as

now

reached."

The distinctions nderdiscussion

re ofcourserelated

o

the

rgument bout "inductive nference" s. "inductive

e-

havior,"but

in

thisdebate Pearson

refused

o

participate.

He concludes his

responseto Fisher's 1955 attack

with:

"Professor isher's

final

riticism

oncerns

he

use of

the

term inductive

ehavior';

his

s Professor

eyman's

field

rather

han

mine."'

5. POWER

As was mentioned

n

Section

2,

a

central

onsideration

of theNeyman-Pearsonheorysthat ne must pecify ot

only

he

hypothesis

but

lso

the

lternatives

gainst

which

it

s

to

be tested.

n

terms f the

alternatives,ne can then

define he

type

I error

false cceptance)

nd the

power

of

the

test the

rejection robabilitys

a

function

f the

alter-

native). This

idea

is now

fairly enerally

ccepted

for

ts

importance

n

assessing

he

chance of

detecting

n

effect

(i.e.,

a

departure

rom

H) when

t

exists, etermining

he

sample

size

required

o

raise this chance to an

acceptable

level, nd

providing criterion nwhich obasethechoice

of an

appropriate

est.

Fisher

never

wavered

n

his strong pposition o

these

ideas. Following re some of

his objections:

1. A

type

I

error

onsists

n

falsely cceptingH,

and

Fisher 1935b, p.

)

emphasized

hat here

s

no reasonfor

"believing

hat

hypothesis

as been

proved

o be true

merely

because

t

s

not contradicted

y

the

vailablefacts." his s

of

course

orrect,

ut t

does

not

diminish

he

usefulness

f

power

alculations.

2. A

secondpoint isher aised

s,

n modem

erminology,

that

he

power

annot be

calculated

because

it

depends

on

the unknown lternative. or

example Fisher

1955,

p. 73),

he wrote:

The frequency

f he 1 t lass type error] . . is

calculable nd

thereforeontrollable

imply rom he specification

f thenull

hypothesis. he frequency f the 2nd kind must depend

..

greatlyn how closely

hey rivalhypotheses]esemble henull







Lehmann:

Theories

of Testing Hypotheses

1245

hypothesis.

uch errors re therefore

ncalculable

. . merely

from he specification

f the null

hypothesis,

nd

would never

havecame

intoconsideration

n the theory

nly of tests

f

sig-

nificance,

ad the ogicof uch ests

otbeen confusedwith

hat

of cceptance rocedures.

He discussed

he ame

point

nFisher

1947,p.

16-17.)

Fisherwas

of course aware

of the

mportance f power,

as is clearfrom he following emarks1947, p. 24): "With

respect

o therefinements

ftechnique,

we

have seen above

that hese ontribute

othing

o thevalidity

f he xperiment

and of

the test

of

significance

y

which

we determine

ts

result.

heymay,

however, e

important,nd even

essential,

in permitting

hephenomenon

nder

est o manifest

tself."

The section

n which his

tatement

ppears s tellingly

n-

titled Qualitative

Methods of Increasing

ensitiveness."

Fisher ccepted

the mportance

f the concept

but denied

thepossibility

f

assessing

t

quantitatively.

Later

n the same

book Fisher

made a very imilar

dis-

tinction egarding

he choice

of

test.

Under

the heading

"Multiplicity

f

Tests

of the Same

Hypothesis,"

e

devoted

a section sec. 61) to this opic.Here again,without sing

the term, e referred

o alternatives

hen he wrote

Fisher

1947,p. 182)

that

we may now observe hat

he ame

data

may ontradict

he

hypothesis

n anyof number

fdifferent

ways."

Afterllustrating

ow different

estswould

be appro-

priate

or ifferent

lternatives,

e continued

p.

185):

The notion hat

differentests

f

significance

re

appropriate

o

test

differenteatures

f the same

null

hypothesis

resents

o

difficulty

o workers ngaged

n

practical

xperimentation

ut

hasbeen he

ccasionof

much heoretical

iscussionmong

tat-

isticians.

he reason for his

diversityf

view-points perhaps

that he xperimenter

s thinking

n

terms fobservational

alues,

and saware fwhat bservationaliscrepancy

t s

which

nterests

him, ndwhichhe thinksmay be statisticallyignificant,efore

he nquires

hat est f ignificance,

f ny, s available

ppropriate

to

his

needs.He is, therefore,

ot usually

oncernedwith

he

question:

To

what

observational

eature hould

test f

signifi-

cance be

applied?

The idea that

here

s no

need for theory

ftest hoice,

because

an

experienced

xperimenter

nows

what

s the

p-

propriate

est, s expressed

more

trongly

n

a

letter o W. E.

Hick ofOctober1951 Bennett

990,

p. 144),

who,

n

asking

about

"one-tail"

s. "two-tail"

nX

2,

had

referredo

his

ack

of

knowledge

oncerning

the

theory

f critical

regions,

power,

tc.":

I am a little

orry

hatyou

havebeenworryingourself

t all with

that unnecessarilyortentouspproachto tests f significance

represented

y

the Neyman

nd Pearson

ritical

egions,

tc.

n

fact,

and

mypupils

hroughout

heworld

would

never

hink

f

using

hem.

f I

am asked

to

give

an

explicit

eason

for his

should

ay

hat

heypproach

he

roblem

ntirely

rom he

wrong

end,

.e.,

notfrom he

point

f view

f a research

orker,

ith

basis

of well

grounded

knowledge

n

which

very

luctuating

population

f

conjectures

nd

incoherentbservations

s

contin-

ually

nder

xamination.

n these ircumstances

he

xperimenter

does

knowwhat bservation

t s

that ttracts is

ttention.

What

he needs s a confident

nswer

o the

question

ought

to

take

any

notice f hat?"This

question

an,

of

course,

nd

for

efine-

ment

f

houghthould,

e

frameds "Is

this

articular

ypothesis

overthrown,nd

f

o

at what evel

f

ignificance,

y

his articular

body of observations?"t

can be put n thisform

nequivocally

only

becausethe genuine

xperimenterlready

has theanswers

to all thequestions hat he followers fNeyman

nd

Pearson

attempt,

think ainly, o answer

y merelymathematical

on-

sideration.

6.

CONDITIONAL

NFERENCE

While Fisher's

pproach

to testing

ncludedno

detailed

consideration

f

power,

the

Neyman-Pearson

pproach

failed

o pay

attention

o an important

oncern

raised by

Fisher.To discuss

his

ssue,we must

beginby considering

briefly

hedifferenteanings

hat

isher nd Neyman

ttach

to probability.

For

Neyman, he

dea of

probability

s fairly traightfor-

ward: t represents

n

idealization

f ong-run requency

n

a

long

equence

f

repetitions

nder onstant onditions

see,

for xample,Neyman

1952, p.

27; 1957, p.

9). Later Ney-

man 1977),

he pointed

ut that y

he aw of arge

numbers,

this dea permits

n extension:

f

sequence

of

ndependent

events

s observed,

ach

with

probability

of

success,

hen

the ong-runuccess

frequency

ill

be approximately

even

if he events

re not dentical.

his property

dds greatly

o

the appeal

and applicability

f

a

frequentist

robability.

n

particular,

t s theway

n which

Neyman

ame

to

interpret

the value

of a

significance

evel.

On the ther and, hemeaning fprobabilitys a problem

with

which isher

rappled

hroughout

is

ife.Not

surpris-

ingly,

is views oo

underwent ome

changes.

The concept

at

which

he

eventually

rrived s

much broader hanNey-

man's:

"In a

statement

f

probability,

he

predicand,

which

may be conceived

s

an

object,

s an event,

r

as

a

propo-

sition,

s asserted

o be

one of

a

set

of a

number,

owever

large, f

ike entities

f which

knownproportion,

,

have

some

relevant haracteristic,

ot

possessed

y

theremainder.

It is furthersserted

hatno subset

f

the entire et,having

a differentroportion,

an

be

recognized"

Fisher

1973, p.

113). It is this

ast requirement,

isher's

version

fthe

"re-

quirement ftotal vidence" Carnap 1962, ec.45), which

is particularly

mportanto

the

present

iscussion.

Example

1

(Cox

1958). Suppose

that

we are concerned

with heprobability

(X

<

x),

where

X

is normally

istrib-

uted as N(,u, 1)

or

N(,u,

4),

depending

n

whether

he

spin

of a

fair oin

results

n

heads

H)

or tails

T).

Here the

set

of ases

n which

he oin

falls

eads

s

a

recognizable

ubset;

therefore,

isher

would not

admit

he tatement

P(X

<

x)

= 4

x

-

A)

+

2(1)

22

2

as

legitimate.

nstead,

he would

have

required

(X

<

x)

to

be evaluated onditionallys

P(X<xIH)=4(x-

A)

or

P(X?<xlIT)=

4(Xj)A

(2)

2)

depending

n the

outcome

of the

spin.

On theother

and,Neyman

would

have taken

1)

to

pro-

vide

the

natural

ssessment

f

P(X

<

x).

Despite

this

pref-

erence,

here

s

nothing

n the

Neyman-Pearsonfrequentist)

approach

o prevent onsideration

f

the

conditional

rob-

abilities

2). The critical

ssue from

frequentist

iewpoint

is what o consider s therelevant eplicationsftheexper-

iment:

sequence

of observations

rom he same

normal







1246

Journal

of the American

Statistical Association, December 1993

distributionr a sequence ofcoin tosses, ach followed

y

an observation rom he appropriate ormaldistribution.

Considernow theproblem

f esting

: ,u

0 against he

simple lternative

=

1 on the

basis of

a

sample

X1,

. . ..

X,

from hedistribution1).

The Neyman-Pearson

emma

would tellus to rejectH when

1 1

e-z(xi-l)I/

1 1

-_(X,-1)2/8

2

2

22

2

2 e-x2 +

-

e-x

8 , (3

whereK is determinedo that heprobabilityf 3) when u

=

0

is equal

to the

specified

evela.

On theother and, Fisherian pproachwould djust

he

test o whether he coin fallsH or

T

and would use the re-

jection

region

1 2

-z(xi1)2/

2 K1 e

-x2/2

when he coin fallsH (4)

and

_2:-(X._1)2/8 2K 1 -X2:/8

>

K2

e~xI

2V1 21/

when he

coin falls

T,

(5)

whereK1 and K2 are determinedo that henullprobability

of both

4)

and

(5) is equal to

a.

It is easily een that hese

two tests re not equivalent.Whichone shouldwe prefer?

Test 3) has the advantage

f beingmorepowerfuln the

sensethatwhen he full xperimentfspinning coin and

then

aking observationsn

X

is repeatedmany imes, nd

when u

=

1, this testwill reject he hypothesismore fre-

quently.

The

second est as the dvantage

hat ts onditionalevel

given he outcomeof the spin s

a

bothwhenthe outcome

is

H

and

when

t s

T.

[The conditional

evelofthe firstest

will

be

<a

for one of the two

outcomes and >a forthe

other.]

Which f

hese onsiderations

s

more mportant epends

on the

ircumstances.choingFisher,wemight aythatwe

prefer1)

in

an acceptance ampling

ituation here nterest

focusesnot on the ndividual ases but on the ong-run re-

quency

of

errors,

ut

that

we

would prefer he second test

in

a scientificituationwhere ong-run onsiderations re

irrelevantnd only hecircumstances

t hand i.e., H or T)

matter.As Fisher

put t 1973,

p. 101-102), referringo a

differentut imilar ituation:

It is then bvious t the ime

that

he

udgment

f

significance

as been decided not by

theevidence f

the

ample,

but

by

thethrow

f a coin. t is

not obvioushow theresearch orkers to be made to forget

this

ircumstance,nd t s certain hat

e

ought ot o forget

it,

f

he is

concerned o assess

the weight nly of objective

observational acts gainst he hypothesisn

question."

The present xample s of course rtificial,ut the

same

issue ariseswhenever here xists n ancillary tatisticsee,

for xample,Cox

and Hinkley

1974; Lehmann

1986),

and

it seems to

lie at the heartof thecases in whichthe two

theories

isagree

n

specific ests.

he

two most

prominent

of these ases are

discussed

n

the next ection.

7. TWO EXAMPLES

For many problems, pure Fisherian or

Neymann-

Pearsonian pproachwill ead to the sametest. uppose n

particularhat heobservations

follow

distributionrom

an

exponential amily

with

density

po,a(x)

=

C(0,

d)e'U(x)+?0=1aiTl(x)

(6)

and

consider esting he hypothesis

H:

0

=

00

(7)

against heone-sided lternatives

>

00.

Then

Fisherwould

condition n

T

=

(T1,

. .

,

Tk)

nd would

n

the

onditional

model

consider

t natural

o

calculate he

p

value

as the on-

ditional

robability

f

U

2 u,

where

is

the

observed

alue

of U.

At a

given

evel

a,

the resultwould be declared

ig-

nificantfU 2 C(t), whereC(t) is determined y

P[U> C(t)1T= t]

=

a.

(8)

A

Neyman-Pearson

iewpoint

would ead to the

same test

as being

uniformly

ost

powerful mong

all similar

ests.

But as we have seen

n

Example

1,

the

wo pproaches o

not lways

ead to the ame result.We next onsider he

wo

examples

hathave

engendered

he

most

ontroversy.

Example

2: The

2 X 2

table with

nefixedmargin.

Let

X,

Y

be two

independent

binomial variables withsuc-

cess

probabilities

i

and

P2

and

corresponding

o m and n

trials.

The

problem

of

testing

H:

P2

=

Pi

against

he al-

ternatives

2 > Pi

is oftheform ivenby 6) and (7) with

0

=

log[(p2/q2)/(p1/1q)],

T

=

X

+ Y

and

U

=

Y.

Ba-

sically,

here s therefore o conflict

etween he twoap-

proaches.However, ecause of the

discreteness

f

the con-

ditional distribution f U

given t, condition

8) typically

cannot

be

satisfied.

isher's

xacttest hen hooses

C(t)

to

be the

argest

onstant orwhich

P[U

>

C(t)

I

T

=

t]

?

a.

(9)

For small

values

of

t, this may lead to conditional

evels

substantially

essthan

a;

for

mall m and

n,

the

same

may

be

true

for heunconditionalevel.For this eason, isher's

exact est as beencriticizedsbeing oo conservative. any

alternatives avebeen

proposed

orwhich

he

unconditional

level

which

s a function

f

Pi

=

P2) is much closer o a.

Upton (1982)

lists

22;

forother

urveys,

ee

Yates

(1984)

and

Agresti1992).

The

issues are similar

o those encountered

n

Example

1.

f

conditioning

s considered

ppropriateand

in

thepres-

ent

case it

typicallys),

and

if

ontrol ftype error t evel

a

isconsidered

ssential,

hen he

only

ensible

est vailable

is

of the formU

>

C(t), where C(t) is the largest alue

satisfying

9). If,

on the other

hand,only

heunconditional

performances considered elevant, henwe mayallow

the

conditional evel f heregionU>

C(

t)

to exceed

z

or ome

values of insuch way hat heunconditionalevel which

is the expected alue of the conditional evel)gets loser

o







1248

Journal

of the American Statistical Association, December

1993

stated I regard he frequence equirement

frepeated am-

pling' s including

onditionalnferences." commonbasis

for he

discussion

f various

onditioningoncepts,

uch as

ancillaries

nd

relevant

ubsets,

hus xists. he

proper

hoice

of framework

s

a

problemneeding

urther

tudy.

We

conclude

by considering

ome more detailed

ssues

and

by reviewing

xamples

2 and 3 from he

present oint

of view.

1. Both Neyman-Pearson nd

Fisherwouldgive t most

lukewarm upport o standard ignificance

evels uch s 5%

or

1%.

Fisher,

lthough riginally ecommending

heuse of

such

levels,

ater strongly ttacked any standard

hoice.

Neyman-Pearson,

n

their riginal ormulation

f 1933,rec-

ommended balance between

he two

kinds

of

error

i.e.,

between evel nd power).

For

a

disucssion fhowto achieve

such a

balance,

see,

for

xample,

anathanan

1974).

Both

level nd power

hould f ourse e considered onditionally

whenever onditioning s deemed appropriate.

Unfortu-

nately, his s not

possible t the planning tage.

2. A secondpoint nwhich here ppears o be no conflict

between

he

two

approaches

s "truth

n

advertising."

ven

if

particular ominal

evel a, say 5%, s the arget,

hen

t

cannot

be

achieved

because of

discreteness

he

test hould

not ust be described

s conservativer iberal elative o the

nominal evel; nstead, he actual conditional

r uncondi-

tional)

evel shouldbe stated.

f

this evel s not knownbe-

cause it depends

on unknown arameters,t least

ts

range

should

be

given nd,

f

possible,

lso an estimated alue.

3.

In

both the

2 X

2 example and the Behrens-Fisher

problems, he onflict etween he olutions

roposed y he

two chools

s often iscussed

s

that f desire or similar

testi.e.,oneforwhich heunconditionalevel s -a) versus

a

suitable onditional est.

The issue becomes

learer

f

one

asks

for

he

reason hat

Neyman-Pearsonroposed

he on-

dition f

similarity.

he

explanation egins

with hecase of

a

simplehypothesis

here hese uthors ake t

for

ranted

that

n

order o maximize he

power,

ne would want the

attained evel to

be

equal

to rather han

ess

than

a. For a

compositehypothesis

, they hereforetated hat he

evel

should

qual

a

for ach

of he

imple

hypotheses aking p

H. The

requirement

or

imilarity

hushas its

origin

n

the

desire

o maximizepower, he

ssue discussed

n

Section5.

In

the ight f 1) and 2), a

unified

heory

ess concerned

with standardnominal evels might ettisonnot only the

demandfor imilarityut also that f conservatismelative

to a nominal

evel.

When

similarity

annotbe achieved nd conservations

notrequired, arious ompromiseolutions

maybe available.

Thus

n

the

2 X

2

case ofExample2,

one

could,for xample,

select foreach t the conditional evel closestto a. If this

seems too

permissive,

hen

the rule could be

modified y

adding

a

cap

on the conditional

evel

beyond

which one

would not

go.

Tests with variable onditional evel that

will

ometimes e <a and sometimesa havebeendiscussed

by

Barnard

1989)

under

he name "flexible isher."Alter-

natively, ne might ive up on a nominal evelaltogether

and instead or ach t adjust he evel o the ttainablecon-

ditional)power.

The situations

muchmore omplicated or heBehrens-

Fisherproblem.On

the one hand, the arguments or on-

ditioning

n

S2

/S2 seems less

compelling;

n the other

hand,

ven

f

his onditioning equirement

s

accepted, he

conditional istribution epends

on unknown arameters,

and thus t

s less clearhow to control

he

conditional

evel.

Robinson's formulation, entioned

n

Section7, provides

an interestingossibilityutrequiresmuch furthernvesti-

gation.

But

such work

an

be carried ut from he

present

point ofviewbycombining onsiderations

f both condi-

tioning nd power.

To summarize, values, fixed-level

ignificance tate-

ments, onditioning,nd power onsiderations

an be com-

bined nto a unified pproach.

When

ong-term ower

nd

conditioningre

n

conflict,pecification

fthe

ppropriate

frame f reference

akespriority,ecause t determines he

meaning f the probabilitytatements.

fundamental ap

in the theory s the ack of clear principles

or electing he

appropriate ramework. dditional

work

n

this area

will

have to come to termswith hefact hat hedecision nany

particular

ituation

must e based not

only

n abstract

rin-

ciplesbut also

on contextual spects.

[Received

January 992.RevisedFebruary

993.]

REFERENCES

Agresti, . (1992), "A Survey f Exact nference orContingency ables"

(withdiscussion), tatistical cience, , 131-177.

Barnard, .

A.

(1989), "On AllegedGains

n

Power rom ower Values,"

Statistics

n

Medicine,, 1469-1477.

Barnett, . (1982), Comparative tatistical nference2nd ed.), New York:

JohnWiley.

Bartlett, . S. (1984), "Discussionof Tests ofSignificanceor

X

2 Con-

tingency ables,' byF. Yates." Journal ftheRoyalStatistical ociety,

Ser. A, 147, 453.

Bennett,.H. (1990), tatisticalnferencend

AnalysisSelected orre-

spondence f

R.

A. Fisher),Oxford, .K.: Clarendon ress.

Braithwaite, . B. (1953), Scientific xplanation, ambridge, .K.: Cam-

bridgeUniversityress.

Brown, . (1967), "The Conditional evelof Student's Test," Annalsof

Mathematicaltatistics,8, 1068-1071.

Carlson,R. (1976),

"The

Logicof TestsofSignificance"

with iscussion),

Philosophyf cience, 3,

116-128.

Carnap,

R.

(1962), Logical

Foundations

fProbability2nd ed.), Chicago:

theUniversityfChicagoPress.

Cowles,M.,

and

Davis,

C.

(1982),

"On the

Origins

f the 05 Level of Sta-

tistical

ignificance,"

merican

sychologist,7, 553-558.

Cox,D. R. (1958), "Some

Problems

onnected

With tatistical

nference,"

Annals fMathematicaltatistics,9,357-372.

Cox,

D.

R.,

and

Hinkley,

D.

V.

(1974),

Theoretical

tatistics,

ondon:

Chapmanand Hall.

Fisher,R. A. (1925) (10th ed., 1946), Statistical

Methods or Research

Workers, dinburgh: liver& Boyd.

(1932), "Inverse robabilitynd theUse ofLikelihood," roceedings

of he ambridgehilosophicalociety,8,257-261.

(1935a), "The Logic of nductive nference," ournal f heRoyal

Statistical

ociety,8,

39-54.

(1935b),

"Statistical

ests,"

Nature,

136,

474.

(1935c) (4th ed., 1947), TheDesign of

Experiments, dinburgh:

Oliver&

Boyd.

(1939), "Student,"

Annals

ofEngenics, ,

1-9.

(1947), The Design ofExperiments4th ed.), New York: Hafner

Press.

(1955),

"Statistical

Methods nd Scientificnduction,"Journal f

theRoyal

tatistical

ociety,

er.

B, 17, 9-78.

(1956),

"On a

Test ofSignificance

n

Pearson's

Biometrika ables

(No. 11),

Journal

f

he

Royal tatisticalociety,

er.

B, 18,

6-60.

(1958), "The NatureofProbability," entennial eview, , 261-

274.







Lehmann:

Theories

of

TestingiHypotheses

1249

(1959), "Mathematical

robability

n theNatural ciences,"

Tech-

nometrics, ,21-29.

(1960), "Scientific

houghtnd the

RefinementfHuman Reason,"

Journal f he

Operations esearch ociety

fJapan,3, 1 10.

(1973), Statistical ethods

nd

Scientific

nference,3rd d.) London:

Collins Macmillan.

Gigerenzer,

.,et l. 1989), The

Empire fChance,New York:

Cambridge

University ress.

Hacking, . (1965), Logic of Statistical

nference, ew

York: Cambridge

Universityress.

Hall,P.,and Selinger,

. 1986), "Statistical

ignificance:alancing

vidence

Against oubt,"

Australian ournal f

tatistics, 8,

354-370.

Hedges,

L., and Olkin,

.

(1985),

Statistical

Methods or Meta-Analysis,

Orlando,

FL:

Academic

Press.

Hockberg,., and

Tamhane,A. C. (1987),

Multiple omparisonrocedures,

NewYork:

JohnWiley.

Kendall,M. G. (1963),

"Ronald Aylmer

isher, 890-1962,"Biometrika,

50,

1-15.

Kyburg,

. E., Jr. 1974),

The Logical

Foundations

f

tatistical

nference,

Boston:

D.

Reidel.

Linhart,

H., and Zucchini,

W.

(1986),

Model Selection,

New York: John

Wiley.

Linssen,

H. N. (1991),

"A Table for olving

heBehrens-Fisherroblem,"

Statistics

nd

Probability etters, 1,

359-363.

Miller,

R. G.

(1981),

Simultaneous tatistical nference,2nd

ed.),

New

York: Springer-Verlag.

Morrison,

.

E.,and Henkel,

. E.

(1970),

The

ignificance

estControversy,

Chicago:

Aldine.

Neyman,

J. 1935), "Discussion

of Fisher 1935a)."

Journal

f

theRoyal

Statistical

ociety, 8,

74-75.

(1938), "L'Estimation

Statistique raitee

Comme un Probleme

Classique

de

Probabilite," ctualites

cientifiques

t

Industrielles,39,

25-57.

(1952),

Lectures

nd

ConJerences

n Mathematical tatistics

nd

Probability2nd

ed.),Graduate chool,

Washington,

.C.: U.S. Dept.

of

Agriculture.

(1955),

"The Problem

f nductive

nference,"

ommunications

n

Pure

and Applied

Mathematics, , 13-46.

(1956),

"Note on an

Article

y

Sir RonaldFisher,"Journal

f

he

Royal Statistical ociety,

er.B, 18,288-294.

(1957),

"

'Inductive ehavior'

s

a Basic

Concept

of

Philosophy

f

Science,"

Review f he nternational

tatisticalnstitute,5,

7-22.

(1961), "SilverJubilee fMy DisputeWithFisher,"Journal f he

Operations

esearch

ociety

fJapan,3,

145-154.

(1966), "Behavioristic

oints f

View on Mathematical tatistics,"

in

On PoliticalEconomy

nd

Econometrics:

ssays

in

Honour

of Oscar

Lange,

Warsaw:PolishScientific ublishers, p.

445-462.

(1976),

"Tests of

Statistical

ypotheses

nd Their Use

in

Studies

of Natural

Phenomena,"

Communications

n

Statistics,

art

A-Theory

and

Methods, ,

737-75

1.

(1977), "Frequentistrobability

nd

Frequentisttatistics,"

ynthese,

36,

97-131.

Neyman,J.,

nd

Pearson,

. S.

(1928),

"On the Use and Interpretation

f

CertainTest Criteria or

Purposes f

Statistical

nference," iometrika,

20A, 175-240, 263-294.

(1933a), "On the

Problem f the MostEfficientestsof Statistical

Hypotheses,"hilosophicalransactionsf he oyal ocietyf ondon,

Ser.

A, 231,

289-337.

(1933b), "TheTesting f

Statistical ypotheses

n Relation o

Prob-

abilitiesA

Priori," roceedingsf

he

Cambridge

hilosophicalociety,

29, 492-510.

Oakes,

M.

(1986),

Statistical

nference:

Comment

or

he ocial

nd

Be-

havioralciences,

ewYork: JohnWiley.

Pearson,

.

S. (1955), "Statistical

oncepts

n

Their

Relation o

Reality,"

Journalf

he

Royal tatisticalociety,

er.

B, 17,204-207.

(1962), "SomeThoughts n Statistical

nference,"

nnals

f

Math-

ematical

tatistics,3,

394-403.

(1974), "Memories f

the

mpact

of Fisher'sWork

n

the

1920's,"

Internationaltatistical

eview,2,

5-8.

Pearson, .

S.,

and

Hartley,

.

0.

(1954),

Biometrikaables

or

tatisticians

(TableNo. 11),New York:

Cambridge niversity

ress.

Pedersen, .G. (1978), "Fiducial

nference,"

nternationaltatistical

eview,

46,

147-170.

Robinson,G.

K.

(1976),

"Properties

f Student's and of the Behrens-

Fisher

olution o theTwo-MeansProblem,"

TheAnnals

f tatistics,,

963-971.

(1982), "Behrens-Fisherroblem,"

n

Encyclopedia

f Statistical

Sciences

Vol. 1,

ds. S. Kotz and

N. L.

Johnson),

ew

York:John

Wiley,

pp. 205-209.

Savage,L. J. 1976), "On Rereading . A. Fisher" with iscussion),

nnals

of tatistics,

, 441-500.

Schweder,

.

(1988), "A

Significance ersion f heBasic Neyman-Pearson

Theory or cientific

ypothesis esting," candanavianournalf ta-

tistics,5, 225-242.

Seidenfeld,

.

(1979),Philosophicalroblemsf tatistical

nference,oston:

D.

Reidel.

Spielman, . (1974),

"The

Logic of Tests of

Significance,"hilosophyf

Science,1,

211-226.

(1978),

"Statistical

ogma

and the

Logic of

Significance esting,"

Philosophyf cience, 5,120-135.

Steger, .A. (ed.) (1971),

Readings

n

tatistics

or

he

ehavioralcientist,

New

York: Holt, Rinehart nd Winston.

Stuart,

.,

and

Ord,

J.

K.

(1991),

Kendall's dvanced

heoryf tatistics,

Vol. I

(5th ed.), New York:OxfordUniversity

ress.

Tukey,J.

W.

(1960), "Conclusions s.

Decisions,"Technometrics,, 424-

432.

Upton,G. J.G.

(1982),

"A

Comparison

f Alternativeests for he2

X

2

Comparative rial,"

Journal

f

he

oyal

tatistical

ociety,

er.

A, 145,

86- 105.

Wallace,

D.

L.

(1980), "The Behrens-Fishernd

Fieller-Creasy roblems,"

in

R.

A.

Fisher: n

Application,

ds. S. E.

Fienberg nd

D. V.

Hinkley,

New

York:Springer-Verlag,

p.

119-147.

Yates,

F.

(1984), "TestsofSignificanceor

X

2

Contingency

ables"

with

discussion),

Journal

f

he

Royal

tatistical

ociety,

er.

A, 147,

426-

463.

Zabell, S.

L.

(1992), "R. A. Fisher nd the

FiducialArgument,"tatistical

Science,,

369-387.

This content downloaded on Tue 8 Jan 2013 10:38:47 AM


The Fisher, Neyman-Pearson Theories of Testing Hypotheses

Documents

Transcript of The Fisher, Neyman-Pearson Theories of Testing Hypotheses