Dipankar Bandyopadhyay, Ph.D.dbandyop/BIOS625/lecture_01new.pdf · Introduction Dipankar...

Introduction

Dipankar Bandyopadhyay, Ph.D.

Department of Biostatistics,Virginia Commonwealth University

D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 56

Course logistics

Let Y be a discrete random variable with f (y) = P(Y = y) = py .

Then, the expectation of Y is defined as

E (Y ) =∑y

yf (y)

Similarly, the Variance of Y is defined as

Var(Y ) = E [(Y − E (Y ))2]= E (Y 2)− [E (Y )]2


Conditional probabilities

Let A denote the event that a randomly selected individual from the“population” has heart disease.

Then, P(A) is the probability of heart disease in the “population”.

Let B denote the event that a randomly selected individual from thepopulation has a defining characteristics such as smoking

Then, P(B) is the probability of smoking in the population

Denote

P(A|B) =probability that a randomly selected individualhas characteristic A, given that he has characteristic B

Then by definition,

P(A|B) =P(A and B)

P(B)=

P(AB)

P(B)

provided that P(B) 6= 0

P(A|B) could be interpreted as the probability of that a smoker has heartdisease


Associations

The two characteristics, A and B are associated if

P(A|B) 6= P(A)

Or, in the context of our example–the rate of heart disease dependson smoking status

If P(A|B) = P(A) then A and B are said to be independent


Law of Total Probability

Suppose event B is made up of k mutually exclusive and exhaustivestrata, identified by B1,B2, . . .Bk

If event A occurs at all, it must occur along with one (and only one)of the k exhaustive categories of B.

Since B1,B2, . . .Bk are mutually exclusive

P(A) = P[(A and B1) or (A and B2) or . . . (A and Bk)]= P(AB1) + P(AB2) + . . .+ P(ABk)

=k∑

i=1P(A|Bi )P(Bi )

which is known as the total law of probability

A special case when k = 2 is

P(A) = P(A|B)P(B) + P(A|B ′)P(B ′)

where B ′ is read “not B” – also view this as a weighted average


Application to screening tests

A frequent application of Bayes’ theorem is in evaluating theperformance of a diagnostic test used to screen for diseases

Let D+ be the event that a person does have the disease;

D− be the event that a person does NOT have the disease;

T+ be the event that a person has a POSITIVE test; and

T− be the event that a person has a NEGATIVE test

There are 4 quantities of interest:1 Sensitivity2 Specificity3 Positive Predictive Value (PPV)4 Negative Predictive Value (NPV)


Sensitivity and Specificity

Sensitivity is defined as the probability a test is positive given disease

Sensitivity = P(T+|D+)

Specificity is defined as the probability of a test being negative giventhe absence of disease

Specificity = P(T−|D−)

In Practice, you want to know disease status given a test result


PPV and NPV

PPV is defined as the proportion of people with a positive test resultthat actually have the disease, which is P(D+|T+)

By Bayes’ theorem,

PPV = P(D+|T+) =P(T+|D+)P(D+)

P(T+)

NPV is defined as the proportion of people among those with anegative test who truly do not have the disease (P(D−|T−))

Which by Bayes’ theorem is

NPV = P(D−|T−)

= P(T−|D−)·P(D−)P(T−)

= P(T−|D−)·(1−P(D+))1−P(T+)


As a function of disease prevalence

For both PPV and NPV, the disease prevalence (P(D+)) influencesthe value of the screening test.

Consider the following dataTest result

Disease status Positive Negative Total

Present 950 50 1000Absent 10 990 1000

Sensitivity and Specificity for this test are

Sen = P(T+|D+) = 950/1000 = 0.95

andSpec = P(T−|D−) = 990/1000 = 0.99

However, the real question is what is the probability that an individualhas the disease given a positive test result.


With some easy algebra (substituting definitions into the previousequations), it can be shown that

PPV =Sens · Π

Sens · Π + (1− Spec)(1− Π)

and

NPV =Spec · (1− Π)

Spec · (1− Π) + (1− Sens) · Π

where Π is the disease prevalence (P(D+))

Thus, the PPV and NPV for rare to common disease could becalculated as follows:

Π PPV NPV

1/1,000,000 0.0001 1.01/500 0.16 0.999901/100 0.49 0.99949


Interpretation?

For a rare disease that affects only 1 in a million,1 A negative test result almost guarantees the individual is free from

disease (NOTE: this is a different conclusion of a 99% specificity)2 A positive test result still only indicates that you have a probability of

0.0001 of having the disease (still unlikely–which is why most screeningtests indicate that “additional verification may be necessary”)

However, if the disease is common (say 1 in 100 have it)1 A negative test result would correctly classify 9995 out of 10,000 as

negative, but 5 of 10,000 would be wrongly classified (i.e., they aretruly positive and could go untreated)

2 However, of 100 people that do have a positive test, only 49 wouldactually have the disease (51 would be wrongly screened)

Does the test “work”

It “depends”


Application to Pregnancy Tests

Most home pregnancy tests claims to be “over 99% accurate”

By accurate, the manufactures mean that 99% of samples are“correctly” classified (i.e., pregnant mothers have a positive test,non-pregnant mothers have a negative test)

This measure is flawed in that it is highly dependent on the numberof cases (i.e., pregnant mothers) and controls (i.e., non-pregnantmothers) – FYI: we’ll revisit this concept again in future lectures

However, for sake of illustration, lets consider a sample of 250pregnant mothers and 250 non-pregnant mothers


Example Data–Based on at home pregnancy tests

Suppose we have the following data observed in a clinical trial:

TruthPregnant Not Pregnant

Test + N++ bTest - a N−−

250 250 500

We know that we have 99% accuracy (because the manufactures tell us so), wehave a constraint

N++ + N−−500

≥ 0.99

soN++ + N−− ≥ 495

and for illustrative purposes, let a = 3 and b = 2 so that the following table

results.



Test + 247 2 249Test - 3 248 251

250 250 500

ThenSens = P(T+|D+) = 247/250 = 0.988

andSpec = P(T−|D−) = 248/250 = 0.992

Using these values and simplifying the previous equations for PPV and NPV,

PPV =0.988Π

0.980Π + 0.008

NPV =0.992− 0.992Π

0.992− 0.98Π

where Π is again the “disease rate” (or in this case, the probability of being

pregnant)


Π PPV NPV0.001 0.110022 0.999988

0.01 0.555056 0.9998780.1 0.932075 0.9986580.5 0.991968 0.988048

Here, the “population” at risk is those females, of childbearing age, whoengaged in sexual activity during the previous menstrual cycle, and are atleast 2 days late in the new cycle.

The success rate of birth control may be in the range of 99% andunprotected sex may be in the range of (0.1-0.5)

How do you feel about the marketing claim that the product is “over 99%accurate”?


Different Case-Control Ratio


Test + 397 2 399Test - 3 98 101

400 100 500

ThenSens = P(T+|D+) = 297/400 = 0.9925

andSpec = P(T−|D−) = 98/100 = 0.98

*Note: Sensitivity is now higher and specificity is lower than previously assumed

Π PPV NPV0.001 0.047324 0.999992

0.01 0.333894 0.9999230.1 0.846482 0.999150.5 0.980247 0.992405


Before we begin

Lectures will be primarily from the text, and (usually) posted thenight before.

Sample SAS code is provided on Dr. Agresti’s websitehttp://www.stat.ufl.edu/∼aa/cda/cda.html, and in my notes.There is also a link to a large PDF file with sample R code.

For a better understanding, read the lecture notes AND the text.


Categorical Data Analysis (CDA) ?

What are categorical data?

Agresti’s answer: a variable with a measurement scale consisting of aset of categories


CDA definitions continued...

Response data considered in regression and ANOVA are continuous.Examples:

cholesterol level (milligrams per deciliter)

lifetime of a lab rat (in weeks)

money spent on breakfast cereal (U.S. $)

A categorical variable takes on one of a (usually finite) number ofcategories, or levels. Examples:

eye color (blue, brown, green, other)

political affiliation (Democrat, Republican, other)

cholesterol level (low, normal, high)

Note that a variable can be continuous or categorical depending on howit’s defined.


Quantitative vs. Qualitative Variable Distinctions

Qualitative Variables: Distinct categories differ in quality, not in quantity

Quantitative Variables: Distinct levels have differing amounts of thecharacteristic of interest.

Clearly, a qualitative variable is synonymous with ”nominal” (black, white,green, blue). Also, an interval variable is clearly quantitative (weight inpounds).

However, ordinal variables are a hybrid of both a quantitative andqualitative features. For example, ‘small, medium and large’ can be viewedas a quantitative variable.

At this point, the utility in the variable descriptions may appearunnecessary. However, as the course progresses, the statistical methodspresented will be appropriate for a specific classification of data.


Response or Explanatory?

Variables can be classified as a response or explanatory.

In regression models we seek to model a response as a stochastic functionof explanatory variables, or predictors.

In this course the response will be categorical and the predictors can becategorical, continuous, or discrete.

For example, if we wanted to model political affiliation as a function ofgender and annual salary, the response would be (Republican, Democrat,other), and the two predictors would be annual salary (essentiallycontinuous) and the categorical gender (male, female).


More examples: Nominal verses Ordinal

Nominal variables have no natural ordering to them. e.g. eye color (blue,brown, other), political affiliation (Democrat, Republican, other), favoritemusic type (jazz, folk, rock, rap, country, bluegrass, other), gender (male,female).

Ordinal variables have an obvious order to them. e.g. cancer stage (I, II,III, IV), a taste response to a new salsa (awful, below average, average,good, delicious).

Interval variables are ordinal variables that also have a natural scaleattached to them. e.g. diastolic blood pressure, number of years of posthigh school education. Interval variables are typically discrete numbersthat comprise an interval.Read: Sections 1.1.3, 1.1.4, 1.1.5.


Core Discrete Distributions for CDA

There are three core discrete distributions for categorical data analysis

1 Binomial (with the related Bernoulli distribution)

2 Multinomial

3 Poisson

We will explore each of these in more detail.


Bernoulli Trials

Consider the following,

n independent patients are enrolled in a single arm (only onetreatment) Phase II oncology study.

The outcome of interest is whether or not the experimental treatmentcan shrink the tumor.

Then, the outcome for patient i is

Yi =

{1 if new treatment shrinks tumor (success)0 if new treatment does not shrinks tumor (failure)

,

i = 1, . . . , n


Each Yi is assumed to be independently, identically distributed as aBernoulli random variables with the probability of success as

P(Yi = 1) = p

and the probability of failure is

P(Yi = 0) = 1− p

Then, the probability function is Bernoulli

P(Yi = y) = py (1− p)1−y for y = 0, 1

and is denoted byYi ∼ Bern(p)


Properties of Bernoulli

MEANE (Yi ) = 0 · P(Yi = 0) + 1 · P(Yi = 1)

= 0(1− p) + 1p

= p

VARIANCE

Var(Yi ) = E (Y 2i )− [E (Yi )]2

= E (Yi )− [E (Yi )]2 ; since Y 2i = Yi

= E (Yi )[1− E (Yi )]

= p(1− p)


Binomial Distribution

Let Y be defined as

Y =n∑

i=1

Yi ,

where n is the number of bernoulli trials. We will use Y (the number ofsuccesses) to form test statistics and confidence intervals for p, theprobability of success.

Example 2,Suppose you take a sample of n independent biostatistics professors todetermine how many of them are nerds (or geeks).

We want to estimate the probability of being a nerd given you are abiostatistics professor.


What is the distribution of the number of successes,

Y =n∑

i=1

Yi ,

resulting from n identically distributed, independent trials with

Yi =

{1 if professor i is a nerd (success)0 if professor i is not a nerd (failure)

.

andp = P(Yi = 1); (1− p) = P(Yi = 0)

for all i = 1, . . . , n


The probability function can be shown to be binomial:

P(Y = y) =

(ny

)py (1− p)n−y =

n!

y !(n − y)!py (1− p)n−y ,

wherey = 0, 1, 2, . . . , n

andthe number (

ny

)=

n!

(n − y)!y !

is the number of ways of partitioning n objects into two groups; one groupof size y , and the other of size (n − y).The distribution is denoted by

Y ∼ Bin(n, p)


Properties of the Binomial

MEANE (Y ) = E

(∑ni=1 Yi

)=

∑ni=1 E (Yi )

=∑n

i=1 p = np

[Recall, the expectation of a sum is the sum of the expectations]

VARIANCE

Var(Y ) = Var(∑n

i=1 Yi

)=

∑ni=1 Var(Yi )

=∑n

i=1 p(1− p) = np(1− p)

[Variance of a sum is the sum of the variances if observations areindependent)


Multinomial

Often, a categorical may have more than one outcome of interest. Recallthe previous oncology trial where Yi was defined as

Yi =

{1 if new treatment shrinks tumor (success)0 if new treatment does not shrinks tumor (failure)

However, sometimes is may be more beneficial to describe the outcome interms of

Yi =

1 Tumor progresses in size2 Tumor remains as is3 Tumor decreases in size


Multinomial

Let yij = 1 if subject i has outcome j and yij = 0 else. Then

yi = (yi1, yi2, . . . , yic)

represents a multinomial trial, with∑

j yij = 1 and c representing thenumber of potential levels of Y .

Let nj =∑

i yij denote the number of trials having outcome in category j.The counts (n1, n2, . . . , nc) have the multinomial distribution.

P(n1, n2, · · · , nc−1) =

(n!

n1!n2! · · · nc !

)πn11 π

n22 . . . πncc

How many free parameters among the π’s?


Special Case of a Multinomial

When c = 2, then

P(n1) =

(n!

n1!n2!

)πn11 π

n22

Due to the constraints∑

c nc = n and∑

c π = 1, n2 = n − n1 andπ2 = 1− π1.

Therefore,

P(n1) =

(n!

n1!(n − n1!)

)πn11 (1− π1)n−n1

Note: For most of the class, I will use p for probability, Agresti tends touse π


Poisson

Sometimes, count data does not arrive from a fixed number of trials. Forexample,Let Y = number of babies born at VCU in a given week.

Y does not have a predefined maximum and a key feature of the Poissondistribution is that the variance equals its mean.

The probability that Y = 0, 1, 2, . . . is written as

P(Y = y) =e−µµy

y !

where µ = E (Y ) = Var(Y ).


Proof of Expectation

E [Y ] =∞∑i=0

ie−µµi

i!

= 0·e−µ

0! +∞∑i=1

ie−µµi

i! [See Note 1]

= 0 + µe−µ∞∑i=1

µi−1

(i−1)!

= µe−µ∞∑j=0

µj

j! [See Note 2]

= µ [See Note 3]

Notes:

1 0! = 1 and we separated the 1st term (i=0) of the summation out

2 Let j = i − 1, then if i = 1, . . . ,∞, j = 0, . . . ,∞

3 Since∞∑j=0

µj

j! = eµ by McLaurin expansion of ex


Overdispersion

Overdispersion: Often the variability associated with Poisson andbinomial models is smaller than what is observed in real data.

The increased variance can be attributed to unmeasured, or perhaps latentregressors in the model and thus the resulting count distribution is morecorrectly a mixture of binomial or Poisson distributions, with mixingweights being the proportion of outcomes resulting from specific(unaccounted for) covariate combinations.

We will discuss testing for overdispersion in specific models and remedieslater on.


Connection between Multinomial and Poisson

Let Y = (Y1,Y2,Y3) be independent Poisson with parameters (µ1, µ2, µ3).

e.g. Y1 is number of people that fly to France from Britain this year, Y2

the number who go by train, and Y3 the number who take a ferry. Thetotal number of traveling n = Y1 + Y2 + Y3 is Pois(µ1 + µ2 + µ3).

Conditional on n, the distribution of (Y1,Y2,Y3) is multinomial withparameters n and π = (µ1, µ2, µ3)/µ+ where µ+ = µ1 + µ2 + µ3.

This is especially useful in log-linear models, covered in Chapter 9.


Maximum likelihood [ML] estimation

Let the parameter vector for a model be β = (β1, . . . , βp) where p is thenumber of parameters in the model. Let the outcome variables be randomvariables denoted y = (y1, . . . , yn) and the probability model denoted

p(y1, . . . , yn|β) = p(y|β).

The likelihood of β, denoted L(β), is L(β) = p(y|β) thinking of data y asfixed.


MLE, cont.

For example, if n = (n1, . . . , nc) is mult(n,π) where π = (π1, . . . , πc),then β = (π1, π2, . . . , πc−1) because there are c − 1 free parameters in π.

The likelihood of β is simply the probability of seeing the response datagiven β:

L(β) = p(n1, . . . , nc |β) =

(n

n1 · · · nc

) c∏j=1

πnjj .


MLE, cont.

The maximum likelihood estimator is that value of β that maximizes L(β)for given data:

β = argmaxβ∈BL(β),

where B is the set of values β can take on.

The MLE β makes the observed data as likely as possible. The estimatorturns into an estimate when data are actually seen. For example, if c = 3and n1 = 3, n2 = 5, n3 = 2, then β = (π1, π2) = (0.3, 0.5) and of courseπ3 = 1− (π1 + π2) = 0.2. Thenp(3, 5, 2|π1 = 0.2, π2 = 0.5) ≥ p(3, 5, 2|π1 = p1, π2 = p2) for all values ofp1 and p2.

An estimator is random (i.e. before data are collected and seen they arerandom, and so then is any function of data) whereas an estimate is afixed, known vector (like (0.3,0.5)).


MLE, cont.

MLEs have nice properties for most (but not all) models (p. 9):

They have large sample normal distributions:

β•∼ Np(β, cov(β)) where cov(β) =

[−E

(∂2 logL(β)

∂βj∂βk

)]−1p×p

.

They are asymptotically consistent: β → β (in probability) as thesample size n→∞.

They are asymptotically efficient: var(βj) is smaller than thecorresponding variance of other (asymptotically) unbiased estimators.


Example: MLE for Poisson data

Let Yi ∼ Pois(λti ) where λ is the unknown event rate and ti are knownexposure times. Assume the Y1, . . . ,Yn are independent.

The likelihood of λ is

L(λ) = p(y1, . . . , yn|λ) =n∏

i=1

p(yi |λ) =n∏

i=1

e−tiλ(tiλ)yi/yi !

=

[n∏

i=1

tyiiyi !

]e−λ

∑ni=1 tiλ

∑ni=1 yi = g(t, y)e−λ

∑ni=1 tiλ

∑ni=1 yi .

Then the log-likelihood is

L(λ) = log g(t, y)− λn∑

i=1

ti + log(λ)n∑

i=1

yi .


Poisson MLE, cont.Taking the derivative w.r.t. λ we get

L′(λ) =∂L(λ)

∂λ= −

n∑i=1

ti +1

λ

n∑i=1

yi .

Setting this equal to zero, plugging in Y for y, and solving for λ yields theMLE

λ =

∑ni=1 Yi∑ni=1 ti

.

Now∂2L(λ)

∂λ2= −

∑ni=1 yiλ2

.

Since∑n

i=1 Yi ∼ Pois(λ∑n

i=1 ti ), we have

−E(∂2L(λ)

∂λ2

)= E

(∑ni=1 Yi

λ2

)=λ∑n

i=1 tiλ2

=

∑ni=1 tiλ

.


Poisson MLE, cont.The variance of λ is given by the ‘inverse’ of this ‘matrix’

var(λ)•=

λ∑ni=1 ti

.

The large sample normal result tells us

λ•∼ N

(λ,

λ∑ni=1 ti

).

The standard deviation of λ is estimated to be sd(λ) =√

λ∑ni=1 ti

. Since

we do not know λ, the standard deviation is estimated by the standarderror obtained from estimating λ by its MLE:

se(λ) =

√λ∑ni=1 ti

=

√ ∑ni=1 yi

[∑n

i=1 ti ]2

=

√∑ni=1 yi∑n

i=1 ti.


Poisson MLE, cont.

Example: Say that we record the number of adverse medical events (e.g.operating on the wrong leg) from a hospital over n = 3 different times:t1 = 1 week in 2013, t2 = 4 weeks in 2014, and t3 = 3 weeks in 2015.We’ll assume that the adverse surgical event rate λ (events/week) doesnot change over time and that event counts in different time periods areindependent.

Then Yi ∼ Pois(tiλ) for i = 1, 2, 3. Say we observe y1 = 0, y2 = 3, andy3 = 1. Then λ = (0 + 3 + 1)/(1 + 4 + 3) = 4/8 = 0.5 event/week, or oneevent every other week. Also, se(λ) =

√4/8 = 0.25.

The large sample result tells us then (before data are collected andY = (Y1,Y2,Y3) is random) that

λ•∼ N(λ, 0.252),

useful for constructing hypothesis tests and confidence intervals.


Wald, Likelihood Ratio, and Score tests

These are three ways to perform large sample hypothesis tests based onthe model likelihood.

Wald test

Let M be a m × p matrix. Many hypotheses can be written H0 : Mβ = bwhere b is a known m × 1 vector.

For example, let p = 3 so β = (β1, β2, β3). The test of H0 : β2 = 0 iswritten in matrix terms with M = (0, 1, 0) and b = 0. The hypothesis

H0 : β1 = β2 = β3 has M =

[1 −1 00 1 −1

]and b =

[00

].


Wald test, cont.The large sample result for MLEs is

β•∼ Np(β, cov(β)).

So thenMβ

•∼ Nm(Mβ,Mcov(β)M′).

If H0 : Mβ = b is true then

Mβ − b•∼ Nm(0,Mcov(β)M′).

SoW = (Mβ − b)′[Mcov(β)M′]−1(Mβ − b)

•∼ χ2m.

W is called the Wald statistic and large values of W indicate Mβ is faraway from b, i.e. that H0 is false. The p-value for H0 : Mβ = b is givenby p-value = P(χ2

m >W ).

The simplest, most-used Wald test is the familiar test that a regressioneffect is equal to zero, common to multiple, logistic, Poisson, and ordinalregression models.


Score test

In general, the cov(β) is a function of the unknown β. The Wald testreplaces β by its MLE β yielding cov(β). The score test replaces β by thethe MLE β0 obtained under the constraint imposed by H0

β0 = argmaxβ∈B:Mβ=bL(β).

Let cov(β) be the asymptotic covariance for unconstrained MLE.

The resulting test statistic

S = [ ∂∂β logL(β0)]′[cov(β0)][ ∂∂β logL(β0)]•∼ χ2

m.

Sometimes it is easier to fit the reduced model rather than the full model;the score test allows testing whether new parameters are necessary from afit of a smaller model.


Likelihood Ratio tests

The Likelihood Ratio test [LRT] is easily constructed and carried out fornested models. The full model has parameter vector β and the reducedmodel obtains when H0 : Mβ = b holds. A common example is whenβ = (β1,β2) and we wish to test H0 : β1 = 0 (e.g. a subset of regressioneffects are zero). Let β be the MLE under the full model

β = argmaxβ∈BL(β),

and β0 be the MLE under the constraint imposed by H0

β0 = argmaxβ∈B:Mβ=bL(β).


LRT, cont.

If H0 : Mβ = b is true,

L = −2[logL(β0)− logL(β)]•∼ χ2

m.

The statistic L is the likelihood ratio test statistic for the hypothesisH0 : Mβ = b. The smallest L can be is zero when β0 = β. The moredifferent β is from β0, the larger L is and the more evidence there is thatH0 is false. The p-value for testing H0 is given by p− value = P(χ2

m > L).

To test whether additional parameters are necessary, LRT tests are carriedout by fitting two models: a ‘full’ model with all effects and a ‘reduced’model. In this case the dimension m of M is the difference in the numbersof parameters in the two models.


LRT, cont.

For example, say we are fitting the standard regression model

Yi = β0 + β1xi1 + β2xi2 + β3xi3 + ei

where eiiid∼ N(0, σ2). Then β = (β0, β1, β2, β3, σ

2) and we want to testβ1 = (β2, β3) = (0, 0), that the 2nd and 3rd predictors aren’t needed. Thistest can be written using matrices as

H0 :

[0 0 1 0 00 0 0 1 0

]β0β1β2β3σ2

=

[00

].


LRT, cont.

The likelihood ratio test fits the full model above and computesLf = logLf (β0, β1, β2, β3, σ).

Then the reduced model Yi = β0 + β1xi1 + ei is fit andLr = logLr (β0, β1, σ) computed.

The test statistic is L = −2(Lr − Lf ); a p-value is computed as P(χ22 > L).

If the p-value is less than, say, α = 0.05 we reject H0 : β2 = β3 = 0.

Of course we wouldn’t use this approximate LRT test here! We haveoutlined an approximate test, but there is well-developed theory thatinstead uses a different test statistic with an exact F -distribution.


Comments

Note that:

The Wald test requires maximizing the unrestricted likelihood, anduses non-null standard error.

The score test requires maximizing the restricted likelihood (under anested submodel), and uses the null standard error.

The Likelihood ratio test combines information from both [restrictedand unrestricted] likelihoods.

So the likelihood ratio test uses more information and both Wald andScore tests can be viewed as approximations to the LRT.

However, SAS can “automatically” perform Wald tests of the formH0 : Mβ = b in a contrast statement and so I often use Wald testsbecause they’re easy to get. In large samples the tests are equivalent.


Confidence intervals

A plausible range of values for a parameter βj (from β) is given by aconfidence interval (CI). Recall that a CI has a certain fixed probability ofcontaining the unknown βj before data are collected. After data arecollected, nothing is random any more, and instead of “probability” werefer to “confidence.”

A common way of obtaining confidence intervals is by inverting hypothesistests of H0 : βk = b. Without delving into why this works, a (1− α)100%CI is given by those b such that the p-value for testing H0 : βk = b islarger than α.


CIs, cont.

For Wald tests of H0 : βk = b, the test statistic is W = (βk − b)/se(βk).This statistic is approximately N(0, 1) when H0 : βk = b is true and thep-value is larger than 1− α only when |W | < zα/2 where zα/2 is the1− α/2 quantile of a N(0, 1) random variable. This yields the well knownCI

(βk − zα/2se(βk), βk + zα/2se(βk)).

The likelihood ratio CI operates in the same way, but the log-likelihoodmust be computed for all values of b. We’ll explore the differencesbetween inverting Wald, Score, and LRT for binomial data in theremainder of Chapter 1.


Dipankar Bandyopadhyay, Ph.D.dbandyop/BIOS625/lecture_01new.pdf · Introduction Dipankar...

Documents

Transcript of Dipankar Bandyopadhyay, Ph.D.dbandyop/BIOS625/lecture_01new.pdf · Introduction Dipankar...