New Ways of Looking at Binary Data Fitting in R Yoon G Kim, ygk1@humboldt.edu Colloquium Talk.

New Ways of Looking at Binary Data Fitting in R

Yoon G Kim, ygk1@humboldt.edu

Colloquium Talk

Can we “stabilize” this?

Appetizer

5 10 15 20

After taking LOG …> y1 <- rep(c(100,200),times=10)> y2 <- rep(c(10,20),times=10)> x <- c(1:20)> data <- cbind(x,y1,y2)> data[1:3,] x y1 y2 [1,] 1 100 10 [2,] 2 200 20 [3,] 3 100 10> par(mfrow=c(1,2))> plot(y1~x,type="l",ylim=c(0,250),col="blue",ylab="")> lines(y2~x,type="l",col="red")> plot(log(y1)~x,type="l",ylim=c(0,6),col="blue",ylab="")> lines(log(y2)~x,type="l",col="red")

Log transformed

5 10 15 20

5 10 15 200

Outline

Exploring options available when

assumptions of classical linear models are untenable.

In this talk:

What can we do when observations are not

continuous

and the residuals are not normally distributed nor

identically distributed ?

Defined by three assumptions:

(1) the response variable is continuous.

(2) the residuals (ε) are normally distributed and ...

(3) ... independently (3a) and identically distributed (3b).

Today, we will consider a range of options available

when assumptions (1) (2) and/or (3b) are not verified.

Classical Linear Models

Many situations exist:

The response variable could be

(1) a count (number of individuals in a population)(number of species in a community)

(2) a proportion (proportion "cured" after treatment) (proportion of threatened species)

(3) a categorical variable (breeding/non-breeding)

(different phenotypes)

(4) a strictly positive value (esp. time to success) (or time to failure)

( ... ) and so forth

Non-continuous response variable

These types of non-continuous variables also tend to deviate from the assumptions of

Normality (assumption #2) and Homoscedasticity (assumption #3b)

(1) A count variable often follows a Poisson distribution (where the variance increases linearly with the mean)

(2) A proportion often follows a Binomial distribution (where the variance reaches a maximum for intermediate values

and a minimum at either end: 0% or 100%)

Added difficulties

These types of non-continuous variables also tend to deviate from the assumptions of

Normality (assumption #2) and Homoscedasticity (assumption #3b).

(3) A categorical variable tends to follow a Binomial distribution

(when the variable has only two levels) or a Multinomial

distribution (when the variable has more than two levels)

(4) Time to success/failure can follow an exponential distribution or

an inverse Gaussian distribution (the latter having a variance

increasing much more quickly than the mean).

Added difficulties

Many of these situations can be unified under a central framework.

Since all these distributions (and a few more) belong to the exponential family of distributions.

Fortunately

)(e xp ,

Probability density function (if y is continuous)

Probability mass function (if y is discrete)

Canonical (location) parameter

Dispersion parameter

Canonical form

abY v a r

variance

The Normal distribution

2 2e xp

Probability density

function

Canonical form

)2lo g (

12/e x p 2

2v a r abY

The Poisson distribution

Probability

mass function

Canonical form

!l nl ne x p yy

abYv a r

)e x p ()( b

The Binomial distribution

yny ppy

Probability mass

function

Canonical form

npynpy ln1lnlne xp

n pbE Y

)1(v a r pn pabY

)e x p1l o g ()1l n ()( npnb

py ln1ln

1lne xp

Why is that remotely useful?1) A single algorithm (maximum likelihood)

will cope with all these situations.

2) Different types of Variance can be accommodated

When Var is constant -> Normal (Gaussian)

When Var increases linearly with the mean -> Poisson

When Var has a humped back shape -> Binomial

When Var increases as the square of the mean -> Gamma(means the coefficient of variation remains constant)

When Var increases as the cube of the mean -> inverse Gaussian

3) Most types of data are thus effectively covered

Two ways to cope with non-independent observations

When design is balanced ("equal sample size")

We can use factors to partition our observations in different "groups" and analyze them as an ANOVA or ANCOVA.

… when factors are "crossed" or when they are “nested"

When design is unbalanced ("uneven sample size")

Mixed effect models are then called for.

Non-independent Observations

How does it work?1) You need to specify the family of distribution to use

2) You need to specify the link function

pp xxx 22110 iyg

linear predictorlink function

For each type of variable the "natural" link function to use is indicated by the canonical parameter

Normal Identity

Poisson Log

Binomial Logit

Gamma Inverse

Inv.Gaussian Inverse square

Binary variableThe response variable contains only 0’s and 1’s. The

probability that a place is “occupied” is p, and we write

The objective is to determine how Y influences p.

The family to use is Binomial and the canonical link is logit.Example: The response is occupation of territories and the explanatory

variable is the resource availability in each territory

> occupy <- read.table("D:\\STAT999\\RBook\\occupation.txt",header=T)> dim(occupy)[1] 150 2> occupy[1:3,] resources occupied1 14.18154 02 18.68306 03 20.22156 0> attach(occupy)Crawley, M.J. (2007) The R Book: 597-598

yy ppyP 1)1()(

Binary variable

> table(occupied)occupied 0 1 58 92 > modell <- glm(occupied~resources, family=binomial)> > plot(resources, occupied, type="n")> rug(jitter(resources[occupied==0]))> rug(jitter(resources[occupied==1]),side=3)> xv <- 0:1000> yv <- predict(modell, list(resources=xv),type="response")

by default the link for a Binomial is logistic

0 200 400 600 800 1000

resources

cutr <- cut(resources,5)tapply(occupied,cutr,sum)(13.2,209] (209,405] (405,600] (600,796] (796,992] 0 10 25 26 31 table(cutr)cutr(13.2,209] (209,405] (405,600] (600,796] (796,992] 31 29 30 29 31 probs <- tapply(occupied,cutr,sum)/table(cutr)probs(13.2,209] (209,405] (405,600] (600,796] (796,992] 0.0000000 0.3448276 0.8333333 0.8965517 1.0000000 attr(,"class")[1] "table"probs <- as.vector(probs)resmeans <- tapply(resources,cutr,mean)resmeans <- as.vector(resmeans)points(resmeans,probs,pch=16,cex=2)se <- sqrt(probs*(1-probs)/table(cutr))up <- probs + as.vector(se)down <- probs - as.vector(se)for(i in 1:5) { lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}

0 200 400 600 800 1000

resources

> grid_x <- seq(10,990,by=0.5)

> modell_p <- predict(modell,new=data.frame(resources=grid_x),type="response")

> modelp <- glm(occupied~resources, family=binomial(link=probit))> modelp_p <- predict(modelp,new=data.frame(resources=grid_x),type="response")

> modelcl <- glm(occupied~resources, family=binomial(link=cloglog))> modelcl_p <- predict(modelcl,new=data.frame(resources=grid_x),type="response")

> modelca <- glm(occupied~resources, family=binomial(link=cauchit))> modelca_p <- predict(modelca,new=data.frame(resources=grid_x),type="response")

Various Link Functions

To draw …> newdata <- data.frame(grid_x,modell_p,modelp_p,modelcl_p,modelca_p)> library(lattice)> print(xyplot(modell_p+modelp_p+modelcl_p+ modelca_p ~ grid_x,+ data=newdata, type ="l", xlab="resources",+ ylab="p",lwd=1.5, lty=c(1,2,3,4), col=c(1:4),+ panel = function(x, y, ...) {+ panel.xyplot(x, y, ...)+ panel.text(occupy$resources,occupy$probs,"x", cex=1.5, type="p", ...)+ }))> legend("topleft", legend=c("logit","probit","cloglog","cauchit"),lty=c(1:4), col=c(1:4), lwd=1.5)> > par(new=F)> points(resmeans,probs,pch=16,cex=2)> for (i in 1:5){+ lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}

Binary variable> summary(modell)

Call:glm(formula = occupied ~ resources, family = binomial)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.669923 -5.590 2.28e-08 ***resources 0.009762 0.001568 6.227 4.77e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 200.170 on 149 degrees of freedomResidual deviance: 97.152 on 148 degrees of freedomAIC: 101.15

Number of Fisher Scoring iterations: 6

Only valid if the Response variable is indeed a binomial

iiiiii yyyD

)ˆ()ˆl n (2 also called G-statistic

Binary variable

> (dp <- sum(residuals(modell, type="pearson")^2)/modell$df.res)[1] 0.8472199)

Pearson's residuals

This dispersion parameter () must be calculated.

pni iii

ˆˆˆ22

Residual degrees of freedom

Suggests that the Variance is 0.85 times the Mean.

In statistical terms there is no overdispersion.

In biological terms, it suggests that the counts are independent from each other and are not Aggregated(i.e. Clumped).

Typically Overdispersed count data follow a Negative Binomial distribution, which is not part of the Exponential families of distribution.

It won't be covered here, but it can be approximated as a quasi-binomial (family="quasibinomial").

If you need it in your future work, you can also try glm.nb (in MASS package)

Binary variable

> summary(modell, dispersion=dp)

Call:glm(formula = occupied ~ resources, family = binomial)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.616628 -6.073 1.26e-09 ***resources 0.009762 0.001443 6.765 1.33e-11 ***---(Dispersion parameter for binomial family taken to be 0.8472199)

Null deviance: 200.170 on 149 degrees of freedomResidual deviance: 97.152 on 148 degrees of freedomAIC: 101.15

Number of Fisher Scoring iterations: 6

The summary table can be adjusted with the dispersion parameter

These Values can now be taken at face value

How good is the model? 1 – (Res. Dev. / Null Dev.)

= 51.47 %

> summary(modell)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.669923 -5.590 2.28e-08 ***resources 0.009762 0.001568 6.227 4.77e-10 ***

(Dispersion parameter for binomial family taken to be 1) Null deviance: 200.170 on 149 degrees of freedomResidual deviance: 97.152 on 148 degrees of freedomAIC: 101.15

> summary(modelp)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.1437759 0.3448511 -6.217 5.08e-10 ***resources 0.0055046 0.0007811 7.047 1.82e-12 ***

> summary(modelcl)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.5902574 0.4293153 -6.033 1.60e-09 ***resources 0.0053519 0.0008337 6.419 1.37e-10 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.17 on 149 degrees of freedomResidual deviance: 102.30 on 148 degrees of freedomAIC: 106.30 > summary(modelca)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.540198 1.644250 -3.369 0.000753 ***resources 0.014612 0.004205 3.475 0.000510 ***

Bootstrapping

> modell <- glm(occupied~resources,family=binomial)

> bcoef <- matrix(0,1000,2)> for (i in 1:1000){+ indices <-sample(1:150,replace=T)+ x <- resources[indices]+ y <- occupied[indices]+ modell <- glm(y~x, family=binomial)+ bcoef[i,] <- modell$coef }

> par(mfrow=c(1,2))> plot(density(bcoef[,2]),xlab="Coefficient of x",main="")> abline(v=quantile(bcoef[,2],c(0.025,0.975)),lty=2, col=4)> plot(density(bcoef[,1]),xlab="Intercept",main="")> abline(v=quantile(bcoef[,1],c(0.025,0.975)),lty=2, col=4)

0.005 0.010 0.015 0.020

Coefficient of x

-8 -7 -6 -5 -4 -3 -20

.6Intercept

Jackknifing

> jcoef <- matrix(0,150,2)> for (i in 1:150) {+ modelj<-glm(occupied[-i]~resources[-i], family=binomial)+ jcoef[i,] <- modelj$coef+ }

> par(mfrow=c(1,2))> plot(density(jcoef[,2]),xlab="Coefficient of x",main="")> abline(v=quantile(jcoef[,2],c(0.025,0.975)),lty=2, col=4)> plot(density(jcoef[,1]),xlab="Intercept",main="")> abline(v=quantile(jcoef[,1],c(0.025,0.975)),lty=2, col=4)

0.0098 0.0102 0.0106

Coefficient of x

-4.00 -3.90 -3.80 -3.700

Intercept

C.I.’s

> library(boot)> reg.boot<-function(regdat, index){+ x <- resources[index]+ y <- occupied[index]+ modell <- glm(y~x, family=binomial)+ coef(modell) }> reg.model<-boot(occupy,reg.boot,R=10000)> boot.ci(reg.model,index=2)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 10000 bootstrap replicates Intervals : Level Normal Basic 95% ( 0.0059, 0.0128 ) ( 0.0051, 0.0120 ) Level Percentile BCa 95% ( 0.0075, 0.0144 ) ( 0.0070, 0.0132 ) Calculations and Intervals on Original Scale

> jack.after.boot(reg.model,index=2)

-6 -5 -4 -3 -2 -1 0 1

standardized jackknife value

* * * * * ** * ** ** **** *************************************************************************************************************************************** * * * * ** * ** ** **** *************************************************************************************************************************************** * * * * ** * ** ** **** **************************************************************************************************************************************

* * * * * ** * ** ** **** **************************************************************************************************************************************

* * * * * ** * ** ** **** **************************************************************

************************************************************************

* * * * * ** *** ** **** **************

*******************************************************

******************************************

***********************

108 34 46 70 125113601221391443852645812412891779285282353829871047944

102 83 45 57 6214121351710713295148126511431141097649874128988291015527 68

100 36 49 14912913699137111105119133112698118313111812125488422401847 75

90 39 71 134130150117147146138131120566572541231466501678867624961099431 35

32 41 67 127595115156171211451031101066393140432019116801422630377333 42

108th observation?> occupy[105:110,] resources occupied105 703.1783 1106 710.1274 1107 716.7298 1108 717.1994 0109 733.3538 1110 736.3060 1> plot(resources, occupied)> text(resources[108],occupied[108],"Here",cex = 1.5,col="blue",pos=3)OR> fat.arrow <- function(size.x=0.5,size.y=0.5,ar.col="red"){+ size.x <- size.x*(par("usr")[2]-par("usr")[1])*0.1+ size.y <- size.y*(par("usr")[4]-par("usr")[3])*0.1+ pos <- locator(1)+ xc <- c(0,1,0.5,0.5,-0.5,-0.5,-1,0)+ yc <- c(0,1,1,6,6,1,1,0)+ polygon(pos$x+size.x*xc,pos$y+size.y*yc,col=ar.col) }> fat.arrow()

0 200 400 600 800 1000

resources

Yoon G Kim, ygk1@humboldt.edu

Thank You!

New Ways of Looking at Binary Data Fitting in R Yoon G Kim, ygk1@humboldt.edu Colloquium Talk.

Documents

Transcript of New Ways of Looking at Binary Data Fitting in R Yoon G Kim, ygk1@humboldt.edu Colloquium Talk.

Molly Yoon 4.4 PPP

10Dec Wong Ka-Yoon

Juhyeok Yoon

acac.humboldt.edu · Academic & Career Advising Center Explore, Experience, Achieve Gist Hall 114 humboldt.edu/acac 707 826-3341 acac@humboldt.edu HUMBOLDT STATE UNIVERSITY

Yoon VR Furniture

Yoon, Hyung Joo - naro.affrc.go.jp

Physics Colloquium

S. W. Yoon a

YOON. - uploads-ssl.webflow.com

Yoon Woo Lee

Nov2009 Colloquium

the andy yoon collection

Electronic Colloquium on ... - logique.jussieu.frnderugy/doc/VPCompletECCC.pdfElectronic Colloquium on ... - logique.jussieu.fr

YOON LETS sample

Steampunk Colloquium

Bae,Yoon Hyeok

Today’s colloquium:

April 27, 2015 Colleen Mullery Sr. AVP, Faculty Affairs & Human Resources Colleen.Mullery@humboldt.edu.

PhD. Colloquium

YOON, KUK-JINvi.kaist.ac.kr/wp-content/uploads/2018/10/CV-Yoon-v... · 2019-04-18 · Curriculum Vitae of Yoon, Kuk-Jin 34.Han-Mu Park and Kuk-Jin Yoon, \Multi-attributed Graph Matching