Categorical Data Analysis - download.e-bookshelf.de · 4.1 Generalized Linear Model, 116 4.2...

30
Categorical Data Analysis Second Edition ALAN AGRESTI University of Florida Gainesville, Florida

Transcript of Categorical Data Analysis - download.e-bookshelf.de · 4.1 Generalized Linear Model, 116 4.2...

  • Categorical Data AnalysisSecond Edition

    ALAN AGRESTI

    University of FloridaGainesville, Florida

    Innodata0471458767.jpg

  • Categorical Data Analysis

  • Categorical Data AnalysisSecond Edition

    ALAN AGRESTI

    University of FloridaGainesville, Florida

  • �This book is printed on acid-free paper. "Copyright � 2002 John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved.

    Published simultaneously in Canada.

    No part of this publication may be reproduced, stored in a retrieval system or transmitted in anyform or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, withouteither the prior written permission of the Publisher, or authorization through payment of theappropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA

    Ž . Ž .01923, 978 750-8400, fax 978 750-4744. Requests to the Publisher for permission should beaddressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New

    Ž . Ž .York, NY 10158-0012, 212 850-6011, fax 212 850-6008, E-Mail: [email protected].

    For ordering and customer service, call 1-800-CALL-WILEY.

    Library of Congress Cataloging-in-Publication Data Is A©ailable

    ISBN 0-471-36093-7

    Printed in the United States of America

    10 9 8 7 6 5 4 3 2 1

  • To Jacki

  • Contents

    Preface xiii

    1. Introduction: Distributions and Inference for Categorical Data 1

    1.1 Categorical Response Data, 11.2 Distributions for Categorical Data, 51.3 Statistical Inference for Categorical Data, 91.4 Statistical Inference for Binomial Parameters, 141.5 Statistical Inference for Multinomial Parameters, 21Notes, 26

    2. Describing Contingency Tables 36

    2.1 Probability Structure for Contingency Tables, 362.2 Comparing Two Proportions, 432.3 Partial Association in Stratified 2 � 2 Tables,2.4 Extensions for I � J Tables, 54Notes, 59Problems, 60

    3. Inference for Contingency Tables 70

    3.1 Confidence Intervals for Association Parameters, 703.2 Testing Independence in Two-Way Contingency

    Tables, 783.3 Following-Up Chi-Squared Tests, 803.4 Two-Way Tables with Ordered Classifications, 863.5 Small-Sample Tests of Independence, 91

    vii

    47

    Problems, 27

  • CONTENTSviii

    3.6 Small-Sample Confidence Intervals for 2 � 2 Tables,* 983.7 Extensions for Multiway Tables and Nontabulated

    Responses, 101Notes, 102Problems, 104

    4. Introduction to Generalized Linear Models 115

    4.1 Generalized Linear Model, 1164.2 Generalized Linear Models for Binary Data, 1204.3 Generalized Linear Models for Counts, 1254.4 Moments and Likelihood for Generalized Linear

    Models,* 1324.5 Inference for Generalized Linear Models, 1394.6 Fitting Generalized Linear Models, 1434.7 Quasi-likelihood and Generalized Linear Models,* 1494.8 Generalized Additive Models,* 153Notes, 155Problems, 156

    5. Logistic Regression 165

    5.1 Interpreting Parameters in Logistic Regression, 1665.2 Inference for Logistic Regression, 1725.3 Logit Models with Categorical Predictors, 1775.4 Multiple Logistic Regression, 1825.5 Fitting Logistic Regression Models, 192Notes, 196Problems, 197

    6. Building and Applying Logistic Regression Models 211

    6.1 Strategies in Model Selection, 2116.2 Logistic Regression Diagnostics, 2196.3 Inference About Conditional Associations in 2 � 2 � K

    Tables, 2306.4 Using Models to Improve Inferential Power, 2366.5 Sample Size and Power Considerations,* 2406.6 Probit and Complementary Log-Log Models,* 245

    *Sections marked with an asterisk are less important for an overview.

  • CONTENTS ix

    6.7 Conditional Logistic Regression and ExactDistributions,* 250

    Notes, 257Problems, 259

    7. Logit Models for Multinomial Responses 267

    7.1 Nominal Responses: Baseline-Category Logit Models, 2677.2 Ordinal Responses: Cumulative Logit Models, 2747.3 Ordinal Responses: Cumulative Link Models, 2827.4 Alternative Models for Ordinal Responses,* 2867.5 Testing Conditional Independence in I � J � K

    Tables,* 2937.6 Discrete-Choice Multinomial Logit Models,* 298

    Problems, 302

    8. Loglinear Models for Contingency Tables 314

    8.1 Loglinear Models for Two-Way Tables, 3148.2 Loglinear Models for Independence and Interaction in

    Three-Way Tables, 3188.3 Inference for Loglinear Models, 3248.4 Loglinear Models for Higher Dimensions, 3268.5 The Loglinear�Logit Model Connection, 3308.6 Loglinear Model Fitting: Likelihood Equations and

    Asymptotic Distributions,* 3338.7 Loglinear Model Fitting: Iterative Methods and their

    Application,* 342Notes, 346Problems, 347

    9. Building and Extending LoglinearrrrrrLogit Models 357

    9.1 Association Graphs and Collapsibility, 3579.2 Model Selection and Comparison, 3609.3 Diagnostics for Checking Models, 3669.4 Modeling Ordinal Associations, 3679.5 Association Models,* 3739.6 Association Models, Correlation Models, and

    Correspondence Analysis,* 379

    Notes, 300

  • CONTENTSx

    9.7 Poisson Regression for Rates, 3859.8 Empty Cells and Sparseness in Modeling Contingency

    Tables, 391Notes, 398Problems, 400

    10. Models for Matched Pairs 409

    10.1 Comparing Dependent Proportions, 41010.2 Conditional Logistic Regression for Binary Matched

    Pairs, 41410.3 Marginal Models for Square Contingency Tables, 42010.4 Symmetry, Quasi-symmetry, and Quasi-

    independence, 42310.5 Measuring Agreement Between Observers, 43110.6 Bradley�Terry Model for Paired Preferences, 43610.7 Marginal Models and Quasi-symmetry Models for

    Matched Sets,* 439Notes, 442Problems, 444

    11. Analyzing Repeated Categorical Response Data 455

    11.1 Comparing Marginal Distributions: MultipleResponses, 456

    11.2 Marginal Modeling: Maximum Likelihood Approach, 45911.3 Marginal Modeling: Generalized Estimating Equations

    Approach, 46611.4 Quasi-likelihood and Its GEE Multivariate Extension:

    Details,* 47011.5 Markov Chains: Transitional Modeling, 476Notes, 481Problems, 482

    12. Random Effects: Generalized Linear Mixed Models forCategorical Responses 491

    12.1 Random Effects Modeling of Clustered CategoricalData, 492

    12.2 Binary Responses: Logistic-Normal Model, 49612.3 Examples of Random Effects Models for Binary

    Data, 50212.4 Random Effects Models for Multinomial Data, 513

  • CONTENTS xi

    12.5 Multivariate Random Effects Models for Binary Data,516

    12.6 GLMM Fitting, Inference, and Prediction, 520Notes, 526Problems, 527

    13. Other Mixture Models for Categorical Data* 538

    13.1 Latent Class Models, 53813.2 Nonparametric Random Effects Models, 54513.3 Beta-Binomial Models, 55313.4 Negative Binomial Regression, 55913.5 Poisson Regression with Random Effects, 563Notes, 565Problems, 566

    14. Asymptotic Theory for Parametric Models 576

    14.1 Delta Method, 57714.2 Asymptotic Distributions of Estimators of Model

    Parameters and Cell Probabilities, 58214.3 Asymptotic Distributions of Residuals and Goodness-

    of-Fit Statistics, 58714.4 Asymptotic Distributions for LogitrLoglinear

    Models, 592Notes, 594Problems, 595

    15. Alternative Estimation Theory for Parametric Models 600

    15.1 Weighted Least Squares for Categorical Data, 60015.2 Bayesian Inference for Categorical Data, 60415.3 Other Methods of Estimation, 611Notes, 615Problems, 616

    16. Historical Tour of Categorical Data Analysis* 619

    16.1 Pearson�Yule Association Controversy, 61916.2 R. A. Fisher’s Contributions, 622

  • CONTENTSxii

    16.3 Logistic Regression, 62416.4 Multiway Contingency Tables and Loglinear Models, 625

    Ž .16.5 Recent and Future? Developments, 629

    Appendix A. Using Computer Software to Analyze Categorical Data 632

    A.1 Software for Categorical Data Analysis, 632A.2 Examples of SAS Code by Chapter, 634

    Appendix B. Chi-Squared Distribution Values 654

    References 655

    Examples Index 689

    Author Index 693

    Subject Index 701

  • Preface

    The explosion in the development of methods for analyzing categorical datathat began in the 1960s has continued apace in recent years. This bookprovides an overview of these methods, as well as older, now standard,methods. It gives special emphasis to generalized linear modeling techniques,which extend linear model methods for continuous variables, and theirextensions for multivariate responses.

    Today, because of this development and the ubiquity of categorical data inapplications, most statistics and biostatistics departments offer courses oncategorical data analysis. This book can be used as a text for such courses.The material in Chapters 1�7 forms the heart of most courses. Chapters 1�3cover distributions for categorical responses and traditional methods fortwo-way contingency tables. Chapters 4�7 introduce logistic regression andrelated logit models for binary and multicategory response variables. Chap-ters 8 and 9 cover loglinear models for contingency tables. Over time, thismodel class seems to have lost importance, and this edition reduces some-what its discussion of them and expands its focus on logistic regression.

    In the past decade, the major area of new research has been the develop-ment of methods for repeated measurement and other forms of clusteredcategorical data. Chapters 10�13 present these methods, including marginalmodels and generalized linear mixed models with random effects. Chapters14 and 15 present theoretical foundations as well as alternatives to themaximum likelihood paradigm that this text adopts. Chapter 16 is devoted toa historical overview of the development of the methods. It examines contri-butions of noted statisticians, such as Pearson and Fisher, whose pioneeringefforts�and sometimes vocal debates�broke the ground for this evolution.

    Every chapter of the first edition has been extensively rewritten, and somesubstantial additions and changes have occurred. The major differences are:

    � A new Chapter 1 that introduces distributions and methods of inferencefor categorical data.

    � A unified presentation of models as special cases of generalized linearmodels, starting in Chapter 4 and then throughout the text.

    xiii

  • PREFACExiv

    � Greater emphasis on logistic regression for binary response variablesand extensions for multicategory responses, with Chapters 4�7 introduc-ing models and Chapters 10�13 extending them for clustered data.

    � Three new chapters on methods for clustered, correlated categoricaldata, increasingly important in applications.

    � A new chapter on the historical development of the methods.� More discussion of ‘‘exact’’ small-sample procedures and of conditional

    logistic regression.

    In this text, I interpret categorical data analysis to refer to methods forcategorical response variables. For most methods, explanatory variables canbe qualitative or quantitative, as in ordinary regression. Thus, the focus isintended to be more general than contingency table analysis, although forsimplicity of data presentation, most examples use contingency tables. Theseexamples are often simplistic, but should help readers focus on understand-ing the methods themselves and make it easier for them to replicate resultswith their favorite software.

    Special features of the text include:

    � More than 100 analyses of ‘‘real’’ data sets.� More than 600 exercises at the end of the chapters, some directed

    towards theory and methods and some towards applications and dataanalysis.

    � An appendix that shows, by chapter, the use of SAS for performinganalyses presented in this book.

    � Notes at the end of each chapter that provide references for recentresearch and many topics not covered in the text.

    Appendix A summarizes statistical software needed to use the methodsdescribed in this text. It shows how to use SAS for analyses included in the

    Ž .text and refers to a web site www.stat.ufl.edur� aarcdarcda.html thatŽ . Žcontains 1 information on the use of other software such as R, S-plus,

    . Ž .SPSS, and Stata , 2 data sets for examples in the form of complete SASŽ .programs for conducting the analyses, 3 short answers for many of the

    Ž .odd-numbered exercises, 4 corrections of errors in early printings of theŽ .book, and 5 extra exercises. I recommend that readers refer to this ap-

    pendix or specialized manuals while reading the text, as an aid to implement-ing the methods.

    I intend this book to be accessible to the diverse mix of students who takegraduate-level courses in categorical data analysis. But I have also written itwith practicing statisticians and biostatisticians in mind. I hope it enablesthem to catch up with recent advances and learn about methods thatsometimes receive inadequate attention in the traditional statistics curricu-lum.

  • PREFACE xv

    The development of new methods has influenced�and been influencedby�the increasing availability of data sets with categorical responses in thesocial, behavioral, and biomedical sciences, as well as in public health, humangenetics, ecology, education, marketing, and industrial quality control. Andso, although this book is directed mainly to statisticians and biostatisticians, Ialso aim for it to be helpful to methodologists in these fields.

    Readers should possess a background that includes regression and analysisof variance models, as well as maximum likelihood methods of statisticaltheory. Those not having much theory background should be able to followmost methodological discussions. Sections and subsections marked with anasterisk are less important for an overview. Readers with mainly appliedinterests can skip most of Chapter 4 on the theory of generalized linearmodels and proceed to other chapters. However, the book has distinctlyhigher technical level and is more thorough and complete than my lower-level

    Ž .text, An Introduction to Categorical Data Analysis Wiley, 1996 .I thank those who commented on parts of the manuscript or provided help

    of some type. Special thanks to Bernhard Klingenberg, who read severalchapters carefully and made many helpful suggestions, Yongyi Min, whoconstructed many of the figures and helped with some software, and BrianCaffo, who helped with some examples. Many thanks to Rosyln Stone andBrian Marx for each reviewing half the manuscript and Brian Caffo, I-MingLiu, and Yongyi Min for giving insightful comments on several chapters.Thanks to Constantine Gatsonis and his students for using a draft in a courseat Brown University and providing suggestions. Others who provided com-ments on chapters or help of some type include Patricia Altham, WicherBergsma, Jane Brockmann, Brent Coull, Al DeMaris, Regina Dittrich,Jianping Dong, Herwig Friedl, Ralitza Gueorguieva, James Hobert, WalterKatzenbeisser, Harry Khamis, Svend Kreiner, Joseph Lang, Jason Liao,Mojtaba Ganjali, Jane Pendergast, Michael Radelet, Kenneth Small, MauraStokes, Tom Ten Have, and Rongling Wu. I thank my co-authors on variousprojects, especially Brent Coull, Joseph Lang, James Booth, James Hobert,Brian Caffo, and Ranjini Natarajan, for permission to use material fromthose articles. Thanks to the many who reviewed material or suggestedexamples for the first edition, mentioned in the Preface of that edition.Thanks also to Wiley Executive Editor Steve Quigley for his steadfastencouragement and facilitation of this project. Finally, thanks to my wifeJacki Levine for continuing support of all kinds, despite the many days thiswork has taken from our time together.

    ALAN AGRESTI

    Gaines®ille, FloridaNo®ember 2001

  • C H A P T E R 1

    Introduction: Distributions andInference for Categorical Data

    From helping to assess the value of new medical treatments to evaluating thefactors that affect our opinions and behaviors, analysts today are findingmyriad uses for categorical data methods. In this book we introduce thesemethods and the theory behind them.

    Statistical methods for categorical responses were late in gaining the levelof sophistication achieved early in the twentieth century by methods forcontinuous responses. Despite influential work around 1900 by the Britishstatistician Karl Pearson, relatively little development of models for categori-cal responses occurred until the 1960s. In this book we describe the earlyfundamental work that still has importance today but place primary emphasison more recent modeling approaches. Before outlining the topics covered, wedescribe the major types of categorical data.

    1.1 CATEGORICAL RESPONSE DATA

    A categorical ®ariable has a measurement scale consisting of a set of cate-gories. For instance, political philosophy is often measured as liberal, moder-ate, or conservative. Diagnoses regarding breast cancer based on a mammo-gram use the categories normal, benign, probably benign, suspicious, andmalignant.

    The development of methods for categorical variables was stimulated byresearch studies in the social and biomedical sciences. Categorical scales arepervasive in the social sciences for measuring attitudes and opinions. Cate-gorical scales in biomedical sciences measure outcomes such as whether amedical treatment is successful.

    Although categorical data are common in the social and biomedicalsciences, they are by no means restricted to those areas. They frequently

    1

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA2

    Žoccur in the behavioral sciences e.g., type of mental illness, with the cate-.gories schizophrenia, depression, neurosis , epidemiology and public health

    Že.g., contraceptive method at last intercourse, with the categories none,. Ž .condom, pill, IUD, other , genetics type of allele inherited by an offspring ,

    Žzoology e.g., alligators’ primary food preference, with the categories fish,. Žinvertebrate, reptile , education e.g., student responses to an exam question,

    . Žwith the categories correct and incorrect , and marketing e.g., consumerpreference among leading brands of a product, with the categories brand A,

    .brand B, and brand C . They even occur in highly quantitative fields such asengineering sciences and industrial quality control. Examples are the classifi-cation of items according to whether they conform to certain standards, andsubjective evaluation of some characteristic: how soft to the touch a certainfabric is, how good a particular food product tastes, or how easy to perform aworker finds a certain task to be.

    Categorical variables are of many types. In this section we provide ways ofclassifying them and other variables.

    1.1.1 Response–Explanatory Variable Distinction

    Ž .Most statistical analyses distinguish between response or dependent ®ariablesŽ .and explanatory or independent ®ariables. For instance, regression models

    describe how the mean of a response variable, such as the selling price of ahouse, changes according to the values of explanatory variables, such assquare footage and location. In this book we focus on methods for categoricalresponse variables. As in ordinary regression, explanatory variables can be ofany type.

    1.1.2 Nominal–Ordinal Scale Distinction

    Categorical variables have two primary types of scales. Variables havingcategories without a natural ordering are called nominal. Examples are

    Žreligious affiliation with the categories Catholic, Protestant, Jewish, Muslim,. Žother , mode of transportation to work automobile, bicycle, bus, subway,. Ž .walk , favorite type of music classical, country, folk, jazz, rock , and choice of

    Ž .residence apartment, condominium, house, other . For nominal variables,the order of listing the categories is irrelevant. The statistical analysis doesnot depend on that ordering.

    Many categorical variables do have ordered categories. Such variables areŽcalled ordinal. Examples are size of automobile subcompact, compact,

    . Ž .midsize, large , social class upper, middle, lower , political philosophyŽ . Žliberal, moderate, conservative , and patient condition good, fair, serious,

    .critical . Ordinal variables have ordered categories, but distances betweencategories are unknown. Although a person categorized as moderate is moreliberal than a person categorized as conservative, no numerical value de-scribes how much more liberal that person is. Methods for ordinal variablesutilize the category ordering.

  • CATEGORICAL RESPONSE DATA 3

    An inter®al ®ariable is one that does have numerical distances between anytwo values. For example, blood pressure level, functional life length oftelevision set, length of prison term, and annual income are interval vari-

    Žables. An internal variable is sometimes called a ratio ®ariable if ratios of.values are also valid.

    The way that a variable is measured determines its classification. Forexample, ‘‘education’’ is only nominal when measured as public school orprivate school; it is ordinal when measured by highest degree attained, usingthe categories none, high school, bachelor’s, master’s, and doctorate; it isinterval when measured by number of years of education, using the integers0, 1, 2, . . . .

    A variable’s measurement scale determines which statistical methods areappropriate. In the measurement hierarchy, interval variables are highest,ordinal variables are next, and nominal variables are lowest. Statisticalmethods for variables of one type can also be used with variables at higherlevels but not at lower levels. For instance, statistical methods for nominalvariables can be used with ordinal variables by ignoring the ordering ofcategories. Methods for ordinal variables cannot, however, be used withnominal variables, since their categories have no meaningful ordering. It isusually best to apply methods appropriate for the actual scale.

    Since this book deals with categorical responses, we discuss the analysis ofnominal and ordinal variables. The methods also apply to interval variables

    Ž .having a small number of distinct values e.g., number of times married orŽfor which the values are grouped into ordered categories e.g., education

    .measured as � 10 years, 10�12 years, � 12 years .

    1.1.3 Continuous–Discrete Variable Distinction

    Variables are classified as continuous or discrete, according to the number ofvalues they can take. Actual measurement of all variables occurs in a discretemanner, due to precision limitations in measuring instruments. The continu-ous�discrete classification, in practice, distinguishes between variables thattake lots of values and variables that take few values. For instance, statisti-cians often treat discrete interval variables having a large number of valuesŽ .such as test scores as continuous, using them in methods for continuousresponses.

    Ž .This book deals with certain types of discretely measured responses: 1Ž . Ž .nominal variables, 2 ordinal variables, 3 discrete interval variables having

    Ž .relatively few values, and 4 continuous variables grouped into a smallnumber of categories.

    1.1.4 Quantitative–Qualitative Variable Distinction

    Nominal variables are qualitati®e�distinct categories differ in quality, not inquantity. Interval variables are quantitati®e�distinct levels have differingamounts of the characteristic of interest. The position of ordinal variables in

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA4

    the quantitative�qualitative classification is fuzzy. Analysts often treatthem as qualitative, using methods for nominal variables. But in manyrespects, ordinal variables more closely resemble interval variables than theyresemble nominal variables. They possess important quantitative features:Each category has a greater or smaller magnitude of the characteristic thananother category; and although not possible to measure, an underlyingcontinuous variable is usually present. The political philosophy classificationŽ .liberal, moderate, conservative crudely measures an inherently continuouscharacteristic.

    Analysts often utilize the quantitative nature of ordinal variables byassigning numerical scores to categories or assuming an underlying continu-ous distribution. This requires good judgment and guidance from researcherswho use the scale, but it provides benefits in the variety of methods availablefor data analysis.

    1.1.5 Organization of This Book

    The models for categorical response variables discussed in this book resem-ble regression models for continuous response variables; however, theyassume binomial, multinomial, or Poisson response distributions instead ofnormality. Two types of models receive special attention, logistic regressionand loglinear models. Ordinary logistic regression models, also called logit

    Ž .models, apply with binary i.e., two-category responses and assume a bino-mial distribution. Generalizations of logistic regression apply with multicate-gory responses and assume a multinomial distribution. Loglinear modelsapply with count data and assume a Poisson distribution. Certain equiva-lences exist between logistic regression and loglinear models.

    The book has four main units. In the first, Chapters 1 through 3, wesummarize descriptive and inferential methods for univariate and bivariatecategorical data. These chapters cover discrete distributions, methods ofinference, and analyses for measures of association. They summarize thenon-model-based methods developed prior to about 1960.

    In the second and primary unit, Chapters 4 through 9, we introducemodels for categorical responses. In Chapter 4 we describe a class ofgeneralized linear models having models of this text as special cases. We focuson models for binary and count response variables. Chapters 5 and 6 coverthe most important model for binary responses, logistic regression. In Chap-ter 7 we present generalizations of that model for nominal and ordinalmulticategory response variables. In Chapter 8 we introduce the modeling ofmultivariate categorical response data and show how to represent associationand interaction patterns by loglinear models for counts in the table thatcross-classifies those responses. In Chapter 9 we discuss model building withloglinear and related logistic models and present some related models.

    In the third unit, Chapters 10 through 13, we discuss models for handlingrepeated measurement and other forms of clustering. In Chapter 10 we

  • DISTRIBUTIONS FOR CATEGORICAL DATA 5

    present models for a categorical response with matched pairs; these apply,for instance, with a categorical response measured for the same subjects attwo times. Chapter 11 covers models for more general types of repeatedcategorical data, such as longitudinal data from several times with explana-tory variables. In Chapter 12 we present a broad class of models, generalizedlinear mixed models, that use random effects to account for dependence withsuch data. In Chapter 13 further extensions and applications of the modelsfrom Chapters 10 through 12 are described.

    The fourth and final unit is more theoretical. In Chapter 14 we developasymptotic theory for categorical data models. This theory is the basis forlarge-sample behavior of model parameter estimators and goodness-of-fitstatistics. Maximum likelihood estimation receives primary attention hereand throughout the book, but Chapter 15 covers alternative methods ofestimation, such as the Bayesian paradigm. Chapter 16 stands alone from theothers, being a historical overview of the development of categorical datamethods.

    Most categorical data methods require extensive computations, and statis-tical software is necessary for their effective use. In Appendix A we discusssoftware that can perform the analyses in this book and show the use of SASfor text examples. See the Web site www. stat.ufl.edur� aarcdarcda.html todownload sample programs and data sets and find information about othersoftware.

    Chapter 1 provides background material. In Section 1.2 we review the keydistributions for categorical data: the binomial, multinomial, and Poisson. InSection 1.3 we review the primary mechanisms for statistical inference, usingmaximum likelihood. In Sections 1.4 and 1.5 we illustrate these by presentingsignificance tests and confidence intervals for binomial and multinomialparameters.

    1.2 DISTRIBUTIONS FOR CATEGORICAL DATA

    Inferential data analyses require assumptions about the random mechanismthat generated the data. For regression models with continuous responses,the normal distribution plays the central role. In this section we review thethree key distributions for categorical responses: binomial, multinomial, andPoisson.

    1.2.1 Binomial Distribution

    Many applications refer to a fixed number n of binary observations. Lety , y , . . . , y denote responses for n independent and identical trials such1 2 n

    Ž . Ž .that P Y s 1 s � and P Y s 0 s 1 y � . We use the generic labelsi i‘‘success’’ and ‘‘failure’’ for outcomes 1 and 0. Identical trials means that theprobability of success � is the same for each trial. Independent trials means

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA6

    � 4that the Y are independent random variables. These are often callediBernoulli trials. The total number of successes, Y sÝn Y , has the binomialis1 i

    Ž .distribution with index n and parameter � , denoted by bin n,� .The probability mass function for the possible outcomes y for Y is

    n nyyyp y s � 1 y � , y s 0, 1, 2, . . . , n , 1.1Ž . Ž . Ž .ž /yn 2w Ž . x Ž . Ž .where the binomial coefficient s n!r y! n y y ! . Since E Y s E Yi iž /y

    Ž .s 1 � � q 0 � 1 y � s � ,

    E Y s � and var Y s � 1 y � .Ž . Ž . Ž .i i

    The binomial distribution for Y sÝ Y has mean and variancei i

    � s E Y s n� and � 2 s var Y s n� 1 y � .Ž . Ž . Ž .

    3 3Ž . Ž . 'The skewness is described by E Y y � r� s 1 y 2� r n� 1 y � .Ž .The distribution converges to normality as n increases, for fixed � .

    There is no guarantee that successive binary observations are independentor identical. Thus, occasionally, we will utilize other distributions. One suchcase is sampling binary outcomes without replacement from a finite popula-tion, such as observations on gender for 10 students sampled from a class ofsize 20. The hypergeometric distribution, studied in Section 3.5.1, is thenrelevant. In Section 1.2.4 we mention another case that violates thesebinomial assumptions.

    1.2.2 Multinomial Distribution

    Some trials have more than two possible outcomes. Suppose that each of nindependent, identical trials can have outcome in any of c categories. Lety s 1 if trial i has outcome in category j and y s 0 otherwise. Theni j i j

    Ž .y s y , y , . . . , y represents a multinomial trial, with Ý y s 1; fori i1 i2 ic j i jŽ .instance, 0, 0, 1, 0 denotes outcome in category 3 of four possible categories.

    Note that y is redundant, being linearly dependent on the others. Leticn sÝ y denote the number of trials having outcome in category j. Thej i i j

    Ž .counts n , n , . . . , n have the multinomial distribution.1 2 cŽ .Let � s P Y s 1 denote the probability of outcome in category j forj i j

    each trial. The multinomial probability mass function is

    n!n n n1 2 cp n , n , . . . , n s � � ��� � . 1.2Ž . Ž .1 2 cy1 1 2 cž /n ! n ! ��� n !1 2 c

  • DISTRIBUTIONS FOR CATEGORICAL DATA 7

    Ž . ŽSince Ý n s n, this is cy1 -dimensional, with n s n y n q ���j j c 1.qn . The binomial distribution is the special case with c s 2.cy1

    For the multinomial distribution,

    E n s n� , var n s n� 1 y � , cov n , n syn� � .Ž . Ž . Ž .Ž .j j j j j j k j k1.3Ž .

    We derive the covariance in Section 14.1.4. The marginal distribution of eachn is binomial.j

    1.2.3 Poisson Distribution

    Sometimes, count data do not result from a fixed number of trials. Forinstance, if y s number of deaths due to automobile accidents on motorways

    Žin Italy during this coming week, there is no fixed upper limit n for y as you.are aware if you have driven in Italy . Since y must be a nonnegative integer,

    its distribution should place its mass on that range. The simplest suchdistribution is the Poisson. Its probabilities depend on a single parameter,

    Ž .the mean �. The Poisson probability mass function Poisson 1837, p. 206 is

    ey�� yp y s , y s 0, 1, 2, . . . . 1.4Ž . Ž .

    y!

    Ž . Ž .It satisfies E Y s var Y s �. It is unimodal with mode equal to the3 3Ž . 'integer part of �. Its skewness is described by E Y y � r� s 1r � . The

    distribution approaches normality as � increases.The Poisson distribution is used for counts of events that occur randomly

    over time or space, when outcomes in disjoint periods or regions are inde-pendent. It also applies as an approximation for the binomial when n is largeand � is small, with � s n� . So if each of the 50 million people driving inItaly next week is an independent trial with probability 0.000002 of dying in a

    Ž .fatal accident that week, the number of deaths Y is a bin 50000000, 0.000002Ž .variate, or approximately Poisson with � s n� s 50,000,000 0.000002 s 100.

    A key feature of the Poisson distribution is that its variance equals itsmean. Sample counts vary more when their mean is higher. When the meannumber of weekly fatal accidents equals 100, greater variability occurs in theweekly counts than when the mean equals 10.

    1.2.4 Overdispersion

    In practice, count observations often exhibit variability exceeding that pre-dicted by the binomial or Poisson. This phenomenon is called o®erdispersion.We assumed above that each person has the same probability of dying in afatal accident in the next week. More realistically, these probabilities vary,

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA8

    due to factors such as amount of time spent driving, whether the personwears a seat belt, and geographical location. Such variation causes fatalitycounts to display more variation than predicted by the Poisson model.

    Ž � .Suppose that Y is a random variable with variance var Y � for given �,but � itself varies because of unmeasured factors such as those just de-

    Ž .scribed. Let � s E � . Then unconditionally,

    � � �E Y s E E Y � , var Y s E var Y � q var E Y � .Ž . Ž .Ž . Ž . Ž .

    Ž . Ž . Ž .When Y is conditionally Poisson given � , for instance, then E Y s E � sŽ . Ž . Ž . Ž .� and var Y s E � q var � s � q var � � � .

    Assuming a Poisson distribution for a count variable is often too simplistic,because of factors that cause overdispersion. The negati®e binomial is arelated distribution for count data that permits the variance to exceed themean. We introduce it in Section 4.3.4.

    Ž .Analyses assuming binomial or multinomial distributions are also some-times invalid because of overdispersion. This might happen because the truedistribution is a mixture of different binomial distributions, with the parame-ter varying because of unmeasured variables. To illustrate, suppose that anexperiment exposes pregnant mice to a toxin and then after a week observesthe number of fetuses in each mouse’s litter that show signs of malformation.Let n denote the number of fetuses in the litter for mouse i. The mice alsoivary according to other factors that may not be measured, such as theirweight, overall health, and genetic makeup. Extra variation then occursbecause of the variability from litter to litter in the probability � of malfor-mation. The distribution of the number of fetuses per litter showing malfor-mations might cluster near 0 and near n , showing more dispersion thaniexpected for binomial sampling with a single value of � . Overdispersioncould also occur when � varies among fetuses in a litter according to some

    Ž .distribution Problem 1.12 . In Chapters 4, 12, and 13 we introduce methodsfor data that are overdispersed relative to binomial and Poisson assumptions.

    1.2.5 Connection between Poisson and Multinomial Distributions

    In Italy this next week, let y s number of people who die in automobile1accidents, y s number who die in airplane accidents, and y s number who2 3

    Ž .die in railway accidents. A Poisson model for Y , Y , Y treats these as1 2 3Ž .independent Poisson random variables, with parameters � , � , � . The1 2 3

    � 4joint probability mass function for Y is the product of the three massiŽ .functions of form 1.4 . The total n sÝY also has a Poisson distribution,i

    with parameter � .iWith Poisson sampling the total count n is random rather than fixed. If we

    � 4assume a Poisson model but condition on n, Y no longer have Poissoni� 4distributions, since each Y cannot exceed n. Given n, Y are also no longeri i

    independent, since the value of one affects the possible range for the others.

  • STATISTICAL INFERENCE FOR CATEGORICAL DATA 9

    Ž .For c independent Poisson variates, with E Y s � , let’s derive theiri iconditional distribution given that ÝY s n. The conditional probability of ai

    � 4set of counts n satisfying this condition isi

    P Y s n , Y s n , . . . , Y s n Y s nŽ . Ý1 1 2 2 c c jP Y s n , Y s n , . . . , Y s nŽ .1 1 2 2 c cs

    P ÝY s nŽ .jn iŁ exp y� � rn ! n!Ž .i i i i n is s � , 1.5Ž .Łn iŁ n !exp yÝ� Ý� rn! iŽ . Ž . i ij j

    � Ž .4 Ž � 4.where � s � r Ý� . This is the multinomial n, � distribution, charac-i i j i� 4terized by the sample size n and the probabilities � .i

    Many categorical data analyses assume a multinomial distribution. Suchanalyses usually have the same parameter estimates as those of analysesassuming a Poisson distribution, because of the similarity in the likelihoodfunctions.

    1.3 STATISTICAL INFERENCE FOR CATEGORICAL DATA

    The choice of distribution for the response variable is but one step of dataanalysis. In practice, that distribution has unknown parameter values. In thissection we review methods of using sample data to make inferences about theparameters. Sections 1.4 and 1.5 cover binomial and multinomial parameters.

    1.3.1 Likelihood Functions and Maximum Likelihood Estimation

    In this book we use maximum likelihood for parameter estimation. Underweak regularity conditions, such as the parameter space having fixed dimen-sion with true value falling in its interior, maximum likelihood estimatorshave desirable properties: They have large-sample normal distributions; theyare asymptotically consistent, converging to the parameter as n increases;and they are asymptotically efficient, producing large-sample standard errorsno greater than those from other estimation methods.

    Given the data, for a chosen probability distribution the likelihood functionis the probability of those data, treated as a function of the unknown

    Ž .parameter. The maximum likelihood ML estimate is the parameter valuethat maximizes this function. This is the parameter value under which thedata observed have the highest probability of occurrence. The parametervalue that maximizes the likelihood function also maximizes the log of thatfunction. It is simpler to maximize the log likelihood since it is a sum ratherthan a product of terms.

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA10

    We denote a parameter for a generic problem by and its ML estimateˆ Ž .by . The likelihood function is ll and the log-likelihood function is

    ˆŽ . w Ž .x Ž .L s log ll . For many models, L has concave shape and is thepoint at which the derivative equals 0. The ML estimate is then the solution

    Ž .of the likelihood equation, L r s 0. Often, is multidimensional,ˆdenoted by �, and � is the solution of a set of likelihood equations.

    ˆ ˆŽ .Let SE denote the standard error of , and let cov � denote theˆ Žasymptotic covariance matrix of �. Under regularity conditions Rao 1973,

    ˆ. Ž . Ž .p. 364 , cov � is the inverse of the information matrix. The j, k element ofthe information matrix is

    2L �Ž .yE . 1.6Ž .ž / j k

    The standard errors are the square roots of the diagonal elements for theinverse information matrix. The greater the curvature of the log likelihood,the smaller the standard errors. This is reasonable, since large curvature

    ˆimplies that the log likelihood drops quickly as � moves away from �; hence,ˆthe data would have been much more likely to occur if � took a value near �

    ˆrather than a value far from �.

    1.3.2 Likelihood Function and ML Estimate for Binomial Parameter

    The part of a likelihood function involving the parameters is called thekernel. Since the maximization of the likelihood is with respect to theparameters, the rest is irrelevant.

    Ž .To illustrate, consider the binomial distribution 1.1 . The binomial coeffi-ncient has no influence on where the maximum occurs with respect to � .ž /y

    Thus, we ignore it and treat the kernel as the likelihood function. Thebinomial log likelihood is then

    nyyyL � s log � 1 y � s y log � q n y y log 1 y � . 1.7Ž . Ž . Ž . Ž . Ž . Ž .

    Differentiating with respect to � yields

    L � r� s yr� y n y y r 1 y � s y y n� r� 1 y � . 1.8Ž . Ž . Ž . Ž . Ž . Ž .

    Equating this to 0 gives the likelihood equation, which has solution � s yrn,ˆthe sample proportion of successes for the n trials.

    2 Ž . 2Calculating L � r� , taking the expectation, and combining terms,we get

    22 2 2yE L � r� s E yr� q n y y r 1 y � s nr � 1 y � .Ž . Ž . Ž . Ž .1.9Ž .

  • STATISTICAL INFERENCE FOR CATEGORICAL DATA 11

    Ž .Thus, the asymptotic variance of � is � 1 y � rn. This is no surprise. SinceˆŽ . Ž . Ž .E Y s n� and var Y s n� 1 y � , the distribution of � s Yrn has meanˆ

    and standard error

    � 1 y �Ž .E � s � , � � s .Ž . Ž .ˆ ˆ (

    n

    1.3.3 Wald–Likelihood Ratio–Score Test Triad

    Three standard ways exist to use the likelihood function to performlarge-sample inference. We introduce these for a significance test of a nullhypothesis H : s and then discuss their relation to interval estimation.0 0They all exploit the large-sample normality of ML estimators.

    ˆWith nonnull standard error SE of , the test statistic

    ˆz s y rSEŽ .0has an approximate standard normal distribution when s . One refers z0to the standard normal table to obtain one- or two-sided P-values. Equiva-lently, for the two-sided alternative, z 2 has a chi-squared null distribution

    Ž .with 1 degree of freedom df ; the P-value is then the right-tailed chi-squaredprobability above the observed value. This type of statistic, using the nonnull

    Ž .standard error, is called a Wald statistic Wald 1943 .The multivariate extension for the Wald test of H : � s � has test0 0

    statistic� y1ˆ ˆ ˆW s � y � cov � � y � .Ž .Ž . Ž .0 0

    Ž .The prime on a vector or matrix denotes the transpose. The nonnullˆŽ .covariance is based on the curvature 1.6 of the log likelihood at �. The

    ˆasymptotic multivariate normal distribution for � implies an asymptoticˆŽ .chi-squared distribution for W. The df equal the rank of cov � , which is the

    number of nonredundant parameters in �.A second general-purpose method uses the likelihood function through

    Ž .the ratio of two maximizations: 1 the maximum over the possible parameterŽ .values under H , and 2 the maximum over the larger set of parameter0

    values permitting H or an alternative H to be true. Let ll denote the0 a 0maximized value of the likelihood function under H , and let ll denote the0 1

    Ž .maximized value generally i.e., under H j H . For instance, for parameter0 aŽ .vector � s � , � � and H : � s 0, ll is the likelihood function calculated0 1 0 0 1

    at the � value for which the data would have been most likely; ll is the0likelihood function calculated at the � value for which the data would1have been most likely, when � s 0. Then ll is always at least as large as0 1ll , since ll results from maximizing over a restricted set of the parameter0 0values.

  • INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA12

    The ratio � s ll rll of the maximized likelihoods cannot exceed 1. Wilks0 1Ž .1935, 1938 showed that y2 log� has a limiting null chi-squared distribu-tion, as n ™ �. The df equal the difference in the dimensions of theparameter spaces under H j H and under H . The likelihood-ratio test0 a 0statistic equals

    y2 log� sy2 log ll rll sy2 L y L ,Ž .Ž . 0 10 1where L and L denote the maximized log-likelihood functions.0 1

    The third method uses the score statistic, due to R. A. Fisher and C. R.Rao. The score test is based on the slope and expected curvature of the

    Ž .log-likelihood function L at the null value . It utilizes the size of the0score function

    u s L r ,Ž . Ž .

    ˆŽ .evaluated at . The value u tends to be larger in absolute value when 0 0w 2 Ž . 2 x Ž .is farther from . Denote yE L r i.e., the information evalu-0

    Ž . Ž .ated at by . The score statistic is the ratio of u to its null SE,0 0 0w Ž .x1r2which is . This has an approximate standard normal null distribu-0

    tion. The chi-squared form of the score statistic is

    2 2u L rŽ . Ž .0 0s ,2 2 yE L rŽ . Ž .0 0

    where the partial derivative notation reflects derivatives with respect to that are evaluated at . In the multiparameter case, the score statistic is a0quadratic form based on the vector of partial derivatives of the log likelihoodwith respect to � and the inverse information matrix, both evaluated at the

    Ž .H estimates i.e., assuming that � s � .0 0Ž .Figure 1.1 is a generic plot of a log-likelihood L for the univariate

    case. It illustrates the three tests of H : s 0. The Wald test uses the0ˆ ˆ 2Ž . Ž .behavior of L at the ML estimate , having chi-squared form rSE .

    ˆ ˆŽ .The SE of depends on the curvature of L at . The score test is basedŽ .on the slope and curvature of L at s 0. The likelihood-ratio test

    ˆŽ .combines information about L at both and s 0. It compares the0ˆlog-likelihood values L at and L at s 0 using the chi-squared1 0 0

    Ž .statistic y2 L y L . In Figure 1.1, this statistic is twice the vertical dis-0 1ˆŽ .tance between values of L at and at 0. In a sense, this statistic uses the

    most information of the three types of test statistic and is the most versatile.As n ™ �, the Wald, likelihood-ratio, and score tests have certain asymp-

    Ž .totic equivalences Cox and Hinkley 1974, Sec. 9.3 . For small to moderatesample sizes, the likelihood-ratio test is usually more reliable than the Waldtest.