Empirical Model Building - buch · EATON Multivariate Statistics: A Vector Space Approach ......

15
Empirical Model Building JAMES R. THOMPSON Professor of Statistics Rice University Adjunct Professor of Biomathematics University of Texas M. D. Anderson Cancer Center WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

Transcript of Empirical Model Building - buch · EATON Multivariate Statistics: A Vector Space Approach ......

Empirical Model Building

JAMES R. THOMPSON

Professor of Statistics Rice University Adjunct Professor of Biomathematics University of Texas M. D. Anderson Cancer Center

WILEY

JOHN WILEY & SONS

New York Chichester Brisbane Toronto Singapore

This Page Intent iona l ly Left Blank

WILEY SERIES IN PROBABILITY AND MATHEMATICAL STATISTICS

ESTABLISHED BY WALTER A. SHEWHART AND SAMUEL S. WILKS Editors Vic Barnett, Ralph A . Bradley, J . Stuart Hunter, David G . Kendall, Rupert G . Miller, J r . , Adrian F. M . Smith, Stephen M . Stigler, Geoffrey S . Watson

ADLER The Geometry of Random Fields ANDERSON The Statistical Analysis of Time Series ANDERSON An Introduction to Multivariate Statistical Analysis,

ARNOLD The Theory of Linear Models and Multivariate Analysis BARNETT' Comparative Statistical Inference, Second Edition BHATTACHARYYA and JOHNSON Statistical Concepts and Methods BILLINGSLEY Probability and Measure, Second Edition BOLLEN Structural Equations with Latent Variables BOROVKOV Asymptotic Methods in Queuing Theory BOSE and MANVEL Introduction to Combinatorial Theory CAINES Linear Stochastic Systems CASSEL, SARNDAL, and WRETMAN Foundations of Inference in

CHEN Recursive Estimation and Control for Stochastic Systems COCHRAN 9 Contributions to Statistics COCHRAN Planning and Analysis of Observational Studies CONSTANTINE Combinatorial Theory and Statistical Design DOOB Stochastic Processes DUDEWICZ and MISHRA Modem Mathematical Statistics EATON Multivariate Statistics: A Vector Space Approach ETHIER and KURTZ Markov Processes: Characterization and

Convergence FABIAN and HANNAN Introduction to Probability and

Mathematical Statistics FELLER An Introduction to Probability Theory and Its Applications,

Volume I, Third Edition, Revised; Volume 11, Second Edition FULLER Introduction to Statistical Time Series FULLER Measurement Error Models GRENANDER Abstract Inference GUTTMAN Linear Models: An Introduction HALL 9 Introduction to The Theory of Coverage Processes HAMPEL, RONCHETTI, ROUSSEEUW, and STAHEL Robust

HANNAN Multiple Time Series HANNAN and DEISTLER The Statistical Theory of Linear Systems HARRISON Brownian Motion and Stochastic Flow Systems HETTMANSPERGER Statistical Inference Based on Ranks HOEL Introduction to Mathematical Statistics, Fifth Edition HUBER Robust Statistics IMAN and CONOVER A Modem Approach to Statistics IOSIFESCU Finite Markov Processes and Applications JOHNSON and BHATTACHARYYA Statistics: Principles and Methods,

LAHA and ROHATGI Probability Theory LARSON Introduction to Probability Theory and Statistical Inference,

LEHMANN Testing Statistical Hypotheses, Second Edition LEHMANN Theory of Point Estimation MATTHES, KERSTAN, and MECKE Infinitely Divisible Point Processes MUIRHEAD Aspects of Multivariate Statistical Theory

Probability and Mathematical Statistics

Second Edition

Survey Sampling

Statistics: The Approach Based on Influence Functions

Revised Printing

Third Edition

Probability and Mathematical Statistics (Continued) PRESS Bayesian Statistics PURI and SEN Nonparametric Methods in General Linear Models PURI and SEN Nonpararnetric Methods in Multivariate Analysis PURI. VILAPLANA, and WERTZ New Perspectives in Theoretical and

Applied Statistics RANDLES and WOLFE Introduction to the Theorv of Nonoarametnc

Statistics RAO Linear Statistical Inference and Its Applications, Second Edition RAO Real and Stochastic Analysis RAO and SEDRANSK W.G. Cochran’s Impact on Statistics RAO Asymptotic Theory of Statistical Inference ROBERTSON, WRIGHT and DYKSTRA Order Restricted Statistical

ROGERS and WILLiAMS Diffusions, Markov Processes, and

ROHATGI An Introduction to Probability Theory and Mathematical

ROHATGI Statistical Inference ROSS - Stochastic Processes RUBINSTEIN Simulation and The Monte Carlo Method RUZSA and SZEKELY - Algebraic Probability Theory SCHEFFE The Analysis of Variance SEBER Linear Regression Analysis SEBER - Multivariate Observations SEBER and WILD Nonlinear Regression SEN Sequential Nonparametrics: Invariance Principles and Statistical

SERFLING - Approximation Theorems of Mathematical Statistics SHORACK and WELLNER Empirical Processes with Applications to

STOYANOV Counterexamples in Probability

Inference

Martingales,Volume 11: Ito Calculus

Statistics

Inference

Statistics

Applied Probubility and Statistics ABRAHAM and LEDOLTER AGRESTI Analysis of Ordinal Categorical Data AICKIN - Linear Statistical Analysis of Discrete Data ANDERSON and LOYNES The Teaching of Practical Statistics ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and

ARTHANARI and DODGE Mathematical Programming in Statistics ASMUSSEN - Applied Probability and Queues BAILEY - The Elements of Stochastic Processes with Applications to the

BARNETT Interpreting Multivariate Data BARNETT and LEWIS Outliers in Statistical Data, Second Edition BARTHOLOMEW Stochastic Models for Social Processes, Third Edition BARTHOLOMEW and FORBES Statistical Techniques for Manpower

BATES and WATTS Nonlinear Regression Analysis and Its Applications BECK and ARNOLD - Parameter Estimation in Engineering and Science BELSLEY, KUH, and WELSCH Regression Diagnostics: Identifying

BHAT Elements of Applied Stochastic Processes, Second Edition BLOOMFIELD - Fourier Analysis of Time Series: An Introduction BOX R. A. Fisher, The Life of a Scientist BOX and DRAPER - Empirical Model-Building and Response Surfaces BOX and DRAPER Evolutionary Operation: A Statistical Method for

BOX, HUNTER, and HUNTER Statistics for Experimenters: An

Statistical Methods for Forecasting

WEISBERG Statistical Methods for Comparative Studies

Natural Sciences

Planning

Influential Data and Sources of Collinearity

Process Improvement

Introduction to Design, Data Analysis, and Model Building

Applied Probability and Statistics (Continued) BROWN and HOLLANDER Statistics: A Biomedical Introduction BUNKE and BUNKE 0 Statistical Inference in Linear Models, Volume I CHAMBERS Computational Methods for Data Analysis CHATTERJEE and HAD1 Sensitivity Analysis in Linear Regression CHATTERJEE and PRICE Regression Analysis by Example CHOW Econometric Analysis by Control Methods CLARKE and DISNEY Probability and Random Processes: A First

COCHRAN Sampling Techniques, Third Edition COCHRAN and COX Experimental Designs, Second Edition CONOVER Practical Nonparametric Statistics, Second Edition CONOVER and IMAN Introduction to Modem Business Statistics CORNELL Experiments with Mixtures: Designs, Models and The Analysis

COX Planning of Experiments COX 9 A Handbook of Introductory Statistical Methods DANIEL Biostatistics: A Foundation for Analysis in the Health Sciences,

DANIEL Applications of Statistics to Industrial Experimentation DANIEL and WOOD Fitting Equations to Data: Computer Analysis of

DAVID Order Statistics, Second Edition DAVISON Multidimensional Scaling DEGROOT, FIENBERG and KADANE Statistics and the Law DEMING Sample Design in Business Research DILLON and GOLDSTEIN Multivariate Analysis: Methods and

DODGE Analysis of Experiments with Missing Data DODGE and ROMIG Sampling Inspection Tables, Second Edition DOWDY and WEARDEN Statistics for Research DRAPER and SMITH Applied Regression Analysis, Second Edition DUNN Basic Statistics: A Primer for the Biomedical Sciences, Second

Edition DUNN and CLARK Applied Statistics: Analysis of Variance and

Regression, Second Edition ELANDT-JOHNSON and JOHNSON Survival Models and Data Analysis FLEISS Statistical Methods for Rates and Proportions, Second Edition FLEISS The Design and Analysis of Clinical Experiments FLURY Common Principal Components and Related Multivariate Models FOX Linear S!atistical Models and Related Methods FRANKEN, KONIG, ARNDT, and SCHMIDT Queues and Point

GALLANT Nonlinear Statistical Models GIBBONS, OLKIN, and SOBEL Selecting and Ordering Populations: A

GNANADESIKAN Methods for Statistical Data Analysis of Multivariate

GREENBERG and WEBSTER Advanced Econometrics: A Bridge to the

GROSS and HARRIS Fundamentals of Queueing Theory, Second Edition GROVES, BIEMER, LYBERG, MASSEY, NICHOLLS, and WAKSBERG

GUPTA and PANCHAPAKESAN Multiple Decision Procedures: Theory

GUTTMAN, WILKS, and HUNTER Introductory Engineering Statistics,

HAHN and SHAPIRO Statistical Models in Engineering HALD Statistical Tables and Formulas HALD Statistical Theory with Engineering Applications

Course with Applications, Second Edition

of Mixture Data

Fourth Edition

Multifactor Data, Second Edition

Applications

Processes

New Statistical Methodology

Observations

Literature

Telephone Survey Methodology

and Methodology of Selecting and Ranking Populations

Third Edition

(continued on back )

Empirical Model Building

Empirical Model Building

JAMES R. THOMPSON

Professor of Statistics Rice University Adjunct Professor of Biomathematics University of Texas M. D. Anderson Cancer Center

WILEY

JOHN WILEY & SONS

New York Chichester Brisbane Toronto Singapore

Copyright 0 1989 by John Wiley & Sons, Inc.

All rights reserved. Published simultaneously in Canada.

Reproduction or translation of any part of this work beyond that permitted by Section 107 or 108 of the 1976 United States Copyright Act without the permission of the copyright owner is unlawful. Requests for permission or further information should be addressed to the Permission Department, John Wiley & Sons, Inc.

Library of Congress Cataloging in Publication Data:

Thompson, James R. (James Robert), 1938- Empirical model building/James R. Thompson.

p. , cm.-(Wiley series in probability and mathematical statistics. Probability and mathematical statistics, ISSN 0271 -6232)

Bibliography: p.

1. Experimental design. 2. Mathematical models. 3. Mathematical ISBN 0-471-60105-5

statistics. I. Title. 11. Series. QA279.T49 1989 5 19.5-dc19 88-20549

CIP

Printed in the United States of America

10 9 8 7 6 5 4 3 2

This Page Intent iona l ly Left Blank

To My Mother, Mary Haskins Thompson

This Page Intent iona l ly Left Blank

Preface

Empirical model building refers to a mindset that lends itself to constructing practical models useful in describing and coping with real-world situations. It does not refer to quick and dirty methods, which are used simply because they are ones we understand and we have them readily available in the form of off-the-shelf software. The fraction of real situations which can appropri- ately be addressed using, say, a linear regression package or a Newton’s optimization routine or an autoregressive moving average forecasting proce- dure is small indeed. Most successful consultants are aware of this fact, but it escapes the attention of most academics and the orientation of most textbooks.

It is a recurring experience of students who receive their B.A. or their M.S. or their Ph.D. and journey forth into the real world of science, commerce, and industry, that they seldom find any problem quite like those they have been trained to solve in the university. They find that they have been studying tidy methodological packets of information which have rather limited application. If they are fortunate, they will accept the situation and begin to learn model building “on the job” as it were. On the other hand, it does appear a pity that so important a subject be relegated to the school of hard knocks. Conse- quently, a number of universities have experimented for the last 20 years or so with courses in model building.

The problem with most such courses and with books used in the courses is that they tend to focus on methodologies well understood by the instructor. After all, operations research, numerical analysis, statistics, and physical mathematics are all supposed to be branches of applied mathematics, and so anything presented from such a field is deemed applied. There are model building books that emphasize linear programming, others that emphasize queueing theory, and so on. The fact is, of course, that most “applied mathe- matics” is as divorced from the real world as algebraic topology and ring

vii

viii PREFACE

theory. Methodology, perceived as an end in itself, is usually a closed system that looks inward rather than a means to some end in the world outside the methodology.

Consultation is a subject very much related to model building. Increasingly, courses on consultation are included in the graduate curricula of departments of statistics. Short courses are frequently given on the topic to practicing professionals. The point of view of some of the instructors of such courses and the authors of texts for such courses seems to be that bedside manner and psychological support are the main contributions of a consultant. When listening to some of the case studies and the recommended approach of the instructors who use them, I have been struck by the similarities with recom- mended care for the terminally ill. Most consulting clients for whom I have worked over the years do not want tea and sympathy. They want results. They have a problem (or problems) which they usually cannot quite articulate. They want the consultant to formulate the problem and solve it-nothing more nor less. They are completely unmoved by statements on the part of the consultant that “that is not my field.” Most real-world consulting jobs are not anybody’s field. They are problems that have to be attacked de nouo. The ability to handle consulting problems is deemed by some to be “a gift.” Some people are supposed to have the facility, whereas others do not, though they may be expert theoreticians. My experience is that model building is an attractive means whereby one develops consulting skills. I have former students who call and write about this or that off-the-wall problem which they were given and were able to formulate and solve thanks to the insights they gained in the Rice University model building course.

The purpose of the course in model building, which I have taught at Rice University for 10 years, has been to try to start with problems rather than with methodologies. The fact that I am a statistician (and an undergraduate chemical engineer) certainly biases my approaches and to some extent the problems I choose to examine. However, the experience of a decade is that the approach is successful to a very considerable degree in preparing indi- viduals to formulate and solve problems in the real world. The students in the course have ranged from Ph.D. candidates to sophomores.

We live in a world in which the collection of subjects required for this or that major may be only marginally relevant to the real world. This or that “hard-nosed” petroleum engineering curriculum, for example, may become as irrelevant to the job market as a curriculum in Etruscan archaeology. Of course, the advocates of a liberal arts education have always cheerfully ac- cepted this fact and realistically informed students that such an education was not supposed to be job oriented. The argument that a diverse educational experience is better preparation than training in a narrow area which may become obsolete in a few years has some appeal. But between the extreme of

PREFACE ix

a completely nontechnical education and one of narrow specialization lies a considerable middle ground. Students have begun to understand that ob- solescence can overtake any field. Many expect that they will experience several career changes during their professional lives. Accordingly, model building can be looked on as a kind of “liberal sciences” subject. By exposing the student to a variety of modeling situations, it is to be hoped that the synthesizing powers of the human brain will prepare him or her for other situations that are covered neither by other courses nor by the model building course. Experience indicates that this indeed is the case.

In a classroom setting, I form students into two- and three-person groups, which address 10 model building scenarios, each related to one of the sections in the book. The use of group reports as the primary vehicle of evaluation is another attempt to make the course more representative of situations faced in the real world. A week or two is allowed for the preparation of each report. Although the course is one semester in duration at Rice University, the fact that there are more than twice as many sections in the book as can be covered in a semester provides for selections of material based on the preferences of the instructor and the particular composition of the group.

Chapter 1 is concerned with several models of growth and decay. Section 1.1 is an attempt to look at the kind of “privatized” social security system advocated by some. It shows, among other things, how the advent of the calculator and the computer should cause us to consider alternatives to the “closed form solution.” In this section, some advantages of the “do loop” are given. Section 1.2 examines the motivation for the tax reforms of 1981. Section 1.3 examines the spread between the actual value and the nominal value of a mortgage. Section 1.4 departs from rather well-defined accounting models and goes into the more ambiguous country of population growth. Section 1.5 discusses the consequences of metastatic progression and the development of resistance to chemotherapeutic agents in the treatment of cancer.

Chapter 2 looks at rather more complicated systems in which competition and interaction add to the complexity of the model. Section 2.1 is a highly speculative analysis of the population of ancient Israel. Part of the motivation for this section is to demonstrate how a supposedly sterile data set, like that in the Book of Numbers, may become quite significant when analyzed in the light of a model. Section 2.2 shows the enormous advance in data analysis caused by the data compression technique of John Graunt. Section 2.3 examines some considerations in the modeling of combat situations. More particularly, a model-based argument is made as to the possible value of fortifications in modern combat (General Patton notwithstanding). Section 2.4 shows how the predator-prey model first advocated by Volterra can not only be used to model competition of species but can also be applied to the dramatically different situation of the body’s immune response to cancer.

X PREFACE

Section 2.5 considers the relatively trivial subject of pyramid clubs as a precursive analysis of epidemics. Section 2.6 examines the AIDS epidemic and gives an analysis that suggests that the epidemic might have been avoided if public health authorities had simply closed down establishments, such as bathhouses, which encourage high contact rate homosexual activity.

Chapter 3 argues further that the advent of the microprocessor should change the way we carry out modeling. Here, simulation is advocated as an aggregation alternative to the closed form. Section 3.1 examines the use of simulation as an alternative to numerical approximations to the closed form for the point-wise evaluation of models generally described by differential equations. This concept was Johann von Neumann’s original motivation for construction of the digital computer, but the point is made in Section 3.1 that computing has reached the speed and cheapness where we should, in many cases, dispense with the differential equation formulation altogether and go rather from the axioms to the pointwise evaluation of the function. Section 3.2 shows a computer intensive, but conceptually simple, procedure whereby we can use a data set to construct many “quasi data” sets. Section 3.3 considers the use of simulation-based alternatives for the estimation of parameters characterizing stochastic processes. The SIMEST algorithm dealt with in this section has made possible the modeling of processes not tractable using classical closed form techniques.

Chapter 4 addresses some nonclassical methods of data analysis which are highly interactive with the human visual perception system. Both fall, more or less, under the Radical Pragmatist approach mentioned in the Introduc- tion. Section 4.1 attempts to give an analytical description of John W. Tukey’s exploratory data analysis. Section 4.2 examines the challenges and problems associated with the analysis of higher-dimensional data via nonparametric density estimation.

Chapter 5 takes a contrarian approach to the way several well-established theories are perceived. The primary point made here is that we tend to pay too much attention to the mathematical consequences of axiomitized descrip- tions of real-world systems without checking carefully as to the adequacy of the axioms to describe these systems. Section 5.1 examines several problems concerned with group consensus, in particular the famous impossibility theo- rem of Kenneth Arrow. Section 5.2 examines Charles Stein’s proof that the sample mean can always be improved on as an estimate for the mean of a normal distribution for dimensionality greater than two. Section 5.3 examines the fuzzy set theory of Lofti Zadeh. Section 5.4 argues that quality control, while a subject of great importance, is generally misunderstood, due to an improper anthropomorphization of machines and systems.

Appendix A.l gives a brief introduction to stochastics. Appendix A.2 pre- sents the robust optimization algorithm of Nelder and Mead.