Readers' corner: The role of experiments in computer science

3
Readers’ Corner: The Role of Experiments in Computer Science Peter Fletcher Department of Computer Science, Keele University Staffordshire, U.K. It is sometimes suggested that computing research should be conducted as an empirical science, with all its algorithms, designs and models being subjected to quantitative experimental testing. I question the ap- propriateness of this view of research and argue that much of computing research does not, and should not, embody quantitative empirical claims. Experiments are, however, sometimes appropriate; I point out some ways in which computing experiments can fall short of normal scientific standards of rigour. When, why and how should computer scientists con- duct experiments? Denning (1980) takes it for granted that computing research must be validated by experiment, in the same way as research in the natural sciences, and that the appropriate questions for investigation are ones of quantitative perfor- mance. However, a recent survey by Tichy et al. (1995) shows that by and large computer scientists do not work like experimental scientists. Tichy et al. examined 400 recent articles from the computer science literature and classified them into five cate- gories: formal theoty, design and modelling, empirical work, hypothesis testing, and other (including surveys). They discovered that of the design and modelling papers 43% contain no experimental evaluation, and only 31% devote more than 20% of their pages to experimental reports. Tichy et al. regard this as a bad thing and compare it unfavourably with two journals, Optical Engineering and Neural Computa- tion, in which a much higher proportion of space is devoted to reports of experiments. Glass (1995) en- dorses this attitude, and appears to go further by proposing that even formal theory needs to bc vali- dated by empirical observations. I wish to take issue with these authors on three Address correspondence to Peter Fletcher, Department of Computer Science, Keele University, Keele, Staffordshire, ST5 5BG, U.K. E-mail: [email protected]. J. SYSTEMS SOFTWARE 1995; 30:161-163 0 1995 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010 points: first, their narrow conception of research; secondly, their preoccupation with quantitative per- formance, which I shall argue is ill-suited to com- puter science; thirdly, Tichy et al’s claim that the quality of an experimental report can be measured by a page count. 1. COMPUTER SCIENCE IS NOT A SCIENCE Sciences such as physics and chemistry proceed by framing hypotheses, designing experiments to test rival hypotheses, which suggest further hypotheses, and so on. The purpose of these sciences is to describe a body of natural phenomena and to make successful predictions. If we take this as scientific method, then it is clear that “computer science” is not a science, for throughout its history (that is, from Turing [1936-71 to the present) it has mostly not been concerned with natural phenomena or empirical predictions. It follows that research methods which are appropriate to physics and chemistry are not automatically ap- propriate to computer science. Computer science is more concerned with qualita- tive commonalities than quantitative differences, competence rather than performance. The founding idea of the discipline is the Turing-equivalence of all computers, abstracting away from the differences in instruction sets, speed and storage capacity of partic- ular machines (this is true even of complexity thc- ory). Without this essential insight we would have no academic discipline, just a miscellany of knowledge about various machines that do various kinds of computation. In, for example, the field of language compiling, we are not particularly interested in the fact that top-down parsing is about twice as fast as bottom-up parsing. Far more important are the qualitative is- sues involved: whether a language can be described by a context-free grammar, whether a grammar is 0164-1212/‘)5/$9.50 SSDI 0164-1212(95X)0050-B

Transcript of Readers' corner: The role of experiments in computer science

Readers’ Corner: The Role of Experiments in Computer Science

Peter Fletcher Department of Computer Science, Keele University Staffordshire, U.K.

It is sometimes suggested that computing research

should be conducted as an empirical science, with all its algorithms, designs and models being subjected to

quantitative experimental testing. I question the ap-

propriateness of this view of research and argue that

much of computing research does not, and should not,

embody quantitative empirical claims. Experiments

are, however, sometimes appropriate; I point out some

ways in which computing experiments can fall short

of normal scientific standards of rigour.

When, why and how should computer scientists con- duct experiments? Denning (1980) takes it for granted that computing research must be validated by experiment, in the same way as research in the natural sciences, and that the appropriate questions for investigation are ones of quantitative perfor- mance. However, a recent survey by Tichy et al. (1995) shows that by and large computer scientists do not work like experimental scientists. Tichy et al. examined 400 recent articles from the computer science literature and classified them into five cate- gories: formal theoty, design and modelling, empirical work, hypothesis testing, and other (including surveys). They discovered that of the design and modelling papers 43% contain no experimental evaluation, and only 31% devote more than 20% of their pages to experimental reports. Tichy et al. regard this as a bad thing and compare it unfavourably with two journals, Optical Engineering and Neural Computa- tion, in which a much higher proportion of space is devoted to reports of experiments. Glass (1995) en- dorses this attitude, and appears to go further by proposing that even formal theory needs to bc vali- dated by empirical observations.

I wish to take issue with these authors on three

Address correspondence to Peter Fletcher, Department of Computer Science, Keele University, Keele, Staffordshire, ST5 5BG, U.K. E-mail: [email protected].

J. SYSTEMS SOFTWARE 1995; 30:161-163 0 1995 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010

points: first, their narrow conception of research; secondly, their preoccupation with quantitative per- formance, which I shall argue is ill-suited to com- puter science; thirdly, Tichy et al’s claim that the quality of an experimental report can be measured by a page count.

1. COMPUTER SCIENCE IS NOT A SCIENCE

Sciences such as physics and chemistry proceed by framing hypotheses, designing experiments to test rival hypotheses, which suggest further hypotheses, and so on. The purpose of these sciences is to describe a body of natural phenomena and to make successful predictions.

If we take this as scientific method, then it is clear that “computer science” is not a science, for throughout its history (that is, from Turing [1936-71 to the present) it has mostly not been concerned with natural phenomena or empirical predictions. It follows that research methods which are appropriate to physics and chemistry are not automatically ap- propriate to computer science.

Computer science is more concerned with qualita- tive commonalities than quantitative differences, competence rather than performance. The founding idea of the discipline is the Turing-equivalence of all computers, abstracting away from the differences in instruction sets, speed and storage capacity of partic- ular machines (this is true even of complexity thc- ory). Without this essential insight we would have no academic discipline, just a miscellany of knowledge about various machines that do various kinds of computation.

In, for example, the field of language compiling, we are not particularly interested in the fact that top-down parsing is about twice as fast as bottom-up parsing. Far more important are the qualitative is- sues involved: whether a language can be described by a context-free grammar, whether a grammar is

0164-1212/‘)5/$9.50 SSDI 0164-1212(95X)0050-B

162 .I. SYSTEMS SOFTWARE 1995; 30~161-163

P. Fletcher

ambiguous, whether an unambiguous grammar can be parsed without lookahead or backtracking, and so on. A demand for quantitative performance results can sometimes be a distraction from underlying competence limitations: if the source language is of the wrong Chomsky type then no amount of optimis- ing the parsing algorithm will make it parsable.

This goes a long way to account for the reluctance of computer scientists to conduct quantitative stud- ies.

Tichy et al. cite a neural network journal as a model of good practice because it devotes a large proportion of its space to experimental reports. My personal view as a neural network researcher is that many workers in this field do such experiments be- cause, having coded a new network architecture or learning rule, they cannot think of anything else to do with it. The neural network literature gives far too little attention to traditional computer science issues of representational adequacy, specification, correctness proof, and relationships between differ- ent levels of abstraction. This is only a personal view, but then so is Tichy et al.‘s assertion that “Publications of the design and modelling category require reproducible experiments for validation of claims.”

2. WHAT IS COMPUTING RESEARCH?

There is a kind of research that consists of seeking the best solution to a previously specified problem. A solution here may be a program, a model of some real-world process, a software development method, a programming language, or whatever, depending on the context. The best solution may be defined by criteria such as correctness (or error rate), execution speed, resource requirements, robustness or ease of use. Clearly any proposed solution needs to be eval- uated against the relevant criteria, in comparison with other known solutions; the evaluation may take the form of a correctness proof, a complexityanaly- sis, case studies, or experiments involving a com- puter system or users. Tichy et al.‘s and Glass’s remarks make most sense when applied to this kind of research (see also Fenton et al. [1994]).

However, much research (in computing as in every other academic discipline) is not of this sort. Often the problem being studied is only roughly under- stood at the outset, and a major goal of the research is to delimit it more precisely. An extreme example of this is in artificial intelligence research, where the problem (of duplicating intelligence in a machine) is very ill-understood, and research is significant to the extent that it helps us to understand the problem

better. In research of this kind, ideas are evaluated not by being tested against precise criteria but by being argued over. The reason for writing programs, or designing formal notations, is different from the first kind of research. A program is not a solution to a problem but an aid to thinking about the problem. The formality and explicitness required in a program help one to think precisely about vague and ill- specified issues. Exploring “toy” problem domains (such as blocks worlds) helps us to see the conse- quences of our assumptions, to perceive large gaps in our understanding or distinctions which had pre- viously gone unnoticed. The quantitative perfor- mance of a system produced in the course of such research may be of no significance, since the system is not intended for use in any real-life application and since it inevitably contains many arbitrary im- plementation details. For these reasons the quanti- tative performance of such systems bears no logical relation to the worth of the ideas behind it. Most of this work would fall into Tichy et al’s category of ‘design and modelling’, because it does not consist of proving theorems, and yet it requires no experimen- tal validation because it embodies no empirical claims. (See Fletcher [1991] for further discussion of the purposes of implementation in the context of neural network research.)

Computer science could not consist entirely of research of the first kind (solving precisely specified problems), since someone has to think up the pre- cisely specified problems in the first place and con- vince us that they are worth pursuing. The problems and criteria need to be continually under review to ensure that they are still academically worthwhile, and this calls for intelligent debate rather than quantitative experiment. My real point is that re- search is primarily critical scholarship (studying one’s own and others’ ideas and forming a perspective on them) and only secondarily what is known in indus- try as research and development.

3. EXPERIMENTS: QUANTITY VS. QUALITY

Some branches of computing research do genuinely involve making quantitative empirical claims: for example, a psychological claim that structured pro- grams are easier to understand than FORTRAN programs, or a sociological claim that one method for managing large software projects works better than another. There are also pseudo-empirical claims, say about the execution speed of a program or the error rate of a neural network, which are strictly speaking mathematical theorems or nega- tions of theorems but where we resort to trial runs

Experiments in Computer Science

because the problem is mathematically intractable; Hooker (1994) stresses the importance of problems of this kind.

Clearly such claims need to be tested with prop- erly conducted and controlled scientific experiments. Tichy et al. praise a neural network journal in this respect. Their reasoning is: this journal devotes a lot of space to experimental reports; ergo the authors and referees probably attach a high importance to experimental evaluation; ergo the experimental eval- uation is probably of high quality. The truth, how- ever, is rather different. In the literature on artificial neural networks generally, serious lapses from nor- mal experimental protocol are tolerated.

No precautions are taken against experimenter bias. Most of the experimental hypotheses are of the form “My method works twice as fast as the standard method,” and of course the experi- menter has no difficulty in proving this. Re- searchers in the life sciences know that such experiments must be conducted “blind” if they are to be of any value. Statistical analyses of the significance of differ- ences in error rates and learning curves are usu- ally omitted. Graphs are plotted without error bars. Single results are often quoted without any indication of whether they represent the only trial, the best of a large number of trials, or a mean. Sources of systematic error are neglected. For example there is no attempt to calibrate one neural network simulator against another: it is assumed that ‘back-propagation’ means the same thing on SNNS, PlaNet and the PDP software.

Often the assumptions of an experimental report are more interesting than its results. Tichy et al.

J. SYSTEMS SOFTWARE 163 1995; 30~161-163

suggest that “a long description of an uninteresting experiment is likely to be rejected by reviewers.” Alas, this is wishful thinking.

4. CONCLUSIONS

As Glass (1995) points out, computing research has evolved in an ad hoc way and the framework of methods and phases “has, until very recently, either not existed or been poorly articulated.” As computer scientists we need to become more self-critical about our research goals and methods, and in particular to identify which aspects of our work genuinely embody empirical claims and hence call for experimental testing.

When we do have occasion to do experiments, we should adopt the same rigorous protocols as prac- tised in the natural and social sciences.

REFERENCES

Denning, P. J., What Is Experimental Computer Science? Commun. ACM 23,543~544 (1980).

Fenton, N., Pfleeger, S. L., and Glass, R. L., Science and Substance: a Challenge to Software Engineers, IEEE Sofnyare 11(4), 86-95 (1994).

Fletcher, P., A Self-Configuring Network, Connection Sci. 3, 35-60 (1991).

Glass, R. L., A Structure-Based Critique of Contemporary Computing Research, J. Sys. Software 28. 3-7 (1995).

Hooker, J. N., Needed: an Empirical Science of Algo- rithms, Operat. Res. 42, 201-212 (1994).

Tichy, W. F., Lukowicz, P. Prechelt, L., and Heinz, E. A., Experimental Evaluation in Computer Science: A Quantitative Study, J. Sys. Software 28, 9-18 (1995).

Turing, A. M., On Computable Numbers, with an Applica- tion to the Entscheidungsproblem, Proc. London Math. Sot., series 2, 42, 230-265 (1936-7).