Subjective confidence in forecasts: A response to fischhoff and MacGregor

7
Journal of Forecasting, Vol. 5 , 117-123 (1986) Subjective Confidence in Forecasts: A Response to Fischhoff and MacGregor GEORGE WRIGHT PETER AYTON Psychology Department, City of London Polytechnic, London, U. K. ABSTRACT Here we evaluate the generalizability of calibration studies which have used general knowledge questions, and argue that on conceptual, methodological and empirical grounds the results have limited applicability to judgemental forecasting. We also review evidence which suggests that judgemental forecast probabilities are influenced by variables such as the desirability, imminence, time period and perceived controllability of the event to be forecast. As these variables do not apply to judgement in the domain of general knowledge, a need for research recognizing and exploring the psychological processes underlying uncertainty about the future is apparent. KEY WORDS Calibration Judgemental forecasting Subjective probability This paper reviews recent psychological research that has relevance for judgemental forecasting. Judgemental forecasts are often required in situations where actuarial, or relative frequency, data are unavailable or known to be unreliable. These judgements can be used as inputs to decision analysis and probabilistic information processing systems, which are based on subjective expected utility theory and Bayes’ theorem, respectively. (For a review of these decision aids see Wright (1984)). Economists have identified a need for judgemental forecasts when there is seen to be a possibility of ‘turning points’ in time series data. The recognition of the possibility of such turning points may be made on the basis of observing discontinuous changes in variables that are assumed to have a causal influence on the criterion variable. For example, consider the impact of the miners’ strike in the United Kingdom upon U.K. gross domestic product and U.K. investment. Fischhoff and MacGregor (1982) reviewed earlier research and reported some new empirical evidence on the psychology of judgement. Fischhoff and MacGregor focused on the calibration of subjective probabilities. Calibration is one measure of the validity of subjective probability assessment. For a person to be perfectly calibrated, assessed probability should equal percentage correct over a number of assessments of equal probability. For example, if you assign a probability of 0.7 to each of ten questions, you should get seven of those questions correct. Similarly, all events that you assess as being certain to occur (1 .O probability assessments) should occur. 0277-6693/86/02011747$05.00 Received June 1985 0 1986 by John Wiley & Sons, Ltd. Revised January I986

Transcript of Subjective confidence in forecasts: A response to fischhoff and MacGregor

Page 1: Subjective confidence in forecasts: A response to fischhoff and MacGregor

Journal of Forecasting, Vol. 5 , 117-123 (1986)

Subjective Confidence in Forecasts: A Response to Fischhoff and MacGregor

GEORGE WRIGHT PETER AYTON Psychology Department, City of London Polytechnic, London, U. K.

ABSTRACT Here we evaluate the generalizability of calibration studies which have used general knowledge questions, and argue that on conceptual, methodological and empirical grounds the results have limited applicability to judgemental forecasting. We also review evidence which suggests that judgemental forecast probabilities are influenced by variables such as the desirability, imminence, time period and perceived controllability of the event to be forecast. As these variables do not apply to judgement in the domain of general knowledge, a need for research recognizing and exploring the psychological processes underlying uncertainty about the future is apparent.

KEY WORDS Calibration Judgemental forecasting Subjective probability

This paper reviews recent psychological research that has relevance for judgemental forecasting. Judgemental forecasts are often required in situations where actuarial, or relative frequency, data are unavailable or known to be unreliable. These judgements can be used as inputs to decision analysis and probabilistic information processing systems, which are based on subjective expected utility theory and Bayes’ theorem, respectively. (For a review of these decision aids see Wright (1984)). Economists have identified a need for judgemental forecasts when there is seen to be a possibility of ‘turning points’ in time series data. The recognition of the possibility of such turning points may be made on the basis of observing discontinuous changes in variables that are assumed to have a causal influence on the criterion variable. For example, consider the impact of the miners’ strike in the United Kingdom upon U.K. gross domestic product and U.K. investment.

Fischhoff and MacGregor (1982) reviewed earlier research and reported some new empirical evidence on the psychology of judgement. Fischhoff and MacGregor focused on the calibration of subjective probabilities. Calibration is one measure of the validity of subjective probability assessment. For a person to be perfectly calibrated, assessed probability should equal percentage correct over a number of assessments of equal probability. For example, if you assign a probability of 0.7 to each of ten questions, you should get seven of those questions correct. Similarly, all events that you assess as being certain to occur (1 .O probability assessments) should occur. 0277-6693/86/02011747$05.00 Received June 1985 0 1986 by John Wiley & Sons, Ltd. Revised January I986

Page 2: Subjective confidence in forecasts: A response to fischhoff and MacGregor

1 18 Journal of Forecasting Vol. 5 , Iss. N o . 2

Fischhoff and MacGregor’s aim was to see whether the results of previous studies that have investigated calibration of probabilities given to general knowledge questions could be generalized to judgemental forecasting. These studies have often used general knowledge items in the form of dichotomous questions such as ‘which canal is longer? (a) Suez Canal (b) Panama Canal’. Subjects are required to indicate the answer they think is correct and then to assess a probability between 0.5 and 1 to indicate their degree of belief in its correctness. General knowledge questions have been extensively used in studies of calibration because subject’s answers can be immediately and conveniently evaluated by the experimenter. This research has documented the generality of overconfidence in probability assessment. Generally, for all propositions assessed as having a 0 .XX probability of being correct less than X X per cent actually are correct (Lichtenstein and Fischhoff, 1977).

Lichtenstein et al. ( 1 982), in reviewing calibration research, concluded that although the most common form of miscalibration was overconfidence, different calibration curves emerge for tests with different levels of difficulty. Difficulty is measured as the proportion of responses where subjects choose what turned out to be the correct answer. Lichtenstein and Fischhoff ( 1 977) investigated the calibration/difficulty relationship in more detail. They concluded that ‘with increasing knowledge comes decreasing overconfidence until, for those whose percentage correct exceeded 80%, we found moderate underconfidence. This relationship resulted in a non- monotonic relationship between knowledge and calibration, with the best calibration found at approximately 80 per cent correct.’

Fischhoff and MacGregor presented calibration curves based on grouped data for general knowledge and future event questions, and noted that ‘(the) curve pertaining to forecasts looks strikingly like that observed with general knowledge questions of the same difficulty level . . .’ (Fischhoff and MacGegor, 1982, p. 162). However, they did not present comparisons between the sets of general knowledge and forecasting questions on mathematical measures of calibration and overconfidence, which they used extensively in later sections of the paper to compare forecasting performance under different task demands. These mathematical measures summarize individual calibration curves and are weighted by the number of assessments made at each assessment level. So, although the shapes of two calibration curves can be identical, calibration, when computed as a measure, may be quite different in each case because different distributions of responses comprise these two curves. In fact, Fischhoff and MacGregor did note a difference in the distributions for future event and general knowledge questions; overall, fewer certainty responses (probability of 1 .O) are applied to future event questions. The shape of their calibration curves shows that higher probabilities tend to be worse calibrated. It would appear highly likely that, if they had computed the calibration scores, they would have shown a difference.

Wright and Wisudha (1982) and Wright (1982) have made close investigation of calibration for future-event, past-event and general knowledge questions. For sets of past-event and future- event questions of equal difficulty they found differences in calibration and overconfidence although the group calibration curves looked much the same. The reason for this was that subjects tended to use more certainty assessments in response to the past-event questions, whereas the future-event questions tended to be seen in terms of probability or complete uncertainty. Figure 1 presents Wright’s (1982) calibration curves. For past-event questions the mathematical measures of calibration and overconfidence were influenced by the high proportion of overconfident high probabilities, whereas the measures computed for the future-event questions were influenced by the high proportion of underconfident low- and mid-range probabilities.

What factors may cause the obtained difference in the realism and distribution of probabifity assessments given to remembered and then-future events? One main difference may be that

Page 3: Subjective confidence in forecasts: A response to fischhoff and MacGregor

G. Wright and P . Ayton Subjective Confidence in Forecasts 119

Figure 1. Wright’s calibration curves

people in ordinary life do not naturally put a probability to the veracity of their memories. Fischhoff et al. (1977) have argued that people may believe that they are answering such questions directly from memory without making any inferences. Conversely, judgements about the likelihood of future events contain explicit uncertainty because the correct answer is unknown to both subject and experimenter. The issue is really one of competence versus performance and it can be described by reference to a three-stage model of the cognitive processes involved in answering a question first described by Phillips and Wright (1977). These stages may be distinguished from each other by their response output. The stages consist of: a certainty response (either ‘yes’ or ‘no’ or a probability estimate of 0 or 1) in stage 1, a response consequent on a refusal to respond probabilistically (either ‘don’t know’ or a 0.5 probability estimate) in stage 2, and a truly probabilistic response (either a probability estimate between 0 and 1 or a corresponding verbal expression) in stage 3 , People tend to respond to questions concerning remembered events by stopping at stage 1, whereas responses to future events are determined by stages 2 and 3. These cognitive processes are also influenced by culture (Wright and Phillips, 1980) as well as task. Lichtenstein et al. (1982) have argued that the heuristic ‘anchoring and adjustment’ may, in part, account for overconfidence. In terms of the Phillips and Wright model, model, the response anchor for past event and general knowledge questions may be perceived certainty. This anchor may have such a dominating influence that any adjustment from the anchor to accommodate recognized uncertainty may be insufficient; hence any assessed probabilities will be too high, yielding overconfidence. Conversely, with future event questions the anchor may be a 0.5 probability response, which indicates an immediate recognition of uncertainty. Insufficient adjustment from this anchor would result in the underconfidence for future-event questions shown in Wright’s (1982) study. It may be that a then-future event question initiates a more thorough memory search than a past-event or almanac question. Plausibly a multiplicity of possible causal relationships could be perceived as resulting in an outcome which is presented as an alternative answer in a future event question. These speculations convince us that a tenable conceptual rationale can be developed, with some empirical evidence to support it, for the hypothesis that the cognitive processes underlying expressions of uncertainty about the future (judgemental forecasting) are different from those underlying expressions of uncertainty about the past or present (‘almanac’ or general knowledge verification).

An interesting methodological difference between the studies of Wright (1982) and Fischhoff

Page 4: Subjective confidence in forecasts: A response to fischhoff and MacGregor

120 Journal of Forecasting Vol. 5 , Iss. No. 2

and MacGregor (1982) prompts further questioning of the applicability of general knowledge calibration to judgemental forecasting. Fischhoff and MacCregor’s questions, at least the ones they give as examples, require judgements about future, assumed certain, events whose possible outcomes can be dichotomized. For example, one question requires subjects to decide which of two teams will win a given baseball game. By contrast, Wright’s task requires subjects to make judgements about uncertain future events, the dichotomy here being whether or not the future event will occur; for example, will a British member of parliament die within the next fourteen days? (a) yes, (b) no. Arguably there is less redundancy with Wright’s task as the possibility of occurrence of a specific event does not constrain the probability of any other explicit event except, of course, the non-occurrence of that event. The implications of this difference between the two tasks are, in psychological processing terms, unclear. However, one potentially important difference for subjects in the experimental studies should be stressed. Subjects in Fischhoff and MacGregor’s experiments who are unsure, perhaps even totally ignorant of the relative likelihood of the event dichotomy, can select (a) or (b) at random and give a 50 per cent response with impunity to their calibration score as, on average, this will be the proportion correct. They may even consciously appreciate that and amend their strategy accordingly. In this respect Fischhoff and MacGregor’s task is similar to the general knowledge task. The influence of such ‘metacues’ on subjects’ performance on these tasks remains uninvestigated, but it is possible that they introduce artificiality into the experimental responses. Subjects attempting Wright’s task, on the other hand, cannot use 50 per cent responses to the same effect. The proportion of events that do in fact occur is obviously not under the control of the experimenter, and so no response strategy can be adopted by the subject, when ignorant of the likelihood of occurrence, that will minimize miscalibration. If subjects do use 50 per cent responses under these circumstances they will be miscalibrated whenever the proportion of event occurrence is different to that, and in direct proportion to the size of that difference. It would seem that many real-life forecasting situations (e.g. weather, Murphy and Winkler, 1974) resemble Wright’s task more closely than Fischhoff and MacCregor’s. Clearly, when it comes to predicting the future, we often need to deal with the probability of occurrence of events that have more complex alternatives than can be specifically and exhaustively represented by two mutually exclusive categories.

In the next section of this paper we move beyond using almanac questions as a paradigm for investigations of judgemental forecasting and review research that has strong implications for studies of the calibration of subjective forecasts.

SUBJECTIVE PROBABILITIES GIVEN TO FUTURE EVENTS

Milburn (1978) investigated sources of bias in the prediction of future events. He prepared lists of events that were either personal (i.e. could happen to the forecaster) or external (i.e. national or international) in nature. He argued that ‘For most people, events which have a direct impact on their lives are of greater concern than those occurring on a national or international level. Having thought more about personal than external events, it should be much easier to imagine the occurrence of personal events than external events. Thus personal events should be seen as more likely.’ However, this event distinction did not produce a significant difference in subjective forecasts. Milburn also investigated the imminence of a forecast period by eliciting forecasts of an event occurring in each of four successive decades in the future. Half of the events had what he classified as ‘positive’ outcomes (e.g. ‘hunger and poverty are no longer problems in the US’) and the other half had ‘negative’ outcomes (e.g. ‘I have to spend some time in the hospital because of a serious illness’). Milburn found that subjects perceived desirable events as becoming

Page 5: Subjective confidence in forecasts: A response to fischhoff and MacGregor

G . Wright and P. Ayton Subjective Confidence in Forecasts 12 1

increasingly more likely to occur in each of the four successive decades in the future. By contrast, undesirable events were perceived as becoming less likely in each of the four successive decades. Figure 2 presents his findings in detail.

However, in a complementary experiment where each subject gave only forecasts for one of the decades, there was a significant downward trend in perceived likelihood for both types of event across time in the future. These results are difficult to account for. Milburn suggested that the availability heuristic (Tversky and Kahneman, 1974) would predict that the subjects will feel that as the world changes it will become progressively less like the present in successive decades. Thus it should be harder to imagine what the world will be like and events should be seen as less probable. Desirable events increase in probability over time because, Milburn argues,

t @--4 Negative events

T h e

Figure 2. Mean probability estimates for positive and negative events. From Milburn (1978)

desirability has a greater effect in more ambiguous circumstances. Owing to the time periods involved, there was no attempt to assess the calibration of the forecasts. Perhaps the most significant result from this research is that when subjects do not have to produce a number of forecasts for successive time periods the pattern of results changes. The influence of imminence does not seem to be straightforward. It interacts with the desirability of the events to be forecast and changes as a function of the method of elicitation of the forecasts.

Zakay (1983) has investigated the desirability of an event on probability forecasting in a way that controls for alternative explanations of changes in subjective probabilities. He asked subjects to assess probabilities that positive value events (e.g. win a lottery) and negative value events (e.g. be hurt in a road accident) will happen to themselves and also to someone who the probability assessor didn’t know but who is ‘of the same population as yours in terms of age, origin, income, social status and personality’. Zakay found that subjects perceived desirable life events as being more likely to occur to themselves than to another person similar to themselves. Undesirable life events showed the opposite effect. This pattern of responding can be termed an optimistic bias. Zakay suggested that operation of the availability heuristic would tend to retrieve many instances of undesirable events occurring to others from news reports, but not to oneself. The availability heuristic does not explain the findings for desirable events, however.

A real-life study of the effect of the value of outcomes on forecasting was undertaken by

Page 6: Subjective confidence in forecasts: A response to fischhoff and MacGregor

122 Journal of Forecasting Vol. 5 , Iss. No. 2

Snyder (1978). He investigated public betting on horse racing and found that punters attempted to recoup losses by selecting even longer odds horses than usual on the last race of the day. This result replicates that of McGlothlin (1956) and also suggests an interaction between subjective probability and utility.

Weinstein (1980) found similar results to those reported by Zakay, as well as finding that perceived controllability of events correlated positively with the amount of optimistic bias. Controllability was measured on a 5-point subjective rating scale ranging from ‘there is nothing that we can do that will change the likelihood that the event will take place’ to ‘completely controllable’. Indeed, perceived controllability, whether real or imaginary, may have a strong influence on judgements of the likelihood of outcomes (Langer, 1982).

The way in which forecast probabilities vary as a function of the time duration of a forecast period has received little attention. Howell and Burnett (1978) have suggested that short time spans will be influenced to a greater degree by local context effects and less by knowledge of the past frequency of events. Thus subjects may believe in ‘luck’ playing a bigger part in the short term and be prone to unstable ephemeral influences. They cite the gambler’s fallacy (e.g. the belief that a coin which has come up heads many times in a row will be more likely to come up tails on the next throw) as an illustration of this view. In a follow-up study, Howell and Kerkar (1982) had subjects participate in a task where they served as emergency vehicle dispatchers for a hypothetical city. Each subject gained experience with events (types of emergency Falls) generated from different parts of the city by a stationary stochastic process. Subjects then had to answer a series of questions concerning either the observed frequency of specified kinds of events or the probability that those same events would occur on future occasions. However, despite subjects’ ability to make good judgements of observed historic frequency, future forecasts of subsequent events appeared to depend more heavily on less appropriate transient information.

Although these studies did not measure calibration, it would seem that a range of variables may influence the judgements given to the likelihood of future events. The validity of those judgements requires careful examination, as some variables known to influence judgements about the future do not apply to general knowledge. Thus we feel that Fischhoff and MacGregor’s conclusion that ‘calibration for confidence assessments regarding forecasts is largely indistinguishable from that pertaining to general knowledge questions. . . . [and] one might expect calibration for forecasts to be relatively unaffected by changes in response mode, incentive payments for correct answers, or familiarity with the subject matter’ is too strong a statement. Once the paradigm for investigation of the calibration of probability assessments is broadened, it follows that calibration, based as it is on the distribution of forecast probabilities, will vary as a function of the desirability, imminence, time period and perceived controllability of the event to be forecast.

ACKNOWLEDGEMENT

This research was supported by the Economic and Social Research Council Grant No. C00232037.

REFERENCES

Fischhoff, B. and MacGregor, D., ‘Subjective confidence in forecasts’, Journal of Forecasting, 1 (1982), 155-1 72.

Page 7: Subjective confidence in forecasts: A response to fischhoff and MacGregor

G . Wright and P . Ay ton Subjective Confidence in Forecasts 123

Fischhoff, B., Slovic, P. and Lichtenstein, S., ‘Knowing with certainty’, Journal of Experimental Psychology:

Howell, W. C. and Burnett, S. A., ‘Uncertainty measurement: a cognitive taxonomy’, Organizational

Howell, W. C. and Kerkar, S. P., ‘A test of task influences in uncertainty measurement’, Organizational

Langer, E. J., The Psychology of Control, Beverly Hills: Sage, 1982. Lichtenstein, S. and Fischhoff, B., ‘Do those who know more also know more about how much they

know?’, Organizational Behavior and Human Performance, 2 (1977), 159-1 83. Lichtenstein, S., Fischhoff, B. and Phillips, L. D., ‘Calibration of probabilities: the state of the art to

1980’, in Kahneman, D., Slovic, P. and Tversky, A. (eds), Judgment under Uncertainty: Heuristics and Biases, New York: Cambridge University Press, 1982.

McGlothin, W. H., ‘Stability of choices among uncertain alternatives’, American Journal of Psychology, 69 (1956), 604615.

Milburn, M. A,, ‘Sources of bias in the prediction of future events’, Organizational Behavior and Human Performance, 21 (1978), 17-26,

Murphy, A. H. and Winkler, R. L., ‘Subjective probability forecasting experiments in meteorology: some preliminary results’, Bulletin of the American Meteorological Society, 55 (1974), 12061216.

Phillips, L. D. and Wright, G. N. ‘Cultural differences in viewing uncertainty and assessing probabilities’, in Jungermann, H. and de Zeeuw, G. (eds), Decision making and change in human affairs, Amsterdam: D. Reidel, 1977.

Snyder’ W., ‘Decision-making with risk and uncertainty: the case of horse racing’, American Journal of

Tversky, A. and Kahneman, D., ‘Judgement under uncertainty: heuristics and biases’, Science, 185 (1974),

Weinstein, N. D., ‘Unrealistic optimism about future life events’, Journal of Personality and Social

Wright, G., ‘Changes in the realism and distribution of probability assessment as a function of question

Wright, G., Behauioural Decision Theory, Harmondsworth: Penguin, 1984 and Beverly Hills: Sage, 1984. Wright, G. and Phillips, L. D., ‘Cultural variation in probabilistic thinking: alternative ways of dealing

Wright, G. and Wisudha, A., ‘Distribution of probability assessment for almanac and future event

Zakay, D., ‘The relationship between the probability assessor and the outcomes of an event as a determiner

Human Perception and Performance, 3 (1977), 552-564.

Behavior and Human Performance, 22 (1978), 45-68.

Behavior and Human Performance, 30 (l982), 365-390.

Psychology, 91 (1978), 201-209.

1 124-1 13 1.

Psychology, 39 (1980), 806820.

type’, Acta Psychologica, 52 (1982), 165- 174.

with uncertainty’, International Journal of Psychology, 15 (1980), 239-257.

questions’, Scandinavian Journal of Psychology, 23 (1 982), 219-224.

of subjective probability’, Acta Psychologica, 53 (1983), 271--280.

Authors’ biographies: George Wright received his Ph.D. from Brunel University in 1980. He has since published widely on the human aspects of decision-making and forecasting. His publications include Behauioural Decision Theory, Beverly Hills: Sage and Harmondsworth: Penguin 1984. Behauioural Decision Making, New York: Plenum, 1985. Investigative Design and Statistics, Harmondsworth: Penguin, 1986 and Judgemental Forecasting, Chichester: Wiley, in press. He is currently a senior lecturer at the City of London Polytechnic. Peter Ayton conducted research in memory and language at University College London before joining the Decision Analysis Group at the City of London Polytechnic. His current research activities include the development of statistical methods for individual difference analysis and the study of intuitive statistical concepts.

Authors’ addresses: George Wright and Peter Ayton, Decision Analysis Group, Psychology Department, City of London Polytechnic, Old Castle Street, London El 7NT, U.K.