Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast...

Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verificationRoger Harbord [email protected] EMS & 12th ECAM, 10 September 2015, Sofia

Who was Simpson?

Journal of the Royal Statistical Society,Series B (Methodological). Vol. 13, No. 2, pp. 238 241‒

© Royal Statistical Society

© Crown copyright Met Office

Verification of binary forecastsEvent forecast? Event observed?

Yes No

Yes Hits False alarms

No Misses Correct rejections

• Hit rate H = Hits ∕ ( Hits + Misses )

• False alarm rate F = False alarms ∕ ( False alarms + Correct rejections )

• Peirce skill score (true skill statistic, Hanssen & Kuiper’s discriminant, Youden’s index …)

PSS = H − F


Homer’s performanceDecember to May

Low vis. forecast?

Low visibility observed?

Yes No Total

Yes 35 56 91

No 21 70 91

Total 56 126 182

• Hit rate H = 35 / 56 = 0.625

• False alarm rate F = 56 / 126 = 0.444

• Peirce skill score PSS = H − F = 0.625 − 0.444 = 0.18

two-sided P = 0.025


December to February

Low vis. forecast?


Yes No Total

Yes 33 43 76

No 7 7 14

Total 40 50 90

• Hit rate H = 33 / 40 = 0.825


• Peirce skill score PSS = H − F = 0.825 − 0.860 = − 0.035


March to May

Low vis. forecast?


Yes No Total

Yes 2 13 15

No 14 63 77

Total 16 76 92

• Hit rate H = 2 / 16 = 0.125


• Peirce skill score PSS = H − F = 0.125 − 0.171 = − 0.046


Collapsing contingency tables:A geometric approach

Shapiro SH (1982). The American Statistician, 36 (1): 43-46.

False alarm rate

Hit rate

0

1

0 1


Simpson’s paradox (‘the reversal paradox’)

• Not limited to Peirce’s skill score — Independent of the measure used, as all sensible performance measures agree on direction of effect (positive or negative)

• Not limited to deterministic forecasts of dichotomous events— analogous phenomena occur for:

• Continuous variables (‘Spurious correlation’: Pearson, 1899)

• Probabilistic forecasts


Non-collapsibility (Greenland, Robins & Pearl 1999)

• More generally, the value of a measure overall compared to the same measure in two or more subgroups can:• Reverse • Change from zero to non-zero or vice-versa• Increase or decrease in magnitude

• In general, conditions for collapsibility do depend on the measure chosen (Shapiro 1982)


Some real data

• Observations from UK surface stations

• Equitable threat score (ETS),also known as Gilbert Skill Score

1. Precipitation ≥ 0.5mm in 6 hoursMet Office global modelCombining groups of stations

2. Visibility ≤ 1000mMet Office ‘UKV’ modelCombining dates and times


Combining over areas


Combining over times

6 12 18 24 30 360

0.02

0.04

0.06

0.08

0.1

0.12

Mean

36-month pooled

Median

ETS score for visibility <1000m,UK sites

Forecast lead time


Time of day

Base rate

Time (hours)


Is this new to verification?• Non-collapsibility isn’t:

• Hamill TM & Juras J (2006). Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society, Vol. 132, No. 621C, pp. 2905-2923

• Mason I (1989). Dependence of the critical success index on sample climate and threshold probability. Australian Meteorological Magazine, Vol. 37, pp. 75-81

• Yet to find any mention or description of Simpson’s paradox in the forecast verification literature.


Alternatives

• We have seen that it may be misleading to pool data over large areas or time periods simply by adding up the numbers (whether counts, mean squares, Brier scores…)

• But what’s the alternative?

1. Report performance measures only in homogeneous subgroups

• But there may be rather a lot of them,so we often want a summary measure


2. Percentile thresholds

• The issue of non-collapsibility goes awayif the base rate (climatological event frequency)is the same for all samples

• True if use percentile thresholds,i.e. quantiles of the local climatological distribution

• But such thresholds can be harder to interpret


3. Weighted averaging

Estimate the measure within each homogeneous subgroup (e.g. coastal stations in autumn 2013)

1. Summarise these estimates graphically

2. If the estimates are fairly homogeneous:Report a single summary measure by taking a weighted average of the estimates in each subgroup

In statistics, this is known as meta-analysis

(whole literature on how to choose weights, derive confidence intervals ...)


4. Paired comparisons

• Skill scores relative to persistence account for variation with time of yearand time of day (24-h persistence) (Mittermaier 2008)

• Commonly used for continuous variables

• Less common for dichotomous variables?

• Can define e.g. ETS skill score relative to persistence in the usual way


Summary & future work

• “Pooling over heterogeneous regions can easily produce misleading results”

• Not clear how big an issue this is in practice in a small region such as the UK

• Produce some empirical results on frequency of Simpson’s Paradox and of substantial non-collapsibility in real data

• Solutions include:

• weighted averaging (meta-analysis)

• Paired comparisons (e.g. to persistence)


Discussion / Questions• Anything I’ve missed?

• Should we worry more about these issues?

• What do you do in practice?


References

Greenland S, Robins JM, Pearl J (1999). Confounding and collapsibility in causal inference. Statistical Science 14(1), 29-46.

Hogan RJ, Mason IB (2012). Deterministic forecasts of binary events. Chapter 3 in Jolliffe IT, Stephenson DB. Forecast Verification: A Practitioner's Guide in Atmospheric Science. 2nd edition. John Wiley & Sons.

Mittermaier MP (2008). The potential impact of using persistence as a reference forecast on perceived forecast skill. Weather & Forecasting 23, 1022 ‒1031

Pearson K (1899). Mathematical Contributions to the Theory of Evolution. VI. Genetic (Reproductive) Selection. Philosophical Transactions of the Royal Society of London. Series A. 192, 259-278.(Specifically, Proposition VI pp. 277‒278 “On the spurious correlation produced by forming a mixture of heterogeneous but uncorrelated materials”)

Shapiro SH (1982). Collapsing Contingency Tables—A Geometric Approach. The American Statistician 36(1), 43-46.

Simpson EH (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) 13(2), 238-241.

Simpson EH (2010). Edward Simpson: Bayes at Bletchley Park. Significance 7(2): 76-80.

Wikipedia contributors. Simpson's paradox. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10).

Wikipedia contributors. Edward H. Simpson. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10)

http://en.wikipedia.org/w/index.php?title=Simpsonis_paradox&oldid=642532410

https://en.wikipedia.org/wiki/Edward_H._Simpson


Historical aside: Stigler's law of eponymy

“No scientific discovery is named after its original discoverer.”

(Stephen Stigler, 1980)

• Simpson’s paper doesn’t describe reversal,but rather a fictitious example in which the measure is equal and non-zero in both of two categories, but zero when the categories are collapsed

• Reversal phenomenon named ‘Simpson’s Paradox’ by Colin Blyth in 1972

• Described (with a real-data example) as early as 1934 by Cohen & Nagel


Blyth CR (1972). On Simpson’s Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67(338), 364‒366.

Cohen MR, Nagel E (1934). An Introduction to Logic and Scientific Method. (Harcourt, Brace & Co.)

Stigler SM (1980). Stigler's law of eponymy. In: Gieryn TF, ed. Science and social structure: a festschrift for Robert K. Merton. (New York Academy of Sciences) pp. 147–57. Republished in Stigler's collection Statistics on the Table: The History of Statistical Concepts and Methods (1999, Harvard University Press)

Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast...

Documents

Transcript of Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast...