Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast...
-
Upload
ophelia-sherman -
Category
Documents
-
view
218 -
download
2
Transcript of Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast...
Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verificationRoger Harbord [email protected] EMS & 12th ECAM, 10 September 2015, Sofia
Who was Simpson?
Journal of the Royal Statistical Society,Series B (Methodological). Vol. 13, No. 2, pp. 238 241‒
© Royal Statistical Society
© Crown copyright Met Office
Verification of binary forecastsEvent forecast? Event observed?
Yes No
Yes Hits False alarms
No Misses Correct rejections
• Hit rate H = Hits ∕ ( Hits + Misses )
• False alarm rate F = False alarms ∕ ( False alarms + Correct rejections )
• Peirce skill score (true skill statistic, Hanssen & Kuiper’s discriminant, Youden’s index …)
PSS = H − F
© Crown copyright Met Office
Homer’s performanceDecember to May
Low vis. forecast?
Low visibility observed?
Yes No Total
Yes 35 56 91
No 21 70 91
Total 56 126 182
• Hit rate H = 35 / 56 = 0.625
• False alarm rate F = 56 / 126 = 0.444
• Peirce skill score PSS = H − F = 0.625 − 0.444 = 0.18
two-sided P = 0.025
© Crown copyright Met Office
December to February
Low vis. forecast?
Low visibility observed?
Yes No Total
Yes 33 43 76
No 7 7 14
Total 40 50 90
• Hit rate H = 33 / 40 = 0.825
• False alarm rate F = 43 / 50 = 0.860
• Peirce skill score PSS = H − F = 0.825 − 0.860 = − 0.035
© Crown copyright Met Office
March to May
Low vis. forecast?
Low visibility observed?
Yes No Total
Yes 2 13 15
No 14 63 77
Total 16 76 92
• Hit rate H = 2 / 16 = 0.125
• False alarm rate F = 13 / 76 = 0.171
• Peirce skill score PSS = H − F = 0.125 − 0.171 = − 0.046
© Crown copyright Met Office
Collapsing contingency tables:A geometric approach
Shapiro SH (1982). The American Statistician, 36 (1): 43-46.
False alarm rate
Hit rate
0
1
0 1
© Crown copyright Met Office
Simpson’s paradox (‘the reversal paradox’)
• Not limited to Peirce’s skill score — Independent of the measure used, as all sensible performance measures agree on direction of effect (positive or negative)
• Not limited to deterministic forecasts of dichotomous events— analogous phenomena occur for:
• Continuous variables (‘Spurious correlation’: Pearson, 1899)
• Probabilistic forecasts
© Crown copyright Met Office
Non-collapsibility (Greenland, Robins & Pearl 1999)
• More generally, the value of a measure overall compared to the same measure in two or more subgroups can:• Reverse • Change from zero to non-zero or vice-versa• Increase or decrease in magnitude
• In general, conditions for collapsibility do depend on the measure chosen (Shapiro 1982)
© Crown copyright Met Office
Some real data
• Observations from UK surface stations
• Equitable threat score (ETS),also known as Gilbert Skill Score
1. Precipitation ≥ 0.5mm in 6 hoursMet Office global modelCombining groups of stations
2. Visibility ≤ 1000mMet Office ‘UKV’ modelCombining dates and times
© Crown copyright Met Office
Combining over areas
© Crown copyright Met Office
Combining over times
6 12 18 24 30 360
0.02
0.04
0.06
0.08
0.1
0.12
Mean
36-month pooled
Median
ETS score for visibility <1000m,UK sites
Forecast lead time
© Crown copyright Met Office
Time of day
Base rate
Time (hours)
© Crown copyright Met Office
Is this new to verification?• Non-collapsibility isn’t:
• Hamill TM & Juras J (2006). Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society, Vol. 132, No. 621C, pp. 2905-2923
• Mason I (1989). Dependence of the critical success index on sample climate and threshold probability. Australian Meteorological Magazine, Vol. 37, pp. 75-81
• Yet to find any mention or description of Simpson’s paradox in the forecast verification literature.
© Crown copyright Met Office
Alternatives
• We have seen that it may be misleading to pool data over large areas or time periods simply by adding up the numbers (whether counts, mean squares, Brier scores…)
• But what’s the alternative?
1. Report performance measures only in homogeneous subgroups
• But there may be rather a lot of them,so we often want a summary measure
© Crown copyright Met Office
2. Percentile thresholds
• The issue of non-collapsibility goes awayif the base rate (climatological event frequency)is the same for all samples
• True if use percentile thresholds,i.e. quantiles of the local climatological distribution
• But such thresholds can be harder to interpret
© Crown copyright Met Office
3. Weighted averaging
Estimate the measure within each homogeneous subgroup (e.g. coastal stations in autumn 2013)
1. Summarise these estimates graphically
2. If the estimates are fairly homogeneous:Report a single summary measure by taking a weighted average of the estimates in each subgroup
In statistics, this is known as meta-analysis
(whole literature on how to choose weights, derive confidence intervals ...)
© Crown copyright Met Office
4. Paired comparisons
• Skill scores relative to persistence account for variation with time of yearand time of day (24-h persistence) (Mittermaier 2008)
• Commonly used for continuous variables
• Less common for dichotomous variables?
• Can define e.g. ETS skill score relative to persistence in the usual way
© Crown copyright Met Office
Summary & future work
• “Pooling over heterogeneous regions can easily produce misleading results”
• Not clear how big an issue this is in practice in a small region such as the UK
• Produce some empirical results on frequency of Simpson’s Paradox and of substantial non-collapsibility in real data
• Solutions include:
• weighted averaging (meta-analysis)
• Paired comparisons (e.g. to persistence)
© Crown copyright Met Office
Discussion / Questions• Anything I’ve missed?
• Should we worry more about these issues?
• What do you do in practice?
© Crown copyright Met Office
References
Greenland S, Robins JM, Pearl J (1999). Confounding and collapsibility in causal inference. Statistical Science 14(1), 29-46.
Hogan RJ, Mason IB (2012). Deterministic forecasts of binary events. Chapter 3 in Jolliffe IT, Stephenson DB. Forecast Verification: A Practitioner's Guide in Atmospheric Science. 2nd edition. John Wiley & Sons.
Mittermaier MP (2008). The potential impact of using persistence as a reference forecast on perceived forecast skill. Weather & Forecasting 23, 1022 ‒1031
Pearson K (1899). Mathematical Contributions to the Theory of Evolution. VI. Genetic (Reproductive) Selection. Philosophical Transactions of the Royal Society of London. Series A. 192, 259-278.(Specifically, Proposition VI pp. 277‒278 “On the spurious correlation produced by forming a mixture of heterogeneous but uncorrelated materials”)
Shapiro SH (1982). Collapsing Contingency Tables—A Geometric Approach. The American Statistician 36(1), 43-46.
Simpson EH (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) 13(2), 238-241.
Simpson EH (2010). Edward Simpson: Bayes at Bletchley Park. Significance 7(2): 76-80.
Wikipedia contributors. Simpson's paradox. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10).
Wikipedia contributors. Edward H. Simpson. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10)
© Crown copyright Met Office
Historical aside: Stigler's law of eponymy
“No scientific discovery is named after its original discoverer.”
(Stephen Stigler, 1980)
• Simpson’s paper doesn’t describe reversal,but rather a fictitious example in which the measure is equal and non-zero in both of two categories, but zero when the categories are collapsed
• Reversal phenomenon named ‘Simpson’s Paradox’ by Colin Blyth in 1972
• Described (with a real-data example) as early as 1934 by Cohen & Nagel
© Crown copyright Met Office
Blyth CR (1972). On Simpson’s Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67(338), 364‒366.
Cohen MR, Nagel E (1934). An Introduction to Logic and Scientific Method. (Harcourt, Brace & Co.)
Stigler SM (1980). Stigler's law of eponymy. In: Gieryn TF, ed. Science and social structure: a festschrift for Robert K. Merton. (New York Academy of Sciences) pp. 147–57. Republished in Stigler's collection Statistics on the Table: The History of Statistical Concepts and Methods (1999, Harvard University Press)