Verifying cloud and boundary-layer forecasts
description
Transcript of Verifying cloud and boundary-layer forecasts
Robin HoganRobin HoganEwan O’Connor, Natalie Harvey, Thorwald Stein, Anthony Ewan O’Connor, Natalie Harvey, Thorwald Stein, Anthony
Illingworth, Julien Delanoe, Helen Dacre, Helene GarconIllingworth, Julien Delanoe, Helen Dacre, Helene Garcon
University of Reading, UKUniversity of Reading, UK
Chris Ferro, Ian Jolliffe, David StephensonChris Ferro, Ian Jolliffe, David Stephenson
University of Exeter, UKUniversity of Exeter, UK
Verifying cloud and Verifying cloud and
boundary-layer forecastsboundary-layer forecasts
How skillful is a forecast?How skillful is a forecast?
• Most model evaluations of clouds test the cloud climatology– What about individual forecasts?
• Standard measure shows ECMWF forecast “half-life” of ~6 days in 1980 and ~9 days in 2000 – But virtually insensitive to clouds!
ECMWF 500-hPa geopotential anomaly correlation
• Cloud has smaller-scale variations than geopotential height because it is separated by around two orders of differentiation:
– Cloud ~ vertical wind ~ relative vorticity ~ 2streamfunction ~ 2pressure– Suggests cloud observations would be a more stringent test of models
Geopotential height anomalyGeopotential height anomaly Vertical velocityVertical velocity
OverviewOverview• Desirable properties of verification measures (skill scores)
– Usefulness for rare events– Equitability: is the “Equitable Threat Score” equitable?
• Testing the skill of cloud forecasts from seven models– What is the “half life” of a cloud forecast?
• Testing the skill of cloud forecasts from space– Which cloud types are best forecast and which types worst?
• Testing the skill of boundary-layer type forecasts– New diagnosis from Doppler lidar
ChilboltonObservations
Met OfficeMesoscale
Model
ECMWFGlobal Model
Meteo-FranceARPEGE Model
KNMIRACMO Model
Swedish RCA model
Cloud Cloud fractionfraction
Joint PDFs of cloud fractionJoint PDFs of cloud fraction
• Raw (1 hr) resolution– 1 year from Murgtal– DWD COSMO model
• 6-hr averaging
ab
cd
…or use a simple contingency table
a = 7194 b = 4098
c = 4502 d = 41062
DWD model, Murgtal
Model cloud
Model clear-sky
a: Cloud hit b: False alarm
c: Miss d: Clear-sky hit
Contingency tablesContingency tables
For given set of observed events, only 2 degrees of freedom in all possible forecasts (e.g. a & b), because 2 quantities fixed: - Number of events that occurred n =a +b +c +d - Base rate (observed frequency of occurrence) p =(a +c)/n
Observed cloud Observed clear-sky
Desirable properties of verification Desirable properties of verification measuresmeasures
1. “Equitable”: all random forecasts receive expected score zero– Constant forecasts of occurrence or non-occurrence also score zero
2. Difficult to “hedge”– Some measures reward under- or over-prediction
3. Useful for rare events– Almost all widely used measures are “degenerate” in that they asymptote
to 0 or 1 for vanishingly rare events
4. “Linear”: so that can fit an inverse exponential for half-life5. Useful for overwhelmingly common events…6. Base-rate independent…7. Bounded…
For a full discussion see Hogan and Mason,Chapter 3 of Forecast Verification 2nd edition:
Skill versus cloud-fraction Skill versus cloud-fraction thresholdthreshold
• Consider 7 models evaluated over 3 European sites in 2003-2004– Two equitable
measures: Heidke Skill Score and Log of Odds Ratio
LOR implies skill increases for larger
cloud-fraction thresholdHSS implies skill decreases
significantly for larger cloud-fraction threshold
LORHSS
)1()1(ln1
1ln
SEDIFFHH
FF
HH
Extreme dependency scoresExtreme dependency scores• Stephenson et al. (2008) explained this behavior:
– Almost all scores have a meaningless limit as “base rate” p 0– HSS tends to zero and LOR tends to infinity
• Solved with their Extreme Dependency Score:
– Problem: inequitable and easy to hedge: just forecast clouds all the time
• Hogan et al. (2009) proposed the symmetric version SEDS:
• Ferro and Stephenson (2011) proposed Symmetric Extremal Dependence Index
– where hit rate H = a/(a+c) and false alarm rate F = b/(b+d)– Robust for rare and overwhelmingly common events
Skill versus cloud-fraction Skill versus cloud-fraction thresholdthreshold
SEDS has much flatter behaviour for all models (except for Met Office which underestimates high cloud occurrence significantly)
LORHSS SEDS
Skill versus heightSkill versus height
• Verification using SEDS reveals:– Skill tends to slowly decrease at tropopause– Mid-level clouds (4-5 km) most skilfully predicted, particularly by
Met Office– Boundary-layer clouds least skilfully predicted
Asymptotic equitabilityAsymptotic equitability• “Equitable Threat Score” is
slightly inequitable for n < 30– Should call it Gilbert Skill Score
• ORSS, EDS, SEDS & SEDI approach zero much more slowly with n – For events that occur 2% of the
time need n > 25,000 before magnitude of expected score is less than 0.01
• Hogan et al. (2010) showed that inequitable measures can be scaled to make them equitable but tricky numerical operation
• Alternatively be sure sample size is large enough and report CIs on verification measures
For some measures, expected score for random forecast only tends to zero for a large number of samples: these are asymptotically equitable
League League tabletable
Tru
ly E
quita
ble
Asy
mpto
tically
Equ
itable
Linear o
r nearly
linear
Use
ful fo
r rare
events
Use
ful fo
r overw
helm
ingly
com
mon
events
Equitably transformed SEDI (Tricky to implement) Y Y Y Y Y
Symmetric Extremal Dependence Index SEDI N Y Y Y Y
Symmetric Extreme Dependency Score SEDS N Y Y Y N
Peirce Skill Score PSS / Heidke Skill Score HSS Y Y Y N N
Log of Odds Ratio LORN Y Y N N
Odds Ratio Skill Score LOR / Yule’s Q N Y N N N
Gilbert Skill Score GSS (formerly ETS) N Y N N N
Extreme Dependency Score EDS N N Y Y N
Hit Rate H / False Alarm Rate FAR N N Y N N
Critical Success Index CSI N N N N N
Forecast “half life”Forecast “half life”
• Fit an inverse-exponential:– S0 is the initial score and 1/2 is the half-life
• Noticeably longer half-life fitted after 36 hours– Same thing found for Met Office rainfall forecast (Roberts 2008)– First timescale due to data assimilation and convective events– Second due to more predictable large-scale weather systems
2004 20072.6 days
2.9 days2.9 days2.7 days2.9 days
2.7 days
2.7 days3.1 days
2.4 days
4.0 days4.3 days4.3 days
3.0 d
3.2 d
3.1 d
Met Office DWD
2/1/0 2)( tStS
A-train verification: July 2006A-train verification: July 2006Both models underestimate mid- and low-level clouds (partly a snow issue at ECMWF)
GSS and LOR misleading: skill increases or decreases with cloud fractionSEDS and SEDI much more robust
Highest skill: winter upper-troposphere mid-latitudesLowest skill: tropical and sub-tropical boundary-layer cloudsTropical deep convection somewhere in between!
How is the boundary layer How is the boundary layer modelled?modelled?
• Met Office model has explicit boundary-layer types (Lock et al. 2000)
Doppler-lidar retrieval of BL typeDoppler-lidar retrieval of BL type
• Usually the most probable type has a probability greater than 0.9• Now apply to two years of data and evaluate the type in the Met
Office model
Most probable boundary-layer type
II: Stratocu over stable surface layer
IIIb: Stratocumulus-topped mixed layer Ib: Stratus
Harvey, Hogan and Dacre (2012)
Forecast skillForecast skill
random
Forecast skill: Forecast skill: stabilitystability
• Surface layer stable?– Model very skilful (but
basically predicting day versus night)
– Better than persistence (predicting yesterday’s observations)
b a
d c
random
Forecast skill: Forecast skill: cumuluscumulus
• Cumulus present (given the surface layer is unstable)?– Much less skilful than in
predicting stability– Significantly better than
persistence
b a
d c
random
Forecast skill: Forecast skill: decoupleddecoupled
• Decoupled (as opposed to well-mixed)?– Not significantly more skilful
than a persistence forecast
b ad c
random
Forecast skill:Forecast skill:multiple cloud multiple cloud
layers?layers?• Cumulus under statocumulus as opposed to cumulus alone?– Not significantly more skilful than
a random forecast– Much poorer than cloud
occurrence skill (SEDI 0.5-0.7)
b ad c
random
Take-home messagesTake-home messages• Pressure is too easy to forecast; verify with clouds instead!• Half life of cloud forecasts is 2.5-4 days rather than 9-10 days• ETS is not strictly equitable: call it Gilbert Skill Score instead• But GSS and most others are misleading for rare events• I recommend the Symmetric Extremal Dependence Index• Global verifications shows mid-lat winter ice clouds have most skill,
tropical boundary-layer clouds have no skill at all!
• Relevant publications– Cloud-forecast half-life: Hogan, O’Connor & Illingworth (QJ 2009)– Asymptotic equitability: Hogan, Ferro, Jolliffe & Stephenson (WAF 2010)– SEDI: Ferro and Stephenson (WAF 2011)– Comparison of verification measures and calculation of confidence
intervals: Hogan and Mason (2nd Ed of “Forecast Verification” 2011)– Doppler-lidar BL type: Harvey, Hogan & Dacre (Submitted to QJRMS)– Global verification: Hogan, Stein, Garcon & Delanoe (ERL in prep)
Cloud fraction in 7 modelsCloud fraction in 7 models• Mean & PDF for 2004 for Chilbolton, Paris and Cabauw
Illingworth et al. (BAMS 2007)
0-7 km
– All models except DWD underestimate mid-level cloud– Some have separate “radiatively inactive” snow (ECMWF, DWD); Met
Office has combined ice and snow but still underestimates cloud fraction
– Wide range of low cloud amounts in models– Not enough overcast boxes, particularly in Met Office model
Skill-Bias diagramsSkill-Bias diagrams
Positiveskill
Randomforecast
Negativeskill
Best possible forecast
Worst possible forecast
Under-prediction No bias Over-prediction
Random unbiased forecast
Constant forecast of non-occurrence
Constant forecast of occurrence
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Reality (n=16, p=1/4)
Forecast
-
Hogan and Mason (2011)
Skill-bias diagramSkill-bias diagram
HedgingHedging“Issuing a forecast “Issuing a forecast
that differs from your that differs from your true belief in order to true belief in order to improve your score” improve your score”
(e.g. Jolliffe 2008)(e.g. Jolliffe 2008)
• Hit rate H=a/(a+c)– Fraction of events
correctly forecast– Easily hedged by
randomly changing some forecasts of non-occurrence to occurrence H=0.5
H=0.75
H=1
Some reportedly equitable Some reportedly equitable measuresmeasures
HSS = [x-E(x)] / [n-E(x)]; x = a+d ETS = [a-E(a)] / [a+b+c-E(a)]
LOR = ln[ad/bc] ORSS = [ad/bc – 1] / [ad/bc + 1]
E(a) = (a+b)(a+c)/n is the expected value of a for an unbiased random forecasting system
Random and constant forecasts all score zero, so these measures are all equitable, right?
Simple attempts to hedge will fail for all these measures
Extreme dependency scoreExtreme dependency score• Stephenson et al. (2008) explained this behavior:
– Almost all scores have a meaningless limit as “base rate” p 0– HSS tends to zero and LOR tends to infinity
• They proposed the Extreme Dependency Score:
– where n = a + b + c + d
• It can be shown that this score tends to a meaningful limit:– Rewrite in terms of hit rate H =a/(a +c) and base rate p =(a +c)/n :
– Then assume a power-law dependence of H on p as p 0:– In the limit p 0 we find
– This is useful because random forecasts have Hit rate converging to zero at the same rate as base rate: =1 so EDS=0
– Perfect forecasts have constant Hit rate with base rate: =0 so EDS=1
Extreme dependence scoresExtreme dependence scores
Extreme Dependence Score– Stephenson et al. (2008)– Inequitable– Easy to hedge
Symmetric EDS– Hogan et al. (2009)– Asymptotically equitable– Difficult to hedge
Symmetric Extremal Dependence Index– Ferro and Stephenson (2011)– Base-rate independent– Robust for both rare and
overwhelmingly common events
)1()1(ln1
1ln
SEDIFFHH
FF
HH
• Expected values of a–d for a random forecasting system may score zero:– S[E(a), E(b), E(c), E(d)] = 0
• But expected score may not be zero!
– E[S(a,b,c,d)] = P(a,b,c,d)S(a,b,c,d)
• Width of random probability distribution decreases for larger sample size n– A measure is only equitable if positive
and negative scores cancel
Which measures are equitable?Which measures are equitable?
ETS & ORSS are asymmetric
n = 16 n = 80
Possible solutionsPossible solutions1. Ensure n is large enough that E(a) > 102. Inequitable scores can be scaled to make them equitable:
– This opens the way to a new class of non-linear equitable measures
),|E()max(
),|E(
s
sequit qpSS
qpSSS
3. Report confidence intervals and “p-values” (the probability of a score being achieved by chance)
Key properties for estimating ½ Key properties for estimating ½ lifelife
• We wish to model the score S versus forecast lead time t as:
– where 1/2 is forecast “half-life”
• We need linearity– Some measures “saturate” at high skill
end (e.g. Yule’s Q / ORSS)– Leads to misleadingly long half-life
• ...and equitability– The formula above assumes that score tends to zero for very long
forecasts, which will only occur if the measure is equitable
2/1/0
/0 2)( tt SeStS
• Different spatial scales? Convection?– Average temporally before calculating skill scores:
– Absolute score and half-life increase with number of hours averaged
Why is half-life less for clouds than Why is half-life less for clouds than pressure?pressure?
Forecast skill: Forecast skill: Nocturnal Nocturnal
stratocustratocu• Stratocumulus present (given a stable surface layer)?– Marginally more skilful than a
persistence forecast– Much poorer than cloud
occurrence skill (SEDI 0.5-0.7)
b ad c
random