Verifying cloud and boundary-layer forecasts

Robin HoganRobin HoganEwan O’Connor, Natalie Harvey, Thorwald Stein, Anthony Ewan O’Connor, Natalie Harvey, Thorwald Stein, Anthony

Illingworth, Julien Delanoe, Helen Dacre, Helene GarconIllingworth, Julien Delanoe, Helen Dacre, Helene Garcon

University of Reading, UKUniversity of Reading, UK

Chris Ferro, Ian Jolliffe, David StephensonChris Ferro, Ian Jolliffe, David Stephenson

University of Exeter, UKUniversity of Exeter, UK

Verifying cloud and Verifying cloud and

boundary-layer forecastsboundary-layer forecasts

How skillful is a forecast?How skillful is a forecast?

• Most model evaluations of clouds test the cloud climatology– What about individual forecasts?

• Standard measure shows ECMWF forecast “half-life” of ~6 days in 1980 and ~9 days in 2000 – But virtually insensitive to clouds!

ECMWF 500-hPa geopotential anomaly correlation

• Cloud has smaller-scale variations than geopotential height because it is separated by around two orders of differentiation:

– Cloud ~ vertical wind ~ relative vorticity ~ 2streamfunction ~ 2pressure– Suggests cloud observations would be a more stringent test of models

Geopotential height anomalyGeopotential height anomaly Vertical velocityVertical velocity

OverviewOverview• Desirable properties of verification measures (skill scores)

– Usefulness for rare events– Equitability: is the “Equitable Threat Score” equitable?

• Testing the skill of cloud forecasts from seven models– What is the “half life” of a cloud forecast?

• Testing the skill of cloud forecasts from space– Which cloud types are best forecast and which types worst?

• Testing the skill of boundary-layer type forecasts– New diagnosis from Doppler lidar

ChilboltonObservations

Met OfficeMesoscale

Model

ECMWFGlobal Model

Meteo-FranceARPEGE Model

KNMIRACMO Model

Swedish RCA model

Cloud Cloud fractionfraction

Joint PDFs of cloud fractionJoint PDFs of cloud fraction

• Raw (1 hr) resolution– 1 year from Murgtal– DWD COSMO model

• 6-hr averaging

ab

cd

…or use a simple contingency table

a = 7194 b = 4098

c = 4502 d = 41062

DWD model, Murgtal

Model cloud

Model clear-sky

a: Cloud hit b: False alarm

c: Miss d: Clear-sky hit

Contingency tablesContingency tables

For given set of observed events, only 2 degrees of freedom in all possible forecasts (e.g. a & b), because 2 quantities fixed: - Number of events that occurred n =a +b +c +d - Base rate (observed frequency of occurrence) p =(a +c)/n

Observed cloud Observed clear-sky

Desirable properties of verification Desirable properties of verification measuresmeasures

1. “Equitable”: all random forecasts receive expected score zero– Constant forecasts of occurrence or non-occurrence also score zero

2. Difficult to “hedge”– Some measures reward under- or over-prediction

3. Useful for rare events– Almost all widely used measures are “degenerate” in that they asymptote

to 0 or 1 for vanishingly rare events

4. “Linear”: so that can fit an inverse exponential for half-life5. Useful for overwhelmingly common events…6. Base-rate independent…7. Bounded…

For a full discussion see Hogan and Mason,Chapter 3 of Forecast Verification 2nd edition:

Skill versus cloud-fraction Skill versus cloud-fraction thresholdthreshold

• Consider 7 models evaluated over 3 European sites in 2003-2004– Two equitable

measures: Heidke Skill Score and Log of Odds Ratio

LOR implies skill increases for larger

cloud-fraction thresholdHSS implies skill decreases

significantly for larger cloud-fraction threshold

LORHSS

)1()1(ln1

1ln

SEDIFFHH

FF

HH

Extreme dependency scoresExtreme dependency scores• Stephenson et al. (2008) explained this behavior:

– Almost all scores have a meaningless limit as “base rate” p 0– HSS tends to zero and LOR tends to infinity

• Solved with their Extreme Dependency Score:

– Problem: inequitable and easy to hedge: just forecast clouds all the time

• Hogan et al. (2009) proposed the symmetric version SEDS:

• Ferro and Stephenson (2011) proposed Symmetric Extremal Dependence Index

– where hit rate H = a/(a+c) and false alarm rate F = b/(b+d)– Robust for rare and overwhelmingly common events

Skill versus cloud-fraction Skill versus cloud-fraction thresholdthreshold

SEDS has much flatter behaviour for all models (except for Met Office which underestimates high cloud occurrence significantly)

LORHSS SEDS

Skill versus heightSkill versus height

• Verification using SEDS reveals:– Skill tends to slowly decrease at tropopause– Mid-level clouds (4-5 km) most skilfully predicted, particularly by

Met Office– Boundary-layer clouds least skilfully predicted

Asymptotic equitabilityAsymptotic equitability• “Equitable Threat Score” is

slightly inequitable for n < 30– Should call it Gilbert Skill Score

• ORSS, EDS, SEDS & SEDI approach zero much more slowly with n – For events that occur 2% of the

time need n > 25,000 before magnitude of expected score is less than 0.01

• Hogan et al. (2010) showed that inequitable measures can be scaled to make them equitable but tricky numerical operation

• Alternatively be sure sample size is large enough and report CIs on verification measures

For some measures, expected score for random forecast only tends to zero for a large number of samples: these are asymptotically equitable

League League tabletable

Tru

ly E

quita

ble

Asy

mpto

tically

Equ

itable

Linear o

r nearly

linear

Use

ful fo

r rare

events

Use

ful fo

r overw

helm

ingly

com

mon

events

Equitably transformed SEDI (Tricky to implement) Y Y Y Y Y

Symmetric Extremal Dependence Index SEDI N Y Y Y Y

Symmetric Extreme Dependency Score SEDS N Y Y Y N

Peirce Skill Score PSS / Heidke Skill Score HSS Y Y Y N N

Log of Odds Ratio LORN Y Y N N

Odds Ratio Skill Score LOR / Yule’s Q N Y N N N

Gilbert Skill Score GSS (formerly ETS) N Y N N N

Extreme Dependency Score EDS N N Y Y N

Hit Rate H / False Alarm Rate FAR N N Y N N

Critical Success Index CSI N N N N N

Forecast “half life”Forecast “half life”

• Fit an inverse-exponential:– S0 is the initial score and 1/2 is the half-life

• Noticeably longer half-life fitted after 36 hours– Same thing found for Met Office rainfall forecast (Roberts 2008)– First timescale due to data assimilation and convective events– Second due to more predictable large-scale weather systems

2004 20072.6 days

2.9 days2.9 days2.7 days2.9 days

2.7 days

2.7 days3.1 days

2.4 days

4.0 days4.3 days4.3 days

3.0 d

3.2 d

3.1 d

Met Office DWD

2/1/0 2)( tStS

A-train verification: July 2006A-train verification: July 2006Both models underestimate mid- and low-level clouds (partly a snow issue at ECMWF)

GSS and LOR misleading: skill increases or decreases with cloud fractionSEDS and SEDI much more robust

Highest skill: winter upper-troposphere mid-latitudesLowest skill: tropical and sub-tropical boundary-layer cloudsTropical deep convection somewhere in between!

How is the boundary layer How is the boundary layer modelled?modelled?

• Met Office model has explicit boundary-layer types (Lock et al. 2000)

Doppler-lidar retrieval of BL typeDoppler-lidar retrieval of BL type

• Usually the most probable type has a probability greater than 0.9• Now apply to two years of data and evaluate the type in the Met

Office model

Most probable boundary-layer type

II: Stratocu over stable surface layer

IIIb: Stratocumulus-topped mixed layer Ib: Stratus

Harvey, Hogan and Dacre (2012)

Forecast skillForecast skill

random

Forecast skill: Forecast skill: stabilitystability

• Surface layer stable?– Model very skilful (but

basically predicting day versus night)

– Better than persistence (predicting yesterday’s observations)

b a

d c

random

Forecast skill: Forecast skill: cumuluscumulus

• Cumulus present (given the surface layer is unstable)?– Much less skilful than in

predicting stability– Significantly better than

persistence

b a

d c

random

Forecast skill: Forecast skill: decoupleddecoupled

• Decoupled (as opposed to well-mixed)?– Not significantly more skilful

than a persistence forecast

b ad c

random

Forecast skill:Forecast skill:multiple cloud multiple cloud

layers?layers?• Cumulus under statocumulus as opposed to cumulus alone?– Not significantly more skilful than

a random forecast– Much poorer than cloud

occurrence skill (SEDI 0.5-0.7)

b ad c

random

Take-home messagesTake-home messages• Pressure is too easy to forecast; verify with clouds instead!• Half life of cloud forecasts is 2.5-4 days rather than 9-10 days• ETS is not strictly equitable: call it Gilbert Skill Score instead• But GSS and most others are misleading for rare events• I recommend the Symmetric Extremal Dependence Index• Global verifications shows mid-lat winter ice clouds have most skill,

tropical boundary-layer clouds have no skill at all!

• Relevant publications– Cloud-forecast half-life: Hogan, O’Connor & Illingworth (QJ 2009)– Asymptotic equitability: Hogan, Ferro, Jolliffe & Stephenson (WAF 2010)– SEDI: Ferro and Stephenson (WAF 2011)– Comparison of verification measures and calculation of confidence

intervals: Hogan and Mason (2nd Ed of “Forecast Verification” 2011)– Doppler-lidar BL type: Harvey, Hogan & Dacre (Submitted to QJRMS)– Global verification: Hogan, Stein, Garcon & Delanoe (ERL in prep)

Cloud fraction in 7 modelsCloud fraction in 7 models• Mean & PDF for 2004 for Chilbolton, Paris and Cabauw

Illingworth et al. (BAMS 2007)

0-7 km

– All models except DWD underestimate mid-level cloud– Some have separate “radiatively inactive” snow (ECMWF, DWD); Met

Office has combined ice and snow but still underestimates cloud fraction

– Wide range of low cloud amounts in models– Not enough overcast boxes, particularly in Met Office model

Skill-Bias diagramsSkill-Bias diagrams

Positiveskill

Randomforecast

Negativeskill

Best possible forecast

Worst possible forecast

Under-prediction No bias Over-prediction

Random unbiased forecast

Constant forecast of non-occurrence

Constant forecast of occurrence

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Reality (n=16, p=1/4)

Forecast

-

Hogan and Mason (2011)

Skill-bias diagramSkill-bias diagram

HedgingHedging“Issuing a forecast “Issuing a forecast

that differs from your that differs from your true belief in order to true belief in order to improve your score” improve your score”

(e.g. Jolliffe 2008)(e.g. Jolliffe 2008)

• Hit rate H=a/(a+c)– Fraction of events

correctly forecast– Easily hedged by

randomly changing some forecasts of non-occurrence to occurrence H=0.5

H=0.75

H=1

Some reportedly equitable Some reportedly equitable measuresmeasures

HSS = [x-E(x)] / [n-E(x)]; x = a+d ETS = [a-E(a)] / [a+b+c-E(a)]

LOR = ln[ad/bc] ORSS = [ad/bc – 1] / [ad/bc + 1]

E(a) = (a+b)(a+c)/n is the expected value of a for an unbiased random forecasting system

Random and constant forecasts all score zero, so these measures are all equitable, right?

Simple attempts to hedge will fail for all these measures

Extreme dependency scoreExtreme dependency score• Stephenson et al. (2008) explained this behavior:

– Almost all scores have a meaningless limit as “base rate” p 0– HSS tends to zero and LOR tends to infinity

• They proposed the Extreme Dependency Score:

– where n = a + b + c + d

• It can be shown that this score tends to a meaningful limit:– Rewrite in terms of hit rate H =a/(a +c) and base rate p =(a +c)/n :

– Then assume a power-law dependence of H on p as p 0:– In the limit p 0 we find

– This is useful because random forecasts have Hit rate converging to zero at the same rate as base rate: =1 so EDS=0

– Perfect forecasts have constant Hit rate with base rate: =0 so EDS=1

Extreme dependence scoresExtreme dependence scores

Extreme Dependence Score– Stephenson et al. (2008)– Inequitable– Easy to hedge

Symmetric EDS– Hogan et al. (2009)– Asymptotically equitable– Difficult to hedge

Symmetric Extremal Dependence Index– Ferro and Stephenson (2011)– Base-rate independent– Robust for both rare and

overwhelmingly common events

)1()1(ln1

1ln

SEDIFFHH

FF

HH

• Expected values of a–d for a random forecasting system may score zero:– S[E(a), E(b), E(c), E(d)] = 0

• But expected score may not be zero!

– E[S(a,b,c,d)] = P(a,b,c,d)S(a,b,c,d)

• Width of random probability distribution decreases for larger sample size n– A measure is only equitable if positive

and negative scores cancel

Which measures are equitable?Which measures are equitable?

ETS & ORSS are asymmetric

n = 16 n = 80

Possible solutionsPossible solutions1. Ensure n is large enough that E(a) > 102. Inequitable scores can be scaled to make them equitable:

– This opens the way to a new class of non-linear equitable measures

),|E()max(

),|E(

s

sequit qpSS

qpSSS

3. Report confidence intervals and “p-values” (the probability of a score being achieved by chance)

Key properties for estimating ½ Key properties for estimating ½ lifelife

• We wish to model the score S versus forecast lead time t as:

– where 1/2 is forecast “half-life”

• We need linearity– Some measures “saturate” at high skill

end (e.g. Yule’s Q / ORSS)– Leads to misleadingly long half-life

• ...and equitability– The formula above assumes that score tends to zero for very long

forecasts, which will only occur if the measure is equitable

2/1/0

/0 2)( tt SeStS

• Different spatial scales? Convection?– Average temporally before calculating skill scores:

– Absolute score and half-life increase with number of hours averaged

Why is half-life less for clouds than Why is half-life less for clouds than pressure?pressure?

Forecast skill: Forecast skill: Nocturnal Nocturnal

stratocustratocu• Stratocumulus present (given a stable surface layer)?– Marginally more skilful than a

persistence forecast– Much poorer than cloud

occurrence skill (SEDI 0.5-0.7)

b ad c

random

Verifying cloud and boundary-layer forecasts

Documents

Transcript of Verifying cloud and boundary-layer forecasts