Download - Embracing uncertainty when reasoning about outcomes · Embracing uncertainty when reasoning about outcomes Dr Andy Fugard Research Fellow EBPU Masterclass 7

Embracing uncertainty when

reasoning about outcomes

Dr Andy Fugard

Research Fellow

EBPU Masterclass 7

Timetable 10:30 – 10:45 Registration

10:45 – 11:00 Introductions

11:00 – 11:30 What data are you collecting? What’s it for?

11:30 – 13:00 Recap on basics of uncertainty • Summary stats (mean, median, SD, quantiles/percentiles)

• Reasoning from sample to population

• (Im)precision, confidence/uncertainty intervals and p-values

• Effect size and power

13:00 – 14:00 Lunch

14:00 – 15:00 Statistical techniques often used to assess change in CAMHS • Reliable change index

• “Added-value”/estimated treatment effect scores

15:00 – 15:15 Coffee break

15:15 – 16:00 Basics of statistical process control • Run charts

• Control charts

The one slide summary…

• Statistical methods are needed to extract meaning from the noise – There’s a lot of methods

– You might be pleasantly surprised how helpful your (or a friend’s) old undergrad stats textbook is

– Don’t forget about statistician colleagues

• Interpretation of data equally essential – Forget the numbers for a moment

– Think what’s actually happening clinically?

– What do the theories say?

– What do the systematic reviews say helps? NICE Guidelines? Your service users?

Warmup – what data do you

collect?

Summary statistics

Data (N = 200) randomly generated using SDQ Parent norms

14 5 15 1 12 7 7 14 19 21 0 9 0 10 0 11 8 3

9 7 7 3 7 11 13 12 13 12 0 12 12 12 11 0 9 11

14 18 9 17 0 5 0 3 8 11 1 2 0 8 16 0 1 3

9 1 8 11 4 4 11 16 9 10 14 14 3 3 10 12 8 8

8 16 7 7 8 13 2 12 10 6 5 5 21 12 17 13 0 0

0 20 10 8 11 0 7 0 9 11 9 7 4 13 4 4 13 11

13 0 4 0 4 18 3 0 12 14 7 8 6 1 14 8 8 12

6 14 16 12 8 8 11 5 2 8 13 6 12 1 19 13 8 8

16 9 6 7 12 8 8 5 1 4 0 18 11 3 12 14 18 0

7 11 0 12 9 20 10 7 13 2 17 12 13 0 2 3 7 15

15 6 16 6 6 6 1 5 2 0 5 7 5 18 12 8 1 0

12 7

What a lot of statistics is about

• Reducing data in various ways

• Uncovering relationships

• Drawing inferences about a population based on a random sample of that population

All stats packages will compute stuff on slides following (and deal with tricky details) – this meant to build intuitions; don’t compute by hand!

(Arithmetic) Mean:

sum all numbers and divide by N

14 5 15 1 12 7 7 14 19 21 0 9 0 10 0 11 8 3

9 7 7 3 7 11 13 12 13 12 0 12 12 12 11 0 9 11

14 18 9 17 0 5 0 3 8 11 1 2 0 8 16 0 1 3

9 1 8 11 4 4 11 16 9 10 14 14 3 3 10 12 8 8

8 16 7 7 8 13 2 12 10 6 5 5 21 12 17 13 0 0

0 20 10 8 11 0 7 0 9 11 9 7 4 13 4 4 13 11

13 0 4 0 4 18 3 0 12 14 7 8 6 1 14 8 8 12

6 14 16 12 8 8 11 5 2 8 13 6 12 1 19 13 8 8

16 9 6 7 12 8 8 5 1 4 0 18 11 3 12 14 18 0

7 11 0 12 9 20 10 7 13 2 17 12 13 0 2 3 7 15

15 6 16 6 6 6 1 5 2 0 5 7 5 18 12 8 1 0

12 7

14 + 5 +15 +1 + 12 + … + 7

200=8.2

Median: sort & take the middle value

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2

2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4

4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6

6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 11

11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12

12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13

13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15

16 16 16 16 16 16 17 17 17 18 18 18 18 18 19 19 20 20

21 21

What about the rest of the data? (Histogram)

Or… (Discrete histogram)

50th Percentile / Median

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2

2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4

4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6

6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 11

11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12

12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13

13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15

16 16 16 16 16 16 17 17 17 18 18 18 18 18 19 19 20 20

21 21

25th Percentile / 1st quartile

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2

2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4

4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6

6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 11

11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12

12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13

13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15

16 16 16 16 16 16 17 17 17 18 18 18 18 18 19 19 20 20

21 21

0th Percentile

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2

2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4

4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6

6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 11

11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12

12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13

13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15

16 16 16 16 16 16 17 17 17 18 18 18 18 18 19 19 20 20

21 21

100th Percentile

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2 2

2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4

4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6

6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 11

11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12

12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13

13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15

16 16 16 16 16 16 17 17 17 18 18 18 18 18 19 19 20 20

21 21

Things you can do with

percentiles

Commonly seen

• Interquartile range: lower and upper value of the middle 50% of the data

Other possibilities

• The middle 80%

• Or the top 5%

What might be useful?

Things you can do with percentiles:

norms (here SDQ)

Standard deviation

• Measure of how spread out value are

• Start with the variance: average* of squared

differences between mean and each value: 𝑥1 − mean 2, 𝑥2 − mean 2, … , 𝑥𝑛 − mean 2

• Standard deviation is the square root of this

• In original units of variable, e.g., SDQ points,

age in years, …

*Or almost: divide by (N−1)

Rules of thumb for normally

distributed (bell curve) data

~68% of data

Mean −1SD −2SD −3SD 1SD 2SD 3SD



~95% of data




~99.7% of data


Reasoning from sample to

population

Thinking about quality of care

provided by a service

Is one patient enough?



Two?



Five?

Discuss: what might vary that’s

out of control of clinician?

Made up example

• Data from two teams

• 20 patients in each

• Dependent variable: levels of difficulties at

end of treatment

Results

Team A mean 5.2

SD 2.3

Team B mean 7.4

SD 2.7

A B

Difficu

ltie

s

0

2

4

6

8

What we have

• Sample estimates of the population

– Mean

– Standard deviation

• Means definitely differ in

the sample of 40

A B

Difficu

ltie

s

0

2

4

6

8

What we (typically) want to know

• Do the means differ in the population?

• Do the results generalise beyond these 40

• Would the result replicate in another

sample from the same population?

You know the

mean outcome

for a random

sample

You know the

mean outcome

for a random

sample

What’s it likely

to be for the

population?

Null hypothesis

significance testing

Null hypothesis significance testing

• One way to reason from sample to population

• Requires:

Null hypothesis: e.g., the (population) means are the same

Alternative hypothesis: e.g., the (population) means are different

• We know the sample means were different

• Hope is that we can reject H0

Example: the t statistic

• Computed from sample means, SDs, & Ns

• Gives a standardised measure

• Related to difference in sample means

• A bigger number bigger difference

• Closer to zero smaller difference

• For our example we get t = 2.8

• Sign gives the direction of the difference, e.g., t = −2.8 would have been in the opposite direction

Pretend…

There is no difference in population means.








Simulate with 10,000 studies where there‘s no

difference in the population means

t

Fre

qu

en

cy

-4 -2 0 2 4

05

00

10

00

15

00

t

Fre

qu

en

cy

-4 -2 0 2 4

05

00

10

00

15

00

Occasionally large

positive differences

in sample

t

Fre

qu

en

cy

-4 -2 0 2 4

05

00

10

00

15

00

Occasionally large

negative differences

in sample

t

Fre

qu

en

cy

-4 -2 0 2 4

05

00

10

00

15

00

Mostly only small

differences in sample

t

Fre

qu

en

cy

-4 -2 0 2 4

05

00

10

00

15

00

Mostly only small

differences in sample

But remember:

NO DIFFFERENCE IN POPULATION

t

De

nsity

-4 -2 0 2 4

0.0

0.1

0.2

0.3

Normalise so blue area = 1

Blue area between two values gives

probability of getting those values in sample

Can we find the null distribution

without simulation?

• William Gosset

• 1899: Joined Guinness in Dublin

• Developed statistics to help with

quality control in brewing

• Published under pseudonym

Student

• Worked out the t-distribution [Student (1908). The probable error

of a mean. Biometrika 6, 1–25.]

... the t distribution is now computed

in all stats software …

What is a p-value then?

• We got t = 2.8 in the sample

• How likely is this, assuming that the

population means are the same?

• We don‘t want it to be likely

• We want to be in a world where there is a

difference!

• The hypothesis didn‘t specify a direction

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

t

De

nsity

t(38) = 2.8, p = .008

Area = .004 Area =.004

Final step

• Current conventions say that p < 0.05 is

“statistically significant”

• So can reject H0

• Note statistically significant doesn’t mean

clinically significant – the magnitude of the

difference matters!

Confidence intervals

• Give interval where population value is

likely to be

• With a given degree of confidence

• 95% confidence means that in 95% of

studies, the true population value will be

included in the interval

02

46

8

t(198) = 0, p = 1

Group

Sco

re

A B

95% Confidence intervals of the means

02

46

8

t(198) = 0.676, p = 0.5

Group

Sco

re

A B


02

46

8

t(198) = 1.653, p = 0.1

Group

Sco

re

A B


02

46

8

t(198) = 1.93, p = 0.055

Group

Sco

re

A B


02

46

8

t(198) = 1.972, p = 0.05

Group

Sco

re

A B


02

46

8

t(198) = 2.017, p = 0.045

Group

Sco

re

A B


02

46

8

t(198) = 2.017, p = 0.045

Group

Sco

re

A B


Note: CIs can

overlap and still

there’s a significant

difference

02

46

8

t(198) = 2.601, p = 0.01

Group

Sco

re

A B


02

46

8

t(198) = 2.839, p = 0.005

Group

Sco

re

A B


02

46

8

t(198) = 3.339, p = 0.001

Group

Sco

re

A B


Why this funnel shape?

• Standard error of the mean =

SD

𝑁

• Sample size ↑ … error ↓

• So for smaller services you’d expect greater between-service differences in (sample) means

• Even if there’s no population difference

Also happens for proportions: relationship

between sample size and recovery rates

Graphed

using public

Adult IAPT

data

downloaded

from NHS IC

web page

Effect size and power

• Effect size: how big the effect is

• For example how large a difference in

means or in proportions

• Larger the effect, the easier it is to detect

• More data

More precise estimate of population quantity

more likely to detect an effect = more power

Power analysis

http://www.psycho.uni-

duesseldorf.de/abteilunge

n/aap/gpower3/

http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/





Examples with proportions

• Compare proportion of families dropping

out of treatment between two teams

• Or try to reduce drop-outs

• Or increase proportion of children who say

they felt listened to

Minimum sample size needed in each group for

test of difference between two proportions (power = 80%, searching for p < .05)

Group 2

Group 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.2 195

0.3 59 292

0.4 30 80 355

0.5 18 38 93 387

0.6 12 22 42 97 387

0.7 9 14 23 42 93 355

0.8 6 9 14 22 38 80 292

0.9 5 6 9 12 18 30 59 195

Examples with means

• Compare average outcomes between two

teams

• Or test whether training has any effect on

outcomes

• Compare clinicians’ outcomes?

Minimum sample size needed in each group for test of

difference between two means (power = 80%, searching for p < .05)

Effect size d (in SD units)

N in each group

0.1 1571 0.2 393 0.3 175 0.4 99 0.5 64 0.6 45 0.7 33 0.8 26 0.9 20 1.0 17

𝑑 = mean1 − mean2

SD

What about interpretation?

Let’s start somewhere where it

seems to be easier…

Rizo and Südhof (2002)

Zhang, Wu,

Wang, et al

(2011)

There’s still a large gap

between data and theory!

“My friend told me Chomsky said something very sad. He said that today we don't need theory. All we need to do is tell people, empirically, what is going on. Here, I violently disagree: facts are facts, and they are precious, but they can work in this way or that. Facts alone are not enough. […] I'm sorry, I'm an old-fashioned continental European. Theory is sacred and we need it more than ever.” – Slavoj Žižek, interview in New Statesman, 29 October 2009

Start with the basics

• Data quality – What’s the return rate?

– Any sign of systematic biases?

– Miscoded data?

• Domain coverage of measure – Broad spectrum vs. narrow focus

– Are the problems/strengths covered at all?

• Case mix – Comorbidity

– Emotional problems vs. developmental conditions

“In the absence of randomization, one has to

work very hard to demonstrate that

unbalanced patient characteristics or referral

practices could not have substantially

influenced the treatment outcome

comparison”

– Clark, Fairburn, and Wessely (2008, p. 631)

Discuss: where to look for ideas

for factors influencing outcomes?

Some ideas…

• Therapeutic alliance – Correlates with outcomes (Shirk, Karver, Brown,

2011)

– Parent alliance relates to less frequent cancellations and CYP alliance to outcomes (Hawley & Weisz, 2005)

• Getting feedback on therapy (Bickman, Kelley, Breda, Andrade & Riemer, 2011)

• Practitioner skill (Scott, Carby, Rendu, 2008)

• Steer clear of “service restructuring”?

• Normalising the possibility to change therapist if things aren’t work out? (Evidence?)

Skill and outcome… (Scott, Carby, Rendu, 2008)

Skill and outcome… (e.g., Scott Carby, Rendu, 2008)

Assessing change I:

reliable change index

What if

patient

discharged?

A problem with changescores

Improvement/deterioration

could be due to chance

and to change in

underlying problems

What do

you

see?

One solution: reliable change

indices (Jacobson & Truax 1991)

RCI =post − pre

SEdiff

Change score

Reliable change

RCI =post − pre

SEdiff

where Standard error

of the

difference

SEdiff = SDpre 2 1 − 𝑟

Reliable change

RCI =post − pre

SEdiff

SEdiff = SDpre 2 1 − 𝑟

where

SD for pre-

score

Reliability, e.g.,

Cronbach’s α or test-

retest reliability

Example

• SDpre = 𝟔. 𝟔𝟓, 𝑟 = 𝟎. 𝟖𝟖

• SEdiff = SDpre 2 1 − 𝑟

• = 𝟔. 𝟔𝟓 × 2 × 1 − 𝟎. 𝟖𝟖

• = 3.257821

• Then…. RCI =post − pre

3.257821

z-scores

• The null distribution is a z-score, i.e., – Normally distributed

– Mean 0

– SD 1

• So you can compute a p-value

• See Excel file at http://www.corc.uk.net/resources/downloads/

Change-

score RCI p

1 0.31 0.759

2 0.61 0.539

3 0.92 0.357

4 1.23 0.220

5 1.53 0.125

6 1.84 0.066

7 2.15 0.032

8 2.46 0.014

9 2.76 0.006

10 3.07 0.002

11 3.38 0.001

http://www.corc.uk.net/resources/downloads/





There are some variations

on this theme

• Cronbach α or test-retest reliability used

• Sometimes (e.g., Barkham et al 2012) the

SD of the difference is used rather than

time 1 SD

• These values can come from large norm

samples (preferred) or sometimes (have

to) come from smaller samples

Important message: just tell everyone what

you did and be consistent

Another useful thing to do with this:

compute a reliable change criterion

• Work out 1.96 × SEdiff (why 1.96?)

• Change greater than this amount is reliable change (with 95% confidence)

• Example

– SEdiff = 3.257821 from example before

– Multiple by 1.96 = around 6.4

– If the change scores are integers, then this means a change of 7 or more (in either direction) is reliable

• Sometimes provided by measure developers

Assessing change II:

“added value” score

The problem

• Many factors other than mental health intervention can change symptom scores

• Examples

– Helpful friends, family, teacher

– Development

– Referral at peak of problems

– Regression to the mean

– Various response biases

– Negative life episodes

Intuition

• Suppose you’re okay at bowling but not fantastic

• You throw– strike!

• What happens on your next turn?

• Analogy: – Skill + luck

– True levels of difficulties + random variation

Regression to the mean

Think of all questionnaire scores as consisting of a true score and a error component

Measured score = true score + error

(But beware, sometimes some of the “error” is the signal, e.g., a particularly good/distressing day)

Now suppose…

• You measure the same thing twice

• It hasn’t really changed

• And there is no measurement error

Each line

connects a

person’s

score at

time 1 and

time 2

What happens if you add

measurement error, but the true

score doesn’t change…?

Example

(Parent SDQ in CORC)

r = .30

p << 0.001

N = 18140

Added value score (Goodman and Goodman)

• Developed using BCAMHS 2004 data

• Parent-rated SDQs

• 609 people with clinical problems

• Mostly (84%) not attending CAMHS

• Regression model developed predicting

Time 2 from Time 1 scores 6 months

earlier

Outset 6 months later

Non-CAMHS

sample modeled

Produces equation for

change due, e.g., to

• regression to mean

• spontaenous recovery

Outset 6 months later

AVS

Change in

CAMHS

case

Actual score

Predicted

non-CAMHS

score

Final added-value score

2.3 + 0.8𝑇𝑜𝑡𝑎𝑙1 + 0.2𝐼𝑚𝑝𝑎𝑐𝑡1 − 0.3𝐸𝑚𝑜𝑡𝑖𝑜𝑛1 − 𝑇𝑜𝑡𝑎𝑙2

Predicted T2 score if had

received no treatment

Actual T2 score

with treatment

Evidence it works

• Two parenting programme RCTs now

supporting the AVS

– Ford, Hutchings, Bywater, Goodman, &

Goodman (2009) Br J Psychiatry. 194(6).

– Sebastian Rotheray’s talk at 2012 RCP conf.

• Control group has an AVS ≈ 0

• Treatment group has AVS ≈ treat − control

• Not yet tested for emotional problems

Notes on the AVS http://www.sdqinfo.com/c5.html

• Using only for largish samples, e.g., 100 cases

• Confidence intervals for individual cases plus or minus 10 points

• Only applies to Parent-reported SDQs

• “Although initial findings on added value scores are promising, they should not be taken too seriously until accumulating experimental data from around the world tells us more about the formula's own strengths and difficulties!”

http://www.sdqinfo.com/c5.html

http://www.sdqinfo.com/c5.html

What about for other measures?

• Difficult to find data for people not receiving treatment

• The Parent-SDQ AVS based on original sample of 7977 cases in general population

• One source is waiting list controls in RCTs

• More to come, initially for school counselling populations (Cooper, Fugard, McArthur, Pybis, in preparation)

A rapid intro to statistical

process control

Basic ideas

• Common cause: causes of variation which affect all parts of the system, for instance noisiness in the measurement

• Special cause: causes of variation which do not affect all parts of the system, or not all of the time

Focus on trying to spot special causes, e.g., due to variation in practice

Perla R J et al. BMJ Qual Saf 2011;20:46-51

Run chart

Rules for run charts (giving p < .05)

• Shift: at least 6 consecutive points all above/below median (ignore points on median)

• Trend: at least 5 points all rising/falling (ignore like points)

• Run: too few or too many points on one side of median (statistical tables for how many there have to be; see Perla et al 2011)

• Astronomical point: something visually odd (not probability based for run charts – better to use control charts, coming later)

Total number of data

points on the run chart

that do not fall on the

median

Lower limit for the

number of runs

(< than this number

runs is ‘too few’)

Upper limit for the

number of runs

(> than this number

runs is ‘too many’)

10 3 9

11 3 10

12 3 11

13 4 11

14 4 12

15 5 12

16 5 13

17 5 13

18 6 14

19 6 15

20 6 16

More in Perla R J et al. BMJ Qual Saf 2011;20:46-51

Perla R J et al. BMJ Qual Saf 2011;20:46-51

Fictional example – based on a real service

Clinician

Number of face-

to-face sessions

this week

Claire 24

Panos 22

Thom 21

Johannes 20

Keith 19

Berit 19

Jenny 19

Polly 16

Bianca 14

Ellie 12

Weekly targets came from

Above: 20 face-to-face

sessions per week

Now easily tracked by

service manager

Actual session counts shared

each week with everyone

Why is Clare performing

better than Ellie?

One reason: random variation

Control chart (see Caulcutt 2004)

Upper/lower control lines (UCL/LCL): mean +/– 3 SD

Upper/lower warning lines (UWL/LWL): mean +/– 2 SD

Process has changed if one of:

• 1 point above UCL

• 1 point below LCL

• 2 consecutive points between UCL and UWL

• 2 consecutive points between LCL and LWL

• 8 consecutive points on same side of mean

Concise summary: monitor for a while before judging!