Master class Data, understanding it, interpreting it and using it.

1

Master classData, understanding it, interpreting it and using it.

Ruth HarrellLiann Brookes-smith

2

Agenda 9.30am – 10.30am 10.30am break 10.45 – 11.30am 11.40 – 12.30pm 12.30 – 1.30pm lunch 1.30 – 2.30pm probability 2.30 – 2.45pm break 2.45 – 3.30pm sampling and curve 3.30 – 4.30pm confidence and risk

3

Introduction Statistics may be defined as "a body of methods for

making wise decisions in the face of uncertainty." ~W.A. Wallis

“There are three kinds of lies: lies, damned lies, and statistics.” Disraeli (according to Mark Twain)

98% of all statistics are made up. ~Author Unknown Statistics are like bikinis. What they reveal is

suggestive, but what they conceal is vital. ~Aaron Levenstein

If you can not measure it, it does not exist ~ Author unknown

4

Question to the Room

What are statistics? Why are data important? What do you feel about stats? What do they tell us? E.g. 40% of children on XX area have dental

caries, what does that tell us? List types of data you are aware of or use in your

day to day

5

Practitioner competencies

Obtain, verify, analyse and interpret data and/or information to improve the health and wellbeing outcomes of a population / community / group – demonstrating: a. knowledge of the importance of accurate and reliable data / information and the anomalies that might occur b. knowledge of the main terms and concepts used in epidemiology and the routinely used methods for analysing quantitative and qualitative data c. ability to make valid interpretations of the data and/or information and communicate these clearly to a variety of audiences

6

Aim for the day

Aim of the day is to improve people understanding of the data they use, how to analyse it and interpret it.

This session is concentrating on the data rather than things such as the study design but we are happy to discuss and answer questions on both; you can’t understand what the data is telling you without understanding how it has been collected and the potential for bias.

7

Topics covered

1. Types of data2. Basic probability and stats3. Understanding how data is collected4. Measures of odds and ratios - comparing

populations and study results.5. Population sampling - Good samples and bad

samples6. Understanding Confidence intervals & p values -

is the result reliable 7. How I apply data to what I am doing

8

Types of data

9

Describing the data

We have a responsibility to present data in a way that can be easily understood, and which does not misrepresent the true meaning of the data.

Key decisions are made based on the data – or more accurately people’s impression of the data – so this has an impact on use of resources and eventually on patient care.

Accurate analysis and presentation of the data saves lives!

10

Quantitative data measures quantity ie is numerical.

Qualitative data is usually more descriptive and not measured in numbers.

However, data originally obtained as qualitative information about individual items may give rise to quantitative data if they are summarised by means of counts;

Quantitative vs. Qualitative

11

Discrete – Continuous

Discrete data can only take certain particular values

Continuous falls on a scale.

For example height is continuous, but the number of siblings is discrete.

12

Nominal - Ordinal

Nominal comes from the Latin nomen, meaning 'name', and is used to describe categorical data. There is no quantitative relationship between the different categories (though sometimes a number may be assigned for ease of analysis). An example is ethnicity.

Ordinal data again describes categories but there is some order to them - though the relationship between them may not be well defined. For example, Agenda for change pay scales, since they are ordered and can therefore be put in sequence (but there is no numerical relationship between them).

13

Transforming the data

Sometimes the data you have isn't the most effective way of displaying the data.

E.g. You have data on weight in Kilos.

Having a list of continuous weights is not intuitive, therefore you convert this to BMI I.e., those who are underweight, healthy weight, obese and morbidly obese.

Continuous to ordinal.

14

Transforming the data (2)

With this you can display more meaningful data

BUT

You lose the detail, the number of the edge of each category (borderline). You cant transform it back.

What you transform it to may not be the best use of data.

You can also transform data using complex calculations doing a “log” of each number, this will sometimes convert skewed data to normal curved data (discussed later)

15

Exercise

Exercise 1 and 2

16

Displaying the data

What are the options? Tables – simple descriptive, cross

tab… (mention pivot table) Graphs – bar, line, x-y or scatter, pie

chart….

17

Basic statistics and probability

Having looked at the raw data and carried out any transformations you felt necessary, you now want to describe the features of this data.

Distributions – plotting the data is the first step in this. You need to consider the shape of the graph before you know how to best analyse the data.

18

Types of graph

Normal

19

Types of graph

Skewed

20

Types of graph

Bimodal

21

Types of graph

Uniform

22

15 minuteBreak!

23

Data measures

Definitions: Range: the difference between the highest and

the lowest values in a set Mean: the total value of measure values summed

divided by the number of measures Median: the middle measure Mode: measure found most often Interquartile ranges: is a measure of statistical

dispersion, being equal to the difference between the upper and lower quartiles

Standard deviation: is a measure of how spread out numbers are.

Mean, median and mode Mean= (sum of observations)

(number of observations) Mode = the most common observation Median = the number where 50% of

observations are below and 50% are above

24

Standard Deviation and IQR Std Dev= sum of (difference squared between

each observation and the mean) / (number of observations - 1)

IQR= the difference between the value at the 25th percentile and 75th percentile

25

Formulas Sample mean x = ( Σ xi ) / n

Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]

xi is each observationN is the number of observationsΣ means ‘sum’

26

27

Exercise 3

28

Exercise 4

29

How reliable is my data?

Any data missing? How old is it? What is the denominator? Who collected it How was it collected? Ways to avoid making statements

about inaccurate data?

30

Describing data

31

Interpret the graph

This graph is a graph showing the trend of obesity in adults from 1993 – 2007

Percentage: of what (all adults presumed, all registered? All resident?) what age is defined as an adult?

Is the increase due to chance or an actual increase?

Data is quantitative/continuous

32

Bias

When looking at data sometimes the relationship we see is one caused by the way in which we are measuring not actually what is there.

33

Fudging Rate or Number You have 50 cases of COPD in area 1, and 150 cases in

COPD in area 2. should you do something in area 2? Area 1 has population of 2000 Area 2 has population of 5000 In area 1 rate in 50-74 year olds is 20/1000 In area 1 rate in 50-74 year olds is 42/1000 Area 1’s data was from 2004 Area 2’s data was from 2005-2009 Area 1 is 20/1000 confidence interval (12-48 per 1000) Area 2 is 42/1000 confidence interval (18 – 56 per

1000) Now what?

34

Exercise

Exercise 5

What do these data tell you? Key message?

What would you ask of these data? What further information would you want to know?

35

Basics of probability

Probability is a way of quantifying the judgements that we make all the time – from ‘do I need an umbrella?’ to ‘shall I bet on that horse?’

Probability is measured on a linear scale of 0 to 1 where 0 is impossible and 1 is absolutely certain.

36

Probability Why is probability relevant to public health? Probability gives us a quantitative measurement of the

chances of something happening, and there are 2 key ways in which it is used in Public Health

It is another word for risk (or if it has a positive impact benefit). For example, the probability that some who smokes cigarettes will get lung cancer has been shown to be much higher than for someone who doesn’t smoke.

It helps us to answer the question ‘how likely is it that the observed effect is due to our intervention not just to chance?’, and is used in all types of studies – testing medical treatments, evaluating the impact of public health interventions, assessing need of one population compared to another.

37

Probability and risk Odd – number of events divided by the number of

opportunities Risk in exposed– number of events divided by the

number of exposed Risk in unexposed– number of events divided by

the number of un-exposed Relative risk or Risk ratio is a ratio of the

probability of the event occurring in the exposed group versus a non-exposed group

Absolute risk is the difference in risk between the exposed and unexposed.

38

Probability cont…

What is the probability of a 6 if you throw an unbiased dice?

What is the probability of a total of 6 if you throw two unbiased dice?

Welcome back!!

I'm not an outlier I just haven't found my distribution yet.

39

40

Exercise

Exercise 6

Worse and early death = 0-3/10 No change = 4-5 /10 Cure = 2-6/10

41

Population sampling (1) In the real world we don’t usually get data from

everybody that we are interested in. Why not? Cost and resources may be too large People may choose to opt in or out May have incomplete data (data entry problems

etc)

42

Population sampling (2) So what we need to do is measure a sample of people

and infer from that sample what the population looks like. We can do this by tweaking the statistical formula used – but there are two things to consider;

If your sample size is too low you are unlikely to get a reasonable result – you can still use the formula but you need to bear this in mind when interpreting it

Think about who you have managed to sample – are they representative of the population? (imagine walking in to a large open plan office with a set of scales and asking people if they would mind being weighed – who is more likely to volunteer?)

43

Population sampling (3)

If we have a REPRESENTATIVE sample, we can apply a statistical tweak to help us to estimate the figure for the population.

If we don’t (if the sample is biased), though we can carry out the maths, it will always be flawed.

44

Population sampling (4)

Principle – Measure your sample Calculate the mean and standard deviation (of the

sample) Calculate the standard error = standard deviation

of the sample / n To estimate your mean, we say our best guess is

that the population mean is equal to the sample mean

Then we can use the standard error to estimate how close we think our estimate is.

First we need to talk about confidence intervals

Which one is an Insult.

Darling, you are two standard deviations below the mean

Of course your normal (mean 10, mode, 7)

You are mean Your looks are in the 80% percentile The difference between you and her

is a standard deviation

45

47

Probability, Population Sampling and the Normal Curve

Thinking about our data that fitted the normal curve – By using the mathematical model we can easily

calculate probabilities.

The maths tells us that; The total area under the normal curve is equal to 1. The probability that any new observation will fall

within one standard deviation of the mean is 68% The probability that any new observation will fall

within two standard deviations of the mean is 95% The probability that any new observation will fall

within three standard deviations of the mean is 99.7%

48

Examples

49

CERN experiments observe particle consistent with long-sought Higgs bosonGeneva, 4 July 2012. “We observe in our data clear signs of a new particle, at the level of 5 sigma, in the mass region around 126 GeV. The outstanding performance of the LHC and ATLAS and the huge efforts of many people have brought us to this exciting stage,” said ATLAS experiment spokesperson Fabiola Gianotti, “but a little more time is needed to prepare these results for publication.”

At five-sigma there is only one chance in nearly two million that the result is wrong, i.e. the measurement seen is a random fluctuation.

50

Confidence intervals (1)if we measure one individual’s IQ we can be 95% sure that it would fall between 70 and 130This ‘interval’ is called the 95% confidence interval.

We use 95% by convention; sometimes other figures are used such as 98%.If we measure the heights of a class of children and we have a mean of 1.2m, standard deviation of 0.1, what is your estimate for the height of a child randomly selected from the sample?

1.2 +/-0.2, ie 95% of this sample lies between 1.0 and 1.4m

51

Confidence intervals (2) Reminder; the heights of a class of children have a

mean of 1.2m, standard deviation of 0.1

We measure a new child and their height is 1.5m. What does this mean?

This is equal to mean + 3 standard deviations. This means we had less than a 0.5% chance that we would have this height in a child in this population. That doesn’t mean they are not part of the distribution (0.5% is not that rare) but you might be sensible to check a few things to be sure they are part of the same population (age!).

52

Confidence intervals (3)This time we are using confidence intervals to estimate our

‘true’ population characteristics based on a sample. Best estimate of the mean = measured mean of sample Best estimate of standard deviation of population = std

deviation of sample/ number of measurements in the sample

Therefore we can say that we are 95% confident that the mean of the population lies between the sample mean +/- 2xstandard error

This implies that; Our estimate of the mean gets better as n increases –

because our error gets smaller. This is the way we usually use confidence intervals in public

health as we usually measure a sample and infer the population.

Examples – Health survey for England, Household surveys, etc

You are a significant part of my life

P value =9

53

I would never treat you differently to your sisters

Sister 1 CI 4-9 Sister 2 CI 5-11 Sister 3 CI 4-13 ME CI 2-3

54

55

Comparing two samples The important question is – is there a difference between two

populations? This question might be asked in slightly different ways for

different types of study, but is fundamentally the same; For an RCT you compare control group with the

intervention group For a cohort you compare the outcomes in those exposed

to a risk factor compared to those not exposed For a case-control you look at the group with the disease

and compare their risk factors to those without the disease

You might look at before and after an intervention was put in place

You might compare one city or country to another

Comparing two samples (2)

56

The important question is – is there a difference between two populations?

57

Comparing two samples (3) We can calculate the difference between the two populations

as;

Mean difference = mean of pop 1 – mean pop2

Confidence interval = mean difference +/- 1.96*SE

SE (standard error) is a combination of the standard errors for each sample (shown here as s1 and s2)

SE = sqrt[ (s12 / n1) + (s2

2 / n2) ]

(se can be slightly different for different situations – but this gives you an idea)

58

T testsTesting using t test; You need to know the mean and standard deviation of

both of your samples. You start with a hypothesis; this is that there is no

difference between the two samples (or populations) You then do some maths;

t = [(mean of sample 1 – mean of sample 2)] / SE where SE= sqrt[ (standard dev of pop 1)2 / n1) +

(standard dev of pop 2)2 / n2) ]

59

T tests (2)So what does t mean? t =the horizontal axis of a normal distribution with

mean=0 and standard deviation=1 You can read the probability of the two samples coming

from the same population from a table of t valuesMost important value - if t>1.96 then the probability of them being from the

same distribution is <0.05 By convention, we discard the null hypothesis if p<0.05 Its good practice to quote the p value e.g. P=0.01

If t>1.96, then the probability of the two samples coming from the same population is <0.05 (5%). This suggests that they are fundamentally different

60

T tests (3)

What do these results mean? Mean difference = 0, with 95% confidence interval

(-1.0, +1.0), p= 0.50

Mean difference = 0.5, with 95% confidence interval (0.1, 0.9), p= 0.049

Mean difference = 1, with 95% confidence interval (-0.1, +1.1), p= 0.055

Mean difference = 1, with 95% confidence interval (0.2, +1.8), p= 0.02

Risk differences Same principle – null hypothesis is that there is no difference For no difference, the 95% confidence interval would include

0 If it does not include 0, then you can be 95% confident that

there is a risk difference. You can also quote a p value

Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to 2.4%), p=0.02

Would you take the intervention?

61

Risk differences (2) You can also calculate the number needed to treat from

this NNT is the number of people you need to treat to prevent

one event from occuring

Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to 2.4%), p=0.02

If you treat 100 people you avoid 2 heart attacks. NNT = 50

62

Risk ratio A relative measure of risk – very commonly used Same principle – null hypothesis is that there is no

difference IN THE RATIO OF RISKS For no difference, the 95% confidence interval would

include 1 Why 1 this time? Because if both had the same risk, the ratio would be 1

If it does not include 1, then you can be 95% confident that there is a risk difference.

You can also quote a p value

63

Odds ratio A relative measure of risk – very commonly used Very similar to risk ratio Used for certain types of study, and the result of

some calculations For no difference, the 95% confidence interval would

include 1 If it does not include 1, then you can be 95%

confident that there is a difference. You can also quote a p value

64

Examples Meta-analysis of the 5 prospective cohort studies (86,092

patients) indicated that individuals with periodontal disease had a 1.14 times higher risk of developing CHD than the controls (relative risk 1.14, 95% CI 1.074-1.213, P < .001)

the risk of VTE was 2.33 for obesity (95% CI, 1.68 to 3.24), 1.51 for hypertension (95% CI, 1.23 to 1.85), 1.42 for diabetes mellitus (95% CI, 1.12 to 1.77), 1.18 for smoking (95% CI, 0.95 to 1.46), and 1.16 for hypercholesterolemia (95% CI, 0.67 to 2.02).

65

66

In summary

Your boss says: “do we need a weight loss service for

kids in XXX area”

1. You collect data, definition of “kids”, is this data accurate, how was it collected, what year.

2. Compare the areas, are you much different is there an underlying reason

3. Is this value statistically significant?

67

In summary (2)

You look at a service elsewhere (from evidence)

You ask yourself, who was included in this sample, are they different to my population

Looking at the odds what proportion of kids will this work on

Look to see if the test group were bias compared to control group

Were the results normally distributed, skewed or other

68

In summary (3)

Were the results significant between the two groups.

Can you rely on these findings

You have just found the need. Evaluated its accuracy Reviewed a solution Looked at effectiveness

WELL DONE!!!

69

Useful websites

Basic maths and probability http://www.cimt.plymouth.ac.uk/

projects/mepres/book7/bk7i21/bk7_21i1.htm

Tutorials on statistics http://www.stattrek.com/tutorials/

statistics-tutorial.aspx

Master class Data, understanding it, interpreting it and using it.

Documents

Transcript of Master class Data, understanding it, interpreting it and using it.