John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics

John Matthews, John Matthews, Professor of Medical Professor of Medical Statistics, School of Mathematics and StatisticsStatistics, School of Mathematics and Statistics

Janine GrayJanine Gray, Senior Lecturer and Deputy , Senior Lecturer and Deputy Director, Newcastle Clinical Trials UnitDirector, Newcastle Clinical Trials Unit

Introductory StatisticsIntroductory Statistics

University of Newcastle-upon-Tyne

Course OutlineCourse Outline Data DescriptionData Description

Mean, Median, Standard DeviationMean, Median, Standard Deviation GraphsGraphs

The Normal DistributionThe Normal Distribution Populations and SamplesPopulations and Samples Confidence intervals and p-valuesConfidence intervals and p-values Estimation and Hypothesis testingEstimation and Hypothesis testing

Continuous dataContinuous data Categorical dataCategorical data

Regression and CorrelationRegression and Correlation

Course ObjectivesCourse Objectives

To have an understanding of the Normal To have an understanding of the Normal distribution and its relationship to common distribution and its relationship to common statistical analysesstatistical analyses

To have an understanding of basic statistical To have an understanding of basic statistical concepts such as confidence intervals and p-concepts such as confidence intervals and p-valuesvalues

To know which analysis is appropriate for To know which analysis is appropriate for different types of data different types of data

Recommended TextbooksRecommended Textbooks

Swinscow TDV and Campbell MJ. Statistics at Square One Swinscow TDV and Campbell MJ. Statistics at Square One (10(10thth edn). BMJ Books edn). BMJ Books

Altman DG. Practical Statistics for Medical Research. Altman DG. Practical Statistics for Medical Research. Chapman and HallChapman and Hall

Bland M. An Introduction to Medical Statistics. Oxford Bland M. An Introduction to Medical Statistics. Oxford Medical PublicationsMedical Publications

Campbell MJ & Machin D. Medical Statistics A Campbell MJ & Machin D. Medical Statistics A Commonsense Approach. WileyCommonsense Approach. Wiley

Other readingOther reading

Chinn S. Statistics for the European Chinn S. Statistics for the European Respiratory Journal. Eur Respir J 2001; Respiratory Journal. Eur Respir J 2001; 18:393-40118:393-401

www.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htmwww.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htm

BMJ statistics notesBMJ statistics notes

Types of DataTypes of Data Numerical DataNumerical Data

– discretediscrete number of lesionsnumber of lesions number of visits to GPnumber of visits to GP

– continuouscontinuous heightheight lesion arealesion area

Types of DataTypes of Data CategoricalCategorical

– unorderedunordered Pregnant/Not pregnantPregnant/Not pregnant married/single/divorced/separated/widowedmarried/single/divorced/separated/widowed

– ordered (ordinal)ordered (ordinal) minimal/moderate/severe/unbearableminimal/moderate/severe/unbearable Stage of breast cancer: I II III IVStage of breast cancer: I II III IV

ExerciseExercise What type are the following variables?What type are the following variables?

a)a) sexsexb)b) diastolic blood pressurediastolic blood pressurec)c) diagnosisdiagnosisd)d) heightheighte) family sizee) family sizef) cancer stagef) cancer stage

Types of DataTypes of Data

Outcome/Dependent variableOutcome/Dependent variable– outcome of interestoutcome of interest– e.g. survival, recoverye.g. survival, recovery

Explanatory/Independent variableExplanatory/Independent variable– treatment grouptreatment group– age age – sexsex

Histogram of Birthweight Histogram of Birthweight (grams) at 40 weeks GA(grams) at 40 weeks GA

Summary StatisticsSummary Statistics LocationLocation

– Mean (average value)Mean (average value)– Median (middle value)Median (middle value)– Mode (most frequently occurring value)Mode (most frequently occurring value)

VariabilityVariability– Variance/SDVariance/SD– RangeRange– CentilesCentiles

Birthweights (g) at 40 weeks Birthweights (g) at 40 weeks GestationGestation

mean = 3441gmean = 3441g median = 3428gmedian = 3428g sd = 434gsd = 434g min = 2050gmin = 2050g max = 4975g max = 4975g range = 2925g range = 2925g

BoxplotBoxplot

2020N =

T4 cells/ mm3 blood sample

GROUP

Non-Hodgkin'sHodgkin's

T4

CE

LL

S

2000

1500

1000

500

0

23

3

Symmetric DataSymmetric Data mean = median (approx) mean = median (approx)

standard deviation standard deviation

Skew DataSkew Data

median = "typical" value median = "typical" value mean affected by extreme mean affected by extreme

values - larger than median values - larger than median

SD fairly meaningless SD fairly meaningless centiles (less affected by centiles (less affected by

extreme values/outliers) extreme values/outliers)

Half of all doctors are below average….Half of all doctors are below average….

Even if all surgeons are equally good, about Even if all surgeons are equally good, about half will have below average results, one will half will have below average results, one will have the worst results, and the worst results have the worst results, and the worst results will be a long way below averagewill be a long way below average

Ref. BMJ 1998; 316:1734-1736Ref. BMJ 1998; 316:1734-1736

Discrete Data Discrete Data Principal diagnosis of patients in Tooting Bec Hospital Principal diagnosis of patients in Tooting Bec Hospital

Diagnosis Number of patients

Schizophrenia 474 (32%)

Affective Disorders 277 (19%)

Organic Brain Syndrome 405 (28%)

Subnormality 58 (4%)

Alcoholism 57 (4%)

Other/Not Known 196 (13%)

Total 1467

Bar ChartBar Chart

Principal Diagnosis of Patients in Tooting Bec Hospital

Diagnosis

Other/Not Known

Alcoholism

Subnormality

Organic Brain Syndro

Affective Disorders

Schizophrenia

Co

un

t

500

400

300

200

100

0

Summarising data - SummarySummarising data - Summary

Choosing the appropriate summary statistics Choosing the appropriate summary statistics and graph depends upon the type of variable and graph depends upon the type of variable you haveyou have

Categorical (unordered/ordered)Categorical (unordered/ordered) Continuous (symmetric/skew)Continuous (symmetric/skew)

The Normal Distribution

N(2 unknown population mean

- estimate using sample mean unknown population SD -

estimate using sample SD Birthweight is N(3441, 4342)

N(0,1) - Standard Normal Distribution

95% within ± 1.9699% within ± 2.58

68% within ± 1 SD Units

zx

z - SD units

Birthweight (g) at 40 weeksBirthweight (g) at 40 weeks

95% within 1.96 SDs2590 - 4292 grams

99% within 2.58 SDs2321 - 4561 grams

Further ReadingFurther Reading http://http://www.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdfwww.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdf

Altman DG, Bland JM (1996) Presentation of Altman DG, Bland JM (1996) Presentation of numerical data. BMJ 312, 572numerical data. BMJ 312, 572

Altman DG, Bland JM. (1995) The normal Altman DG, Bland JM. (1995) The normal distribution. BMJ 310, 298.distribution. BMJ 310, 298.

Samples and PopulationsSamples and Populations

Use samples to estimate population quantities Use samples to estimate population quantities (parameters) such as disease prevalence, mean (parameters) such as disease prevalence, mean cholesterol level etc cholesterol level etc

Samples are not interesting in their own right - only Samples are not interesting in their own right - only to infer information about the population from which to infer information about the population from which they are drawnthey are drawn

Sampling VariationSampling Variation Populations are unique - samples are not.Populations are unique - samples are not.

Sample and PopulationsSample and Populations How much might these estimates vary from How much might these estimates vary from

sample to sample?sample to sample?

Determine precision of estimates (how close/far Determine precision of estimates (how close/far away from the population?)away from the population?)

(Artifical) example(Artifical) example

Have 5000 measurements of diastolic blood pressure from Have 5000 measurements of diastolic blood pressure from airline pilots. This accounts for ALL airline pilots and is airline pilots. This accounts for ALL airline pilots and is the the populationpopulation of airline pilots. of airline pilots.

(Artificial example - if we had the whole population we (Artificial example - if we had the whole population we wouldn’t need to sample!!)wouldn’t need to sample!!)

Since we have the population, we know the true Since we have the population, we know the true population characteristics. It is these we are trying to population characteristics. It is these we are trying to estimate from a sample.estimate from a sample.

Population distribution of diastolic BP Population distribution of diastolic BP from Airline Pilots from Airline Pilots (in mmHg)(in mmHg)

True mean = 78.2True SD = 9.4

ExampleExample Write each measurement on a piece of paper and put Write each measurement on a piece of paper and put

into a hat.into a hat.

Draw 5 pieces of paper and calculate the mean of the Draw 5 pieces of paper and calculate the mean of the BP.BP.

replace and repeat 49 more timesreplace and repeat 49 more times

End up with 50 (different) estimates of mean BPEnd up with 50 (different) estimates of mean BP

Sampling DistributionSampling Distribution Each estimate of the mean will be different. Each estimate of the mean will be different. Treat this as a random sample of means Treat this as a random sample of means Plot a histogram of the means.Plot a histogram of the means. This is an estimate of the sampling distribution This is an estimate of the sampling distribution

of the mean.of the mean. Can get the sampling distribution of any Can get the sampling distribution of any

parameter in a similar way.parameter in a similar way.

Distribution of the meanDistribution of the mean

50 samples N=5

50 samples N=10

50 samples N=100

= 78.2, = 9.4Population

Distribution of the MeanDistribution of the Mean BUT! Don’t need to take multiple samplesBUT! Don’t need to take multiple samples

Standard error of the mean =Standard error of the mean =

SE of the mean is the SD of the distribution SE of the mean is the SD of the distribution of the sample meanof the sample mean

Sample SD

N

2

Distribution of Sample MeanDistribution of Sample Mean Distribution of sample mean is Normal Distribution of sample mean is Normal

regardless of distribution of sampleregardless of distribution of sample(unless small or very skew sample)(unless small or very skew sample)

SOSOCan apply Normal theory to sample mean alsoCan apply Normal theory to sample mean also

Distribution of Sample MeanDistribution of Sample Mean i.e. 95% of sample means lie within 1.96 SEs i.e. 95% of sample means lie within 1.96 SEs

of (unknown) true meanof (unknown) true mean This is the basis for a 95% confidence interval This is the basis for a 95% confidence interval

(CI)(CI) 95% CI is an interval which on 95% of 95% CI is an interval which on 95% of

occasions includes the population meanoccasions includes the population mean

ExampleExample 57 measurements of FEV1 in male medical 57 measurements of FEV1 in male medical

studentsstudents

ExampleExample

95% of population lie within95% of population lie withini.e. within 4.06 ±1.96i.e. within 4.06 ±1.960.67, 0.67, from 2.75 to 5.38 litresfrom 2.75 to 5.38 litres X SDs196.

litresSDlitresX 67.0,06.4

ExampleExample

Thus for FEV1 data, 95% chance that the Thus for FEV1 data, 95% chance that the interval interval contains the true population meancontains the true population meani.e. between 3.89 and 4.23 litresi.e. between 3.89 and 4.23 litres

This is the 95% confidence interval for the This is the 95% confidence interval for the meanmean

09.096.106.4

09.057

67.0 2

SE

Confidence IntervalsConfidence Intervals The confidence interval (CI) measures The confidence interval (CI) measures

uncertainty. The 95% confidence interval is uncertainty. The 95% confidence interval is the range of values within which we can be the range of values within which we can be 95% sure that the true value lies for the whole 95% sure that the true value lies for the whole of the population of patients from whom the of the population of patients from whom the study patients were selected. The CI narrows study patients were selected. The CI narrows as the number of patients on which it is based as the number of patients on which it is based increases. increases.

Standard Deviations & Standard Standard Deviations & Standard ErrorsErrors

The SE is the SD of the sampling distribution The SE is the SD of the sampling distribution (of the mean, say)(of the mean, say)

SE = SD/SE = SD/√√NN Use SE to describe the precision of estimates Use SE to describe the precision of estimates

(for example Confidence intervals)(for example Confidence intervals) Use SD to describe the variability of samples, Use SD to describe the variability of samples,

populations or distributions (for example populations or distributions (for example reference ranges)reference ranges)

The t-distribution

When N is small, estimate of SD is particularly unreliable and the distribution of sample mean is not Normal

Distribution is more variable - longer tails Shape of distribution depends upon sample

size This distribution is called the t-distribution

N=2

N(0,1)

t(1)

t(1)95% within ± 12.7

N=10

N(0,1)

t(9)

t(9)95% within ± 2.26

N=30

t(29)95% within ± 2.04

t-distribution

As N becomes larger, t-distribution becomes more similar to Normal distribution

Degrees of Freedom (DF)- sample size - 1

DF measure of amount of information contained in data set

Implications

Confidence interval for the mean» Sample size < 30

Use t-distribution» Sample size > 30

Use either Normal or t distribution Note: Stats packages (generally) will

automatically use the correct distribution for confidence intervals

Example

Numbers of hours of relief obtained by 7 arthritic patients after receiving a new drug: 2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3

Mean = 3.33, SD = 1.03, DF = 6, t(5%) = 2.45 95% CI = 3.33 ± 2.451.03/ 7

2.38 to 4.28 hours Normal 95% CI = 3.33 ± 1.961.03/ 7

2.57 to 4.09 hours TOO NARROW!!

Hypothesis TestingHypothesis Testing Enables us to measure the strength of evidence Enables us to measure the strength of evidence

supplied by the data concerning a proposition supplied by the data concerning a proposition of interestof interest

In a trial comparing two treatments there will In a trial comparing two treatments there will ALWAYS be a difference between the ALWAYS be a difference between the estimates for each treatment - a real difference estimates for each treatment - a real difference or random variation?or random variation?

Null HypothesisNull Hypothesis Study hypothesis - hypothesis in the mind of Study hypothesis - hypothesis in the mind of

the investigator (patients with diabetes have the investigator (patients with diabetes have raised blood pressure)raised blood pressure)

Null hypothesis is the converse of the study Null hypothesis is the converse of the study hypothesis - aim to disprove it (patients with hypothesis - aim to disprove it (patients with diabetes do not have raised blood pressure)diabetes do not have raised blood pressure)

Hypothesis of no effect/differenceHypothesis of no effect/difference

Two-Sample t-testTwo-Sample t-test Two independent samplesTwo independent samples Can the two samples be considered to be the Can the two samples be considered to be the

same with respect to the variable you are same with respect to the variable you are measuring or are they different?measuring or are they different?

Sample means will ALWAYS be different - Sample means will ALWAYS be different - real difference or random variation?real difference or random variation?

ASSUMPTION: Data are normally distributed ASSUMPTION: Data are normally distributed and SD in each group similar and SD in each group similar

Two-Sample t-testTwo-Sample t-test 24 hour total energy expenditure (MJ/day) in 24 hour total energy expenditure (MJ/day) in

groups of lean and obese womengroups of lean and obese women Do the women differ in their energy Do the women differ in their energy

expenditure?expenditure? Null hypothesis: energy expenditure in lean Null hypothesis: energy expenditure in lean

and obese women is the sameand obese women is the same

Boxplot of energy expenditure Boxplot of energy expenditure MJ/dayMJ/day

913N =

GROUP

obeselean

24

Ho

ur

tota

l en

erg

y e

xpe

nd

iture

MJ/

da

y14

12

10

8

6

4

1

12

13

Two-sample t-testTwo-sample t-test Summary statisticsSummary statistics

leanlean obeseobeseMeanMean 8.18.1 10.310.3

SDSD 1.21.2 1.41.4

NN 1313 99 Difference in means = 10.3 - 8.1 = 2.2Difference in means = 10.3 - 8.1 = 2.2 SE difference = 0.57 (weighted average)SE difference = 0.57 (weighted average)

TwoTwo Sample t-test Sample t-test Test statistic is 2.2/0.57 = 3.9Test statistic is 2.2/0.57 = 3.9 N1 + N2 - 2 DF (= 20)N1 + N2 - 2 DF (= 20) Calculate the probability of observing a value at least Calculate the probability of observing a value at least

as extreme as 3.9 if the null hypothesis is trueas extreme as 3.9 if the null hypothesis is true If the null hypothesis is true, the test statistic should If the null hypothesis is true, the test statistic should

have a t-distribution with 20 df (df = N1+N2-2)have a t-distribution with 20 df (df = N1+N2-2)

TwoTwo Sample t-test Sample t-test

95% of values from t-distribution with 20 DF lie 95% of values from t-distribution with 20 DF lie between -2.09 and +2.09between -2.09 and +2.09

Probability of observing a value as extreme or more Probability of observing a value as extreme or more extreme than 3.9 in a t-distribution with 20 df is 0.001extreme than 3.9 in a t-distribution with 20 df is 0.001

Only a very small probability that the value of 3.9 fits Only a very small probability that the value of 3.9 fits reasonably with a t-distribution with 20 df reasonably with a t-distribution with 20 df

Conclude that energy expenditure is significantly Conclude that energy expenditure is significantly different between lean and obese womendifferent between lean and obese women

The P-valueThe P-value

The P-value is the probability of observing a test The P-value is the probability of observing a test statistic at least as extreme as that observed if the null statistic at least as extreme as that observed if the null hypothesis is truehypothesis is true

tt distribution with 20 df distribution with 20 dfP

rob

ab

ility

x-4 -3 -2 -1 0 1 2 3 4

0

.1

.2

.3

.4

Confidence Interval for the Confidence Interval for the difference in two meansdifference in two means

95% CI = 95% CI = 2.2 - 2.092.2 - 2.090.57 to 2.2 +2.090.57 to 2.2 +2.090.570.57

or from 1.05 to 3.41 MJ/dayor from 1.05 to 3.41 MJ/day Thus we are 95% confident that obese women use Thus we are 95% confident that obese women use

between 1.05 and 3.41 MJ/day energy more than between 1.05 and 3.41 MJ/day energy more than lean women lean women

Confidence Interval or P-value?Confidence Interval or P-value? Confidence interval!!!Confidence interval!!! P-value will tell you whether or not there is a P-value will tell you whether or not there is a

statistically significant differencestatistically significant difference confidence interval will give information confidence interval will give information

about the size of the difference and the about the size of the difference and the strength of the evidencestrength of the evidence

Paired t-testPaired t-test Obvious pairing between observationsObvious pairing between observations

– two measurements on each subject (before-after two measurements on each subject (before-after study)study)

– case-control pairscase-control pairs Assumption - paired data are normally distributedAssumption - paired data are normally distributed Example - Systolic blood pressure (SBP) measured in Example - Systolic blood pressure (SBP) measured in

16 middle aged men before and after a standard 16 middle aged men before and after a standard exercise. Post-exercise SBP - Pre-exercise SBP exercise. Post-exercise SBP - Pre-exercise SBP calculated for each mancalculated for each man

Boxplot of differencesBoxplot of differences

16N =

Pos

t E

xerc

ise

SB

P -

Pre

-exe

rcis

e S

BP

20

10

0

-10

Paired t-testPaired t-test Mean difference = 6.6Mean difference = 6.6 SE(Mean) = 1.5SE(Mean) = 1.5 t = 6.6/1.5 = 4.4t = 6.6/1.5 = 4.4 Compare with t(15)Compare with t(15) P < 0.001P < 0.001ConclusionConclusion- mean systolic blood pressure is - mean systolic blood pressure is

higher after exercise than beforehigher after exercise than before

Paired t-testPaired t-test

95% confidence interval for the mean 95% confidence interval for the mean differencedifference

6.6 6.6 2.13 2.13××1.5 = 3.4 to 9.81.5 = 3.4 to 9.8

Categorical VariablesCategorical Variables

To investigate the relationship between two To investigate the relationship between two categorical variables form contingency tablecategorical variables form contingency table

Hypothesis testsHypothesis tests– Chi-squared test (Chi-squared test (22 test) test)– Fisher’s exact test (small samples)Fisher’s exact test (small samples)– McNemar’s test (paired data)McNemar’s test (paired data)

Chi-squared testChi-squared test Used to test for associations between Used to test for associations between

categorical variables (2 or more distinct categorical variables (2 or more distinct outcomes)outcomes)

Example - a comparison between Example - a comparison between psychotherapy and usual care for major psychotherapy and usual care for major depression in primary caredepression in primary care

Patient Reported Recovery at 8 Patient Reported Recovery at 8 monthsmonths

Recovered NotRecovered

Total

Psycho-therapy

47 (51%) 46 (49%) 93

Usual Care 18 (20%) 73 (80%) 91

Total 65 (35%) 119 (65%) 184

P<0.001, Chi-square test

Patient Reported Recovery at 8 Patient Reported Recovery at 8 monthsmonths

Difference between means 30.8%Difference between means 30.8% 95% confidence interval for difference 17.7% 95% confidence interval for difference 17.7%

to 43.8%to 43.8%

Larger tablesLarger tables Similar methods can be applied to larger tables Similar methods can be applied to larger tables

to test the association between two categorical to test the association between two categorical variablesvariables

Example - Is there an association between Example - Is there an association between housing tenure and time of delivery of baby housing tenure and time of delivery of baby (preterm/term).(preterm/term).

Null hypothesis: There is no relationship Null hypothesis: There is no relationship between housing tenure and time of deliverybetween housing tenure and time of delivery

Relationship between housing Relationship between housing tenure and time of deliverytenure and time of delivery

Housing Tenure Preterm Term Total

Owner-occupier 50 (61.7) 849 (837.3) 899

Council Tenant 29 (17.7) 229 (240.3) 258

Private Tenant 11 (12.0) 164 (163.0) 175

Lives with Parents 6 (4.9) 66 (67.1) 72

Other 3 (2.7) 36 (36.3) 39

Total 99 1344 1443

Relationship between housing Relationship between housing tenure and time of deliverytenure and time of delivery

Test Statistic

50 617

617

849 837 3

837 3

3 2 7

2 7

36 36 3

36 310 5

2 2

2 2

.

.

.

.. . . . . . .

. . . . . . ..

.

.

..

DF = (5-1)DF = (5-1)(2-1) = 4(2-1) = 4 P = 0.03P = 0.03 Thus we strong evidence of a relationship between Thus we strong evidence of a relationship between

housing tenure and time of deliveryhousing tenure and time of delivery

NotesNotes

Chi-squared test not valid if Chi-squared test not valid if expected values expected values are small (<5) are small (<5)

– Combine rows or columns to obtain a Combine rows or columns to obtain a smaller table with larger expected valuessmaller table with larger expected values

– Use Fisher’s exact test for small tablesUse Fisher’s exact test for small tables

McNemar’s testMcNemar’s test

Appropriate for use with paired or matched Appropriate for use with paired or matched (case-control) data with a dichotomous (case-control) data with a dichotomous outcomeoutcome

Example - McNemar’s testExample - McNemar’s test

Skaane compared the use of Skaane compared the use of mammography mammography and ultrasound in the assessment of 327 (228 and ultrasound in the assessment of 327 (228 palpable and 99 non-palpable) consecutive palpable and 99 non-palpable) consecutive malignant tumours confirmed at histology.malignant tumours confirmed at histology.

Acta radiologica vol 40;486-490 (1999)Acta radiologica vol 40;486-490 (1999)

McNemar’s test - exampleMcNemar’s test - example

Mammogram

Yes No Tot.

US Yes 267 11 278

No 41 8 49

Tot. 308 19 327

McNemar’s test - exampleMcNemar’s test - example

308/327 (94%) were picked up by 308/327 (94%) were picked up by mammograpy compared with 278/327 (85%) mammograpy compared with 278/327 (85%) picked up by ultrasoundpicked up by ultrasound

P<0.001P<0.001 Conclusion: Mammography is significantly Conclusion: Mammography is significantly

more sensitive in diagnosing tumours than more sensitive in diagnosing tumours than ultrasound in a population of mixed malignant ultrasound in a population of mixed malignant tumourstumours

Hypothesis testing - summaryHypothesis testing - summary

Type of data Paired Design Unpaired Design

ContinuousQuantitative data

Paired (one-sample) t-testWilcoxon Signed ranktest

Unpaired (independentsamples) t-testMann-Whitney U test

Ordered Categoricaldata

Wilcoxon signed ranktest

Mann-Whitney U test

Unordered Categoricaldata

McNemar's test (2categories only)

Chi-squared testFisher's exact test

Adapted from Chinn S. Statistics for the European Respiratory Journal.

Correlation and RegressionCorrelation and Regression Relationship between two continuous variablesRelationship between two continuous variables

– regressionregression– correlationcorrelation

Relationship between two Relationship between two continuous variablescontinuous variables

3 main purposes for doing this3 main purposes for doing this– to assess whether the two variables are associated to assess whether the two variables are associated

(correlation)(correlation)– to enable the value of one variable to be predicted to enable the value of one variable to be predicted

from any known value of the other variable from any known value of the other variable (regression)(regression)

– to assess the amount of agreement between two to assess the amount of agreement between two variables (method comparison study)variables (method comparison study)

ExampleExample Women from a pre-defined geographical area Women from a pre-defined geographical area

were invited to have their haemoglobin (Hb) were invited to have their haemoglobin (Hb) level and packed cell volume measured. They level and packed cell volume measured. They were also asked their age. were also asked their age.

Packed Cell Volume (%)

6050403020

Ha

em

og

olb

in le

ve

l (g

/dl)

18

16

14

12

10

8

Haemoglobin and packed cell Haemoglobin and packed cell volumevolume

Example - relationships between Example - relationships between variablesvariables

Association between Hb and PCV? Association between Hb and PCV? Hb affects PCV or PCV affects Hb?Hb affects PCV or PCV affects Hb?

Use correlation to measure the strength of an Use correlation to measure the strength of an association association

Association between Hb and age?Association between Hb and age?age must affect Hb and not vice versaage must affect Hb and not vice versa

Use regression to predict Hb from ageUse regression to predict Hb from age

CorrelationCorrelation

Not interested in causation Not interested in causation i.e. does a high PCV i.e. does a high PCV causecause a high Hb level a high Hb level

Interested in associationInterested in associationi.e. is a high PCV i.e. is a high PCV associatedassociated with a high Hb with a high Hb level?level?

sample correlation coefficientsample correlation coefficient– summarises strength of relationshipsummarises strength of relationship– can be used to test the hypothesis that the can be used to test the hypothesis that the

population correlation coefficient is 0population correlation coefficient is 0

Correlation CoefficientCorrelation Coefficient dimensionless, from -1 to 1dimensionless, from -1 to 1 measures the strength of a linear relationshipmeasures the strength of a linear relationship +ve - high value of one variable associated +ve - high value of one variable associated

with high value of the otherwith high value of the other -ve - high value of one variable associated -ve - high value of one variable associated

with low value of the otherwith low value of the other +1 = exact linear relationship +1 = exact linear relationship strictly called Pearson correlation coefficientstrictly called Pearson correlation coefficient

Example DataExample Datar = 1 r = -0.4

r = 0.7 r = 0X

987654321

Y

20

18

16

14

12

10

8

6

4

X

987654321

Y

10

0

-10

-20

X

987654321

Y

8

6

4

2

0

-2

-4

X

987654321

Y

30

20

10

0

When not to use the correlation When not to use the correlation coefficientcoefficient

If the relationship is non-linearIf the relationship is non-linear with caution in the presence of outlierswith caution in the presence of outliers when the variables are measured over more when the variables are measured over more

than one distinct group (i.e. disease groups)than one distinct group (i.e. disease groups) when one of the variables is fixed in advancewhen one of the variables is fixed in advance Assessing agreementAssessing agreement

Correlation - example dataCorrelation - example data

1494

11

10

9

8

7

6

5

4

x1

y1

1494

9

8

7

6

5

4

3

x2

y2

1494

13

12

11

10

9

8

7

6

5

x3

y3

201510

13

12

11

10

9

8

7

6

5

x4

y4

Is there an alternative?Is there an alternative? If the data are non-linear or there is an outlierIf the data are non-linear or there is an outlier

– use spearman rank correlation coefficientuse spearman rank correlation coefficient

Haemoglobin and Packed Cell Haemoglobin and Packed Cell VolumeVolume

Packed Cell Volume (%)

6050403020

Ha

em

og

olb

in le

vel (

g/d

l)

18

16

14

12

10

8

6

4

2

Without outlierWithout outlier

Pearson=0.67Pearson=0.67

Spearman=0.63Spearman=0.63

With outlierWith outlier

Pearson=0.34Pearson=0.34

Spearman=0.48Spearman=0.48

RegressionRegression Assume a change in x will cause a change in yAssume a change in x will cause a change in y predict y for a given value of xpredict y for a given value of x usually not logical to believe y causes xusually not logical to believe y causes x y is the dependent variable (vertical axis)y is the dependent variable (vertical axis) x is the independent variable (horizontal axis)x is the independent variable (horizontal axis)

Example - Haemoglobin vs AgeExample - Haemoglobin vs Age

Age (Years)

70605040302010

Ha

em

og

olb

in le

vel (

g/d

l)18

16

14

12

10

8

RegressionRegression Logical to assume that increasing age leads to Logical to assume that increasing age leads to

increasing Hbincreasing Hb Not logical to assume Hb affects age!Not logical to assume Hb affects age! Assume underlying true linear relationshipAssume underlying true linear relationship Make an estimate of what that true linear Make an estimate of what that true linear

relationship isrelationship is

Estimating a regression lineEstimating a regression line How do I identify the ‘best’ straight line?How do I identify the ‘best’ straight line? least squares estimateleast squares estimate straight line determined by slope and straight line determined by slope and

interceptintercept y = a + by = a + bxx a and b are estimates of the true intercept a and b are estimates of the true intercept

and slope and are subject to sampling and slope and are subject to sampling variationvariation

Regression line of haemoglobin on Regression line of haemoglobin on ageage

Age (years)

70605040302010

Ha

em

og

lob

in (

g/d

l)18

16

14

12

10

8

Regression of haemoglobin on ageRegression of haemoglobin on age Variable(s) Entered on Step Number Variable(s) Entered on Step Number

1.. AGE Age (Years)1.. AGE Age (Years)Multiple R .87959Multiple R .87959R Square .77367R Square .77367Adjusted R Square .76110Adjusted R Square .76110Standard Error 1.17398Standard Error 1.17398

Analysis of VarianceAnalysis of Variance DF Sum of Squares Mean Square DF Sum of Squares Mean SquareRegression 1 84.80397 84.80397Regression 1 84.80397 84.80397Residual 18 24.80803 1.37822Residual 18 24.80803 1.37822

F = 61.53133 Signif F = .0000F = 61.53133 Signif F = .0000

Regression of haemoglobin on ageRegression of haemoglobin on age

---------------------- Variables in the Equation ----------------------------------- Variables in the Equation -------------Variable B SE B 95% Confdnce Intrvl BVariable B SE B 95% Confdnce Intrvl BAGE .134251 .017115 .098295 .170208 AGE .134251 .017115 .098295 .170208 (Constant) 8.239786 .794261 6.571104 9.908467(Constant) 8.239786 .794261 6.571104 9.908467

----------- in ----------------------- in ------------Variable T Sig TVariable T Sig TAGE 7.844 .0000AGE 7.844 .0000(Constant) 10.374 .0000(Constant) 10.374 .0000

What does this tell us?What does this tell us? Mean Hb = 8.2 + 0.13 Mean Hb = 8.2 + 0.13 AGEAGE 95% CI for the slope goes from 0.098 to 0.17095% CI for the slope goes from 0.098 to 0.170 P < 0.0001P < 0.0001 Significant relationship between Hb and ageSignificant relationship between Hb and age 77% of the variability in Hb can be accounted 77% of the variability in Hb can be accounted

for by agefor by age

How can it be used?How can it be used? Predict mean Hb for a given agePredict mean Hb for a given age

Eg. What is the mean Hb of a 50 year old?Eg. What is the mean Hb of a 50 year old? Mean Hb = 8.2 + 0.13Mean Hb = 8.2 + 0.1350 = 14.7 g/dl50 = 14.7 g/dl 95% CI for the estimate from 14.4 to 15.5 95% CI for the estimate from 14.4 to 15.5

g/dlg/dl

How can it be used?How can it be used? To calculate reference ranges for the To calculate reference ranges for the

populationpopulation

E.g. What range would you expect 95% of 50 E.g. What range would you expect 95% of 50 year olds to lie within? (reference range)year olds to lie within? (reference range)

Between 12.4 to 17.5 g/dlBetween 12.4 to 17.5 g/dl

95% Confidence Interval for the Mean & 95% 95% Confidence Interval for the Mean & 95% prediction interval for individualsprediction interval for individuals

Age (years)

70605040302010

Ha

em

og

lob

in (

g/d

l)20

18

16

14

12

10

8

DefinitionsDefinitions Predicted value Predicted value

– the value predicted by the regression linethe value predicted by the regression line– an estimate of the mean valuean estimate of the mean value

ResidualResidual– Observed value - predicted valueObserved value - predicted value

What assumptions have I made?What assumptions have I made? The relationship is approximately linearThe relationship is approximately linear The residuals have a normal distributionThe residuals have a normal distribution

Multiple RegressionMultiple Regression

One outcome variable with multiple predictor One outcome variable with multiple predictor variablesvariables

Residuals assumed to be normally distributedResiduals assumed to be normally distributed Predictor variables can be continuous or Predictor variables can be continuous or

categoricalcategorical No assumptions made about distribution of No assumptions made about distribution of

continuous predictor variablescontinuous predictor variables

Multiple RegressionMultiple Regression

Example. Does the value of packed cell Example. Does the value of packed cell volume improve the prediction of hb?volume improve the prediction of hb?

Model fittedModel fitted

Mean Hb = 5.2 + 0.1Mean Hb = 5.2 + 0.1age(years) + age(years) + 0.10.1packed cell volume(%)packed cell volume(%)

RR22 = 83% = 83%

Knowledge of packed cell volume improves the Knowledge of packed cell volume improves the prediction of haemoglobinprediction of haemoglobin

SummarySummary

Regression can be used to estimate the Regression can be used to estimate the numerical relationship between an outcome numerical relationship between an outcome variable and one or more predictor variablesvariable and one or more predictor variables

Correlation coefficient alone is of limited useCorrelation coefficient alone is of limited use

John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics

Documents

Transcript of John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics