details - School of Mathematics, Statistics and Applied Mathematics
John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics
description
Transcript of John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics
John Matthews, John Matthews, Professor of Medical Professor of Medical Statistics, School of Mathematics and StatisticsStatistics, School of Mathematics and Statistics
Janine GrayJanine Gray, Senior Lecturer and Deputy , Senior Lecturer and Deputy Director, Newcastle Clinical Trials UnitDirector, Newcastle Clinical Trials Unit
Introductory StatisticsIntroductory Statistics
University of Newcastle-upon-Tyne
Course OutlineCourse Outline Data DescriptionData Description
Mean, Median, Standard DeviationMean, Median, Standard Deviation GraphsGraphs
The Normal DistributionThe Normal Distribution Populations and SamplesPopulations and Samples Confidence intervals and p-valuesConfidence intervals and p-values Estimation and Hypothesis testingEstimation and Hypothesis testing
Continuous dataContinuous data Categorical dataCategorical data
Regression and CorrelationRegression and Correlation
Course ObjectivesCourse Objectives
To have an understanding of the Normal To have an understanding of the Normal distribution and its relationship to common distribution and its relationship to common statistical analysesstatistical analyses
To have an understanding of basic statistical To have an understanding of basic statistical concepts such as confidence intervals and p-concepts such as confidence intervals and p-valuesvalues
To know which analysis is appropriate for To know which analysis is appropriate for different types of data different types of data
Recommended TextbooksRecommended Textbooks
Swinscow TDV and Campbell MJ. Statistics at Square One Swinscow TDV and Campbell MJ. Statistics at Square One (10(10thth edn). BMJ Books edn). BMJ Books
Altman DG. Practical Statistics for Medical Research. Altman DG. Practical Statistics for Medical Research. Chapman and HallChapman and Hall
Bland M. An Introduction to Medical Statistics. Oxford Bland M. An Introduction to Medical Statistics. Oxford Medical PublicationsMedical Publications
Campbell MJ & Machin D. Medical Statistics A Campbell MJ & Machin D. Medical Statistics A Commonsense Approach. WileyCommonsense Approach. Wiley
Other readingOther reading
Chinn S. Statistics for the European Chinn S. Statistics for the European Respiratory Journal. Eur Respir J 2001; Respiratory Journal. Eur Respir J 2001; 18:393-40118:393-401
www.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htmwww.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htm
BMJ statistics notesBMJ statistics notes
Types of DataTypes of Data Numerical DataNumerical Data
– discretediscrete number of lesionsnumber of lesions number of visits to GPnumber of visits to GP
– continuouscontinuous heightheight lesion arealesion area
Types of DataTypes of Data CategoricalCategorical
– unorderedunordered Pregnant/Not pregnantPregnant/Not pregnant married/single/divorced/separated/widowedmarried/single/divorced/separated/widowed
– ordered (ordinal)ordered (ordinal) minimal/moderate/severe/unbearableminimal/moderate/severe/unbearable Stage of breast cancer: I II III IVStage of breast cancer: I II III IV
ExerciseExercise What type are the following variables?What type are the following variables?
a)a) sexsexb)b) diastolic blood pressurediastolic blood pressurec)c) diagnosisdiagnosisd)d) heightheighte) family sizee) family sizef) cancer stagef) cancer stage
Types of DataTypes of Data
Outcome/Dependent variableOutcome/Dependent variable– outcome of interestoutcome of interest– e.g. survival, recoverye.g. survival, recovery
Explanatory/Independent variableExplanatory/Independent variable– treatment grouptreatment group– age age – sexsex
Histogram of Birthweight Histogram of Birthweight (grams) at 40 weeks GA(grams) at 40 weeks GA
Summary StatisticsSummary Statistics LocationLocation
– Mean (average value)Mean (average value)– Median (middle value)Median (middle value)– Mode (most frequently occurring value)Mode (most frequently occurring value)
VariabilityVariability– Variance/SDVariance/SD– RangeRange– CentilesCentiles
Birthweights (g) at 40 weeks Birthweights (g) at 40 weeks GestationGestation
mean = 3441gmean = 3441g median = 3428gmedian = 3428g sd = 434gsd = 434g min = 2050gmin = 2050g max = 4975g max = 4975g range = 2925g range = 2925g
BoxplotBoxplot
2020N =
T4 cells/ mm3 blood sample
GROUP
Non-Hodgkin'sHodgkin's
T4
CE
LL
S
2000
1500
1000
500
0
23
3
Symmetric DataSymmetric Data mean = median (approx) mean = median (approx)
standard deviation standard deviation
Skew DataSkew Data
median = "typical" value median = "typical" value mean affected by extreme mean affected by extreme
values - larger than median values - larger than median
SD fairly meaningless SD fairly meaningless centiles (less affected by centiles (less affected by
extreme values/outliers) extreme values/outliers)
Half of all doctors are below average….Half of all doctors are below average….
Even if all surgeons are equally good, about Even if all surgeons are equally good, about half will have below average results, one will half will have below average results, one will have the worst results, and the worst results have the worst results, and the worst results will be a long way below averagewill be a long way below average
Ref. BMJ 1998; 316:1734-1736Ref. BMJ 1998; 316:1734-1736
Discrete Data Discrete Data Principal diagnosis of patients in Tooting Bec Hospital Principal diagnosis of patients in Tooting Bec Hospital
Diagnosis Number of patients
Schizophrenia 474 (32%)
Affective Disorders 277 (19%)
Organic Brain Syndrome 405 (28%)
Subnormality 58 (4%)
Alcoholism 57 (4%)
Other/Not Known 196 (13%)
Total 1467
Bar ChartBar Chart
Principal Diagnosis of Patients in Tooting Bec Hospital
Diagnosis
Other/Not Known
Alcoholism
Subnormality
Organic Brain Syndro
Affective Disorders
Schizophrenia
Co
un
t
500
400
300
200
100
0
Summarising data - SummarySummarising data - Summary
Choosing the appropriate summary statistics Choosing the appropriate summary statistics and graph depends upon the type of variable and graph depends upon the type of variable you haveyou have
Categorical (unordered/ordered)Categorical (unordered/ordered) Continuous (symmetric/skew)Continuous (symmetric/skew)
The Normal Distribution
N(2 unknown population mean
- estimate using sample mean unknown population SD -
estimate using sample SD Birthweight is N(3441, 4342)
N(0,1) - Standard Normal Distribution
95% within ± 1.9699% within ± 2.58
68% within ± 1 SD Units
zx
z - SD units
Birthweight (g) at 40 weeksBirthweight (g) at 40 weeks
95% within 1.96 SDs2590 - 4292 grams
99% within 2.58 SDs2321 - 4561 grams
Further ReadingFurther Reading http://http://www.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdfwww.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdf
Altman DG, Bland JM (1996) Presentation of Altman DG, Bland JM (1996) Presentation of numerical data. BMJ 312, 572numerical data. BMJ 312, 572
Altman DG, Bland JM. (1995) The normal Altman DG, Bland JM. (1995) The normal distribution. BMJ 310, 298.distribution. BMJ 310, 298.
Samples and PopulationsSamples and Populations
Use samples to estimate population quantities Use samples to estimate population quantities (parameters) such as disease prevalence, mean (parameters) such as disease prevalence, mean cholesterol level etc cholesterol level etc
Samples are not interesting in their own right - only Samples are not interesting in their own right - only to infer information about the population from which to infer information about the population from which they are drawnthey are drawn
Sampling VariationSampling Variation Populations are unique - samples are not.Populations are unique - samples are not.
Sample and PopulationsSample and Populations How much might these estimates vary from How much might these estimates vary from
sample to sample?sample to sample?
Determine precision of estimates (how close/far Determine precision of estimates (how close/far away from the population?)away from the population?)
(Artifical) example(Artifical) example
Have 5000 measurements of diastolic blood pressure from Have 5000 measurements of diastolic blood pressure from airline pilots. This accounts for ALL airline pilots and is airline pilots. This accounts for ALL airline pilots and is the the populationpopulation of airline pilots. of airline pilots.
(Artificial example - if we had the whole population we (Artificial example - if we had the whole population we wouldn’t need to sample!!)wouldn’t need to sample!!)
Since we have the population, we know the true Since we have the population, we know the true population characteristics. It is these we are trying to population characteristics. It is these we are trying to estimate from a sample.estimate from a sample.
Population distribution of diastolic BP Population distribution of diastolic BP from Airline Pilots from Airline Pilots (in mmHg)(in mmHg)
True mean = 78.2True SD = 9.4
ExampleExample Write each measurement on a piece of paper and put Write each measurement on a piece of paper and put
into a hat.into a hat.
Draw 5 pieces of paper and calculate the mean of the Draw 5 pieces of paper and calculate the mean of the BP.BP.
replace and repeat 49 more timesreplace and repeat 49 more times
End up with 50 (different) estimates of mean BPEnd up with 50 (different) estimates of mean BP
Sampling DistributionSampling Distribution Each estimate of the mean will be different. Each estimate of the mean will be different. Treat this as a random sample of means Treat this as a random sample of means Plot a histogram of the means.Plot a histogram of the means. This is an estimate of the sampling distribution This is an estimate of the sampling distribution
of the mean.of the mean. Can get the sampling distribution of any Can get the sampling distribution of any
parameter in a similar way.parameter in a similar way.
Distribution of the meanDistribution of the mean
50 samples N=5
50 samples N=10
50 samples N=100
= 78.2, = 9.4Population
Distribution of the MeanDistribution of the Mean BUT! Don’t need to take multiple samplesBUT! Don’t need to take multiple samples
Standard error of the mean =Standard error of the mean =
SE of the mean is the SD of the distribution SE of the mean is the SD of the distribution of the sample meanof the sample mean
Sample SD
N
2
Distribution of Sample MeanDistribution of Sample Mean Distribution of sample mean is Normal Distribution of sample mean is Normal
regardless of distribution of sampleregardless of distribution of sample(unless small or very skew sample)(unless small or very skew sample)
SOSOCan apply Normal theory to sample mean alsoCan apply Normal theory to sample mean also
Distribution of Sample MeanDistribution of Sample Mean i.e. 95% of sample means lie within 1.96 SEs i.e. 95% of sample means lie within 1.96 SEs
of (unknown) true meanof (unknown) true mean This is the basis for a 95% confidence interval This is the basis for a 95% confidence interval
(CI)(CI) 95% CI is an interval which on 95% of 95% CI is an interval which on 95% of
occasions includes the population meanoccasions includes the population mean
ExampleExample 57 measurements of FEV1 in male medical 57 measurements of FEV1 in male medical
studentsstudents
ExampleExample
95% of population lie within95% of population lie withini.e. within 4.06 ±1.96i.e. within 4.06 ±1.960.67, 0.67, from 2.75 to 5.38 litresfrom 2.75 to 5.38 litres X SDs196.
litresSDlitresX 67.0,06.4
ExampleExample
Thus for FEV1 data, 95% chance that the Thus for FEV1 data, 95% chance that the interval interval contains the true population meancontains the true population meani.e. between 3.89 and 4.23 litresi.e. between 3.89 and 4.23 litres
This is the 95% confidence interval for the This is the 95% confidence interval for the meanmean
09.096.106.4
09.057
67.0 2
SE
Confidence IntervalsConfidence Intervals The confidence interval (CI) measures The confidence interval (CI) measures
uncertainty. The 95% confidence interval is uncertainty. The 95% confidence interval is the range of values within which we can be the range of values within which we can be 95% sure that the true value lies for the whole 95% sure that the true value lies for the whole of the population of patients from whom the of the population of patients from whom the study patients were selected. The CI narrows study patients were selected. The CI narrows as the number of patients on which it is based as the number of patients on which it is based increases. increases.
Standard Deviations & Standard Standard Deviations & Standard ErrorsErrors
The SE is the SD of the sampling distribution The SE is the SD of the sampling distribution (of the mean, say)(of the mean, say)
SE = SD/SE = SD/√√NN Use SE to describe the precision of estimates Use SE to describe the precision of estimates
(for example Confidence intervals)(for example Confidence intervals) Use SD to describe the variability of samples, Use SD to describe the variability of samples,
populations or distributions (for example populations or distributions (for example reference ranges)reference ranges)
The t-distribution
When N is small, estimate of SD is particularly unreliable and the distribution of sample mean is not Normal
Distribution is more variable - longer tails Shape of distribution depends upon sample
size This distribution is called the t-distribution
N=2
N(0,1)
t(1)
t(1)95% within ± 12.7
N=10
N(0,1)
t(9)
t(9)95% within ± 2.26
N=30
t(29)95% within ± 2.04
t-distribution
As N becomes larger, t-distribution becomes more similar to Normal distribution
Degrees of Freedom (DF)- sample size - 1
DF measure of amount of information contained in data set
Implications
Confidence interval for the mean» Sample size < 30
Use t-distribution» Sample size > 30
Use either Normal or t distribution Note: Stats packages (generally) will
automatically use the correct distribution for confidence intervals
Example
Numbers of hours of relief obtained by 7 arthritic patients after receiving a new drug: 2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3
Mean = 3.33, SD = 1.03, DF = 6, t(5%) = 2.45 95% CI = 3.33 ± 2.451.03/ 7
2.38 to 4.28 hours Normal 95% CI = 3.33 ± 1.961.03/ 7
2.57 to 4.09 hours TOO NARROW!!
Hypothesis TestingHypothesis Testing Enables us to measure the strength of evidence Enables us to measure the strength of evidence
supplied by the data concerning a proposition supplied by the data concerning a proposition of interestof interest
In a trial comparing two treatments there will In a trial comparing two treatments there will ALWAYS be a difference between the ALWAYS be a difference between the estimates for each treatment - a real difference estimates for each treatment - a real difference or random variation?or random variation?
Null HypothesisNull Hypothesis Study hypothesis - hypothesis in the mind of Study hypothesis - hypothesis in the mind of
the investigator (patients with diabetes have the investigator (patients with diabetes have raised blood pressure)raised blood pressure)
Null hypothesis is the converse of the study Null hypothesis is the converse of the study hypothesis - aim to disprove it (patients with hypothesis - aim to disprove it (patients with diabetes do not have raised blood pressure)diabetes do not have raised blood pressure)
Hypothesis of no effect/differenceHypothesis of no effect/difference
Two-Sample t-testTwo-Sample t-test Two independent samplesTwo independent samples Can the two samples be considered to be the Can the two samples be considered to be the
same with respect to the variable you are same with respect to the variable you are measuring or are they different?measuring or are they different?
Sample means will ALWAYS be different - Sample means will ALWAYS be different - real difference or random variation?real difference or random variation?
ASSUMPTION: Data are normally distributed ASSUMPTION: Data are normally distributed and SD in each group similar and SD in each group similar
Two-Sample t-testTwo-Sample t-test 24 hour total energy expenditure (MJ/day) in 24 hour total energy expenditure (MJ/day) in
groups of lean and obese womengroups of lean and obese women Do the women differ in their energy Do the women differ in their energy
expenditure?expenditure? Null hypothesis: energy expenditure in lean Null hypothesis: energy expenditure in lean
and obese women is the sameand obese women is the same
Boxplot of energy expenditure Boxplot of energy expenditure MJ/dayMJ/day
913N =
GROUP
obeselean
24
Ho
ur
tota
l en
erg
y e
xpe
nd
iture
MJ/
da
y14
12
10
8
6
4
1
12
13
Two-sample t-testTwo-sample t-test Summary statisticsSummary statistics
leanlean obeseobeseMeanMean 8.18.1 10.310.3
SDSD 1.21.2 1.41.4
NN 1313 99 Difference in means = 10.3 - 8.1 = 2.2Difference in means = 10.3 - 8.1 = 2.2 SE difference = 0.57 (weighted average)SE difference = 0.57 (weighted average)
TwoTwo Sample t-test Sample t-test Test statistic is 2.2/0.57 = 3.9Test statistic is 2.2/0.57 = 3.9 N1 + N2 - 2 DF (= 20)N1 + N2 - 2 DF (= 20) Calculate the probability of observing a value at least Calculate the probability of observing a value at least
as extreme as 3.9 if the null hypothesis is trueas extreme as 3.9 if the null hypothesis is true If the null hypothesis is true, the test statistic should If the null hypothesis is true, the test statistic should
have a t-distribution with 20 df (df = N1+N2-2)have a t-distribution with 20 df (df = N1+N2-2)
TwoTwo Sample t-test Sample t-test
95% of values from t-distribution with 20 DF lie 95% of values from t-distribution with 20 DF lie between -2.09 and +2.09between -2.09 and +2.09
Probability of observing a value as extreme or more Probability of observing a value as extreme or more extreme than 3.9 in a t-distribution with 20 df is 0.001extreme than 3.9 in a t-distribution with 20 df is 0.001
Only a very small probability that the value of 3.9 fits Only a very small probability that the value of 3.9 fits reasonably with a t-distribution with 20 df reasonably with a t-distribution with 20 df
Conclude that energy expenditure is significantly Conclude that energy expenditure is significantly different between lean and obese womendifferent between lean and obese women
The P-valueThe P-value
The P-value is the probability of observing a test The P-value is the probability of observing a test statistic at least as extreme as that observed if the null statistic at least as extreme as that observed if the null hypothesis is truehypothesis is true
tt distribution with 20 df distribution with 20 dfP
rob
ab
ility
x-4 -3 -2 -1 0 1 2 3 4
0
.1
.2
.3
.4
Confidence Interval for the Confidence Interval for the difference in two meansdifference in two means
95% CI = 95% CI = 2.2 - 2.092.2 - 2.090.57 to 2.2 +2.090.57 to 2.2 +2.090.570.57
or from 1.05 to 3.41 MJ/dayor from 1.05 to 3.41 MJ/day Thus we are 95% confident that obese women use Thus we are 95% confident that obese women use
between 1.05 and 3.41 MJ/day energy more than between 1.05 and 3.41 MJ/day energy more than lean women lean women
Confidence Interval or P-value?Confidence Interval or P-value? Confidence interval!!!Confidence interval!!! P-value will tell you whether or not there is a P-value will tell you whether or not there is a
statistically significant differencestatistically significant difference confidence interval will give information confidence interval will give information
about the size of the difference and the about the size of the difference and the strength of the evidencestrength of the evidence
Paired t-testPaired t-test Obvious pairing between observationsObvious pairing between observations
– two measurements on each subject (before-after two measurements on each subject (before-after study)study)
– case-control pairscase-control pairs Assumption - paired data are normally distributedAssumption - paired data are normally distributed Example - Systolic blood pressure (SBP) measured in Example - Systolic blood pressure (SBP) measured in
16 middle aged men before and after a standard 16 middle aged men before and after a standard exercise. Post-exercise SBP - Pre-exercise SBP exercise. Post-exercise SBP - Pre-exercise SBP calculated for each mancalculated for each man
Boxplot of differencesBoxplot of differences
16N =
Pos
t E
xerc
ise
SB
P -
Pre
-exe
rcis
e S
BP
20
10
0
-10
Paired t-testPaired t-test Mean difference = 6.6Mean difference = 6.6 SE(Mean) = 1.5SE(Mean) = 1.5 t = 6.6/1.5 = 4.4t = 6.6/1.5 = 4.4 Compare with t(15)Compare with t(15) P < 0.001P < 0.001ConclusionConclusion- mean systolic blood pressure is - mean systolic blood pressure is
higher after exercise than beforehigher after exercise than before
Paired t-testPaired t-test
95% confidence interval for the mean 95% confidence interval for the mean differencedifference
6.6 6.6 2.13 2.13××1.5 = 3.4 to 9.81.5 = 3.4 to 9.8
Categorical VariablesCategorical Variables
To investigate the relationship between two To investigate the relationship between two categorical variables form contingency tablecategorical variables form contingency table
Hypothesis testsHypothesis tests– Chi-squared test (Chi-squared test (22 test) test)– Fisher’s exact test (small samples)Fisher’s exact test (small samples)– McNemar’s test (paired data)McNemar’s test (paired data)
Chi-squared testChi-squared test Used to test for associations between Used to test for associations between
categorical variables (2 or more distinct categorical variables (2 or more distinct outcomes)outcomes)
Example - a comparison between Example - a comparison between psychotherapy and usual care for major psychotherapy and usual care for major depression in primary caredepression in primary care
Patient Reported Recovery at 8 Patient Reported Recovery at 8 monthsmonths
Recovered NotRecovered
Total
Psycho-therapy
47 (51%) 46 (49%) 93
Usual Care 18 (20%) 73 (80%) 91
Total 65 (35%) 119 (65%) 184
P<0.001, Chi-square test
Patient Reported Recovery at 8 Patient Reported Recovery at 8 monthsmonths
Difference between means 30.8%Difference between means 30.8% 95% confidence interval for difference 17.7% 95% confidence interval for difference 17.7%
to 43.8%to 43.8%
Larger tablesLarger tables Similar methods can be applied to larger tables Similar methods can be applied to larger tables
to test the association between two categorical to test the association between two categorical variablesvariables
Example - Is there an association between Example - Is there an association between housing tenure and time of delivery of baby housing tenure and time of delivery of baby (preterm/term).(preterm/term).
Null hypothesis: There is no relationship Null hypothesis: There is no relationship between housing tenure and time of deliverybetween housing tenure and time of delivery
Relationship between housing Relationship between housing tenure and time of deliverytenure and time of delivery
Housing Tenure Preterm Term Total
Owner-occupier 50 (61.7) 849 (837.3) 899
Council Tenant 29 (17.7) 229 (240.3) 258
Private Tenant 11 (12.0) 164 (163.0) 175
Lives with Parents 6 (4.9) 66 (67.1) 72
Other 3 (2.7) 36 (36.3) 39
Total 99 1344 1443
Relationship between housing Relationship between housing tenure and time of deliverytenure and time of delivery
Test Statistic
50 617
617
849 837 3
837 3
3 2 7
2 7
36 36 3
36 310 5
2 2
2 2
.
.
.
.. . . . . . .
. . . . . . ..
.
.
..
DF = (5-1)DF = (5-1)(2-1) = 4(2-1) = 4 P = 0.03P = 0.03 Thus we strong evidence of a relationship between Thus we strong evidence of a relationship between
housing tenure and time of deliveryhousing tenure and time of delivery
NotesNotes
Chi-squared test not valid if Chi-squared test not valid if expected values expected values are small (<5) are small (<5)
– Combine rows or columns to obtain a Combine rows or columns to obtain a smaller table with larger expected valuessmaller table with larger expected values
– Use Fisher’s exact test for small tablesUse Fisher’s exact test for small tables
McNemar’s testMcNemar’s test
Appropriate for use with paired or matched Appropriate for use with paired or matched (case-control) data with a dichotomous (case-control) data with a dichotomous outcomeoutcome
Example - McNemar’s testExample - McNemar’s test
Skaane compared the use of Skaane compared the use of mammography mammography and ultrasound in the assessment of 327 (228 and ultrasound in the assessment of 327 (228 palpable and 99 non-palpable) consecutive palpable and 99 non-palpable) consecutive malignant tumours confirmed at histology.malignant tumours confirmed at histology.
Acta radiologica vol 40;486-490 (1999)Acta radiologica vol 40;486-490 (1999)
McNemar’s test - exampleMcNemar’s test - example
Mammogram
Yes No Tot.
US Yes 267 11 278
No 41 8 49
Tot. 308 19 327
McNemar’s test - exampleMcNemar’s test - example
308/327 (94%) were picked up by 308/327 (94%) were picked up by mammograpy compared with 278/327 (85%) mammograpy compared with 278/327 (85%) picked up by ultrasoundpicked up by ultrasound
P<0.001P<0.001 Conclusion: Mammography is significantly Conclusion: Mammography is significantly
more sensitive in diagnosing tumours than more sensitive in diagnosing tumours than ultrasound in a population of mixed malignant ultrasound in a population of mixed malignant tumourstumours
Hypothesis testing - summaryHypothesis testing - summary
Type of data Paired Design Unpaired Design
ContinuousQuantitative data
Paired (one-sample) t-testWilcoxon Signed ranktest
Unpaired (independentsamples) t-testMann-Whitney U test
Ordered Categoricaldata
Wilcoxon signed ranktest
Mann-Whitney U test
Unordered Categoricaldata
McNemar's test (2categories only)
Chi-squared testFisher's exact test
Adapted from Chinn S. Statistics for the European Respiratory Journal.
Correlation and RegressionCorrelation and Regression Relationship between two continuous variablesRelationship between two continuous variables
– regressionregression– correlationcorrelation
Relationship between two Relationship between two continuous variablescontinuous variables
3 main purposes for doing this3 main purposes for doing this– to assess whether the two variables are associated to assess whether the two variables are associated
(correlation)(correlation)– to enable the value of one variable to be predicted to enable the value of one variable to be predicted
from any known value of the other variable from any known value of the other variable (regression)(regression)
– to assess the amount of agreement between two to assess the amount of agreement between two variables (method comparison study)variables (method comparison study)
ExampleExample Women from a pre-defined geographical area Women from a pre-defined geographical area
were invited to have their haemoglobin (Hb) were invited to have their haemoglobin (Hb) level and packed cell volume measured. They level and packed cell volume measured. They were also asked their age. were also asked their age.
Packed Cell Volume (%)
6050403020
Ha
em
og
olb
in le
ve
l (g
/dl)
18
16
14
12
10
8
Haemoglobin and packed cell Haemoglobin and packed cell volumevolume
Example - relationships between Example - relationships between variablesvariables
Association between Hb and PCV? Association between Hb and PCV? Hb affects PCV or PCV affects Hb?Hb affects PCV or PCV affects Hb?
Use correlation to measure the strength of an Use correlation to measure the strength of an association association
Association between Hb and age?Association between Hb and age?age must affect Hb and not vice versaage must affect Hb and not vice versa
Use regression to predict Hb from ageUse regression to predict Hb from age
CorrelationCorrelation
Not interested in causation Not interested in causation i.e. does a high PCV i.e. does a high PCV causecause a high Hb level a high Hb level
Interested in associationInterested in associationi.e. is a high PCV i.e. is a high PCV associatedassociated with a high Hb with a high Hb level?level?
sample correlation coefficientsample correlation coefficient– summarises strength of relationshipsummarises strength of relationship– can be used to test the hypothesis that the can be used to test the hypothesis that the
population correlation coefficient is 0population correlation coefficient is 0
Correlation CoefficientCorrelation Coefficient dimensionless, from -1 to 1dimensionless, from -1 to 1 measures the strength of a linear relationshipmeasures the strength of a linear relationship +ve - high value of one variable associated +ve - high value of one variable associated
with high value of the otherwith high value of the other -ve - high value of one variable associated -ve - high value of one variable associated
with low value of the otherwith low value of the other +1 = exact linear relationship +1 = exact linear relationship strictly called Pearson correlation coefficientstrictly called Pearson correlation coefficient
Example DataExample Datar = 1 r = -0.4
r = 0.7 r = 0X
987654321
Y
20
18
16
14
12
10
8
6
4
X
987654321
Y
10
0
-10
-20
X
987654321
Y
8
6
4
2
0
-2
-4
X
987654321
Y
30
20
10
0
When not to use the correlation When not to use the correlation coefficientcoefficient
If the relationship is non-linearIf the relationship is non-linear with caution in the presence of outlierswith caution in the presence of outliers when the variables are measured over more when the variables are measured over more
than one distinct group (i.e. disease groups)than one distinct group (i.e. disease groups) when one of the variables is fixed in advancewhen one of the variables is fixed in advance Assessing agreementAssessing agreement
Correlation - example dataCorrelation - example data
1494
11
10
9
8
7
6
5
4
x1
y1
1494
9
8
7
6
5
4
3
x2
y2
1494
13
12
11
10
9
8
7
6
5
x3
y3
201510
13
12
11
10
9
8
7
6
5
x4
y4
Is there an alternative?Is there an alternative? If the data are non-linear or there is an outlierIf the data are non-linear or there is an outlier
– use spearman rank correlation coefficientuse spearman rank correlation coefficient
Haemoglobin and Packed Cell Haemoglobin and Packed Cell VolumeVolume
Packed Cell Volume (%)
6050403020
Ha
em
og
olb
in le
vel (
g/d
l)
18
16
14
12
10
8
6
4
2
Without outlierWithout outlier
Pearson=0.67Pearson=0.67
Spearman=0.63Spearman=0.63
With outlierWith outlier
Pearson=0.34Pearson=0.34
Spearman=0.48Spearman=0.48
RegressionRegression Assume a change in x will cause a change in yAssume a change in x will cause a change in y predict y for a given value of xpredict y for a given value of x usually not logical to believe y causes xusually not logical to believe y causes x y is the dependent variable (vertical axis)y is the dependent variable (vertical axis) x is the independent variable (horizontal axis)x is the independent variable (horizontal axis)
Example - Haemoglobin vs AgeExample - Haemoglobin vs Age
Age (Years)
70605040302010
Ha
em
og
olb
in le
vel (
g/d
l)18
16
14
12
10
8
RegressionRegression Logical to assume that increasing age leads to Logical to assume that increasing age leads to
increasing Hbincreasing Hb Not logical to assume Hb affects age!Not logical to assume Hb affects age! Assume underlying true linear relationshipAssume underlying true linear relationship Make an estimate of what that true linear Make an estimate of what that true linear
relationship isrelationship is
Estimating a regression lineEstimating a regression line How do I identify the ‘best’ straight line?How do I identify the ‘best’ straight line? least squares estimateleast squares estimate straight line determined by slope and straight line determined by slope and
interceptintercept y = a + by = a + bxx a and b are estimates of the true intercept a and b are estimates of the true intercept
and slope and are subject to sampling and slope and are subject to sampling variationvariation
Regression line of haemoglobin on Regression line of haemoglobin on ageage
Age (years)
70605040302010
Ha
em
og
lob
in (
g/d
l)18
16
14
12
10
8
Regression of haemoglobin on ageRegression of haemoglobin on age Variable(s) Entered on Step Number Variable(s) Entered on Step Number
1.. AGE Age (Years)1.. AGE Age (Years)Multiple R .87959Multiple R .87959R Square .77367R Square .77367Adjusted R Square .76110Adjusted R Square .76110Standard Error 1.17398Standard Error 1.17398
Analysis of VarianceAnalysis of Variance DF Sum of Squares Mean Square DF Sum of Squares Mean SquareRegression 1 84.80397 84.80397Regression 1 84.80397 84.80397Residual 18 24.80803 1.37822Residual 18 24.80803 1.37822
F = 61.53133 Signif F = .0000F = 61.53133 Signif F = .0000
Regression of haemoglobin on ageRegression of haemoglobin on age
---------------------- Variables in the Equation ----------------------------------- Variables in the Equation -------------Variable B SE B 95% Confdnce Intrvl BVariable B SE B 95% Confdnce Intrvl BAGE .134251 .017115 .098295 .170208 AGE .134251 .017115 .098295 .170208 (Constant) 8.239786 .794261 6.571104 9.908467(Constant) 8.239786 .794261 6.571104 9.908467
----------- in ----------------------- in ------------Variable T Sig TVariable T Sig TAGE 7.844 .0000AGE 7.844 .0000(Constant) 10.374 .0000(Constant) 10.374 .0000
What does this tell us?What does this tell us? Mean Hb = 8.2 + 0.13 Mean Hb = 8.2 + 0.13 AGEAGE 95% CI for the slope goes from 0.098 to 0.17095% CI for the slope goes from 0.098 to 0.170 P < 0.0001P < 0.0001 Significant relationship between Hb and ageSignificant relationship between Hb and age 77% of the variability in Hb can be accounted 77% of the variability in Hb can be accounted
for by agefor by age
How can it be used?How can it be used? Predict mean Hb for a given agePredict mean Hb for a given age
Eg. What is the mean Hb of a 50 year old?Eg. What is the mean Hb of a 50 year old? Mean Hb = 8.2 + 0.13Mean Hb = 8.2 + 0.1350 = 14.7 g/dl50 = 14.7 g/dl 95% CI for the estimate from 14.4 to 15.5 95% CI for the estimate from 14.4 to 15.5
g/dlg/dl
How can it be used?How can it be used? To calculate reference ranges for the To calculate reference ranges for the
populationpopulation
E.g. What range would you expect 95% of 50 E.g. What range would you expect 95% of 50 year olds to lie within? (reference range)year olds to lie within? (reference range)
Between 12.4 to 17.5 g/dlBetween 12.4 to 17.5 g/dl
95% Confidence Interval for the Mean & 95% 95% Confidence Interval for the Mean & 95% prediction interval for individualsprediction interval for individuals
Age (years)
70605040302010
Ha
em
og
lob
in (
g/d
l)20
18
16
14
12
10
8
DefinitionsDefinitions Predicted value Predicted value
– the value predicted by the regression linethe value predicted by the regression line– an estimate of the mean valuean estimate of the mean value
ResidualResidual– Observed value - predicted valueObserved value - predicted value
What assumptions have I made?What assumptions have I made? The relationship is approximately linearThe relationship is approximately linear The residuals have a normal distributionThe residuals have a normal distribution
Multiple RegressionMultiple Regression
One outcome variable with multiple predictor One outcome variable with multiple predictor variablesvariables
Residuals assumed to be normally distributedResiduals assumed to be normally distributed Predictor variables can be continuous or Predictor variables can be continuous or
categoricalcategorical No assumptions made about distribution of No assumptions made about distribution of
continuous predictor variablescontinuous predictor variables
Multiple RegressionMultiple Regression
Example. Does the value of packed cell Example. Does the value of packed cell volume improve the prediction of hb?volume improve the prediction of hb?
Model fittedModel fitted
Mean Hb = 5.2 + 0.1Mean Hb = 5.2 + 0.1age(years) + age(years) + 0.10.1packed cell volume(%)packed cell volume(%)
RR22 = 83% = 83%
Knowledge of packed cell volume improves the Knowledge of packed cell volume improves the prediction of haemoglobinprediction of haemoglobin
SummarySummary
Regression can be used to estimate the Regression can be used to estimate the numerical relationship between an outcome numerical relationship between an outcome variable and one or more predictor variablesvariable and one or more predictor variables
Correlation coefficient alone is of limited useCorrelation coefficient alone is of limited use