Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study

Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study

Dr. Tuan V. Nguyen

Garvan Institute of Medical Research

Sydney, Australia

Overview

• Estimate of prevalence• Analysis of difference between two proportions• Analysis of difference among proportions: Chi-square• Analysis of difference between two means• Analysis of association I: simple linear regression analysis• Analysis of association II: multiple regression analysis

Prevalence of Disease

• Prevalence is NOT incidence

• Measures the no. of people in a population who have the disease at a given point in time.

– this measure has been called point prevalence, in contrast to period prevalence, infrequently used, which sums cases existing at the start of a time period to new cases that occur during the time period

• A measure of disease status, disease burden

– in contrast to incidence which measures disease onset events

1

2

3

4

5

TimePrevalence: At time T, 2 out of 5 subjects had the disease; P = 2/5 = 0.4

T

Sampling Variability in Prevalence

• Prevalence in the population () is UNKNOWN• Sample prevalence (p) is an unbiased estimate of .

x = number of diseased individuals in the sample

p = prevalence

N = sample size

• Estimates:

p = x/N

variance of p:

standard error of p:

95% CI of :

2 1p ps

N

1p ps

N

1.96p s

An Example of Calculation of Prevalence

• The prevalence of ABO hemolytic disease in a population is 43 out of 3584 subjects.

• So, the estimated prevalence:

p = 43/3584 = 0.0125

• Standard error of the prevalence:

• 95% confidence interval:

0.0125+(1.96 x 0.002) = 0.009 to 0.016.

0.0125 1 0.01250.002

3584s

Test for Difference Between Two Proportions

p1 = proportion for group 1

p2 = proportion for group 2

N1 = sample size for group 1

N2 = sample size for group 2

d = p1 – p2

variance of d:

z-test:

dz

s

1 1 2 22

1 2

1 1p p p ps

N N

d = 0.268 – 0.211 = 0.057

variance of d:

s2 = 0.000238 + 0.000152 = 0.000391

z-test:

z = 0.057 / sqrt(0.00391) = 2.87

Significant!

Vietnam Australia

N 700 1287

Osteoporosis 148 345

Prevalence 0.211 0.268

Variance (s2) 0.000238 0.000152

Test for Difference Among Proportions

Caffeine consumption1- 151- 300-

None 150 300 900 Total

____________________________________________Marital statusMarried 652 1537 598 242 3029Divorced 36 46 38 21 141Single 218 327 106 67 718Total 906 1910 742 330 3888

In percent (row)Married 0.22 0.51 0.20 0.08 100Divorced 0.26 0.33 0.27 0.15 100Single 0.30 0.46 0.15 0.09 100Total 0.23 0.49 0.19 0.08 100

652/3029=0.22 1537/3029=0.51 598/3029=0.20 242/3029=0.08

36/141=0.26 46/141=0.33 38/141=0.27 21/141=0.15

218/718=0.30 327/718=0.46 106/718=0.15 67/718=0.09

906/3888=0.23 1910/3888=0.49 742/3888=-.19 330/3888=0.08



None 150 300 900 Total_______________________________________________Expected freq.Married 705.8 1488 578.1 257.1 3029Divorced 32.9 69.3 26.9 12.0 141Single 167.3 352.7 137.0 60.9 718Total 906 1910 742 330 3888


None 150 300 900 Total_______________________________________________Marital statusMarried 652 1537 598 242 3029Divorced 36 46 38 21 141Single 218 327 106 67 718Total 906 1910 742 330 3888

3029/3888*906=705.8 3029/3888*1910=1488 3029/3888*742=578.1 3029/3888*330=257.1

141/3888*906=32.9 141/3888*1910=69.3 141/3888*742=26.9 141/3888*330=12.0

718/3888*906=167.3 718/3888*1910=352.7 718/3888*742=137.0 718/3888*330=60.9



None 150 300 900_______________________________________________Marital statusMarried 652 1537 598 242 O

(705.8) (1488) (578.1) (257.1) E

Divorced 36 46 38 21 O(32.9) (69.3) (26.9) (12.0) E

Single 218 327 106 67 O(167.3) (352.7) (137.0) (60.9) E

(O - E)2/EMarried 4.11 1.61 0.69 0.89 7.30Divorced 0.30 7.82 4.57 6.82 19.51Single 15.30 1.88 7.02 0.60 24.86Total 19.77 11.31 12.28 8.31 51.66

(652-705.8)2 / 705.8 = 4.11 (1537 – 1488)2 / 1488 = 1.61 ….Chisq = 51.6

df = 3x2=6

X2 = 1.63 for =0.05

Normal Distribution

2

22

1| , exp

22

xP X x f x

130 140 150 160 170 180 190 200

0.0

00

.02

0.0

40

.06

0.0

8

Probability distribution of height in Vietnamese women

Height

f(h

eig

ht)

Phân phối chiều cao ở phụ nữ Việt Nam với trung bình 156 cm và độ lệch chuẩn 4.6 cm. Trục hoành là chiều cao và trục tung là xác suất cho mỗi chiều cao.

Application of the Normal Distribution

• The serum cholesterol levels of Californian children have a mean of 175 mg/100ml and a standard deviation of 30 mg/100ml. The distribution of the cholesterol levels is normal.

• 95% of the children should have cholesterol levels ranged between 175 + (1.96x30) = 116 and 234 mg/100ml.

• If we let X be the chol. level for any child, then X can be converted to a variable with mean=0 and SD=1:

Z = (X – 175) / 30

Abnormal? Abnormal?0-1.96 1.96

mg/100l175116 234

Z

Two-group comparison: unpaired t-test

Mean difference:

D = x1 – x2

Variance of D:

T-statistic:

95% Confidence interval:

Group 1 Group2

x11 x21

x12 x22 x13 x23 x14 x24 x15 x25 …x1n x2n

Sample size n1 n2

Mean x1 x2

SD s1 s2

Two-group comparison: an example

A B100 122108 130119 138127 142132 152135 154136 176164

N 8 7Mean 127.6 144.9SD 19.6 17.8

Mean difference:

d = 127.6 – 144.9 = -17.3

Variance of D:

T-statistic:

95% Confidence interval:

Analysis of Correlation

ID Age Chol (mg/ml)

1 463.5

2 201.9

3 524.0

4 302.6

5 574.5

6 253.0

7 282.9

8 363.8

9 222.1

10 433.8

11 574.1

12 333.0

13 222.5

14 634.6

15 403.2

16 484.2

17 282.3

18 494.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 20 40 60 80

Age

Ch

ol

(mg

/100

ml)

Variance, Covariance and Correlation: Theory

• Let x and y be two random variables from a sample of n obervations.

• Measure of variability of x and y: variance

2

1

var1

ni

i

x xx

n

2

1

var1

ni

i

y yy

n

• Measure of covariation between x and y ?

1

1cov ,

1

n

i ii

x y x x y yn

• Coefficient of correlation (r)

cov , cov ,

var var x y

x y x yr

SD SDx y

Positive and Negative Correlation

8 10 12 14 16

-30

-25

-20

-15

x

y

8 10 12 14 16

1520

2530

x

y

r = 0.9 r = -0.9

Test of Hypothesis of Correlation

• Hypothesis: Ho: r = 0 versus Ho: r not equal to 0.• Step 1: Fisher’s z-transformation

1 1ln

2 1

rz

r

1

3SE z

n

• Step 2: calculate standard error of z

z

tSE z

• Step 3: calculate t-statistic

An Example of Correlation Analysis

ID Age Cholesterol(x) (y;

mg/100ml)1 46 3.52 20 1.93 52 4.04 30 2.65 57 4.56 25 3.07 28 2.98 36 3.89 22 2.110 43 3.811 57 4.112 33 3.013 22 2.514 63 4.615 40 3.216 48 4.217 28 2.318 49 4.0Mean 38.83 3.33SD 13.60 0.84

Cov(x, y) = 10.68 cov , 10.68

0.9413.60 0.84x y

x yrSD SD

1 1 0.94ln 0.56

2 1 0.94z

1 10.26

3 15SE z

n

t-statistic = 0.56 / 0.26 = 2.17

Critical t-value with 17 df and alpha = 5% is 2.11

Conclusion: There is a significant association between age and cholesterol.

Simple Linear Regression Analysis

• Only two variables are of interest: one response variable and one predictor variable

• No adjustment is needed for confounding or covariate

• Assessment:– Quantify the relationship between two variables

• Prediction– Make prediction and validate a test

• Control– Adjusting for confounding effect (in the case of multiple variables)

Linear Regression: Model

• Y : random variable representing a response

• X : random variable representing a predictor variable (predictor, risk factor)– Both Y and X can be a categorical variable (e.g., yes / no) or a

continuous variable (e.g., age). – If Y is categorical, the model is a logistic regression model; if Y

is continuous, a simple linear regression model.

• Model

Y = + X + : intercept

: slope / gradient : random error (variation between subjects in y even if x is constant, e.g.,

variation in cholesterol for patients of the same age.)

Linear Regression: Assumptions

• The relationship is linear in terms of the parameter;

• X is measured without error;

• The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ;

• The random error term () is normally distributed with mean 0 and constant variance.

• If the assumptions are tenable, then:

– The expected value of Y is: E(Y | x) = + x

– The variance of Y is: var(Y) = var() = 2

Estimation of Model Parameters

• Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points

A(x1,y1)

B(x2,y2)

a

x

y

0

dy

dx

Gradient: 2 1

2 1

y ydym

dx x x

Equation: y = mx + a

What happen if we have more than 2 points?

Method of Least Squares

• For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)

• Let a and b be sample estimates for parameters and ,

• We have a sample equation: Y* = a + bx

• Aim: finding the values of a and b so that (Y – Y*) is minimal.

• Let SSE = sum of (Yi – a – bxi)2.

• Values of a and b that minimise SSE are called least square estimates.

Criteria of Estimation

Chol

Age

ˆi iy a bx ˆi i id y y

yi

The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal.

Least squares Estimates

• After some calculus operations, the results can be shown to be:

,

varxy

xx

S Cov x yb

S x

a y bx

2

1

n

xx ii

S x x

1

n

xy i ii

S x x y y

Where:

• When the regression assumptions are valid, the estimators of a and b have the following properties:– Unbiased– Uniformly minimal variance (eg efficient)

Goodness-of-fit

• Now, we have the equation Y = a + bX

• Question: how well the regression equation describe the actual data?

• Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X.

Partitioning of variations: geometry

Chol (Y)

Age (X)

mean

SSR

SSE

SST

• SST = sum of squared difference between yi and the mean of y.

• SSR = sum of squared difference between the predicted value of y and the mean of y.

• SSE = sum of squared difference between the observed and predicted value of y.

SST = SSR + SSE

• The the coefficient of determination is: R2 = SSR / SST

Linear Regression Analysis by R

age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)

chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,2.5,4.6,3.2,4.2,2.3,4.0)

lipid <- data.frame(age,chol)

attach(lipid)

results <- lm(chol ~ age)

summary(results)

Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedomMultiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Interpretation of Model Estimates

Cholesterol = 1.089 + 0.0578(Age)

Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***

• Interpretation: Cholesterol is increased by 0.0578 mg/ml for each year increase in age. The association between age and cholesterol is statistically significant (p = 1.06e-08).

R-squared = 0.8698

• Interpretation: Variation in age “explained” 85% variation in cholesterol.

Prediction

plot(chol ~ age)

abline(results)

20 30 40 50 60

2.0

2.5

3.0

3.5

4.0

4.5

age

cho

lRegression line:Chol = 1.089 + 0.0578(Age)

Checking Assumptions

par(mfrow=c(2,2))

plot(results)

2.5 3.0 3.5 4.0 4.5

-0.4

0.0

0.2

0.4

0.6

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

8

6

17

-2 -1 0 1 2

-10

12

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

8

6

17

2.5 3.0 3.5 4.0 4.5

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location8

617

0.00 0.05 0.10 0.15 0.20 0.25

-10

12

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance0.5

0.5

1

Residuals vs Leverage

6

2

8

The Importance of Assumption: BMI and Sexual Attractiveness

bmi <- c(11.0, 12.0, 12.5, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.8, 15.0, 15.0, 15.5, 16.0, 16.5, 17.0, 17.0, 18.0, 18.0, 19.0, 19.0, 20.0, 20.0, 20.0, 20.5, 22.0, 23.0, 23.0, 24.0, 24.5, 25.0, 25.0, 26.0, 26.0, 26.5, 28.0, 29.0, 31.0, 32.0, 33.0, 34.0, 35.5, 36.0, 36.0)sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)beauty <- data.frame(bmi,sa)attach(beauty)results <- lm(sa ~ bmi)summary(results)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.92512 0.64489 7.637 1.81e-09 ***

bmi -0.05967 0.02862 -2.084 0.0432 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.354 on 42 degrees of freedom

Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218

F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

Incorrect Functional Form

10 15 20 25 30 35

23

45

6

bmi

sa

3.0 3.5 4.0

-3-2

-10

12

3

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

21

10

20

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

21

10

20

3.0 3.5 4.0

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location21 1020

0.00 0.02 0.04 0.06 0.08 0.10 0.12

-2-1

01

2

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance

Residuals vs Leverage

1310

Cubic Regression

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6500 0.1193 30.587 < 2e-16 ***poly(bmi, 3)1 -2.8228 0.7915 -3.566 0.000957 ***poly(bmi, 3)2 -5.9749 0.7915 -7.548 3.27e-09 ***poly(bmi, 3)3 4.0324 0.7915 5.094 8.76e-06 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7915 on 40 degrees of freedomMultiple R-Squared: 0.7051, Adjusted R-squared: 0.683 F-statistic: 31.88 on 3 and 40 DF, p-value: 1.077e-10

results<-lm(sa ~ poly(bmi,3))summary(results)

SA = 3.65 – 2.82(BMI) – 5.97(BMI)2 + 4.03(BMI)3

Sexual Attractiveness and BMI: Cubic Functionbmi.new <- (10:40)sa.pred = predict(results, data.frame(bmi=bmi.new))plot(sa ~ bmi)lines(bmi.new, sa.pred, col="blue", lwd=3)

10 15 20 25 30 35

23

45

6

bmi

sa

Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study

Documents

Transcript of Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study