Biostatistics in practice Session 3 Youngju Pak, Ph.D. UCLA Clinical and Translational Science...

Biostatistics in practiceBiostatistics in practiceSession 3Session 3

Youngju Pak, Youngju Pak, Ph.D. Ph.D.

UCLA Clinical and Translational Science InstituteUCLA Clinical and Translational Science Institute

LA BioMed/Harbor-UCLA Medical CenterLA BioMed/Harbor-UCLA Medical Center

School of Medicine School of Medicine http://research.LABioMed.org/Biostathttp://research.LABioMed.org/Biostat

E-mail: [email protected]: [email protected]

04/20/2304/20/23 11

Table of ContentsTable of Contents

Analogy of Hypothesis TestingAnalogy of Hypothesis Testing How to compute a P-Values and interpret itHow to compute a P-Values and interpret it Understanding the sampling distribution and a Understanding the sampling distribution and a

confidence intervals (CI)confidence intervals (CI) How to interpret a CIHow to interpret a CI The relationship between a P-Value and a CIThe relationship between a P-Value and a CI

04/20/2304/20/23 22

Population

Sample

Sample estimate of population parameter/ Descriptive statistics

Population parameter

Sampling mechanism: random sample or

convenience sample

C.I.s or P-values for population parameter

The procedure of statistical inferences

04/20/2304/20/23 33

Analogy for hypothesis testingAnalogy for hypothesis testingExample: a bet between two friendsExample: a bet between two friends

Suppose you and a friend were playing a “fun” gambling Suppose you and a friend were playing a “fun” gambling game. Your friend has a coin which you flip: game. Your friend has a coin which you flip:

if “tails”, your friend pays you a $1,if “tails”, your friend pays you a $1,

if “heads”, you pay your friend a $1if “heads”, you pay your friend a $1

After 10 plays, you got 9 heads up. Do you trust After 10 plays, you got 9 heads up. Do you trust your friend? Is this a fair coin? What is your your friend? Is this a fair coin? What is your argument?argument?

04/20/2304/20/23 44

Statistical Hypothesis TestingStatistical Hypothesis Testing HH00: Fair coin (null): Fair coin (null) vs. H vs. Haa: Unfair coin (alternative): Unfair coin (alternative)

Assume the coin is “fair” (Assume Assume the coin is “fair” (Assume HH0 0 is true)is true) You and your friend have to put a You and your friend have to put a threshold valuethreshold value on the on the

definition of “being definition of “being RARERARE”. That means that if Prob (# of ”. That means that if Prob (# of H=9 or more |10 trials) is less than a certain value, say, H=9 or more |10 trials) is less than a certain value, say, αα, , then we will consider that 9 heads out of 10 trials are then we will consider that 9 heads out of 10 trials are RARELYRARELY happen when the coin is fair, thus very unlikely to happen when the coin is fair, thus very unlikely to happen if the coin was fair. Then the rule ishappen if the coin was fair. Then the rule is

Prob.(# of H= 9 or more |10 trials) Prob.(# of H= 9 or more |10 trials) << 0.05 ( 0.05 (αα) ) ← Type I error rate ( = a level of significance)← Type I error rate ( = a level of significance) then your friend would agree to conclude it was not a then your friend would agree to conclude it was not a

fair coin thus reject Hfair coin thus reject H00 in favor of H in favor of Haa..

04/20/2304/20/23 55

Statistical Hypothesis Testing continue.Statistical Hypothesis Testing continue. Collect data and provide the “evidence” if Collect data and provide the “evidence” if HH00: Fair coin : Fair coin

is true.is true. P(# of H=9 or more |10 trials)P(# of H=9 or more |10 trials)≈ ≈ 0.0011 (1.1%)0.0011 (1.1%)

Make decisionMake decision P(# of H=9 or more |10 trials) P(# of H=9 or more |10 trials) ≈≈ 1.1 (%) < 5% 1.1 (%) < 5%

Thus, it is VERY unlikely to happen if it was a fair Thus, it is VERY unlikely to happen if it was a fair

coin. coin. We found a We found a significant significant evidence to disapprove evidence to disapprove

HH00 in favor of Ha. in favor of Ha.

Therefore, conclude that it was an UNFAIR coin Therefore, conclude that it was an UNFAIR coin (thus, the bet is invalid).(thus, the bet is invalid).04/20/2304/20/23 66

How to interpret P Value=1.1(%), in general ?How to interpret P Value=1.1(%), in general ?

A P Value is predicted on the assumption that A P Value is predicted on the assumption that H H00 is trueis true

A P Value is A P Value is NOTNOT a probability of the alternative being a probability of the alternative being correct.correct.

A P Value should be used as an evidence to A P Value should be used as an evidence to DISPROVE DISPROVE HH00,, not to prove the Ha. not to prove the Ha.

04/20/2304/20/23 77

How to interpret P-Values: ExampleHow to interpret P-Values: ExampleAcute secondary Adrenal Insufficiency (AI) after Acute secondary Adrenal Insufficiency (AI) after Traumatic Brain Injury (TBI): a prospective studyTraumatic Brain Injury (TBI): a prospective study

Objective: To determine the prevalence, clinical characteristics, and effect of AI on TBI patients

Procedure: 80 TBI and 41 non-TBI patients were followed during the hospitalization up to 9 days, blood samples taken every 8 hours and vital signs recorded every hour.

Subject is AI if 2 successive serum cortisols are low.

Goal: Do Groups Differ By More than is Goal: Do Groups Differ By More than is Expected By Chance?Expected By Chance?

First, need to:

• Specify experimental units (Persons? Blood draws?).

• Specify single outcome for each unit (e.g., Yes/No(binary) or continuous?).

• Examine raw data, e.g., histogram, for meeting test assumptions.

• Specify group summary measure to be used (e.g., % or mean, median over units) Descriptive statistics.

• Choose particular statistical test for the outcome and make inference with inferential statistics (CI, P-Value).

Outcome Type Outcome Type → Statistical Test→ Statistical Test

Cohan (2005) Crit Care Med;33:2358-66.

. . .

. . .

Medians

%s

Means

WilcoxonTest

ChiSquareTest

t Test

t-Test for Minimal Mean Arterial t-Test for Minimal Mean Arterial Pressure(MAP): Step 1Pressure(MAP): Step 1

1. Calculate a standardized quantity for the particular test, a “test statistic”.

Diff in Group Means = 63.4 - 56.2 = 7.2 (“Signal”)

SE(Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2 (“Noise” due to random sampling)

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

→ Test Statistic = t = (7.2 - 0)/2.2 = 3.28

Signal to Noise Ratio

t-Test for Minimal MAP: Step 2t-Test for Minimal MAP: Step 2

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ(H0). Often: t is approx’ly normal bell curve.

Expect

0.95 ChanceObserved = 3.28

Is the t-test statistics of 3.28 seems to be “RARE” to you? Why?

Prob (-2 to -1) is Area = 0.14

222.5 % 22

t-Test for Minimal MAP: P-Valuet-Test for Minimal MAP: P-Value

Expected 95%

When H0 is true

95% ChanceObserved = 3.28

P-Value=Prob. ( T-statistics > 3.28)=0.0007(One-sided)

In practice, a two sided p-value is usually used.

Two sided P-Value

= 2 x One-sided P-value

=2 x 0.0007= 0.0014 < 0.05

Conclude: Groups differ since ≥3.28 has <5% if no difference in the entire

Smaller values ↔ more evidence of group differences.

Area = 0.0007

Area = 0.0007

3. Declare groups to differ if test statistic is RARE under H0 is true[How much RARE?]

One sided or Two sided P-Values?One sided or Two sided P-Values?

There are other types of t-tests:

• A two-sided P-value assumes that differences (between groups or pre-to-post) are possible in both directions, e.g., increase or decrease.

• A one-sided P-value assumes that these differences can only be either an increase or decrease, or one group can only have higher or lower responses than the other group. This is very rare, and generally not acceptable.

Tests on PercentagesTests on Percentages

Is 26.3% vs. 61.9% statistically significant (p<0.05), i.e., a difference too large to have a <5% of occurring by chance if groups do not really differ?

Solution: Same theme as for means. Find a test statistic and compare to its expected values if groups do not differ.

See next slide.

Tests on PercentagesTests on Percentages

Cannot use t-test for comparing lab data for multiple blood draws per subject.

Expect

1Observed = 10.2

Area = 0.002

Chi-Square Distribution

95% Chance

5.99

Here, the signal in the test statistic is a squared quantity, expected to be 1.

Test statistic=10.2 >> 5.99, so p<0.05. In fact, p=0.002.

Tests on Percentages: Chi-SquareTests on Percentages: Chi-Square

The chi-square test statistic (10.2 in the example) is found by first calculating what is the expected number of AI patients with MAP <60 and the same for non-AI patients, if AI and non-AI really do not differ for this.

Then, chi-square is found as the sum of standardized ∑ (Observed – Expected)2 / Expected.

This should be close to 1, as in the graph on the previous slide, if groups do not differ. The value 10.2 seems too big(extreme) to have happened by chance (probability=0.002),i.e., if there is no difference among “all” TBI subjects(H0).

How RARE is being “RARE”?How RARE is being “RARE”?

Expect

>99% Chance Observed = 3.28

Convention:

“Too deviant” is < 5% chance → |t| >~2.

Why not choose, say, |t|>3, so that our chances of being wrong are even less, <1%?

<0.5%<0.5%

Answer: Then the chances of missing a real difference are increased, the converse wrong conclusion.

This is analogous to setting the threshold for a diagnostic test of disease.

A statistically significant result ---A statistically significant result --- is not necessarily an important or even is not necessarily an important or even

interesting resultinteresting result may not be scientifically interesting or clinically may not be scientifically interesting or clinically

significant.significant. With With largelarge sample sizes, very small differences sample sizes, very small differences

may turn out to be may turn out to be statisticallystatistically significant. In significant. In such a case, such a case, practicalpractical implications of any implications of any findings must be judged findings must be judged on other than statistical on other than statistical grounds.grounds.

Statistical significance does not imply practical Statistical significance does not imply practical significancesignificance

04/20/2304/20/23 1919

How to interpret insignificant p-valuesHow to interpret insignificant p-values

Possible answersPossible answers

1.There is no difference (H1.There is no difference (H00 is true). is true).

2.There is a real difference (Ha is true) but fail to 2.There is a real difference (Ha is true) but fail to detect due to small sample size– Type II errordetect due to small sample size– Type II error

There is no way to determine whether a non-There is no way to determine whether a non-significant difference is the result of a small significant difference is the result of a small sample size or because the null hypothesis is sample size or because the null hypothesis is correct.correct.

Thus, insignificant P-Values should almost always Thus, insignificant P-Values should almost always be regarded as be regarded as INCONCLUSIVEINCONCLUSIVE rather than an rather than an indication of no effect. (Fail to reject the null.).indication of no effect. (Fail to reject the null.).

Insignificant p-value does NOT prove HInsignificant p-value does NOT prove H00. . 04/20/2304/20/23 2020

Back to Paper: Normal RangeBack to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

SD = 8.7 SD = 10.8

N = 38 N = 42

Back to Paper: Normal RangeBack to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

Answer: 56.2 ± 2(10.8) ≈ 35 to 78

SD = 8.7 SD = 10.8

N = 38 N = 42

Back to Paper: Confidence IntervalsBack to Paper: Confidence Intervals

Δ= 63.4-56.2= 7.2 is the best guess for the MAP diff between the means of “all” AI and non-AI patients.

We are 95% sure that diff is within ≈ 7.2±2SE(Diff) = 7.2±2(2.2) = 2.8 to 11.6.

SD = 8.7 SD = 10.8

N = 38 N = 42

SE = 1.41 SE = 1.66

SE(Diff of Means) = 2.2

SE(Diff) ≈ sqrt of [SEM12 +

SEM22]

Sampling distribution and CISampling distribution and CI Sampling distribution: A distribution of a statistics (such as Sampling distribution: A distribution of a statistics (such as

a sample mean or a t-test statistics) with repeated a sample mean or a t-test statistics) with repeated sampling from a target population.sampling from a target population.

We can calculate statistics from one random sample and We can calculate statistics from one random sample and use that statistics as point estimate for population.use that statistics as point estimate for population.

But how precise that statistics is based on the sampling But how precise that statistics is based on the sampling distribution of that statisticsdistribution of that statistics

Since a sample mean is used most commonly, the Since a sample mean is used most commonly, the sampling distribution of the mean are used most sampling distribution of the mean are used most commonly.commonly.

Simulation of a sampling distribution or a confidence Simulation of a sampling distribution or a confidence interval of the sample mean the sample mean interval of the sample mean the sample mean go to go to

http://onlinestatbook.com/stat_sim/index.htmlonlinestatbook.com/stat_sim/index.html

Confidence IntervalConfidence Interval When your study is under powered(e.g., pilot data) or over When your study is under powered(e.g., pilot data) or over

powered(e.g., national surveys), the confidence interval powered(e.g., national surveys), the confidence interval provide the range for where true effect ( a population provide the range for where true effect ( a population parameter) lies.parameter) lies.

How well your sample mean (m) reflect the true mean?How well your sample mean (m) reflect the true mean? Generic form of 95% CI for the mean(proportion)Generic form of 95% CI for the mean(proportion)

Lower limit: Sample mean(proportion) – Lower limit: Sample mean(proportion) – 1.96* SE1.96* SE

Upper limit: Sample mean (proportion) + Upper limit: Sample mean (proportion) + 1.96* SE1.96* SE

, 1.96* SE also usually called “the margin of the error”., 1.96* SE also usually called “the margin of the error”.

SE is measures the variability in SE is measures the variability in the sampling distribution the sampling distribution of of the sample mean (or proportion) from a repeated sampling.the sample mean (or proportion) from a repeated sampling.

04/20/2304/20/23 2525

Revisiting the food additives studyRevisiting the food additives study

2. Look at the left side of the bottom panel of Figure 3 and recall what we have said about confidence intervals. Would you conclude that there is a change in hyperactivity under Mix A?

3. Repeat question 2 for placebo.

Revisiting the food additive study cont.Revisiting the food additive study cont.


Possible values for real effect.

Zero is “ruled out”.


4. Do you think that the positive conclusion for question #3 has been "proven"?Yes, with 95% confidence.

5. Do you think that the negative conclusion for question #2 has been "proven"?No, since more subjects would give a narrower confidence interval.

Hypothesis testing make a Yes or No conclusion whether there is an effect and quantifies the chances of a correct conclusion either way.

Confidence intervals give possible magnitudes of effects.

Confidence Intervals Confidence Intervals ↔ Hypothesis tests↔ Hypothesis tests

p>0.05 p≈0.05 p<0.05The food additives study

Confidence Intervals Confidence Intervals ↔ Hypothesis tests↔ Hypothesis tests

95% Confidence Intervals

Non-overlapping 95% confidence intervals, as here, are sufficient for significant (p<0.05) group differences.

However, overlapping is not necessary. They can overlap and still groups can differ significantly.

The AI study

Power of a StudyPower of a Study

Statistical power is the sensitivity of a study to detect real effects, if they exist.

It needs to be balanced with the likelihood of wrongly declaring effects when they are non-existent. Today, we have been keeping that error at <5%.

Power is the topic for the next session #4.

Biostatistics in practice Session 3 Youngju Pak, Ph.D. UCLA Clinical and Translational Science...

Documents

Transcript of Biostatistics in practice Session 3 Youngju Pak, Ph.D. UCLA Clinical and Translational Science...