Introduction to Biostatistics - Stony Brook Medicine

48
In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). Introduction to Biostatistics Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core

Transcript of Introduction to Biostatistics - Stony Brook Medicine

Page 1: Introduction to Biostatistics - Stony Brook Medicine

In collaboration with Clinical Translational Science Center (CTSC) and

the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).

Introduction to Biostatistics

Jie Yang, Ph.D.

Associate Professor

Department of Family, Population and Preventive Medicine

Director

Biostatistical Consulting Core

Page 2: Introduction to Biostatistics - Stony Brook Medicine

What is Biostatistics

What does a biostatistician do

• Experiment design, clinical trial design

• Descriptive and Inferential analysis

• Result interpretation

What you should bring while consulting

with a biostatistician

OUTLINE

Page 3: Introduction to Biostatistics - Stony Brook Medicine

http://www.youtube.com/watch?v=PbODigCZqL8

WHAT IS BIOSTATISTICS

• The science of biostatistics encompasses the design of

biological/clinical

experiments

the collection,

summarization, and

analysis of data from

those experiments

the interpretation of,

and inference from, the

results How to Lie with Statistics (1954) by Darrell Huff.

Page 4: Introduction to Biostatistics - Stony Brook Medicine

GOAL OF STATISTICS

Sampling

Inference

Probability

TheoryPOPULATION SAMPLE

Descriptive

Statistics

Descriptive

Statistics

Inferential StatisticsSample

Statistics: 𝑿 , 𝒔, 𝒑 ,…

Population

Parameters:

𝝁, 𝝈, 𝝅…

Page 5: Introduction to Biostatistics - Stony Brook Medicine

• Adequate sample size (statistical power)

• Random selection (representative)

PROPERTIES OF A “GOOD” SAMPLE

Sampling Techniques:

1.Simple random sampling

2.Stratified sampling

3.Systematic sampling

4.Cluster sampling

5.Convenience sampling

Page 6: Introduction to Biostatistics - Stony Brook Medicine

STUDY DESIGN

Page 7: Introduction to Biostatistics - Stony Brook Medicine

EXPERIEMENT DESIGN

Completely Randomized Design (CRD)- Randomly assign the experiment units to the treatments

Design with Blocking – dealing with nuisance factor which has

some effect on the response, but of no interest to the experimenter; Without

blocking, large unexplained error leads to less detection power.

1. Randomized Complete Block Design (RCBD) - One single blocking factor

2. Latin Square

Design (two

blocking factor)

3. Cross over Design

(each subject=blocking factor) 4. Balanced Incomplete

Block Design

Page 8: Introduction to Biostatistics - Stony Brook Medicine

EXPERIMENT DESIGN

Factorial Design: similar to randomized block design, but

allowing to test the interaction between two treatment

effects. A significant interaction between A and B tells:

• the effect of A is different at each level of B. Or the

effect of B differs at each level of A.

• it is not very sensible to even be talking about the main

effect of A and B

Experiment with random factors: randomly select n of

the possible levels of the factor of interest. Typically

random factors are categorical.

Split-plot Design: confounding a main effect with blocks

Page 9: Introduction to Biostatistics - Stony Brook Medicine

EVIDENCE PYRAMID

IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on

ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019]

Page 10: Introduction to Biostatistics - Stony Brook Medicine

WHAT CAN A STATISTICIAN HELP

DURING STUDY DESIGN PHASE

Blinding/masking and randomization

The number and combination of experimental inventions

The timing of measurements or visits

Collect information on a larger sample or on the same

sample over time

Ways to maximize the efficient use of the available

resources

Even for data management – how to code measures

and what to computerize directly affect the ease even

the feasibility of subsequent analysis

Page 11: Introduction to Biostatistics - Stony Brook Medicine

How a biostatistician analyzes data

Page 12: Introduction to Biostatistics - Stony Brook Medicine

1. Nominal data: unordered categories or classes

e.g. gender, blood type, transplant type

2. Ordinal data: order among categories is important

e.g. disease severity, AE level

3. Discrete data: both ordering and magnitude are important;

often integers or counts, no intermediate values are possible

e.g. # of accidents within a month, # of kids in a family

4. Continuous data: difference between two possible data

values can be arbitrarily small

e.g. height, weight, body temperature, serum level, BP

5. Time to event data: censoring presents

e.g. overall survival

TYPE OF DATA

Page 13: Introduction to Biostatistics - Stony Brook Medicine

DESCRIPTIVE STATISTICS

General goal is to describe the distribution of a single

variable (center, spread, shape, functional form)

Helpful for checking data and assumptions

Stratified (by group) analysis can be done for groups of

interest

Values and comparisons can be visualized and

“estimated” but descriptive statistics alone will provide no

information about our level of confidence in conclusions

Page 14: Introduction to Biostatistics - Stony Brook Medicine

• Mean: average

• Median: the 50th percentile point (median

value);

• Mode: value that occurs most frequently;

unimodal and multimodal

DESCRIPTIVE STATISTICS

1. Measure of central tendency

Page 15: Introduction to Biostatistics - Stony Brook Medicine

• Reporting a measure of center gives only partial

information about a data set.

– Example: Consider the following three datasets:

Dataset 1: 4 5 5 5 6

Dataset 2: 1 3 5 7 9

Dataset 3: 1 5 5 5 9

All the three datasets have identical means and medians.

Datasets 2&3 are more variable than the 1st one.

• It is also important to describe the spread of values

about the center.

DESCRIPTIVE STATISTICS

Page 16: Introduction to Biostatistics - Stony Brook Medicine

• Range= Max –Min

• Inter-Quartile Range (IQR)=Q3-Q1

• Variance, Sample Variance

• Standard Deviation, Sample Standard

Deviation

DESCRIPTIVE STATISTICS

2. Measure of variability

Page 17: Introduction to Biostatistics - Stony Brook Medicine

• An Outlying Value is a value, X, such that

X> Q3+ 1.5(IQR) or

X< Q1–1.5(IQR)

• An Extreme Outlying Value is a value, X, such

that

X> Q3+ 3(IQR) or

X< Q1–3(IQR)

IDENTIFYING POTENTIAL OUTLIERS

Page 18: Introduction to Biostatistics - Stony Brook Medicine

• Median and IQR are generally unaffected by the removal

of outliers but minor changes are possible.

• Mean and Standard Deviation will be affected by the

outlying values.

• Apparent shape of the distribution can also be affected by

outlying values.

• One should never simply remove data values from a

dataset.

• In practice, if the outliers are not errors, sensitivity

analysis will often be conducted or robust statistical

methods will be used.

EFFECTS OF OUTLIERS

Page 19: Introduction to Biostatistics - Stony Brook Medicine

• Summary table

• Bar/Pie chart

• Histogram

• Scatter plot

• Boxplot

1. Outlier

2. Extreme Outlier

3. Modified Boxplot

WAYS OF PRESENTING DATA

Page 20: Introduction to Biostatistics - Stony Brook Medicine

SUMMARY TABLE

side N Mean Stderr Median Min Max

left 14 18.83 6.04 18.25 8.00 30.10

right 14 18.61 5.48 17.75 8.80 28.21

1. By one variable

2. By multiple categorical variables

Location of

Tumor

Radiation

Sequence with

Surgery

Before 2002 After 2002

Lower

(n = 972)

Preoperative 107(19.21%) 65(15.66%)

Postoperative 450(80.79%) 350(84.34%)

Upper

(n = 283)

Preoperative 20(13.16%) 21(16.03%)

Postoperative 132(86.84%) 110(83.97%)

Page 21: Introduction to Biostatistics - Stony Brook Medicine

BAR CHART AND PIE CHART

8.12%

10.35%

12.76%

2.88%3.59%

4.46%

5.75%

7.44%

9.32%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

AGB LSG RYGB

Bariatric surgeries, 2010-2013

ED revisit Admitted from ED Discharged from ED

Diagnosis for Cholecystectomy patients,

2006-2013

Page 22: Introduction to Biostatistics - Stony Brook Medicine

HISTOGRAM AND SCATTER PLOT

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

DISC measurement by group

Le

ft

Right Control No Surgery Gamma-knife Resection

Page 23: Introduction to Biostatistics - Stony Brook Medicine

BOX-PLOT

One continuous variable and one categorical variable

Page 24: Introduction to Biostatistics - Stony Brook Medicine

OTHER

Page 25: Introduction to Biostatistics - Stony Brook Medicine

THEORETICAL DISTRIBUTIONS

Variable Type of Outcome Theoretical Distribution

Continuous numeric Normal, Log-normal, Exponential,…

Discrete numeric Poisson, Negative Binomial,…

Binary Bernoulli, Binomial,….

Categorical with multiple categories

Multinomial,…

Page 26: Introduction to Biostatistics - Stony Brook Medicine

CONFIDENCE INTERVALS

A point estimate alone is not enough: it gives us no way to judge how accurate it is as an estimator.

A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of values that are likely to cover the true value of the parameter.

A confidence intervals starts with the point estimate and adds a “margin of error.” A confidence interval is defined as: point estimate +/- margin of error.

Population Parameters:𝝁, 𝝈, 𝝅…

Sample Statistics: 𝑿 , 𝒔, 𝒑 ,…

Page 27: Introduction to Biostatistics - Stony Brook Medicine

95% CI for μ: P(-??<µ<??)=0.95

CONFIDENCE INTERVALS

Since by central limit theorem, ),(~ 2

xxNx

95.0)96.1/

96.1(

n

xP

95.0)*96.1*96.1( n

xn

xP

Page 28: Introduction to Biostatistics - Stony Brook Medicine

95% Confidence Interval (CI) for µ:

Interpretation 1: You can be 95% sure that the true mean

(μ) will fall within the upper and lower bounds.

Interpretation 2: 95% of the intervals constructed using

sample means (x) will contain the true population mean

(μ).

100(1-α)% CI:

A good link for simulation of CI: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html

CONFIDENCE INTERVALS

nx

*96.1

nZx

2/1

Page 29: Introduction to Biostatistics - Stony Brook Medicine

The cartoon guide to Statistics by Gonick

and Smith

Page 30: Introduction to Biostatistics - Stony Brook Medicine

• Using data to test specific hypotheses

• Making decisions based on probability

(instead of subjective impressions)

• Distribution is usually assumed

• Methods that require no distributional

assumptions are called non-parametric or

distribution free

HYPOTHESIS TESTING

Page 31: Introduction to Biostatistics - Stony Brook Medicine

• A well formulated hypothesis will be both

quantifiable and testable, that is, involve

measurable quantities or refer to items that may

be assigned to mutually exclusive categories.

• Takes one of two forms:

“ Some measurable characteristic of a population takes

one of a specific set of values”

“ Some measurable characteristic takes different values in

different populations, the difference has a specific pattern

or a specific set of values”

WHAT IS A HYPOTHESIS

Page 32: Introduction to Biostatistics - Stony Brook Medicine

EXAMPLES

This new drug will lower diastolic blood pressure.

For males over 40 suffering from chronic

hypertension, a 100mg daily dose of this new drug

will lower diastolic blood pressure an average of

10 mm Hg.

Page 33: Introduction to Biostatistics - Stony Brook Medicine

• The Null hypothesis describes some aspect of the

statistical behavior of a set of data and is denoted

H0

• This description is treated as valid unless the

actual behavior of the data contradicts this

assumption

• The Alternative Hypothesis is generally the

“opposite” of the null hypothesis and is denoted

H1

BASIC DEFINITIONS AND

NOTATION

Page 34: Introduction to Biostatistics - Stony Brook Medicine

• The null hypothesis is usually of the form

HYPOTHESIS

• The alternative will take on one of the following forms

Page 35: Introduction to Biostatistics - Stony Brook Medicine

REJECTION REGION IN

HYPOTHESIS TESTING

Page 36: Introduction to Biostatistics - Stony Brook Medicine

ERRORS IN HYPOTHESIS TESTING

The facts Decisions

No Difference Drug is Better

No Difference Correct Type I error: Manufacturer wastes money

developing an ineffective drugs

Drug is better Type II error: Manufacturer misses

opportunity for profit; Public denied access to

effective treatment

Correct

There is a trade off between type I error, α, and type II error, β

Page 37: Introduction to Biostatistics - Stony Brook Medicine

P-VALUE

Definition: the probability of obtaining a test statistic as extreme as or

more extreme than the actual test statistic obtained, given that the null

hypothesis is true.

Other explanation: the α level at which we would be indifferent between

accepting or rejecting H0 given the sample data at hand or the α level at

which the given value of the test statistic is on the borderline between the

acceptance and rejection region

Page 38: Introduction to Biostatistics - Stony Brook Medicine

1. State null (H0) and alternative (H1) hypotheses

2. Choose a significance level, α (usually 0.05 or

0.01)

3. Based on the sample, calculate the test statistic

and calculate p-value based on theoretical

distribution

4. Compare p-value with the significance level

5. Make a decision, and state the conclusion

BASIC STEPS IN HYPOTHESIS TESTING

Page 39: Introduction to Biostatistics - Stony Brook Medicine

• If we wish to conduct a two-sided test of a hypothesis regarding a population parameter with significance level α,

we can do this by constructing a 100(1-α)% confidence

interval and checking to see if the hypothesized value is in

the interval

• In this manner, CIs can be used to conduct Two-Sided

Hypothesis Tests

RELATIONSHIP BETWEEN HT

AND CI

Variable Group Estimate95% Confidence

IntervalP-value

absolute

difference

Control vs. No surgery 0.421 0.239-0.741 0.0036

Control vs. Surgery 0.276 0.158-0.482 <0.0001

No surgery vs. Surgery 0.655 -0.372-1.156 0.1400

Gamma-knife vs. Resection 0.420 0.190-0.925 0.0322

Page 40: Introduction to Biostatistics - Stony Brook Medicine

• Usually we assume an underlying distribution and

the methods used have been based on these

assumptions

• Such methods are parametric statistical methods

since the parametric form of the distribution is

assumed to be known

• If these assumptions are not reasonable and/or

the central limit theorem cannot be applied,

nonparametric procedures should be used

NONPARAMETRIC TESTS

Page 41: Introduction to Biostatistics - Stony Brook Medicine

• These tests can be used in situations where the

data are ordinal or even binary (Yes/No)

• For quantitative data, if there are extreme values,

ranking mediates these values

• In these situations, assumptions may be violated

for parametric tests such as t-test, especially for

small samples and non-parametric methods are

more appropriate

NONPARAMETRIC TESTS

Page 42: Introduction to Biostatistics - Stony Brook Medicine

SUMMARY OF COMMON

NONPARAMETRIC TESTS

Purpose of test

Normal theory-based

test

Corresponding

nonparametric test

To study the central tendency

of a single sample One sample t-test

sign test; Wilcoxon

signed rank test

To compare central

tendencies of two

independent samples

t-test for two independent

samples

Wilcoxon rank-sum test

(Mann-Whitney U test)

To examine a set of

differences paired t-test

Wilcoxon signed-rank

test

To assess the linear

association between two

variables

Pearson correlation

coefficient

Spearman rank

correlation coefficient

To compare three or more

groups

one way analysis of

variance (ANOVA) Kruskal-Wallis test

Page 43: Introduction to Biostatistics - Stony Brook Medicine

The cartoon guide to Statistics by Gonick

and Smith

43

Page 44: Introduction to Biostatistics - Stony Brook Medicine

• Sample size: N– ↑ N → power ↑

• Significance level: α– ↑ α → power ↑

• Effect size: δ– ↑ δ → power ↑

• Variation in the continuous outcome: σ2

– ↓ σ2 → power ↑• One-tailed vs. two-tailed tests

– Power is greater in one-tailed tests than in comparable two-tailed tests

POWER IS AFFECTED BY

44

Page 45: Introduction to Biostatistics - Stony Brook Medicine

• Variables of interest – type of data e.g. continuous, categorical

• Desired power

• Desired significance level

• Effect/difference of clinical importance

• Standard deviations of continuous outcome variables

• One or two-sided tests

SAMPLE SIZE FORMULA BASICS

45

Depends on study designNot hard, but can be VERY algebra intensive

Page 46: Introduction to Biostatistics - Stony Brook Medicine

WHILE CONSULTING WITH A

BIOSTATISTICIAN

Written summary materials are preferred: readable & organized

o Background information about the problem

o A proposal, protocol, or statement of work

o Schematics, such as diagram or flow-chart

o Information about any existing database

o prior studies, pilot data, tests, published and in-house reports

relating to the problem

Make sure the biostatistician understand your needs - Avoid having a

good solution to a wrong question

Any issues in data collection (e.g. missing data) or deviations from the

study protocol (e. g. randomization before baseline tests verified

eligibility).

The biostatistician should be a coauthor given substantive input from the biostatistician. But always get explicit consent first.

Page 47: Introduction to Biostatistics - Stony Brook Medicine

WHEN TO CONSULT A

BIOSTATISTICIAN

Most effective way: to include a biostatistician from the

very beginning of a research project.

After data have been collected: • Bring complete, detailed description of the study design and conduct

• Bring clear exposition of the questions to be addressed

• Note: research questions may not be answered by collected data

Once the data are analyzed:• Checking if conclusions fit the analysis results

• Suggesting best ways to describe and display the data

• Assuring no erroneous or incomplete statements about the findings

• Sometimes data re-analysis may be needed

When You Consult a Statistician... What to Expect (2007) Berman N, Gansky S, Guillon C, Loughin T, Sanchez M (2003)

Page 48: Introduction to Biostatistics - Stony Brook Medicine

Please check our website for future lectures

https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education

THANK YOU!