Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS...

28
Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007

Transcript of Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS...

Page 1: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Fundamental Concepts of Biostatistics

Cathy Jenkins, MSBiostatistician II

Lisa Kaltenbach, MSBiostatistician IIApril 17, 2007

Page 2: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Prior to any analysis Define research question(s). Write

out using no more than one sentence per question.

Determine statistical analysis plan to address each research question.

Analysis:Confounder

Predictor Outcome

Page 3: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Population versus Sample Population: includes all possible observations

of a particular type. Observations may be people, animals, places, or

things Ex.: Men & women aged 18 and older; infants;

penguins; bodies of water Sample: includes only some of the

observations but selected in a way that gives every possible observation an equal chance of being observed.

Ex.: Men & women aged 18 and older in the TennCare Database from years 1998-2005; infants with a primary care physician at Vanderbilt during years 1990-2000; all penguins living in the Nashville zoo since 1995; bodies of water included in Bay Delta & Tributaries Database

Page 4: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Population versus Sample [2] In most cases in clinical research,

we want to generalize from information about our sample to information about a population.

Page 5: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Descriptive Statistics To describe characteristics of the

sample Ex.: demographics, distributions,

frequencies May want to describe data with

numerical or graphical summary Characteristics of sample may be

continuous or categorical variables

Page 6: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Continuous Variables Continuous: a variable that can take on any

number of possible values (Ex.: weight). Discrete Numeric: a variable whose set of possible

values is a finite sequence of numbers (Ex.: pain scale 1 to 5).

Numerical Summary: Often want to measure central tendency of data Sample mean: The sum of all of the observations

divided by the number of observations. The mean is only useful when the data are normally distributed.

Sample Median (50th Percentile): Order the observations from smallest to largest:

If n is odd, then the median is the middle ordered observation.

If n is even, then the median is the average of the two middle ordered observations.

Page 7: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Continuous Variables [2] Other common percentiles include quartiles (25th,

50th, and 75th percentiles) and deciles (10th, 20th, …, 90th percentiles).

The p-th percentile is the value that p-% of the data are less than or equal to. If p-% of the data lie below the p-th percentile, it follows the (100- p)-% of the data lie above it.

Ex: If the 85-th% percentile of household income is $60,000 then 85% of the households have incomes of $60,000 or less and the top 15% of households have incomes of $60,000 or more.

Measures of Dispersion When measurements are collected there will be scatter,

dispersion, or variability. Sources of dispersion

Random error: Error due to chance Systematic Error: Wrong result do to bias Biological variability

Page 8: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Continuous Variables [3] Minimum: Smallest observed value Maximum: Largest observed value Range: Difference between max and min (often

reported as (min, max)) Interquartile Range (IQR): Difference between

75th and 25th percentiles (often reported as (25th,75th))

Variance: The average of the squares of the deviations of the observations from their mean.

Standard Deviation: the square root of the variance

Standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center

Has the same unit of measurement as the mean

Page 9: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Continuous Variables [4] Graphical Summary:

0.0

2.0

4.0

6

De

nsity

0 10 20 30 40Baseline APACHE Score

Page 10: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Categorical Variables Categorical: a variable having only certain possible values (ex.: race).

Binary: a categorical variable with only two possible values (ex.: gender). Ordinal: a categorical variable for which there is a definite ordering of the

categories (ex.: severity of lower back pain ordered as none, mild, moderate, and severe).

Numerical Summary: Frequency Distribution: A listing of distinct values for that characteristic and

the number of observations having each value. Relative frequency: Proportion of the total number of observations that fall

into each category. Cumulative frequency: Proportion of the total number of observations that

fall into the current or previous categories listed (may be useful for ordinal variable).

6498

6058

4356

Cumulative Frequency

438/6498=.06740536

1702/6498=.2619267

4356/6498=.6703601

Relative frequency

438 (7%)

1702 (26%)

4356 (67%)

Frequency

High Dose

Low Dose

Nonuser

Ibuprofen Use

Page 11: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Relationships between two variables Two variables measured on the same observations are

associated if some values of the first variable tend to occur more often with some values of the second variable than with other values of that variable.

Two continuous variables Ex.: Person’s weight and blood pressure

Two categorical variables Ex.: gender and smoking status

One continuous & one categorical variable Ex.: blood pressure and gender

Keep in mind - relationship between two variables can be strongly influenced by other variables that are lurking in the background

Page 12: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Two Continuous Variables A scatterplot shows the relationship

between two continuous variables measured on the same observations.

The values of one variable appear on the x-axis, and the values of the other variable appear on the y-axis.

Each observation appears as a point in the plot fixed by the values of both variables for that observation.

Page 13: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Two Continuous Variables [2] Graphical Summary: Scatterplot

Page 14: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Two Categorical Variables Numerical Summary:

Cross-tabulation (or 2-way table) Ex.: Clinical Pregnancy by Age Groups

33731 (9%)71 (21%)

235 (70%)

Total

160 (47%)

14 (4%)47 (14%)

99 (29%)No

177 (53%)

17 (5%)24 (7%)136 (40%)

Yes

TotalAge >=3835-37Age <35Clinical Pregnancy

Age Group

Page 15: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

One Categorical and One Continuous variable Consider descriptive statistics of the

continuous variable separately for different values of the categorical variable

Ex: Descriptive statistics of birth weight by smoking status during pregnancy for mothers

189729.02944.7Total

660.1

752.4

Standard Deviation

742773.2Smoker

1153055.0Nonsmoker

Frequency

Mean Birth weight

Smoked during pregnancy

Page 16: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

How to study the relationship between two different variables Quantify the relationship: Measure the

strength of the relationship (linear, monotonic, …) between two continuous variables.

Use hypothesis testing: Test theory to see if experimental results only reflect random chance.

Fit model: Predict one measure of an individual from another.

Page 17: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Quantify the relationship Correlation coefficient (r): Quantitative

summary of the strength of the relationship between two continuous variables. Pearson correlation: focuses on the raw

data. Spearman correlation: focuses on the ranks

of the raw data. Covariance (r2): Square of the correlation

coefficient that defines the strength or magnitude of the

correlation. not a cause and effect relationship but

quantifies how well one variable predicts another.

Page 18: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [1] Define null hypothesis for question of interest that

assumes the experimental results are due to chance alone.

Perform statistical test to determine if we can reject or fail to reject the null hypothesis.

NOTE: Absence of evidence does not mean evidence of absence. In other words, if our test results in a non-significant p-value, we do not “accept” the null hypothesis. Rather we fail to reject the null hypothesis. It could be that for the same experiment but a different sample we would obtain significant results.

P-value: the probability of obtaining a result at least as extreme as a given data point assuming the data point was the result of chance alone.

Page 19: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [2] Categorical data

Chi-square tests Both row and column variables are nominal Row variable nominal; column variable

ordinal Both row and column variables are ordinal

Tests whether distribution of frequencies differs across rows (groups) or whether there is any association between the row and column variables.

Page 20: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [3] Nominal row and column variables

Example: Given data on the neighborhood in which a person lives and his political affiliation, you wish to test whether a person’s politics influences where he/she lives.

H0: No association exists between a person’s political affiliation and the neighborhood in which he lives.

HA: An association exists between a person’s political affiliation and the neighborhood in which he lives.

Page 21: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [4] Nominal row variables and ordinal

column variables Example: Given data studying hours of

headache pain relief (hours ranging from 0 – 6) using three different treatments – placebo, standard, and test treatment.

H0: No association between hours of pain relief and treatment.

HA: A shift in row mean hours of headache pain relief exists between the treatment groups.

Page 22: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [5] Ordinal row and column variables

Example: Given data assessing how water additives (water, standard, super) affect the washability of clothes (low, medium, high).

H0: No association between the water additive and the washability of the clothes.

HA: There is a linear association between water additive and washability of clothes.

Page 23: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis Testing [6] Continuous variables

Parametric tests: Make assumptions about underlying distribution of data.

1-sample t-test H0: Mean of data is equal to some fixed value (defined by study question).

2-sample t-test H0: No difference in means between the two independent groups.

Paired t-test H0: Mean of difference in paired data is equal to 0.

Page 24: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Hypothesis testing [7] Non-parametric tests: No

assumptions about underlying distribution of data.

Wilcoxon signed rank test – analogous to the one-sample or paired t-test.

1-sample H0: Median is equal to specified value (defined by study question).

Paired H0: Median difference is equal to 0. Wilcoxon rank sum test – analogous to

the two-sample t-test. H0: The distribution of the response variable

is the same in the two independent groups.

Page 25: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Modeling [1]: ANOVA 1-way/2-way: extends 2-sample t-test (with

1 factor/2 factors) to n-groups -- compares mean of continuous variable across n-groups.

H0: No difference in means between the n-groups.

HA: At least one group has a different mean than the other (n-1) groups.

Avoids problems with multiple comparisons. Tests whether within-group variability is

greater than between-group variability.

Page 26: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Modeling [2]: Linear regression Continuous outcome. Assumes relationship between predictor(s) and outcome is

linear. Observations assumed to be independent (ie., only one

observation per subject, no subjects that are related to each other, etc.)

Number of predictors allowed in the model depends on the sample size.

Rule of thumb: no more than n/10 predictors where n = # of subjects.

Include confounders in the model for better parameter estimates.

Output are parameter estimates – Can give information similar to that obtained from hypothesis

testing. Allows the investigator to make inference based on the parameter

estimates.

Page 27: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Modeling [3]:Logistic regression Categorical outcome – typically binary. Number of predictors in model depends on several

things: Each group has at least 10 subjects. Cell counts in a cross-tab table meet certain sample size

criteria: 80% of expected counts are at least 5. All other expected counts are greater than 2, with virtually no

0 counts. Output are parameter estimates and odds ratios

calculated from these parameter estimates. Odds ratio: way of comparing whether the probability of

a certain event is the same for two groups. OR = 1 The event is equally likely in both groups. OR > 1 The event is more likely in the first group. OR < 1 The event is less likely in the first group.

Page 28: Fundamental Concepts of Biostatistics Cathy Jenkins, MS Biostatistician II Lisa Kaltenbach, MS Biostatistician II April 17, 2007.

Conclusion Statistical analysis plan should be devised

before collecting data. Use aims of study to decide the best way to

study the relationship between variables of interest – correlations, hypothesis testing, modeling.

Make use of the daily biostatistics clinics. Refer to http://biostat.mc.vanderbilt.edu for the clinic schedule. Click on the “Clinics” link (5th link from the top).

Check to see if your department is a part of the collaboration plan for more intensive biostatistics support.