Statistical Analysis I - Penn State...

Statistical Analysis I

Lan Kong, PhD Associate Professor

Department of Public Health Sciences

December 22, 2014

CTSI BERD Research Methods Seminar Series

Biostatistics, Epidemiology, Research Design(BERD)

BERD Goals: Match the needs of investigators to the

appropriate biostatisticians/epidemiologists/methodologists

Provide BERD support to investigators Offer BERD education to students and

investigators via in-person, videoconferenced, and on-line classes

http://ctsi.psu.edu/ctsi-programs/biostatisticsepidemiologyresearch-design/

http://ctsi.psu.edu/ctsi-programs/biostatisticsepidemiologyresearch-design/

BERD Seminar Series Date Title Presenter

Sept. 8 Intro to Clinical Research Designs & Cross-sectional Study

Duanping Liao

Sept. 22 Cohort Study Duanping Liao

Oct. 6 Case Control Study Duanping Liao

Oct. 20 Matched Case Control Study Duanping Liao

Nov. 3 Clinical Trials Vern Chinchilli

Nov. 17 Power and Sample Size Allen Kunselman

Dec. 8 Data Management Rosanne Pogash

Dec. 22 Statistical Analysis 1 (HY only) Lan Kong

Jan. 5 Statistical Analysis 2 (HY only) Mosuk Chow

Jan. 19 Putting it all together in a research proposal/protocol

Duanping Liao

Feb. 2 Meta-Analysis Vern Chinchilli

Statistics Encompasses Study design Selection of efficient design (cohort study/case-control

study) Sample size Randomization

Data collection Summarizing data

Important first step in understanding the data collected Analyzing data to draw conclusions Communicating the results of analyses

Keys to Successful Collaboration Between Statistician and Investigator: A Two-Way Street

Involve statistician at beginning of project (planning/design phase)

Specific objectives

Communication

avoid jargon

willingness to explain details

Keys to Successful Collaboration: A Two-Way Street

Respect Knowledge Skills Experience Time

Embrace statistician as a member of the research team

Fund statistician on grant application for best collaboration Most statisticians are supported by grants, not by

Institutional funds

Statistical Analysis

Describing data Numeric or graphic

Statistical Inference Estimation of parameters of interest Hypothesis testing Regression modeling

Interpretation and presentation of the results

Describing data: Basic Terms

Measurement – assignment of a number to something

Data – collection of measurements Sample – collected data Population – all possible data Variable – a property or characteristic of the

population/sample – e.g., gender, weight, blood pressure.

Example of data set/sample

Data on albumin and bilirubin levels before and after treatment with a study drug

ID DRUG BILI ALBUMIN BASE_BIL BASE_ALB 6 0 0.7 4.2 0.8 3.98 7 0 1.2 3.59 1 4.09 8 0 1.3 3.08 0.3 4

11 0 2.1 3.58 1.4 4.16 13 0 1.1 3.39 0.7 3.85 16 0 0.6 3.8 0.7 3.66 21 0 1.7 3.22 0.6 3.83 2 1 3.6 2.92 1.1 4.14

15 1 1.2 3.72 0.8 3.87 19 1 0.4 3.92 0.7 3.56 24 1 3.6 3.66 2.1 4 34 1 0.8 3.85 0.8 3.7 43 1 0.7 3.78 1.1 3.64

Describing Data

Types of data Summary measures (numeric) Visually describing data (graphical)

Types of Variables Qualitative or Categorical

Binary (or dichotomous) True/False, Yes/No Nominal – no natural ordering Ethnicity Ordinal – Categories have natural ranks

Degree of agreement (strong, modest, weak) Size of tumor (small, medium, large)

Quantitative Ratio - Ordered, constant scale, natural zero (age, weight)

Interval-Ordered, constant scale, no natural zero

Differences make sense, but ratios do not Temperature in Celsius (30°-20°=20°-10°, but 20°/10°

is not twice as hot)

Types of Measurements for Quantitative Variables Continuous: Weight, Height, Age Discrete: a countable number of values

The number of births, Age in years Likert scale: “agree”, “strongly agree”, etc.

Somewhere between ordinal and discrete Scales with <= 4 possibilities are usually

considered to be ordinal. Scales with >=7 possibilities are usually considered

to be discrete.

Descriptive Statistics

Quantitative variable Measure(s) of central location/tendency

Mean Median Mode

Measure(s) of variability (dispersion) describe the spread of the distribution

Summary Measures of dispersion/variation Minimum and Maximum Range = Maximum – Minimum Sample variances (abbreviated s2) and

standard deviation (s or SD) with denominator=n-1

Descriptive Statistics (cont.)

Other Measures of Variation Interquartile range (IQR): 75th percentile – 25th percentile MAD: median absolute deviation CV: Coefficient of variation

Ratio of SD over sample mean Measure relative variability Independent of measurement units Useful for comparing two or more sets of data

Tell whole story of data, detect outliers Histogram Stem and Leaf Plot Box Plot

Describing data graphically

Histogram

Divide range of data into intervals (bins) of equal width. Count the number of observations in each class.

05

1015

20N

umbe

r of M

en

80 100 120 140 160Systolic BP (mmHg)

• 113 men

• Each bar spans a width of 5 mmHg.

• The height represents the number of individuals in that range of SBP.

Histogram of SBP

020

4060

Num

ber

of M

en

80 100 120 140 160Systolic BP (mmHg)

Bin Width = 20 mmHg 0

24

6N

um

be

r o

f M

en

80 100 120 140 160Systolic BP (mmHg)

Bin Width = 1 mmHg

Stem and Leaf Plot

Provides a good summary of data structure Easy to construct and much less prone to error

than the tally method of finding a histogram 2 8 8 9

3 0 1 1 1 2 3 3 4 4 5 5 5 5 6 6 6 7 7 7 7 8 9 9 4 0 0 1 1 1 1 1 2 2 3 3 3 4 4 4 4 5 5 5 6 7 7 8 9 5 0 1 1 2 3 4

“stem”: the first digit or digits of the number. “leaf” : the trailing digit.

Box Plot: SBP for 113 Males

8010

012

014

016

0Sample of 113 Men

Boxplot of Systolic Blood Pressures

Sample Median Blood Pressure

75th Percentile

25th Percentile

Largest Observation

Smallest Observation

Descriptive Statistics (cont.)

Categorical variable Frequency (counts) distribution Relative frequency (percentages) Pie chart Bar graph

Describe relationship between two variables

One quantitative and one categorical Descriptive statistics within each category Side by side boxplots/histograms Both quantitative Scatter plot Both categorical Contingency table

A process of making inference (an estimate, prediction, or decision) about a population (parameters) based on a sample (statistics) drawn from that population.

Statistical Inference 0

.1.2

.3.4

Perc

enta

ge

80 100 120 140 160 180Systolic BP (mmHg)

05

1015

20N

umbe

r of M

en

80 100 120 140 160Systolic BP (mmHg)

Statistics (Vary from sample to sample) Parameters (Fixed, unknown)

Population

Sample

Inference

Statistical Inference

Questions to ask in selecting appropriate methods Are observation units independent? How many variables are of interest? Type and distribution of variable(s)? One-sample or two-sample problem? Are samples independent? Parameters of interest (mean, variance, proportion)? Sample size sufficient for the chosen method? (see decision making flow chart in the handout)

Estimation of population mean

We don’t know the population mean μ but would like to know it.

We draw a sample from the population. We calculate the sample mean X. How close is X to μ? Statistical theory will tell us how close X is to μ. Statistical inference is the process of trying to

draw conclusions about the population from the sample.

Key Statistical Concept Question: How close is the sample

mean to the population mean? Statistical Inference for sample mean

Sample mean will change from sample to sample

We need a statistical model to quantify the distribution of sample means (Sampling distribution)

Assume “normal distribution” for the population data

Normal Distribution Normal distribution, denoted by N(µ, σ2), is characterized by

two parameters µ: The mean is the center. σ: The standard deviation measures the spread

(variability).

Mean

Standard Deviation

Standard Deviation

Mean

Probability density function

Distribution of Blood Pressure in Men (population)

Y: Blood pressure Y~ N(µ, σ2)

Parameters: Mean, µ= 125 mmHg

SD, σ = 14 mmHg

83 97 111 125 139 153 1670

.1

.2

.3

.4

99.7%

95%

68%

The 68-95-99.7 rule for normal distribution applied to the distribution of systolic blood pressure in men.

Sampling Distribution

The sampling distribution refers to the distribution of the sample statistics (e.g. sample means) over all possible samples of size n that could have been selected from the study population.

If the population data follow normal distribution N(µ, σ2), then the sample means follow normal distribution N(µ, σ2/n).

What if the population data do not come from normal distribution?

Central Limit Theorem (CLT)

If the sample size is large, the distribution of sample means approximates a normal distribution.

~ N(µ, σ2/n) The Central Limit Theorem works even when the

population is not normally distributed (or even not continuous).http://onlinestatbook.com/stat_sim/sampling_dist/index.html

For sample means, the standard rule is n > 60 for the Central Limit Theorem to kick in, depending on how “abnormal” the population distribution is. 60 is a worst-case scenario.

X

http://onlinestatbook.com/stat_sim/sampling_dist/index.html

http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Sampling Distribution

By CLT, about 95% of the time, the sample mean will be within two standard errors of the population mean. This tells us how “close” the sample statistic

should be to the population parameter. Standard errors (SE) measure the precision of

your sample statistic. A small SE means it is more precise. The SE is the standard deviation of the sampling

distribution of the statistic.

Standard Error of Sample Mean

The standard error of sample mean (SEM) is a measure of the precision of the sample mean. σ: standard deviation (SD) of population

distribution.

SEM = nσ

The standard deviation is not the standard error of a statistic!

Example

Measure systolic blood pressure on random sample of 100 students Sample size n = 100 Sample mean = 125 mm Hg Sample SD s = 14.0 mm Hg

Population SD (σ) can be replaced by sample SD for large sample

SEM = mmHg 1.410014

=

x

Confidence Interval for population mean An approximate 95% confidence interval for population mean

µ is: ± 2×SEM or precisely is a random variable (vary from sample to sample), so

confidence interval is random and it has 95% chance of covering µ before a sample is selected.

Once a sample is taken, we observe , then either µ is within the calculated interval or it is not.

The confidence interval gives the range of plausible values for µ.

X

x=X

X

Statistical Analysis I - Penn State...

Documents

Transcript of Statistical Analysis I - Penn State...