Fall 2002Biostat 5110 Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and...
-
Upload
bernice-skinner -
Category
Documents
-
view
218 -
download
0
Transcript of Fall 2002Biostat 5110 Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and...
Fall 2002 Biostat 511 1
Medical Biometry I(Biostatistics 511)
Instructor: Jim Hughes
Cartoons and images in these notes are from Gonick L. Cartoon Guide to Statistics. HarperPerennial, New York, 1993.Fisher L and vanBelle G. Biostatistics: A Methodology for the Health Sciences. Wiley, New York, 1993
Fall 2002 Biostat 511 2
Fall 2002 Biostat 511 3
Typical Public Health, Medical or Biological Questions About Populations
• Does formula feeding increase the chance of survival of infants born to HIV positive mothers, compared to breastfeeding, in a developing country?
• How do we estimate the concentration of antibody based on reactivity of serial dilutions?
• Are there trends in mortality and homicide rates by urban setting, age, gender, or race?
• How do we model survival following heart bypass surgery? Are there patient characteristics that predict survival? How does the 1 year, 2 year, 5 year survival of bypass patients compare to individuals treated medically for heart disease?
• How do attitudes toward enrollment in an HIV vaccine study vary by geography, age, or education?
• How does physician experience influence survival of patients with HIV?
Fall 2002 Biostat 511 4
Biostatistics 511
• Introduction to the basic concepts of statistics as applied to problems in public health or medicine
• Definitions:
1. Data - numerical facts, measurements, or observations obtained from an investigation aimed at answering a question.
2. Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation.
3. Biostatistics - the application of statistics to the biologic sciences, medicine and public health.
Fall 2002 Biostat 511 5
Role of Statistics in Public Health and Medicine
Science
1. Idea or Question
2. Collect data/make observations
3. Describe data / observations
4. Assess the strength of evidence for / against the hypothesis
Statistics
1. Math. model / hypothesis
2. Study design
3. Descriptive statistics
4. Inferential statistics
Fall 2002 Biostat 511 6
Descriptive Statistics and Exploratory Data Analysis -
Univariate
• Types of data1. Categorical2. Continuous
• Numerical Summaries1. Location - mean, median, mode.2. Spread - range, variance, standard deviation, IQR3. Shape - skewness
• Graphical Summaries1. Barplot 2. Stem and Leaf plot3. Histogram4. Boxplot
• Mathematical Summaries1. Density curves
Fall 2002 Biostat 511 7
Descriptive Statistics (Exploratory)
• “Exploratory data analysis is detective work - numerical detective work”
• “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone- the first step”
• organization, summarization, and presentation of data
• “Show me the data!”
Tools:
• tables
• graphs
• numerical summaries
John TukeyExploratory Data AnalysisAddison-Wesley, 1977
Fall 2002 Biostat 511 8
Inferential Statistics (Confirmatory)
• Generalization of conclusions:
sample population
• Assess strength of evidence
• Make comparisons
• Make predictions
Tools:
• Modeling
• Estimation and Confidence Intervals
• Hypothesis Testing
Fall 2002 Biostat 511 9
Example: Effect of seat belt use on accident fatality
Seat BeltDriver Worn Not worn
dead 10 20alive 40 30
Total 50 50Fatality Rate 10/50 (20%) 20/50 (40%)
Fall 2002 Biostat 511 10
But, suppose...
Impact Speed< 40 mph > 40 mph
Driver seat beltworn not
seat beltworn not
dead 3 2 7 18alive 27 18 13 12
Total 30 20 20 30FatalityRate
10% 10% 35% 60%
How does this affect your inference?
Fall 2002 Biostat 511 11
Types of Data
• Categorical (qualitative)
1) Nominal scale - no natural order
- gender, marital status, race
2) Ordinal scale
- severity scale, good/better/best
• Numerical (quantitative)
1) Discrete - (few) integer values
- number of children in a family
2) Continuous - measure to arbitrary precision
- blood pressure, weight
Why bother?
PROPER DISPLAYS
PROPER ANALYSIS
In statistics we deal with data - measurements or observations on individuals (or, more generally, on the “units of observation”).
Fall 2002 Biostat 511 12
Categorical data
For categorical data we usually summarize with counts. A simple visual summary is the bar graph.
Notes:
• vertical axis can be count or percent
• in the above example, counts do not add to 74 … individuals can have multiple risk factors
• tabular presentation may be more parsimonious for such data
Risk factor for HIV
0
20
40
60
80 Count
Gay Heter IVDU Occup
N = 74
Fall 2002 Biostat 511 13
Consider the 11 ages:
21,32,34,34,42,44,46,48,52,56,64
Age is a quantitative variable so a barplot doesn’t make sense. Here we are more interested in characteristics of the distribution of ages -
where is the center of the age distribution (e.g. the average)?
how much does age vary?
are there some values far from the bulk of the data?
We would like some visual tools to help us answer these questions.
Continuous Data
Fall 2002 Biostat 511 14
We could group the data and tally the frequencies:
But why “hide” the details? Instead, we’ll use the 10’s place as stems and the units as leaves:
20: X30: XXX40: XXXX50: XX60: X
2* | 13* | 2444* | 24685* | 266* | 4
Stem age
Stem and Leaf Diagram
The stemplot or stem and leaf plot is a quick, informative summary for small datasets.
Fall 2002 Biostat 511 15
Stem and Leaf Diagram, construction
• All but the last digit form the stem.
• Stems are stacked vertically from the smallest to the largest.
• The leaf is the last digit in a value and is placed next to the appropriate stem (out from smallest to largest)
• Shows macro information - general shape, spread, range.
• Shows micro information - all values shown.
• Fast and easy to construct.
Fall 2002 Biostat 511 16
Back-to-back Stem and Leaf
To compare two sets of data, use a back-to-back stem and leaf diagram
9*10*10*11*11*12*12*13*13*
2770122
39
6
829
422097
3
0
Fig 1. Systolic blood pressure after 12 weeks treatment with daily calcium supplement or placebo
CalciumPlacebo
(Unfortunately, you can’t do this in Stata)
Fall 2002 Biostat 511 17
Methods for Grouped Data
The stem and leaf effectively groups continuous data into intervals. Let’s extend this idea. The following terms are useful for grouped data:
• frequency - the number of times the value occurs in the data.
• cumulative frequency - the number of observations that are equal to or smaller than the value.
• relative frequency - the % of the time that the value occurs (frequency/N).
• cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/N).
Fall 2002 Biostat 511 18
Example - Birthweights
Sample of 100 birthweights in ounces. Complete the following table ...
Interval Midpt Freq. Cum.Freq.
Rel.Freq.
Cum. Rel.Freq.
29.5 < W < 69.5 49.5 569.5 < W < 89.5 79.5 1089.5 < W < 99.5 94.5 1199.5 < W < 109.5 104.5 19109.5 < W < 119.5 114.5 17119.5 < W < 129.5 124.5 20129.5 < W < 139.5 134.5 12139.5 < W < 169.5 154.5 6
Fall 2002 Biostat 511 19
Histogram
• Similar to a barplot, but used for continuous data.
• Divide the data into intervals.
• A rectangle is constructed with the base being the interval end-points and the height chosen so the area of the rectangle is proportional to the frequency (if the width is one unit for all intervals, then height=frequency).
• Shape can be sensitive to number and choice of intervals (rule of thumb: number of bins is smaller of or 10*log10n)
• Histograms are more effective for moderate to large datasets.
Note: A histogram is a special type of bargraph where variable interval widths are permitted.
n
Fall 2002 Biostat 511 20
Example - Birthweights
Right:
Wrong:
Note: You can determine relative frequency and cumulative relative frequency from a histogram.
Fall 2002 Biostat 511 21
Characteristics of Distributions
Shape
number of modes (peaks)
symmetry
Center
where is the center?
Spread
how much variation?
outliers?
Other features
boundaries
digit preference
Fall 2002 Biostat 511 22
Fra
ctio
n
Var0 5 10 15
0
.05
.1
.15
.2
Fra
ctio
n
Var-4 -2 0 2 4
0
.05
.1
.15
.2
Fra
ctio
n
Var-2 0 2 4 6
0
.1
.2
.3
.4
Examples
Fall 2002 Biostat 511 23
Notation
Suppose we have N measurements of a particular variable. We will denote these N measurements as:
X1, X2, X3,…,XN
where X1 is the first measurement, X2 is the second, etc.
Sometimes it is useful to order the measurements. We denote the ordered measurements as:
X(1), X(2), X(3),…,X(N)
where X(1) is the smallest value and X(N) is the largest.
Fall 2002 Biostat 511 24
Arithmetic Mean
The arithmetic mean is the most common measure of the central location of a sample. We use to refer to the mean and define it as:
X
N
iiX
NX
1
1
The symbol is shorthand for “sum” over a specified range. For example:
)(4
14321
ii XXXXX
Fall 2002 Biostat 511 25
Some Properties of the Arithmetic Mean
Often we wish to transform variables. Linear changes to variables (i.e. Y = a*X+b) impact the mean in a predictable way:
(1) Adding (or subtracting) a constant to all values:
(2) Multiplication (or division) by a constant:
Y
cXY ii
Y
cXY ii
Does this nice behavior happen for any change? NO! (show that ) XX loglog
Fall 2002 Biostat 511 26
Median
Another measure of central tendency is the median - the “middle one”. Half the values are below the median and half are above. Given the ordered sample, X(i), the median is:
N odd:
N even:
Mode
The mode is the most frequently occurring value in the sample.
21Median NX
12221
Median NN XX
Fall 2002 Biostat 511 27
Example: Central Location
Suppose the ages in years of the first 10 subjects enrolled in your study are:
34,24,56,52,21,44,64,34,42,46
Then the mean age of this group is:
X
( ) /
/
.
34 24 56 52 21 44 64 34 42 46 10
417 10
417 years
To find the median, first order the data:
21,24,34,34,42,44,46,52,56,64
Median1
2
years
X X102
102
1
1
242 44
43
The mode is 34 years.
Fall 2002 Biostat 511 28
Suppose the next patient enrolls and their age is 97 years.
How do the mean and median change?
X
( ) /
/
.
34 24 56 52 21 44 64 34 42 46 97 11
514 11
46 7 years
To get the median, order the data:
21,24,34,34,42,44,46,52,56,64,97
Median
years
X 6
44
If the age were recorded incorrectly as 977, instead of 97, what would the new median be? What would the new mean be?
Fall 2002 Biostat 511 29
Comparison of Mean and Median
• Mean is sensitive to a few very large (or small) values - “outliers”
• Median is “resistant” to outliers
• Mean is attractive mathematically
• 50% of sample is above the median, 50% of sample is below the median.
Fall 2002 Biostat 511 30
Variation is important!
Fall 2002 Biostat 511 31
Measures of Spread: Range
The range is the difference between the largest and smallest observations:
1=
Minimum-Maximum=Range
XX N
Alternatively, the range may be denoted as the pair of observations:
Range = Minimum,Maximum
= X X N1 ,
The latter form is useful for data quality control.
Disadvantage: the sample range increases with increasing sample size.
In the ages example, for the first 10 subjects, the range is
Range = 64 - 21= 43
or (21,64)
Fall 2002 Biostat 511 32
Measures of Spread: Variance
Consider the following two samples:
20,23,34,26,30,22,40,38,37
30,29,30,31,32,30,28,30,30
These samples have the same mean and median, but the second is much less variable. The average “distance” from the center is quite small in the second. We use the variance to describe this feature:
N
i
N
iii
N
ii
N
ii
NXXN
XNXN
XXN
1 1
222
1
222
1
22
/)(1
1s
1
1s
1
1s
The standard deviation is simply the square root of the variance:
2s = s =deviation standard
Fall 2002 Biostat 511 33
For the first sample, we obtain:
2
22
9
1
2
yr25.59
881008574
309857419
1
8574
30
s
X
X
ii
For the second sample, we obtain:
2
22
9
1
2
yr25.1
881008110
309811019
1
8110
30
s
X
X
ii
Fall 2002 Biostat 511 34
Properties of the variance/standard deviation
• Variance and standard deviation are ALWAYS greater than or equal to zero.
• Linear changes are a little trickier than they were for the mean:
(1) Add/substract a constant: Yi=Xi+c
(2) Multiply/divide by a constant: Yi=c Xi
2
1
2
1
22
)(1
11
1
X
N
ii
N
iiY
s
cXcXN
YYN
s
22
1
22
1
2
1
22
11
11
11
X
N
ii
N
ii
N
iiY
sc
XXcN
XccXN
YYN
s
So what happens to the standard deviation?
Fall 2002 Biostat 511 35
Measures of Spread: Quantiles and Percentiles
The median was the sample value that had 50% of the data below (or above) it.
More generally, we define the pth percentile as the value which has p% of the sample values less than or equal to it.
Let k = p*N/100.
(1) If k is an integer, pth percentile is the average of X(k) and X(k+1).
(2) If k is not an integer, pth percentile is X([k]
+1).
Here, [k] is the largest integer smaller than k (i.e. truncate the decimal).
Quartiles are the (25,50,75) percentiles. The interquartile range is Q.75-Q.25 and is another useful measure of spread. The middle 50% of the data is found between Q.25 and Q.75.
Fall 2002 Biostat 511 36
Boxplot
A graphics display of the quartiles of a dataset, as well as the range. Extremely large or small values are also identified.
-20
0
20
40
Increment in Systolic B.P.
1 2 3 4
Drug
Fall 2002 Biostat 511 37
Boxplot, construction
1. Order the data
2. Compute the median and draw a line at this value.
3. Compute the hinges, Q.25 and Q.75.
4. Draw lines at the hinges (quartiles) and enclose in a box.
5. Compute the IQR = Q.75- Q.25 .
6. Compute the upper fence = Q.75 + 1.5*IQR lower fence = Q.25 -
1.5*IQRObservations beyond the fences are called outliers.
7. Draw a line (whisker) from Q.75 to the largest non-outlying value
8. Draw a line from Q.25 to the smallest non-outlying value.
9. Mark points outside of the fences (outliers).
Fall 2002 Biostat 511 38
Skewness
Both histograms and boxplots can show us that a distribution is skewed. Skewness refers to the symmetry or lack of symmetry in the shape of the distribution. Neither the mean nor the variance tell us about symmetry. Skewness is based on the average of .( )X Xi
3
1. Skew = 0; “symmetric”; median = mean
2. Skew > 0; “positive” or “right” skewed; median<mean
3. Skew < 0; “negative” or “left” skewed; median>mean
Fall 2002 Biostat 511 39
Density Curves
We have seen how continuous data can be summarized with a histogram. Although histograms are summaries of the data, they still involve keeping track of a lot of numbers (i.e. the height and location of each bar). Is there a way to summarize the entire distribution of our data with just a few numbers?
YES! We can use a type of mathematical model known as a density curve.
Fall 2002 Biostat 511 40
Density Curves
Fall 2002 Biostat 511 41
Density Curves
We saw previously that we can use a histogram to determine the relative frequency (= proportion = probability) of obtaining observations in a particular interval.
If a particular density curve provides a good fit to our data then we can use the density curve to approximate these probabilties. In particular, the probability of obtaining an observation in a particular interval is given by the area under the density curve.
Note: For continuous data, it does not make sense to talk about the probability of an individual value (i.e. P(X = 6) 0.0)
Fall 2002 Biostat 511 42
Relative frequency of scores less than 6 from histogram = .303
Probability of scores less than 6 from density curve = .293
Fall 2002 Biostat 511 43
Probability density function
1. A function, typically denoted f(x), that gives probabilities based on the area under the curve.
2. f(x) > 0
3. Total area under the function f(x) is 1.0.
0.1)( dxxf
Cumulative distribution function
The cumulative distribution function, F(t), tells us the total probability less than some value t.
F(t) = P(X < t)
This is analogous to the cumulative relative frequency.
Fall 2002 Biostat 511 44
Fall 2002 Biostat 511 45
Fall 2002 Biostat 511 46
Normal Distribution
• A common model for continuous data
• Bell-shaped curve
takes values between - and +
symmetric about mean
mean=median=mode
• Examples
birthweights
blood pressure
CD4 cell counts (perhaps transformed)
Fall 2002 Biostat 511 47
Normal Distribution
Specifying the mean and variance of a normal distribution completely determines the probability distribution function and, therefore, all probabilities (just 2 numbers!).
The normal probability density function is:
where
3.14 (a constant)
Notice that the normal distribution has two parameters:
= the mean of X
= the standard deviation of X
We write X~N( , 2). The standard normal distribution is a special case where = 0 and = 1.
2
2)(21
exp2
1)(
x
xf
Fall 2002 Biostat 511 48
Fall 2002 Biostat 511 49
For a standard normal distribution ...
In general,
~68% of data within 1 of
~95% of data within 2 of
~99.7% of data within 3 of
Fall 2002 Biostat 511 50
Summary
• Types of data
1. Categorical
2. Continuous
• Numerical Summaries
1. Location - mean, median, mode.
2. Spread - range, variance, standard deviation, IQR
3. Shape - skewness
• Graphical Summaries
1. Barplot
2. Stem and Leaf plot
3. Histogram
4. Boxplot
• Mathematical Summaries
1. Density curves