Fall 2002Biostat 5110 Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and...

Fall 2002 Biostat 511 1

Medical Biometry I(Biostatistics 511)

Instructor: Jim Hughes

Cartoons and images in these notes are from Gonick L. Cartoon Guide to Statistics. HarperPerennial, New York, 1993.Fisher L and vanBelle G. Biostatistics: A Methodology for the Health Sciences. Wiley, New York, 1993


Typical Public Health, Medical or Biological Questions About Populations

• Does formula feeding increase the chance of survival of infants born to HIV positive mothers, compared to breastfeeding, in a developing country?

• How do we estimate the concentration of antibody based on reactivity of serial dilutions?

• Are there trends in mortality and homicide rates by urban setting, age, gender, or race?

• How do we model survival following heart bypass surgery? Are there patient characteristics that predict survival? How does the 1 year, 2 year, 5 year survival of bypass patients compare to individuals treated medically for heart disease?

• How do attitudes toward enrollment in an HIV vaccine study vary by geography, age, or education?

• How does physician experience influence survival of patients with HIV?


Biostatistics 511

• Introduction to the basic concepts of statistics as applied to problems in public health or medicine

• Definitions:

1. Data - numerical facts, measurements, or observations obtained from an investigation aimed at answering a question.

2. Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation.

3. Biostatistics - the application of statistics to the biologic sciences, medicine and public health.


Role of Statistics in Public Health and Medicine

Science

1. Idea or Question

2. Collect data/make observations

3. Describe data / observations

4. Assess the strength of evidence for / against the hypothesis

Statistics

1. Math. model / hypothesis

2. Study design

3. Descriptive statistics

4. Inferential statistics


Descriptive Statistics and Exploratory Data Analysis -

Univariate

• Types of data1. Categorical2. Continuous

• Numerical Summaries1. Location - mean, median, mode.2. Spread - range, variance, standard deviation, IQR3. Shape - skewness

• Graphical Summaries1. Barplot 2. Stem and Leaf plot3. Histogram4. Boxplot

• Mathematical Summaries1. Density curves


Descriptive Statistics (Exploratory)

• “Exploratory data analysis is detective work - numerical detective work”

• “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone- the first step”

• organization, summarization, and presentation of data

• “Show me the data!”

Tools:

• tables

• graphs

• numerical summaries

John TukeyExploratory Data AnalysisAddison-Wesley, 1977


Inferential Statistics (Confirmatory)

• Generalization of conclusions:

sample population

• Assess strength of evidence

• Make comparisons

• Make predictions

Tools:

• Modeling

• Estimation and Confidence Intervals

• Hypothesis Testing


Example: Effect of seat belt use on accident fatality

Seat BeltDriver Worn Not worn

dead 10 20alive 40 30

Total 50 50Fatality Rate 10/50 (20%) 20/50 (40%)


But, suppose...

Impact Speed< 40 mph > 40 mph

Driver seat beltworn not

seat beltworn not

dead 3 2 7 18alive 27 18 13 12

Total 30 20 20 30FatalityRate

10% 10% 35% 60%

How does this affect your inference?


Types of Data

• Categorical (qualitative)

1) Nominal scale - no natural order

- gender, marital status, race

2) Ordinal scale

- severity scale, good/better/best

• Numerical (quantitative)

1) Discrete - (few) integer values

- number of children in a family

2) Continuous - measure to arbitrary precision

- blood pressure, weight

Why bother?

PROPER DISPLAYS

PROPER ANALYSIS

In statistics we deal with data - measurements or observations on individuals (or, more generally, on the “units of observation”).


Categorical data

For categorical data we usually summarize with counts. A simple visual summary is the bar graph.

Notes:

• vertical axis can be count or percent

• in the above example, counts do not add to 74 … individuals can have multiple risk factors

• tabular presentation may be more parsimonious for such data

Risk factor for HIV

0

20

40

60

80 Count

Gay Heter IVDU Occup

N = 74


Consider the 11 ages:

21,32,34,34,42,44,46,48,52,56,64

Age is a quantitative variable so a barplot doesn’t make sense. Here we are more interested in characteristics of the distribution of ages -

where is the center of the age distribution (e.g. the average)?

how much does age vary?

are there some values far from the bulk of the data?

We would like some visual tools to help us answer these questions.

Continuous Data


We could group the data and tally the frequencies:

But why “hide” the details? Instead, we’ll use the 10’s place as stems and the units as leaves:

20: X30: XXX40: XXXX50: XX60: X

2* | 13* | 2444* | 24685* | 266* | 4

Stem age

Stem and Leaf Diagram

The stemplot or stem and leaf plot is a quick, informative summary for small datasets.


Stem and Leaf Diagram, construction

• All but the last digit form the stem.

• Stems are stacked vertically from the smallest to the largest.

• The leaf is the last digit in a value and is placed next to the appropriate stem (out from smallest to largest)

• Shows macro information - general shape, spread, range.

• Shows micro information - all values shown.

• Fast and easy to construct.


Back-to-back Stem and Leaf

To compare two sets of data, use a back-to-back stem and leaf diagram

9*10*10*11*11*12*12*13*13*

2770122

39

6

829

422097

3

0

Fig 1. Systolic blood pressure after 12 weeks treatment with daily calcium supplement or placebo

CalciumPlacebo

(Unfortunately, you can’t do this in Stata)


Methods for Grouped Data

The stem and leaf effectively groups continuous data into intervals. Let’s extend this idea. The following terms are useful for grouped data:

• frequency - the number of times the value occurs in the data.

• cumulative frequency - the number of observations that are equal to or smaller than the value.

• relative frequency - the % of the time that the value occurs (frequency/N).

• cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/N).


Example - Birthweights

Sample of 100 birthweights in ounces. Complete the following table ...

Interval Midpt Freq. Cum.Freq.

Rel.Freq.

Cum. Rel.Freq.

29.5 < W < 69.5 49.5 569.5 < W < 89.5 79.5 1089.5 < W < 99.5 94.5 1199.5 < W < 109.5 104.5 19109.5 < W < 119.5 114.5 17119.5 < W < 129.5 124.5 20129.5 < W < 139.5 134.5 12139.5 < W < 169.5 154.5 6


Histogram

• Similar to a barplot, but used for continuous data.

• Divide the data into intervals.

• A rectangle is constructed with the base being the interval end-points and the height chosen so the area of the rectangle is proportional to the frequency (if the width is one unit for all intervals, then height=frequency).

• Shape can be sensitive to number and choice of intervals (rule of thumb: number of bins is smaller of or 10*log10n)

• Histograms are more effective for moderate to large datasets.

Note: A histogram is a special type of bargraph where variable interval widths are permitted.

n


Example - Birthweights

Right:

Wrong:

Note: You can determine relative frequency and cumulative relative frequency from a histogram.


Characteristics of Distributions

Shape

number of modes (peaks)

symmetry

Center

where is the center?

Spread

how much variation?

outliers?

Other features

boundaries

digit preference


Fra

ctio

n

Var0 5 10 15

0

.05

.1

.15

.2

Fra

ctio

n

Var-4 -2 0 2 4

0

.05

.1

.15

.2

Fra

ctio

n

Var-2 0 2 4 6

0

.1

.2

.3

.4

Examples


Notation

Suppose we have N measurements of a particular variable. We will denote these N measurements as:

X1, X2, X3,…,XN

where X1 is the first measurement, X2 is the second, etc.

Sometimes it is useful to order the measurements. We denote the ordered measurements as:

X(1), X(2), X(3),…,X(N)

where X(1) is the smallest value and X(N) is the largest.


Arithmetic Mean

The arithmetic mean is the most common measure of the central location of a sample. We use to refer to the mean and define it as:

X

N

iiX

NX

1

1

The symbol is shorthand for “sum” over a specified range. For example:

)(4

14321

ii XXXXX


Some Properties of the Arithmetic Mean

Often we wish to transform variables. Linear changes to variables (i.e. Y = a*X+b) impact the mean in a predictable way:

(1) Adding (or subtracting) a constant to all values:

(2) Multiplication (or division) by a constant:

Y

cXY ii

Y

cXY ii

Does this nice behavior happen for any change? NO! (show that ) XX loglog


Median

Another measure of central tendency is the median - the “middle one”. Half the values are below the median and half are above. Given the ordered sample, X(i), the median is:

N odd:

N even:

Mode

The mode is the most frequently occurring value in the sample.

21Median NX

12221

Median NN XX


Example: Central Location

Suppose the ages in years of the first 10 subjects enrolled in your study are:

34,24,56,52,21,44,64,34,42,46

Then the mean age of this group is:

X

( ) /

/

.

34 24 56 52 21 44 64 34 42 46 10

417 10

417 years

To find the median, first order the data:

21,24,34,34,42,44,46,52,56,64

Median1

2

years

X X102

102

1

1

242 44

43

The mode is 34 years.


Suppose the next patient enrolls and their age is 97 years.

How do the mean and median change?

X

( ) /

/

.

34 24 56 52 21 44 64 34 42 46 97 11

514 11

46 7 years

To get the median, order the data:

21,24,34,34,42,44,46,52,56,64,97

Median

years

X 6

44

If the age were recorded incorrectly as 977, instead of 97, what would the new median be? What would the new mean be?


Comparison of Mean and Median

• Mean is sensitive to a few very large (or small) values - “outliers”

• Median is “resistant” to outliers

• Mean is attractive mathematically

• 50% of sample is above the median, 50% of sample is below the median.


Variation is important!


Measures of Spread: Range

The range is the difference between the largest and smallest observations:

1=

Minimum-Maximum=Range

XX N

Alternatively, the range may be denoted as the pair of observations:

Range = Minimum,Maximum

= X X N1 ,

The latter form is useful for data quality control.

Disadvantage: the sample range increases with increasing sample size.

In the ages example, for the first 10 subjects, the range is

Range = 64 - 21= 43

or (21,64)


Measures of Spread: Variance

Consider the following two samples:

20,23,34,26,30,22,40,38,37

30,29,30,31,32,30,28,30,30

These samples have the same mean and median, but the second is much less variable. The average “distance” from the center is quite small in the second. We use the variance to describe this feature:

N

i

N

iii

N

ii

N

ii

NXXN

XNXN

XXN

1 1

222

1

222

1

22

/)(1

1s

1

1s

1

1s

The standard deviation is simply the square root of the variance:

2s = s =deviation standard


For the first sample, we obtain:

2

22

9

1

2

yr25.59

881008574

309857419

1

8574

30

s

X

X

ii

For the second sample, we obtain:

2

22

9

1

2

yr25.1

881008110

309811019

1

8110

30

s

X

X

ii


Properties of the variance/standard deviation

• Variance and standard deviation are ALWAYS greater than or equal to zero.

• Linear changes are a little trickier than they were for the mean:

(1) Add/substract a constant: Yi=Xi+c

(2) Multiply/divide by a constant: Yi=c Xi

2

1

2

1

22

)(1

11

1

X

N

ii

N

iiY

s

cXcXN

YYN

s

22

1

22

1

2

1

22

11

11

11

X

N

ii

N

ii

N

iiY

sc

XXcN

XccXN

YYN

s

So what happens to the standard deviation?


Measures of Spread: Quantiles and Percentiles

The median was the sample value that had 50% of the data below (or above) it.

More generally, we define the pth percentile as the value which has p% of the sample values less than or equal to it.

Let k = p*N/100.

(1) If k is an integer, pth percentile is the average of X(k) and X(k+1).

(2) If k is not an integer, pth percentile is X([k]

+1).

Here, [k] is the largest integer smaller than k (i.e. truncate the decimal).

Quartiles are the (25,50,75) percentiles. The interquartile range is Q.75-Q.25 and is another useful measure of spread. The middle 50% of the data is found between Q.25 and Q.75.


Boxplot

A graphics display of the quartiles of a dataset, as well as the range. Extremely large or small values are also identified.

-20

0

20

40

Increment in Systolic B.P.

1 2 3 4

Drug


Boxplot, construction

1. Order the data

2. Compute the median and draw a line at this value.

3. Compute the hinges, Q.25 and Q.75.

4. Draw lines at the hinges (quartiles) and enclose in a box.

5. Compute the IQR = Q.75- Q.25 .

6. Compute the upper fence = Q.75 + 1.5*IQR lower fence = Q.25 -

1.5*IQRObservations beyond the fences are called outliers.

7. Draw a line (whisker) from Q.75 to the largest non-outlying value

8. Draw a line from Q.25 to the smallest non-outlying value.

9. Mark points outside of the fences (outliers).


Skewness

Both histograms and boxplots can show us that a distribution is skewed. Skewness refers to the symmetry or lack of symmetry in the shape of the distribution. Neither the mean nor the variance tell us about symmetry. Skewness is based on the average of .( )X Xi

3

1. Skew = 0; “symmetric”; median = mean

2. Skew > 0; “positive” or “right” skewed; median<mean

3. Skew < 0; “negative” or “left” skewed; median>mean


Density Curves

We have seen how continuous data can be summarized with a histogram. Although histograms are summaries of the data, they still involve keeping track of a lot of numbers (i.e. the height and location of each bar). Is there a way to summarize the entire distribution of our data with just a few numbers?

YES! We can use a type of mathematical model known as a density curve.


Density Curves


Density Curves

We saw previously that we can use a histogram to determine the relative frequency (= proportion = probability) of obtaining observations in a particular interval.

If a particular density curve provides a good fit to our data then we can use the density curve to approximate these probabilties. In particular, the probability of obtaining an observation in a particular interval is given by the area under the density curve.

Note: For continuous data, it does not make sense to talk about the probability of an individual value (i.e. P(X = 6) 0.0)


Relative frequency of scores less than 6 from histogram = .303

Probability of scores less than 6 from density curve = .293


Probability density function

1. A function, typically denoted f(x), that gives probabilities based on the area under the curve.

2. f(x) > 0

3. Total area under the function f(x) is 1.0.

0.1)( dxxf

Cumulative distribution function

The cumulative distribution function, F(t), tells us the total probability less than some value t.

F(t) = P(X < t)

This is analogous to the cumulative relative frequency.


Normal Distribution

• A common model for continuous data

• Bell-shaped curve

takes values between - and +

symmetric about mean

mean=median=mode

• Examples

birthweights

blood pressure

CD4 cell counts (perhaps transformed)


Normal Distribution

Specifying the mean and variance of a normal distribution completely determines the probability distribution function and, therefore, all probabilities (just 2 numbers!).

The normal probability density function is:

where

3.14 (a constant)

Notice that the normal distribution has two parameters:

= the mean of X

= the standard deviation of X

We write X~N( , 2). The standard normal distribution is a special case where = 0 and = 1.

2

2)(21

exp2

1)(

x

xf


For a standard normal distribution ...

In general,

~68% of data within 1 of

~95% of data within 2 of

~99.7% of data within 3 of


Summary

• Types of data

1. Categorical

2. Continuous

• Numerical Summaries

1. Location - mean, median, mode.

2. Spread - range, variance, standard deviation, IQR

3. Shape - skewness

• Graphical Summaries

1. Barplot

2. Stem and Leaf plot

3. Histogram

4. Boxplot

• Mathematical Summaries

1. Density curves

Fall 2002Biostat 5110 Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and...

Documents

Transcript of Fall 2002Biostat 5110 Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and...