Lecture 4 Chapter 2. Numerical descriptors. Objectives (PSLS Chapter 2) Describing distributions...

16
Lecture 4 Chapter 2. Numerical descriptors

Transcript of Lecture 4 Chapter 2. Numerical descriptors. Objectives (PSLS Chapter 2) Describing distributions...

Lecture 4Chapter 2. Numerical descriptors

Objectives (PSLS Chapter 2)

Describing distributions with numbers

Measure of center: mean and median (Meas. Cent. Award)

Measure of spread: quartiles, standard deviation, IQR (Meas. Var. Award)

The five-number summary and boxplots (SUMS Award)

Dealing with outliers (outliers award)

Choosing among summary statistics (All Numeric Awards)

Organizing a statistical problem (Foundational)

The mean, or arithmetic average

To calculate the average (mean) of a data set, add all values, then

divide by the number of individuals. It is the “center of mass.”

Measure of center: the mean

n

x....xxx

n

21

n

iixn

x1

1

n is the sample sizex is the variable

Measure of center: the median

The median is the midpoint of a distribution—the number such that

half of the observations are smaller, and half are larger.

1) Sort observations from smallest to largest.n = number of observations

2) The location of the median is (n + 1)/2 in the sorted list

______________________________

1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.6

n = 24 (n+1)/2 = 12.5

Median = (3.3+3.4)/2 = 3.35

If n is even, the median is the mean of the two center observations

1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.625 6.1

n = 25 (n+1)/2 = 13 Median = 3.4

If n is odd, the median is the value of the center observation

Mean and median for skewed distributions

Mean and median for a symmetric distribution

Left skew Right skew

MeanMedian

Mean Median

MeanMedian

Comparing the mean and the median

The median is a measure of center that is resistant to skew and

outliers. The mean is not.

Measure of spread: quartiles

M = median = 3.4

Q1= first quartile = 2.2

Q3= third quartile = 4.35

1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.625 6.1

The first quartile, Q1, is the median

of the values below the median in the

sorted data set.

The third quartile, Q3, is the median

of the values above the median in the

sorted data set.

28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35

How fast do skin wounds heal?

Here are the skin healing rate data from 18 newts measured

in micrometers per hour:

11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40

Sorted data:

Median = ???

Quartiles = ???

Measure of spread: standard deviationThe standard deviation is used to describe the variation around the mean.

To get the standard deviation of a SAMPLE of data:

2

1

2 )(1

1xx

ns

n

i

1) Calculate the variance s2

2

1

)(1

1xx

ns

n

i

2) Take the square root to get the standard deviation s

Learn how to obtain the standard deviation of a sample using a spread sheet.

A person’s metabolic rate is the rate at which the body consumes energy. Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours).

*

2.1897.811,35

7.811,356870,214

)()1(

61

870,214)(

1600/

22

2

1

s

xxdfs

ndf

xx

nxx

i

i

Center and spread in boxplots

median = 3.4

Q3= 4.35

Q1= 2.2

25 6.124 5.623 5.322 4.921 4.720 4.519 4.218 4.117 3.916 3.815 3.714 3.613 3.412 3.311 2.910 2.89 2.58 2.37 2.36 2.15 1.54 1.93 1.62 1.21 0.6

max = 6.1

min = 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

“Five-number summary”

Boxplot

0123456789

101112131415

Disease X Multiple Myeloma

Ye

ars

un

til d

ea

th

Boxplots for a symmetric and a right-skewed distribution

Boxplots and skewed data

Boxplots show

symmetry or

skew.

IQR and outliers

The interquartile range (IQR) is the distance between the first and

third quartiles (the length of the box in the boxplot)

IQR = Q3 – Q1

An outlier is an individual value that falls outside the overall pattern.

How far outside the overall pattern does a value have to fall to be

considered a suspected outlier?

Suspected low outlier: any value < Q1 – 1.5 IQR

Suspected high outlier: any value > Q3 + 1.5 IQR

Q3 = 4.35

Q1 = 2.2

25 7.924 5.623 5.322 4.921 4.720 4.519 4.218 4.117 3.916 3.815 3.714 3.613 3.412 3.311 2.910 2.89 2.58 2.37 2.36 2.15 1.54 1.93 1.62 1.21 0.6

Disease X

0

1

2

3

4

5

6

7

Yea

rs u

nti

l dea

th

8

Interquartile rangeQ3 – Q1

4.35-2.2 = 2.15

Distance to Q37.9-4.35 = 3.55

Individual #25 has a survival of 7.9 years, which is 3.55 years

above the third quartile. This is more than 1.5 IQR = 3.225 years.

Individual #25 is a suspected outlier.

*

Dealing with outliers: Baldi and Moore’s SuggestionsWhat should you do if you find outliers in your data? It depends in part on what kind of outliers they are:

Human error in recording information

Human error in experimentation or data collection

Unexplainable but apparently legitimate wild observations

Are you interested in ALL individuals?

Are you interested only in typical individuals?

Learn. Does the outlier tell you something interesting about

biology?

Don’t discard outliers just to make your data look better, and don’t act as if they did not exist.

Choosing among summary statistics: B & M Because the mean is not resistant

to outliers or skew, use it is often

used to describe distributions that

are fairly symmetrical and don’t

have outliers.

Plot the mean and use the

standard deviation for error bars.

Otherwise, use the median and the

five-number summary, which can be

plotted as a boxplot.

Describe a distribution with its

S.U.M.S. (shape, unusual points,

middle, and spread).

Height of 30 women

58

59

60

61

62

63

64

65

66

67

68

69

Box plot Mean +/- sd

Hei

ght i

n in

ches

Boxplot Mean ± s.d.

Deep-sea sediments.

Phytopigment concentrations in deep-sea sediments

collected worldwide show a very strong right-skew.

Which of these two values is the mean and which is the median?

0.015 and 0.009 grams per square meter of bottom surface

Which would be a better summary statistic for these data?