Lecture 3 Chapter 1. Displaying data with graphs.

Lecture 3Chapter 1. Displaying data with graphs

Objectives (PSLS Chapter 1, plus suppl.)

Individuals and variables

Individuals are the objects described in a set of data. Individuals may

be people, animals, plants or things. These are often the units of

measure.

Student, shells, trials participant, tomato plant

A variable is any property that characterizes an individual. A variable

can take different values for different individuals.

Age, gender, hair color, head circumference, leaf length, flower color

Two types of variables

A variable can be either

quantitative

Some quantity assessed or measured for each individual. We can

then report the average of all individuals. Quantitative variables can

be continuous or discrete. Continuous variables have units of

measure that can be highly divided.

Age (in seconds), blood pressure (in mm Hg), leaf length (in cm)

Discrete variables provide counts.

Number of fingers, number of leaves, number of units

categorical

Some characteristic describing each individual. We can then report

the count or proportion of individuals with that characteristic.

Gender (male, female), blood type (A, B, AB, O), flower color (white, red)

How do you decide if a variable is categorical or quantitative?

Ask:

What are the n individuals examined (in the sample or population)?

What is being recorded about those n individuals?

Is that a number ( quantitative) or a statement ( categorical)?

Individuals studied Diagnosis Age at death

Patient A Heart disease 56

Patient B Stroke 70

Patient C Stroke 75

Patient D Lung cancer 60

Patient E Heart disease 80

Patient F Accident 73

Patient G Diabetes 69

Each individual is given a meaningful number

Each individual is given a description

A study examined the condition of deer after a particularly

nasty winter. Sex and condition (good and poor) of a

random sample of 61 deer are noted. Data from such a

study could appear in either of these two formats:

Who/what are the individuals? What are the sampling units?

What are the variables, and are they quantitative or categorical?

Raw data

Frequency table

Ways to chart categorical data

Most common ways to graph categorical data:

Bar graphs

Each characteristic, or level, is represented by a bar. The height of a bar

represents either the count of individuals with that characteristic, the

frequency, or the percent of individuals with that characteristic, the relative

frequency.

Mosaic Plots (bivariate)

Mosaic plots are graphical displays that allow the examination of relationships

among two or more categorical variables. All dimensions of each bar represent

the proportion found in the sample.

Pie charts

Are well know to be bad for human consumption.

Do you like…?

Subject Carrots Peas Spinach1 yes yes yes2 yes no no3 yes yes no4 no no yes5 yes no no6 no yes yes

Carrots Peas SpinachPercent who like 67% 50% 33%

Which one do you prefer?

Subject Preference1 Peas2 Carrots3 Carrots4 Spinach5 Carrots6 Peas

Percent who preferCarrots 50%Peas 33%

Spinach 17%

Bar graph only

Percent of current marijuana users in each of four age groups: USA, 2004

Who/what are the individuals?

What are the variables, and are they quantitative or categorical?

Common ways to chart quantitative data Histograms

This is a summary graph for a variable. Histograms are useful to understand

the pattern of variability in the data, especially for large data sets.

Dot density plots (aka dotplot)

These are graphs which show every data point. They are useful to describe

the pattern of variability in the data.

Line graphs: time plots

Use them when there is a meaningful sequence, like time. The line connecting

the points helps emphasize any change over time.

Other graphs to display numerical summaries (see chapter 2)

28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35

11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40

Sorted data

Making a dotplot

1) Create a single axis representing the quantitative variable’s range

2) Represent each data point as a dot positioned according to its

numerical value

3) When two or more data points have the same value, stack them up

Making a histogram

1) The range of values that the quantitative variable takes is divided

into equal-size intervals, or classes. This makes up the horizontal axis.

2) The vertical axis represents either

the frequency (counts) or the relative frequency (percents of total).

3) For each class on the horizontal axis, draw a column. The height of

the column represents the count (or percent) of data points that fall in

that class interval.

Guinea pig survival time (in days) after inoculation with a pathogen (n = 72)

Let’s build a histogram with classes of size 50, starting at zero (zero is included in the first class).

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598

Survival time (days)

Fre

quency

600550500450400350300250200150100500

30

25

20

15

10

5

0

12

0111

0

4

8

24

28

2

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598


Fre

quency

600550500450400350300250200150100500

30

25

20

15

10

5

0

12

0111

0

4

8

24

28

2

Choosing the classes for a histogram

It is an iterative process – try and try again.

Not too many classes with either 0 or 1 counts

Not overly summarized that you loose all the information

Not so detailed that it is no longer summary

Try starting with 5 to10

classes, then refine

your class choice.

(There isn’t a unique or

“perfect” solution)

Statistical Applets: One Variable Statistical Calculator

Art or S

cience?

http://bcs.whfreeman.com/psls1e/pages/bcs-main.asp?v=category&s=00010&n=99000&i=99010.01&o=%7C00510%7C00520%7C00530%7C00540%7C00550%7C00560%7C00570%7C00010%7C00020%7C00030%7C00040%7C00050%7C00070%7C00080%7C00090%7C01000%7C02000%7C03000%7C04000%7C05000%7C06000%7C07000%7C08000%7C09000%7C10000%7C11000%7C12000%7C1

Interpreting histograms

We look for the overall pattern and for striking deviations from that

pattern. We describe the histogram’s:

Shape

Center

Spread

Possible outliers

Symmetric distribution

Most common unimodal distribution shapes

Left skewThe left side extends much farther out than the right side.

Right skewThe right side (side with larger values) extends much farther out than the left side.

Not all distributions have a simple shape

(especially with few observations).

Describe the shape of these histograms.

Alaska Florida

Outliers

An important kind of deviation is an outlier. Outliers are observations

that lie outside the overall pattern of a distribution. Always look for

outliers and try to explain them.

Alaska and Florida have

unusual percents of elderly

in their population.

A large gap in the

distribution is typically a

sign of an outlier.

Mauna Loa [CO2]

Mauna Loa Data in a Bar Chart

Graphing time seriesM

onth

ly C

O2 (

part

s per

millio

n)

YearMonth

200819981988197819681958MarMarMarMarMarMar

390

380

370

360

350

340

330

320

310

Monthly atmospheric CO2 levels recorded at the Mauna Loa Hawaii observatory (March 1958 – August 2009)

Data collected over time are displayed in a time plot, with time on the horizontal axis and the variable of interest on the vertical axis.

We look for a possible trend (a clear overall pattern) and possible cyclical variations (variations with some regularity over time)

Mona Loa Examples

Graphic Context

From Science News (2/11/2012)

Explanation of the Figure Caption to the figure:

India and China would see the most deaths prevented by 14 measures reducing methane and soot. The circles at left are proportional to the number of deaths that would be prevented annually by country in 2030.

Bubble Graph

Bar Graph

Graphic Context

From Science News (2/11/2012)

Cleveland’s Hierarchy

(From Stewart, Brandi 2005)

Tufte’s Take-homes Increase the data to ink ratio

Does each bit of ink provide unique information? Erase everything that is not needed.

Deemphasize non‐data elements Make the data points darker and bolder than other elements.

Quick notes

Label the axes to avoid repeating percent signs on all data points.

Too many decimal places or trailing zeros are a distraction

Avoid putting extra dimensions in your charts. The pseudo three‐dimensional charts are difficult to read and provide no information

If you know categories and values for each category, a two‐dimensional chart will be clearer than a pseudo three‐dimensional one.

True three‐dimensional charts are even more difficult to read.

Make grid lines barely perceptible, if used.

Guinea pig survival time (in days) after inoculation with a pathogen (n = 72)

Let’s build a histogram with classes of size 50, starting at zero (zero is included in the first class).

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598


Fre

quency

600550500450400350300250200150100500

30

25

20

15

10

5

0

12

0111

0

4

8

24

28

2

0

50

100

150

200

250

1940 1950 1960 1970 1980 1990 2000

Years

Death

rate

(per

thousand)

0

50

100

150

200

250

1940 1960 1980 2000

Years

Dea

th r

ate

(pe

r th

ousand)

0

50

100

150

200

250

1940 1960 1980 2000

Years

Death

rate

(per

thousand) A picture is worth a

thousand words,

BUT

there is nothing like hard numbers.

Look at the scales.

Scales matter

How you stretch the axes and choose your scales can give a different impression.

120

140

160

180

200

220

1940 1960 1980 2000

Years

Dea

th r

ate

(per

thousand)

Death rates from cancer (U.S., 1945 – 95)

0

2

4

6

8

10

12

14

16

18

20W

ater

sal

init

y (p

arts

per

th

ou

san

ds)

9/30 10/1 10/2 10/3 10/4 10/5 10/6 10/7

Shark River water salinity in the Everglades National Park, over a seven-day period in the fall of 2009.

Describe these

two graphs.

Lecture 3 Chapter 1. Displaying data with graphs.

Documents

Transcript of Lecture 3 Chapter 1. Displaying data with graphs.