Chapter 1 – Exploring Data YMS - 1.1 Displaying Distributions with Graphs xii-7.
Lecture 3 Chapter 1. Displaying data with graphs.
-
Upload
archibald-short -
Category
Documents
-
view
233 -
download
7
Transcript of Lecture 3 Chapter 1. Displaying data with graphs.
Lecture 3Chapter 1. Displaying data with graphs
Objectives (PSLS Chapter 1, plus suppl.)
Individuals and variables
Individuals are the objects described in a set of data. Individuals may
be people, animals, plants or things. These are often the units of
measure.
Student, shells, trials participant, tomato plant
A variable is any property that characterizes an individual. A variable
can take different values for different individuals.
Age, gender, hair color, head circumference, leaf length, flower color
Two types of variables
A variable can be either
quantitative
Some quantity assessed or measured for each individual. We can
then report the average of all individuals. Quantitative variables can
be continuous or discrete. Continuous variables have units of
measure that can be highly divided.
Age (in seconds), blood pressure (in mm Hg), leaf length (in cm)
Discrete variables provide counts.
Number of fingers, number of leaves, number of units
categorical
Some characteristic describing each individual. We can then report
the count or proportion of individuals with that characteristic.
Gender (male, female), blood type (A, B, AB, O), flower color (white, red)
How do you decide if a variable is categorical or quantitative?
Ask:
What are the n individuals examined (in the sample or population)?
What is being recorded about those n individuals?
Is that a number ( quantitative) or a statement ( categorical)?
Individuals studied Diagnosis Age at death
Patient A Heart disease 56
Patient B Stroke 70
Patient C Stroke 75
Patient D Lung cancer 60
Patient E Heart disease 80
Patient F Accident 73
Patient G Diabetes 69
Each individual is given a meaningful number
Each individual is given a description
A study examined the condition of deer after a particularly
nasty winter. Sex and condition (good and poor) of a
random sample of 61 deer are noted. Data from such a
study could appear in either of these two formats:
Who/what are the individuals? What are the sampling units?
What are the variables, and are they quantitative or categorical?
Raw data
Frequency table
Ways to chart categorical data
Most common ways to graph categorical data:
Bar graphs
Each characteristic, or level, is represented by a bar. The height of a bar
represents either the count of individuals with that characteristic, the
frequency, or the percent of individuals with that characteristic, the relative
frequency.
Mosaic Plots (bivariate)
Mosaic plots are graphical displays that allow the examination of relationships
among two or more categorical variables. All dimensions of each bar represent
the proportion found in the sample.
Pie charts
Are well know to be bad for human consumption.
Do you like…?
Subject Carrots Peas Spinach1 yes yes yes2 yes no no3 yes yes no4 no no yes5 yes no no6 no yes yes
Carrots Peas SpinachPercent who like 67% 50% 33%
Which one do you prefer?
Subject Preference1 Peas2 Carrots3 Carrots4 Spinach5 Carrots6 Peas
Percent who preferCarrots 50%Peas 33%
Spinach 17%
Bar graph only
Percent of current marijuana users in each of four age groups: USA, 2004
Who/what are the individuals?
What are the variables, and are they quantitative or categorical?
Common ways to chart quantitative data Histograms
This is a summary graph for a variable. Histograms are useful to understand
the pattern of variability in the data, especially for large data sets.
Dot density plots (aka dotplot)
These are graphs which show every data point. They are useful to describe
the pattern of variability in the data.
Line graphs: time plots
Use them when there is a meaningful sequence, like time. The line connecting
the points helps emphasize any change over time.
Other graphs to display numerical summaries (see chapter 2)
28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35
11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40
Sorted data
Making a dotplot
1) Create a single axis representing the quantitative variable’s range
2) Represent each data point as a dot positioned according to its
numerical value
3) When two or more data points have the same value, stack them up
Making a histogram
1) The range of values that the quantitative variable takes is divided
into equal-size intervals, or classes. This makes up the horizontal axis.
2) The vertical axis represents either
the frequency (counts) or the relative frequency (percents of total).
3) For each class on the horizontal axis, draw a column. The height of
the column represents the count (or percent) of data points that fall in
that class interval.
Guinea pig survival time (in days) after inoculation with a pathogen (n = 72)
Let’s build a histogram with classes of size 50, starting at zero (zero is included in the first class).
43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598
Survival time (days)
Fre
quency
600550500450400350300250200150100500
30
25
20
15
10
5
0
12
0111
0
4
8
24
28
2
43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598
Survival time (days)
Fre
quency
600550500450400350300250200150100500
30
25
20
15
10
5
0
12
0111
0
4
8
24
28
2
Choosing the classes for a histogram
It is an iterative process – try and try again.
Not too many classes with either 0 or 1 counts
Not overly summarized that you loose all the information
Not so detailed that it is no longer summary
Try starting with 5 to10
classes, then refine
your class choice.
(There isn’t a unique or
“perfect” solution)
Statistical Applets: One Variable Statistical Calculator
Art or S
cience?
Interpreting histograms
We look for the overall pattern and for striking deviations from that
pattern. We describe the histogram’s:
Shape
Center
Spread
Possible outliers
Symmetric distribution
Most common unimodal distribution shapes
Left skewThe left side extends much farther out than the right side.
Right skewThe right side (side with larger values) extends much farther out than the left side.
Not all distributions have a simple shape
(especially with few observations).
Describe the shape of these histograms.
Alaska Florida
Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
Alaska and Florida have
unusual percents of elderly
in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Mauna Loa [CO2]
Mauna Loa Data in a Bar Chart
Graphing time seriesM
onth
ly C
O2 (
part
s per
millio
n)
YearMonth
200819981988197819681958MarMarMarMarMarMar
390
380
370
360
350
340
330
320
310
Monthly atmospheric CO2 levels recorded at the Mauna Loa Hawaii observatory (March 1958 – August 2009)
Data collected over time are displayed in a time plot, with time on the horizontal axis and the variable of interest on the vertical axis.
We look for a possible trend (a clear overall pattern) and possible cyclical variations (variations with some regularity over time)
Mona Loa Examples
Graphic Context
From Science News (2/11/2012)
Explanation of the Figure Caption to the figure:
India and China would see the most deaths prevented by 14 measures reducing methane and soot. The circles at left are proportional to the number of deaths that would be prevented annually by country in 2030.
Bubble Graph
Bar Graph
Graphic Context
From Science News (2/11/2012)
Cleveland’s Hierarchy
(From Stewart, Brandi 2005)
Tufte’s Take-homes Increase the data to ink ratio
Does each bit of ink provide unique information? Erase everything that is not needed.
Deemphasize non‐data elements Make the data points darker and bolder than other elements.
Quick notes
Label the axes to avoid repeating percent signs on all data points.
Too many decimal places or trailing zeros are a distraction
Avoid putting extra dimensions in your charts. The pseudo three‐dimensional charts are difficult to read and provide no information
If you know categories and values for each category, a two‐dimensional chart will be clearer than a pseudo three‐dimensional one.
True three‐dimensional charts are even more difficult to read.
Make grid lines barely perceptible, if used.
Guinea pig survival time (in days) after inoculation with a pathogen (n = 72)
Let’s build a histogram with classes of size 50, starting at zero (zero is included in the first class).
43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598
Survival time (days)
Fre
quency
600550500450400350300250200150100500
30
25
20
15
10
5
0
12
0111
0
4
8
24
28
2
0
50
100
150
200
250
1940 1950 1960 1970 1980 1990 2000
Years
Death
rate
(per
thousand)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Dea
th r
ate
(pe
r th
ousand)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Death
rate
(per
thousand) A picture is worth a
thousand words,
BUT
there is nothing like hard numbers.
Look at the scales.
Scales matter
How you stretch the axes and choose your scales can give a different impression.
120
140
160
180
200
220
1940 1960 1980 2000
Years
Dea
th r
ate
(per
thousand)
Death rates from cancer (U.S., 1945 – 95)
0
2
4
6
8
10
12
14
16
18
20W
ater
sal
init
y (p
arts
per
th
ou
san
ds)
9/30 10/1 10/2 10/3 10/4 10/5 10/6 10/7
Shark River water salinity in the Everglades National Park, over a seven-day period in the fall of 2009.
Describe these
two graphs.