Visualization - Statistical Methods
Transcript of Visualization - Statistical Methods
Visualization - Statistical Methods
Sarah Filippi, University of Oxford
20 October 2015Michaelmas Term 2015
First step
The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data
ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!
Graphics reveal data, communicate complex ideas anddependencies with clarity, precision and efficiency.
Graphical excellence
Excellent graphics:
I show the data
I induce the viewer to think about the substance
I avoid bias
I make large complex data sets coherent
I encourage data exploration and debate
Categorical Data
Let’s start by not using one common graph type:
There is no data that can be displayed in a pie chartthat cannot be displayed BETTER in some other type ofchart.J. W. Tukey
And let’s not even think about 3D and exploded pie charts.
What’s the matter with pie charts:
I people are not good at interpreting areas
I small and large slices are relatively distorted
I zero is often a very meaningful number but gets los
I very hard to compare two pie charts
Barplots are usually a much better choice: barplot(height)
Suppose we have a few ordinal or categorical variables: theinteresting questions are then how they vary together. Here is across-tabulation on the caffeine consumption (in mg/day) ofwomen in a maternity ward by marital status. (A contingencytable.)
0 1-150 151-300 300+Married 652 1537 598 242Prev.married 36 46 38 21Single 218 327 106 67
The next two slides show two graphical representations, a set ofbarplots (aka bar charts), and a mosaic plot.Different versions of these plots and other plots for categoricaldata can be found in package vcd.
Married Prev. married Single
00−150150−300300+
020
040
060
080
010
0012
0014
00
Married Prev. married Single
00−150150−300300+
0.0
0.2
0.4
0.6
0.8
1.0
A special case of a mosaic plot is sometimes called a spineplot.
0
0
Married Prev. married
00−
150
150−
300
0.0
0.2
0.4
0.6
0.8
1.0
Barplots
Barplots should not (!) be used to compare distributions of dataacross groups.
Box-and-whisker Plots (boxplots)
2000
2500
3000
3500
4000
Median
1st quartile
3rd quartile
Lower Whisker
Upper Whisker
Outliers
There are about as many variations as software designers.
Parallel box plots are often useful to show the differences betweensubgroups of the data.
●
●●
●
●
●
●
●●●●
(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]
050
100
150
GDP
Infant mortality
Violin plotsViolin plots replace the representation in a boxplot by avariable-width box determined by a density estimate.
See e.g. the help for function vioplot() in the package of thatname or the help for function panel.violin() in packagelattice.
050
100
150
(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]
●
●
●
●
infant.mortality
(0,100]
(100,1000]
(1000,1e+04]
(1e+04,1e+05]
0 50 100 150
●
●
●
●
●●● ●
● ● ●● ●●●
Histograms
I Very convenient to study the shape of the distribution of thedata.
I We can choose a set of breakpoints covering the data, andcount how many points fall into each interval.
I Warning: some software plot the counts or the proportions orpercentages.
A true histogram has the area of each bar proportional to thecount, and total area one. This matters if the breaks are notequally spaced. See function truehist() in package MASS.
How do we choose the number and position of the breaks?
hist(data, prob = FALSE, breaks=breaks)
truehist(data, x0)
Histogram of infant.mortality
infant.mortality
Fre
quen
cy
0 50 100 150
020
4060
80
Histogram of infant.mortality
infant.mortality
Den
sity
0 50 100 150
0.00
00.
005
0.01
00.
015
0.02
0
Histogram of infant.mortality
infant.mortality
Fre
quen
cy
0 50 100 150
05
1015
2025
3035
Histogram of infant.mortality
infant.mortality
Den
sity
0 50 100 150
0.00
00.
010
0.02
00.
030
1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
duration
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
duration
Duration of Old Faithful eruptions
Density Plots
I Histograms are density plots: the tops of the bars is apiecewise-constant estimator of the underlying pdf.
I We can use smooth estimates of density. Examples:I Kernel density estimates
f̂(x) =1
n
n∑i=1
Kh(x− xi)
where Kh is a kernel and h is the bandwidth.
density() – check arguments bw, from, to...
I Splines or losplines: a spline is a piecewise polynomial functionwhich has smooth properties at the places where thepolynomial pieces connect.
logspline() in package polspline
kernel density
infant.mortality
Den
sity
0 50 100 150
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
00.
035
logspline
infant.mortality
Den
sity
0 50 100 150
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
00.
035
Rugs
We can add rug below the x axes to highlight position of data. Seethe help for rug() and also jitter().
Possible to have semi-transparent grey rugs by specifying the color:col=rgb(0,0,0,0.25).
1 2 3 4 5
0.0
0.5
1.0
1.5
kernel density
duration
1 2 3 4 5
0.0
0.5
1.0
1.5
logspline
duration
Scatterplots
The canonical plot of two continuous variables is a scatterplot.
Example from UN dataset:plot(infant.mortality ∼ gdp, UN, cex = 0.5)
plot(infant.mortality ∼ gdp, UN, log = "xy",...)
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
0 10000 20000 30000 40000
050
100
150
gdp
infa
nt.m
orta
lity
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50 100 200 500 2000 5000 20000
25
1020
5010
020
0
gdp
infa
nt.m
orta
lity
Using scatterplot() from package car.
50 100 200 500 1000 2000 5000 10000 20000 50000
25
1020
5010
020
0
gdp
infa
nt.m
orta
lity
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Tonga
Iraq
Afghanistan
Bosnia
Sao.Tome
Sudan
Gabon
Liberia
Korea.Dem.Peoples.Rep French.Guiana
Smoother
I A fitted regression line and a smooth line have beenautomatically added.
I Such smooth curves often help to highlight trends in ascatterplot, but they can also be deceptive. Smoothers arethings we will return to, but see the functionsloess.smooth() and smooth.spline().
I With the scatterplot() function, outliers are automaticallylabelled. It is often best to do this manually with theidentify() function.
It is often useful to visually convey confidence in your plots.
●●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
0
100
200
300
10 15 20 25 30 35mpg
hp
Scatterplot with color by types
5000 10000 15000 20000 25000
2030
4050
6070
80
income
pres
tige
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
type
bc prof wc
scatterplot(prestige ∼ income|type, data=Prestige,
smoother=FALSE, reg.line=FALSE)
Scatterplot Matrices (or pairs plots)
Sepal.Length
2.0 2.5 3.0 3.5 4.0
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●● ●
●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●●●
●●
●
●
●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●
0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
●●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●● ●●●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●●●
●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●
2.0
2.5
3.0
3.5
4.0
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●●
●
●
●
●
●Sepal.Width
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
● ●●
●
●●
●
●
●
●
●
●●●● ●
●● ●● ● ●●
●● ●
●●●
●●
●●
●
●●
●● ●●●● ●● ●●
● ●●●●
●●●●●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
● ●●
●
●
●●●
●
●
● ●●
●●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●● ●● ●
●●●● ● ●●
●● ●
●●●
●●
●●
●
●●
● ●●●●● ● ●●●● ●●●
●●● ●●
●
●●
● ●●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●●
●
●
● ●●
●●●
●●
●
●
● ●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
Petal.Length
12
34
56
7
●●●●●
●●●●●●●●
●●●●●
●●
●●
●
●●● ●●●●● ●●●●●●
●●●
●●●●
●
●●●●●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●●
●
●
●●●
●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
4.5 5.5 6.5 7.5
0.5
1.0
1.5
2.0
2.5
●●●● ●
●●
●●●
●●●●
●
●●● ●●
●
●
●
●
● ●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●●
●
●●
●●
●●●●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
●● ●● ●
●●●●
●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●●
● ●●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
1 2 3 4 5 6 7
●●●●●
●●●●●●●
●●●
●●●●●
●
●
●
●
●●
●
●●●●
●
●●●●●●
●●●●●
●
●●
●●●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●● ●
●
●
●●
●
●
●●●
●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
Petal.Width
Anderson's Iris Data −− 3 species
pairs(iris[1:4], bg = c("red", "green3",
"blue")[unclass(iris$Species)])
Image or contours
The functions image or contour are useful to explore threedimensional data or to illustrate distributions in two dimensions.
−4 −2 0 2 4
−4
−2
02
4
0.02
0.04
0.06
0.08
0.1 0.12
0.1
4
−4 −2 0 2 4−
4−
20
24
Aspect ration of plotThe aspect ratio of a plot is very important:Cleveland/McGill recommended an average slope of about 45◦ asthe eye is most sensitive to departures from 45◦.
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
● ●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−6
−4
−2
02
4
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
−3 −1 0 1 2 3−
6−
4−
20
24
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Arranging Several Plots on Single Page
par(mfrow=c(2,3))
for(i in 1:6) { plot(1:10) }
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index1:
10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
The layout function allows to divide the plotting device intovariable numbers of rows and columns with the column-widths andthe row-heights specified in the respective arguments.
nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE),
c(3,7), c(5,5),respect=TRUE)
for(i in 1:3) { plot(1:10) }
●
●
●
●
●
●
●
●
●
●
2 4 6 8
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
●
●
●
●
●
●
●
●
●
●
2 4 6 8 10
24
68
10
Index
1:10
Make your graphical display easy to understand
I Add labels to your axis (with appropriate font and font size)
I Control the scale of the axes using the commands xlim andylim. Also check the command axes.
I Add clear captions
I Use appropriate colors
I ....
Save a figure in a pdf file
I recommend you to always save your figure in pdf as it makes iteasier to include in LaTeX. This can be done in R using thefollowing command line:
pdf("filename.pdf")
...
dev.off()
You can also specify the size of the figure with the options width
and height – the measures are in inches.
To watch at home...
TED talk on The beauty of data visualization:
David McCandless turns complex data sets (like worldwide militaryspending, media buzz, Facebook status updates) into beautiful,simple diagrams that tease out unseen patterns and connections.Good design, he suggests, is the best way to navigate informationglut – and it may just change the way we see the world.
Link: http://www.ted.com/talks/david_mccandless_the_
beauty_of_data_visualization