Visualization - Statistical Methods

Post on 29-Mar-2022

14 views 0 download

Transcript of Visualization - Statistical Methods

Visualization - Statistical Methods

Sarah Filippi, University of Oxford

20 October 2015Michaelmas Term 2015

First step

The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

Graphics reveal data, communicate complex ideas anddependencies with clarity, precision and efficiency.

Graphical excellence

Excellent graphics:

I show the data

I induce the viewer to think about the substance

I avoid bias

I make large complex data sets coherent

I encourage data exploration and debate

Categorical Data

Let’s start by not using one common graph type:

There is no data that can be displayed in a pie chartthat cannot be displayed BETTER in some other type ofchart.J. W. Tukey

And let’s not even think about 3D and exploded pie charts.

What’s the matter with pie charts:

I people are not good at interpreting areas

I small and large slices are relatively distorted

I zero is often a very meaningful number but gets los

I very hard to compare two pie charts

Barplots are usually a much better choice: barplot(height)

Suppose we have a few ordinal or categorical variables: theinteresting questions are then how they vary together. Here is across-tabulation on the caffeine consumption (in mg/day) ofwomen in a maternity ward by marital status. (A contingencytable.)

0 1-150 151-300 300+Married 652 1537 598 242Prev.married 36 46 38 21Single 218 327 106 67

The next two slides show two graphical representations, a set ofbarplots (aka bar charts), and a mosaic plot.Different versions of these plots and other plots for categoricaldata can be found in package vcd.

Married Prev. married Single

00−150150−300300+

020

040

060

080

010

0012

0014

00

Married Prev. married Single

00−150150−300300+

0.0

0.2

0.4

0.6

0.8

1.0

A special case of a mosaic plot is sometimes called a spineplot.

0

0

Married Prev. married

00−

150

150−

300

0.0

0.2

0.4

0.6

0.8

1.0

Barplots

Barplots should not (!) be used to compare distributions of dataacross groups.

Box-and-whisker Plots (boxplots)

2000

2500

3000

3500

4000

Median

1st quartile

3rd quartile

Lower Whisker

Upper Whisker

Outliers

There are about as many variations as software designers.

Parallel box plots are often useful to show the differences betweensubgroups of the data.

●●

●●●●

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

050

100

150

GDP

Infant mortality

Violin plotsViolin plots replace the representation in a boxplot by avariable-width box determined by a density estimate.

See e.g. the help for function vioplot() in the package of thatname or the help for function panel.violin() in packagelattice.

050

100

150

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

infant.mortality

(0,100]

(100,1000]

(1000,1e+04]

(1e+04,1e+05]

0 50 100 150

●●● ●

● ● ●● ●●●

Histograms

I Very convenient to study the shape of the distribution of thedata.

I We can choose a set of breakpoints covering the data, andcount how many points fall into each interval.

I Warning: some software plot the counts or the proportions orpercentages.

A true histogram has the area of each bar proportional to thecount, and total area one. This matters if the breaks are notequally spaced. See function truehist() in package MASS.

How do we choose the number and position of the breaks?

hist(data, prob = FALSE, breaks=breaks)

truehist(data, x0)

Histogram of infant.mortality

infant.mortality

Fre

quen

cy

0 50 100 150

020

4060

80

Histogram of infant.mortality

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

0

Histogram of infant.mortality

infant.mortality

Fre

quen

cy

0 50 100 150

05

1015

2025

3035

Histogram of infant.mortality

infant.mortality

Den

sity

0 50 100 150

0.00

00.

010

0.02

00.

030

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

duration

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

duration

Duration of Old Faithful eruptions

Density Plots

I Histograms are density plots: the tops of the bars is apiecewise-constant estimator of the underlying pdf.

I We can use smooth estimates of density. Examples:I Kernel density estimates

f̂(x) =1

n

n∑i=1

Kh(x− xi)

where Kh is a kernel and h is the bandwidth.

density() – check arguments bw, from, to...

I Splines or losplines: a spline is a piecewise polynomial functionwhich has smooth properties at the places where thepolynomial pieces connect.

logspline() in package polspline

kernel density

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

00.

035

logspline

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

00.

035

Rugs

We can add rug below the x axes to highlight position of data. Seethe help for rug() and also jitter().

Possible to have semi-transparent grey rugs by specifying the color:col=rgb(0,0,0,0.25).

1 2 3 4 5

0.0

0.5

1.0

1.5

kernel density

duration

1 2 3 4 5

0.0

0.5

1.0

1.5

logspline

duration

Scatterplots

The canonical plot of two continuous variables is a scatterplot.

Example from UN dataset:plot(infant.mortality ∼ gdp, UN, cex = 0.5)

plot(infant.mortality ∼ gdp, UN, log = "xy",...)

●●

● ●

●●●

●●

●●

●● ●

●●

● ●

● ●

● ●

● ●

0 10000 20000 30000 40000

050

100

150

gdp

infa

nt.m

orta

lity

● ●

●●

● ●

● ●

● ●

● ●

50 100 200 500 2000 5000 20000

25

1020

5010

020

0

gdp

infa

nt.m

orta

lity

Using scatterplot() from package car.

50 100 200 500 1000 2000 5000 10000 20000 50000

25

1020

5010

020

0

gdp

infa

nt.m

orta

lity

● ●

● ●

● ●

●●

● ●

Tonga

Iraq

Afghanistan

Bosnia

Sao.Tome

Sudan

Gabon

Liberia

Korea.Dem.Peoples.Rep French.Guiana

Smoother

I A fitted regression line and a smooth line have beenautomatically added.

I Such smooth curves often help to highlight trends in ascatterplot, but they can also be deceptive. Smoothers arethings we will return to, but see the functionsloess.smooth() and smooth.spline().

I With the scatterplot() function, outliers are automaticallylabelled. It is often best to do this manually with theidentify() function.

It is often useful to visually convey confidence in your plots.

●●

●●

● ●●

●●

0

100

200

300

10 15 20 25 30 35mpg

hp

Scatterplot with color by types

5000 10000 15000 20000 25000

2030

4050

6070

80

income

pres

tige

●●

●●

●●

●●

type

bc prof wc

scatterplot(prestige ∼ income|type, data=Prestige,

smoother=FALSE, reg.line=FALSE)

Scatterplot Matrices (or pairs plots)

Sepal.Length

2.0 2.5 3.0 3.5 4.0

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

●●●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

2.0

2.5

3.0

3.5

4.0

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●

●Sepal.Width

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

● ●●

●●●

● ●●

●●●

●●

●●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

●●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

● ●●

●●●

●●

● ●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

Petal.Length

12

34

56

7

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●● ●●●●● ●●●●●●

●●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

●●●● ●

●●

●●●

●●●●

●●● ●●

● ●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●●

●●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●●

●●●

●●●

●●● ●●

●●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

1 2 3 4 5 6 7

●●●●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●●●●●

●●●●●

●●

●●●●

●● ●

●●

●●

●●

●●● ●

●●

●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

Petal.Width

Anderson's Iris Data −− 3 species

pairs(iris[1:4], bg = c("red", "green3",

"blue")[unclass(iris$Species)])

Image or contours

The functions image or contour are useful to explore threedimensional data or to illustrate distributions in two dimensions.

−4 −2 0 2 4

−4

−2

02

4

0.02

0.04

0.06

0.08

0.1 0.12

0.1

4

−4 −2 0 2 4−

4−

20

24

Aspect ration of plotThe aspect ratio of a plot is very important:Cleveland/McGill recommended an average slope of about 45◦ asthe eye is most sensitive to departures from 45◦.

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●●

●●

●●

●●

●●

●● ●

●●

−3 −2 −1 0 1 2 3

−6

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

−3 −1 0 1 2 3−

6−

4−

20

24

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Arranging Several Plots on Single Page

par(mfrow=c(2,3))

for(i in 1:6) { plot(1:10) }

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index1:

10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

The layout function allows to divide the plotting device intovariable numbers of rows and columns with the column-widths andthe row-heights specified in the respective arguments.

nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE),

c(3,7), c(5,5),respect=TRUE)

for(i in 1:3) { plot(1:10) }

2 4 6 8

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

Make your graphical display easy to understand

I Add labels to your axis (with appropriate font and font size)

I Control the scale of the axes using the commands xlim andylim. Also check the command axes.

I Add clear captions

I Use appropriate colors

I ....

Save a figure in a pdf file

I recommend you to always save your figure in pdf as it makes iteasier to include in LaTeX. This can be done in R using thefollowing command line:

pdf("filename.pdf")

...

dev.off()

You can also specify the size of the figure with the options width

and height – the measures are in inches.

To watch at home...

TED talk on The beauty of data visualization:

David McCandless turns complex data sets (like worldwide militaryspending, media buzz, Facebook status updates) into beautiful,simple diagrams that tease out unseen patterns and connections.Good design, he suggests, is the best way to navigate informationglut – and it may just change the way we see the world.

Link: http://www.ted.com/talks/david_mccandless_the_

beauty_of_data_visualization