04 reports

Post on 13-May-2015

586 views 0 download

Tags:

Transcript of 04 reports

If you’re using a laptop, start installing latex, from the instructions on the website

Thursday, 2 September 2010

Office hours: before class.

Lab access: you should now have it

Thursday, 2 September 2010

Hadley Wickham

Stat405Statistical reports

Thursday, 2 September 2010

1. More subsetting.

2. Missing values.

3. Statistical reports: data, code, graphics & written report

Thursday, 2 September 2010

Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]

Thursday, 2 September 2010

diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!

Thursday, 2 September 2010

Your turn

Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.

Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)

Thursday, 2 September 2010

equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]

Thursday, 2 September 2010

Missing values

Thursday, 2 September 2010

Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values.

In R, missing values are indicated by NA

Data errors

Thursday, 2 September 2010

Expression Guess Actual

5 + NA

NA / 2

sum(c(5, NA))

mean(c(5, NA)

NA < 3

NA == 3

NA == NA

Thursday, 2 September 2010

NA behaviour

Missing values propagate

Use is.na() to check for missing values

Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation.

Thursday, 2 September 2010

# Can use subsetting + <- to change individual # values

diamonds$x[diamonds$x == 0] <- NAdiamonds$y[diamonds$y == 0] <- NAdiamonds$z[diamonds$z == 0] <- NA

y_big <- !is.na(diamonds$y) & diamonds$y > 10diamonds$y[y_big] <- diamonds$y[y_big] / 10z_big <- !is.na(diamonds$z) & diamonds$z > 6diamonds$z[z_big] <- diamonds$z[z_big] / 10

Thursday, 2 September 2010

What happens if you don’t remove missing values? Why?

Your turn

Thursday, 2 September 2010

Statistical reports

Thursday, 2 September 2010

Statistical reports

Regardless of whether you go into academia or industry, you need to be able to present your findings.

And you should be able to do more than just present them, you should be able to reproduce them.

Thursday, 2 September 2010

Data (.csv)+

Code (.r)+

Graphics (.png, .pdf)+

Written report (.tex)

In one directory

Thursday, 2 September 2010

Set your working directory to specify where files will be saved by default.

From the terminal (linux or mac): the working directory is the directory you’re in when you start R

On windows: File | Change dir.

On the mac: ⌘-D

Working directory

Thursday, 2 September 2010

DataSo far we’ve just used built in datasets

Next week we’ll learn how to use external data

Thursday, 2 September 2010

Code

Thursday, 2 September 2010

Workflow

At the end of each interactive session, you want a summary of everything you did

Two options:

Save everything that you did with savehistory(filename.r) then remove the unimportant bits

Build up the important bits as you go

Up to you - I prefer the second

Thursday, 2 September 2010

R editor

Linux: gedit(copy and paste - see website)

Windows: File | New Script(press F5 to send line)

Mac: File | New document (press command-enter to send)

Thursday, 2 September 2010

Code is communication!

Thursday, 2 September 2010

Code presentationUse comments (#) to describe what you are doing and to create scannable headings in your code

Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces

Lines should be at most 80 characters. If you have to break up a line, indent the following piece

Thursday, 2 September 2010

qplot(table,depth,data=diamonds)qplot(table,depth,data=diamonds)+xlim(50,70)+ylim(50,70)qplot(table-depth,data=diamonds,geom="histogram")qplot(table/depth,data=diamonds,geom="histogram",binwidth=0.01)+xlim(0.8,1.2)

Thursday, 2 September 2010

# Table and depth -------------------------

qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)

# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")

# This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002

Thursday, 2 September 2010

# Table and depth -------------------------

qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)

# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")

# This bin width seems the most revealingqplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002

Thursday, 2 September 2010

Graphics

Thursday, 2 September 2010

Saving graphics# Uses size on screen:ggsave("my-plot.pdf")ggsave("my-plot.png")

# Specify sizeggsave("my-plot.pdf", width = 6, height = 6)

# Saves file in working directory# (where you started R from)

Thursday, 2 September 2010

PDF PNG

Vector based (can zoom in infinitely)

Raster based(made up of pixels)

Good for most plots

Good for plots with thousands of

points

Thursday, 2 September 2010

Your turn

Recreate some of the graphics from previous lectures and save them.

Experiment with the scale and height and width settings.

Modify the template to include them.

Thursday, 2 September 2010

Written report

Thursday, 2 September 2010

Latex

We are going to use the open source document typesetting system called latex to produce our reports.

This is widespread in statistics - if you ever write a journal article, you will probably write it in latex.

(Not so useful if you’re not in grad school)

Thursday, 2 September 2010

Edit-Compile-Preview

Edit: a text document with special formatting

Compile: to produce a pdf

Preview: with a pdf viewer

See web page for system specifics.

Thursday, 2 September 2010

Latex

Template

Sections

Images

Figures and cross-references

Verbatim input (for code)

Thursday, 2 September 2010

Your turn# Get the sample reportwget http://had.co.nz/stat405/\resources/sample-report.zip unzip sample-report.zip

cd sample-reportgedit template.tex &pdflatex template.texevince template.pdf# Experiment!

Thursday, 2 September 2010

Your turn

If not on linux, follow the instructions on the class website.

If you feel comfortable, start on homework 2.

Thursday, 2 September 2010

Homework

Thursday, 2 September 2010