Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·...

17
Computational Statistics Why Use R? Computational Statistics using R and RStudio Randall Pruim Calvin College

Transcript of Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·...

Page 1: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Computational Statistics using R and RStudio

Randall Pruim

Calvin College

Page 2: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Statistics + Computation = Computational Statistics

Modern statistics is done with statistical computing tools. Thereare many possibilities (Minitab, Excel, Stata, SPSS, StatCrunch,. . . ).

Different tools come with different sets of advantages anddisadvantages.

Page 3: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Statistics + Computation = Computational Statistics

Modern statistics is done with statistical computing tools. Thereare many possibilities (Minitab, Excel, Stata, SPSS, StatCrunch,. . . ).

Different tools come with different sets of advantages anddisadvantages.

Page 4: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Why Use R?

Modern statistics is done with statistical computing tools.

• R is one of many reasonable choices.

• Excel is NOT. [Mccullough and Heiser, 2008]Worse (from a teaching perspective), using Excel for statisticsdamages your brain.

• Excel users will constrain themselves to what Excel makes easy,whether or not it is the best way to think about things

• Excel users are not forced to constrain themselves in ways thattrain the mind for statistical reasoning.

• Aside from data entry and rudimentary data manipulation,Excel has no place in statistics education.

• If you find that Excel ”does everything you need” (statistically)this is an indictment on your statistical needs, not a strengthof Excel.

Page 5: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Why Use R?

Modern statistics is done with statistical computing tools.

• R is one of many reasonable choices.

• Excel is NOT. [Mccullough and Heiser, 2008]

Worse (from a teaching perspective), using Excel for statisticsdamages your brain.

• Excel users will constrain themselves to what Excel makes easy,whether or not it is the best way to think about things

• Excel users are not forced to constrain themselves in ways thattrain the mind for statistical reasoning.

• Aside from data entry and rudimentary data manipulation,Excel has no place in statistics education.

• If you find that Excel ”does everything you need” (statistically)this is an indictment on your statistical needs, not a strengthof Excel.

Page 6: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Why Use R?

Modern statistics is done with statistical computing tools.

• R is one of many reasonable choices.• Excel is NOT. [Mccullough and Heiser, 2008]

Excel 2007, like its predecessors, fails a standard setof intermediate-level accuracy tests in three areas:statistical distributions, random number generation, andestimation. Additional errors in specific Excel proceduresare discussed. Microsoft’s continuing inability to correctlyfix errors is discussed. No statistical procedure in Excelshould be used until Microsoft documents that theprocedure is correct; it is not safe to assume thatMicrosoft Excel’s statistical procedures give the correctanswer. Persons who wish to conduct statistical analysesshould use some other package.

Worse (from a teaching perspective), using Excel for statisticsdamages your brain.

• Excel users will constrain themselves to what Excel makes easy,whether or not it is the best way to think about things

• Excel users are not forced to constrain themselves in ways thattrain the mind for statistical reasoning.

• Aside from data entry and rudimentary data manipulation,Excel has no place in statistics education.

• If you find that Excel ”does everything you need” (statistically)this is an indictment on your statistical needs, not a strengthof Excel.

Page 7: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Why Use R?

Modern statistics is done with statistical computing tools.

• R is one of many reasonable choices.

• Excel is NOT. [Mccullough and Heiser, 2008]Worse (from a teaching perspective), using Excel for statisticsdamages your brain.

• Excel users will constrain themselves to what Excel makes easy,whether or not it is the best way to think about things

• Excel users are not forced to constrain themselves in ways thattrain the mind for statistical reasoning.

• Aside from data entry and rudimentary data manipulation,Excel has no place in statistics education.

• If you find that Excel ”does everything you need” (statistically)this is an indictment on your statistical needs, not a strengthof Excel.

Page 8: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

Why Use R?

Modern statistics is done with statistical computing tools.

• R is one of many reasonable choices.

• Excel is NOT. [Mccullough and Heiser, 2008]Worse (from a teaching perspective), using Excel for statisticsdamages your brain.

• Excel users will constrain themselves to what Excel makes easy,whether or not it is the best way to think about things

• Excel users are not forced to constrain themselves in ways thattrain the mind for statistical reasoning.

• Aside from data entry and rudimentary data manipulation,Excel has no place in statistics education.

• If you find that Excel ”does everything you need” (statistically)this is an indictment on your statistical needs, not a strengthof Excel.

Page 9: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• R is free and available on any platform (Mac, PC, Linux, etc.)and also, via RStudio, in a web browser.

This means students have access to R whenever and whereverthey need it.

• R is powerful – you won’t outgrow it.

If one goal is to prepare students for future research, R willgrow with them.

Page 10: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• R produces excellent, publication-quality graphics (×3)

1. base graphics (the original):• no reason to learn it if you don’t already know it• should probably learn (at least) one of the other two

2. lattice (my recommendation for teaching purporses)• formula-based description of plots• the most important plots of one and two variables plus

covariates can all be created very easily

3. ggplot2

• very different syntax from lattice

• requires more data manipulation external to plotting thanlattice does

• for those fluent in R ggplot2 is probably the easier way tocreate elaborate custom plots

Both lattice and ggplot2 provide a high level of usercustomization (colors, sizes, fonts, shapes, etc.) via themesand settings.

Page 11: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• R helps you think about data in ways that are useful forstatistical analysis.

As with most true statistical packages, R prefers dataarranged with variables in columns and observational units inrows. Being able to identify each of these is a crucialpre-requisite to understanding statistical procedures.

• R promotes reproducible research.

R commands provide an exact record of how an analysis wasdone. Commands can be edited, rerun, commented, shared,etc.

• R is up-to-date.

Many new analysis methods appear first in R.

Page 12: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• There are many packages available that automate particulartasks.

The CRAN (Comprehensive R Archive Network) repositorycontains more than 3000 packages that provide a wide rangeof additional capabilities (including many with biologicalapplications.)The Bioconductor repository(http://www.bioconductor.org/) contains nearly 500additional packages that focus on biological andbioinformatics applications.

• R can be combined with other tools.

R can be used within programming languages (like Python) orin scripting environments to automate data analysis pipelines.

Page 13: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• R is popular

A large and growing user base in industry, business, and manyacademic fields is an important factor when choosing astatistics package. A large user base means active support andavailability of teaching/learning materials. It also means thatwe are providing students with a useful technical skill.

• R provides a gentle introduction to general computation.

Although R can be used “one command at a time” (that’s theway it is used in most intro courses), R is a full featuredprogramming language.

Page 14: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• The best is yet to come.

Popularity is increasing

A number of projects are underway that promise to furtherimprove on R, including

• RStudio and its manipulate and shiny features• mosaic package attempts to smooth out some of the

challenges in using R with beginners• . . .

Page 15: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• Although R is very comprehensive, a small pallet of usefulthings can be learned fairly easily.

Anyone who can learn to use a TI calculator, and learnenough R. – rjp

In my experience, troubles with R are usually indicatorsfor other problems that have nothing to do with R. – rjp

Less volume, more creativity.– Mike McCarthy (Green Bay Packers head coach)

Page 16: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

R Advantages

• Although R is very comprehensive, a small pallet of usefulthings can be learned fairly easily.

I would like to say again how much I appreciated theUSCOTS workshop. As it happens, I elected to teach avery small summer offering of our elementary statisticsclass with just two students, both of whom had failed thecourse before and needed it to graduate.

My approach has been to use R as a supplement to atraditional course, but I find that I have been able tospend a lot more time on simulations of all sorts,including re-sampling approaches to hypothesis-testing –including contingency tables – early on.

The sample is small, but these students find theR-approach quite engaging and intuitive. In addition tothe ready access to simulation and manipulationactivities, they seem to get a conceptual leg-up when itcomes to the handling the formulas that dominate thetraditional approach.

Page 17: Computational Statistics using R and RStudio - Calvin …rpruim/talks/Rminis/RAdvantages.pdf ·  · 2012-11-16Computational Statistics using R and RStudio Randall Pruim Calvin College.

Computational Statistics Why Use R?

References

Mccullough, B. D. and Heiser, D. A. (2008).On the accuracy of statistical procedures in microsoft excel2007.Computational Statistics & Data Analysis, 52:4570–4578.available online atwww.pages.drexel.edu/~bdm25/excel2007.pdf.