Introduction to Computational Statistics

Post on 22-Jan-2017

786 views 1 download

Transcript of Introduction to Computational Statistics

Computational Statistics

Setia Pramana

2015

Computational Statistics 1

Course Outline

• Introduction– Different Statistical Software

• Data Preparation, Management, Manipulation, Summarization with:– SPSS– R (R Commander)– Ms. Excel

• Data Tabulation and Visualization

Computational Statistics 2

Course Outline

• Generate Different Statistical Distribution (with Rcmdr)

• Simple Linear Regression and Correlation• Basic R Programming• Developing Simple Graphical User Interface in R• Resampling Methods• Statistical Inference (Point and interval

estimation)

Computational Statistics 3

Course Outline

• Hypothesis testing: one, two sample t-test (test for mean difference, proportion and variance)

• Analysis of Variance (Anova): one and two way Anova.

• Introduction to Design of Experiment• Final Project

Computational Statistics 4

Course Workload

• 20% Theory, 80% practice• Group Project (5 students)• Presentation every week• R code would be provided• Slides can be seen at :

http://www.slideshare.net/hafidztio/

Computational Statistics 5

Reference Books

Computational Statistics 6

Reference Books• John Maindonald dan W. John Braun. Data Analysis and

Graphics Using R – an Example-Based Approach. 3rd

Edition. Cambridge University Press: Cambridge.2010.• John Fox. Journal of Statistical Software, The R

Commander : A Basic-Statistics Graphical User Interface to R.Volume 14, Issue 9, September 2005.

• Chris Beeley. Web Application Development with R Using Shiny. Packt Publishing: Birmingham.2013.

• SPSS Statistics Base User’s Guide 17.0. Polar Engineering and Consulting : Chicago, 2007.

Computational Statistics 7

Reference Books• Jurusan Komputasi Statistik. Modul Mata Kuliah

Komputasi Statistik. 2014• Kerns, G. Jays. Introduction to Probability and Statistics

Using R. E book. GNU Free Documentation License. 2010.

• Geof H. Givens dan Jennifer A. Hoeting. Computational Statistics, 2nd edition. John Wiley and Sons : New Jersey. 2013

• Jochen Voss. Statistical Computing. E book. 2011.• Brent B. Welch, Ken Jones dan Jeffrey Hobbs. Practical

Programming in Tcl and Tk. 4Th edition. Prentice Hall PTR: New Jersey.2003.

Computational Statistics 8

Other Materials

• https://sites.google.com/site/biostatinfocore/home/rworkshop

• https://sites.google.com/site/biostatinfocore/biostatistics-workshop

Computational Statistics 9

Introduction

Computational Statistics 10

Statistics?

Computational Statistics 11

Computational Statistics 12

What is Statistics?

• Statistics: is the science which deals with collection, classification and tabulation of numerical facts as the basis for explanation, description and comparison of phenomenon”.

Computational Statistics 13

Observations on the Bills of Mortality (1662)

Recorded Plague related death for 100 years

Computational Statistics 14

What is Statistics?• Exploring data: Using graphical and numerical

techniques to study patterns and departures from patterns (in order to interpreting data)

• Sampling and experimentation: Clarifying the question, deciding on methods of collection and analysis to produce valid information.

• Anticipating patterns: Exploring random phenomena using probability and simulation. Probability is our tool for anticipating distributions...

• Statistical Inference: Estimating population parameters and testing hypothesis

Computational Statistics 15

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write” HG Well

Computational Statistics 16

Areas of Statistics

Two areas of statistics:Descriptive Statistics: collection, presentation, and description of sample data.Inferential Statistics: making decisions and drawing conclusions about populations.

Computational Statistics 17

Statistics Descriptive

What is your conclusion?

The fatality rate is:

– 40% in the group of drivers who did not wear seat belts– 20%in drivers who did wear seat belts

• Seat belts appear to save lives

18Computational Statistics

Inferential Statistics

• Are results applicable to the population of all drivers? (generalization)

• Does wearing seat belts save lives? (assess strength of evidence)

• Is the fatality rate of those not wearing seat belts higher than the fatality rate of those wearing seat belts? (comparison)

• How many lives can be saved by wearing seat belts? (prediction)

• Do other variables influence the conclusion? For example: the age of driver, alcohol use, type of car, speed at impact (ask more questions)

19Computational Statistics

Statistics and the Technology

• The electronic technology has had a tremendous effect on the field of statistics.

• Many statistical techniques are repetitive in nature: computers and calculators are good at this.

• Lots of statistical software packages: R, MINITAB, SYSTAT, STATA, SAS, Statgraphics, SPSS, MS Excel, and calculators.

Computational Statistics 20

Available Statistical Packages

Computational Statistics 21

Available Statistical Packages

Proprietary Excel SPSS MINITAB SAS Stata Statistica Many more ……

Free Software LibreOffice Calc R CS Pro WinBugs EpiInfo Many more……..

Computational Statistics 22

Computational Statistics 23

Computational Statistics 24

Computational Statistics 25

Computational Statistics 26

Microsoft Excel

Computational Statistics 27

Which one do you use?

Why?

Computational Statistics 28

Statistical Software Used

Computational Statistics 29

Statistical Software Used

Computational Statistics 30

R is HOT !

Computational Statistics 31

R is HOT !

• R is HOT !

http://r4stats.com/articles/popularity/Computational Statistics 32

R is HOT !

http://r4stats.com/articles/popularity/Computational Statistics 33

R is HOT !

http://r4stats.com/articles/popularity/Computational Statistics 34

What is R?

• A language and environment for statistical computing and graphics.

• An integrated suite of software facilities for data manipulation, calculation and graphical display.

• First appeared in 1996 by Prof. Ross Ihaka and Robert Gentleman of the University of Auckland, NZ.

• GNU software -> Free. Similar like S language.• Open source, maintained and developed by a community

of developers.• Works in Windows, Unix, MacOsComputational Statistics 35

R includes

• Effective data handling and storage facility,• A suite of operators for calculations on arrays, in particular

matrices• A large, coherent, integrated collection of intermediate

tools for data analysis,• Graphical facilities for data analysis and display either on-

screen or on hardcopy• Well-developed, simple and effective programming

language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

http://www.r-project.org/Computational Statistics 36

Why R?

• It is not only statistical software but also a language

• 5000 add-on packages lots of pre-prepared packages (http://cran.r-project.org/web/packages/)

• With many applications http://cran.r-project.org/web/views/, http://www.revolutionanalytics.com/r-language-features-applications-and-extensions#thirdparty .

• Access to powerful, cutting-edge analytics Computational Statistics 37

Why R?

• Flexible (complex or standard statistical practices, bayesian modelling, GIS map building, building interactive web applications, building interactive tests, etc. )

• We can make our own package and publish it• Great Graphics and data visualization• Can be used for High Performance Computer Clusters• Well Supported by R Community (http://www.inside-r.org/r-

resources-web)• And many more…..

Computational Statistics 38

Why R?

• Can be integrated with other languages (C/C++, Java).

• R can interact with many data sources and other statistical packages (SAS, Stata, SPSS, and Minitab).

• For the high performance computing task multiple cores, either on a single machine or across a network.

39Computational Statistics

But…..

• R has no warranty

• Command Line Interface : difficult for some users.

• Users must learn a new way of thinking about data and data analysis sequence

• That’s all ….. I guess

Computational Statistics 40

Companies using R in 2013

• The New York Times routinely uses R for interactive and print data visualization.

• Google has more than 500 R users.• The FDA supports the use of R for clinical trials of new drugs.• The National Weather Service uses R to predict the extent of flooding

events.• Zillow uses R to model housing prices.• The Consumer Financial Protection Bureau uses R and other open

source tools.• Twitter uses R for data science applications on the Twitter database.• FourSquare uses R to develop its recommendation engine.• Facebook uses R to model all sorts of user behaviour.

Source: RevolutionanalyticsComputational Statistics 41

R Library/packages

R Base Packages

lme4IsoGene

foreign

survivalzoo

ggplot2zoo

reshape2

nlme

Computational Statistics 42

My R Packages• IsoGene• IsoGeneGUI• nea• neaGUI• biclustGUI• OCRME• More detail: http://setiopramono.wordpress.com/r-

programming/

Computational Statistics 43

R For Cutting Edge Technologies

44Computational Statistics

R Graphics and Visualization

• R provides wide range graphics and visualizations• Basic Plots: bar plots, basic 3D plots, heatmap.,etc• Geographic Maps• Projection Maps• Social Network Graphs• Animated graphics and movies (animation) • Motion Charts (GoogleViz) • Interactive Graphics (rggobi)• Image format: BMP, JPEG, PDF, PNG etc…• and….many more………

Computational Statistics 45

R Graphics

Computational Statistics 46

R Graphics

RCircoshttps://gjabel.wordpress.com/ 47Computational Statistics

R Graphics

A map of worldwide email traffic

Computational Statistics 48

R Graphics

Facebook connections between city centers around the world

Computational Statistics 49

R Graphical User Interfaces

• R uses Command line interface and it is preferred for advanced users allows direct control, more accurate, flexible and the analysis is reproducible.

• Requires good knowledge of the language difficult for beginners or less frequent users.

• R provides tools for building GUIs RGUI

Computational Statistics 50

R GUI Projects

• Integrated development environment (IDE)/Script Editors aimed to provide feature-rich environments to edit R scripts and code: Rstudio (www.rstudio.com), and architect (www.Openanalytics.eu)

• Web based application: the Rweb (Banfield, 1999), R.Net (www.u.arizona.edu/~ryckman/Net.php), or gWidgetsWWW (Verzani, 2012).

51Computational Statistics

R GUI Projects

• Python: OpenMeta-Analyst (Wallace et al, 2012)

• Java: JGR (Java GUI for R), Deducer (Fellows, 2012), and Glotaran (Snellenburg, 2012).

• Php: R-php (http://dssm.unipa.it/R-php/)

• Other extensions connect R to graphical toolboxes for developing menus and dialog boxes: Tcltk, Gtk.

52Computational Statistics

R Studio

• Download from Rstudio.com

• Powerfull IDE (Integrated Development Environment) for R.

Computational Statistics 53

RGUI Developed using tcltk

Computational Statistics 54

RGUI: RCommander

• Rcommander.com• Helpful for R beginner• Install inside R

Computational Statistics 55

RGUI using C#: Wires

• Developed by STIS students

• For Spatial Data Analysis

• Still developing…

Computational Statistics 56

RGUI using C#: Wires

Computational Statistics 57

RGUI: Web Based App

Computational Statistics 58

WebBUGS

• Conducting Bayesian Statistical Analysis Online

• Combines OpenBUGS and R

www.webbugs.psychstat.org

Computational Statistics 59

RGUI: Shiny

• A new package from Rstudio to build interactive web applications with R.

• Really Easy!• Build useful web applications with only a few lines of

code—no JavaScript required.• Self learning: http://shiny.rstudio.com/• http://www.showmeshiny.com/

Computational Statistics 60

RGUI using Shiny: FAST

Figure 5. FAST main page

61Computational Statistics

Dynamic Report Generation

• Sweave• knitr• markdown

Computational Statistics 62

Want to Learn R? Need Help?

Lots of Self learning Resources http://www.rdatamining.com/resources/onlinedocsBlogs:

Software # Blogs Blogs SourceR 550 R-Bloggers.comPython 60 SciPy.orgSAS 40 PROC-X.com, sasCommunity.org PlanetStata 11 Stata-Bloggers.com

User Group: Stockholm R User group, etc… Indonesia/Jakarta?https://sites.google.com/site/biostatinfocore/introduction-to-r

Computational Statistics 63

Need Help?

Computational Statistics 64

Number of R- or SAS-related posts to Stack Overflow by week.