Data analysis with R

27
SHARETHIS DATA ANALYSIS with R Hassan Namarvar

description

The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.

Transcript of Data analysis with R

Page 1: Data analysis with R

SHARETHISDATA ANALYSIS with RHassan Namarvar

Page 2: Data analysis with R

2

WHAT IS R?

• R is a free software programming language and software development for statistical computing and graphics.

• It is similar to S language developed at AT&T Bell Labs by Rick Becker, John Chambers and Allan Wilks.

• R was initially developed by Ross Ihaka and Robert Gentleman (1996), from the University of Auckland, New Zealand.

• R source code is written in C, Fortran, and R.

Page 3: Data analysis with R

3

R PARADIGMS

Multi paradigms:– Array– Object-oriented– Imperative– Functional– Procedural– Reflective

Page 4: Data analysis with R

4

STATISTICAL FEATURES

• Graphical Techniques• Linear and nonlinear modeling• Classical statistical tests• Time-series analysis• Classification• Clustering• Machine learning

Page 5: Data analysis with R

5

PROGRAMMING FEATURES

• R is an interpreted language• Access R through a command-line interpreter• Like MATLAB, R supports matrix arithmetic• Data structures:

– Vectors – Metrics – Array– Data Frames – Lists

Page 6: Data analysis with R

6

ADVANTAGES OF R

• The most comprehensive statistical analysis package available.

• Outstanding graphical capabilities• Open source software – reviewed by experts• R is free and licensed under the GNU.• R has over 5,578 packages as of May 31, 2014!• R is cross-platform. GNU/Linux, Mac, Windows.• R plays well with CSV, SAS, SPSS, Excel, Access, Oracle,

MySQL, and SQLite.

Page 7: Data analysis with R

7

HOW TO INSTALL R?

• Download an install the latest version from:– http://cran.r-project.org

• Install packages from R Console:– > install.packages(‘package_name’)

• R has its own LaTeX-like documentation:– > help()

Page 8: Data analysis with R

8

STARTING WITH R

• In R console:– > x <- 2– > x– > y <- x^2– > y– > ls()– > rm(y)

• Vectors:– > v <- c(4, 7, 23.5, 76.2, 80)– > Summary(v)

Page 9: Data analysis with R

9

STARTING WITH R

• Histogram:– > r <- rnorm(100)– > summary(r)– > plot(r)– > hist(r)

• QQ-Plot (Quantile):– > qqplot(r, rnorm(1000))

Page 10: Data analysis with R

10

STARTING WITH R

• Factors:– > g <- c(‘f’, ‘m’, ‘m’, ‘m’, ‘f’, ‘m’, ‘f’, ‘m’)

– > h <- factor(g)– > table(g)

• Matrices:– > r <- rnorm(100)– > dim(r) <- c(50,2)– > r– > Summary(r)– > M <- matrix(c(45, 23, 66, 77, 33, 44), 2, 3, byrow=T)

Page 11: Data analysis with R

11

STARTING WITH R

• Data Frames:

– > n = c(2, 3, 5) – > s = c("aa", "bb", "cc") – > b = c(TRUE, FALSE, TRUE) – > df = data.frame(n, s, b)

• Built-in Data Set:– > state.x77– > st = as.data.frame(state.x77)– > st$Density = st$Population * 1000 / st$Area– > summary(st)– > cor(st)– > pairs(st)

Page 12: Data analysis with R

12

STARTING WITH R

Page 13: Data analysis with R

13

LINEAR REGRESSION MODEL IN R

• Linear Regression Model:

– > x <- 1:100 – > y <- x^3 

– Model y = a + b . x

– > lm(y ~ x) – > model <- lm(y ~ x)– > summary(model)– > par(mfrow=c(2,2)) – > plot(model)

Page 14: Data analysis with R

14

LM MODEL

– Call:– lm(formula = y ~ x)– Residuals:– Min 1Q Median 3Q Max – -129827 -103680 -29649 85058 292030 – Coefficients:– Estimate Std. Error t value Pr(>|t|) – (Intercept) -207070.2 23299.3 -8.887 3.14e-14 ***– x 9150.4 400.6 22.844 < 2e-16 ***– ---– Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

’ 1

– Residual standard error: 115600 on 98 degrees of freedom– Multiple R-squared: 0.8419, Adjusted R-squared: 0.8403 – F-statistic: 521.9 on 1 and 98 DF, p-value: < 2.2e-16

Page 15: Data analysis with R

15

LM MODEL

Page 16: Data analysis with R

16

DIAGNOSIS PLOT

Page 17: Data analysis with R

17

LINEAR REGRESSION MODEL IN R

• Model Built-in Data:

– > colnames(st)[4] = "Life.Exp"– > colnames(st)[6] = "HS.Grad"– model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density, data=st) 

– > summary(model1)– > model2 <- step(model1)– > model3 = update(model2, .~.-Population)

– > Summary(model3)

Page 18: Data analysis with R

18

LINEAR REGRESSION MODEL IN R

• Confidence limits on Estimated Coefficients:

– > confint(model3)– > predict(model3, list(Murder=10.5, HS.Grad=48, Frost=100))

Page 19: Data analysis with R

19

OUTLIERS

• Boxplot:

– > v <- rnorm(100) – > v = c(v,10) – > boxplot(v) – > rug(jitter(v), side=2)

Page 20: Data analysis with R

20

PROBABILITY DENSITY FUNCTION

• PDF:

– > r <- rnorm(1000)– > hist(r, prob=T)– > lines(density(r), col="red")

Page 21: Data analysis with R

21

CASE STUDY: SHARETHIS EXAMPLE

• Relationship of clicks with winning price and Impression on ADX:

• Data– Analyzed ADX Hourly Impression Logs

• Method– Detected outliers– Predicted clicks using a regression tree model

Page 22: Data analysis with R

22

CASE STUDY: SHARETHIS EXAMPLE

• Outlier Detection:

Clicks Impressions

Page 23: Data analysis with R

23

CASE STUDY: SHARETHIS EXAMPLE

• Regression Tree– One of the most powerful classification/regression

– > library(rpart)– > fit <- rpart(log(CLK) ~ log(IMP) + AVG_PRICE + SD_PRICE, data=x)

– > plot(fit)– > text(fit)– > plot(predict(fit), log(x$CLK))

Page 24: Data analysis with R

24

CASE STUDY: SHARETHIS EXAMPLE

• Regression Tree

Page 25: Data analysis with R

25

CASE STUDY: SHARETHIS EXAMPLE

• Predict Log of Clicks

Page 26: Data analysis with R

26

CASE STUDY: COLOR DETECTION

• Detect color from product image:

Page 27: Data analysis with R

27

RESOURCES

• Books:

– An Introduction to Statistical Learning: with Applications in R by G. James, D. Witten, T. Hatie, R. Tibshirani, 2013

– The Art of R Programming: A Tour of Statistical Software Design, N. Matloff, 2011

– R Cookbook (O'Reilly Cookbooks), P. Teetor, 2011

• R Blog:– http://www.r-bloggers.com