Data analysis with R
-
Upload
sharethis -
Category
Data & Analytics
-
view
821 -
download
2
description
Transcript of Data analysis with R
SHARETHISDATA ANALYSIS with RHassan Namarvar
2
WHAT IS R?
• R is a free software programming language and software development for statistical computing and graphics.
• It is similar to S language developed at AT&T Bell Labs by Rick Becker, John Chambers and Allan Wilks.
• R was initially developed by Ross Ihaka and Robert Gentleman (1996), from the University of Auckland, New Zealand.
• R source code is written in C, Fortran, and R.
3
R PARADIGMS
Multi paradigms:– Array– Object-oriented– Imperative– Functional– Procedural– Reflective
4
STATISTICAL FEATURES
• Graphical Techniques• Linear and nonlinear modeling• Classical statistical tests• Time-series analysis• Classification• Clustering• Machine learning
5
PROGRAMMING FEATURES
• R is an interpreted language• Access R through a command-line interpreter• Like MATLAB, R supports matrix arithmetic• Data structures:
– Vectors – Metrics – Array– Data Frames – Lists
6
ADVANTAGES OF R
• The most comprehensive statistical analysis package available.
• Outstanding graphical capabilities• Open source software – reviewed by experts• R is free and licensed under the GNU.• R has over 5,578 packages as of May 31, 2014!• R is cross-platform. GNU/Linux, Mac, Windows.• R plays well with CSV, SAS, SPSS, Excel, Access, Oracle,
MySQL, and SQLite.
7
HOW TO INSTALL R?
• Download an install the latest version from:– http://cran.r-project.org
• Install packages from R Console:– > install.packages(‘package_name’)
• R has its own LaTeX-like documentation:– > help()
8
STARTING WITH R
• In R console:– > x <- 2– > x– > y <- x^2– > y– > ls()– > rm(y)
• Vectors:– > v <- c(4, 7, 23.5, 76.2, 80)– > Summary(v)
9
STARTING WITH R
• Histogram:– > r <- rnorm(100)– > summary(r)– > plot(r)– > hist(r)
• QQ-Plot (Quantile):– > qqplot(r, rnorm(1000))
10
STARTING WITH R
• Factors:– > g <- c(‘f’, ‘m’, ‘m’, ‘m’, ‘f’, ‘m’, ‘f’, ‘m’)
– > h <- factor(g)– > table(g)
• Matrices:– > r <- rnorm(100)– > dim(r) <- c(50,2)– > r– > Summary(r)– > M <- matrix(c(45, 23, 66, 77, 33, 44), 2, 3, byrow=T)
11
STARTING WITH R
• Data Frames:
– > n = c(2, 3, 5) – > s = c("aa", "bb", "cc") – > b = c(TRUE, FALSE, TRUE) – > df = data.frame(n, s, b)
• Built-in Data Set:– > state.x77– > st = as.data.frame(state.x77)– > st$Density = st$Population * 1000 / st$Area– > summary(st)– > cor(st)– > pairs(st)
12
STARTING WITH R
13
LINEAR REGRESSION MODEL IN R
• Linear Regression Model:
– > x <- 1:100 – > y <- x^3
– Model y = a + b . x
– > lm(y ~ x) – > model <- lm(y ~ x)– > summary(model)– > par(mfrow=c(2,2)) – > plot(model)
14
LM MODEL
– Call:– lm(formula = y ~ x)– Residuals:– Min 1Q Median 3Q Max – -129827 -103680 -29649 85058 292030 – Coefficients:– Estimate Std. Error t value Pr(>|t|) – (Intercept) -207070.2 23299.3 -8.887 3.14e-14 ***– x 9150.4 400.6 22.844 < 2e-16 ***– ---– Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
– Residual standard error: 115600 on 98 degrees of freedom– Multiple R-squared: 0.8419, Adjusted R-squared: 0.8403 – F-statistic: 521.9 on 1 and 98 DF, p-value: < 2.2e-16
15
LM MODEL
16
DIAGNOSIS PLOT
17
LINEAR REGRESSION MODEL IN R
• Model Built-in Data:
– > colnames(st)[4] = "Life.Exp"– > colnames(st)[6] = "HS.Grad"– model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density, data=st)
– > summary(model1)– > model2 <- step(model1)– > model3 = update(model2, .~.-Population)
– > Summary(model3)
18
LINEAR REGRESSION MODEL IN R
• Confidence limits on Estimated Coefficients:
– > confint(model3)– > predict(model3, list(Murder=10.5, HS.Grad=48, Frost=100))
19
OUTLIERS
• Boxplot:
– > v <- rnorm(100) – > v = c(v,10) – > boxplot(v) – > rug(jitter(v), side=2)
20
PROBABILITY DENSITY FUNCTION
• PDF:
– > r <- rnorm(1000)– > hist(r, prob=T)– > lines(density(r), col="red")
21
CASE STUDY: SHARETHIS EXAMPLE
• Relationship of clicks with winning price and Impression on ADX:
• Data– Analyzed ADX Hourly Impression Logs
• Method– Detected outliers– Predicted clicks using a regression tree model
22
CASE STUDY: SHARETHIS EXAMPLE
• Outlier Detection:
Clicks Impressions
23
CASE STUDY: SHARETHIS EXAMPLE
• Regression Tree– One of the most powerful classification/regression
– > library(rpart)– > fit <- rpart(log(CLK) ~ log(IMP) + AVG_PRICE + SD_PRICE, data=x)
– > plot(fit)– > text(fit)– > plot(predict(fit), log(x$CLK))
24
CASE STUDY: SHARETHIS EXAMPLE
• Regression Tree
25
CASE STUDY: SHARETHIS EXAMPLE
• Predict Log of Clicks
26
CASE STUDY: COLOR DETECTION
• Detect color from product image:
27
RESOURCES
• Books:
– An Introduction to Statistical Learning: with Applications in R by G. James, D. Witten, T. Hatie, R. Tibshirani, 2013
– The Art of R Programming: A Tour of Statistical Software Design, N. Matloff, 2011
– R Cookbook (O'Reilly Cookbooks), P. Teetor, 2011
• R Blog:– http://www.r-bloggers.com