Data Analytics with R and SQL Server

30
Data Analytics with R and SQL Server Stéphane Fréchette Thursday March 19, 2015

Transcript of Data Analytics with R and SQL Server

Page 1: Data Analytics with R and SQL Server

Data Analytics with R and SQL Server

Stéphane FréchetteThursday March 19, 2015

Page 2: Data Analytics with R and SQL Server

Who am I?

My name is Stéphane Fréchette

SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine.

I have a passion for architecting, designing and building solutions that matter.

Twitter: @sfrechette

Blog: stephanefrechette.com

Email: [email protected]

Page 3: Data Analytics with R and SQL Server

Topics

• What is R?

• Should I use R?

• Data Structures

• Graphics

• Data Manipulation in R

• Connecting to SQL Server

• Demos

• Resources

• Q&A

Page 4: Data Analytics with R and SQL Server

DISCLAIMER\

This is not a course nor a tutorial, but an introduction, a walkthrough to inspire you to further explore and

learn more about R and statistical computing

Page 5: Data Analytics with R and SQL Server

“ Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.”

- Wikipedia

Page 6: Data Analytics with R and SQL Server

What is R?

• A programming language, environment for statistical computing and graphics

• R has its origins in the S programming language created in the 1970’s

• Best used to manipulate moderately sized datasets, do statistical analysis and produce data-centric documents and presentations

• These tools are distributed as packages, which any user can download to customize the R environment

• Cross-platform: runs on Mac, Windows and Unix based systems

Page 7: Data Analytics with R and SQL Server

Should I use R?

Are you doing

statistics?

No Yes

No Yes

Where “statistics” can mean machine learning, predictive analytics, data science, anything that falls under a rather broad umbrella…

But if you have some data that makes sense to represent in a tabular like structure, and you want to do some cool analytical or statistics stuff with it, R is definitely a good choice…

Page 8: Data Analytics with R and SQL Server

Downloading and Installing R

http://www.r-project.org/ http://www.rstudio.com/

Page 9: Data Analytics with R and SQL Server

The IDE (RStudio)

1. View Files and Data

2. See Workspace and History

3. See Files, Plots, Packages and Help

4. Console

1 2

34

Page 10: Data Analytics with R and SQL Server

Installing Packages

• To use packages in R, one must first install them using the install.packagesfunction

• Downloads the packages from CRAN and installs it to ready to be use

Page 11: Data Analytics with R and SQL Server

Loading Packages

• To use particular packages in your current R session, one must load it into the R environment using the library or require functions

Page 12: Data Analytics with R and SQL Server

Common Data Structures in R

To make the best of the R language, one needs a strong understanding of the basic data types and data structures and how to operate and use them.

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists…

To understand computations in R, two slogans are helpful:

• Everything that exists is an object

• Everything that happens is a function call

John Chamberscreator of the S programming language, and core member of the R programming language project.

Page 13: Data Analytics with R and SQL Server

Data Structures - Vectors

The simplest structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.

Page 14: Data Analytics with R and SQL Server

Data Structures - Matrices

Matrices are nothing more than 2-dimensional vectors. To define a matrix, use the function matrix.

Page 15: Data Analytics with R and SQL Server

Data Structures - Data frames

Time series are often ordered in data frames. A data frame is a matrix with names above the columns. This is nice, because you can call and use one of the columns without knowing in which position it is.

Page 16: Data Analytics with R and SQL Server

Data Structures - Lists

An R list is an object consisting of an ordered collection of objects known as its components.

Page 17: Data Analytics with R and SQL Server

Data Structures - Date and Time

Sys.time() # returns the current system date time

Page 18: Data Analytics with R and SQL Server

Data Structures - Date and Time

Two main (internal) formats for date-time are: POSIXct and POSIXlt

• POSIXct: A short format of date-time, typically used to store date-time columns in a data-frame

• POSIXlt: A long format of date-time, various other sub-units of time can be extracted from here

Page 19: Data Analytics with R and SQL Server

Data Structures - Others

Other useful and important data type

• NULL: Typically used for initializing variables. (x = NULL) creates a variable x of length zero. The function is.null() returns TRUE or FALSE and tells whether a variable is NULL or not.

• NA: Used for denoting missing values. (x = NA) creates a variable x with missing values. The function is.na() returns TRUE or FALSE and tells whether a variable is NA or not.

• NaN: NaN stands for “Not a Number”. Prints a warning message in console. The function is.nan() lets you check whether the value of a variable is NaN or not.

• Inf: Inf stands for “Infinity”. (x = 10/0 ; y = -3/0) sets value of x to Inf ad y to –Inf. The function is.finite() lets you check whether the value of a variable is infinity or not.

Page 20: Data Analytics with R and SQL Server

Graphics

One of the main reasons data analysts and data scientists turn to R is for its strong graphic capabilities.

Basic Graphs:

• These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and scatter plots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).

Page 21: Data Analytics with R and SQL Server

Graphics

Advances Graphs:

• Graphical parameters describes how to change a graph's symbols, fonts, colors, and lines. Axes and text describe how to customize a graph's axes, add reference lines, text annotations and a legend. Combining plots describes how to organize multiple plots into a single graph.

• The lattice package provides a comprehensive system for visualizing multivariate data, including the ability to create plots conditioned on one or more variables. The ggplot2 package offers a elegant systems for generating univariate and multivariate graphs based on a grammar of graphics.

Page 22: Data Analytics with R and SQL Server

Data Manipulation in R

dplyr an R package for fast and easy data manipulation.

Data manipulation often involves common tasks, such as selecting certain variables, filtering on certain conditions, deriving new variables from existing variables, and so forth. If we think of these tasks as “verbs”, we can define a grammar of sorts for data manipulation.

In dplyr the main verbs (or functions) are:

• filter: select a subset of the rows of a data frame

• arrange: works similarly to filter, except that instead of filtering or selecting rows, it reorders them

• select: select columns of a data frame

• mutate: add new columns to a data frame that are functions of existing columns

• summarize: summarize values

• group_by: describe how to break a data frame into groups of rows

Page 23: Data Analytics with R and SQL Server

Demo[dplyr – manipulating data]

Page 24: Data Analytics with R and SQL Server

Connecting R and SQL Server

The RODBC package provides access to databases (including Microsoft Access and Microsoft SQL Server) through an ODBC interface

Function Description

odbcConnection(dsn, uid = “”, pwd = “”) Open a connection to an ODBC database

sqlFetch(channel, sqtable) Read a table from an ODBC database into a data frame

sqlQuery(channel, query) Submit a query to an ODBC database and return the results

sqlSave(channel, mydf, tablename = sqtable, append = FALSE)

Write or update (append=TRUE) a data frame to a table in the ODBC database

sqlDrop(channel, sqtable) Remove a table from the ODBC database

close(channel) Close the connection

Page 25: Data Analytics with R and SQL Server

RODBC Example

Page 26: Data Analytics with R and SQL Server

Other interface

The RJDBC package provides access to databases through a JDBC interface.(requires JDBC driver from Microsoft)

Page 27: Data Analytics with R and SQL Server

Demo[Let’s analyze - R and SQL Server]

Page 28: Data Analytics with R and SQL Server

Resources

• The R Project for Statistical Computing http://www.r-project.org/

• RStudio http://www.rstudio.com/

• Revolution Analytics http://www.revolutionanalytics.com/

• Shiny http://shiny.rstudio.com/

• {swirl} Learn R, in R http://swirlstats.com/

• R-bloggers http://www.r-bloggers.com/

• Online R resources for Beginners http://bit.ly/1x2q6Gl

• 60+ R resources to improve your data skills http://bit.ly/1BzW4ox

• Stack Overflow - R http://stackoverflow.com/tags/r

• Cerebral Mastication - R Resources http://bit.ly/17YhZj4

• Microsoft JDBC Drivers 4.1 and 4.0 for SQL Server http://bit.ly/1kEgJ7O

Page 29: Data Analytics with R and SQL Server

What Questions Do You Have?

Page 30: Data Analytics with R and SQL Server

Thank YouFor attending this session