Microsoft R Server for Data Sciencea

35

Transcript of Microsoft R Server for Data Sciencea

Page 1: Microsoft R Server for Data Sciencea
Page 2: Microsoft R Server for Data Sciencea

Data Science Team

Data Engineering

Data Science

Application Development

Business Acumen

Data Management

Data

Dividend

Page 3: Microsoft R Server for Data Sciencea

Typical advanced analytics lifecycle

Ingest Transform Explore Model Deploy

Score Visualize Measure

Model

Score

ƒ(x)

Preparation Modeling

Operationalization

Page 4: Microsoft R Server for Data Sciencea

Data Scientist should be creating / testing models

Data scientist are rare and expensive

Ingest Transform Explore Model Deploy

Score Visualize Measure

Model

Score

ƒ(x)

Preparation Modeling

Operationalization

Page 5: Microsoft R Server for Data Sciencea

But the reality is different …

Data scientist focus time

Ingest Transform Explore Model Deploy

Score Visualize Measure

Model

Score

ƒ(x)

Preparation Modeling

Operationalization

80%

5%

15%

Page 6: Microsoft R Server for Data Sciencea

Decisions

OperationizePreparation

Model

Page 7: Microsoft R Server for Data Sciencea

• Embrace Open Source

• Evolutionary Path to Cloud

• Democratize Data Science

• Skill Re-Use

• Transparent Scaling

• Facilitate Collaboration

• Decouple Data Science from Platforms

• Leverage Hybrid Cloud Architecture

• Accelerate Experimentation

• Streamline Deployment

Broaden The

Talent Pool

Increase

Productivity

Modernize

Infrastructure

Maximize

Innovation

Drive Down

TCO

Page 8: Microsoft R Server for Data Sciencea

People

+

Data Sources

Apps

Sensors and devices

From Data To Action On Premises

INTELLIGENCEDATA ACTION

Automated SystemsMicrosoft R Server & SQL R Services

Apps

Cortana Intelligence

Page 9: Microsoft R Server for Data Sciencea

Challenges posed by open source R

??

Lack of Commercial

Support

InadequateModeling

Performance

Complex DeploymentProcesses

Limited Data Scale

Page 10: Microsoft R Server for Data Sciencea

R from Microsoft brings

Peace of mind

Efficiency Speed and scalability

Flexibility and agility

Page 11: Microsoft R Server for Data Sciencea

High-performance, Scalable R

Linux, Windows, Hadoop & Teradata

R Server Technology

Page 12: Microsoft R Server for Data Sciencea

CommercialOpen Community

Revolution R Open

R Open

Revolution R Enterprise

R Server

Page 13: Microsoft R Server for Data Sciencea

Escapes R’s traditional memory limits

Scales predictive modeling using parallelization

Distributes computation cores & nodes

Minimizes data movement using in-database, in-MapReduce and in-Apache Spark execution

Page 14: Microsoft R Server for Data Sciencea
Page 15: Microsoft R Server for Data Sciencea

• Remote Execution

• Transparent Parallelization:

• Shared Resource Management

Data

Nodes

Corporate

Applications

Desktops &

Servers

direct web services

Microsoft R

Server

Hadoop

Page 16: Microsoft R Server for Data Sciencea

Distributed R - How Does Remote Compute Context ?

Algorithm

Master

Predictive

Algorithm

Big

Data

Analyze

Blocks In

Parallel

Load Block

At A TimeDistribute Work,

Compile Results

“Pack and Ship”

Requests to

Remote

Environments

Results

Microsoft R Server functions

• A compute context defines where to process.

• E.g. remote context like Hadoop Map Reduce

• Microsoft R functions prefixed with rx

• Current set compute context determines processing

location

Copyright Microsoft Corporation. All rights reserved.

Microsoft R Server “Client” Microsoft R Server “Server”

Console

R IDE or

command-

line REMOTE

CONTEXT

Page 17: Microsoft R Server for Data Sciencea

### SETUP HADOOP ENVIRONMENT VARIABLES ###

myHadoopCC <- RxHadoopMR()

### HADOOP COMPUTE CONTEXT ###

rxSetComputeContext(myHadoopCC)

### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###

hdfsFS <- RxHdfsFileSystem()

hdfsFS

### ANALYTICAL PROCESSING ###

### Statistical Summary of the data

rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)

### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)

### Linear Model and plot

hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)

plot(hdfsXdfArrLateLinMod$coefficients)

### SETUP LOCAL ENVIRONMENT VARIABLES ###

myLocalCC <- “localpar”

### LOCAL COMPUTE CONTEXT ###

rxSetComputeContext(myLocalCC)

### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###

localFS <- RxNativeFileSystem()

AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,

fileSystem = localFS)

Local Parallel processing – Linux or Windows In – Hadoop

ScaleR models can be deployed from a server or edge node to run in Hadoop

without any functional R model re-coding for map-reduce

Compute

context R script

– sets where the

model will run

Functional

model R script –

does not need

to change to run

in Hadoop

Copyright Microsoft Corporation. All rights reserved.

Page 18: Microsoft R Server for Data Sciencea

DeployR• Web services software development kit for

integration analytics via APIs :

• Java

• JavaScript

• .NET Integrates R Into application infrastructures

Capabilities:

• Enterprise authentication & security

• Horizontal scaling

• Invokes R Scripts from web services calls

• RESTful interface for easy integration

• Works with:

• Web & mobile apps

• Leading BI & Visualization tools

• Business rules and streaming engines

DeployR DevelopR

Page 19: Microsoft R Server for Data Sciencea

19

On-demand sales forecasting

Real-time social

media analysisLeveraging the

power of Office365

Page 20: Microsoft R Server for Data Sciencea

Microsoft R Server provides a unique opportunity to deliver advanced analytics capabilities to customers who have already invested in storing their data on non Microsoft platforms like Hadoop, Teradata and Linux

Hadoop

- Cloudera CDH, Hortonworks HDP, and HDInsight

Page 21: Microsoft R Server for Data Sciencea
Page 22: Microsoft R Server for Data Sciencea
Page 23: Microsoft R Server for Data Sciencea

Write Once – Deploy Anywhere

R Server portfolio

Cloud

RDBMS

Desktops & Servers

Hadoop & Spark

EDWR Server Technology

Page 24: Microsoft R Server for Data Sciencea

Included in SQL Server 2016

Reuse and optimize existing R code

Eliminate data movement

In-database deployment

Memory and disk scalability

No R memory limits

Write once, deploy anywhere

Enterprise speed and scale

Near-DB analytics

Parallel threading and processing

Reuse SQL skills for data engineering

Cost effectiveness

Scalability and choice

Simplicity and agility

Page 25: Microsoft R Server for Data Sciencea

• The industry’s broadest R-based platform

• Enterprise scale atop spark, Hadoop, RDBMSs & EDWs

• Freedom from memory limits

• Choice of Windows and Linux IDEs

• Stable deployment

• Write-once-deploy-anywhere portability

• Investment protection

• Hybrid cloud evolution

Page 26: Microsoft R Server for Data Sciencea
Page 27: Microsoft R Server for Data Sciencea

Introduces the following topics:

1. Creating an R Server on Spark HDInsight cluster

2. Installing RStudio for the cluster

3. Running R using Rstudio on web

Reference: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-r-server-get-started/

Page 28: Microsoft R Server for Data Sciencea

Get Essentials Microsoft Developer Resources and R Server Developer Edition: aka.ms/ch9.th

Microsoft R Server on-premises: www.microsoft.com/R-Server

Microsoft R Server on Azure (Cloud): https://azure.microsoft.com/en-us/marketplace/partners/microsoft-r-products/microsoft-r-server/

Page 29: Microsoft R Server for Data Sciencea
Page 30: Microsoft R Server for Data Sciencea
Page 31: Microsoft R Server for Data Sciencea

What is

• A statistics programming language

• A data visualization tool

• Open source

• 2.5+M users

• Taught in most universities

• Thriving user groups worldwide

• 7000+ free algorithms in CRAN

• Scalable to big data

• New and recent grad’s use it

Language

Platform

Community

Ecosystem

• Rich application & platform integration

Page 32: Microsoft R Server for Data Sciencea

Convergence with Flexibility

Scalable Algorithms

R: Write Once Deploy Anywhere

Templates & Samples

Microsoft R Server Family

R & Python to AML Interop.

Cortana Intelligence

Page 33: Microsoft R Server for Data Sciencea

DistributedR

ScaleR

ConnectR

DevelopR

Code Portability Across Platforms

In the Cloud Azure HDI/ Spark

Workstations & Servers LinuxWindows

Clustered SystemsLinux Clusters (LSF For Now)Microsoft HPC

EDW Teradata

HadoopHortonworksClouderaMapR &HDInsight

Page 34: Microsoft R Server for Data Sciencea

DI

R+

CR

AN

Mic

roso

ft R

DistributedR

DeployR DevelopR

ScaleR

ConnectR

Delivers High Performance Parallel Distributed Analytics Across Individual and Clustered Systems

• Cloudera

• Hortonworks

• MapR

• Apache Spark

• IBM Platform LSF

• Microsoft HPC Clusters

• Teradata Database

• Red Hat

• SuSE Servers

• Windows

DistributeR

Page 35: Microsoft R Server for Data Sciencea

RevoDeployR Web Services

Client libraries (JavaScript, Java, .NET)

Desktop

Applications

(i.e. Excel)

Business

Intelligence

PowerBI

Interactive Web or

Mobile

Applications

HTTP/HTTPS – JSON/XML

Session

ManagementAuthentication

Data/Script

ManagementAdministration

RR

R scripts

End User

Application

Developer

Admin

Data Scientist

Grid Node

R