Post on 10-May-2015
description
High Performance Predictive
Analytics in R and Hadoop
Presented by:Mario E. Inchiosa, Ph.D.US Chief Scientist
Hadoop Summit 2013June 27, 2013 1
Innovate with R
2
Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts
Most powerful statistical programming language Flexible, extensible and comprehensive for productivity
Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data
Thriving open-source community Leading edge of analytics research
Fills the talent gap New graduates prefer R
Download the White Paper
R is Hotbit.ly/r-is-hot
R is open source and drives analytic innovation but has some limitations
Bigger data sizes
Speed of analysis
Production support
Memory Bound Big Data
Single Threaded Scale out, parallel processing, high speed
Community Support Commercial production support
Innovation and scale
Innovative – 5000+ packages, exponential growth
Combines with open source R packages where needed
4
Revolution R EnterpriseHigh Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployRWeb Services
Software Development Kit
DevelopRIntegrated Development
Environment
ConnectRHigh Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleRHigh Performance Big Data Analytics
Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers
DistributedRDistributed Computing Framework
Platform LSF, MS HPC Server, MS Azure Burst
RevoRPerformance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers
Open Source RPlus
Revolution Analytics
performance enhancements
Revolution Analytics
Value-Add ComponentsProviding Power
and Scale to Open Source R
Our objectives with respect to Hadoop Allow our customers to do predictive
analytics as easily in Hadoop as they can using R on their laptops today
Scalable and High Performance
Provide the first enterprise-ready, commercially supported, full-featured, out-of-the-box Predictive Analytics suite running in Hadoop
5
Overview of Revolution and Hadoop
Revolution currently provides capabilities for using R and Hadoop together RHadoop packages: rmr2, rhdfs, rhbase
RevoScaleR package: Rapidly read and process data from HDFS
We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI-based clusters.
6
7
RHadoop: Map-Reduce with R
Unlock data in Hadoop using only the R language No need to learn Java, Pig, Python or any other language
8
RevoScaleR Overview
An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics – Descriptive and Predictive Parallel and distributed computing Visualization
Scales from small local data to huge distributed data
Scales from laptop to server to cluster to cloud
Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
Key ways RevoScaleR enhances R
High Performance Computing (HPC) functions: Parallel/Distributed computing
High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing
XDF file format; rapidly store and extract data
Use results from HPA and HPC functions in other R packages
Revolution R Enterprise 9
10
High Performance Big Data Analytics with ScaleR
Statistical Tests
Machine Learning
Simulation
Descriptive Statistics
Data Visualization
R Data Step
Predictive Models
Sampling
Company Confidential – Do not distribute 11
ScaleR: High Performance ScalableParallel External Memory Algorithms
Data import – Delimited, Fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category
(means, sums)
Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product
matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data
(standard tables & long form) Marginal Summaries of Cross
Tabulations
Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
Subsample (observations & variables)
Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
Company Confidential – Do not distribute 12
ScaleR: High Performance ScalableParallel External Memory Algorithms
Sum of Squares (cross product matrix for set variables)
Multiple Linear Regression Generalized Linear Models (GLM)
- All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models
Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Monte Carlo
HPC Capabilities in RevoScaleR
Execute (essentially) any R function in parallel on nodes and cores
Results from all runs are returned in a list to the user’s laptop
Extensive control over parameters Extensive control over nodes, cores, and
times to run Ideal for simulations and for running R
functions on small amounts of data
Revolution R Enterprise 13
14
Key ScaleR HPA features
Handles an arbitrarily large number of rows in a fixed amount of memory
Scales linearly with the number of rows
Scales linearly with the number of nodes
Scales well with the number of cores per node
Scales well with the number of parameters
Extremely high performance
Revolution R Enterprise
Regression comparison using in-memory data: lm() vs rxLinMod()
Revolution R Enterprise 17
lm() in R
lm() in Revolution R
rxLinMod() in Revolution R
GLM comparison using in-memory data: glm() vs rxGlm()
Revolution R Enterprise 18
SAS HPA Benchmarking comparison* Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
19
Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
Allstate compares SAS, Hadoop and R for Big-Data Insurance ModelsApproach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-coreHadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-coreLSF cluster
5.7 minutes
Revolution R Enterprise 20
Generalized linear model, 150 million observations, 70 degrees of freedomhttp://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
What makes ScaleR so fast?
Lot’s of things! This kind of performance requires
Careful architecting that supports performance Constant, intense focus on all of the details A review of every line of code with an eye to
performance (in addition to giving the correct answers, of course)
Extensive profiling and continuous benchmarking to detect problems and improve code
Revolution R Enterprise 21
Specific speed-related factors -- 1
Efficient computational algorithms Efficient memory management – minimize
data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row
and column
Revolution R Enterprise 22
Specific speed-related factors -- 2 Models are pre-analyzed to detect and
remove duplicate computations and points of failure (singularities)
Handle categorical variables efficiently
Revolution R Enterprise 23
Parallel External Memory Algorithms (PEMA’s) The HPA analytics algorithms are all built on
a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms
These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes
Revolution R Enterprise 24
Scalability and portability of Revo’s implementation of PEMA’s These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability
They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions
They are independent of where the data is coming from, giving portability with respect to data sources
Revolution R Enterprise 25
Simplified ScaleR Internal Architecture
Revolution R Enterprise 26
Analytics EnginePEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process CommunicationMPI, RPC, Sockets, Files
Data SourcesHDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
ScaleR on Hadoop – Base Implementation Each iteration (pass through the data) is a separate
MapReduce job The mapper produces “intermediate result objects”
for each task that are combined using a combiner and then a reducer
A master process decides if another pass through the data is required
Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms
Revolution R Enterprise 27
Performance enhancements
A single Map job that handles all iterations, to reduce overhead of starting multiple jobs
Tree structured communication and reduction to reduce time for those operations, especially on large clusters
Alternative communication schemes (e.g. sockets) instead of files
Revolution R Enterprise 28
Thank You!
Mario Inchiosa
mario.inchiosa@revolutionanalytics.com
Revolution R Enterprise 29
Backup Slides
Revolution R Enterprise 30
Functionality in common across HPA algorithms Ability to do on-the-fly data transformations
in the R language In-formula: ArrDelay>15 ~ DayOfWeek + … As transform: Late = ArrDelay>15 As transform function: Specify an R function to
process an entire chunk of data at a time Row selection Both standard weights and frequency
weights
Revolution R Enterprise 31
Portability of user code
The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop
The same analytics code can be used on a small data.frame in memory and on a huge distributed data file
Revolution R Enterprise 32
Sample code for logit on laptop
# Specify local data sourceairData <- myLocalDataSource
# Specify model formula and parametersrxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)
Revolution R Enterprise 33
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)
Revolution R Enterprise 34