MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

36
1 Pivotal Confidential–Internal Use Only BUILT FOR THE SPEED OF BUSINESS

Transcript of MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

Page 1: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

1Pivotal Confidential–Internal Use Only

BUILT FOR THE SPEED OF BUSINESS

Page 2: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

2Pivotal Confidential–Internal Use Only 2Pivotal Confidential–Internal Use Only

MADlib Architecture

Page 3: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

3Pivotal Confidential–Internal Use Only

MPP (Massively Parallel Processing)

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQLMapReduce

ExternalSourcesLoading,

streaming, etc.

Shared-Nothing Database Architecture

Page 4: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

4Pivotal Confidential–Internal Use Only

Architecture

C API(HAWQ, GPDB, PostgreSQL)

Low-level Abstraction Layer(array operations,

C++ to DB type-bridge, …)

RDBMSBuilt-in

Functions

User Interface

Functions for Inner Loops(implements ML logic)

SQL, generated per specification

C++

Eigen

Page 5: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

5Pivotal Confidential–Internal Use Only

How do we implement scalability? Example: Linear Regression

• Finding linear dependencies between variables

y ≈ c0 + c1 · x1 + c2 · x2 + …?

y | x1 | …-------+------------- 10.14 | 0 | … 11.93 | 0.69 | … 13.57 | 1.1 | … 14.17 | 1.39 | … 15.25 | 1.61 | … 16.15 | 1.79 | … Design

matrix XVector of dependent variables y

Predictor (x1)

Reg

ress

or (y

)

Page 6: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

7Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

Segment 1

Segment 2

Page 7: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

8Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

Segment 1

Segment 2

a c e gb d f h

Segm

ent 1

Segm

ent 2

Page 8: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

9Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

a c e gb d f h

a2+c2+e2+g2

=Data across nodes are multiplied

Page 9: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

10Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

a c e gb d f h

a2+c2+e2+g2

=Data across nodes are multiplied!

ab+cd+ef+gh

Page 10: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

11Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

a c e gb d f h

a2+c2+e2+g2

=Looks like the result can be decomposed

ab+cd+ef+gh

b2+d2+f2+h2ab+cd+ef+gh

Page 11: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

12Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a bc de fg h

a c e gb d f h

a2+c2+e2+g2

=Data across nodes are multiplied!

ab+cd+ef+gh

b2+d2+f2+h2ab+cd+ef+gh

= +a b ef

e fab +c d g

hg hc

d +

Page 12: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

13Pivotal Confidential–Internal Use Only

Linear Regression: Streaming AlgorithmHow to compute with a single table scan?

XT

XXT

y

-1

XTyXTX

+ +-1

Page 13: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

14Pivotal Confidential–Internal Use Only

Problem solved? … Not Yet Many ML solutions are iterative without analytical

formulationsInitialize problem

Perform single step

Has converged?

Return results

false

true

Page 14: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

15Pivotal Confidential–Internal Use Only

In general, use a convex optimization framework

Each step has an analytical formulation that can be performed in parallel

Gradient Descent

Start at a random pointRepeat

Determine a descent direction

Choose a step sizeUpdate the model

Until stopping criterion is satisfied

Page 15: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

16Pivotal Confidential–Internal Use Only

Architecture

C API(HAWQ, GPDB, PostgreSQL)

Low-level Abstraction Layer(array operations,

C++ to DB type-bridge, …)

RDBMSBuilt-in

Functions

User Interface

Functions for Inner Loops(implements ML logic)

SQL, generated per specification

C++

Eigen

Page 16: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

17Pivotal Confidential–Internal Use Only

Architecture

C API(Greenplum, PostgreSQL, HAWQ)

Low-level Abstraction Layer(array operations,

C++ to DB type-bridge, …)

RDBMSBuilt-in

Functions

User Interface

High-level Iteration Layer(iteration controller, …)

Functions for Inner Loops(implements ML logic)

Python

SQL, generated per specification

C++ Eigen

Page 17: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

18Pivotal Confidential–Internal Use Only 18Pivotal Confidential–Internal Use Only

But not all data scientists speak SQL …Accessing scalability through R

Page 18: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

19Pivotal Confidential–Internal Use Only

Why R?

O’Reilly: Strata 2013 Data Science Salary Survey

“The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.”

Page 19: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

20Pivotal Confidential–Internal Use Only

PivotalR: Bringing MADlib and HAWQ to a familiar R interface

ChallengeWant to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

d <- db.data.frame(”houses")houses_linregr <-

madlib.lm(price ~ tax +

bath +

size,

data=d)

Pivotal R

SELECT madlib.linregr_train( 'houses’,'houses_linregr’,

'price’,'ARRAY[1, tax, bath, size]’);

SQL Code

Page 20: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

21Pivotal Confidential–Internal Use Only

PivotalR Design Overview

2. SQL to execute

3. Computation results

1. R SQL

RPostgreSQL

PivotalR

Data lives hereNo data here

Database/HAWQ w/ MADlib

• Syntax is analogous to native R function

• Data doesn’t need to leave the database• All heavy lifting, including model estimation

& computation, are done in the database

Page 21: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

22Pivotal Confidential–Internal Use Only 22Pivotal Confidential–Internal Use Only

Demo

Page 22: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

23Pivotal Confidential–Internal Use Only

library(PivotalR)

db.connect(port = 14526, dbname = "madlib")

db.objects()

x <- db.data.frame("madlibtestdata.dt_abalone")

dim(x)

names(x)

x$rings

lookat(x, 10) # look at a sample of table

mean(x$rings)

lookat(mean(x$rings))

fit <- madlib.lm(rings ~ . - id | sex, data = y)

predict(fit, x)

mean((x$rings - predict(fit, x))^2)

x$sex <- as.factor(v$sex)

m0 <- madlib.glm(resp ~ age,

family="binomial", data=dbbank)

mstep <- step(m0, scope=list( lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job)))

Load the Library

Connect to the database “madlib” on port 14526

List all the tables in the active connection

Create an R object that references a table in the database

Report #/rows and #/columns in the table

Column names within the table

Database query object representing “select rings from madlibtestdata.dt_abalone”

Pull 10 rows of data from the table back into the R environment

query object representing “select avg(rings) from madlibtestdata.dt_abalone”

execute the query and report back the result

Run a linear regression within the database and return a model object

Create a query object representing scoring the model in the database

Query object calculating the mean square error of the model

Add a calculated factor column to the database query object

Calculate a logistic regression model

Perform stepwise feature selection

Demonstration

Page 23: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

26Pivotal Confidential–Internal Use Only

Class hierarchy

db.obj

db.data.frame db.Rquery

db.table db.view

Wrapper of objects in databasex = db.data.frame("table")

Resides in R onlyx[,1:2], merge(x, y, by="column")

Operations/ MADlib

functions

lookat

as.db.data.frame

operation

Page 24: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

27Pivotal Confidential–Internal Use Only

Some of current features

A wrapper of MADlib

• Generalized linear models

(lm, glm)

• Elastic Net (elnet)

• Cross validation (generic.cv)

• ARIMA

• Tree methods

(rpart, randomforest)

• Table summary

• $ [ [[ $<- [<- [[<-

• is.na

+ - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim • names• as.factor()

• content

And more ... (SQL wrapper)

• predict

Page 25: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

28Pivotal Confidential–Internal Use Only

We’re looking for contributors

• Browse our help pages– Start page: madlib.net– Github pages

• github.com/apache/incubator-madlib (SQL)• github.com/pivotalsoftware/pivotalr (R)• github.com/pivotalsoftware/pymadlib (Python)

• Use our product and report issues: • https://issues.apache.org/jira/browse/MADLIB (Issue tracker)• [email protected] (User forum)• [email protected] (Developer forum)

Page 26: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

29Pivotal Confidential–Internal Use Only

Credits

Leaders and contributors:

Gavin SherryCaleb WeltonJoseph HellersteinChristopher RéZhe Wang

Florian Schoppmann

Hai QianShengwen YangXixuan Feng

and many others …

Page 27: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

30Pivotal Confidential–Internal Use Only 30Pivotal Confidential–Internal Use Only

Thank you for your attention

Important links:

Product email: [email protected]

Product site: madlib.net

Page 28: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

31Pivotal Confidential–Internal Use Only 31Pivotal Confidential–Internal Use Only

Backup slides

Page 29: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

32Pivotal Confidential–Internal Use Only

Performing a linear regression on 10 million rows in seconds

Hellerstein et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 30: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

33Pivotal Confidential–Internal Use Only

Reminder: Linear-Regression Model

• • If residuals i.i.d. Gaussians with standard deviation σ:

– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

Page 31: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

34Pivotal Confidential–Internal Use Only

Linear Regression: Streaming Algorithm

How to compute with a single table scan?

XT

XXT

y

-1

XTX XTy

Page 32: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

35Pivotal Confidential–Internal Use Only

PivotalR Architecture

Page 33: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

36Pivotal Confidential–Internal Use Only

Page 34: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

37Pivotal Confidential–Internal Use Only 37Pivotal Confidential–Internal Use Only

PL/X Procedural Languages

Page 35: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

38Pivotal Confidential–Internal Use Only

PivotalR vs PL/R

PivotalR• Interface is R client• Execution is in database• Parallelism handled by

PivotalR• Supports a portion of R

R> x = db.data.frame(“t1”)

R> l = madlib.lm(interlocks ~ assets + nation, data = t)

PL/R• Interface is SQL client• Execution is in R• Parallelism via SQL

function invocation• Supports all of R

psql> CREATE FUNCTION lregr() …

LANGUAGE PLR;

psql> SELECT lregr( array_agg(interlocks),

array_agg(assets),

array_agg(nation) )

FROM t1;

Page 36: MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR

39Pivotal Confidential–Internal Use Only

Parallelized R in Pivotal via PL/R: An Example

SQL & R

R piggy-backs on Pivotal’s parallel architecture Minimize data movement Build predictive model for each state in parallel

TN Data

CA Data

NY Data

PA Data

TX Data

CT Data

NJ Data

IL Data

MA Data

WA Data

TN Model

CA Model

NY Model

PA Model

TX Model

CT Model

NJ Model

IL Model

MA Model

WA Model