PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.
-
Upload
srivatsan-ramanujam -
Category
Technology
-
view
117 -
download
0
description
Transcript of PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.
![Page 1: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/1.jpg)
1© Copyright 2011 EMC Corporation. All rights reserved.
Srivatsan RamanujamSenior Data Scientist
Greenplum
![Page 2: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/2.jpg)
2© Copyright 2011 EMC Corporation. All rights reserved.
Agenda
• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture
• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout
• PyMADlib– Overview– Demo in IPython Notebook
• Future Directions– GPHD and HAWQ
![Page 3: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/3.jpg)
3© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Overview
![Page 4: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/4.jpg)
4© Copyright 2011 EMC Corporation. All rights reserved.
Products
![Page 5: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/5.jpg)
5© Copyright 2011 EMC Corporation. All rights reserved.
MPP (Massively Parallel Processing) Shared-Nothing Architecture
NetworkInterconnect
... ...
......MasterServers
Query planning & dispatch
SegmentServers
Query processing & data storage
SQL
MapReduce
ExternalSources
Loading, streaming, etc.
Greenplum Database - Architecture
![Page 6: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/6.jpg)
6© Copyright 2011 EMC Corporation. All rights reserved.
MADlib
![Page 7: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/7.jpg)
7© Copyright 2011 EMC Corporation. All rights reserved.
MADlib: The Origin
UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.
• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,
Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant
contributions from UW Madison, UFlorida and UC Berkeley.
![Page 8: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/8.jpg)
8© Copyright 2011 EMC Corporation. All rights reserved.
Current Modules
Data Modeling
Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field
Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation
Descriptive Statistics
Sketch-based Estimators• CountMin (Cormode-
Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)
Profile
Quantile
Support
Array Operations
Conjugate Gradient
Sparse Vectors
Probability Functions
Random Sampling
Inferential Statistics
Hypothesis tests
![Page 9: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/9.jpg)
9© Copyright 2011 EMC Corporation. All rights reserved.
MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net
![Page 10: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/10.jpg)
10© Copyright 2011 EMC Corporation. All rights reserved.
How does it work ? : A Linear Regression Example• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2 from unm limit 6;
y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design
matrix X
Vector of dependent variables y
![Page 11: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/11.jpg)
11© Copyright 2011 EMC Corporation. All rights reserved.
Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals
• First-order conditions for the following quadratic objective (in c)
yield the minimizer
![Page 12: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/12.jpg)
12© Copyright 2011 EMC Corporation. All rights reserved.
Linear Regression: Streaming Algorithm• How to compute with a single table scan?
XT
X
XT
y
-1
XTX XTy
![Page 13: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/13.jpg)
13© Copyright 2011 EMC Corporation. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT
1y
1XT
2y
2
Segment 1
Segment 2 XTyMaster
![Page 14: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/14.jpg)
14© Copyright 2011 EMC Corporation. All rights reserved.
Performance Comparison : Test Setup on AWB
• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3
• Mahout v0.7
• MADlib v0.5– With small LMF change to allow 4-byte integer values
• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations
![Page 15: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/15.jpg)
15© Copyright 2011 EMC Corporation. All rights reserved.
Performance & Scalability Results (summary)
• Whitepaper coming out shortly!
![Page 16: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/16.jpg)
16© Copyright 2011 EMC Corporation. All rights reserved.
Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation
1000000 10000000 100000000 10000000000
100
200
300
400
500
600
700
MADlib & Mahout Logistic Regression Scalability Across Number of Attributes
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
log(Number of Rows)
Tim
e in
Min
ute
s
![Page 17: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/17.jpg)
17© Copyright 2011 EMC Corporation. All rights reserved.
Logistic Regression
50 100 150 200 2500
2
4
6
8
10
12
14
16
18
MADlib Scalability Across Number of GPDB Segments
Number of GPDB Segments
Tim
e in
Min
ute
s
![Page 18: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/18.jpg)
18© Copyright 2011 EMC Corporation. All rights reserved.
K-Means Clustering
1000000 10000000 100000000 10000000000
50
100
150
200
250
300
350
MADlib & Mahout K-means Scalability Across Number of Rows
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
log(Number of Rows)
Tim
e in
Min
![Page 19: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/19.jpg)
19© Copyright 2011 EMC Corporation. All rights reserved.
K-Means Clustering
50 100 150 200 2500
1
2
3
4
5
6
7
8
9
10
MADlib K-means Scalability Across Number of GPDB Segments
Number of GPDB Segments
Tim
e in
Min
![Page 20: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/20.jpg)
20© Copyright 2011 EMC Corporation. All rights reserved.
PyMADlib : Python + MADlib = Awesome!
![Page 21: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/21.jpg)
21© Copyright 2011 EMC Corporation. All rights reserved.
Motivation
• Undeniably the most straightforward way to query data
• But not necessarily designed for data science
• SQL is great for many things, but it’s not nearly enough
![Page 22: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/22.jpg)
22© Copyright 2011 EMC Corporation. All rights reserved.
MADlib is a godsend!
• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries
• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science
• In-database, explicitly parallel.
![Page 23: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/23.jpg)
23© Copyright 2011 EMC Corporation. All rights reserved.
Then which interface is favored by and familiar to data scientists?
• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”
![Page 24: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/24.jpg)
24© Copyright 2011 EMC Corporation. All rights reserved.
Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user
• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable
• SAS HPA = $$$$$
![Page 25: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/25.jpg)
25© Copyright 2011 EMC Corporation. All rights reserved.
The challenge
• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL
• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science
• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science
• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R
![Page 26: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/26.jpg)
26© Copyright 2011 EMC Corporation. All rights reserved.
Simple solution: Translate Python code into SQL
• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib
• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.
SQL to execute MADlib
Model output
ODBC/JDBC
Python SQL
![Page 27: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/27.jpg)
27© Copyright 2011 EMC Corporation. All rights reserved.
Demo
PyMADlib Tutorial – IPython Notebook Viewer Link
http://nbviewer.ipython.org/5275846
![Page 28: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/28.jpg)
28© Copyright 2011 EMC Corporation. All rights reserved.
Where do I get it ?
$pip install pymadlib
![Page 29: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/29.jpg)
29© Copyright 2011 EMC Corporation. All rights reserved.
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely available for single node installations on multiple platforms
– Written permission may be requested from EMC/Greenplum for research use for multi-node installations
• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/
madlib
• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib
![Page 30: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/30.jpg)
30© Copyright 2011 EMC Corporation. All rights reserved.
Future Directions
![Page 31: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/31.jpg)
31© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop
• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of
scalar and aggregate functions
• ACID Compliant
![Page 32: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/32.jpg)
32© Copyright 2011 EMC Corporation. All rights reserved.
HAWQ – Architecture
![Page 33: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/33.jpg)
33© Copyright 2011 EMC Corporation. All rights reserved.
Performance : HAWQ1 Vs. Hive Vs. Impala2
All experiments were run on a 60 node deployment with Analytics Workbench3
1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/
![Page 34: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/34.jpg)
34© Copyright 2011 EMC Corporation. All rights reserved.
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
HAWQ: Deep Scalable AnalyticsWhat’s inside the box?
• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib
![Page 35: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/35.jpg)
35© Copyright 2011 EMC Corporation. All rights reserved.
Questions?
https://github.com/vatsan/pymadlib
![Page 36: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/36.jpg)
36© Copyright 2011 EMC Corporation. All rights reserved.
Appendix
![Page 37: PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c6fa3e4a795937038b4645/html5/thumbnails/37.jpg)
37© Copyright 2011 EMC Corporation. All rights reserved.
Datasets
The following datasets were used in comparing the performance of MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings