Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014
-
Upload
pydata -
Category
Data & Analytics
-
view
110 -
download
1
description
Transcript of Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014
BUILT FOR THE SPEED OF BUSINESS
Additional Line 18 Point Verdana
@ronert_obst 2 © Copyright 2014 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing with Procedural Python
How do we use the PyData stack in data science engagements at Pivotal?
Ian Huston, Ronert Obst Data Scientists, Pivotal
@ronert_obst 3 © Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk � Simple code examples:
https://github.com/ronert/plpython_examples
� IPython notebook rendered with nbviewer: http://bit.ly/1u5t1Mj
@ronert_obst 4 © Copyright 2014 Pivotal. All rights reserved.
About Pivotal
Agile PaaS Big Data
Pivotal™ HDwith GemFire XD®
Pivotal™ GreenplumDatabase
Pivotal™ GemFire®
@ronert_obst 5 © Copyright 2014 Pivotal. All rights reserved.
What do our customers look like? � Large enterprises with lots of data collected – Work with PBs of data, structured & unstructured
� Not able to get what they want out of their data – Legacy systems – Slow response times – Small Samples
� Transform into data driven enterprises
@ronert_obst 6 © Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ronert_obst 7 © Copyright 2014 Pivotal. All rights reserved.
Pivotal’s Open Source Contributions Lots more interesting small projects:
• PyMADlib – Python Wrapper for MADlibhttps://github.com/gopivotal/pymadlib
• PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR
• Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/
• Pandas via psql (interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql
@ronert_obst 8 © Copyright 2014 Pivotal. All rights reserved.
Typical Engagement Tech Setup � Platform: – Greenplum Analytics Database (GPDB) – Pivotal HD Hadoop Distribution + HAWQ (SQL on Hadoop)
� Open Source Options (http://gopivotal.com): – Greenplum Community Edition – Pivotal HD Community Edition (HAWQ not included) – MADlib in-database machine learning library (http://madlib.net)
� Where Python fits in: – IPython Notebook for exploratory analysis – PL/Python running in-database, with nltk, scikit-learn etc – Pandas, Matplotlib etc.
Pivotal™ HDwith GemFire XD®
Pivotal™ GreenplumDatabase
@ronert_obst 9 © Copyright 2014 Pivotal. All rights reserved. 9 © Copyright 2013 Pivotal. All rights reserved. 9 © Copyright 2013 Pivotal. All rights reserved. 9 © Copyright 2013 Pivotal. All rights reserved. 9 © Copyright 2014 Pivotal. All rights reserved.
PL/Python
@ronert_obst 10 © Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python � Procedural languages need to be installed on each database used
� Name in SQL is plpythonu, ‘u’ means untrusted so need to be super user to install
� Syntax is like normal Python function with function definition line replaced by SQL wrapper
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
@ronert_obst 11 © Copyright 2014 Pivotal. All rights reserved.
Returning Results � Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) � Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS ( name text, value integer);
� Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;
� For functions which return multiple rows, prefix “setof” before the return type
@ronert_obst 12 © Copyright 2014 Pivotal. All rights reserved.
Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu;
Sequence
Generator CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
@ronert_obst 13 © Copyright 2014 Pivotal. All rights reserved.
Accessing Packages � On Greenplum DB: To be available packages must be installed on the
individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!)
� Then just import as usual inside function:
CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu;
@ronert_obst 14 © Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple PostgreSQL servers
Workers
Master
@ronert_obst 15 © Copyright 2014 Pivotal. All rights reserved.
Benefits of PL/Python
� Easy to bring your code to the data
� When SQL falls short, leverage your Python experience
� Apply Python across petabytes of data with minimal overhead or additional requirements
� Results are already in the database system, ready for further analysis or storage
@ronert_obst 16 © Copyright 2014 Pivotal. All rights reserved. 16 © Copyright 2013 Pivotal. All rights reserved. 16 © Copyright 2013 Pivotal. All rights reserved. 16 © Copyright 2013 Pivotal. All rights reserved. 16 © Copyright 2014 Pivotal. All rights reserved.
MADlib
@ronert_obst 17 © Copyright 2014 Pivotal. All rights reserved.
Going Beyond Data Parallelism � PL/Python only allows us to run ‘n’ models in parallel � No global model in PL/Python. Solution:
• Open Source!https://github.com/madlib/madlib
• Works on Greenplum DB, PostgreSQL and HAWQ
• Active development by Pivotal - Latest Release: v1.6 (July 2014)
• Downloads and Docs: http://madlib.net/
@ronert_obst 18 © Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database Functions
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
@ronert_obst 19 © Copyright 2014 Pivotal. All rights reserved.
Architecture
RDBMS Query Processing (Greenplum, PostgreSQL, HAWQ…)
Low-level Abstraction Layer (matrix operations, C++)
RDBMS Built-In
Functions
User Interface
High-level Abstraction Layer (iteration controller, ...)
Functions for Inner Loops (for streaming algorithms)
“Driver” Functions (outer loops of iterative algorithms, optimizer invocations)
Python
Python with templated SQL
SQL, generated from specification
C++
@ronert_obst 20 © Copyright 2014 Pivotal. All rights reserved. 20 © Copyright 2013 Pivotal. All rights reserved. 20 © Copyright 2013 Pivotal. All rights reserved. 20 © Copyright 2013 Pivotal. All rights reserved. 20 © Copyright 2014 Pivotal. All rights reserved.
Examples
@ronert_obst 21 © Copyright 2014 Pivotal. All rights reserved.
In Database Image Processing
� Many use cases – Defect detection in
manufacturing – Tumor detection in medical
images
� Challenge: Size of main memory
Beck et al. Sci Transl Med 2011.
Name Row Col R G B
img.jpg 331 188 250 249 255
img.jpg 332 188 248 250 255
img.jpg 331 189 249 249 255
@ronert_obst 22 © Copyright 2014 Pivotal. All rights reserved.
Image Smoothing create or replace function smooth(maxr integer, maxc integer, im_array integer[]) returns integer[] as $$ import numpy as np import scipy as sc import scipy.ndimage as ndi smooth = np.reshape(im_array, (maxr+1, maxc+1)) smooth = ndi.uniform_filter(smooth, size=3) return np.reshape(smooth.astype(int), (maxr+1)*(maxc+1)) $$ language plpythonu;
select im_id, smooth(max(row), max(col), array_agg(blue_intensity order by row,col)) from (select im_id,row,col,blue_intensity from images_table where im_id=1878 order by im_id, row, col) t group by im_id distributed by (im_id);
@ronert_obst 23 © Copyright 2014 Pivotal. All rights reserved.
Pivotal GNIP Decahose Pipeline
Parallel Parsing of JSON
(PL/Python)
Topic Analysis through MADlib LDA
Unsupervised Sentiment Analysis
(PL/Python)
D3.js
Twitter Decahose (~55 million tweets/day)
Source: http Sink: hdfs
HDFS
External Tables
PXF
Nightly Cron Jobs
@ronert_obst 24 © Copyright 2014 Pivotal. All rights reserved.
Sentiment Analysis – PL/Python Functions
Sentiment Scored Tweets
Use learned phrasal polarities to score
sentiment of new tweets
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Part-of-speech tagger1
Break-up Tweets into tokens and tag their
parts-of-speech
Phrase Extraction
Semi-Supervised Sentiment Classification
Phrasal Polarity Scoring
PL/Python
@ronert_obst 25 © Copyright 2014 Pivotal. All rights reserved.
Transport for London Traffic Disruption feed
Pivotal Greenplum Database Deduplication Feature Creation
Modelling & Machine Learning d3.js & NVD3
Interactive SVG figures
Transport Disruption Prediction Pipeline
@ronert_obst 26 © Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python, or more generally about Data Science.
@ronert_obst
[email protected] www.ronert-obst.com
BUILT FOR THE SPEED OF BUSINESS
Additional Line 18 Point Verdana