Use of standards and related issues in predictive analytics

14
Use of standards and related issues in predictive analytics KDD 2016, SF 2016-08-16 Paco Nathan, @pacoid Dir, Learning Group @ O’Reilly Media

Transcript of Use of standards and related issues in predictive analytics

Page 1: Use of standards and related issues in predictive analytics

Use of standards and related issues in predictive analytics

KDD 2016, SF 2016-08-16

Paco Nathan, @pacoid Dir, Learning Group @ O’Reilly Media

Page 2: Use of standards and related issues in predictive analytics

PMML referenced by 86 publications in Safari, 2001-2016 https://www.safaribooksonline.com/search/?query=PMML

Page 3: Use of standards and related issues in predictive analytics

Pattern: PMML for Cascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) https://goo.gl/jk7829

Page 4: Use of standards and related issues in predictive analytics

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

Pattern – score a model, using pre-defined Cascading app

cascading.org/projects/pattern

Page 5: Use of standards and related issues in predictive analytics

evaluationoptimizationrepresentationcirca 2010

ETL into cluster/cloud

datadata

visualize,reporting

Data Prep

Features

Learners, Parameters

UnsupervisedLearning

Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines

actionable resultsdecisions, feedback

bar developers

foo algorithms

Algorithms and developer-centric template thinking only go so far in real-world workflows…

Results shown in blue, hard problems highlighted in red

Generalized Workflow for ML Use Cases in Big Data

Page 6: Use of standards and related issues in predictive analytics

Portable Format for Analytics (PFA)

PFA updates the standards w.r.t. more contemporary issues of system architectures used for predictive analytics: distributed processing, in-memory computing, serialization, etc.

http://dmg.org/pfa/docs/motivation/

• much more support for distributed systems

• Avro data types

• forward-looking toward more streaming applications

• fits well with higher layers of abstraction, success of DSLs, etc.

Page 7: Use of standards and related issues in predictive analytics

Tuning Spark Streaming for Throughput Gerard Maas, Virdata (2014-12-22)

“One Size Fits All” Doesn’t Anymore This common architectural pattern requires interchange…

Page 8: Use of standards and related issues in predictive analytics

bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine-and-then-uses-sensors-to-listen-to-it/

IoT alters “velocity” and “volume” dramatically This growing category of use cases requires interchange…

Page 9: Use of standards and related issues in predictive analytics

Lessons from the success of Apache Spark…

interchange is necessary for the ecosystem

major use cases tend to build their own ML libraries – despite a case where a majority of committers tend to support a common vision and encourage use of a canonical library (MLLib with DataFrames)

when a successful business grows over time, challenges arise by definition: managing separated teams, mergers and acquisitions, increased audits, regulations, etc.

therefore, lack of interchange for analytics represents a serious technical debt and potential liability

Page 10: Use of standards and related issues in predictive analytics

Tungsten Execution

Python SQL R Streaming

DataFrame

Advanced Analytics

Set Footer from Insert Dropdown Menu 27

Physical Execution: CPU Efficient Data Structures

Keep data closure to CPU cache Tungsten

Lessons from the success of Apache Spark…

direct use of “compilers” becomes atypical as abstraction layers become smarter for deferred optimization

Page 11: Use of standards and related issues in predictive analytics

What to suggest for existing standards?

microservices: how to compose models + parameters from multiple/distinct services

support for API definitions in Swaggar http://swagger.io/

consider the benefits of Parquet, e.g., how pushdown predicates enable better optimization of workflows

Page 12: Use of standards and related issues in predictive analytics

What to suggest for existing standards?

additional standards emerging for other aspects of workflow definition:

Jupyter http://jupyter.org/create and share documents that contain live code, equations, visualizations and explanatory text — a network protocol suite, at heart, for distributed REPL environments, often along with containerization

see usage in Oriole http://oreilly.com/oriole/index.html

Dat http://dat-data.com/

shares versioned data through a decentralized network

Page 13: Use of standards and related issues in predictive analytics

What to suggest for existing standards?

other lingering issues:

• data lineage / provenance

• metadata drift

• public dialog and law: https://public.resource.org/about/

Page 14: Use of standards and related issues in predictive analytics

presenter:

Just Enough Math O’Reilly (2014) justenoughmath.com

monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/