HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING by Kai Waehner

102

Transcript of HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING by Kai Waehner

Kai WähnerTechnology Evangelist

[email protected]

LinkedIn

@KaiWaehner

www.kai-waehner.de

Big Data Spain @Madrid (November 2016)

How to apply big data analytics and machine learning

to real-time processing of microservice events

© Copyright 2000-2016 TIBCO Software Inc.

Apply Big Data Analytics to Real Time Processing

© Copyright 2000-2016 TIBCO Software Inc.

Analyze and Act on Critical Business Moments

© Copyright 2000-2016 TIBCO Software Inc.

Key Take-Aways

Ø Insights are hidden in Historical Data on Big Data Platforms

Ø Machine Learning and Big Data Analytics find these Insights by building Analytics Models

Ø Event Processing uses these Models (without Redevelopment) to take Action in Real Time

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

Machine Learning

…. allows computers to find hidden insights without being explicitly programmed where to look.

Real World Examples of Machine Learning

Spam Detection Search Results +Product Recommendation

Picture Detection(Friends, Locations, Products)

Machine Learning is already present in daily life…

Now, every enterprise is beginning to leverage it!

The Next Disruption:Google Beats Go Champion

© Copyright 2000-2016 TIBCO Software Inc.

Example: Decision Tree – Titanic Survival Rate

family size

Wikipedia

Decision Tree – Product Pass / Fail by Equipment Sensor Readings

Bad Product

Good Product

Step 8 Temperature< 122 C >= 122 C

Step 2 RecipeA B

Step 11 Pressure

TV Color Display Problem

Decision Tree – Training and Test Data Sets

© Copyright 2000-2016 TIBCO Software Inc.

Ensemble Tree Algorithms

• Random Forest, Gradient Boosting Machine (GBM)

• Method – Average many simple trees• Sample the data: fit a simple tree

• Re-sample the data; up-weighting the observations that weren’t fitted well in previous model

• Continue adding trees until fit is good

• Save all the trees and average them

• Better fit + prediction than single trees

© Copyright 2000-2016 TIBCO Software Inc.

Closed Loop for Big Data Analytics

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

Self-service

DashboardsEventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

VisualAnalytics EventProcessing

Analytics

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

VisualAnalytics EventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

Analytics

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

Self-service

DashboardsEventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

VisualAnalytics EventProcessing

Analytics

© Copyright 2000-2016 TIBCO Software Inc.

The first task in a new analytics projectsis to define a Business Case!

© Copyright 2000-2016 TIBCO Software Inc.

From a Business Case to Proactive Actions

Model

Present

Data Wrangling Signals Dashboards

SAP

Historian

Production

Well

Filter

Enrich

Merge

Shape

Explore

Clean

AssembleDataBusinessCase

IncreaseProductivity

GrowRevenue

Completions

Visualize GeoLocation

Production

ValueTheses

ReduceRisk

G&G

Equipment

Decision,Action

Prediction Action

DevelopModel

Pressure

Temperature

ProductionInterrupt

DrillBitMovement

EquipmentFailure

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

Self-service

DashboardsEventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

VisualAnalytics EventProcessing

Analytics

What is Predictive Analytics?

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Variety of Data in Enterprises

CustomGUI-drivendataaccessviaSDK

SiebeleBusiness

Localdatasources

AccessExcel STDF

Drag-and-drop

MySQL

SQLServerOracle

InformationServices(join,transform,reusable,

parameterized,dynamicqueryforin-memoryuse)

Databases

JDBC/ODBC

HadoopSFDC

PostgreSQL

TeradataNetezza

Etc.XML

RDBMS

FlatFiles

Spread-sheets

WebServices

OracleE-Business

RDBMSRDBMS

RDBMS

SAP BWSAP R/3 DATA

FABRIC

Salesforce

ODBCOLEDBSqlClient

Directconnection

OracleTeradataAsterMSSSAS

Teradata

DirectQuery(dynamicallyqueryandretrievedatafor

visualizationandanalysis)Databases

MySQLEtc.

OBIEE

NetezzaHadoop

© Copyright 2000-2016 TIBCO Software Inc.

Data Acquisition“Smart Recommendation Engine”

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Data Munging / Wrangling / Mash-up

cust_id dept sku dollar gift date1 104 C 12003 2.40 FALSE 2016-10-172 105 A 12005 62.85 FALSE 2016-10-173 102 C 12007 69.23 TRUE 2016-10-174 104 B 12004 9.33 FALSE 2016-10-185 105 C 12010 14.16 TRUE 2016-10-186 101 B 12003 90.43 FALSE 2016-10-197 103 C 12005 90.97 FALSE 2016-10-19n … … … … … …

cust_id A B C total # orders first_date

last_date

1 100 21.76 23.67 0.00 45.43 2 2016-10-19

2016-10-20

2 101 0.01 74.65 0.00 74.66 3 2016-10-19

2016-10-20

3 102 0.00 60.92 50.29 111.21 6 2016-10-17

2016-10-20

4 103 0.00 0.00 52.30 52.30 2 2016-10-19

2016-10-20

5 104 31.34 9.33 2.40 43.06 4 2016-10- 2016-10-© Copyright 2000-2016 TIBCO Software Inc.

Data Munging - Transformations

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical)

1. to maximize insight into a data set2. uncover underlying structure3. extract important variables4. detect outliers and anomalies5. test underlying assumptions6. develop parsimonious models7. determine optimal factor settings

© Copyright 2000-2016 TIBCO Software Inc.

Exploratory Data Analysis

“The greatest value of a picture is when it forces us to notice what we never expected to see”

John W. Tukey, 1977

© Copyright 2000-2016 TIBCO Software Inc.

Exploratory Data Analysis

Visual Analytics - Interactive Brush-Linked

© Copyright 2000-2016 TIBCO Software Inc.

… and “Inline Data Wrangling” à Ad-hoc data preparation instead of just ETL

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

VisualAnalytics EventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

Analytics

What is Predictive Analytics?

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Which picture represents a model?

A model is a simplification of the truth that helps you with decision making.

© Copyright 2000-2016 TIBCO Software Inc.

Model Building

© Copyright 2000-2016 TIBCO Software Inc.

Model Building

Employees who write longer emails earn higher salaries!

© Copyright 2000-2016 TIBCO Software Inc.

Model Building

© Copyright 2000-2016 TIBCO Software Inc.

Model Improvement

Managers

Staff

© Copyright 2000-2016 TIBCO Software Inc.

Model Improvement

© Copyright 2000-2016 TIBCO Software Inc.

Analytical Pipeline

© Copyright 2000-2016 TIBCO Software Inc.

Model Validation

How is the IQ of a kid related to the IQ of his / her mum?

© Copyright 2000-2016 TIBCO Software Inc.

Frameworks and Tooling

© Copyright 2000-2016 TIBCO Software Inc.

“…as a next-generation data discovery capability that automatically finds and explains insights from advanced analytics to business users or citizen data scientists”

Smart Data Discovery (for the Business User)

Leverage Machine Learningwithout the help of a Data Scientist

Advanced Analytics and Big Data Tools (for Data Scientists)

Many more ….

R Language

• Built for data scientists

• Very active community

© Copyright 2000-2016 TIBCO Software Inc.

R with Revolution Analytics (now Microsoft)

© Copyright 2000-2016 TIBCO Software Inc.

Open Source GPL License(including its restrictions) http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-

revolution-analytics

TIBCO has rewritten R as a Commercial Compute Engine • Latest statistics scripting engine: S a S-PLUS® a R a TERR• Runs R code including CRAN packages

Engine internals rebuilt from scratch at low-level• Redesigned data objects, memory management• High performance + Big Data

TERR is licensed from TIBCO• TERR Installs (free) with Spotfire Analyst / Desktop + other TIBCO products• Spotfire Server can manage all TERR / R scripts, artifacts for reuse • Standalone Developer Edition• Supported by TIBCO • No GPL license issues

© Copyright 2000-2016 TIBCO Software Inc.

TERR - TIBCO’s Enterprise Runtime for R

Which R to use?

© Copyright 2000-2016 TIBCO Software Inc.

http://www.forbes.com/sites/danwoods/2016/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/

© Copyright 2000-2016 TIBCO Software Inc.

Apache Spark

GeneralData-processingFrameworkà However,focus isespeciallyonAnalytics (at leastthese days)

Apache Spark MLlib

© Copyright 2000-2016 TIBCO Software Inc.

Spark ML is Spark’s machine learning library.

Its goal is to make practical machine learning scalable and easy.

It consists of common learning algorithms and utilities, including

classification, regression, clustering and collaborative filtering.

GeneralData-processingFrameworkà However,focus isespeciallyonAnalytics(atleastthesedays)

x

© Copyright 2000-2016 TIBCO Software Inc.

H2O.ai

An Extensible Open Source Platform for Analytics

• Best of Breed Open Source Technology• Easy-to-use Web UI and Familiar Interfaces • Data Agnostic Support for all Common

Database and File Types• Massively Scalable Big Data Analysis• Real-time Data Scoring (“Nanofast Scoring

Engine”)http://www.h2o.ai/

Smart Visual Analytics vs. Data Science Tools

Live DemoLive Demo

TIBCO Spotfire with R / TERR Integration

© Copyright 2000-2016 TIBCO Software Inc.

Let the business user leverage Analytic Models (created by the Data Scientist) to find insights!

Example: Customer Churn with Random Forest Algorithm• ‘refresh model’ button lives a ‘random forest algorithm’• requires no a priori assumptions at all, it just always works • The business user doesn’t need to know what random forest is to be empowered by it

Select variables for the model

TIBCO Spotfire with H2O Integration

© Copyright 2000-2016 TIBCO Software Inc.

Example: Predictive Analytics for Manufacturing (“scrap parts as early as possible”)

TIBCO Spotfire with H2O Integration

© Copyright 2000-2016 TIBCO Software Inc.

Example: Predictive Analytics for Manufacturing (“scrap parts as early as possible”)

© Copyright 2000-2016 TIBCO Software Inc.

SaaS Machine Learning

• Managed SaaS service for building ML models and generating predictions

• Integrated into the corresponding cloud ecosystem

• Easy to use, but limited feature set and potential latency issues if combined with external data or applications

http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html

© Copyright 2000-2016 TIBCO Software Inc.

PMML (Predictive Model Markup Language )

• XML-based de facto standard to represent predictive analytic models • Developed by the Data Mining Group (DMG)

• Easily share models between PMML compliant applications (e.g. between model creation and deployment for operations)

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Analytics Maturity Model

Immediate Long-TermCompetitiveAdvantageValue to the Organization

Self-service

DashboardsEventProcessingAdvancedAnalytics

Measure Diagnose Predict Optimize Alert Automate

Analytics Maturity

A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases

VisualAnalytics EventProcessing

Analytics

What is Prescriptive Analytics?

© Copyright 2000-2016 TIBCO Software Inc.

Traditional Data Processing: ”Request – Response”

Store

Analyze

Act

© Copyright 2000-2016 TIBCO Software Inc.

The New Era: Streaming Analytics

Act & Monitor

Analyze

Store

© Copyright 2000-2016 TIBCO Software Inc.

Streaming Analytics: What Is A “Stream”?

Clickstream

Sensors

Usage Data

Logs

• Consists of pieces of data typically generated due to a change of state.

• One or more identifiers• Timestamp & payload• Immutable

• Typically unbounded; there is no end to the data.

• Batch dataset: “bounded”.

• Can be raw or derived.

© Copyright 2000-2016 TIBCO Software Inc.

Streaming Analytics - Processing Pipeline

APIs

Adapters / Channels

Integration

Messaging

Stream Ingest

Transformation

Aggregation

Enrichment

Filtering

StreamPreprocessing

Process Management

Analytics (Real Time)

Applications& APIs

Analytics / DW Reporting

StreamOutcomes

• Contextual Rules

• Windowing

• Patterns

• Analytics

• Deep ML

• …

Stream Analytics & Processing

Index / SearchNormalization

Applying an Analytic Modelis just a piece of the puzzle!

© Copyright 2000-2016 TIBCO Software Inc.

Streaming Analytics: “Windows”

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

© Copyright 2000-2016 TIBCO Software Inc.

Operational Intelligence and Human Interaction

Actions by OperationsHumandecisionsinrealtimeinformed

byuptodateinformation

69

Automatedactionbasedonmodelsofhistorycombinedwithlivecontextandbusinessrules

Machine-to-Machine Automation

© Copyright 2000-2016 TIBCO Software Inc.

What Kind of Streaming Analytics do you need?

Visual IDE (Dev, Test, Debug)Simulation (Feed Testing, Test Generation)Live UI (monitoring, proactive interaction)

Maturity (24/7 support, consulting)Integration (out-of-the-box: ESB, MDM, etc.)

Library (Java, .NET, Python)Query Language (often similar to SQL)

Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products)

Operators (Filter, Sort, Aggregate)

Timeto

Market

StreamingFrameworks

StreamingProducts

Slow Fast

StreamingConcepts

© Copyright 2000-2016 TIBCO Software Inc.

Frameworks and Products (no complete list!)

OPEN SOURCE CLOSED SOURCE

PRODUCT

FRAMEWORK

Azure MicrosoftStream Analytics

© Copyright 2000-2016 TIBCO Software Inc.

Comparison of Stream Processing Frameworks and Products

Slide Deck from JavaOne 2015:http://www.kai-waehner.de/blog/2015/10/25/comparison-of-stream-processing-frameworks-and-products/ Updated slide deck coming

in November 2016 (Big Data Spain, Madrid)

© Copyright 2000-2016 TIBCO Software Inc.

Apache Storm – Hello World

http://wpcertification.blogspot.ch/2014/02/helloworld-apache-storm-word-counter.html

© Copyright 2000-2016 TIBCO Software Inc.

AWS Kinesis – Hello World

© Copyright 2000-2016 TIBCO Software Inc.

Visual Coding for Streaming Analytics

• StreamingOperators• Connectivity• VisualDevelopment• Testing&Simulation• MatureTooling/Support• MiddlewareIntegration

© Copyright 2000-2016 TIBCO Software Inc.

Live Visual Analytics UI

Dynamicaggregation

Livevisualization

Ad-hoccontinuousquery

Alerts

Action

© Copyright 2000-2016 TIBCO Software Inc.

How to apply analytic models to real time processing without redevelopment?

StreamProcessingH20.ai

Open Source

R

TERR

Spark ML

MATLAB

SAS

PMML

© Copyright 2000-2016 TIBCO Software Inc.

TIBCO StreamBase Connector for R and TERR

© Copyright 2000-2016 TIBCO Software Inc.

TIBCO StreamBase Connector for H2O.ai

© Copyright 2000-2016 TIBCO Software Inc.

TIBCO StreamBase Connector for PMML

© Copyright 2000-2016 TIBCO Software Inc.

Real World Streaming Application for Customer Churn

© Copyright 2000-2016 TIBCO Software Inc.

Closed Loop à Automatically Re-Compute (and Improve) the Analytic Model

Compute your

performance metric Spot not

good enough performance

Re-compute model

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

Scenario: Predictive Scrapping of Parts in an Assembly Line

Goal: Scrap parts as early as possible automatically to reduce costs in a manufacturing process.

Question: When to scrap a part in Station 1 instead of doing re-work or sending it to Station 2?

Station 1 Station 2

Cost Before9€ 7€ 13€ Total Cost

29€(or more)

Scrap? Scrap?

Fast Data Architecture for Predictive Maintenance

OperationalAnalytics

OperationsLiveUI

CSV Batch

JSON Real Time

XML Real Time

StreamingAnalyticsAction

Aggregate

Rules

Analytics

Correlate

LiveDatamart

Continuousqueryprocessing

Alerts

Manualaction,escalation

HISTORICALANALYSIS DataScientists

FlumeHDFS

Spotfire

R/TERRHDFS

Hadoop (Cloudera)

StreamBase

TIBCO Fast Data Platform

H2O

OracleRDBMS

Avro Parquet … PMML

InternalD

ata

TIBCO Spotfire with H2O Integration

Data Discovery / Data Mining (“Are parts that repeat a station more likely scrap parts?”)

TIBCO Live Datamart

Operational Intelligence (“Monitor the manufacturing process and change rules in real time!”)

Live Dartmart Desktop Client

TIBCO Live Datamart

Operational Intelligence (“Monitor the manufacturing process and change rules in real time!”)

Live Dartmart Web API

TIBCO Spotfire + StreamBase + H2O.ai + Live Datamart

Live DemoLive Demo

© Copyright 2000-2016 TIBCO Software Inc.

TIBCO Accelerator for Apache Spark

1. Fast Data Preparation for IoTDozens of enterprise and IoT data preparation adapters: MQTT, Databases; inbound creation of HDFS, Parquet, Hbase, Avro…

2. Spotfire Model Discovery TemplateUse Spotfire to explore Spark data lake, create predictive model, train in H20, and deploy to Streaming Analytics.

3. Operationalize Predictive ModelsZookeeper deployment to StreamBase nodes living in Spark cluster via H20, PMML, TERR models

4. Streaming Analytics for AutomationAutomate action based on predictive models – make offers to customers, stop fraudulent transactions, alert.

5. Monitor & Retrain Model Monitor behavior of model, retrain when necessary.

6. Drag & Drop for Business Solution DevelopersCode-free development environment for work with H20, HDFS, Avro, TERR

The TIBCO Accelerator for Spark is a TIBCO engineered, light-weight open-source fast-start for systems to stream data into Spark, discover patterns in Spark with Spotfire, and operationalize the insights on Big Data.

FUNCTIONAL COMPONENTS

© Copyright 2000-2016 TIBCO Software Inc.

Agenda

1) Machine Learning and Big Data Analytics2) Building an Analytic Model3) Real Time Processing4) Live Demo5) Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Evolving Demands from the Business

AGILITY & SPEED

REDUCED CYCLE TIMES

WEB SCALE

LOWER COST

FAIL FAST

© Copyright 2000-2016 TIBCO Software Inc.

Development of Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

12 Factor Apps for Cloud Native Microservices

Codebase

One codebase tracked in

revision control, many deploys.

Dependencies

Explicitly declare and isolate

dependencies.

Config

Store config in the environment.

BackingServices

Treat backing services as attached resources.

Build, Release, Run

Strictly separate build and run

stages.

Processes

Execute the app as one or more

stateless processes.

Port Binding

Export services via port binding.

Concurrency

Scale out via the process model.

Disposability

Maximize robustness with fast startup and

graceful shutdown.

Dev / Prod Parity

Keep dev,staging, and

prod as similar as possible.

Logs

Treat logs as event streams.

Admin Processes

Run admin/mgmt

tasks as one-off processes.

https://12factor.net/

© Copyright 2000-2016 TIBCO Software Inc.

Why Containers?

http://www.slideshare.net/andersjanmyr/docker-the-future-of-devops

Containers enable:

• Lightweight deployment

• Automation

• Better resource utilization

• Scaling up and down quickly

• Platform agnostic deployment

• Innovation and Fail Fast Concepts

• Standardization ? Ø The Open Container Initiative (OCI)Ø Docker Fork Discussions (!!!)

© Copyright 2000-2016 TIBCO Software Inc.

DevOps Elements – Culture and Technology!

Process

Tools

Automation

Culture

Continuous Integration/Continuous Development

APIsMicroservicesFrequent releases

Collaboration

© Copyright 2000-2016 TIBCO Software Inc.

Develop fast. Fail fast. Change fast.

Visual Analytics + Visual Coding + DevOps

= Agile Intelligent Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Application of Analytic Modelsto other Microservices

© Copyright 2000-2016 TIBCO Software Inc.

Real Time Streaming Analytics

time1 2 3 4 5 6 7 8 9

EventStreams

Apply your intelligent (micro)service to any event.

Microservice event. Application event. Legacy event. IoT event. You name it.

© Copyright 2000-2016 TIBCO Software Inc.

Key Take-Aways

Ø Insights are hidden in Historical Data on Big Data Platforms

Ø Machine Learning and Big Data Analytics find these Insights by building Analytics Models

Ø Event Processing uses these Models (without Redevelopment) to take Action in Real Time

Questions? Please contact me!

Kai WähnerTechnology Evangelist at TIBCO

[email protected]@KaiWaehnerwww.kai-waehner.deLinkedIn