Automated Analytics at Scale

27
Automated Analytics at Scale Model Management in Streaming Big Data Architectures Chris Kang

Transcript of Automated Analytics at Scale

Page 1: Automated Analytics at Scale

Automated Analytics at ScaleModel Management in Streaming Big Data Architectures

Chris Kang

Page 2: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

• Machine learning allows organizations to proactively discover patterns and predict outcomes for their operations, and improving those insights requires deploying better analytical models on their data.

• Finding the best analytical model requires running thousands of hypotheses on various datasets and comparing models in a brute force approach.

• Currently a model management framework does not exist - that is, an agnostic tool or framework that manages and orchestrates the entire lifecycle of a model.

Real-time Analytics at Scale

2

Challenges of Model Management

Page 3: Automated Analytics at Scale

3Copyright © 2016 Accenture All rights reserved.

Model Management Framework operationalizes analytics to ease development and deployment of analytical models The framework provides key benefits to operationalize and democratize access to analytical modeling at scale

Captures and templates analytical models created by

expert data scientists for easy reuse

Faster development of analytical models

with rapid iteration of training and

comparing models using brute force

approach

Presents champion-challenger view to

visually compare and promote trained

models

Reduces complexity for data scientists to

train and deploy models

Enables business analysts and others to

participate in modeling process

Page 4: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Model Management Framework is essential for the Internet of Things platformThe Internet of Things platform exposes thousands of sensors that require models to be automatically managed and maintained as well as provide easy access to the predicted results

Identify desired insights

Identify sights for operationalizing devices/machinery for various purposes: detecting anomaly, prediction maintenance, budget and resource optimization

Collect dataCollect various types of data (time series or static) and store them into databases that best fits the data type

AnalyzeTrain the analytical models using the model management framework or using other analytical tools such as R then onboard it to the framework

Actuate and optimize

Set up rules to act on predicted results from thousands of sensors, e.g. schedule a maintenance or lower temperature on a device

4

Page 5: Automated Analytics at Scale

Background

Page 6: Automated Analytics at Scale

6Copyright © 2016 Accenture All rights reserved.

Organizations today have an unprecedented amount of data available because of the Internet of Things, the web, and social mediaIn order to take advantage of this massive set of data, organizations must build analytics platforms

Source: IBM, Big Data Hub, 2013

Page 7: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Traditional analytics platforms use big data technologies to process and analyze large amounts of data

“Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies) focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data. Determining the value of the collected data becomes the top challenge in all industries.” Source: Svetlana Sicular, Gartner, October 30 2015

Example Technologies

The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety of big data tools.

Analytics and Visualization

Data Processing

Data CollectionStore huge volumes of data in multiple data stores in a variety of data types for processing.

Process the data by filtering, transforming, and applying machine learning algorithms using computing engines.

Create ad hoc reports on processed data using business intelligence and visualization tools.

7

Page 8: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Enterprises need access to both historical and real-time data to gain the most value out of big data analytics• Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived.• Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time

data has accumulated and is missed, which generates a loss of opportunity for proactive decision making.

Storing data in a fault-tolerant, replicated historic store, processing a large batch of data, and storing the processed data using batch writes incurs delays that make real-time not feasible

Queries are only directed at stale data of up to hours or days. The lack of real-time data limits the analytics to ad-hoc summarizations and aggregations.

Because of the batch processing delay, by the time the captured data is available for queries, it is stale

Real-time data is missed by the time analytics begins

Historic Data Store

Batch Batch Write

Data Query

Storage Processing Serving

Real-Time Data

8

Page 9: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

The Lambda Architecture empowers real-time analytics by handling data at scale and in real-time using a hybrid architecture• Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was

to build a general architecture to process big data at scale.

• The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing for analytics on data that combines the most up-to-date data with historical data views.

Real-time analytics can now be performed on data combined from most up-to-date data with historical views

BATCH LAYER focuses on processing historical data views for queries

SPEED LAYER handles the complexity of real-time data collection and analysis

Historic Data Store

Batch Batch Write

Data Query

Storage Processing Serving

Queue Speed Random Write

9

Page 10: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

In the Internet of Things, predictive modeling on sensor data allows organizations to discover patterns and predict outcomes for their operations

Remediation

Notification and Alerts

Oil & Gas Producer

Water Utility Client

NoSQL for Unstructured Data

Computing Engines and Stream Processors

Machine Learning Algorithms

Model Runtime Environments

Sensors at Field Sites

Predictive Results

Data Collection Data Processing Predictive Modeling Proactive Decision Making

Collects data from over 190,000 sensors

Collects data from sensors placed along pipes in a water

distribution network

Injects 6,000 rows/second and 11 billion rows of data per

month – larger analytics platform than Twitter

Processes data for water flow rate and pressure

Has over 3,500 models analyzing data using various algorithms

Apply predictive model to project forward in time to see spikes or falls that exhibit warning signs of failure

Enables company to examine huge sets of data, discover trends to predict outcomes in operation

and exploration efforts

Use results from predictive model to proactively reduce pressure spikes, avoiding leaks,

prolonging the longevity of assets, and reducing disruption to customers

• The real value of big data is the insight via the analytics, not just the collection of the data.

• Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed to reactive, decisions on data.

10

Page 11: Automated Analytics at Scale

11Copyright © 2016 Accenture All rights reserved.

The modeling process is iterative and its lifetime spans both the batch mode model training and real-time predictionIn general, a model creates an output for an unknown target value given a defined set of inputs. In a time-series model, the target value also depends on time as an input

Build Model• Identify required data and

how to get it• Design and validate

specific analytic models• Verify approach through

initial set of insights on particular environments

Analyzes a variety of machine learning algorithms and identifies the logistic regression model as the most suitable for the problem. Codes model .JAR file

Train Model• Prepare historical data for

training• Select model input

parameters and runtime environment

• Train the model on data from historical batch and/or real-time stream in runtime environment

Selects input parameters such as the regularization parameter for the logistic regression model. Submits the model to Spark to train the model on historical data in HDFS

Monitor Execution• Monitor the status of

training the model in the runtime environment (e.g. running, succeeded, failed)

• Troubleshoot issues in the runtime environment if necessary

Opens the terminal, ssh into the Hadoop cluster, and enters the commands to verify the status of the model as it is trained

Compare Models• Compare trained models

in champion-challenger fashion

• Brute force approach to finding best-of-breed model for deploying to live stream

After iteratively training many models, select the best-of-breed based on the model with the lowest mean square error

Operationalize Model• Deploy best model on live

stream of data• Generate predicted

results for automated or manual proactive decision making

• Observe results to feed back and fine-tune the model

Submits the model to Spark Streaming to be applied to streaming data ingested from Kafka, and model predicts in real-time whether sensor will fail

I want to deploy a model that can detect if a sensor

is faulty in real-time

Data Scientist

Data Science System Administration

Page 12: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved. 12

Challenges with Analytical Modelingin the Current State

Page 13: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Building, training, and deploying analytical models require a rare combination of data science and engineering skillsThe ability to complete the modeling process is limited to specialized individuals who are experts in both data science and engineering

“The United States alone faces a shortage of 140,000 to 190,000 people with analytical

expertise and

1.5 million managers and analysts

with the skills to understand and make decisions based on the analysis of big

data.”Source: McKinsey Global Institute analysis

Traditional StrengthsPotential Hurdles with Model Building and Deployment

Full Set of Skills Needed for Model Building and Deployment

Mathematics, statistics, machine learning, data mining, pattern recognition, predictive algorithms, domain expertise

Troubleshooting and running a runtime environment such as Spark requires advanced system engineering skills, which a data scientist may not be trained in. This can potentially lead to slower development and deployment of predictive models.

• Understanding of a variety of machine learning algorithms, pattern recognition, as well as expertise in a domain.

• Ability to build and code accurate models based on problem space.

• System administrator skills as well as deep understanding of big data systems to deploy models in runtime environment.

Domain expertise, business processes, requirements gathering

Traditional business analysts may lack core skills in data science or data engineering because of a lack of experience to build, train, or deploy models

Combination of data science skills as well as software engineering and system administrator skills for big data systems

May lack domain expertise, in which case it may take longer to build and train relevant models for the use case

Data Scientist

Business Analyst

Dual Data Scientist

and Engineer

13

Page 14: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Analytical models are not easily reusable or shareable, resulting in siloed analytics workThere is no standard method for sharing models to let users leverage models created by other data scientists, so the analytics work is siloed. This is true for both freshly built models and models that were already trained on a dataset

Predictive models duplicate and sprawl as data scientists build and train their own individual library of models that are not shared.

No standard for sharing or viewing other data scientist’s models

Individual Libraries of ModelsData scientists primarily leverage their own libraries of models and previous datasets they worked with to select an algorithm and build a model for the current problem

Model DuplicationAs models are built and trained, the same types of models may be built by more than one data scientist, particularly if the types of models are common in the industry’s use cases

Model SprawlOver time, as more data scientists build and train more models, the models begin to sprawl and duplicate unnecessarily, making the central management of models more difficult

Train and deploy individual models

Runtime execution environments for model training and deployment

14

Page 15: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Without a framework, current approach is too inflexible to support multiple runtime execution environmentsIt is impractical to scale the number of runtime environments to train and deploy models using a manual approach

Spark model with R dependencies

Model with R dependencies

I have a model, but I don’t know which runtime environment can support it

I’m only familiar with R, so I need to learn all the environments to test my model

I have a new type of model so I need to learn another runtime environment

Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning environments instead of using that time for analytical modeling.

Dependencies match and runtime can support model

Missing Spark functionality to execute model

Missing specific R dependency so cannot support model

All R libraries supported and can execute model

Data Scientist

Update

Test

Learn

• Data scientist needs to acquire the system administration skills to operate the runtime environments

• Each runtime environment is unique and requires time and energy

• In the worst case, the data scientist must try every runtime environment before successfully finding a match for the predictive model

• As more model types are needed, additional runtime environments must be learned

• Learning additional environments becomes a time-consuming endeavor

15

Page 16: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Lack of engineering abstraction makes it difficult to quickly train predictive models on dataData scientists lose productivity as the process to train models is manual, requiring a manual check for the status of a model in the environment as well as system administration for troubleshooting the model in the environment

Need for abstraction grows as the number of types of models and runtime environments increases

Wasted productivity – Spending time on data engineering instead of comparing models to

find the best-of-breed for deployment

No abstractions for training or monitoring models on runtime environments

Train model

Repeated for hundreds of models on various runtime environments

Check status of model

Troubleshoot model

Train modelCheck

status of model

Troubleshoot model

Train modelCheck

status of model

Troubleshoot model

Build many models on

various algorithms

More time spent on system administration

Less time spent on building predictive models

Try different input parameters and

algorithms to find best-of-breed model …

..

Manual Process

Data Scientist

Build Model Train Model and Monitor Status

16

Page 17: Automated Analytics at Scale

17Copyright © 2015 Accenture All rights reserved.

Model Management Frameworkfor Automated Analytics at Scale

Page 18: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Model Management Framework simplifies the training, deployment, and management of a large number of models for a Lambda architecture

Model management is a framework for data scientists and users to more easily train and deploy analytical models in various runtime environments on the lambda architecture by abstracting the system administration, reducing the complexity of train and deploy, and sharing the models in a way that is consumable by users in your organization, enabling other users such as business analysts to partake in the modeling process.

The framework in this reference architecture proposes• Model Store and Trained Model Store: A library of models of commonly used

machine learning algorithms that can be trained on user’s historical datasets, as well as trained models that are available to be deployed.

• Model Interface Templates: Interfaces that abstract away the complexity of the machine learning algorithm, allowing users to specify the inputs and outputs of the model.

• Deployment and Scheduler: Automatic training, deployment, and scheduling of models on runtime environments so that users do not need to operate the runtime environments themselves.

• Runtime Verifier: Ability to determine which runtime environments can support a model prior to execution, enabling faster development of trained models.

• Monitoring Service and Metadata Store: Service monitors the status of the model during its execution on the runtime environment for the user, as well as any metadata about its execution which it can then store.

• API: Exposes functionalities with API endpoints for users to verify, train, deploy, and monitor models on runtime environments.

Real Time Analytics

Runtime Environments

Distributed Computing Scientific Computing

Model ManagementDeployment

and SchedulerRuntime Verifier

Model Store Metadata StoreTrained Model Store

Monitoring Service

API Model Interface Templates

Users

Data Scientists Business Analysts

18

Page 19: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

• Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support the domain experts/data scientists to create and update models and for the business analysts to extract data insights.

• Model management at scale is specific for large scale data analytics which requires distributed resources allocation and communicates with various data stores.

Model Management Framework provides seamless interfaces along data analytics pipeline for model creation, deployment and scheduling

The framework in this technical architecture proposes• Runtime Environments: Backend runtime

environments such as Spark, MapReduce, R, and more interact with distributed resources (e.g. Hadoop) to train and deploy models

• Historical Data Store: Data virtualization interacts with various databases (e.g. Cassandra, Redshift, S3)

• Training, Prediction, Model Runtime Services: Framework services interact with runtime service to deploy and allocate resources for models as well as verify models for execution

• APIs: APIs interact with framework services for various functionalities

• Online Message Queue: Message queue is injected with real-time data

19

Prediction Service Training Service

API

User Interface

Resource Allocation Service

Model Store

Results Store

Model Metadata Store

Historical Data

Storage

Runtime Environments

Model Runtime ServiceOnline

Message Queue

Data Scientist

Business Analyst

Page 20: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Demo

20

Page 21: Automated Analytics at Scale

21Copyright © 2016 Accenture All rights reserved.

Model Management Framework covers a number of features to support various perspectivesThe framework provides the following features from the services to better serve domain experts/data scientists and business analysts

Feature ExplanationAutomatic model deployment on multiple runtime environments

Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime environments with pre-verification prior to execution.

Modeling algorithm library A library with algorithms for machine learning and statistical learning

Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well as resource consumption

Heterogeneous data stores Data can be stored in various databases

Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers

Batch mode and real-time mode A combination of model training and serving model to real-time data

Model update Retraining of the current model or re-selecting of the champion model

Job completion time estimation Estimate of how soon a job can be completed given the current resources

Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display

Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality

Page 22: Automated Analytics at Scale

22Copyright © 2016 Accenture All rights reserved.

Deploy Accenture’s Model Management Framework on-premise to operationalize analytics in a big data analytics platformAt Accenture Labs, we have a patent-protected invention on the model management framework that showcases the unique capabilities of our framework. If you have analytical models running in a big data analytics platform, we can help deploy our model manager in your environment before problems arise as the number of types of models and runtime environments you need to support increases

Simplified modeling process for data scientists

Abstracts data engineering and presents champion-challenger view for your data scientists to more quickly train, compare, and promote their models for deployment.

Provide analytics for Internet of Things use cases

Process data from heterogeneous data stores allows for sending data from thousands of sensors through modeling pipeline to leverage existing platform’s analytical capabilities.

Enabled for real-time analytics The model manager can deploy prediction jobs that ingest streaming data and applies a trained model for real-time predictions.

Greater coverage of runtime environments and models

Extends the capability to support additional runtime environments, increasing the number of types of models you can use in your data pipeline.

Democratized access to analytics Share library of models created by experts allows other data scientists and business analysts to leverage the models for their use cases.

Page 23: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Contact InformationAccenture Labs

Teresa TungTechnology Labs [email protected]

Carl DukatzR&D Senior [email protected]

23

Chris KangR&D Associate [email protected]

Page 24: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Appendix

24

Page 25: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

The solution: A new Model Management FrameworkSimplifying model deployment at scale

25

A simplified interface

RESULTS• Enables a catalog approach to finding analytics• Simplified onboarding of new analytics• Brute-force approach to retraining and comparing models

Comprises of a model building service, a prediction

service, and a resource allocation service

Supports end-to-end analytical modeling at scale

using the Lambda Architecture

Hides the complexity of Lambda and unlocks its power for data scientists,

domain experts, and business analysts

Page 26: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Benefits of the new frameworkUnlocking the power of Lambda for data scientists, domain experts, and business analysts

26

Data scientists and domain experts who generate the models can:• Select from already captured

modeling approaches or onboard their own

• Easily compare models in a champion-challenger fashion

Business analysts who rely on model’s results can select from a catalog of models created by experts

Page 27: Automated Analytics at Scale

Copyright © 2016 Accenture All rights reserved.

Model Management Framework differs from other approaches in its enablement of big data capability with heterogeneity and scalability

Other analytics focuses on designing and fine tuning machine learning algorithms to improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA libraries provides comprehensive machine learning algorithms but lack the capability to integrate with big data or manage thousands of models. For example, Apache Mahout works with Hadoop MapReduce with slowdown from frequent writes to disk.

Comparison ExamplesModel Management Framework• I want to run my analytics on the distributed data set with the size

of TB or PB which is geographically distributed and stored in various databases

• I want to deploy multiple models on distributed resources and let the framework automatically select the best model based on the metrics I have defined

• I want to specify the prediction interval and query the results by calling API endpoints

• I want to always use the up-to-date model by having the framework retrain the current model or selecting a new champion model

Other Model Management• I want to the improve my SVM classification algorithm by 3% in

terms of accuracy with my 300MB dataset residing on my local disk

• I want to try various algorithms and fine tune parameters to see how the accuracy can be improved

• I want to apply the trained model for new data for prediction by calling the modeling method and specifying where to store the results. I need to try multiple prediction intervals to see which works.

• I want to see the prediction results by plotting the data from the file where results are saved into

27