STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent...

48
© Stratio 2015. Confidential, All Rights Reserved. STRATIO BIG DATA SCIENCE PLATFORM

Transcript of STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent...

Page 1: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

STRATIO BIG DATA SCIENCE PLATFORM

Page 2: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Library Single-

Machine

Distributed Distributed

Graph

Algorithms

Visualization IDE Spark

Integration

Hadoop

Integration

Community

Spark:

Mllib+GraphX

++ ++++ ++++ - ++ ++++ +++ ++++

R ++++ + - ++++ ++++ ++ + ++++

Scikit-learn ++++ ++ - ++++ +++ ++ + +++

H2O + +++ - +++ ++ +++ +++ ++

Apache Mahout ++ +++ - - + ++ +++ +

Apache

SystemML

++ +++ - - ++ +++ +++ ++

There is no library that provides all the features in a good or very good level

Page 3: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Spark: Mllib + GraphX

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Linear Support Vector Machines (SVMs) X X

Logistic regression X X

Linear least squares X X

Lasso L1 regularization X X

Ridge regression X X

Streaming linear regression X X

Isotonic regression X X

Decision Trees X X

Ensembles of decision trees (Random forests, Gradient-boosted trees) X X

Naive Bayes (Multinomial naive Bayes, Bernoulli naive Bayes) X X

Isotonic regression X X

Collaborative Filtering

Alternating least squares (ALS) X X

Clustering

K-means X X

Gaussian mixture X X

Power iteration clustering (PIC) (using GraphX as its backend) X X

Latent Dirichlet allocation (LDA) X X

Streaming k-means X X

Dimensionality Reduction

Singular value decomposition X X

Principal component analysis X X

Graph

PageRank X X

Closeness Centrality X X

Betweenness Centrality X X

Triangle Counting X X

Connected Components X X

Strongly Connected Components X X

Page 4: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Spark: Mllib + GraphX

Frequent Pattern Mining

FP-growth X X

Association rules X X

PrefixSpan X X

Feature Extraction and Transformation

Term frequency-inverse document frequency (TF-IDF) X X

Word2Vec X X

StandardScaler X X

Normalizer X X

Feature selection (ChiSqSelector) X X

ElementwiseProduct X X

Page 5: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

R

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Linear Support Vector Machines X

Penalized SVM X

Outliers X

Decision Trees X

Ridge regression X

Naïve Bayes X

Adaboost X

JRip X

…... X

Collaborative Filtering

Alternating least squares (ALS) X

Clustering

K-means X

Hybrid Hierarchical Clustering X

Expectation Maximization (EM) X

Dissimilarity Matrix Calculation X

Hierarchical Clustering X

Bayesian Hierarchical Clustering X

Density-Based Clustering X

K-Cores X

...

Dimensionality Reduction

Singular value decomposition X

Principal component analysis X

Feature Selection X

... X

Frequent Pattern Mining

FP-growth X

arulesNBMiner X

The Apriori Algorithm X

The Eclat Algorithm X

... X

Feature Extraction and Transformation

Page 6: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Scikit-learn

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Stochastic Gradient Descent X X

Approximate nearest neighbor X X

Locality Sensitive Hashing Forest (LSH) X X

SVM X X

Gaussian Naive Bayes X X

Multinomial Naive Bayes X X

Bernoulli Naive Bayes X X

Logistic Regression X X

Ridge Regression X

Lasso X

Elastic Net X

Multi-task Lasso X

Least Angle Regression X

LARS Lasso X

Orthogonal Matching Pursuit (OMP) X

Bayesian Regression X

…. X

Clustering

K-means X X

Affinity propagation X

Mean-shift X

Spectral clustering X

Ward hierarchical clustering X

Agglomerative clustering X

DBSCAN X

Gaussian mixtures X

Birch X

Dimensionality Reduction

Incremental PCA X

Kernel PCA X

Truncated singular value decomposition and latent semantic

analysis

X

Sparse coding with a precomputed dictionary X

Generic dictionary learning X

Factor Analysis X

Independent component analysis X

Non-negative matrix factorization (NMF or NNMF) X

Latent Dirichlet Allocation (LDA) X

... X

Page 7: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Scikit-learn

Frequent Pattern Mining

FP-growth X

arulesNBMiner X

The Apriori Algorithm X

The Eclat Algorithm X

... X

Feature Extraction and Transformation

Standardization, or mean removal and variance

scaling

X

Normalization X

Binarization X

Encoding categorical features X

Imputation of missing values X

Generating polynomial features X

Custom transformers X

Grid Search X

Cross Validation

K-Fold X X

Leave-One-Out - LOO X

Random permutations cross-validation a.k.a. Shuffle

& Split

X

... X

Page 8: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

H2O

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Generalized Linear Models X X

Distributed Random Forest X X

Naive Bayes X X

Gradient Boosted Regression X X

Gradient Boosted Classification X X

Clustering

K-means X X

Dimensionality Reduction

Principal component analysis X X

Feature Extraction and Transformation X X

Grid Search X X

Deep Learning X X

Page 9: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Apache Mahout

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Random Forest X X*

Naïve Bayes X X*

Hidden Markov Models X

Multilayer Perceptron X

Logistical Regression X

Collaborative Filtering

Alternating least squares (ALS) X X*

Clustering

K-means X X*

Fuzzy K-Means X X

Streaming K-Means X X*

Spectral Clustering X

Dimensionality Reduction

Singular value decomposition X X*

Principal component analysis X X*

Lanczos Algorithm X X

QR Decomposition X X

Feature Extraction and Transformation X X

Page 10: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Apache SystemML

Nombre algoritmo Single

Machine

Distributed

Classification and Regression

Multinomial Logistic Regression X X

Binary-Class Support Vector Machines X X

Multi-Class Support Vector Machines X X

Naive Bayes X X

Decision Trees X X

Random Forests X X

Linear Regression X X

Stepwise Linear Regression X X

Generalized Linear Models X X

Stepwise Generalized Linear Regression X X

Regression Scoring and Prediction X X

Clustering

K-means X X

Dimensionality Reduction

Matrix Completion via Alternating Minimizations X X

Principal component analysis X X

Feature Extraction and Transformation X X

Page 11: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

DATA SCIENCE BIG DATA PLATFORM

•Integration of different libraries Open Source with distributed machine learningalgorithms

•Development environment for every data scientist

•Making real-time decisions based on the models learned by machine learning algorithms

•Integrated with all components of the Stratio Big Data Platform

•Full management of the knowledge life cycle

Page 12: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Roman Martin

[email protected]

DATA SCIENCE BIG DATA PLATFORM

Page 13: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Milestones

Machine learning

life cycle with Big

Data

Catalog of

distributed machine

learning algorithms

+30 Big Data

components (ingestion, data

stores, real-time, visualization,

notebook)

74distributed machine

learning algorithms

Low learning curve for

a data scientist

integrated with 4 Data

Science development

environments (IPython, Spark,

Java)

Page 14: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

MACHINE LEARNING(LEARN FROM THE PAST TO PREDICT THE FUTURE)

CATALOG OF DISTRIBUTED MACHINE LEARNING ALGORITHMS

Page 15: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Classification &

Regression

Recommendation

Graph

.

.

.

ML Distributed Algorithms

Catalog

RStudio

Development Environment

of Data Science

iPython

Java & Scala

StratioML

R

Python

Java & Scala

Page 16: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Catalago of Distributed Machine Learning Algorithms

Types of Algorithms Number of Algorithms

Classification and Regression 37

Collaborative Filtering 2

Clustering 10

Dimensionality Reduction 7

Graph 6

Frequent Pattern Mining 3

Feature Extraction and

Transformation

7

Cross Validation 1

Deep Learning 1

TOTAL 74

The catalog of machine learning

distributed algorithms is based on the

main open source machine learning

libraries available:

● Apache MLlib

● Apache Graphx

● Sparkit-learn

● H2O

● System ML

● Mahout

Page 17: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

StratioML

• Integration with the datastores of the platform:

☑ HDFS, Parquet, HIVE

☑ Cassandra, MongoDB, ElasticSearch

☑ Stratio Postgres Big Data

• Integration of different distributed machine learning algorithms libraries with various programminglanguages used by data scientists:

☑ Python

☑ Scala

☑ Java

☑ R

Contains the components to provide to the data scientist capabilities to use the Stratio

Big Data Platform and the catalog of distributed machine learning algorithms

Page 18: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

CATÁLOGO DE ALGORITMOS MACHINE LEARNING DISTRIBUIDOS STRATIO CUSTOMER INTELLIGENCE

Page 19: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence

The main difficulty with a segmentation problem, profiling or Recommendation is to define the criteria, or data, based on which we will carry out such profiling. In Stratio Big Data Science Platform always we work based on these criteria:

➔ Profiling should always be performed based on a particular context, user behavior, user reviews, connections, etc.

➔ The result of profiling based on behavior is a very good source of knowledge from which to make recommendations associated with that behavior.

“Knowledge is neither created nor destroyed, it is transformed”

Page 20: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence

In Stratio Customer Intelligence, we have developed a set of algorithms makers learn from customer behavior and make recommendations based on that behavior.

1. Clustering users based on their behaviorClustering generated from the intrinsic information for each user based on their behavior. Clustering users based on these types of behavior:

➔ Quantitative data expressing opinion (rating, number of views, clicks, etc.)➔ Sequence behavior (navigation, actions, etc.)➔ Relationship with other elements (network of friends, networking, consumed products)

2. User classification (Stratio Profiler)The next step is to connect automatically with each user defined for this cluster using the features / own labels.

3. Recommendation System (Stratio Recommender)With information generated we recommend making a ranking of the predictions based on the tastes of each cluster adapting to the target user

Page 21: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence: User Clustering

The following is an example of clustering based on user ratings on movies (the behavior of the user)

Page 22: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence: User Profiling

In this moment all users are assigned a cluster. The classifier seeks to relate the characteristics of default label, along with others who consider the Data Scientist, with own cluster.The system has the inputs:

➔ Categorical characteristics of users➔ Relevant features added➔ Tags allocation given by target cluster

The classifier system generates a new model able to assign users to clusters.

Page 23: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence: User Recommendation

Recommendation systems have much more power if developed on sets of users that have similar characteristics. Grouping all users in a cluster, which will give us recommendations group. These recommendations provide a specific user after having crossed its historic.

Page 24: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Stratio Customer Intelligence: Summary

Below is a comparison of the ratio of success of recommendation algorithm using algorithms Stratio Customer Intelligence Vs recommendation algorithm for ALS Spark Movielens DataSet.

DataSet MovieLens:

Number of Ratings: 1.000.209 Number of Users: 6040 Number of Movies: 3706

Number of evaluated recommendations: 114.518

Ratio de Acierto de Recomendación:

Algoritmo Spark ALS

Stratio Recommender Stratio Recommender + Stratio Profiling

33.63 % 41.25 % 57.47 %

Page 25: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

DEVELOPMENT ENVIRONMENT FOR MACHINE LEARNING

Page 26: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

IPython - Development Environment on Python

IPython provides:

● A powerful interactive shell

● Developed on Jupyter Notebook

● Integrated with interactive visualization

tools and GUIs:

○ wxPython

○ PyQt4/PySide

○ PyGTK

○ Tk

● Flexible and embeddable by different

interpreters

● Easy to use and optimized tools to

distributed processing

Page 27: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

RStudio - Development Environment on R

RStudio is an integrated development

environment (IDE) for R including:

● Console

● Syntax Editor with capabilities of

direct code execution

● logs

● Historical

● Depuration

● Workspace Management

Page 28: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Java & Scala

There are different development

environments for Java and Scala. Two of

the most used are IntelliJ and Eclipse,

these environments provide:

● Syntax Editor with capabilities of

direct code execution

● Historical

● Depuration

● Workspace Management

● Integration with different SCMs

● Remote Debugging

● Integration with Maven and SBT

● Possibility of development of plugins

Page 29: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

MACHINE LEARNING FOR DECISIONS IN REAL TIME

MACHINE LEARNING LIFE CYCLE WITH BIG DATA

Page 30: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Machine Learning Life Cycle

The aim is to manage the lifecycle of knowledge. The life cycle requires a recursive real-time automation process of knowledge management with machine learning technology to learn from experience and

anticipate problems.

BIG DATA MACHINE LEARNING DECISION MAKINGALERT GENERATION

Page 31: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Postgres

Big Data

Page 32: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Knowledge Life Cycle

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

BIG DATA AND DISTRIBUTED PROCESSING OF THE KNOWLEDGE LIFE-CYCLE IN REAL TIME

Data Enrichment

in real time

Real Time

Monitoring

Page 33: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Ingestion

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Standardized data reception

High performance and operational flexibility

Producers and Consumers decoupled

Real Time

Monitoring

Page 34: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Real Time Monitoring

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Real Time

Monitoring

Monitoring throughout the lifecycle of knowledge

Dashboards monitoring and management Alerts

Viewing correlated Information

Page 35: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Store Raw Data

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Centralized management of the knowledge lifecycle

Learning without going to the source

Historical storage

Real Time

Monitoring

Page 36: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Enrichment, Correlation and Decision making

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Enrichment of data based on rules

Treatment of complex events

Anomaly detection, fraud, incident analysis,etc in real time based on learned models Siddhi

CEP

Real Time

Monitoring

Page 37: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Big Data Multi-persistence

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Decision making

Data queryMachine

Learning

Data Enrichment

in real timel

Multi-Persistence system

Flexibility in consultations

And scalable distributed storage of information.

Update aggregated data

Real Time

Monitoring

Big Data Multi-

persistence

Page 38: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Machine Learning

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Machine Learning Algorithms

Development Environments for Learning Machine

Full cycle analysis of the data.

Viewing the results of the algorithms.

Real Time

Monitoring

Page 39: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Data Query

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Abstraction of the data source

Optimization of the most common queries

Single access interface

Real Time

Monitoring

Page 40: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Information Visualization

Ingestion

Store Raw Data

Information

Visualization

Data Correlation

in Real Time

Data

Normalization in

Real Time

Big Data Multi-

persistenceDecision making

Data queryMachine

Learning

Data Enrichment

in real time

Reporting and dynamic dashboards

Integration with BI tools

Easy to freely explore the data

Real Time

Monitoring

Page 41: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

DICTIONARYOF DATA

GIVE MEANING TO DATA = KNOWLEDGE

Page 42: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Dictionary of Data

☑ The centralization of management schemes of information is a key element for proper data governance it must necessarily consider three key issues:

• Registration unified schemes• Evolution and versioning schemes in time• Dictionary fields for a unified mapping.

☑ The absence of a technological piece of this type is a significant risk of loss of control over the serialization / deserialization of stored data, especially when we consider the time factor and the volatility of unstructured data (changing patterns)

☑ It is critical to give consumers the means to manage data structures and proper management of its changes over time.

☑ The solution itself relies on the kindness of Kafka and Avro to resolve their own problems of registration schemes, such as:• Assigning a globally unique ID to each registered scheme.• Reliable and replicated schemes• High-performance distributed architecture

It is necessary to incorporate a semantic layer that serves as a central repository of meta-information.

Page 43: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

DECISION MAKING AND ALERT GENERATION

Page 44: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

Decision Making and Real Time Alert Generation

For true knowledge management it is necessary to automate decision-making and alert

generation

STRATIO DECISION

• Managing the flow of knowledge:

☑ Real-time integration of information originated at different times, origins and components..

☑ Detecting patterns of information in real time.

• Rule management:

☑ User configurable rules that automate real-time decision making and alerts generation.

• Integration with machine learning algorithms:

☑ Rules decision makers and generate alerts based on predictive machine learning algorithms.

Page 45: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

CENTRAL INFORMATION VISUALIZATIONNOT TO STAY ON THE SURFACE

Page 46: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.

CENTRAL INFORMATION VISUALIZATION

The platform allows you to visualize the entire knowledge life-cycle and providing different

views of information based on the consumer

STRATIO VIEWER

• Apply knowledge:

☑ Aggregates generated in real time

☑ Informes analíticos

☑ Analytical reports

☑ Information union available on differentdatastores.

☑ Heat maps in real time indicating the origin ofthe information.

STRATIO EXPLORER

• It provides all views of knowledge:

☑ Raw information

☑ Standardized information to the datadictionary

☑ Information enriched with inference processes

☑ Correlated information (knowledge)

• It allows interaction with all sections of the Platform.

• Allows exchange different views of informationbetween users.

Page 47: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved. 47

UNIQUE DATA SCIENCE PLATFORM OF BIG DATA COVERING ALL THE LIFE CYCLE OF INFORMATION

Page 48: STRATIO BIG DATA SCIENCE PLATFORM - files.meetup.com Big Data... · Spark: Mllib + GraphX Frequent Pattern Mining FP-growth X X Association rules X X PrefixSpan X X Feature Extraction

© Stratio 2015. Confidential, All Rights Reserved.