[Strata] Sparkta

43
SPARKTA A real-time analytics platform based on Apache Spark London, May 2015

Transcript of [Strata] Sparkta

SPARKTAA real-time analytics platform

based on Apache Spark

London, May 2015

FIRST SPARK PLATFORM.

APR 2014

20+ INTERNATIONAL

PROJECTS

WITH SPARK

PLATFORM

OVERVIEW1

STRATIO

INGESTION

Customer lake

STRATIO

STREAMING

STRATIO

QUANTUM

STRATIO DEEP

STRATIO CROSSDATA

ODBC JBDC API Rest

CRM

ERP

Call

Center

BI

InternalData

ExternalData

BI AD HOC APP

Hdfs S3 ElasticSearch

Mongo DB Cassandra Redis Oracle, DB2Other

Databases

STRATIO DATAVIS

4

STRATIO

INGESTION

Customer lake

STRATIO

STREAMING

STRATIO

QUANTUM

STRATIO DEEP

STRATIO CROSSDATA

ODBC JBDC API Rest

CRM

ERP

Call

Center

BI

InternalData

Externaldata

BI AD HOC APP

Ingests, transforms

Analyzes and processes real time streaming

A unified SQL interface

Machine Learningand algorithms

Processes & combines withSpark

STRATIO DATAVIS

Creates and designsdashboards and reports

Hdfs S3 ElasticSearch

Mongo DB Cassandra Redis Oracle, DB2Other

Databases

5

STRATIO

INGESTIONIngests, transforms

STRATIO

STREAMING

STRATIO

QUANTUMSTRATIO CROSSDATA

Analyzes & processesA unified SQL interface

Machine Learningand algorithms

ODBC JBDC API Rest

Streaming

Apache Kite

Apache Flume

CRM

ERP

Call

Center

BI

MLlib

InternalData

ExternalData

BI AD HOC APP

Combines with Spark data from any

source

Customer lake

STRATIO DEEPProcesses & combines with Spark

Hdfs S3 ElasticSearch

Mongo DB Cassandra Redis Oracle, DB2Other

Databases

STRATIO DATAVIS

Creates and designsdashboards and reports

6

STRATIO

INGESTION

Hdfs S3 ElasticSearch

Mongo DB Cassandra Redis Oracle, DB2Other

Databases

Ingests, transforms

STRATIO

STREAMING

STRATIO

QUANTUMSTRATIO CROSSDATA

Analyzes & processes

Consult & analyze. SQL interface

Machine Learning& algorithms

ODBC JBDC API Rest

Streaming

Apache Kite

Apache Flume

CRM

ERP

Call

Center

BI

MLib

InternalData

ExternalData

BI AD HOC APP

Data combination through time

Customer lake

STRATIO DEEPProcesses & combines withSpark

Real-time

Ephemer

al tables

Past

Stored

tables

Future

Quantum

tables

STRATIO DATAVIS

Creates and designsdashboards and reports

7

STRATIO DATAVIS

STRATIO

INGESTIONIngests, transforms

STRATIO

STREAMING

STRATIO

QUANTUMSTRATIO CROSSDATA

Analyzes & processes

Consulta y analiza. Interfaz SQL

Machine Learning& algorithms

ODBC JBDC API Rest

Streaming

Apache Kite

Apache Flume

CRM

ERP

Call

Center

BI

MLlib

InternalData

ExternalData

Creates and designsdashboards and reports

Customer lake

STRATIO DEEPProcesses & combines with Spark

Hdfs S3 ElasticSearch

Mongo DB Cassandra Redis Oracle, DB2Other

Databases

INFORMATIONAL + OPERATIONAL

WITHOUT NEED TO REPLICATE DATA

Oracle, DB2Other Databases Mongo DB TeradataOPERATIONAL

8

REAL-TIME:

Beyond cool dashboards2

The time is N W

We all know this story already

Social media and networking sites are a part of the fabric of everyday life, changing the way the world shares and accesses information.

The overwhelming amount of information gathered not only from messages, updates and images but also readings from sensors, GPS signals and many other sources was the origin of a (big) technological revolution.

Remember? VOLUME, VARIETY & VELOCITY

CONFERENCE10

Look at these sexy infographics!

We all love data visualization

Insights from this vast amount of data allows us to learn from the users and explore our own world.

We can follow in real-time the evolutionof a topic, an event or even an incidentjust by exploring aggregated data.

CONFERENCE11

Delivering real-time business in the Internet

But beyond cool visualizations, there are some core services delivered in real-time, using aggregated data to answer commonquestions in the fastest way.

These services are the heart of the

business behind their nice logos.

Site traffic, user engagement monitoring, service health, APIs, internal monitoring platforms, real-time dashboards…

Aggregated data feeds directly to end

users, publishers, and advertisers, among

others.

CONFERENCE12

Pushing business’ processes to perform faster

Digital companies, born to develop their services in real-time have changed the expectations of many others businesses.

Real-time information makes it possible for a company to be much more agile than its competitors, improving business answers, gaining insights on their performance…

CONFERENCE13

Listen to your data…

CLIENTTPV

Accounts

Loans

and credits

Insurances

Broker

Mortgages

Cards

Deposits

ATM

Onlinegateway

application logs

Socialnetworks

transactions

geolocationCRM

Where as business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a process.

And we can listen to almost everything! Orders, transactions, clicks, calls, bookings, internalservices...

CONFERENCE14

…and start delivering real-time services

Real-time monitoring could be really nice, but yourcompany needs to work in the same way as digital companies:

• Rethinking existing processes to deliver themfaster, better.

• Creating new opportunities for competitiveadvantages.

CONFERENCE15

REAL-TIME

Challenges at Stratio2

Real-time fraud monitoring

DATA RECEIVER

REAL-TIME

AGGREGATION

CONSOLIDATIONDashboardin

g

Reporting

FRAUD

DETECTION

Leveraging the power of Spark Streaming, we have developed some fraud detection

solutions, aggregating data in real-time to work better with machine learning

algorithms.

CONFERENCE17

Extract, Transform and Aggregate

By combining Apache Flume and Spark Streaming we have deployed complex

topologies to deal with data coming from heterogeneous sources.

The full solution allow us to transform and aggregate data on-the-fly

(data cleaning, normalization and enrichment)

REAL-TIME

AGGREGATIONDashboardin

g

Reporting

CONFERENCE18

Custom data sources and storage

Each project requires

specific inputs and data

storages, dealing with

different kinds of

events.

From click stream

activity to bank

transactions...

DATA STREAM

LOADING

TRANSFORM

CUSTOM LOGS

CONFERENCE19

Towards a generic real-time aggregation platform

At Stratio, we have implemented several real-time analytic projects based

on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.

These technologies were always a perfect fit, but soon we found ourselves

writing the same pieces of integration code over and over again.

This is how SPARKTA was born.

CONFERENCE20

ELSEWHERE3

#1 RainBird from Twitter

Some folks from twitter shared some thoughts

about their real-time needs at Strata (2011).

They worked on a “generic” platform in order to

deal with pre-calculated data from a huge number

of events.

It allows them to deal with:

• Data Structures

• Hierarchical Aggregation

• Temporal Aggregation

• Multiple Formulas

Still not open sourceCURRENT STATE

http://goo.gl/ykvQa

CONFERENCE22

#2 Countandra

Countandra is a hierarchical distributed counting

engine exploiting all the excellent write&read

performance of Cassandra.

It supports:

• Geographically distributed counting.

• Easy Http Based interface to insert counts.

• Hierarchical counting such as com.mywebsite.music.

• Retrieves counts, sums and square in near real-time.

• Simple Http queries provides desired output in Jsonformat

• Queries can be sliced by period such as lasthour,lastyear and so on for minutely,hourly,daily,monthlyvalues

https://github.com/milindparikh/Countandra

Rather deprecatedCURRENT STATE

CONFERENCE23

#3 ThunderRain from Intel

ThunderRain is a Real-Time Analytical Processing

(RTAP) example using Spark and Shark, which

can be best characterized by the following four

salient properties:

• Data continuously streamed in & processed in near real-time

• Real-time data queried and presented in an online fashion

• Real-time and history data combined and mined interactively

• Predominant RAM-based processing

https://github.com/thunderain-

project/thunderain

Rather deprecatedCURRENT STATE

CONFERENCE24

#4 TSAR from Twitter

TSAR (the TimeSeries AggregatoR) is a

flexible, reusable, end-to-end service

architecture on top of Summingbird.

Twitter really needs a truly robust real-

time aggregation service considering their

scaling and evolving needs.

They realized that many time-series

applications call for essentially the same

architecture, with only slight variations in

the data model.

https://blog.twitter.com/2014/tsar-a-timeseries-aggregator

Still not open sourceCURRENT STATE

CONFERENCE25

Towards a generic real-time aggregation platform

Some initiatives have tried to solve this problem, but until now most of them

were complex or obsolete while others were not open source.

For this reason, Stratio created SPARKTA: an open source and full-featured

platform for real-time analytics, based on Apache Spark.

This is why SPARKTA was conceived

CONFERENCE26

4THIS IS

SPARKTA

Distributed, high-volume & pluggable analytics framework

Our goals:

Since Aryabhatta invented zero, Mathematicians such as John von Neuman have

been in pursuit of efficient counting and architects have constantly built systems that

computes counts quicker. In this age of social media, where 100s of 1000s events

take place every second, we designed a aggregation engine to deliver real-time service

• Pure Spark!

• No need of coding, only declarative aggregation

workflows

• Data continuously streamed in & processed in near real-

time

• Ready to use out of the box

• Plug & play: flexible workflows (inputs, outputs, parsers,

etc…)

• High performance

• Scalable and fault tolerant

CONFERENCE28

Sparkta: A first look

DRIVER - SUPERVISOR

AGGREGATION POLICY

QUERY

SERVICES

Aggregation policy

definition is sent to the

engine

Allows multiple application to be

defined, each of which is bound to

a context, executing the

aggregation workflow

others

AGGREGATION WORKFLOW

CONFERENCE29

Sparkta: Deploy any number of real-time aggregation policies

DRIVER - SUPERVISOR

You can start

several workflows

at any time, and

also stop or

monitor them

CONFERENCE30

Sparkta: Key Technologies

+

Apache Kite SDK

INPUTS PROCESSING

RabbitMQ

ZeroMQ

Twitter

Flume

Kafka

....

OUTPUTS

..

..

CONFERENCE31

Sparkta: Define your real-time needs

AGGREGATION POLICY

Remember: no need to code anything.

Define your workflow in a JSON document, including:

INPUT Where is the data coming from?

OUTPUT(s) Where should aggregate data be stored?

DIMENSION(s) Which fields will you need for your real-time

needs?ROLLUP(s) How do you want to aggregate the dimensions?

TRANSFORMATION(s) Which functions should be applied before aggregation?

SAVE RAW DATA Do you want to save raw events?

CONFERENCE32

Sparkta: Key Technologies

ROLLUPS

• Pass-through

• Time-based

• Secondly, minutely, hourly, daily,

monthly, yearly...

• Hierarchycal

• GeoRange: Areas with different sizes

(rectangles)

OPERATORS

• Max, min, count, sum

• Average, median

• Stdev, variance, count distinct

• Last value

• Full-text search

KiteSDK

CONFERENCE33

Sparkta SDK

INPUT

OUTPUT(s)

DIMENSION(s)

OPERATORS

TRANSFORMATION(s)

Sparkta has been conceived as an SDK.

You can extend several points of the platform to

fulfill your needs, such as adding new inputs,

outputs, operators, dimension types.

Add new functions to Apache Kite in order to

extend the data cleaning, enrichment and

normalization capabilities.

CONFERENCE34

NEXT STEPS5

Source: mydisguises.com

Next steps in our roadmap (1)

Sparkta is a work in progress, so we still have some nice features to

develop…

QUERY

SERVICES

ALARMS

Creating a REST services layer in order to query the

aggregated data allows us to isolate the final consumer

from the specific data storage

Features

- Time ranges

- Agreggation on time ranges

- Best rollup selection

For example, I want to know if we have earned over $3000 in

London in the last hour...

Remember operational intelligence!

CONFERENCE36

Next steps in our roadmap (II)

WEB

APPLICATION

DEPLOYING &

MONITORING

How about a nice web interface to create and manage policies?

Forget the JSON file and use your mouse to define the workflow :)

We have been working with Spark jobServer & Yarn, but it will be

nice to support Mesos, for example.

Hey, did you miss something? Do you have a great idea?

Let us know!

MORE AWESOMENESS

CONFERENCE37

OPEN SOURCE

& COMMUNITY6

OPEN TO YOUR IDEAS

www.stratio.com

@StratioBD

https://github.com/stratio/sparkta

SPARKTA is fully open source

Apache 2 License.

We are open to contributors & ideas

CONFERENCE39

DEMO TIME7

Do you want to try SPARKTA?

Use a full-featured sandbox to start trying SPARKTA

vagrant init “stratio/sparkta”

vagrant up

Just open a shell and type

CONFERENCE41

Do you want to try SPARKTA?

Getting some real-time stats from

#StrataHadoop

Our real-time policy defines some

rollups in order to know chatty users, hot

hashtags, and heatmaps from

StrataConf tweets.

We are using the standard Twitter input

from Spark Streaming, ElasticSearch

output & Kibana to display results

CONFERENCE42

BIG DATACHILD`S PLAY