[Strata] Sparkta
-
Upload
stratio -
Category
Technology
-
view
2.739 -
download
0
Transcript of [Strata] Sparkta
STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ERP
Call
Center
BI
InternalData
ExternalData
BI AD HOC APP
Hdfs S3 ElasticSearch
Mongo DB Cassandra Redis Oracle, DB2Other
Databases
STRATIO DATAVIS
4
STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ERP
Call
Center
BI
InternalData
Externaldata
BI AD HOC APP
Ingests, transforms
Analyzes and processes real time streaming
A unified SQL interface
Machine Learningand algorithms
Processes & combines withSpark
STRATIO DATAVIS
Creates and designsdashboards and reports
Hdfs S3 ElasticSearch
Mongo DB Cassandra Redis Oracle, DB2Other
Databases
5
STRATIO
INGESTIONIngests, transforms
STRATIO
STREAMING
STRATIO
QUANTUMSTRATIO CROSSDATA
Analyzes & processesA unified SQL interface
Machine Learningand algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
InternalData
ExternalData
BI AD HOC APP
Combines with Spark data from any
source
Customer lake
STRATIO DEEPProcesses & combines with Spark
Hdfs S3 ElasticSearch
Mongo DB Cassandra Redis Oracle, DB2Other
Databases
STRATIO DATAVIS
Creates and designsdashboards and reports
6
STRATIO
INGESTION
Hdfs S3 ElasticSearch
Mongo DB Cassandra Redis Oracle, DB2Other
Databases
Ingests, transforms
STRATIO
STREAMING
STRATIO
QUANTUMSTRATIO CROSSDATA
Analyzes & processes
Consult & analyze. SQL interface
Machine Learning& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLib
InternalData
ExternalData
BI AD HOC APP
Data combination through time
Customer lake
STRATIO DEEPProcesses & combines withSpark
Real-time
Ephemer
al tables
Past
Stored
tables
Future
Quantum
tables
STRATIO DATAVIS
Creates and designsdashboards and reports
7
STRATIO DATAVIS
STRATIO
INGESTIONIngests, transforms
STRATIO
STREAMING
STRATIO
QUANTUMSTRATIO CROSSDATA
Analyzes & processes
Consulta y analiza. Interfaz SQL
Machine Learning& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
InternalData
ExternalData
Creates and designsdashboards and reports
Customer lake
STRATIO DEEPProcesses & combines with Spark
Hdfs S3 ElasticSearch
Mongo DB Cassandra Redis Oracle, DB2Other
Databases
INFORMATIONAL + OPERATIONAL
WITHOUT NEED TO REPLICATE DATA
Oracle, DB2Other Databases Mongo DB TeradataOPERATIONAL
8
The time is N W
We all know this story already
Social media and networking sites are a part of the fabric of everyday life, changing the way the world shares and accesses information.
The overwhelming amount of information gathered not only from messages, updates and images but also readings from sensors, GPS signals and many other sources was the origin of a (big) technological revolution.
Remember? VOLUME, VARIETY & VELOCITY
CONFERENCE10
Look at these sexy infographics!
We all love data visualization
Insights from this vast amount of data allows us to learn from the users and explore our own world.
We can follow in real-time the evolutionof a topic, an event or even an incidentjust by exploring aggregated data.
CONFERENCE11
Delivering real-time business in the Internet
But beyond cool visualizations, there are some core services delivered in real-time, using aggregated data to answer commonquestions in the fastest way.
These services are the heart of the
business behind their nice logos.
Site traffic, user engagement monitoring, service health, APIs, internal monitoring platforms, real-time dashboards…
Aggregated data feeds directly to end
users, publishers, and advertisers, among
others.
CONFERENCE12
Pushing business’ processes to perform faster
Digital companies, born to develop their services in real-time have changed the expectations of many others businesses.
Real-time information makes it possible for a company to be much more agile than its competitors, improving business answers, gaining insights on their performance…
CONFERENCE13
Listen to your data…
CLIENTTPV
Accounts
Loans
and credits
Insurances
Broker
Mortgages
Cards
Deposits
ATM
Onlinegateway
application logs
Socialnetworks
transactions
geolocationCRM
Where as business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a process.
And we can listen to almost everything! Orders, transactions, clicks, calls, bookings, internalservices...
CONFERENCE14
…and start delivering real-time services
Real-time monitoring could be really nice, but yourcompany needs to work in the same way as digital companies:
• Rethinking existing processes to deliver themfaster, better.
• Creating new opportunities for competitiveadvantages.
CONFERENCE15
Real-time fraud monitoring
DATA RECEIVER
REAL-TIME
AGGREGATION
CONSOLIDATIONDashboardin
g
Reporting
FRAUD
DETECTION
Leveraging the power of Spark Streaming, we have developed some fraud detection
solutions, aggregating data in real-time to work better with machine learning
algorithms.
CONFERENCE17
Extract, Transform and Aggregate
By combining Apache Flume and Spark Streaming we have deployed complex
topologies to deal with data coming from heterogeneous sources.
The full solution allow us to transform and aggregate data on-the-fly
(data cleaning, normalization and enrichment)
REAL-TIME
AGGREGATIONDashboardin
g
Reporting
CONFERENCE18
Custom data sources and storage
Each project requires
specific inputs and data
storages, dealing with
different kinds of
events.
From click stream
activity to bank
transactions...
DATA STREAM
LOADING
TRANSFORM
CUSTOM LOGS
CONFERENCE19
Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based
on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.
These technologies were always a perfect fit, but soon we found ourselves
writing the same pieces of integration code over and over again.
This is how SPARKTA was born.
CONFERENCE20
#1 RainBird from Twitter
Some folks from twitter shared some thoughts
about their real-time needs at Strata (2011).
They worked on a “generic” platform in order to
deal with pre-calculated data from a huge number
of events.
It allows them to deal with:
• Data Structures
• Hierarchical Aggregation
• Temporal Aggregation
• Multiple Formulas
Still not open sourceCURRENT STATE
http://goo.gl/ykvQa
CONFERENCE22
#2 Countandra
Countandra is a hierarchical distributed counting
engine exploiting all the excellent write&read
performance of Cassandra.
It supports:
• Geographically distributed counting.
• Easy Http Based interface to insert counts.
• Hierarchical counting such as com.mywebsite.music.
• Retrieves counts, sums and square in near real-time.
• Simple Http queries provides desired output in Jsonformat
• Queries can be sliced by period such as lasthour,lastyear and so on for minutely,hourly,daily,monthlyvalues
https://github.com/milindparikh/Countandra
Rather deprecatedCURRENT STATE
CONFERENCE23
#3 ThunderRain from Intel
ThunderRain is a Real-Time Analytical Processing
(RTAP) example using Spark and Shark, which
can be best characterized by the following four
salient properties:
• Data continuously streamed in & processed in near real-time
• Real-time data queried and presented in an online fashion
• Real-time and history data combined and mined interactively
• Predominant RAM-based processing
https://github.com/thunderain-
project/thunderain
Rather deprecatedCURRENT STATE
CONFERENCE24
#4 TSAR from Twitter
TSAR (the TimeSeries AggregatoR) is a
flexible, reusable, end-to-end service
architecture on top of Summingbird.
Twitter really needs a truly robust real-
time aggregation service considering their
scaling and evolving needs.
They realized that many time-series
applications call for essentially the same
architecture, with only slight variations in
the data model.
https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
Still not open sourceCURRENT STATE
CONFERENCE25
Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them
were complex or obsolete while others were not open source.
For this reason, Stratio created SPARKTA: an open source and full-featured
platform for real-time analytics, based on Apache Spark.
This is why SPARKTA was conceived
CONFERENCE26
Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have
been in pursuit of efficient counting and architects have constantly built systems that
computes counts quicker. In this age of social media, where 100s of 1000s events
take place every second, we designed a aggregation engine to deliver real-time service
• Pure Spark!
• No need of coding, only declarative aggregation
workflows
• Data continuously streamed in & processed in near real-
time
• Ready to use out of the box
• Plug & play: flexible workflows (inputs, outputs, parsers,
etc…)
• High performance
• Scalable and fault tolerant
CONFERENCE28
Sparkta: A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
engine
Allows multiple application to be
defined, each of which is bound to
a context, executing the
aggregation workflow
others
AGGREGATION WORKFLOW
CONFERENCE29
Sparkta: Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any time, and
also stop or
monitor them
CONFERENCE30
Sparkta: Key Technologies
+
Apache Kite SDK
INPUTS PROCESSING
RabbitMQ
ZeroMQ
Flume
Kafka
....
OUTPUTS
..
..
CONFERENCE31
Sparkta: Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON document, including:
INPUT Where is the data coming from?
OUTPUT(s) Where should aggregate data be stored?
DIMENSION(s) Which fields will you need for your real-time
needs?ROLLUP(s) How do you want to aggregate the dimensions?
TRANSFORMATION(s) Which functions should be applied before aggregation?
SAVE RAW DATA Do you want to save raw events?
CONFERENCE32
Sparkta: Key Technologies
ROLLUPS
• Pass-through
• Time-based
• Secondly, minutely, hourly, daily,
monthly, yearly...
• Hierarchycal
• GeoRange: Areas with different sizes
(rectangles)
OPERATORS
• Max, min, count, sum
• Average, median
• Stdev, variance, count distinct
• Last value
• Full-text search
KiteSDK
CONFERENCE33
Sparkta SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend several points of the platform to
fulfill your needs, such as adding new inputs,
outputs, operators, dimension types.
Add new functions to Apache Kite in order to
extend the data cleaning, enrichment and
normalization capabilities.
CONFERENCE34
Next steps in our roadmap (1)
Sparkta is a work in progress, so we still have some nice features to
develop…
QUERY
SERVICES
ALARMS
Creating a REST services layer in order to query the
aggregated data allows us to isolate the final consumer
from the specific data storage
Features
- Time ranges
- Agreggation on time ranges
- Best rollup selection
For example, I want to know if we have earned over $3000 in
London in the last hour...
Remember operational intelligence!
CONFERENCE36
Next steps in our roadmap (II)
WEB
APPLICATION
DEPLOYING &
MONITORING
How about a nice web interface to create and manage policies?
Forget the JSON file and use your mouse to define the workflow :)
We have been working with Spark jobServer & Yarn, but it will be
nice to support Mesos, for example.
Hey, did you miss something? Do you have a great idea?
Let us know!
MORE AWESOMENESS
CONFERENCE37
OPEN TO YOUR IDEAS
www.stratio.com
@StratioBD
https://github.com/stratio/sparkta
SPARKTA is fully open source
Apache 2 License.
We are open to contributors & ideas
CONFERENCE39
Do you want to try SPARKTA?
Use a full-featured sandbox to start trying SPARKTA
vagrant init “stratio/sparkta”
vagrant up
Just open a shell and type
CONFERENCE41
Do you want to try SPARKTA?
Getting some real-time stats from
#StrataHadoop
Our real-time policy defines some
rollups in order to know chatty users, hot
hashtags, and heatmaps from
StrataConf tweets.
We are using the standard Twitter input
from Spark Streaming, ElasticSearch
output & Kibana to display results
CONFERENCE42