Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
-
Upload
stratio -
Category
Data & Analytics
-
view
146 -
download
1
description
Transcript of Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Spark Use Case at
Telefónica CyberSecurity (CBS)
Antonio Alcocer [email protected]
Oscar [email protected]@omendezsoto
1#CassandraSummit 2014
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 50+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
STRATIOWho are we?
#CassandraSummit 2014 2
3
General info
o 1924- 2014: 317+ customer with 130.000+ employees
o 2nd European operator by revenues
o 4th global integrated operator by accesses
o 9th Telco in the Global ranking by market capitalization
o 2nd global operator for investment in R+D
#CassandraSummit 2014 4
Present in 24 countries
#CassandraSummit 2014 5
Their main brands
#CassandraSummit 2014 6
Other brands
#CassandraSummit 2014 7
Telefónica Global Solutions
#CassandraSummit 2014 8
Global Security Services
A global infrastructure to
safeguard your business_
Managed Security
#CassandraSummit 2014 9
CyberSecurity??
10
Why????
#CassandraSummit 2014 11
A picture is worth a thousand words - but a film clip, a million!
Why????
#CassandraSummit 2014 12
A picture is worth a thousand words - but a film clip, a million!
Why????
#CassandraSummit 2014 13
A picture is worth a thousand words - but a film clip, a million!
Don’t worry…
#CassandraSummit 2014 14
What is Cybersecurity?
#CassandraSummit 2014 15
“Cybersecurity is the collection of tools, policies… capabilities to protect
the cyber environment and organization and user’s assets. Cybersecurity
strives to ensure unauthorized access to, manipulation of the integrity,
confidentiality, or availability of an information, or unauthorized
exfiltration of information.”
What does it mean for us?
No rules, just guidelines.
An example of threats
#CassandraSummit 2014 16
Cassandra OpsCenter
World mapWordpress
C* OpsCenter + Shodan
#CassandraSummit 2014 17
C* OpsCenter + Shodan
#CassandraSummit 2014 18
Another threats
#CassandraSummit 2014 19
CyberSecurity in
numbers
20
Numbers
#CassandraSummit 2014 21
• DDoS (23%)
• SQLi (19%)
• Defacement (14%)
• Account Hijacking (9%)
• Unknown (18%)
Threats
Looking for unknown threats
#CassandraSummit 2014 22
What did Telefonica need?
#CassandraSummit 2014 23
Joining efforts
24
Joinnig efforts
Required skills
#CassandraSummit 2014 26
in
27
Using
Use Case Architecture
#CassandraSummit 2014 28
We have three phases:
• Ingestion: based on Apache Kafka
• Data fusion: based on Apache Storm.
• Batch & Analytics: Based on Cassandra
and Spark
Data Adquisition
#CassandraSummit 2014 29
• Data are in several sources:
• DNS traffic
• IP
• Social media
• Underground sources
• Government sources
• …
• There are several sources consumers pulling the info and
pushing it into a Kafka Cluster
• Sources are heterogeneous and their speed is variable.
Data sources
So
urc
es
So
urc
es
So
urc
es
So
urc
es
So
urc
es
So
urc
es
So
urc
es
So
urc
es
KAFKA
AP
I
Data fusion
#CassandraSummit 2014 30
• We use Storm to process and
normalize the information.
• The system must fire alerts
to the analysts.
• This use case required a Big
Data component capable of
processing the data and
extract its information in real-
time.
• Warnings and alerts are time-sensitive in order to deal efficiently with security attacks.
Batch
#CassandraSummit 2014 31
•The data are saved in
Cassandra.
•We use Cassandra directly for
the easy queries.
•And we used Spark to extract
the information not accessible
to cassandra directly.
Da
ta p
roce
ss
INTEGRATION INTEGRATION INTEGRATION
Why did we use C*?
#CassandraSummit 2014 32
Because we need their features:
• P2P architecture
• Read/write performance
• Fault tolerance
• Easy to deploy
• CQL
Why did we use C*?
#CassandraSummit 2014 33
•And we needed data modeler:
•The data in Storm is normalize by source.
• The primary key is the source key (f.e. IP) and a
time stamp to split the cluster key.
• All the data row have view tables with relationship
between entities: IP, DNS, Domain…
IP timestamp Timesplit … Domain … Table name: IP
Primary Key ((IP, timestamp)timesplit)
Domain timestamp timesplit IP1 … IPn Table name: IP_Domain
Primary Key ((Domain, timestamp)timesplit)
Why did we use C*?
#CassandraSummit 2014 34
IP timestamp Timesplit … Domain … Table name: IP
Primary Key ((IP, timestamp)timesplit)
Domain timestamp timesplit IP1 … IPn Table name: IP_Domain
Primary Key ((Domain, timestamp)timesplit)
IP main table
IP view for domain
Domain timestamp Timesplit … IP … Table name: domain
Primary Key ((domain, timestamp)timesplit)
IP timestamp timesplit domain1 … Domainn Table name: Domain_IP
Primary Key ((IP, timestamp)timesplit)
Domain main table
IP view for domain
35
What have we
learned?
THE BEST OF BOTHWORLDS COMBINED
“Two plus two is four? Sometimes… Sometimes it is five.”
G. Orwell
Combination wins
RISK
Combination = add more and more products to the Stack
Complexity
Platforms hybrid Hadoop + spark
Hybrid = complexity
RDD-Based Matrices
Interactive
Batchprocessin
g
Streamprocessing
BatchInteractive [SQL]StreamingMachine Learning
Learn just one systemDevelop within one frameworkDeploy/Manage just one system
Databricks co-founder & CTO Matei Zaharia(source)
Why Spark
1 One stack to rule them all
Be rational not only emotional
The only Pure Spark processing
No Hadoop elements
+10year old constraints
Lean simplicity
Pure Spark PlatformFormer Hadoop or
Hybrid Hadoop-Spark Platforms
Lean = Easier deployment, management, and use of the system
STRATIOADMIN
STRATIODATAVIS
STRATIOINGESTION
STRATIOCROSSDATA(SPARK)
CASSANDRAMONGO DBELASTICSEARCHHDFS
STRATIOSTREAMING(SPARK STREAMING, SIDDHI)
Not to make a POC, but a real project for a Big Company is
very demanding
SPARKCERTIFIED
Multiple Combination
https://github.com/Stratio/stratio-meta
API
Elastic S
Full text search + queries
C*
node
C*
node
Lucene
index
C*
node
Lucene
index
C*
node
Lucene
index
Lucene
index
C*
node
Lucene
index
SELECT * FROM logsWHERE description
MATCH‘*Exception’ ;
Stratio Streaming
•Start using Spark Streaming for
doing some Complex Event
Processing operations.
https://github.com/Stratio/stratio-streaming
DATA JOURNEY THROUGH TIME
PAST
PRESENT FUTURE
Storeddata
Real Time Data
Streaming
ML Algorithms
EphemeralTables
StoredTables
SQL combination: Done
SQL combination: In progress
Quantum Tables
Thanks in advance
#CassandraSummit 2014 48