Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

download Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

of 46

  • date post

  • Category


  • view

  • download


Embed Size (px)


Presenter: Antonio Alcocer, Big Data Architect at Stratio Telefonica is the incumbent telecommunications network operator in Spain and the fourth one in capitalisation in the world. Cyber security is one of our most successful businesses worldwide. We provide monitoring and protecting clients from attacks. We analyze millions of data from multiple sources including social media, DNS records, and underground internet, to generate alerts and security reports for our clients. This use case required a Big Data component capable of processing the data and extract its information in real-time; warnings and alerts are time-sensitive in order to deal efficiently with security attacks. Our original architecture was the typical one used for data fusion systems. It included several collectors, a processing layer based on legacy systems, and a data store. The initial setup included a MongoDB database and an ad-hoc application. This solution however proved to be unfit for the specific purpose of dispatching alerts. We proposed to use Cassandra and Spark instead. This approach did manage to fulfill our original specifications as intended. Our talk will explain the reasons why we migrated the architecture and how the adopted solution based on Spark and Cassandra solved our problem.

Transcript of Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

  • 1. Spark Use Case atTelefnica CyberSecurity (CBS)Antonio Alcocerantonio@stratio.comOscar 2014 1

2. Who are we?STRATIO Stratio is a Big Data Company Founded in 2013 Commercially launched in 2014 50+ employees in Madrid Office in San Francisco Certified Spark distribution#CassandraSummit 2014 2 3. 3 4. General infoo 1924- 2014: 317+ customer with 130.000+ employeeso 2nd European operator by revenueso 4th global integrated operator by accesseso 9th Telco in the Global ranking by market capitalizationo 2nd global operator for investment in R+D#CassandraSummit 2014 4 5. Present in 24 countries#CassandraSummit 2014 5 6. Their main brands#CassandraSummit 2014 6 7. Other brands#CassandraSummit 2014 7 8. Telefnica Global SolutionsGlobal Security ServicesA global infrastructure tosafeguard your business_#CassandraSummit 2014 8 9. Managed Security#CassandraSummit 2014 9 10. CyberSecurity??10 11. Why????A picture is worth a thousand words - but a film clip, a million!#CassandraSummit 2014 11 12. Dont worry#CassandraSummit 2014 12 13. What is Cybersecurity?What does it mean for us?Cybersecurity is the collection of tools, policies capabilities to protectthe cyber environment and organization and users assets. Cybersecuritystrives to ensure unauthorized access to, manipulation of the integrity,confidentiality, or availability of an information, or unauthorizedexfiltration of information.No rules, just guidelines.#CassandraSummit 2014 13 14. An example of threatsCassandra OpsCenterWorld mapWordpress#CassandraSummit 2014 14 15. C* OpsCenter + Shodan#CassandraSummit 2014 15 16. C* OpsCenter + Shodan#CassandraSummit 2014 16 17. Another threats#CassandraSummit 2014 17 18. CyberSecurity innumbers18 19. NumbersThreats DDoS (23%) SQLi (19%) Defacement (14%) Account Hijacking (9%) Unknown (18%)#CassandraSummit 2014 19 20. Looking for unknown threats#CassandraSummit 2014 20 21. What did Telefonica need?#CassandraSummit 2014 21 22. Telefonica first concept test Based in Mongo DB. Problem: the data up faster than server. They have their own collectors in pythonto connect to Mongo DB. Not solve all analyst queries in batch. No real time processingCollectorCollectorCollectorCollectorMongo#CassandraSummit 2014 22 23. Joining efforts23 24. Required skills#CassandraSummit 2014 24 25. in25Using 26. Use Case ArchitectureWe have three phases: Ingestion: based on Apache Kafka Data fusion: based on Apache Storm. Batch & Analytics: Based on Cassandraand Spark#CassandraSummit 2014 26 27. Data Adquisition Data are in several sources: DNS traffic IP Social media Underground sources Government sources Data sourcesSourcesSourcesSourcesSourcesSourcesSourcesKAFKAAPI There are several sources consumers pulling the info andpushing it into a Kafka Cluster Sources are heterogeneous and their speed is variable.SourcesSources#CassandraSummit 2014 27 28. Data fusion We use Storm to process andnormalize the information. The system must fire alertsto the analysts. This use case required a BigData component capable ofprocessing the data andextract its information in real-time. Warnings and alerts are time-sensitive in order to deal efficiently with security attacks.#CassandraSummit 2014 28 29. BatchThe data are saved inCassandra.We use Cassandra directly forthe easy queries.And we used Spark to extractthe information not accessibleto cassandra directly.Data processINTEGRATION INTEGRATION INTEGRATION#CassandraSummit 2014 29 30. Why did we use C*?Because we need their features: P2P architecture Read/write performance Fault tolerance Easy to deploy CQL#CassandraSummit 2014 30 31. Why did we use C*?And we needed data modeler:The data in Storm is normalize by source. The primary key is the source key (f.e. IP) and atime stamp to split the cluster key. All the data row have view tables with relationshipbetween entities: IP, DNS, DomainIP timestamp Timesplit Domain Table name: IPPrimary Key ((IP, timestamp)timesplit)Domain timestamp timesplit IP1 IPn Table name: IP_DomainPrimary Key ((Domain, timestamp)timesplit)#CassandraSummit 2014 31 32. Why did we use C*?IP main tableIP timestamp Timesplit Domain Table name: IPPrimary Key ((IP, timestamp)timesplit)IP view for domainDomain timestamp timesplit IP1 IPn Table name: IP_DomainPrimary Key ((Domain, timestamp)timesplit)Domain main tableDomain timestamp Timesplit IP Table name: domainPrimary Key ((domain, timestamp)timesplit)IP view for domainIP timestamp timesplit domain1 Domainn Table name: Domain_IPPrimary Key ((IP, timestamp)timesplit)#CassandraSummit 2014 32 33. 33What have welearned? 34. Spark + C*Spark-Cassandra enables the implementation of any use case orapplication with any number of usersApplications or dashboards with many users and much data using a predefined set of queries,perfectly solved with Cassandra, using very few cluster resources.BI applications or tools with few users (BI analysts or similar) executing open queries, perfectlysolved with Spark over Cassandra using the remaining power of the cluster. 2014 34 35. Spark processing + NoSQL Db 2014 35 36. Cross data Unifies batch and stream processing using a common language. Ability to access different data store technologies Combine results among datastore in a single query. Interactive console, a Java/Scala API, a REST API, and a ODBCinterface. Easy to use SQL-like language Distributed, scalable, fault-tolerance, P2P and extensible. Abstracts the users from the underlying data stores technologies: Users do not need to be aware of the deployment details tostore and query data#CassandraSummit 2014 36 37. Pure Spark processingCROSSDATA SPARK:Stratio is able to combine, in onequery, stored data with streamingdata entering in the systemPolyglots: Spark integrated with themain noSQL databases, starting withCassandra & Mongo DB.SPARK FOR ALL :BATCH, INTERACTIVE AND STREAMING#CassandraSummit 2014 37 38. System administration Admin Application: Full deployment on-premise andcloud Platform management andmonitoring System Dashboards andReporting System alerts Support (8*5 or 24*7)#CassandraSummit 2014 38 39. System admin The Admin is the easiest way to install, manage and monitoring Stratiosservices, and the necessary software to run them.System alerts Support multiple big data Databases Installs only what you need Simple and easy to operate Unified management tool Install nodes in only 3 steps Simplified the scalability process#CassandraSummit 2014 39 40. Full text search queries Telefonica request the ability to perform full text searches We have developed an integration between Lucene andCassandra With Crossdata, we simplify: The creation syntax The query syntax using the match operator#CassandraSummit 2014 40 41. Full text search queriesC*nodeC*nodeLuceneindexC*nodeLuceneindexLuceneindexC*nodeLuceneindexC*nodeLuceneindexSELECT * FROM logsWHERE descriptionMATCH *Exception;#CassandraSummit 2014 41 42. Upcoming challenges42 43. Stratio StreamingStart using Spark Streaming fordoing some Complex EventProcessing operations. 2014 43 44. Datavis A solution for the creation and design of dashboards,reports and microsites. Connect different data sources (relational databases,noSQL, REST services, ElasticSearch, CSV...) Datavis allows you to design websites with completefreedom, where the user only has to concentrate on thecontent, not in the underlying technology.#CassandraSummit 2014 44 45. Stratio Ingestion 2014 45 46. Thanks inadvance#CassandraSummit 2014 46