Social Security Company Nexgate's Success Relies on Apache Cassandra
-
Upload
planet-cassandra -
Category
Technology
-
view
1.070 -
download
0
description
Transcript of Social Security Company Nexgate's Success Relies on Apache Cassandra
![Page 1: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/1.jpg)
Datastax and Cassandra at NexgateRich Sutton, CTOHarold Nguyen, Sr. Data Scientist
![Page 2: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/2.jpg)
A Little About Us
Company – Security & Compliance for Social Launched April 2013 - Series A from Sierra & WindForce Ventures
– 15 employees, 7 in Engineering (2 Data Scientists)
Security guys from:
Customers:
![Page 3: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/3.jpg)
Key Enterprise Pain Points
① Brand social account sprawl• Can’t inventory, audit, track social
media infrastructure• Can’t continuously find fake accounts
② Inbound protection for accounts• Nothing to detect and remediate
account anomalies / hacks• No automated coverage for volumes of
inappropriate and malicious content
③ Outbound compliance controls• Too many admins and apps installed
across multiple accounts• Little or no automated coverage for
sensitive and regulated data
Novartis Slapped by the FDAFINRA begins social
compliance audits
Spam
![Page 4: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/4.jpg)
Where Nexgate FitsProtecting the social account itself
NexgateProtect branded accounts and ensure compliance
Find, audit, and track the actual social accounts of the brand Catch & remediate social account hacks, tampering, and misuse Remove bad ‘inbound’ content including spam, malware, and acceptable use Enforce usage of approved publishing platforms Comply with regulations using prebuilt content policies, workflow, and intelligent archiving
Listening PlatformsMine external social data and conversations
• Find brand ‘mentions’ and present them with inferences• Provide volumes of market data that is analyzed for trends, share of voice, etc.• Social CRM identification of key conversations and influencers that may need engagement
Publishing PlatformsEngage audiences and track outcomes
• Build communities• Deliver content, custom apps, ads with workflow• Promotions, contests, and campaigns
![Page 5: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/5.jpg)
:001> Content classification is what we do. The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built.
![Page 6: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/6.jpg)
:002> We made a lazy storage choice.
![Page 7: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/7.jpg)
:003> Some success forced our hand.
![Page 8: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/8.jpg)
:004> Social data is small and jagged.
• Average 1K all in, content and metadata• Some common small stuff: time, social IDs, parent, account• Some common big stuff: content, links• Lots of disparate stuff, specific to the social platform
![Page 9: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/9.jpg)
:005>Keep in SQL: Fixed length, non-null, heavily indexed, group accessKeep in NoSQL: Variable length, commonly null, non indexed, single access, text search
![Page 10: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/10.jpg)
:006> Requirements
• Simple, proven horizontal scalability• Integrated tools for research: search, analysis
• Operational simplicity; nodes all the same• Enterprise support
![Page 11: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/11.jpg)
:007> Deployment
• Multi-region AWS• M1 Large instances• Instance attached storage• About to scale again• Separate dev, test, prod clusters
Datastax:• Start-up pricing, per-core pricing• On site experts, responsive support
![Page 12: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/12.jpg)
Over 250 million pieces of social media total content spread across Facebook, Twitter, YouTube, Google+, LinkedIn
Currently about half a million new content per day
– All classified in real time as it comes in
About 50,000 new social media content authors per day
Cassandra is a great choice for a database– allows flexibility for the ever rapidly-changing landscape of social media threats
Scale of Data
![Page 13: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/13.jpg)
Data throughput
Average reads = 70 / secAverage writes = 25 / sec
![Page 14: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/14.jpg)
Among the many security and compliance classifications that Nexgate provides, we also have powerful spam detection
Spam can be a single link directing to a fraudulent site (screenshots of a Facebook comment):
Fighting Spam with Cassandra
![Page 15: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/15.jpg)
Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate product):
![Page 16: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/16.jpg)
Social media spam grew by 355% in the first half of 2013.
Get the report at http://nx.gt/SocialSpamReport
![Page 17: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/17.jpg)
Can create Spam signatures to catch this type of content
...but it would be too slow to catch Spam in real time.
Cassandra
Cassandra and Social Media Spam
![Page 18: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/18.jpg)
Even though Cassandra is a NoSQL schema-less database, it is worth carefully defining the data model
Can’t just “throw data at it” – can make for some really inefficient queries
Define the data model based on how you will query the data
For us, we want to determine spam content that has been posted duplicate times– Spammers tend to post same-content
messages
Define Your Data Model
![Page 19: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/19.jpg)
Typical table in Cassandra– Wide “unconstrained” rows is a nice feature w.r.t.
SQL
Spam Multiplicity Data Model
Row key -> hash of content Column Key -> Unique ID (strictly increasing with time) Column Value -> Item_id and time of post
![Page 20: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/20.jpg)
Spammers typically post the same content over and over
Easy to determine how many times a same-content post is made: check the number of columns
Will never double count because the column key will simply be updated instead of added
Indexed by the content, so quick reads and writes
By reading the column value, can extract the time series information of duplicated posts
– Can also map back to the original value – we store actual content indexed by the item_id in another Cassandra table
Cassandra not a magic bullet – still need a relational database to glue all the pieces of data together– Batch processing may need other tools like Hadoop
Why this Data Model ?
![Page 21: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/21.jpg)
![Page 22: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/22.jpg)
This has become invaluable to us for catching spam content in real time – the following “rant” comment was posted 38 times…
– Brand can more easily moderate given automated tools
Real-world spam multiplicity
In another example, a customer received 25,000 inappropriate messages, and this tool helped us automate content removal
![Page 23: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/23.jpg)
Another way to tackle real-time spam is by identifying spammy users– Since Cassandra effortlessly keeps all
the content we observed, our algorithm takes into account all the posts contributed by an author to determine if they are a spammer
Additionally, it is important to keep all data to train our 100+ classifiers
Importance of Keeping All Data
![Page 24: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/24.jpg)
Cassandra actually has been humming along quite nicely! – Barely any tweaking needed from default values– No deletes (just the nature of our dataset) => not a lot of
frequent repairs performed (repair is done to resolve inconsistencies across all replicas of data due to deletes)
• Fine for us, because repair requires intensive disk I/O
Only times we observed performance issues:– When the rates of our reads and writes reached a certain
threshold– When the size of the data being inserted was too large– Heap memory issue with Cassandra 1.1.x
In all cases, Datastax provided a quick and simple solution, mostly just toggling a few parameters in config files and restarting the nodes
Tuning Cassandra
![Page 25: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/25.jpg)
Community is wonderful - it's really easy to jump on the Cassandra IRC channel and talk to fellow users and developers to get real-time feedback.
– With IRC and mailing list help, implemented composite columns to detect malware sites on the second day of using Cassandra 3 years ago
In fact, when we tested a migration to the latest version of Casandra, and one of our Ruby wrappers didn't play nice with CQL3, I was able to speak directly with the Ruby wrapper author on IRC and received a reason on why it didn't work.
– In the same day, I committed and made a pull request for a fix to the Ruby wrapper on github, and the author looked at it the next morning
Datastax support has been invaluable for providing fast feedback and simple solutions
Cassandra Community
![Page 26: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/26.jpg)
OpsCenter helpful in debugging performance issues
Solr – used to obtain training data for classifiers by phrase matching
Looking forward:– Datastax Hadoop support to look into
training labeled data with MapReduce
Datastax Additional Tools
![Page 27: Social Security Company Nexgate's Success Relies on Apache Cassandra](https://reader037.fdocuments.net/reader037/viewer/2022103110/54b701014a7959943a8b45bd/html5/thumbnails/27.jpg)
Thank you Datastax and RelateIQ!Let us show you: nexgate.com/demoFollow us:@NXGatefacebook.com/NXGate