Real-World Cassandra at ShareThis

Click here to load reader

download Real-World Cassandra at ShareThis

of 38

  • date post

    12-Jul-2015
  • Category

    Technology

  • view

    1.765
  • download

    1

Embed Size (px)

Transcript of Real-World Cassandra at ShareThis

Real-World Cassandra at ShareThisUse Cases, Data Modeling, and Hector

ShareThis + Our Customers: Keys to Unlocking Social

1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES)2. TAKE YOUR SOCIAL INVENTORY TO MARKET3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY

Largest Ecosystem For Sharing and Engagement Across The Web

120 SOCIAL CHANNELSSHARETHIS ECOSYSTEM211 MILLION PEOPLE(95.1% of the web)2.4 MILLION PUBLISHERS

Source: ComScore U.S. January 2013; internal numbers, January 2013We can change the look of the slide (and featured publishers), but I feel the ecosystem is a cool concept and graphic for getting a quick overview of who we are. The text below can be worked in somehow too, with the new look of this slide. Maybe the text can be cut down too.

ShareThis empowers publishers with solutions to improve and drive value from the social engagement of their site. People share content that's most relevant to them, with people who they believe will also enjoy the content. More than 2.5 million publishers increase eyeballs, engagement, and advertising revenue through the ShareThis sharing platform.

Data Modeling and Why it Matters (Keep it even, Keep it slice-able)

Use CasesData StoreHigh Availability Count ServiceHigh Write AnalyticsReal Time Analytics

A New Product: SnapSets

3 - x1.large

A New Product: SnapSets

Use Case: SnapSets, A New Product

Use Case: SnapSets, A New Product (Continued)

CF: Users (userId)meta:first_name=Ronaldmeta:last_name=Melenciometa:username=ronsharethisscrapbook:timestamp:scrapbookId:name=Scrapbook 1scrapbook:timestamp:scrapbookId:date_created=Jan 10url1:sid:clipID={LOCATION DATA}url1:sid:456={LOCATION DATA}

CF: Scrapbooks (scrapbookId)clip:timestamp:clipId:url=sharethis.comclip:timestamp:clipId:title=Clip 1clip:timestamp:clipId:likes=10

CF: Clip (clipId)comment:timestamp:commentId={"name":"Ronald","timestamp":'"jan 10","comment":"hi"}

CF: Stats (user:userId,application,publisher:pubId)meta:total_scrapbooks=1meta:total_clips=100meta:total_scrapbook_comments=100scrapbook:timestamp:scrapbookId:total_comments=10scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:likes=10scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:dislikes=10

Use CasesData StoreHigh Availability Count ServiceHigh Write AnalyticsReal Time Analytics

High Velocity Reads and Writes: Count Service

9 hi1.4xlarge
9 x1.large

Use Case: Count Service for URL's

1 Billion Pageviews per day = 12k pageviews per second

60 Million Social Referrals per day = 720 social referrals per second

1 Million Shares per day = 12 shares per second

No expiration of Data* (3bn rows)

Requires minimum latency possible

Multiple read requests per page on blogs

Normalize and Hash the URL for a row key

Each social channel is a column

Retrieve the whole row for counts

Fix it by cheating ^_^ *

Use CasesData Store High Availability Count ServiceHigh Write AnalyticsReal Time Analytics

Insights that Matter Your Social Analytics Dashboard

Timely Social AnalyticsDive deeper into your most social content

Identify popular articles Uncover which social channels are driving the most social traffic

Benchmark your social engagement with SQIMeasure social activity on an hourly, daily, weekly & monthly basis.

12 - x1.large

Use Case: Loading Processed Batch Data

Backend Hadoop stack for processing analytics

58 JSON schemas map tabular data to key/value storage for slicing

MondoDB* did not scale for frequent row level writes on the same table

Needed to maintain read throughput during spikes to writes when analytics were finished

No TTL* - Works daily, doesn't work hourly

Switching from Astyanax to Hector

Using a Hector Client through Java API's

Use Case: Loading Processed Batch Data (continued)

{"schema":[{"column_name":"publisher","column_type":"UTF8Type", "column_level":"common","column_master":""},{"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""},{"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},{"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},{"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""},{"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""}],"row_key_format": "publisher:domain:month","column_family_name": "sqi_table"}

CF -> Data TypeRow -> Publisher:domain:timestampColumns -> master:slave = value (topics, categories, urls, timestamps, etc)

Use CasesData StoreHigh Availability Count ServiceHigh Write AnalyticsReal Time Analytics

Insights that Matter Your Social Analytics Dashboard

Real Time Social AnalyticsDive deeper into your most social content

Identify trending articles in real-time Uncover which social channels are driving the most social traffic

Benchmark your social engagement with SQIMeasure social activity on an hourly, daily, weekly & monthly basis.

12 - cc1.4xlarge

Insights that Matter Your Social Analytics Dashboard

Real Time Social AnalyticsDive deeper into your most social content

Identify trending articles in real-time Uncover which social channels are driving the most social traffic

Benchmark your social engagement with SQIMeasure social activity on an hourly, daily, weekly & monthly basis.

12 - cc1.4xlarge

Insights that Matter And aren't accessible

Insights that Matter And aren't accessible

Insights that Matter And aren't accessible

Too many columns unbounded url / channel sets

Cascading failure

Solutions:Bigger Boxes meh...

Split up the columns split the rowkeysHash Urls and keep stats separate

Split up the columns split the CFMove URLs to their own space

Split up the columns split the KeyspaceKeyspace is a timestamp

Ask Good Data Modeling Questions

How many rows will there be?

How many columns per row will you need?

How will you slice your data?

What are the maximum number of rows ?

What are the maximum number of columns?

Is your data relational?

How long will your data live?

Hectorhttps://github.com/hector-client/hector/wiki/User-Guide

Hector Imports

import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition;import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel;import me.prettyprint.cassandra.serializers.LongSerializer;import me.prettyprint.cassandra.serializers.StringSerializer;import me.prettyprint.cassandra.service.ColumnSliceIterator;import me.prettyprint.cassandra.service.ThriftCfDef;import me.prettyprint.cassandra.service.ThriftKsDef;import me.prettyprint.cassandra.service.template.ColumnFamilyResult;import me.prettyprint.cassandra.service.template.ColumnFamilyTemplate;import me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate;

import me.prettyprint.hector.api.beans.ColumnSlice;import me.prettyprint.hector.api.beans.HColumn;import me.prettyprint.hector.api.beans.HCounterColumn;import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;import me.prettyprint.hector.api.ddl.ComparatorType;import me.prettyprint.hector.api.ddl.KeyspaceDefinition;import me.prettyprint.hector.api.exceptions.HectorException;import me.prettyprint.hector.api.factory.HFactory;import me.prettyprint.hector.api.mutation.Mutator;import me.prettyprint.hector.api.query.ColumnQuery;import me.prettyprint.hector.api.query.CounterQuery;import me.prettyprint.hector.api.query.QueryResult;import me.prettyprint.hector.api.query.SliceCounterQuery;import me.prettyprint.hector.api.query.SliceQuery;

Hector: Add a keyspace

public static Cluster getCluster(String name, String hosts) {return HFactory.getOrCreateCluster(name, hosts);}

public static KeyspaceDefinition createKeyspaceDefinition(String keyspaceName, int replication) {return HFactory.createKeyspaceDefinition(keyspaceName,ThriftKsDef.DEF_STRATEGY_CLASS, // "org.apache.cassandra.locator.SimpleStrategy"replication,null // ArrayList of CF definitions);}

public static void addKeyspace(Cluster cluster, KeyspaceDefinition ksDef) { KeyspaceDefinition keyspaceDef = cluster.describeKeyspace(ksDef.getName());if (keyspaceDef == null) {cluster.addKeyspace(ksDef, true);System.out.println("Created keyspace: " + ksDef.getName());} else {System.err.println("Keyspace already exists");}}

Hector: Define a CF

public static ColumnFamilyDefinition createGenericColumnFamilyDefinition(String ksName, String cfName, ComparatorType ctName) {BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();columnFamilyDefinition.setKeyspaceName(ksName);columnFamilyDefinition.setName(cfName);columnFamilyDefinition.setDefaultValidationClass(ctName.getClassName());columnFamilyDefinition.setReplicateOnWrite(true);return new ThriftCfDef(columnFamilyDefinition);}

public static ColumnFamilyDefinition createCounterColumnFamilyDefinition(String ksName, String cfName) {BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();columnFamilyDefinition.setKeyspaceName(ksName);columnFamilyDefinition.setName(cfName);columnFamilyDefinition.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());columnFamilyDefinition.setReplicateOnWrite(true)