Use Cases for NoSQL in Media

44
NOSQL in Media Sander Kieft

Transcript of Use Cases for NoSQL in Media

NOSQL in MediaSander Kieft

About me

Manager Core Services at Sanoma

Responsible for all common services, including the

Big Data platform

Work:

– Centralized services

– Data platform

– Search

Like:

– Work

– Water(sports)

– Whiskey

– Tinkering: Arduino, Raspberry PI, soldering stuff24 April 20152

Sanoma, B2C Publishing and Learning company

2+1002 Finnish newspapers

Over 100 magazines

24 April 2015 Presentation name3

5TV channels in Finland

and The Netherlands

200+Websites

100Mobile applications on

various mobile platforms

24 April 2015 Presentation name5

Not Only

SQL

Generic vs specialized solutions

24 April 2015 Presentation name6

Data models

Speed

Scalability

Partition tolerance

Availability / Redundancy

Cost per GB

Specialized focus

24 April 2015 Presentation name7

CAP (or Brewster) Theorem says:

“it is impossible for a distributed computer system

to simultaneously provide all three of the following

guarantees:

– Consistency

– Availability

– Partition tolerance”

CAP Theorem

24 April 2015 Presentation name8

A

C P

CAP Theorem

24 April 2015 Presentation name9

A

C P

Availability

Each client can always

read and write

Partition Tolerance

The system works well

despite physical

network partitions

Consistency

All clients always have

the same view of the

data

RDBMS

MySQL

Postgres

MS SQL

Oracle

NOSQL

NOSQL

Eventual consistency-- Werner Vogels, CTO Amazon

Various Data models

key-value

column

document stores

map/reduce

graph

search

blob storage

Various data models

24 April 2015 Presentation name12

Key/value stores

Photo credits: John Chulick - https://www.flickr.com/photos/chulickphotos/8234894686/

Key/value stores

Storing object on key

Based on the Dynamo paper (Werner Vogels)

Products:

– Riak

– Memcache/Membase

– Tokyo Cabinet

– Redis

– Voldemort

Use cases:

– Counting

– Top lists

– Caches

– Pre-calculated optimizations

24 April 2015 Presentation name14

Bucket A B C

Key/Value buckets

24 April 2015 Presentation name15

User XXXX YYYY ZZZZ

Article 100 200 300

Article_<5 min. TIME> 50 100 150

Real time stats

24 April 2015 Presentation name16

Document Stores

Document stores

Stores ”records” as documents

Versioning

Easy sharding (document self contained)

Products:

– MongoDB

– CouchDB

– SimpleDB

Use case:

– CMS

– Meta data

– Product catalog

24 April 2015 Presentation name20

From relational data model to document

24 April 2015 Presentation name21

Product

Properties

Application

Property

Property

MyJour

Item Based Framework

….CMS

Architecture Content Platform

24 April 2015 Presentation name22

Content Platform Core

Search

Solr

Blob

Storage

(S3 & MT)

Article

storage

MongoDB

Analyse

CMS

CMS

Editorial

reuse-interface

ePub

Digital

Template

system

WoodWing

Content

Portal

Feeds

Noma

Viva

PDF Based Framework

….

HomeDeco

Sources Services Solutions Products

??

??

??

??

eLinea

Blendle

Google Currents

LINDA. nieuws

NU.nl search

Column stores

Column stores

Lineage: Google's BigTable paper

Records with many, many columns

Distinguish between hot and cold data

Versioning

Records and columns can be sharded

Products:

– Hbase

– Cassandra

– Hypertable

Use cases:

– Analytics

– Messages

24 April 2015 Presentation name24

Big Data

Big Data

Linage: Google GFS & Map/Reduce

Distributed data storage and processing

Advanced analytics capabilities on raw data

Schema on read

Products:

Hadoop

MPP databases

Use cases:

– Adhoc querying terabytes of data

– Data science

Predictive analytics

Model training

– Calculate recommendations

24 April 2015 Presentation name26

Big Data at Sanoma

Main use case for reporting and analytics, moving to

data science

A/B MVT testing evaluations

Using Qlikview as a front-end

Supply data to other environments (SAS,

Advertising, Behavioral Targeting)

Agile process for adding sources, from raw to

intermediate to modeled datawarehouse

Sanoma standard data platform, used in all Sanoma

countries

> 250 Users: dashboard users

40 daily users: analysts & developers

43 source systems, with 125 different sources

400 tables in hive

Platform:

– Cloudera Hadoop

– 40-60 nodes

– > 400TB storage

– ~2000 jobs/day

Typical data node / task tracker:

– 1-2 CPU 4-12 cores

– 2 system disks (RAID 1)

– 4 data disks (2TB, 3TB or 4TB)

– 24-32GB RAM

24 April 2015 Presentation name27

Sanoma Data lakeTraditional BI vs Big Data approach

28 24.4.2015 © Sanoma Media

Search

Photo credits: http://www.flickr.com/photos/emyanmei/8223998414/

Search

Keyword search can be combined with

advanced forms of ranking the results

Most of the fields go to an index

Facets can be used for analytics

Ranker can be replaced with custom logic

Products:

– Solr

– ElasticSearch

– Marklogic

Use cases:

– Content Search

– Analytics / Faceted

– Percolation

24 April 2015 Presentation name30

Search

24 April 2015 Presentation name31

Content

Q Σ Result ranking

Search too

24 April 2015 Presentation name32

Content

t

Σ Result ranking

User

Search too

24 April 2015 Presentation name33

Content

Page

Σ Result ranking

User

Traditional queries: against index with existing data

What if the data does not exist at time of query?

Percolation allows registration of queries and then returning the query IDs, e.g. for notification when

new matches are available

Use case:

– Search for a tweet, but after the initial results continuously

get newly tweeted items when they come in

Search - Percolation

24 April 2015 Presentation name34

Graph databases

Graph databases

Lineage: Euler and graph theory.

Data model: Nodes & edges, both which can

hold key-value pairs

Products:

– AllegroGraph

– InfoGrid

– Neo4j

Use cases:

– Social relationships

– Content Linking (Entity linking)

24 April 2015 Presentation name36

Jan Smit

3js

Nick en Simon

Volendam

Article

1

Article

2

Article

3

Blob storage

Blob storage

Endless storage of binary data

Storing larger objects then a single machine

“Lower” price/GB compared to SAN storage

Products

– Amazon S3

– CAStor

– (Hadoop)

Use case:

– Media storage

– Archiving

24 April 2015 Presentation name38

Summary

RDBMS systems are a good enough for many problems

For specific problems NOSQL solutions provide a specific solution

There’s a variety of NOSQL solutions with different characteristics

NOSQL solutions will require a higher engineering effort

Summary

24 April 2015 Presentation name40

Dream NO SQL Architecture – Content Delivery

24 April 201541

CMSDocument storage

(MongoDB/

CouchDB)

Blob storage

(S3/

CAStor)

Search

(ElasticSearch/

Solr)

Website / Mobile

Application

Dream NO SQL Architecture - Analytics

24 April 201542

Event collectionMessage Queue

(Kafka / Flume )

Event processing

(Storm)

Key-value

store

(Redis)

Real time

recommendations

/ targeting

Column

storage

(Cassandra/

Hbase)

Real time

Dashboarding

Big Data

(Hadoop)

Adhoc reporting &

Data science

CAP Theorem

24 April 2015 Presentation name43

A

C P

Availability

Each client can always

read and write

Partition Tolerance

The system works well

despite physical

network partitions

Consistency

All clients always have

the same view of the

data

MySQL Asterdata

Postgres Greenplum

MS SQL Vertica

Oracle

Dynamo Cassandra

Voldemort SimpleDB

Tokyo Cabinet CouchDB

KAI Riak

Big Table MongoDB Berkeley DB

Hypertable Terrastore MemcachDB

Hbase Scalaris Redis

Data models

Relational databases

Key-value

Column-oriented

Document-oriented