Use Cases for NoSQL in Media
-
Upload
sander-kieft -
Category
Technology
-
view
46 -
download
0
Transcript of Use Cases for NoSQL in Media
About me
Manager Core Services at Sanoma
Responsible for all common services, including the
Big Data platform
Work:
– Centralized services
– Data platform
– Search
Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff24 April 20152
Sanoma, B2C Publishing and Learning company
2+1002 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name3
5TV channels in Finland
and The Netherlands
200+Websites
100Mobile applications on
various mobile platforms
Data models
Speed
Scalability
Partition tolerance
Availability / Redundancy
Cost per GB
Specialized focus
24 April 2015 Presentation name7
CAP (or Brewster) Theorem says:
“it is impossible for a distributed computer system
to simultaneously provide all three of the following
guarantees:
– Consistency
– Availability
– Partition tolerance”
CAP Theorem
24 April 2015 Presentation name8
A
C P
CAP Theorem
24 April 2015 Presentation name9
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
RDBMS
MySQL
Postgres
MS SQL
Oracle
NOSQL
NOSQL
key-value
column
document stores
map/reduce
graph
search
blob storage
Various data models
24 April 2015 Presentation name12
Key/value stores
Photo credits: John Chulick - https://www.flickr.com/photos/chulickphotos/8234894686/
Key/value stores
Storing object on key
Based on the Dynamo paper (Werner Vogels)
Products:
– Riak
– Memcache/Membase
– Tokyo Cabinet
– Redis
– Voldemort
Use cases:
– Counting
– Top lists
– Caches
– Pre-calculated optimizations
24 April 2015 Presentation name14
Bucket A B C
Key/Value buckets
24 April 2015 Presentation name15
User XXXX YYYY ZZZZ
Article 100 200 300
Article_<5 min. TIME> 50 100 150
Document stores
Stores ”records” as documents
Versioning
Easy sharding (document self contained)
Products:
– MongoDB
– CouchDB
– SimpleDB
Use case:
– CMS
– Meta data
– Product catalog
24 April 2015 Presentation name20
From relational data model to document
24 April 2015 Presentation name21
Product
Properties
Application
Property
Property
MyJour
Item Based Framework
….CMS
Architecture Content Platform
24 April 2015 Presentation name22
Content Platform Core
Search
Solr
Blob
Storage
(S3 & MT)
Article
storage
MongoDB
Analyse
CMS
CMS
Editorial
reuse-interface
ePub
Digital
Template
system
WoodWing
Content
Portal
Feeds
Noma
Viva
PDF Based Framework
….
HomeDeco
Sources Services Solutions Products
??
??
??
??
eLinea
Blendle
Google Currents
LINDA. nieuws
NU.nl search
Column stores
Lineage: Google's BigTable paper
Records with many, many columns
Distinguish between hot and cold data
Versioning
Records and columns can be sharded
Products:
– Hbase
– Cassandra
– Hypertable
Use cases:
– Analytics
– Messages
24 April 2015 Presentation name24
Big Data
Linage: Google GFS & Map/Reduce
Distributed data storage and processing
Advanced analytics capabilities on raw data
Schema on read
Products:
Hadoop
MPP databases
Use cases:
– Adhoc querying terabytes of data
– Data science
Predictive analytics
Model training
– Calculate recommendations
24 April 2015 Presentation name26
Big Data at Sanoma
Main use case for reporting and analytics, moving to
data science
A/B MVT testing evaluations
Using Qlikview as a front-end
Supply data to other environments (SAS,
Advertising, Behavioral Targeting)
Agile process for adding sources, from raw to
intermediate to modeled datawarehouse
Sanoma standard data platform, used in all Sanoma
countries
> 250 Users: dashboard users
40 daily users: analysts & developers
43 source systems, with 125 different sources
400 tables in hive
Platform:
– Cloudera Hadoop
– 40-60 nodes
– > 400TB storage
– ~2000 jobs/day
Typical data node / task tracker:
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
24 April 2015 Presentation name27
Search
Keyword search can be combined with
advanced forms of ranking the results
Most of the fields go to an index
Facets can be used for analytics
Ranker can be replaced with custom logic
Products:
– Solr
– ElasticSearch
– Marklogic
Use cases:
– Content Search
– Analytics / Faceted
– Percolation
24 April 2015 Presentation name30
Traditional queries: against index with existing data
What if the data does not exist at time of query?
Percolation allows registration of queries and then returning the query IDs, e.g. for notification when
new matches are available
Use case:
– Search for a tweet, but after the initial results continuously
get newly tweeted items when they come in
Search - Percolation
24 April 2015 Presentation name34
Graph databases
Lineage: Euler and graph theory.
Data model: Nodes & edges, both which can
hold key-value pairs
Products:
– AllegroGraph
– InfoGrid
– Neo4j
Use cases:
– Social relationships
– Content Linking (Entity linking)
24 April 2015 Presentation name36
Jan Smit
3js
Nick en Simon
Volendam
Article
1
Article
2
Article
3
Blob storage
Endless storage of binary data
Storing larger objects then a single machine
“Lower” price/GB compared to SAN storage
Products
– Amazon S3
– CAStor
– (Hadoop)
Use case:
– Media storage
– Archiving
24 April 2015 Presentation name38
RDBMS systems are a good enough for many problems
For specific problems NOSQL solutions provide a specific solution
There’s a variety of NOSQL solutions with different characteristics
NOSQL solutions will require a higher engineering effort
Summary
24 April 2015 Presentation name40
Dream NO SQL Architecture – Content Delivery
24 April 201541
CMSDocument storage
(MongoDB/
CouchDB)
Blob storage
(S3/
CAStor)
Search
(ElasticSearch/
Solr)
Website / Mobile
Application
Dream NO SQL Architecture - Analytics
24 April 201542
Event collectionMessage Queue
(Kafka / Flume )
Event processing
(Storm)
Key-value
store
(Redis)
Real time
recommendations
/ targeting
Column
storage
(Cassandra/
Hbase)
Real time
Dashboarding
Big Data
(Hadoop)
Adhoc reporting &
Data science
CAP Theorem
24 April 2015 Presentation name43
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
MySQL Asterdata
Postgres Greenplum
MS SQL Vertica
Oracle
Dynamo Cassandra
Voldemort SimpleDB
Tokyo Cabinet CouchDB
KAI Riak
Big Table MongoDB Berkeley DB
Hypertable Terrastore MemcachDB
Hbase Scalaris Redis
Data models
Relational databases
Key-value
Column-oriented
Document-oriented