Post on 07-Dec-2014
description
NoSql at guardian.co.ukMatthew WallSimon Willison
!
SQL
ot
nly
Guardian journalism online: 1995
Guardian journalism online: 1999
Guardian journalism online: 2000
Guardian journalism online: 2010
Read all about it!
I bring you NEWS!!!App server App server App server
Web server Web server Web server
CMS Data feeds
Oracle
Memcached (20Gb)
I bring you NEWS!!!App server App server App server
Web server Web server Web server
CMS Data feeds
Oracle
Memcached
Why RDBMS?
5 years ago, fewer alternatives
Understand operations procedures
Can easily recruit DBAs / devs
Developer/ops tools
Business critical system: a safe choice
Related content from search engine
Introduction of memcached
Related content from search engine
Introduction of memcached
Big traffic spikeRelated content from search engine
Distributed memcached
Protects database from peak load
Entities explicitly decached
Queries given TTL
memcached = database supercharger
Now we have a stable “broadcast” platform
We know how to scale it
SQL running effectively at core
We’ve finished, right?
Digital journalism is changing
We can’t cover everything
We can’t compete with everyone
Need to be “part of the web” not just “on the web”
Mutualisethe news!
Mutualised news!
Mutalisation of journalism
No longer only broadcasting content
User engagement & contribution:journalism
datasoftware
Data curation / linked data
Support engaged developers with data and APIs
Mutualised news!
Be a part of the data fabric of the internet
Mutualised news!Platform strategy
Out: Release our data to the world via APIs
In: Rapidly build new functionality outside the core
Write: Ingest, store & present arbitrary data
Mutualised news!
Data Out
Content API
Mutualised news!
Content API
Delivered using Apache Solr
Document oriented search engine
Loose schema:records, fields, facets
Fields can be multi-value
Supports dynamic field generation
Can apply multiple facets in queries faster than RDBMS
Mutualised news!
Mutualised news!
Mutualised news!
Mutualised news!
Is Solr a database?
Mutualised news!Can perform complex queries, including full text search
Can filter results with facets (WHERE clause)
ANYTHING can be a facet. Very powerful.
On our dataset most queries are of a similar cost
Scales very well horizontally
Handles millions of documents
Mutualised news!No transactions
Excellent for certain types of queries
Not truly general purpose
Schema design very important
Search index not really persistence
App server
Web servers
CMS
Memcached (20Gb)
Solr
Core
Solr
Solr
Solr
Solr
Solr
Cloud, EC2
M/Q
Api
rdbms
Mutualised news!API
Currently powering iPad app
Site components
External applications
Editors tools
More to follow
Mutualised news!
Data In
Application framework
Mutualised news!
Application framework
Simple REST/ HTTP framework allows lightweight development
Applications proxied for performance
Apps generally hosted in the cloud, hot deployment into production
No RDBMs provided for storage
Can develop in news timeline
App server
Web servers
CMS
Memcached (20Gb)
Core
M/Q
App
App
App
App
App
App
Apps
Proxy
external hostingapp engine etc
rdbms
NoSQL for journalism
Some useful characteristics
• Scale down as well as up
• Support rapid production-ready prototyping: turn projects around in hours or days
• Handle massive traffic spikes
Desktop analysis• Leaked BNP
membership list
• Load postcodes to constituencies mapping in to Redis
• Generate heatmaps by looking up all 12,000 postcodes
MP’s expenses
MP’s expenses
SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()
v2 used Redis
v2 used RedisSet difference:labour MP pages - reviewed pages
SRANDMEMBER
BigTable: Zeitgeist
Zeitgeist stores pre-calculated results in BigTable
• Data comes in from stats system, comments system and OneRiot real-time search API
• AppEngine cron tasks populate task queues
• Task queues recalculate hotness levels
• “Live” BigTable queries are simple SELECT / SORT
Live debate poll
• Over a million votes cast in an hour
• Stretched limits of BigTable / AppEngine
• Sharded counter pattern to handle writes
Spreadsheets are NoSQL too...
Google Docs powered infographics
The Datablog
• Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets
• Retrieve data as CSV, XLS, JSON, Atom...
• “Make a copy” and run your own analysis
Mutualised news!
Write
Arbitrary data
Mutualised news!Create schema free database alongside RDBMS
Index in Solr
Provide access in API
Investigating: CouchDB
App server
Web servers
CMS Data feeds
Memcached (20Gb)
Solr
Core
Solr
Solr
Solr
Solr
Solr
Cloud, EC2
M/Q
Out
App
App
App
App
App
App
In
Proxyexternal hostingapp engine etc
CouchDB?rdbms