Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Learning Lessons: Building a CMS on top of NoSQL technologies
-
Upload
ngdata -
Category
Technology
-
view
18.212 -
download
2
Transcript of Learning Lessons: Building a CMS on top of NoSQL technologies
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Learning LessonsBuilding a content repositoryon top of NoSQL Technologies
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
hello,I’m @stevenn from @outerthought
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
This story is about
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Complexity
4
complexity
age
1.0
2.0
3.0
software architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Complexity
5
complexity
age
1.0
2.0
3.0
user interest
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
We Prefer Sophistication
6
» the challenge for us was to scale ...without dropping features
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
7
database (+opt. filesystem) (+ opt. full-text indexes)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
8
application
database (+opt. filesystem) (+ opt. full-text indexes)
cache
cacheapplication
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
9
more cache
database (+opt. filesystem) (+ opt. full-text indexes)
application cache
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
10
even more cache
more cache
database (+opt. filesystem) (+ opt. full-text indexes)
application cache
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
11
client
even more cache
more cache
database (+opt. filesystem) (+ opt. full-text indexes)
application cache
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The typical CMS ‘architecture’
12
client (+cache)
even more cache
more cache
database (+opt. filesystem) (+ opt. full-text indexes)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
What we found hard to scale
» access control
» facet browsing
» all the nifty stuff people were using our software for
» ... anything that required random accessto in-memory-cache data for computations
13
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Beyond the ‘scaling’ problem
» three-prong data layer
» result set merging (between MySQL & Lucene)» happened in appcode/memory
» ‘transactions’, set operations = hard
14
fs
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Beyond the three-prong problem
» errrr..... “Failover” ..... ?
15
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
If we would be able to add more nodes ...
»True Distribution
16
scalability
availability
performance
... in the line of fire
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Solution 1
» do MORE inside the database
17
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Functional
18
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Functional
19
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Infrastructural
2020
more database !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
even more database !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
let’s add message busses !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
RMI! JMS over JDBC! stuff!
w00t !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Business Development 101
25
budget
user interest
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Solution II
26
sophistication
nosql?
1.0
2.0
3.0
ability to cope
mysql
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Enter The Cambrian Explosion
27
NoSQL
Cassandra
neo4j
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Requirements, phase I
28
» automatic scaling to large data sets
» fault-tolerance: replication, automatic handling of failing nodes
» a flexible data model supporting sparse data
» runs on commodity hardware
» efficient random access to data
» open source, ability to participate in the development thus drive the direction of the project
» some preference for a Java-based solution
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Requirements, phase II
»After careful consideration, we realized the important choices were also:
» consistency: no chance of having two conflicting versions of a row
» atomic updates of a single row, single-row transactions
» bonus points for MapReduce integration» e.g. full-text index rebuilding
29
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
That brought us to HBase, which bought us:
» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model
» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it
» HDFS, a convenient place to store large blobs
» Apache license and community, a familiar environment for us
30
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
»OK, so now we had a data store !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
»However, content repository =store + search
ouch!
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
That was
easy !
(however ...)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
Search ponderings
»CMS = two types of search
» structured search» numbers, strings» based on logic (SQL, anyone?)
» information retrieval (or: full-text search)» text» based on statistics
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Search ponderings
»All of that, at scale
35
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Structured Search
»HBase Indexing Library
» idea from Google App Engine datastore indexes
» http://code.google.com/appengine/articles/index_building.html
36
rowkey
A
B
col
val3
val2
col
foo6
foo7
content table index table A
rowkey
val2-B
val3-A
col
order
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Full-text / IR search
» Lucene?
» no sharding (for scale)
» no replication (for availability)
» batched index updates (not real-time)
37
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Beyond Lucene» Katta
» scalable architecture, however only search, no indexing
» Elastic Search
» very young (sorry)
» hbasene et al.
» stores inverted index in HBase, might not scale all features
» SOLR
» widely used, schema, facets, query syntax, cloud branch
More info: http://lilycms.org/lily/prerelease/technology.html
38
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
+?
=Easy ! O
r ?
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
Remember distribution ?Remember secondary indexes ?
➙ Need for reliable queuing
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
Connecting things
»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)
» indexing, reindexing, mass reindexing (M/R)
»we need a reliable method of updating HBase secondary indexes
» all of that eventually to run distributed
» distribution means coping with failure
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Solution
»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.
»➙ WAL & Queue implemented on top of HBase tables
42
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
WAL / Queue
» WAL» guaranteed execution
of synchronous actions
» call doesn’t return before secondary action finishes
» e.g. update secondary actions
» if all goes well, size = #concurrent ops
» will be useful/made available outside of Lily context as well!
» Queue» triggering of async
actions
» e.g. (re)index (updated) record with SOLR back-end
» size depends on speed of back-end process
43
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The Sum» Lily model (records & fields)
» mapped onto HBase (=storage)
» indexed and searchable through SOLR
» using a WAL/Queue mechanismimplemented in HBase
» runtime based on Kauri
» with client/server comms via Avro
44
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Roadmap
»Today = release of learning material (architecture, model, API, Javadoc)➥ www.lilycms.org➥ bit.ly/lilyprerelease
»Mid July = ‘proof of architecture’ release
» from there on, ca. 3-monthly releasesleading up to Lily 1.0
47
Nearly there!
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
bit.ly/lilyprerelease
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
License
»Apache
49
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Business model
»Consulting, mentoring, turn-key projects
» Strong focus on partner relations
» targeting vertical markets
» geographic coverage
» SaaS offerings
»Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data
»Not: OLAP
50
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
More ?
» @outerthought
»www.lilycms.org/lily/prerelease.html
51