Tagging schema design for high performance
Plan
▪ Tagging basis▪ Database challenges▪ Tagging solutions▪ Pros and cons▪ Q&A session
Tagging terms• Tag is a non-hierarchical keyword or term assigned to a piece of information• Tags are generally chosen informally and personally by the item's creator or
by its viewer• If tags are assigned by the creator and are limited it is taxonomy• If tags are assigned by the viewer and are unlimited it is folksonomy • Started to be widely used from 2003 by Flikr and Delicious web sites• Tags are showed usually inline as well as tag cloud
Tagging challenges+1. used vocabulary reflects the user’s vocabulary directly 2. flexibility - the user can add or remove tags3. multi-dimensional nature - users can assign any number and combination of tags to express a
concept
lead to-4. specialized tags or tags without meaning to others than themselves, misspellings,
singular/plural form, compound words5. tags are often ambiguous, overly personalized, poorly applied tag6. Using synonyms, acronyms and homonyms which aren’t handled well
Database challenges
1. Performance2. Queries awkwardness3. Database size4. Housekeeping
High normalized approach
Denormalized approach
Complex data type approach
Full-text-search oriented solutions
Stackoverflow: <php><mysql><guid><encryption>JSON: {“tags”:[“php”, “apache2”, “openinviter”]}
Full-text-search approaches
FTS inside DB
+FTS model
Relational/denormalized/FTS model
Approach 1 Approach 2
FTS server(Lucene, Sphinx,
Elastic, Solr, Xapian, etc)
Application
server
Application
server
Housekeeping
Denormalized/FTS1. Change all affected tags in all documents if a tag name changedFTS1. FTS index rebuild due fragmentation 2. FTS index refresh if it isn’t refreshed on COMMIT
Test exampleStackOverflow posts via http://data.stackexchange.com/From 31/07/2008 to 21-12-2012Posts: 2 680 474Applied tags: 7 791 527Used unique tags: 30 485Max tags count for a post: 5
Comparison
Initial population time
Relational
Denormalized
Complex data type
Full text search
0 500 1000 1500 2000 2500
Insert time
ModelInsert time, seconds
Relational 1048Denormalized 1205Complex data type 2086Full text search 1950
Comparison
DB sizeModel
Size total, MB
Data size, MB
Index size, MB
Relational 1166 338 828Denormalized 1080 376 704Complex data type 1134 256 878Full text search 1055 416 639
Relational
Denormalized
Complex data type
Full text search
0 200 400 600 800 1000 1200 1400
DB size
Index size, MB Data size, MB Size total, MB
Comparison
Search by document id and all tag retrieval
ModelSpeed with cold cache, seconds
Speed with hot cache, seconds
Relational 0,2 0,003Denormalized 0,07 0,002Complex data type 0,9 0,002Full text search 0,3 0,001
Relational
Denormalized
Complex data type
Full text search
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Speed with cold cache, seconds
Relational
Denormalized
Complex data type
Full text search
0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035
Speed with hot cache, seconds
Comparison
Search using 1 tags and all tag retrieval
Model
Speed with cold cache, seconds
Speed with hot cache, seconds
Relational 1 0,005Denormalized 0,7 0,004Complex data type 1,7 0,005Full text search 0,7 0,002
Relational
Denormalized
Complex data type
Full text search
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Speed with cold cache, seconds
Relational
Denormalized
Complex data type
Full text search
0 0.001 0.002 0.003 0.004 0.005 0.006
Speed with hot cache, seconds
ComparisonSearch by AND using 2 tags and all tag retrieval
Model
Speed with cold cache, seconds
Speed with hot cache, seconds
Relational 40 34Denormalized 34 20Complex data type 34 14Full text search 20 2
Relational
Denormalized
Complex data type
Full text search
0 5 10 15 20 25 30 35 40 45
Search speed
Speed with hot cache, seconds Speed with cold cache, seconds
Comparison
Cloud tag populationModel Speed, secondsrelation 20relational simplified 18relational without fk 202denormalized 18Complex data type 21fts 40
relation
relational simplified
relational without fk
denormalized
array
fts
0 50 100 150 200 250
Speed, seconds
Pros & Cons
ModelSpace consumption
Search performance Insert performance
Maintenance
Additional housekeeping
Risk of failure
Search queries development
Relational worst worst highest minimal not required no worst
Denormalized moderate moderate good required required no moderate
Complex data type moderate moderate worst required required no moderate
Full text search optimal optimal moderate required required yes optimal
Conclusion
1. Choose your best model based on:• Performance (search/insert/update)• Space consumption• Engineer experience• Hardware cost• Software cost
2. Each storage model should be checked on your RDBMS - don’t be afraid to try and measure
3. Understanding how complex data types are stored inside is crucial4. Understanding how FTS works inside is crucial5. Investigate your DBMS unique features
There is no silver bullet for tag storage model!
Q&A
Contacts
Feel free to ask any db-related questions: [email protected]
Top Related