Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf ·...

40
1 Copyright MySQL AB The World’s Most Popular Open Source Database 1 The World’s Most Popular Open Source Database Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes Community Relations Manager, North America ([email protected]) MySQL, Inc. TELECONFERENCE: please dial a number to hear the audio portion of this presentation. Toll-free US/Canada: 866-469-3239 Direct US/Canada: 650-429-3300 : 0800-295-240 Austria 0800-71083 Belgium 80-884912 Denmark 0-800-1-12585 Finland 0800-90-5571 France 0800-101-6943 Germany 00800-12-6759 Greece 1-800-882019 Ireland 800-780-632 Italy 0-800-9214652 Israel 800-2498 Luxembourg 0800-022-6826 Netherlands 800-15888 Norway 900-97-1417 Spain 020-79-7251 Sweden 0800-561-201 Switzerland 0800-028-8023 UK 1800-093-897 Australia 800-90-3575 Hong Kong 00531-12-1688 Japan EVENT NUMBER/ACCESS CODE:

Transcript of Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf ·...

Page 1: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

1Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 1The Worldrsquos Most Popular Open Source Database

Tagging and FolksonomySchema Design for Scalability and Performance

Jay PipesCommunity Relations Manager North America

(jaymysqlcom)MySQL Inc

TELECONFERENCE please dial a number to hear the audio portion of this presentation

Toll-free USCanada 866-469-3239 Direct USCanada 650-429-3300 0800-295-240 Austria 0800-71083 Belgium80-884912 Denmark 0-800-1-12585 Finland0800-90-5571 France 0800-101-6943 Germany00800-12-6759 Greece 1-800-882019 Ireland800-780-632 Italy 0-800-9214652 Israel800-2498 Luxembourg 0800-022-6826 Netherlands800-15888 Norway 900-97-1417 Spain020-79-7251 Sweden 0800-561-201 Switzerland0800-028-8023 UK 1800-093-897 Australia800-90-3575 Hong Kong 00531-12-1688 Japan

EVENT NUMBERACCESS CODE

2Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 2The Worldrsquos Most Popular Open Source Database

craigslist

Communicate

Connect

Share

Play

Trade

Search amp Look Up

3Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 3The Worldrsquos Most Popular Open Source Database

Software Industry Evolution

CLOSED SOURCE SOFTWARE

FREE amp OPEN SOURCE SOFTWARE

20051995

WEB 10 WEB 20

1990 2000

4Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 4The Worldrsquos Most Popular Open Source Database

How MySQL Enables Web 20

lower tco ubiquitous with developers

ease of use

interoperableconnectors

language support

bundledopen sourcescale out

replication

query cache

session management

tag databases

world-widecommunity

lampfull-text search

large objects

pluggable storage engines

reliability

performance

XML

partitioning (51)

5Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 5The Worldrsquos Most Popular Open Source Database

Agenda

bull Web 20 more than just rounded cornersbull Tagging Concepts in SQL bull Folksonomy Concepts in SQLbull Scaling out sensibly

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 2: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

2Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 2The Worldrsquos Most Popular Open Source Database

craigslist

Communicate

Connect

Share

Play

Trade

Search amp Look Up

3Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 3The Worldrsquos Most Popular Open Source Database

Software Industry Evolution

CLOSED SOURCE SOFTWARE

FREE amp OPEN SOURCE SOFTWARE

20051995

WEB 10 WEB 20

1990 2000

4Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 4The Worldrsquos Most Popular Open Source Database

How MySQL Enables Web 20

lower tco ubiquitous with developers

ease of use

interoperableconnectors

language support

bundledopen sourcescale out

replication

query cache

session management

tag databases

world-widecommunity

lampfull-text search

large objects

pluggable storage engines

reliability

performance

XML

partitioning (51)

5Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 5The Worldrsquos Most Popular Open Source Database

Agenda

bull Web 20 more than just rounded cornersbull Tagging Concepts in SQL bull Folksonomy Concepts in SQLbull Scaling out sensibly

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 3: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

3Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 3The Worldrsquos Most Popular Open Source Database

Software Industry Evolution

CLOSED SOURCE SOFTWARE

FREE amp OPEN SOURCE SOFTWARE

20051995

WEB 10 WEB 20

1990 2000

4Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 4The Worldrsquos Most Popular Open Source Database

How MySQL Enables Web 20

lower tco ubiquitous with developers

ease of use

interoperableconnectors

language support

bundledopen sourcescale out

replication

query cache

session management

tag databases

world-widecommunity

lampfull-text search

large objects

pluggable storage engines

reliability

performance

XML

partitioning (51)

5Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 5The Worldrsquos Most Popular Open Source Database

Agenda

bull Web 20 more than just rounded cornersbull Tagging Concepts in SQL bull Folksonomy Concepts in SQLbull Scaling out sensibly

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 4: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

4Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 4The Worldrsquos Most Popular Open Source Database

How MySQL Enables Web 20

lower tco ubiquitous with developers

ease of use

interoperableconnectors

language support

bundledopen sourcescale out

replication

query cache

session management

tag databases

world-widecommunity

lampfull-text search

large objects

pluggable storage engines

reliability

performance

XML

partitioning (51)

5Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 5The Worldrsquos Most Popular Open Source Database

Agenda

bull Web 20 more than just rounded cornersbull Tagging Concepts in SQL bull Folksonomy Concepts in SQLbull Scaling out sensibly

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 5: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

5Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 5The Worldrsquos Most Popular Open Source Database

Agenda

bull Web 20 more than just rounded cornersbull Tagging Concepts in SQL bull Folksonomy Concepts in SQLbull Scaling out sensibly

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 6: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

6Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 6The Worldrsquos Most Popular Open Source Database

More than just rounded corners

bull What is Web 20ndash Participationndash Interactionndash Connection

bull A changing of design patternsndash AJAX and XML-RPC are changing the way data is queriedndash Rich web-based clientsndash Everyones got an APIndash API leads to increased requests per second Must deal with

growth

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 7: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

7Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 7The Worldrsquos Most Popular Open Source Database

AJAX and XML-RPC change the game

bull Traditional (Web 10) page request builds entire pagendash Lots of HTML style and data in each page requestndash Lots of data processed or queried in each requestndash Complex page requests and application logic

bull AJAXXML-RPC request returns small data set or page fragmentndash Smaller amount of data being passed on each requestndash But often many more requests compared to Web 10ndash Exposed APIs mean data services distribute your data to

syndicated sitesndash RSS feeds supply data not full web page

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 8: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

8Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 8The Worldrsquos Most Popular Open Source Database

flickr Tag Cloud(s)

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 9: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

9Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 9The Worldrsquos Most Popular Open Source Database

Participation Interaction and Connection

bull Multiple users editing the same contentndash Content explosionndash Lots of textual data to manage How do we organize it

bull User-driven datandash ldquoTrackedrdquo pagesitemsndash Allows interlinking of content via the user to user relationship

(folksonomy)

bull Tags becoming the new categorizationfiling of this contentndash Anyone can tag the datandash Tags connect one thing to another similar to the way that the

folksonomy relationships link users together

bull So the common concept here is ldquolinkingrdquo

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 10: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

10Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 10The Worldrsquos Most Popular Open Source Database

What The User Dimension Gives You

bull Folksonomy adds the ldquouser dimensionrdquobull Tags often provide the ldquogluerdquo between user and item

dimensionsbull Answers questions such as

ndash Who shares my interestsndash What are the other interests of users who share my interestsndash Someone tagged my bookmark (or any other item) with

something What other items did that person tag with the same thing

ndash What is the users immediate ldquoecosystemrdquo What are the fringes of that ecosystem

bull Ecosystem radius expansion (like zipcode radius expansion)

bull Marks a new aspect of web applications into the world of data warehousing and analysis

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 11: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

11Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 11The Worldrsquos Most Popular Open Source Database

Linking Is the ldquoRrdquo in RDBMS

bull The schema becomes the absolute driving force behind Web 20 applications

bull Understanding of many-to-many relationships is criticalbull Take advantage of MySQLs architecture so that our

schema is as efficient as possiblebull Ensure your data store is normalized standardized

and consistentbull So lets digg in (pun intended)

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 12: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

12Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 12The Worldrsquos Most Popular Open Source Database

Comparison of Tag Schema Designs

bull ldquoMySQLiciousrdquondash Entirely denormalizedndash One main fact table with one field storing delimited list of tagsndash Unless using FULLTEXT indexing does not scale well at allndash Very inflexible

bull ldquoScuttlerdquo solutionndash Two tables Lookup one-way from item table to tag tablendash Somewhat inflexiblendash Doesnt represent a many-to-many relationship correctly

bull ldquoToxirdquo solutionndash Almost gets it right except uses surrogate keys in mapping

tables ndash Flexible normalized approachndash Closest to our recommended architecture

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 13: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

13Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 13The Worldrsquos Most Popular Open Source Database

Tagging Concepts in SQL

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 14: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

14Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 14The Worldrsquos Most Popular Open Source Database

The Tags Table

bull The ldquoTagsrdquo table is the foundational table upon which all tag links are built

bull Lean and meanbull Make primary key an INT

CREATE TABLE Tags (tag_id INT UNSIGNED NOT NULL AUTO_INCREMENT tag_text VARCHAR(50) NOT NULL PRIMARY KEY pk_Tags (tag_id) UNIQUE INDEX uix_TagText (tag_text)) ENGINE=InnoDB

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 15: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

15Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 15The Worldrsquos Most Popular Open Source Database

Example Tag2Post Mapping Table

bull The mapping table creates the link between a tag and anything else

bull In other terms it maps a many-to-many relationshipbull Important to index from both ldquosidesrdquo

CREATE TABLE Tag2Post (tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY pk_Tag2Post (tag_id post_id) INDEX (post_id)) ENGINE=InnoDB

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 16: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

16Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 16The Worldrsquos Most Popular Open Source Database

The Tag Cloud

bull Tag density typically represented by larger fonts or different colors

SELECT tag_text COUNT() as num_tagsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idGROUP BY tag_text

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 17: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

17Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 17The Worldrsquos Most Popular Open Source Database

Efficiency Issues with the Tag Cloudbull With InnoDB tables you dont want to be issuing

COUNT() queries even on an indexed fieldCREATE TABLE TagStattag_id INT UNSIGNED NOT NULL num_posts INT UNSIGNED NOT NULL num_xxx INT UNSIGNED NOT NULL PRIMARY KEY (tag_id)) ENGINE=InnoDB

SELECT tag_text tsnum_postsFROM Tag2Post t2pINNER JOIN Tags tON t2ptag_id = ttag_idINNER JOIN TagStat tsON ttag_id = tstag_idGROUP BY tag_text

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 18: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

18Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 18The Worldrsquos Most Popular Open Source Database

The Typical Related Items Query

bull Get all posts tagged with any tag attached to Post 6bull In other words ldquoGet me all posts related to post 6rdquo

SELECT p2post_id FROM Tag2Post p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idWHERE p1post_id = 6GROUP BY p2post_id

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 19: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

19Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 19The Worldrsquos Most Popular Open Source Database

Problems With Related Items Query

bull Joining small to medium sized ldquotag setsrdquo works greatbull But when youve got a large tag set on either ldquosiderdquo of

the join problems can occur with scalablitybull One way to solve is via derived tables

SELECT p2post_id FROM (SELECT tag_id FROM Tag2PostWHERE post_id = 6 LIMIT 10) AS p1INNER JOIN Tag2Post p2ON p1tag_id = p2tag_idGROUP BY p2post_id LIMIT 10

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 20: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

20Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 20The Worldrsquos Most Popular Open Source Database

The Typical Related Tags Querybull Get all tags related to a particular tag via an itembull The ldquoreverserdquo of the related items query we want a set

of related tags not related posts

SELECT t2p2tag_id t2tag_text FROM (SELECT post_id FROM Tags t1INNER JOIN Tag2Post ON t1tag_id = Tag2Posttag_idWHERE t1tag_text = beach LIMIT 10) AS t2p1INNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idINNER JOIN Tags t2ON t2p2tag_id = t2tag_idGROUP BY t2p2tag_id LIMIT 10

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 21: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

21Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 21The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag

bull What if we want only items related to each of a set of tags

bull Here is the typical way of dealing with this problem

SELECT t2ppost_idFROM Tags t1 INNER JOIN Tag2Post t2pON t1tag_id = t2ptag_idWHERE t1tag_text IN (beachcloud)GROUP BY t2ppost_id HAVING COUNT(DISTINCT t2ptag_id) = 2

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 22: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

22Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 22The Worldrsquos Most Popular Open Source Database

Dealing With More Than One Tag (contd)

bull The GROUP BY and the HAVING COUNT(DISTINCT ) can be eliminated through joins

bull Thus you eliminate the Using temporary using filesort in the query execution

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idWHERE t1tag_text = beachAND t2tag_text = cloud

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 23: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

23Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 23The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset

bull Here we want have a search query like ldquoGive me all posts tagged with ldquobeachrdquo and ldquocloudrdquo but not tagged with ldquoflowerrdquo Typical solution you will see

SELECT t2p1post_idFROM Tag2Post t2p1INNER JOIN Tags t1 ON t2p1tag_id = t1tag_idWHERE t1tag_text IN (beachcloud)AND t2p1post_id NOT IN(SELECT post_id FROM Tag2Post t2p2INNER JOIN Tags t2 ON t2p2tag_id = t2tag_idWHERE t2tag_text = flower)GROUP BY t2p1post_id HAVING COUNT(DISTINCT t2p1tag_id) = 2

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 24: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

24Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 24The Worldrsquos Most Popular Open Source Database

Excluding Tags From A Resultset (contd)bull More efficient to use an outer join to filter out the

ldquominusrdquo operator plus get rid of the GROUP BY

SELECT t2p2post_idFROM Tags t1 CROSS JOIN Tags t2 CROSS JOIN Tags t3INNER JOIN Tag2Post t2p1ON t1tag_id = t2p1tag_idINNER JOIN Tag2Post t2p2ON t2p1post_id = t2p2post_idAND t2p2tag_id = t2tag_idLEFT JOIN Tag2Post t2p3ON t2p2post_id = t2p3post_idAND t2p3tag_id = t3tag_idWHERE t1tag_text = beachAND t2tag_text = cloudAND t3tag_text = flowerAND t2p3post_id IS NULL

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 25: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

25Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 25The Worldrsquos Most Popular Open Source Database

Summary of SQL Tips for Tagging

bull Index fields in mapping tables properly Ensure that each GROUP BY can access an index from the left side of the index

bull Use summary or statistic tables to eliminate the use of COUNT() expressions in tag clouding

bull Get rid of GROUP BY and HAVING COUNT() by using standard join techniques

bull Get rid of NOT IN expressions via a standard outer joinbull Use derived tables with an internal LIMIT expression to

prevent wild relation queries from breaking scalability

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 26: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

26Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 26The Worldrsquos Most Popular Open Source Database

Folksonomy Concepts in SQL

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 27: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

27Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 27The Worldrsquos Most Popular Open Source Database

Here we see tagging and folksonomy

together the user dimension

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 28: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

28Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 28The Worldrsquos Most Popular Open Source Database

Folksonomy Adds The User Dimension

bull Adding the user dimension to our schemabull The tag is the relationship glue between the user and

item dimensions

CREATE TABLE UserTagPost (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NULL post_id INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id post_id) INDEX (tag_id) INDEX (post_id)) ENGINE=InnoDB

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 29: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

29Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 29The Worldrsquos Most Popular Open Source Database

Who Shares My Interest Directly

bull Find out the users who have linked to the same item I have

bull Direct link we dont go through the tag glue

SELECT user_id FROM UserTagPostWHERE post_id = my_post_id

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 30: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

30Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 30The Worldrsquos Most Popular Open Source Database

Who Shares My Interests Indirectly

bull Find out the users who have similar tag setsbull But how much matching do we want to do In other

words what radius do we want to match onbull The first step is to find my tags that are within the

search radius this yields my ldquotoprdquo or most popular tags

SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 31: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

31Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 31The Worldrsquos Most Popular Open Source Database

Who Shares My Interests (contd)

bull Now that we have our ldquotoprdquo tag set we want to find users who match all of our top tags

SELECT othersuser_id FROM UserTagPost others INNER JOIN (SELECT tag_id FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_id

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 32: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

32Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 32The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matchesbull What about finding our ldquoclosestrdquo ecosystem matchesbull We can ldquorankrdquo other users based on whether they have

tagged items a number of times similar to ourselvesSELECT othersuser_id (COUNT() shy my_tagsnum_tags) AS rankFROM UserTagPost others INNER JOIN (SELECT tag_id COUNT() AS num_tags FROM UserTagPostWHERE user_id = my_user_idGROUP BY tag_idHAVING COUNT(tag_id) gt= radius) AS my_tagsON otherstag_id = my_tagstag_idGROUP BY othersuser_idORDER BY rank DESC

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 33: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

33Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 33The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficientlybull But weve still got our COUNT() problembull How about another summary table

CREATE TABLE UserTagStat (user_id INT UNSIGNED NOT NULL tag_id INT UNSIGNED NOT NUL num_posts INT UNSIGNED NOT NULL PRIMARY KEY (user_id tag_id) INDEX (tag_id)) ENGINE=InnoDB

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 34: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

34Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 34The Worldrsquos Most Popular Open Source Database

Ranking Ecosystem Matches Efficiently 2bull Hey weve eliminated the aggregation

SELECT othersuser_id (othersnum_posts shy my_tagsnum_posts) AS rankFROM UserTagStat others INNER JOIN (SELECT tag_id num_postsFROM UserTagStatWHERE user_id = my_user_idAND num_posts gt= radius) AS my_tagsON otherstag_id = my_tagstag_idORDER BY rank DESC

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 35: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

35Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 35The Worldrsquos Most Popular Open Source Database

Scaling Out Sensibly

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 36: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

36Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 36The Worldrsquos Most Popular Open Source Database

SlaveMySQLServer

MasterMySQLServer

MySQL Replication (Scale Out)

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

WebAppServer

Writes amp Reads Reads Reads

hellip

Replication

Load Balancer

bull Write to one masterbull Read from many slavesbull Excellent for read intensive apps

SlaveMySQLServer

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 37: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

37Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 37The Worldrsquos Most Popular Open Source Database

Scale Out Using Replicationbull Master DB stores all writesbull Master has InnoDB tablesbull Slaves handle aggregate reads non-realtime readsbull Web servers can be load balanced (directed) to one or

more slavesbull Just plug in another slave to increase read performance

(thats scaling out)bull Slave can provide hot standby as well as backup server

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 38: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

38Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 38The Worldrsquos Most Popular Open Source Database

Scale Out Strategiesbull Slave storage engine can differ from

Masterndash InnoDB on Master (great updateinsertdelete

performance)ndash MyISAM on Slave (fantastic read performance

and well as excellent concurrent insert performance plus can use FULLTEXT indexing)

bull Push aggregated summary data in batches onto slaves for excellent read performance of semi-static datandash Example ldquothis weeks popular tagsrdquo

bull Generate the data via cron job on each slave No need to burden the master server

bull Truncate every week

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 39: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

39Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 39The Worldrsquos Most Popular Open Source Database

Scale Out Strategies (contd)bull Offload FULLTEXT indexing onto a FT

indexer such as Apache Lucene Mnogosearch Sphinx FT Engine etc

bull Use Partitioning feature of 51 to segment tag data across multiple partitions allowing you to spread disk load sensibly based on your tag text density

bull Use the MySQL Query Cache effectivelyndash Use SQL_NO_CACHE when selecting from

frequently updated tables (ex TagStat)ndash Very effective for high-read environments

can yield 200-250 performance improvement

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB

Page 40: Tagging and Folksonomy Schema Design for Scalability and ...joinfu.com/presentations/tagging.pdf · Tagging and Folksonomy Schema Design for Scalability and Performance Jay Pipes

40Copyright MySQL AB The Worldrsquos Most Popular Open Source Database 40The Worldrsquos Most Popular Open Source Database

QuestionsMySQL Forge httpforgemysqlcom

MySQL Forge Tag Schema Wiki pageshttpforgemysqlcomwikiTagSchema

PlanetMySQL httpwwwplanetmysqlorg

Jay Pipes (jaymysqlcom)

MySQL AB