A Clever Way to Scale-out a Web Application

A Clever Way to Scale-out a Web Application

Cybozu Labs, Inc. Kazuho Oku

RDB sharding

 denormalization is inevitable

Sep 11 2009 A Clever Way to Scale-out a Web Application 2

tweet

following

timeline

uid:1-2000

followed_by

tweet

following

timeline

uid:2001-4000

followed_by

tweet

following

timeline

uid:4001-6000

followed_by

...

when uid:123 tweets, write his tweet, read uids of his followers, and update the timeline table of his followers

Two methods to update the shards

 eventual consistency  asynchonous updates using worker processes  pros: fast response, high scalability  cons: hard to maintain

 2-phase commit  synchronous updates  pros: synchronous, doesn't require external

daemon  cons: slow response


The problems

 complex queries  reading from / writing to multiple DB nodes  cannot use secondary indexes

 need to maintain per-user views (denormalized tables)

 maintain consistency between the nodes  when using eventual consistency model

 dynamic scaling  adding new nodes without stopping the service


Incline


Incline

 solution for the two problems of eventual consistency:  complex update queries  maintenance of the denormalized tables

 basic idea  do not let app. developers write denormalization

logic  handle denormalization below the SQL layer

 by using triggers and queue tables


tweet

following

timeline

uid:1-2000

followed_by

queue

tweet

following

timeline

uid:2001-4000

followed_by

queue

tweet

following

timeline

uid:4001-6000

followed_by

queue

Incline – illustrated

 insert / update / delete rows of related tables automatically


...

when uid:123 tweets, write only to his tweet table. Incline updates other tables automatically

tweet

following

timeline

uid:1-2000

followed_by

queue

tweet

following

timeline

uid:2001-4000

followed_by

queue

tweet

following

timeline

uid:4001-6000

followed_by

queue

Incline – illustrated (cont'd)

 insert / update / delete rows of related tables automatically


...

when uid:2431 starts following uid:940 only write to his following table

Incline – details

 triggers generated from def. files  sync. updates within each node  async. updates between the nodes

 each DB node has a queue table  helper program (C++) applies the queued events

to other nodes  uses a fault tolerant algorithm

 application only needs to write to the user's shard


Incline – the commands

# create queue tables % incline --mode=shard --rdbms=mysql --database=microblog \ --host=10.0.200.10 --source=microblog.json --shard-source=shard.json \ create-queue

# create triggers % incline --mode=shard --rdbms=mysql --database=microblog \ --host=10.0.200.10 --source=microblog.json --shard-source=shard.json \ create-trigger

# run forwarder (transfers data from specified host to other shards) % incline --mode=shard --rdbms=mysql --database=microblog \ --host=10.0.200.10 --source=microblog.json --shard-source=shard.json \ forward


Incline – the definition files

# view def. file [ {

"source" : [ "tweet", "followed_by" ], "destination" : "timeline",

"pk_columns" : { "followed_by.follower_id" : "user_id", "tweet.user_id" : "tweet_user_id",

"tweet.tweet_id" : "tweet_id" },

"npk_columns" : { "tweet.ctime" : "ctime" },

"merge" : { "tweet.user_id" : "followed_by.user_id"

}, "shard-key" : "user_id" }, {

"source" : "following", "destination" : "followed_by",

"pk_columns" : { "following.following_id" : "user_id", "following.user_id" : "follower_id"

}, "shard-key" : "user_id"

} ]


# shard def. file { "algorithm" : "range-int",

"map" : { "1" : {

"host" : "10.0.200.10", "username" : "pac1251781019" },

"2001" : { "host" : "10.0.200.11",

"username" : "pac1251781332" }, "4001" : {

"host" : "10.0.200.12", "username" : "pac1251781408"

} }

Incline – FYI the generated triggers

CREATE TRIGGER _INCLINE_followed_by_INSERT AFTER INSERT ON followed_by FOR EACH ROW BEGIN

IF (((1<=NEW.follower_id AND NEW.follower_id<2001))) THEN INSERT INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT

NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id FROM tweet WHERE tweet.user_id=NEW.user_id;

ELSE INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action)

SELECT NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id,'I' FROM tweet WHERE tweet.user_id=NEW.user_id;

END IF; END CREATE TRIGGER _INCLINE_followed_by_UPDATE AFTER UPDATE ON followed_by FOR EACH

ROW BEGIN IF (((1<=NEW.follower_id AND NEW.follower_id<2001))) THEN REPLACE INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT

NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id FROM tweet WHERE tweet.user_id=NEW.user_id;

ELSE INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action)

SELECT NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id,'U' FROM tweet WHERE tweet.user_id=NEW.user_id;

END IF; END CREATE TRIGGER _INCLINE_followed_by_DELETE AFTER DELETE ON followed_by FOR EACH

ROW BEGIN IF (((1<=OLD.follower_id AND OLD.follower_id<2001))) THEN DELETE FROM timeline WHERE timeline.user_id=OLD.follower_id AND

tweet_user_id=OLD.user_id; ELSE

INSERT INTO _iq_timeline (user_id,tweet_id,tweet_user_id,_iq_action) SELECT OLD.follower_id,tweet.tweet_id,tweet.user_id,'D' FROM tweet WHERE tweet.user_id=OLD.user_id;

END IF;

END CREATE TRIGGER _INCLINE_following_INSERT AFTER INSERT ON following FOR EACH ROW

BEGIN IF (((1<=NEW.following_id AND NEW.following_id<2001))) THEN INSERT INTO followed_by (user_id,follower_id) SELECT

NEW.following_id,NEW.user_id;

ELSE INSERT INTO _iq_followed_by (user_id,follower_id,_iq_action) SELECT

NEW.following_id,NEW.user_id,'I'; END IF; ENDCREATE TRIGGER _INCLINE_following_DELETE AFTER DELETE ON following FOR EACH

ROW BEGIN IF (((1<=OLD.following_id AND OLD.following_id<2001))) THEN DELETE FROM followed_by WHERE followed_by.user_id=OLD.following_id AND

followed_by.follower_id=OLD.user_id;

ELSE INSERT INTO _iq_followed_by (user_id,follower_id,_iq_action) SELECT

OLD.following_id,OLD.user_id,'D'; END IF; END CREATE TRIGGER _INCLINE_tweet_INSERT AFTER INSERT ON tweet FOR EACH ROW BEGIN INSERT INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT

followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id FROM followed_by WHERE ((1<=followed_by.follower_id AND followed_by.follower_id<2001)) AND NEW.user_id=followed_by.user_id;

INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action) SELECT followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id,'I' FROM followed_by WHERE NOT (((1<=followed_by.follower_id AND followed_by.follower_id<2001))) AND NEW.user_id=followed_by.user_id;

END CREATE TRIGGER _INCLINE_tweet_UPDATE AFTER UPDATE ON tweet FOR EACH ROW BEGIN REPLACE INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT

followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id FROM followed_by WHERE ((1<=followed_by.follower_id AND followed_by.follower_id<2001)) AND NEW.user_id=followed_by.user_id;

INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action) SELECT followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id,'U' FROM followed_by WHERE NOT (((1<=followed_by.follower_id AND followed_by.follower_id<2001))) AND NEW.user_id=followed_by.user_id;

END CREATE TRIGGER _INCLINE_tweet_DELETE AFTER DELETE ON tweet FOR EACH ROW BEGIN DELETE FROM timeline WHERE timeline.tweet_id=OLD.tweet_id AND

timeline.tweet_user_id=OLD.user_id; INSERT INTO _iq_timeline (tweet_id,tweet_user_id,user_id,_iq_action) SELECT

OLD.tweet_id,OLD.user_id,followed_by.follower_id,'D' FROM followed_by WHERE OLD.user_id=followed_by.user_id AND NOT (((1<=followed_by.follower_id AND followed_by.follower_id<2001)));

END


Pacific


Range-based sharding vs. hash-based

 Range-based sharding is better  range queries are sometimes necessary  manual tuning is easy  number of nodes increase continuously

 with hash-based sharding, you have to add 1,2,4,8,16,32,64,... servers at once


Pacific

 utility programs for dynamic scaling  mysqld_jumpstart  pacific_divide


mysqld_jumpstart – summary

 create a mysqld instance in a single command  service automatically started by daemontools  setup of primary nodes and slaves  auto-generated backup script: install_dir/etc/

backup.sh  uses XtraBackup for hot-backup


mysql_jumpstart – the commands

# create and start a master database % mysqld_jumpstart --mysql-install-db=/usr/local/mysql/bin/

mysql_install_db --mysqld=/usr/local/mysql/libexec/mysqld --base-dir=/var/servicedb --server-id=1252619462 --socket=/tmp/mysql-servicedb.sock --service-dir=/service/mysql-servicedb --replication-network='10.0.200.0/255.255.255.0'

# backup % /var/servicedb/etc/backup.sh /var/backup/servicedb.backup.20090911

# create and start a slave database % mysqld_jumpstart --mysql-install-db=/usr/local/mysql/bin/

mysql_install_db --mysqld=/usr/local/mysql/libexec/mysqld --base-dir=/var/servicedb --server-id=1252619493 --socket=/tmp/mysql-servicedb.sock --service-dir=/service/mysql-servicedb --replication-network='10.0.200.0/255.255.255.0' --master-host=10.0.200.1 --from-innobackupex


Splitting a MySQL shard


2,001~4,000

replication

Before:

After:

 use replication to prepare, then upgrade a slave to master

1~2,000

slave

2,001~3,000 1~2,000 3,001~4,000 4,001~6,000

4,001~6,000

Problems in splitting a shard

 speed vs. safety  downtime should be minimum  guarantee that all the application servers write to

the new node  reads may switch to the new node eventually


Pacific_divide – the blurbs

 fail-safe  application servers using the old sharding

definition cannot access the split nodes  app. servers reload the definition upon such case

 minimum impact on users  no read-locks during division

 in eventual-consistency mode

 acquires write lock only against the dividing node  write lock time < 10 seconds

 if no delay in replication Sep 11 2009 A Clever Way to Scale-out a Web Application 20

Pacific_divide – the split algorithm

1.  create a new slave node 2.  drop write privileges of existing username on the dividing

node 3.  wait until the new node becomes in sync. 4.  update incline triggers 5.  create new user and give read / write privileges 6.  update shard def. 7.  drop read privileges granted to the old username


Pacific_divide – the comand

# upgrade 10.0.200.18 to a master with range uid:3,000- # # when instructed by pacific_divide, transmit shard.json to all # application servers and mysql shards (or you may use nfs, etc.)

% pacific_divide --shard-def=shard.json --database=microblog --new-host=10.0.200.18 --from-id=3000 --incline-source=microblog.json


2,001~4,000

replication

Before:

After:

1~2,000

slave

2,001~3,000 1~2,000 3,001~4,000 4,001~6,000

4,001~6,000

Pacific_divide – how the shard def. changes


# after

{

"algorithm" : "range-int", "map" : {

"1" : { "host" : "10.0.200.10", "username" : "pac1251781019"

}, "2001" : {


"3001" : { "host" : "10.0.200.18",

"username" : "pac1252624011" }, "4001" : {

"host" : "10.0.200.12", "username" : "pac1251781408"

} }

# before

{

"algorithm" : "range-int", "map" : {

"1" : { "host" : "10.0.200.10", "username" : "pac1251781019"

}, "2001" : {


"4001" : { "host" : "10.0.200.12",

"username" : "pac1251781408" } }

DBIx::ShardManager


DBIx::ShardManager – the code

# create manager object my $mgr = DBIx::ShardManager->new( definition => DBIx::ShardManager::Definition::JSON->new( file => 'etc/user_shard_def.json', auto_reload => 1, ), connector => DBIx::ShardManager::Connector::DBI->new( driver => 'mysql', dbname => 'microblog',

attr => { mysql_enable_utf8 => 1, RaiseError => 1, }, ), );


DBIx::ShardManager – the code (cont'd)

# read user's timeline

# first, read my timeline table my $timeline = $mgr->rw_handle($user_id)->selectall_arrayref( 'SELECT * FROM timeline WHERE user_id=? ORDER BY ctime DESC LIMIT

20',

{ Slice => {} }, $user_id, ); # fetch the tweets using (tweet_user_id,tweet_id) from other shards $mgr->shard_inner_join( $timeline, tweet_user_id => { 'tweet.tweet_id' => 'tweet_id', }, }


DBIx::ShardManager – blurbs

 access to raw DBI handles  easy to use ORM above DBIx::ShardManager

 detects changes and reloads shard def.  but may throw exceptions on writes during node

divisions by pacific_divide  display maintenance error, and let the user retry

 shard_join to be optimized  with Net::Drizzle, or mycached


Conclusion


Conclusion

 RDB sharding is not difficult when using Incline, Pacific, DBIx::ShardManager  IMO it is as easy as writing code for a standalone

database system

 app. developers can use 2-phase commit if necessary  or rely on Incline for async. updates


Current Status & ToDo

 Incline - early beta  ToDo: add support for multiple shard keys, add

recovery support on data-loss

 Pacific - early beta  ToDo: make it a distribution

 DBIx::ShardManager - still alpha  ToDo: write more join functions, concurrent

access, etc.


Miscellaneous

 Mycached  currently in alpha status  access MySQL tables using memcached protocol  higher concurrency (thousands of connections)  higher throughput (2x SQL)


For more information

 see my blog http://developer.cybozu.co.jp/kazuho/  DBIx::ShardManager is in coderepos.org/share/

lang/perl

 come to BPStudy #25 on 9/25  2h30m talk on Incline, Pacific,

DBIx::ShardManager (hopefully including demos)


A Clever Way to Scale-out a Web Application

Technology

Transcript of A Clever Way to Scale-out a Web Application