Post on 18-Nov-2014
description
Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP
byDathan Vance Pattishall
ContentsWho am I
Introductions
Requirements
Design to solve requirements
Federation
Java (Friend Queries)
INNODB-isms
More Stats
Questions
Who am I?
Dathan Vance Pattishall
Chief Data-Layer Architect
Share on http://mysqldba.blogspot.comScaling a Widget Company
Federation at Flickr: Doing Billions of Queries per DayScaling a HUGE volume of concurrent writes
Worked at
Introduction
Now I work at RockYou
When I started
Facebook Shards do 100K TPS alone
MySpace, Hi5, Orkut, Ads, Main Site various other DB servers Sum to 100K TPS
On Less than 120 Database Servers
32 – 48 GB of RAM8 Disk RAID 10 with 256 MB PERC 6 Controller
We can support any Logical SQL Query
T E A M
The Requirements
• Scale Linear• Store some data forever• Allow for change• Keep it cheap• Oh and downtime is not an option
Design to Meet The Requirement
Need Redundancy
Need allot of IO bandwidth
Need to Remove
Replication Lag
Need a system to do processing
offline
Need to do it all without
downtime
Do it Cheaply
Federation
User 1’s DataUser 2’s DataUser 3’s Data
….User N’s Data
User 1’s Data
User 2’s Data
User 3’s Data
User N’s Data
Federation
Does NOT increase write
throughput
Federation
This Increases
WriteThroughput
How does one Federate?
Who / what owns the data
How can you answer any
question asked?
First need to handle master-master
replication
• No auto-increments• GUIDs Only• Bucket assignment • Data access follows a pattern
Enter Global Lookup Cluster
• Hash Lookups are fast, can do 45K qps single server
• Ownerid -> Shard_id• Groupid -> Shard_id• Tagid -> Shard_id• Url_id -> Shard_id• Front By memcache• Use consistent hashing to add capacity
horizontally and HA
Write Multiple Views of the Data
Inviter know who they invited, Invited knows who invited them
Keep Data Consistent
Write Data to Shard 1
Write Data to Shard 2
If Shard 2 says ok Commit Data on Shard 1
If Shard 1 says ok Commit Data on Shard 2
If any step fails ROLLBACK
Use Java App to Parallel this to remove race conditions
What if I need an ID to represent a row
REPLACE INTO Tickets VALUES(‘a’); Get a ID backCREATE TABLE `TicketsGeneric` ( `id` bigint(20) unsigned NOT NULL auto_increment, `stub` char(1) NOT NULL default '', PRIMARY KEY (`id`), UNIQUE KEY `stub` (`stub`)) ENGINE=MyISAM AUTO_INCREMENT=7445309740
But what if I need a global view of the table
• Cron Jobs• Front by Memcache• Offline Tasks to atomic write job and return the
page quickly i.e. defer writes to Many RECPT– Pure PHP– Like GEARMAND uses IPC distributed across servers– Does 100Million actions per day and scales linearly
• @see Friend Query Section
What about maintenanceHave redundancy take side down or rotate server into
master – master config
Alters
Optimize tables
Add new tables
Massive Deletes
Data Repair
What about Shard Misbalance?
Migrate them
• object_id -> shard_id, lock shard_id for object_id
• Migrate the user• If error die, send alert• Takes less then 30 seconds per primary object• Currently shards are self balancing, can
migrate 4 million users in 8 days, at slowest setting.
What about managing datasize
• Enter Shard Types– Archive Shard– Sub Shards
• One way a DBA can scale is to partition and allocate a server per table. Why not by partition shard types?
• Allows for bleeding edge techs, have 10 shards of XTRA-DB
What about Split Brain?
I allow writes on both servers in Master-Master Configs.
Stick Primary Object ID to a server
If you Read my Data, you Access my Data like I access my Data, same for writes.
If a server fails flip to redundant server
$PRIMARY_OBJECT_ID % $NUM_SERVS == BUCKET
Also gets rid of Slave lag for the most part.
Friend QueriesMULTI-GET from Shards
Jetty + J/Connect(AsyncShard Server)
• Can Query 8 shards at a time in parallel• Data is merged on the fly• JASON is the communication protocol• private ExecutorService exec =
Executors.newFixedThreadPool(8); // 4 CPU * .8 Ut (1 + W/C) =~ 8
J/Connect/* mysql-connector-java-5.1.7 ( Revision: ${svn.Revision} ) */SHOW VARIABLEWHERE Variable_name ='language' OR Variable_name = 'net_write_timeout' ORVariable_name = 'interactive_timeout' OR Variable_name = 'wait_timeout' ORVariable_name = 'character_set_client' OR Variable_name = 'character_set_connection' OR Variable_name = 'character_set' OR Variable_name ='character_set_server' OR Variable_name = 'tx_isolation' OR Variable_name ='transaction_isolation' OR Variable_name = 'character_set_results' OR Variable_name = 'timezone' OR Variable_name = 'time_zone' OR Variable_name = 'system_time_zone' OR Variable_name = 'lower_case_table_names' OR
Variable_name = 'max_allowed_packet' OR Variable_name = 'net_buffer_length' OR Variable_name = 'sql_mode' OR Variable_name = 'query_cache_type' OR Variable_name =
'query_cache_size' OR Variable_name = 'init_connect';
• Takes 180ms +
Fix
Add:&cacheServerConfiguration=true To your JDBC url directive
@see http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
Writing Large Strings REALTIME
• Incrementing impressions is easy, but storing referrer URLS is not as easy in RealTime
• Why must know your limits of the Storage Engine you are using
INNODB & Strings
• Indexing a string takes a lot of space• Indexing a large string takes even more space• Each index has its own 16KB page.• Fragmentation across pages was hurting the
app – chewing up I/O• Lots of disk space chewed up per day• Due to a bunch of overhead with Strings &
Deadlock detection
INNODB & High Concurrency of Writes
• Requirement: 300 ms for total db access FOR ALL Apps
• Writes when the datafile(s) size is greater then the buffer_size-slow down at high concurrency
• 10 ms to 20 seconds sometimes for the full transaction
• Fixed by offloading the query to OfflineTask that writes it as a single thread.
Deadlock / Transaction Overhead Solved
• Put a Java daemon that buffers up to 4000 messages (transactions) and apply it serially with one thread
• It does not go down & if it does we can fail over• Log data to local disk for outstanding trans• It does not use much memory or cpu• Even during peak messages do not exceed 200
outstanding transactions
Disk Consumption solved
• Archive Data• Compress using INNODB 1.0.4• innodb_file_format = Barracuda• 8K Key Block Size – best bang for the buck for
our data. Less Key Block Size causes major slow down in transactions.
Stats Across All ServicesOver 17 billion Transactions per day can sustain 200K+++ TPS
Across 25 TB of data
300K Memcache Gets a Second
10 million active users per Shard
A Large % of ALL Major Social network users have a Rockyou Presence (Federate by SN/User)
99.999% uptime
All on the Fly Connections
All balancing handled by application
Use memcache to reduce latency can run without it!!!
Questions / Want to Work here?
dathan@rockyou.com