Mysql Latency
-
Upload
srubinstein -
Category
Technology
-
view
1.850 -
download
2
description
Transcript of Mysql Latency
Jeff Freund, CTOClickability
End of a long day, I am the last stop between you and …..
~6 hours 42 mins left
Yippee!
Future CTO
• Software-as-a-Service Web CMS• True Multi-Tenant SaaS platform from the
ground up• Integrated solution of all services required to run
a sophisticated business website• HQ in San Francisco, 8+ years old, 60+
employees
Global leader in On Demand Web Content Management
250+ million pages delivered per month
• Linux• Apache• MySQL• Java• Tomcat
Proven open source building blocks
• Scale-out horizontally• Distributed infrastructure, including
multiple datacenters• Multiple Layers of caching for performance• Loose-coupling of applications around
data
M1 M2
S3 S4S2S1 S5 S6
VPN Tunnel
Data Center 1 Data Center 2
SlaveMaster
RO ConnectionManager
RW ConnectionManager
con = db.getReadWriteConnection(); con = db.getReadOnlyConnection();con = db.getSafeReadConnection();
Application Code• Intelligently Split Queries between Masters and Slaves
• Inserts/Updates/Deletes sent to Master
• Most Reads sent to Slaves
• “Safe” Reads sent to Masters – zero tolerance for latency
• Manual code updates to implement the split
• 6+ months in production to find all “Safe” Reads
• The difference in time between when a transaction is committed on one database and then subsequently committed on a replicated database.
• Latency can either be “slowness” or “breakage”
7… Hardware Maintenance / Recovery
6… Schema updates / DB Maintenance
5… Elevated transaction rates (i.e. bulk loads)
4... High query load on slaves
3… Network bottlenecks / Loss of connectivity
2… “Slave Errors” (ie Duplicate keys, deadlocks)
while ( 1 )while? echo "show slave status \G;" | mysql -u USER --
password=PASSWORD | grep Seconds_Behind_Master >> replication.log
while? sleep 1while? end
Seconds
M1 M2
S4 S6S3S2S1 S5
VPN Tunnel
Data Center 1 Data Center 2
M1 M2
S4 S6
V PNTunnel
CREATE TABLE `replTest` ( timecol` bigint(20) default NULL, KEY `idx_timecol` (`timecol`) )
Loop:$val = current timestamp in epoch millisecondsM2: INSERT INTO replTest (timecol) VALUES ($val)M1: SELECT $val -max(timecol) from replTest;S4: SELECT $val -max(timecol) from replTest;S6: SELECT $val -max(timecol) from replTest;
INSERT
Database Characteristics Average Latency Max Latency
M2 Transaction Source N/A N/A
M1 Local; Moderate Load ~ 6 ms ~ 315 ms
S4 Local; High Load ~ 190 ms ~12 seconds
S6 Remote; Minimal load ~ 5 ms ~ 400 ms
• All DBs are 1 replication hop away from transaction source• All hardware is roughly equal• Remote location is ~ 60 miles away
• Data taken from 100,000 samples over an hour of standard operations
S4 Database
95 % of the time, replication latency will be 1 second or less
milliseconds
• Now what?
Assume that it will happen in the course of standard operations. Build the application to accommodate it.
If you do, your Ops Team will love you for it.
• Local ehcache on application servers
• Distributed Object Cache (memcached)
• Need to clear all caches effectively on object updates
Pub 1 Pub 2 Pub 3
Distributed Object Cache
Local cache
Reliable Cache Clearing Messages
• Multicast Notification Bus for “clear cache” messages
• The race is on! If message arrives before transaction is replicated, stale object maybe reloaded….
• Frequently accessed objects most susceptible to problems
CMS Pub
DB1 DB2
• Multicast Notification Bus with tuning parameters
• The race is on again! But the database transaction gets a tunable head start. 0.5 sec, 1 sec, 2 secs, 5 secs
• Better – lasted for years, but in the end 99.99+% still wasn’t reliable enough...(remember the long tail on chart?)
CMS PUB
DB1 DB2
• Database Queue table for messages
• Messages are committed after data, injecting them into the replication data stream.
• All apps poll the database queue table once per second.• Guaranteed that data will arrive before message!!!
CMS PUB
DB1 DB2
QueuePoller
• If you don’t need to replicate it, don’t!
• Split data functionally (i.e. separate large blog storage from relational transactions to keep the pipes clear)
• Build the appropriate recovery tools – our “rewind button”
• Masters in multiple data centers
• Greater geographic distance between data centers
• MySQL load balancing – will messaging still be reliable???
Questions? Feedback?