Mysql Latency

Jeff Freund, CTOClickability

End of a long day, I am the last stop between you and …..

~6 hours 42 mins left

Yippee!

Future CTO

• Software-as-a-Service Web CMS• True Multi-Tenant SaaS platform from the

ground up• Integrated solution of all services required to run

a sophisticated business website• HQ in San Francisco, 8+ years old, 60+

employees

Global leader in On Demand Web Content Management

http://www.cnn.com/

http://www.usatoday.com/

http://www.broadcast-interactive.com/

http://www.hoover.org/

http://west.cmu.edu/

http://public.elmhurst.edu/

http://www.2theadvocate.com/

http://www.dwell.com/

http://www.ted.com/

http://skyandtelescope.com/

http://www.la.com/

http://www.timesleader.com/

250+ million pages delivered per month

• Linux• Apache• MySQL• Java• Tomcat

Proven open source building blocks

• Scale-out horizontally• Distributed infrastructure, including

multiple datacenters• Multiple Layers of caching for performance• Loose-coupling of applications around

data

M1 M2

S3 S4S2S1 S5 S6

VPN Tunnel

Data Center 1 Data Center 2

SlaveMaster

RO ConnectionManager

RW ConnectionManager

con = db.getReadWriteConnection(); con = db.getReadOnlyConnection();con = db.getSafeReadConnection();

Application Code• Intelligently Split Queries between Masters and Slaves

• Inserts/Updates/Deletes sent to Master

• Most Reads sent to Slaves

• “Safe” Reads sent to Masters – zero tolerance for latency

• Manual code updates to implement the split

• 6+ months in production to find all “Safe” Reads

• The difference in time between when a transaction is committed on one database and then subsequently committed on a replicated database.

• Latency can either be “slowness” or “breakage”

7… Hardware Maintenance / Recovery

6… Schema updates / DB Maintenance

5… Elevated transaction rates (i.e. bulk loads)

4... High query load on slaves

3… Network bottlenecks / Loss of connectivity

2… “Slave Errors” (ie Duplicate keys, deadlocks)

while ( 1 )while? echo "show slave status \G;" | mysql -u USER --

password=PASSWORD | grep Seconds_Behind_Master >> replication.log

while? sleep 1while? end

Seconds

M1 M2

S4 S6S3S2S1 S5

VPN Tunnel

Data Center 1 Data Center 2

M1 M2

S4 S6

V PNTunnel

CREATE TABLE `replTest` ( timecol` bigint(20) default NULL, KEY `idx_timecol` (`timecol`) )

Loop:$val = current timestamp in epoch millisecondsM2: INSERT INTO replTest (timecol) VALUES ($val)M1: SELECT $val -max(timecol) from replTest;S4: SELECT $val -max(timecol) from replTest;S6: SELECT $val -max(timecol) from replTest;

INSERT

Database Characteristics Average Latency Max Latency

M2 Transaction Source N/A N/A

M1 Local; Moderate Load ~ 6 ms ~ 315 ms

S4 Local; High Load ~ 190 ms ~12 seconds

S6 Remote; Minimal load ~ 5 ms ~ 400 ms

• All DBs are 1 replication hop away from transaction source• All hardware is roughly equal• Remote location is ~ 60 miles away

• Data taken from 100,000 samples over an hour of standard operations

S4 Database

95 % of the time, replication latency will be 1 second or less

milliseconds

• Now what?

Assume that it will happen in the course of standard operations. Build the application to accommodate it.

If you do, your Ops Team will love you for it.

• Local ehcache on application servers

• Distributed Object Cache (memcached)

• Need to clear all caches effectively on object updates

Pub 1 Pub 2 Pub 3

Distributed Object Cache

Local cache

Reliable Cache Clearing Messages

• Multicast Notification Bus for “clear cache” messages

• The race is on! If message arrives before transaction is replicated, stale object maybe reloaded….

• Frequently accessed objects most susceptible to problems

CMS Pub

DB1 DB2

• Multicast Notification Bus with tuning parameters

• The race is on again! But the database transaction gets a tunable head start. 0.5 sec, 1 sec, 2 secs, 5 secs

• Better – lasted for years, but in the end 99.99+% still wasn’t reliable enough...(remember the long tail on chart?)

CMS PUB

DB1 DB2

• Database Queue table for messages

• Messages are committed after data, injecting them into the replication data stream.

• All apps poll the database queue table once per second.• Guaranteed that data will arrive before message!!!

CMS PUB

DB1 DB2

QueuePoller

• If you don’t need to replicate it, don’t!

• Split data functionally (i.e. separate large blog storage from relational transactions to keep the pipes clear)

• Build the appropriate recovery tools – our “rewind button”

• Masters in multiple data centers

• Greater geographic distance between data centers

• MySQL load balancing – will messaging still be reliable???

[email protected]

Questions? Feedback?

Mysql Latency

Technology

Transcript of Mysql Latency