Shard-In after Sharding Out with SSD

Post on 09-Nov-2014

122 views 0 download

Tags:

description

The reason why we shard and Tips on how to combine shards from both Code perspective and Operational perspective

Transcript of Shard-In after Sharding Out with SSD

S

Scale In after Scaling Out

WHY, WHEN AND HOW

Who am I

Been using mySQL since 1999

Worked for FriendFinder, Friendster, Flickr, Rockyou, SchoolFeed, Weebly

Presented on Flickr Architecture: Doing Billions of Queries Per Day. Record Every Referral For Flickr Real-time. Scaling to 200K TPS per second with Open Source. Scaling a Widget Company, MySpace Vrs Facebook API Load Patterns. University Of Utah Presentation and various others.

Patterns from Start to Scale

Start a project with a single mySQL DB.

Get some users add more disks to the mySQL DB

Get some more users add a slave

Add more Slaves

Then need to split up the master

Patterns: Continued

Master is not strong enough

Put tables on other servers with slaves

Constantly battle slave lag

Big TableAll Tables Minus the Big Table

Still not scaling Horizontally

Let’s Shard

Federation

User 1’s DataUser 2’s DataUser 3’s Data

….User N’s Data

User 1’s Data

User 2’s Data

User 3’s Data

User N’s Data

Assign a section of the Database as a whole to a Server

Have a Layer to tell the connector what to connect to

Add slaves for redundancy

Go Master-Master

Do this N times

Finally provided stable service

Federation

This Increases

WriteThroughput

Now you have the ability to scale Horizontally

What problem was solved at its lowest levels? Lack of IOPS solved Handling of concurrency solved

Problems introduced

Lots of power used

Lots of servers to manage

Lots of rack space used

Some less then optimal hardware usage

SSD

SSD use NAND flash chips, each chip holds millions of Cells.

SLC can hold a single data bit, MLC can hold multiple data bits yielding a higher density or more disk space.

Typically MCL provides Slower throughput than SLC due to more complicated error correction algorithms and false positive reads.

MLC is not all bad major leaps in Firmware improved it

More enterprises are using MLC

Its cheaper

Fast Enough

Endurance improved

Write amplification improved (erasure and data wad resends)

Stay on top of Firmware changes

TRIM improvements which solves the progressively slower writes to blocks over and over.

We use Intel SSDSA2CW160 320 Series MLC SSD

It’s FAST

It’s Reliable

It has advance power protection features Really big capacitors to flush buffered data

Low power usage

We consider it the best

It’s no longer made

Everyone wants it

Speed of single SSD verses Single Spinning Metal Drive

20K IOPS writes reported - SSD

35K IOPS reads reported – SSD

200 IOPS for SEAGATE ST9146852SS HT043TB0584C

Now that we have more IOPS need Space

RAID-5 gives the best space performance we can use.

8 160GB SSD gives 1TB 613GB of usable space Raw size is 149 GB per disk Reserved for wear

Now that we have space we can combine Shards

First get the IOP usage of the current shards is 12K IOPS

Next get Disk space requirements Do not use more then 50% so you have room for

growth I use now 56% of space

Depending on the Replication Traffic per shard you may need another plan

S

How to combine Data

Code Steps

Have a program that keeps a hashmap of tablename to federated column

Lock the federated entity by throwing an error in the application that says this federated entity is not available

SELECT ALL Federated data (in chunks) and add it to the new combined table.

Update pointers

Error if any step fails and keep the data locked otherwise Unlock

Another way to combine, more Operational

Take a copy of the shard.

Configure multi instance mysql

Run that shard off a different port

I choose to do both methods here is how

Let’s take a case.

Support 20 million websites

90% of all sites get 1 or more hits but less then 1000 hits per day

Less then 10% of sites gets more then 1000 hits per day

8 shards to handle 12K IOPS 64 CPU Threads 288 GB of memory 64 2.5” drives Roughly $40K Of hardware Multiply * 2 for redundancy

Replication was lagging

Simply combining the data onto one server will not work

Master needs 10K IOPS replication with some tricks can use 2.5K IOPS

Innodb_fake_changes did not work

Facebook:faker helped but CPU was underpowered thus not really good to saturate IOPS and keep it in sync

Multi_Mysql is the answer

Set up 4 mySQL instances

Instead of 1 Replication thread I now have 4

Instead of limiting 2.5K IOPS ON SSD from single Replication thread I now have 10K IOPS

Master produces 10K IOPS from ETL that runs on all Front End

I don’t need much memory in fact each instance only has a 4GB buffer pool

S

Let’s look at the details

All tables are compressed

KEY_BLOCK_SIZE=8K with INNODB

Consistent Hash for Hostname to bigint

Remove Lookup in exchange for a small CPU computation

Md5 based off 1st 16 bits of hex number to produce 8byte bigint

Primary KEY is HashId + Hostname(10)

HashId maps to ShardId with range blocks

S

Run through all hostnames to assign

to a shard

Test Hashing is even

X = #

Shards

Y = PV/Site

Shard on a Single Server

There are 8 Databases per Server Instance

A database represents a shard

There are 4 mySQL Server Instances

32 Shards Total

Can isolate a single DB to a single Server

Can isolate a single Host to a single Shard

Base on a Range go to the correct server, port and database

How to switch

Write in both locations

Log if write fails in a single location (none happened)

Backfill old data to new format

Switch reads over to the new format once data is verified as correct

In Staging switch Reads to new format

Verify that Production and Staging render the same graph

Verify that Production and Staging have the same referrers

Sample random Pro accounts and make sure numbers match

Roll out with a switch to rollback

There was a bug where some website where passed as user input to the lookup method, yet I stored everything lowercase names

Turn off new reads with Application config switch

Fix issue and turn on new reads

Clean Up

Once fully over on new format

Kill old format

Repurpose servers

Profit

Some Stats

$80K potential in server cost reduced to $7K

Utilize all the CPU

Less memory per server but more IOPS

All replicas stay in sync because there is now more then 1 replication thread per physical server. (There are 4)

Next Generation

Fusion I/O PCIe SSD Card

1U Form factor

Less power 40W-50W

No need to RAID

Questions

Twitter @dathanvp

http://mysqldba.blogspot.com

http://facebook.com/dathan

http://linkedIn.com/in/dathan

http://about.me/dathan

mailto:dathanvp@gmail.com