Scaling Social Games

49
Scaling social games “the order of magnitude challenge” Paolo Negri @hungryblank

description

Short talk given at the Berlin hadoop get together on the 27th of january 2011

Transcript of Scaling Social Games

Page 1: Scaling Social Games

Scaling social games“the order of magnitude

challenge”

Paolo Negri @hungryblank

Page 2: Scaling Social Games

Order of magnitude

DAU:

daily active users

0

250000

500000

750000

1000000

July December

DAU

Page 3: Scaling Social Games

Flash client (game) HTTP API

Social Games

http://www.flickr.com/photos/stars6/4381851322

Page 4: Scaling Social Games

Flash client

Social Games

• Game actions need to be persisted and validated

• 1 API call every few secs

Page 5: Scaling Social Games

HTTP API

Social Games

http://www.flickr.com/photos/stars6/4381851322

• 5000 HTTP reqs/sec

• more than 90% writes

• 60K queries/sec

Page 6: Scaling Social Games

July 2010

HAproxy

Ruby on Rails

MySQL

• ~ 170 000 daily users

• Plain Ruby on Rails app

• Persistency 100% SQL

Page 7: Scaling Social Games

July 2010

HAproxy

Ruby on Rails

MySQL

• 1 haproxy server

• multiple RoR servers

• 4 mysql servers (sharded dataset)

Page 8: Scaling Social Games

July 2010

HAproxy

Ruby on Rails

MySQLSlow down

Page 9: Scaling Social Games

July 2010

HAproxy

Ruby on Rails

MySQLSlow down

High queries/requestratio

Page 10: Scaling Social Games

Queries/request

• Which code is triggering extra queries?

• Why in our test environment the ratio is lower than live?

Page 11: Scaling Social Games

Queries/request

Application Ruby on RailsPlugins

Running code of live system

Page 12: Scaling Social Games

Queries/request

Plugins

Source of extra queries

• sharding plugin “breaks” std Rails query cache

• Flash wire protocol plugin generates extra queries

Page 13: Scaling Social Games

Plugins

• Deceiving “feature for free”

• Might provide the right feature

• But might not meet scaling need

Page 14: Scaling Social Games

Plugins

• Instant code legacy, for new projects also!

• Once added it’s your code

• Even if it’s maintained, will it follow your needs?

Page 15: Scaling Social Games

Plugins

• Assess code quality when you add it

• Can you afford to maintain/change it?

Page 16: Scaling Social Games

Plugins

• We fixed it!

• Query cut up to 40% on some requests

Page 17: Scaling Social Games

Early August

• The MySQL hiccup

• every 70 mins query time spikes x7

0

7.5

15

22.5

30

6:00 6:10 6:20 6:30 6:40 6:50 7:00 7:10 7:20 7:30 7:40 7:50 8:00 8:10

query time in ms

Page 18: Scaling Social Games

Hiccup causes

• Code (app + plugins + Rails)?

• Some periodic job?

• The devil (AWS)?

Who is periodically blocking MySQL

Page 19: Scaling Social Games

Hiccup quick fix

• We shard out the top queried table(40% of all queries)

shard 2 shard 4shard 1 shard 3

MySQL servers

Page 20: Scaling Social Games

Hiccup quick fix

• We shard out the top queried table(40% of all queries)

Top tableshard 2

Top tableshard 4

Top tableshard 1

Top tableshard 3

Other tablesshard 2

Other tablesshard 4

Other tablesshard 1

Other tablesshard 3

Page 21: Scaling Social Games

Hiccup quick fix• Mysql likes it

• “top table” shards will go a long way in the scaling process

Top tableshard 2

Top tableshard 4

Top tableshard 1

Top tableshard 3

Other tablesshard 2

Other tablesshard 4

Other tablesshard 1

Other tablesshard 3

Page 22: Scaling Social Games

Hiccup causes

• Code (app + plugins + Rails)?

• Some periodic job?

• The devil (AWS)?

Who is periodically blocking MySQL

None of the Above

Page 23: Scaling Social Games

Hiccup real cause

• Emerging MySQL internal at high volume

• MySQL flushes its buffer

• Under heavy write IO it’s blocking

Page 24: Scaling Social Games

Hiccup solution

• Percona MySQL patches (XtraDB) avoid blocking behavior

• Query time profile gets smooth

• IO capacity limit manifested with gradual performance decay

Page 25: Scaling Social Games

Write through cache

• Memcache in front of MySQL

• Evaluated before sharding

• Was discarded

• Because of our read/write reatio

Page 26: Scaling Social Games

Write through cache

90% of the times we read datain order to modify it

Page 27: Scaling Social Games

Write through cache

It means 90% of the times

1. read cache

2. write cache

3. write SQL

Page 28: Scaling Social Games

Write through cache

• memcache perfs

Read heavy

• Mysql write (unless async)

• Write through lib optimized for writes?

Write heavy

Bound to

Page 29: Scaling Social Games

MySQL

• Sharding SQL is a painful way to scale

• Data migrations at high load imply downtime

• ACID benefits all lost because of sharding or in name of performance

Page 30: Scaling Social Games

Redis

• A persistent cache

• Fast 60000 qps on AWS hardware

• Interesting data structures, not only KV

• Already some small scale experince in house

Page 31: Scaling Social Games

Redis adoption

• Which data to start from?

• How do we migrate without downtime?

• Which Ruby object - Redis structure lib?

Page 32: Scaling Social Games

Redis adoption

• Which data to start from?

• Best data fit for Redis hashes

• Top 3rd queried table

• a collection of integer fields that need only increment / decrement

Page 33: Scaling Social Games

Redis adoption

• How do we migrate without downtime?

• Migrate one user at a time

• Use a Redis set to keep note of migrated/non migrated

• No downtime, transparent to users

Page 34: Scaling Social Games

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

Page 35: Scaling Social Games

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

read original data

Page 36: Scaling Social Games

Redis adoption

• How do we migrate without downtime?

RoRServer

MySQL

Redis

User 123

write migrated data

Page 37: Scaling Social Games

Redis adoption

• How do we migrate without downtime?

• Migration might never complete

• SQL + Redis set information to generate final batch migration

Page 38: Scaling Social Games

Redis 1st result

10% query load from 4 MySQL server

is moved to 1 Redis server

Redis server load is 0.05

Page 39: Scaling Social Games

Redis

• Becomes the tool to use

• Migration plan for all write intensive data

• Migrate one “class” at a time

Page 40: Scaling Social Games

Redis honeymoon end

• Memory usage grows more than data

• Snapshot to disk causes spikes in query time

• Starting new slaves eats memory on the master node

Page 41: Scaling Social Games

Redis honeymoon end

• Redis machine sized with overabundant RAM

• Rigorous slave/master starting plan

Russian Roulette Feeling

Page 42: Scaling Social Games

Redis

• Redis team acknowledges persistency/replication problems

• Redis 2.4 diskstore plan starts

Page 43: Scaling Social Games

1.000.000

And counting...

Page 44: Scaling Social Games

1.000.000

HAproxy

Ruby on Rails

Persistency

painless scaling

Page 45: Scaling Social Games

1.000.000

HAproxy

Ruby on Rails

Peristency

just add serversas load grows

Page 46: Scaling Social Games

1.000.000

HAproxy

Ruby on Rails

PeristencyPainful and

troublesome

Page 47: Scaling Social Games

Infrastructure

• AWS

• Chef - through Scalarium

• Ganglia

Page 48: Scaling Social Games

Thanks...

Page 49: Scaling Social Games

woogaIs looking for

Business Intelligence Engineer

http://wooga.com/jobs