A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox...

A Recovery-Friendly, Self-Managing A Recovery-Friendly, Self-Managing Session State StoreSession State Store

Benjamin Ling and Armando FoxBenjamin Ling and Armando Fox{bling,fox}@cs.stanford.edu{bling,fox}@cs.stanford.edu

© 2003 Benjamin Ling

OutlineOutline

Motivation: What is Session State?Motivation: What is Session State?

Existing solutionsExisting solutions

SSM: Architecture and AlgorithmSSM: Architecture and Algorithm

SSM: Recovery-friendlySSM: Recovery-friendly

SSM: Self-ManagingSSM: Self-Managing

Related and Future WorkRelated and Future Work

ConclusionConclusion


Example of Session StateExample of Session State


Session State and Existing Session State and Existing SolutionsSolutions

We focus on a subcategory of session stateWe focus on a subcategory of session state Single-user, serial access, semi-persistent dataSingle-user, serial access, semi-persistent data

Examples: Temporary application data, Examples: Temporary application data, application workflowapplication workflow

Example of usage (e.g. J2EE):Example of usage (e.g. J2EE):

Browser

App Server1

2

34

56


Existing solutions :Existing solutions :

File System and DatabasesFile System and Databases Poor failure behaviorPoor failure behavior

Lose data (FS)Lose data (FS)

Slow recovery (Both)Slow recovery (Both)

Difficult to administer (DB)Difficult to administer (DB)

Difficult to tune (both)Difficult to tune (both)

In-memory replication using primary/secondary:In-memory replication using primary/secondary: Performance couplingPerformance coupling

Poor failover (uneven load balancing)Poor failover (uneven load balancing)


GoalGoal

Build a session state store that is:Build a session state store that is:

Failure-friendlyFailure-friendly Does not lose data on crashDoes not lose data on crash Degrades gracefullyDegrades gracefully

Recovery-friendlyRecovery-friendly Recovers fastRecovers fast

Self-ManagingSelf-Managing

High performance High performance Avoids performance couplingAvoids performance coupling


Session State Manager (SSM)Session State Manager (SSM)

Brick 1

Brick 2

Brick 3

Brick 4

Brick 5

AppServerSTUB

AppServerSTUB

Redundant, in-memory Redundant, in-memory hash table distributed hash table distributed

across nodesacross nodes

Algorithm: Redundancy similar to Algorithm: Redundancy similar to quorums quorums

• Write to many random nodes, wait for Write to many random nodes, wait for few few (avoid performance (avoid performance coupling)coupling)• Read oneRead one

RAM, Network Interface


Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”

Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2

Brick 5


Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”

Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2

14

Brick 5


Algorithm PropertiesAlgorithm Properties

Client remembers metadataClient remembers metadata Fate sharingFate sharing

Stubs are statelessStubs are stateless

Negative feedback loopNegative feedback loop


SSM: Recovery-FriendlySSM: Recovery-Friendly

FailureFailure No data is lost, WQ-1 copies of the data remainNo data is lost, WQ-1 copies of the data remain

State is available for R/W during failureState is available for R/W during failure

RecoveryRecovery Start a new brick – don’t need to recover anythingStart a new brick – don’t need to recover anything

No special case recovery code (restart=recovery)No special case recovery code (restart=recovery)

State is available for R/W during brick restartState is available for R/W during brick restart Repair phase does not reduce Repair phase does not reduce

throughput/performancethroughput/performance

Session state is self-recovering Session state is self-recovering User’s access pattern will cause data to be rewrittenUser’s access pattern will cause data to be rewritten


SSM: Self-ManagingSSM: Self-Managing

Adaptive:Adaptive: Stub maintains count of maximum allowable in-flight Stub maintains count of maximum allowable in-flight

requests to each brickrequests to each brick Additive increase on successful request Additive increase on successful request Multiplicative decrease on timeoutMultiplicative decrease on timeout

Stubs discover load capacity of each brickStubs discover load capacity of each brick

Self-TuningSelf-Tuning

Admission controlAdmission control Stubs say “no” if insufficient bricksStubs say “no” if insufficient bricks Propagate backpressure from bricks to clientsPropagate backpressure from bricks to clients

Turn users away under overloadTurn users away under overload

Self-ProtectingSelf-Protecting


OVERLOAD

05001000150020002500300035004000

1 2 3 4 5 6 7 8 9 10 11 12 13 14

time in s

#re

q/s

Self-Tuning and Self-ProtectingSelf-Tuning and Self-Protecting

Throughput 250 senders (windowing)

050010001500200025003000350040004500

1 2 3 4 5 6 7 8 9 10 11 12 13 14

time in s

# r

eq

/s

Without Add Inc/Mult Dec adapatation…

Overload with AI/MD adaptation

NORMAL LOAD

0

1000

2000

3000

4000

5000

1 2 3 4 5 6 7 8 9 10 11 12 13 14

time in S

# r

eq

/ s


Other implementation detailsOther implementation details

Garbage collectionGarbage collection

Generational hash tableGenerational hash table Hash table of hash tablesHash table of hash tables Each hash table has an associated time Each hash table has an associated time

rangerange When time has passed, GC that tableWhen time has passed, GC that table

No reference counting, scanning, etc.No reference counting, scanning, etc.


Is it cheap? Is it fast? Is it easy to Is it cheap? Is it fast? Is it easy to use?use?

How much does replication cost?How much does replication cost? With 10 bricks, 1G memory, state size 8k, With 10 bricks, 1G memory, state size 8k,

replication factor of 3 replication factor of 3

Serve around 416,000 concurrent usersServe around 416,000 concurrent users

Configurable request timeout – currently 60 Configurable request timeout – currently 60 msms Dwarfed by computation time and client RT timeDwarfed by computation time and client RT time

Easy to add a brick, kill a brick Easy to add a brick, kill a brick System continues runningSystem continues running


PublicationsPublications

The Case for a Session State Storage LayerThe Case for a Session State Storage LayerBen Ling, Armando FoxBen Ling, Armando Fox

9th Workshop on Hot Topics in Operating Systems (HotOS 9th Workshop on Hot Topics in Operating Systems (HotOS

IX), Lihue, HI, May 2003IX), Lihue, HI, May 2003

A Self-Managing Session State A Self-Managing Session State LayerLayerBen Ling, Armando Fox Ben Ling, Armando Fox

Accepted to the 5th Annual Workshop On Active Middleware Accepted to the 5th Annual Workshop On Active Middleware Services (AMS 2003), Seattle, WA, June 2003Services (AMS 2003), Seattle, WA, June 2003

http://swig.stanford.edu/public/publicationshttp://swig.stanford.edu/public/publications


Related WorkRelated Work

Palimpsest – Timothy Roscoe, IntelPalimpsest – Timothy Roscoe, Intel Temporal storageTemporal storage

Erasure codingErasure coding

No guarantees, just estimatesNo guarantees, just estimates

DeStor – Andy Huang, StanfordDeStor – Andy Huang, Stanford Persistent, multi-user, non-transactional dataPersistent, multi-user, non-transactional data

FAB – HP LabsFAB – HP Labs Enterprise disk storageEnterprise disk storage

Redundancy at disk block levelRedundancy at disk block level


Future WorkFuture Work

Do fault analysis and model failureDo fault analysis and model failure Memory and network failure modesMemory and network failure modes

Performance faults?Performance faults?

How to choose replication factor?How to choose replication factor? 10 bricks, WQ of 3, inter-request rate of 5 10 bricks, WQ of 3, inter-request rate of 5

minutes -> “5 nines” of availability if MTTF of minutes -> “5 nines” of availability if MTTF of bricks > 22 minutesbricks > 22 minutes

Adaptively change replication factor?Adaptively change replication factor?


SSM: Relaxing ACIDSSM: Relaxing ACID

A – we guaranteeA – we guarantee

C – guaranteed by workload (full rewrite of state)C – guaranteed by workload (full rewrite of state)

I – guaranteed by workload (single user, serial-I – guaranteed by workload (single user, serial-access)access)

D – relaxed (ephemeral guarantee, RAM enough)D – relaxed (ephemeral guarantee, RAM enough)

Fast, simple, clean recoveryFast, simple, clean recovery No data loss on failureNo data loss on failure Data can be R/W during failure/recoveryData can be R/W during failure/recovery

Self-ManagingSelf-Managing


SummarySummary

We have built a system for:We have built a system for: Semi-persistent storage for single-user, serial-access Semi-persistent storage for single-user, serial-access

datadata Recovery friendlyRecovery friendly::

Crash Only – Crash-safe, fast recoveryCrash Only – Crash-safe, fast recovery No special case recovery codeNo special case recovery code Reboot any individual nodeReboot any individual node Continuous data availabilityContinuous data availability

Self-ManagingSelf-Managing:: Self-Tuning and ProtectingSelf-Tuning and Protecting Simple management and fault enforcement modelSimple management and fault enforcement model

Benjamin LingBenjamin [email protected]@cs.stanford.edu

http://swig.stanford.edu/http://swig.stanford.edu/


SSM: Recovery-Friendly, Self-Managing SSM: Recovery-Friendly, Self-Managing StoreStore

Questions or Comments?Questions or Comments?

Benjamin LingBenjamin [email protected]@cs.stanford.edu

http://swig.stanford.edu/http://swig.stanford.edu/

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox...

Documents

Transcript of A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox...