A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox...
-
Upload
mavis-flowers -
Category
Documents
-
view
217 -
download
1
Transcript of A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox...
A Recovery-Friendly, Self-Managing A Recovery-Friendly, Self-Managing Session State StoreSession State Store
Benjamin Ling and Armando FoxBenjamin Ling and Armando Fox{bling,fox}@cs.stanford.edu{bling,fox}@cs.stanford.edu
© 2003 Benjamin Ling
OutlineOutline
Motivation: What is Session State?Motivation: What is Session State?
Existing solutionsExisting solutions
SSM: Architecture and AlgorithmSSM: Architecture and Algorithm
SSM: Recovery-friendlySSM: Recovery-friendly
SSM: Self-ManagingSSM: Self-Managing
Related and Future WorkRelated and Future Work
ConclusionConclusion
© 2003 Benjamin Ling
Example of Session StateExample of Session State
© 2003 Benjamin Ling
Session State and Existing Session State and Existing SolutionsSolutions
We focus on a subcategory of session stateWe focus on a subcategory of session state Single-user, serial access, semi-persistent dataSingle-user, serial access, semi-persistent data
Examples: Temporary application data, Examples: Temporary application data, application workflowapplication workflow
Example of usage (e.g. J2EE):Example of usage (e.g. J2EE):
Browser
App Server1
2
34
56
© 2003 Benjamin Ling
Existing solutions :Existing solutions :
File System and DatabasesFile System and Databases Poor failure behaviorPoor failure behavior
Lose data (FS)Lose data (FS)
Slow recovery (Both)Slow recovery (Both)
Difficult to administer (DB)Difficult to administer (DB)
Difficult to tune (both)Difficult to tune (both)
In-memory replication using primary/secondary:In-memory replication using primary/secondary: Performance couplingPerformance coupling
Poor failover (uneven load balancing)Poor failover (uneven load balancing)
© 2003 Benjamin Ling
GoalGoal
Build a session state store that is:Build a session state store that is:
Failure-friendlyFailure-friendly Does not lose data on crashDoes not lose data on crash Degrades gracefullyDegrades gracefully
Recovery-friendlyRecovery-friendly Recovers fastRecovers fast
Self-ManagingSelf-Managing
High performance High performance Avoids performance couplingAvoids performance coupling
© 2003 Benjamin Ling
Session State Manager (SSM)Session State Manager (SSM)
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
AppServerSTUB
AppServerSTUB
Redundant, in-memory Redundant, in-memory hash table distributed hash table distributed
across nodesacross nodes
Algorithm: Redundancy similar to Algorithm: Redundancy similar to quorums quorums
• Write to many random nodes, wait for Write to many random nodes, wait for few few (avoid performance (avoid performance coupling)coupling)• Read oneRead one
RAM, Network Interface
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2003 Benjamin Ling
Write example: “Write to Many, Wait for Write example: “Write to Many, Wait for Few”Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
14
Brick 5
© 2003 Benjamin Ling
Algorithm PropertiesAlgorithm Properties
Client remembers metadataClient remembers metadata Fate sharingFate sharing
Stubs are statelessStubs are stateless
Negative feedback loopNegative feedback loop
© 2003 Benjamin Ling
SSM: Recovery-FriendlySSM: Recovery-Friendly
FailureFailure No data is lost, WQ-1 copies of the data remainNo data is lost, WQ-1 copies of the data remain
State is available for R/W during failureState is available for R/W during failure
RecoveryRecovery Start a new brick – don’t need to recover anythingStart a new brick – don’t need to recover anything
No special case recovery code (restart=recovery)No special case recovery code (restart=recovery)
State is available for R/W during brick restartState is available for R/W during brick restart Repair phase does not reduce Repair phase does not reduce
throughput/performancethroughput/performance
Session state is self-recovering Session state is self-recovering User’s access pattern will cause data to be rewrittenUser’s access pattern will cause data to be rewritten
© 2003 Benjamin Ling
SSM: Self-ManagingSSM: Self-Managing
Adaptive:Adaptive: Stub maintains count of maximum allowable in-flight Stub maintains count of maximum allowable in-flight
requests to each brickrequests to each brick Additive increase on successful request Additive increase on successful request Multiplicative decrease on timeoutMultiplicative decrease on timeout
Stubs discover load capacity of each brickStubs discover load capacity of each brick
Self-TuningSelf-Tuning
Admission controlAdmission control Stubs say “no” if insufficient bricksStubs say “no” if insufficient bricks Propagate backpressure from bricks to clientsPropagate backpressure from bricks to clients
Turn users away under overloadTurn users away under overload
Self-ProtectingSelf-Protecting
© 2003 Benjamin Ling
OVERLOAD
05001000150020002500300035004000
1 2 3 4 5 6 7 8 9 10 11 12 13 14
time in s
#re
q/s
Self-Tuning and Self-ProtectingSelf-Tuning and Self-Protecting
Throughput 250 senders (windowing)
050010001500200025003000350040004500
1 2 3 4 5 6 7 8 9 10 11 12 13 14
time in s
# r
eq
/s
Without Add Inc/Mult Dec adapatation…
Overload with AI/MD adaptation
NORMAL LOAD
0
1000
2000
3000
4000
5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14
time in S
# r
eq
/ s
© 2003 Benjamin Ling
Other implementation detailsOther implementation details
Garbage collectionGarbage collection
Generational hash tableGenerational hash table Hash table of hash tablesHash table of hash tables Each hash table has an associated time Each hash table has an associated time
rangerange When time has passed, GC that tableWhen time has passed, GC that table
No reference counting, scanning, etc.No reference counting, scanning, etc.
© 2003 Benjamin Ling
Is it cheap? Is it fast? Is it easy to Is it cheap? Is it fast? Is it easy to use?use?
How much does replication cost?How much does replication cost? With 10 bricks, 1G memory, state size 8k, With 10 bricks, 1G memory, state size 8k,
replication factor of 3 replication factor of 3
Serve around 416,000 concurrent usersServe around 416,000 concurrent users
Configurable request timeout – currently 60 Configurable request timeout – currently 60 msms Dwarfed by computation time and client RT timeDwarfed by computation time and client RT time
Easy to add a brick, kill a brick Easy to add a brick, kill a brick System continues runningSystem continues running
© 2003 Benjamin Ling
PublicationsPublications
The Case for a Session State Storage LayerThe Case for a Session State Storage LayerBen Ling, Armando FoxBen Ling, Armando Fox
9th Workshop on Hot Topics in Operating Systems (HotOS 9th Workshop on Hot Topics in Operating Systems (HotOS
IX), Lihue, HI, May 2003IX), Lihue, HI, May 2003
A Self-Managing Session State A Self-Managing Session State LayerLayerBen Ling, Armando Fox Ben Ling, Armando Fox
Accepted to the 5th Annual Workshop On Active Middleware Accepted to the 5th Annual Workshop On Active Middleware Services (AMS 2003), Seattle, WA, June 2003Services (AMS 2003), Seattle, WA, June 2003
http://swig.stanford.edu/public/publicationshttp://swig.stanford.edu/public/publications
© 2003 Benjamin Ling
Related WorkRelated Work
Palimpsest – Timothy Roscoe, IntelPalimpsest – Timothy Roscoe, Intel Temporal storageTemporal storage
Erasure codingErasure coding
No guarantees, just estimatesNo guarantees, just estimates
DeStor – Andy Huang, StanfordDeStor – Andy Huang, Stanford Persistent, multi-user, non-transactional dataPersistent, multi-user, non-transactional data
FAB – HP LabsFAB – HP Labs Enterprise disk storageEnterprise disk storage
Redundancy at disk block levelRedundancy at disk block level
© 2003 Benjamin Ling
Future WorkFuture Work
Do fault analysis and model failureDo fault analysis and model failure Memory and network failure modesMemory and network failure modes
Performance faults?Performance faults?
How to choose replication factor?How to choose replication factor? 10 bricks, WQ of 3, inter-request rate of 5 10 bricks, WQ of 3, inter-request rate of 5
minutes -> “5 nines” of availability if MTTF of minutes -> “5 nines” of availability if MTTF of bricks > 22 minutesbricks > 22 minutes
Adaptively change replication factor?Adaptively change replication factor?
© 2003 Benjamin Ling
SSM: Relaxing ACIDSSM: Relaxing ACID
A – we guaranteeA – we guarantee
C – guaranteed by workload (full rewrite of state)C – guaranteed by workload (full rewrite of state)
I – guaranteed by workload (single user, serial-I – guaranteed by workload (single user, serial-access)access)
D – relaxed (ephemeral guarantee, RAM enough)D – relaxed (ephemeral guarantee, RAM enough)
Fast, simple, clean recoveryFast, simple, clean recovery No data loss on failureNo data loss on failure Data can be R/W during failure/recoveryData can be R/W during failure/recovery
Self-ManagingSelf-Managing
© 2003 Benjamin Ling
SummarySummary
We have built a system for:We have built a system for: Semi-persistent storage for single-user, serial-access Semi-persistent storage for single-user, serial-access
datadata Recovery friendlyRecovery friendly::
Crash Only – Crash-safe, fast recoveryCrash Only – Crash-safe, fast recovery No special case recovery codeNo special case recovery code Reboot any individual nodeReboot any individual node Continuous data availabilityContinuous data availability
Self-ManagingSelf-Managing:: Self-Tuning and ProtectingSelf-Tuning and Protecting Simple management and fault enforcement modelSimple management and fault enforcement model
Benjamin LingBenjamin [email protected]@cs.stanford.edu
http://swig.stanford.edu/http://swig.stanford.edu/
© 2003 Benjamin Ling
SSM: Recovery-Friendly, Self-Managing SSM: Recovery-Friendly, Self-Managing StoreStore
Questions or Comments?Questions or Comments?
Benjamin LingBenjamin [email protected]@cs.stanford.edu
http://swig.stanford.edu/http://swig.stanford.edu/