Erpsalespowerpoint8slide1 13137903710224-phpapp01-110819164840-phpapp01
Distributedsystems 100912185813-phpapp01
-
Upload
institute-of-computing-technology-chinese-academy-of-sciences -
Category
Technology
-
view
347 -
download
0
Transcript of Distributedsystems 100912185813-phpapp01
Distributed Systems
scalability and high availability
Renato Lucindo - lucindo.github.com - @rlucindo
Distributed System Design
Renato Lucindo
Call me Lucindo (or Linus)2002 - Bachelor Computer Science2007 - M.Sc. Computer Science (Combinatorial Optimization)7+ year developing Distributed Systems
My default answer: "I don't know."
Distributed System Design
Agenda
Scalability
High Availability
Problems
Tips and Tricks
Learning More
Distributed System Design
Distributed Systems
Multiple computers that interact with each other over a network to achieve a common goalPurpose
ScalabilityHigh availability
source: http://www.cnds.jhu.edu/
Distributed System Design
Scalability
System ability to handle gracefully a growing amount of work
Scale up (vertical)Add resources to a single nodeImprove existing code to handle more work
Scale out (horizontal)Add more nodes to a systemLinear (or better) scalabilityDi
stributed System Design
Scalability - Vertical
Add: CPU, Memory, Disks (bigger box) Handling more simultaneous:
ConnectionsOperationsUsers
Choose a good I/O and concurrency modelNon-blocking I/OAsynchronous I/OThreads (single, pool, per-connection)Event handling patterns (Reactor, Proactor, ...)
Memory model?STM
Distributed System Design
Scalability - Vertical
Careful with numbersRequests per second# of ConnectionsSimultaneous operations
Event handlingThink front-endSlow connections/clientsIt's slower than other options
In doubt, go asyncBack-end
Thread pool (thread per-connection)No eventsProcess per-core
Distributed System Design
Scalability - Horizontal
Add nodes to handle more workFront-end
StraightforwardStateless
Back-endMaster/Slave(s)Partitioning
DHTVolatile Index
Distributed System Design
Scalability - Horizontal
Master/SlaveWrite on single MasterRead on Slaves (one or more)Scales reads
Distributed System Design
Scalability - Horizontal
Partitioning (Sharding)Distribute dada across nodes
Generally involves data de-normalizationWhere is some specific data?
Master IndexHash (DTH, Consistent Hashing)Volatile Index
Joins done in application levelNoSQL friendly
Distributed System Design
Scalability - Horizontal
Volatile Index: build and maintain data index as cached information (all clients)
Distributed System Design
High Availability
"Processes, as well as people, die"
Handle hardware and software failuresEliminate single point of failure
RedundancyFailoverReplicas
Distributed System Design
High Availability - Failover/Redundancy
Distributed System Design
High Availability - Replicas
Two or more copies of same dataReplica granularity
From node replica to "row" replicaLoad balancingWrite concurrencyReplica updatesKey for high availability and root of several problems
Distributed System Design
Problems
Distributed System Design
Problems - CAP Theorem
Distributed System Design
Problems - CAP Theorem
Consistency: all operations (reads/writes) yield a global consistent state
Availability: all requests (on non-failed servers) must have a response
Partition Tolerance: nodes may not be able to communicate with each other.
Pick TwoDistributed System Design
Problems - CAP Theorem
C + A: network problems might stop the system
Examples:Oracle RAC, IBM DB2 ParallelRDBMS (Master/Slave)Google File SystemHDFS (Hadoop)
Distributed System Design
Problems - CAP Theorem
C + P: clients can't always perform operations
Examples:Distributed lock-systems: Chubby, ZooKeeperPaxos protocol (consensus)BigTable, HbaseHypertableMongoDB
Distributed System Design
Problems - CAP Theorem
A + P: clients may read inconsistent (old or undone) data
Examples:�Amazon DynamoCassandraVoldemortCouchDBRiakCaches
Distributed System Design
Problem with CAP Theorem
In practice, C + A and C + P systems are the same.C + A: not tolerant of network partitionsC + P: not available when a network partition occurs
Big problem: network partitionNot so big (how often does it happens?)
Pick twoAvailabilityConsistency
The forgotten: LatencyOr, how long the system waits before considering a partitioned network?
Distributed System Design
Problems - Real World
Every component may fail:Network failureHardware failureElectricityNatural disastersCode failure
Distributed System Design
Tips & Tricks
Distributed System Design
Tips & Tricks - Pyramid
Capacity (connections, operations, ...) Pyramid
Distributed System Design
Tips & Tricks - Reply Fast
FAIL FastBreak complex requests into smaller onesUse timeoutsNo transactionsBe aware that a single slow operation or component can generate contentionSelf-denial attack
Distributed System Design
Tips & Tricks - Cache
Cache: component location, data, dns lookups, previous requests, etcUse negative cache for failed requests (low expiration)Don't rely on cacheYour system must work with no cache
Distributed System Design
Tips & Tricks - Queues
Easy way to add asynchronous processing an decouple your system.
Distributed System Design
Tips & Tricks - DNS
Distributed System Design
Tips & Tricks - Logs
Log everythingUse several log levelsOn every log message
UserRequest hostComponent involvedVersionFilename and line
If log level not enabled do not process log messageAvoid lookup calls (gettimeofday)Di
stributed System Design
Tips & Tricks - Domino Effect
Make sure your load balancer won't overload componentsUser smart algorithms
Load BalanceResource Allocation
Distributed System Design
Tips & Tricks - (Zero) Configuration
No configuration filesUse good defaultsAuto-discovery (multicast, gossip, ...)Make everything configurable
Administrative commandNo need to stop for changes
Automatic self adjusts when possible
Distributed System Design
Tips & Tricks - STOP Test
With your system under load: kill -STOP <component>
Distributed System Design
Tips & Tricks - Know your tools
load average (uptime)stats tools
vmstatiostatmpstattcpstat, tcprstat, etc
tcpdump, nc, netstattunning
/proc/net/*ulimitsysctl
oprofiledebuging tools (gdb, valgrind)...
Distributed System Design
Tips & Tricks - Count
Count everythingConnectionsOperationsFailuresSuccessesRequest times (granularity)
Total, average, standard deviationMonitor counters
Distributed System Design
Tips & Tricks - Stability Patterns
Use TimeoutsCircuit BreakerBulkheadsSteady StateFail FastHandshakingTest HarnessDecoupling Middleware
Distributed System Design
Tips & Tricks - Don't Panic!
Distributed System Design
Learning More - Books
TCP/IP Illustrated, Vol. 1: The Protocols
Distributed System Design
Learning More - Books
Unix Network Programming, Vol. 1: The Sockets Networking
Distributed System Design
Learning More - Books
Pattern Oriented Software Architecture, Vol. 2
Distributed System Design
Learning More - Books
Release It!
Distributed System Design
Learning More - Papers
The Google File System Bigtable: A Distributed Storage System for Structured DataDynamo: Amazon's Highly Available Key-Value StorePNUTS: Yahoo!’s Hosted Data Serving PlatformMapReduce: Simplified Data Processing on Large Clusters
Towards robust distributed systemsBrewer's conjecture and the feasibility of consistent, available, partition-tolerant web servicesBASE: An Acid AlternativeLooking up data in P2P systems
Distributed System Design
Thanks!!! Questions?
lucindo.github.com - @rlucindo
Distributed System Design