Post on 17-May-2015
Apache ZooKeeperКурс «Базы данных»
Щитинин Дмитрий
Computer Science Center
11 ноября 2013 г.
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 1 / 54
Содержание
1 Introducing
2 Data Model
3 API
4 Usages
5 Architecture
6 Throughput
7 Conclusion
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 2 / 54
Introducing
Introducing
High-available (decentralized, no SPoF) and high-reliableserivce for
NamingConfiguration managementSynchronizationLeader electionMessage QueueNotification system
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 3 / 54
Introducing History
Google Chubby
2006, Google published a paper on "Chubby"Google’s primary internal name serviceСommon mechanism for MapReduce, GFS, BigTableUsages:
election a primary from redundant replicasrepository for high availabable files
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 4 / 54
Introducing History
History
ZooKeeper is the close clone of Chubby:2006-2008 - Developed at Yahoo!1
2008-2010 - Moved to Apache as Hadoop subprojectJan 2011-... - Became top-level project of Apacheprojects system
Commiters: Yahoo!, Cloudera, Google, Facebook,Microsoft, ...2
1http://www.sdtimes.com/content/article.aspx?ArticleID=330112http://zookeeper.apache.org/credits.html
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 5 / 54
Introducing Data Model
Data Model
Hierarchical tree of nodes (similar to Google Chubby)called "znode"znode can contain either data and subnodesNot high-volume data storage (znode data - max1MB)Access by key (seq of znode’s names "/"delimieted,like FS paths)Watches/Notications of znode changesznodes have version counters and other metadata
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 6 / 54
Introducing ACID
ACID
Atomicy per node: no partial reads or writesNo rollbacksSequential consistency (clients see a single sequenceof operations)Isolation per nodeDurable (commit log + replication)
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 7 / 54
Introducing CAP
CAP
Guarantees consistency (writes are linearizable - havetotal order)May not be available for writes (strict quorum basedreplication of them)Partition Tolerance
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 8 / 54
Introducing Users
Users3
Apache: HBase, HDFS, Hadoop MapReduce, Kafka,SolrKatta (distributed Lucene indexes)Eclipse Communication FrameworkNorbert, LinkedIn, partitioned routing, clustermanagementNeo4j, graph databaseS4, Yahoo, stream processingredis_failover (automatic master/slave failover)Yandex
3https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 9 / 54
Introducing Users
Use in HBase
Leader ElectionEnsure there is at most 1 active master at any timeConfiguration ManagementStore the bootstrap locationGroup MembershipDiscovers servers and notifies about server death
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 10 / 54
Introducing Motivation
Motivation
Making up own protocols for coordinate is almostfailedDistributed system architecture is hard problemraces, deadlocks, inconsistency, reliabilityZK provides tools for create correct distributedapplications:
hard problems already solvedno bycile re-inventionsusing open-source well-tested service and clients
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 11 / 54
Data Model
Data Model
Hierarhical tree of nodes called "znodes"znode contains a data and ACLznode may contain children znodes1MB of data for any znodeAtomic data access (reads and writes)No append operationznodes referenced by path - "/"delimited strings
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 12 / 54
Data Model
Data Model
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 13 / 54
Data Model Znode Types
Znode Types
Can be persistent or ephemeral - set at cration time.Persistent znodes
not tied to a client sessiondeleted explicitly by a client (any according to ACL)may have children
Ephemeral znodesdeleted by ZK when creating client session endsalways leaf nodes - may not have subnodes (no evenephemeral)tied to a client session, visible to all clients (according toACL)ideal for wacth distributed resources availability
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 14 / 54
Data Model Sequential Znodes
Sequential Znodes
sequential number by ZK is a part of znode namesequential flag sets at cration timenumbers are monotonically increasingsequence is maintained by parent znodeused to impose a global ordering on eventswe will learn how to build distributed lock on top ofthem
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 15 / 54
Data Model Watches
Watches
Allow clients to get notifications when a znode changes
set by operations on ZKtriggered by other operationstriggered once - for multiple notification clientreregisters the watchused for quick reactions of different changes
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 16 / 54
API Operations
Operations
Operation Descriptioncreate Creates a znode (the parent znode must already exist)delete Deletes a znode (the znode must not have any children)exists Tests whether a znode exists and retrieves its metadatagetACL, setACL Gets/sets the ACL for a znodegetChildren Gets a list of the children of a znodegetData, setData Gets/sets the data associated with a znodesync Synchronizes a client’s view of a znode with ZooKeeper
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 17 / 54
API Guarantees
GuaranteesSequential Consistency - Updates from a clientwill be applied in the order that they were sentAtomicity - Updates either succeed or fail. Nopartial results.Single System Image - A client will see the sameview of the service regardless of the server that itconnects toReliability - Once an update has been applied, itwill persist from that time forward until a clientoverwrites the updateTimeliness - The clients view of the system isguaranteed to be up-to-date within a certain timeboundЩитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 18 / 54
API Guarantees
Updates
Update operations (delete, setData) are conditional:
version number of znode under update is specifiedperforms only if version numbers are equalso updates are non-blocking
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 19 / 54
API Clients
Clients
Core language bindings:Java (org.apache.zookeeper:zookeeper:3.4.5)C (zookeeper_st, zookeeper_mt libs)
Contrib bindings:PerlPythonREST clients
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 20 / 54
API Clients
Async & sync
Each binding provides async and sync APIs.
both have same funtionalityasync is preferred for event-driven programmingmodelasync API can offer better throughput
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 21 / 54
API Watch triggers
Watch triggers
exists, getChildren, getData may have watches set onthem
triggered by write operations: create, delete, setDatadon’t triggered by ACL operationswatch event generated includes path of modifiedznode
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 22 / 54
API Watch triggers
Operations and triggers
create znode create child delete znode delete child set dataexists NodeCreated NodeDeleted NodeData
ChangedgetData NodeDeleted NodeData
ChagnedgetChildren NodeChildren
ChagnedNodeDeleted NodeChildren
Chagned
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 23 / 54
API Watch triggers
Handling events
NodeCreated & NodeDeletedcontains modified znode pathNodeChildrenChangeneed to call getChildren for retrieve fresh list ofchildrenNodeDataChangedneed to call getData for retrieve fresh znode data
Warning!State of the znodes may have changed between receivingthe watch event and performing the read operation.
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 24 / 54
API ACL
ACL
Available authentication types:
digestby username and passwordsaslKerberos 4 usedipIP address used
4http://en.wikipedia.org/wiki/Kerberos_(protocol)Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 25 / 54
Usages Out of Box usages
Out of Box usages
Provided directly by API:Name ServiceConfiguration management
Provided by create, setData, getData methodsBenefits:
new nodes only need to know how to connect toZooKeeper and can then download all otherconfiguration information and determineapplication can subscribe to changes in theconfiguration and apply them quickly
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 26 / 54
Usages Group Membership
Group Membership
group is represented by a node (persistent)members of the group are ephemeral nodes under thegroup nodeclients call getChildren() on group node with watchsetfailed member nodes will be removed automaticallyby ZKNodeChildrenChanged notification send to clients
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 27 / 54
Usages Lock
LockObtain the lock:
define lock node - create(’_locknode_’)call create(’_locknode_/lock-’, sequential = true, ephemeral =true), remember created path as ’path-A’call getChildren(’_locknode_’) without setting the watchif ’path-A’ has lowest seq number suffix from all children clientobtains the lock and exits(if ’path-A’ has not lowest seq number), let ’path-B’ has next lowestseq number client calls exists() with the watch set on the ’path-B’if exists() returns false, go to 2 else wait for a notification for’path-B’ before going to step 2.
Release lock:Remove node created at step 1 (i.e. ’path-A’)
Note: node removal will only cause one client to wake up (eachnode is watched by exactly one client)no polling or timeouts
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 28 / 54
Usages ReadWrite Lock
ReadWrite LockRead lock acquire:
read-A = create("_locknode_/read- sequential=true,ephemeral=true)getChildren("_locknode_ watch = false)are there children with path starting with "write-"and having a lowersequence number than ’read-A’ ?no: acquire the read lock and exityes: let ’write-B’ has next lowest seq number and path starts with’write-’call exists(watch=true) on ’write-B’if exists() returns false, go to 2 else wait for a notification for’write-B’ before going to step 2
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 29 / 54
Usages ReadWrite Lock
ReadWrite LockWrite lock acquire:
"write-A- create("_locknode_/write- sequential=true,ephemeral=true)getChildren("_locknode_ watch =false)are there children with a lower sequence number than the "write-A"?no: acquire the write lock and exityes: let ’path-B’ has next lowest seq numbercall exists(watch=true) on ’path-B’if exists() returns false, go to step 2 else wait for a notification for’path-B’
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 30 / 54
Usages Leader Election
Leader ElectionWork path - ’_election_’Leader volunteer behavior:
z = create("_election_/n_ sequential=true, ephemeral=true), z =n_i (i - seq number assigned for created node)C = getChildren("_election_")if i is the lowest seq suffix in C then current volunteer becomes aleaderelse watch for changes on "_election_/n_j where j is the largestsequence number and j < i
When receive a notification of znode deletion:C = getChildren("_election_")if i is the lowest seq suffix in C then current volunteer becomes aleaderelse watch for changes on "_election_/n_j where j is the largestsequence number and j < i
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 31 / 54
Usages Apache Curator
Apache Curator
"Curator is a set of Java libraries that make using ApacheZooKeeper much easier"5
initially created by Netflix6
moved too Apache ecosystem
5http://curator.apache.org6http://en.wikipedia.org/wiki/Netflix
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 32 / 54
Usages Apache Curator
Apache CuratorConsists from:
Framework - high-level API that simplifies using ZooKeeperhandles the complexity of managing connections to the ZooKeepercluster and retrying operationsRecipes - Implementations of some of the common ZooKeeper"recipes"Utilities - Various utilities that are useful when using ZooKeeperClient - A replacement for ZooKeeper class that takes care of somelow-level housekeepingErrors - How Curator deals with errors, connection issues,recoverable exceptions, etc.Extensions - Other recipes (ServiceDiscovery)
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 33 / 54
Architecture Physical View
Physical View
leader is elected at startup and after leader failuresclients connect to exactly one server to submit requestsevery server services clients:
read requestswrite requests are forwarded to leader (responses are sent when a majorityof followers have persisted the change)
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 34 / 54
Architecture Logical View
Logical View
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 35 / 54
Architecture Logical View
Request Processor
prepares request for executionwhen write request received RP calculates what thestate of the system will ater its applying andtransforms it into a transactiontransactions are idempotemt (unlike client request)future state must be calculated: there may beoutstanding transactions that have not yet beenapplied to the database
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 36 / 54
Architecture Logical View
Atomic Broadcast
Atomic broadcast (total order broadcast) - a broadcastmessaging protocol that ensures that messages arereceived reliably and in the same order by all participants.
Special protocol implementation used - Zab
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 37 / 54
Architecture Logical View
Replicated Database
in-memory database containing the entire data treeeach replica has its own copyfor durability:
log updates to disk (replay log (a write-ahead) ofcommitted operations)force writes to be on the disk before they are applied tothe in-memory databasegenerate periodic snapshots of the in-memory database
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 38 / 54
Architecture Broadcast algorithms
Broadcast algorithmsA broadcast algorithm transmits a message from one process (primary) toall other processes in a network or in a broadcast domain, including theprimary. Atomic broadcast satisfies:
ValidityIf a correct process broadcasts a message, then all correct processeswill eventually deliver itUniform AgreementIf a process delivers a message, then all correct processes eventuallydeliver that messageUniform IntegrityFor any message m, every process delivers m at most once, and onlyif m was previously broadcast by the sender of mUniform Total OrderIf processes p and q both deliver messages m and m’, then p deliversm before m’ if and only if q delivers m before m’
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 39 / 54
Architecture Paxos
Paxos
Traditional protocol for solving distributed consensus 7.
7http://en.wikipedia.org/wiki/Consensus_(computer_science)Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 40 / 54
Architecture Paxos
RolesClientissues a request to the distributed system, and waits for a responseAcceptor (Voter)Act as the fault-tolerant "memory"of the protocol. Acceptors arecollected into groups called Quorums.ProposerAdvocates a client request, attempting to convince the Acceptors toagree on itLearnerExecute the request and send a response to the client)Leader A distinguished Proposer (called the leader) to makeprogress.
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 41 / 54
Architecture Paxos
Basic Paxos
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 42 / 54
Architecture Why not Paxos
Why not Paxos
was not originally intended for atomic broadcastingbut consensus protocols can be used for atomicbroadcastingdoes not enable multiple outstanding transactionsdoes not require FIFO channels for communication,so it tolerates message loss and reorderingdoes not support order dependency of proposedvalues.(Problem could be solved by batching multipletransactions into a single proposal and allowing atmost one proposal at a time, but this hasperformance drawbacks)
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 43 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Crash Recovery
Peers communicate by message passing. They maycrash and recover indefinitely many times.Quorum is subset of peers that contains more thanhalf onesAny two quorums have a non-empty intersectionBidirectional channel for every pair of peers mustsatisfy:
integrityprefix
ZAB uses TCP (therefore FIFO) channels forcommunication
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 44 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Transactions
State changes that the primary propagate ("broadcasts")to the ensemble, and are represented by a pair (v, z):
v - new statez - transactional identifier - called zxid
zxid z of a transaction is a pair (e, c), where:e - the epoch number of the primary that generatedthe transactionc - is an integer acting as a counter
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 45 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Broadcast protocol
Peer states:followingleadingelection
Whether a peer is a follower or a leader, it executes threeZab phases:
discoverysynchronizationbroadcast
Previous to Phase 1, a peer is in state election, when itexecutes a leader election algorithm.
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 46 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Phase 0: Leader election
peers have state electionno specific leader election protocol needs to beemployedif peer p voted for peer p’, then p’ is called theprospective leader for p
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 47 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Phase 1: Discovery
followers communicate with their prospective leaderthat leader gathers information about the mostrecent transactions that its followers accepteddiscover the most updated sequence of acceptedtransactions among a quorum, and to establish a newepoch so that previous leaders cannot commit newproposals
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 48 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Phase 2: Synchronization
Recovery part of the protocolsynchronize the replicas in the ensemble using theleader’s updated history from the previous phaseleader communicates with the followers, proposingtransactions from its history. Followers acknowledgethe proposals if their own history is behind theleader’s history.when the leader sees acknowledgements from aquorum, it issues a commit message to them.At that point, the leader is said to be established andnot anymore prospective.
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 49 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Phase 3: Broadcast
peers stay in this phase until any crashes occurperform broadcast transaction for client writerequests
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 50 / 54
Architecture ZooKeeper Atomic Broadcast (ZAB)
Implemented protocol
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 51 / 54
Throughput
Throughput
running on servers with dual 2Ghz Xeon and two SATA 15K RPM drivesone drive was used as a dedicated log device"servers"indicate the size of the ensemble
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 52 / 54
Conclusion Off Screen
Off Screen
More ZooKeeper usagesZooKeeper cluster tuningError handling while native ZooKeeper client usageDeep study of Paxos and ZAB
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 53 / 54
Conclusion Questions?
Questions?
Thanks! Questions?
Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 54 / 54