Vishal Kathuria. Zookeeper use at Facebook Project Zeus – Goals Tao Design Tao Workload...

16
Zookeeper at Facebook Vishal Kathuria

Transcript of Vishal Kathuria. Zookeeper use at Facebook Project Zeus – Goals Tao Design Tao Workload...

Page 1: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Zookeeper at Facebook

Vishal Kathuria

Page 2: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Zookeeper use at Facebook Project Zeus – Goals Tao Design Tao Workload simulator Early results of Zookeeper testing Zookeeper Improvements

Agenda

Page 3: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

HDFS◦ For location of the name node◦ Name node leader election◦ 75K temporary (permanent in future) clients

HBase◦ For mapping of regions to region servers, location of ROOT

node◦ Region server failure detection and failover◦ After UDBs more to HBase, ~100K permanent clients

Titan◦ Mapping of user to Prometheus web server within a cell◦ Leader election of Prometheus web server◦ Future: Selection of the Hbase geo-cell

Use Cases Inside Facebook

Page 4: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Ads◦ Leader Election

Scribe◦ Leader election of scribe aggregators

Future customers◦ TAO

Sharding◦ MySQL

Leader Election◦ Search

Use cases (contd)

Page 5: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

“Make Zookeeper awesome”◦ Zookeeper works at Facebook scale◦ Zookeeper is one of the most reliable services at

Facebook Solve pressing infrastructure problems

using ZooKeeper◦ Shard Manager for Tao◦ Generic Shard Management capability in

Tupperware◦ MySQL HA

Project Zeus

Page 6: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Project is 5 weeks old Initial sharing of ideas with the community

◦ Ideas not yet whetted or proven through prototypes

Caveats

Page 7: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Shard Map◦ Based on ranges instead of consistent hash◦ Stored in ZooKeeper◦ Accessed by clients using Aether◦ Populated by Eos

Dynamically updated based on load information

Tao Design

Page 8: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Scale requirements for a single cluster 24,000 Web machines

◦ Read only clients 6,000 Tao server machines

◦ Read/Write clients About 20 clusters site wide Shard Map is 2-3 MB of data

Tao Projected Workload

Page 9: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Clients◦ Read the shard map of local cluster after connection◦ Put a watch on the shard map◦ Refresh shard map after watch fires

Follower Servers◦ These servers are clients of the leader servers◦ Also read their own shard map

Leader Servers◦ Read their own shard map and of all of their followers

Shard Manager - Eos◦ Periodically updates the shard map

Tao Workload Simulator

Page 10: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

3 node zookeeper ensemble◦ 8 core◦ 8G RAM

Clients – 20 node cluster◦ Web class machines◦ 12 G RAM

Hardware

Page 11: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Using Zookeeper ensemble per cluster model

Assumptions◦ 40K connections◦ Small number of clients joining/leaving at any

time◦ Rare updates to the shard map – once every 10

minutes Result

◦ Zookeeper worked well in this

Scenario - Steady State

Page 12: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Cluster Powering Up◦ 25K Clients simultaneously trying to connect◦ Slow response time

It took some clients 560s to connect and get data Cluster powering down

◦ 25 K clients simultaneously disconnect◦ System Temporarily Unresponsive

The disconnect requests filled zookeeper queues System would not accept any more new connections

or requests After a short time, the disconnect requests were

processed and the system became responsive again

Scenario - Cluster Power Up/Down

Page 13: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Rolling Restart of ZooKeeper Nodes Startup/Shutdown of entire cluster

◦ With active clients◦ Without active clients

Result◦ No corruptions or system hangs noticed so far

Scenario – Zookeeper Node Failure

Page 14: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Client connect/disconnect is a persisted update involving all nodes

The ping and connection timeout handling is done by the leader for all connections

Single thread handling connect requests and data requests

Zookeeper is implemented as a single threaded pipeline.◦ All reads are serialized◦ Low read throughput ◦ Uses only 3 cores at full load

Zookeeper Design

Page 15: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Non persisted sessions with local session tracking◦ Hacked a prototype to test potential◦ Initial test runs very encouraging

Dedicated connection creation thread◦ Prototyped, test runs in progress

Multiple threads for deserializing incoming requests

Zookeeper Improvement Ideas

Page 16: Vishal Kathuria.  Zookeeper use at Facebook  Project Zeus – Goals  Tao Design  Tao Workload simulator  Early results of Zookeeper testing  Zookeeper.

Dedicated parallel pipeline for read only clients

Zookeeper Improvement Ideas