MegaStore and Spanner

103
Megastore and Spanner Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54

Transcript of MegaStore and Spanner

Page 1: MegaStore and Spanner

Megastore and Spanner

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54

Page 2: MegaStore and Spanner

Motivation

I Storage requirements of today’s interactive online applications.• Scalability (a billion internet users)• Rapid development• Responsiveness (low latency)• Durability and consistency (never lose data)• Fault tolerant (no unplanned/planned downtime)• Easy operations (minimize confusion, support is expensive)

I These requirements are in conflict.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54

Page 3: MegaStore and Spanner

Motivation

I Storage requirements of today’s interactive online applications.• Scalability (a billion internet users)• Rapid development• Responsiveness (low latency)• Durability and consistency (never lose data)• Fault tolerant (no unplanned/planned downtime)• Easy operations (minimize confusion, support is expensive)

I These requirements are in conflict.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54

Page 4: MegaStore and Spanner

Motivation

I Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB• Rich set of features• Difficult to scale to the massive amount of reads and writes.

I NoSQL, e.g., BigTable, Dynamo, Cassandra• Highly Scalable• Limited API

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54

Page 5: MegaStore and Spanner

Motivation

I Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB• Rich set of features• Difficult to scale to the massive amount of reads and writes.

I NoSQL, e.g., BigTable, Dynamo, Cassandra• Highly Scalable• Limited API

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54

Page 6: MegaStore and Spanner

NewSQL Databases

I NoSQL scalability + RDBMS ACID

I E.g., Megastore and Spanner

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54

Page 7: MegaStore and Spanner

Megastore

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 5 / 54

Page 8: MegaStore and Spanner

Megastore

I Started in 2006 for app development at Google.

I Google’s wide-area replicated data store.

I Adds (limited) transactions to wide-area replicated data stores.

I GMail, Google+, Android Market, Google App Engine, ...

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54

Page 9: MegaStore and Spanner

Megastore

I Megastore layered on:• GFS (Distributed file system)• Bigtable (NoSQL scalable data store per datacenter)

I BigTable is cluster-level structured storage, while Megastore is geo-scale structured database.

[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54

Page 10: MegaStore and Spanner

Megastore

I Megastore layered on:• GFS (Distributed file system)• Bigtable (NoSQL scalable data store per datacenter)

I BigTable is cluster-level structured storage, while Megastore is geo-scale structured database.

[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54

Page 11: MegaStore and Spanner

Entity Group (1/2)

I The data is partitioned into a collection of entity groups (EG).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54

Page 12: MegaStore and Spanner

Entity Group (2/2)

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54

Page 13: MegaStore and Spanner

Entity Group Replication (1/2)

I Each entity group independently and synchronously replicated overa wide area.

I Megastore’s replication system provides a single consistent view ofthe data stored in its underlying replicas.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54

Page 14: MegaStore and Spanner

Entity Group Replication (2/2)

I Synchronous replication: a low-latency implementation of paxos.

I Basic paxos not used: poor match for high-latency links.

• Writes require at least two inter-replica round-trips to achieveconsensus: prepare round, accept round

• Reads require one inter-replica round-trip: prepare round

I Megastore uses a modified version of paxos: fast read, fast write

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54

Page 15: MegaStore and Spanner

Entity Group Replication (2/2)

I Synchronous replication: a low-latency implementation of paxos.

I Basic paxos not used: poor match for high-latency links.

• Writes require at least two inter-replica round-trips to achieveconsensus: prepare round, accept round

• Reads require one inter-replica round-trip: prepare round

I Megastore uses a modified version of paxos: fast read, fast write

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54

Page 16: MegaStore and Spanner

Entity Group Replication (2/2)

I Synchronous replication: a low-latency implementation of paxos.

I Basic paxos not used: poor match for high-latency links.• Writes require at least two inter-replica round-trips to achieve

consensus: prepare round, accept round• Reads require one inter-replica round-trip: prepare round

I Megastore uses a modified version of paxos: fast read, fast write

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54

Page 17: MegaStore and Spanner

Entity Group Replication (2/2)

I Synchronous replication: a low-latency implementation of paxos.

I Basic paxos not used: poor match for high-latency links.• Writes require at least two inter-replica round-trips to achieve

consensus: prepare round, accept round• Reads require one inter-replica round-trip: prepare round

I Megastore uses a modified version of paxos: fast read, fast write

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54

Page 18: MegaStore and Spanner

Entity Group Transaction (1/3)

I Within each EG: full ACID semantics

I Transaction management using Write Ahead Logging (WAL).

I BigTable feature: ability to store multiple data for same row/columnwith different timestamps.

I Multiversion Concurrency Control (MVCC) in EGs.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54

Page 19: MegaStore and Spanner

Entity Group Transaction (1/3)

I Within each EG: full ACID semantics

I Transaction management using Write Ahead Logging (WAL).

I BigTable feature: ability to store multiple data for same row/columnwith different timestamps.

I Multiversion Concurrency Control (MVCC) in EGs.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54

Page 20: MegaStore and Spanner

Entity Group Transaction (1/3)

I Within each EG: full ACID semantics

I Transaction management using Write Ahead Logging (WAL).

I BigTable feature: ability to store multiple data for same row/columnwith different timestamps.

I Multiversion Concurrency Control (MVCC) in EGs.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54

Page 21: MegaStore and Spanner

Entity Group Transaction (1/3)

I Within each EG: full ACID semantics

I Transaction management using Write Ahead Logging (WAL).

I BigTable feature: ability to store multiple data for same row/columnwith different timestamps.

I Multiversion Concurrency Control (MVCC) in EGs.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54

Page 22: MegaStore and Spanner

Entity Group Transaction (2/3)

I Read consistency

• Current: waits for uncommitted writes, then reads the lastcommitted value.

• Snapshot: doesn’t wait, and reads the last committed values.

• Inconsistent reads: ignores the state of log and reads the last valuesdirectly (data may be stale).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54

Page 23: MegaStore and Spanner

Entity Group Transaction (2/3)

I Read consistency

• Current: waits for uncommitted writes, then reads the lastcommitted value.

• Snapshot: doesn’t wait, and reads the last committed values.

• Inconsistent reads: ignores the state of log and reads the last valuesdirectly (data may be stale).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54

Page 24: MegaStore and Spanner

Entity Group Transaction (2/3)

I Read consistency

• Current: waits for uncommitted writes, then reads the lastcommitted value.

• Snapshot: doesn’t wait, and reads the last committed values.

• Inconsistent reads: ignores the state of log and reads the last valuesdirectly (data may be stale).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54

Page 25: MegaStore and Spanner

Entity Group Transaction (2/3)

I Read consistency

• Current: waits for uncommitted writes, then reads the lastcommitted value.

• Snapshot: doesn’t wait, and reads the last committed values.

• Inconsistent reads: ignores the state of log and reads the last valuesdirectly (data may be stale).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54

Page 26: MegaStore and Spanner

Entity Group Transaction (3/3)

I Write consistency

• Determine the next available log position.

• Assigns mutations of WAL a timestamp higher than any previousone.

• Employs paxos to settle the resource contention.

• Based on optimistic concurrency: in case of multiple writers to thesame log position, only one will win, and the rest will notice thevictorious write, abort, and retry their operations.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54

Page 27: MegaStore and Spanner

Entity Group Transaction (3/3)

I Write consistency

• Determine the next available log position.

• Assigns mutations of WAL a timestamp higher than any previousone.

• Employs paxos to settle the resource contention.

• Based on optimistic concurrency: in case of multiple writers to thesame log position, only one will win, and the rest will notice thevictorious write, abort, and retry their operations.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54

Page 28: MegaStore and Spanner

Entity Group Transaction (3/3)

I Write consistency

• Determine the next available log position.

• Assigns mutations of WAL a timestamp higher than any previousone.

• Employs paxos to settle the resource contention.

• Based on optimistic concurrency: in case of multiple writers to thesame log position, only one will win, and the rest will notice thevictorious write, abort, and retry their operations.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54

Page 29: MegaStore and Spanner

Entity Group Transaction (3/3)

I Write consistency

• Determine the next available log position.

• Assigns mutations of WAL a timestamp higher than any previousone.

• Employs paxos to settle the resource contention.

• Based on optimistic concurrency: in case of multiple writers to thesame log position, only one will win, and the rest will notice thevictorious write, abort, and retry their operations.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54

Page 30: MegaStore and Spanner

Across Entity Group Transaction (1/3)

I Across entity groups: limited consistency guarantees

I Two methods:• Asynchronous messaging (queue)• Two-Phase-Commit (2PC)

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54

Page 31: MegaStore and Spanner

Across Entity Group Transaction (2/3)

I Queues

I Provide transactional messaging between EGs.

I Each message either is:• Synchronous: has a single sending and receiving entity group.• Asynchronous: has different sending and receiving entity group.

I Useful to perform operations that affect many EGs.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54

Page 32: MegaStore and Spanner

Across Entity Group Transaction (3/3)

I Two-Phase Commit

I Atomicity is satisfied.

I High latency

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54

Page 33: MegaStore and Spanner

Spanner

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 18 / 54

Page 34: MegaStore and Spanner

Limitations of Existing Systems

I BigTable• Scalability• High throughput• High performance• Transactional scope limited to single row• Eventually-consistent replication support across data-centers

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54

Page 35: MegaStore and Spanner

Limitations of Existing Systems

I Megastore• Replicated ACID transactions• Schematized semi-relational tables• Synchronous replication support across data-centers• Performance (poor write throughput)• Lack of query language

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54

Page 36: MegaStore and Spanner

Spanner

I Bridging the gap between Megastore and Bigtable.

I SQL transactions + high throughput

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54

Page 37: MegaStore and Spanner

Spanner

I Global scale database with strict transactional guarantees.

I Global scale• Across datacenters• Scale up to millions of nodes, hundreds of datacenters, trillions of

database rows

I Strict transactional guarantees• General transactions (even inter-row)• Reliable even during wide-area natural disasters

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54

Page 38: MegaStore and Spanner

Spanner

I Global scale database with strict transactional guarantees.

I Global scale• Across datacenters• Scale up to millions of nodes, hundreds of datacenters, trillions of

database rows

I Strict transactional guarantees• General transactions (even inter-row)• Reliable even during wide-area natural disasters

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54

Page 39: MegaStore and Spanner

Spanner

I Global scale database with strict transactional guarantees.

I Global scale• Across datacenters• Scale up to millions of nodes, hundreds of datacenters, trillions of

database rows

I Strict transactional guarantees• General transactions (even inter-row)• Reliable even during wide-area natural disasters

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54

Page 40: MegaStore and Spanner

Spanner Implementation

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 23 / 54

Page 41: MegaStore and Spanner

Spanner Organization (1/2)

I Universe: Spanner deployment

I Zones: analogues to deployment of BigTable servers (unit of physicalisolation)

• One zonemaster: assigns data to spanservers• The proxies: used by clients to locate the spanservers assigned to

serve their data• Thousands of spanservers: serve data to clients

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54

Page 42: MegaStore and Spanner

Spanner Organization (1/2)

I Universe: Spanner deployment

I Zones: analogues to deployment of BigTable servers (unit of physicalisolation)

• One zonemaster: assigns data to spanservers• The proxies: used by clients to locate the spanservers assigned to

serve their data• Thousands of spanservers: serve data to clients

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54

Page 43: MegaStore and Spanner

Spanner Organization (1/2)

I Universe: Spanner deployment

I Zones: analogues to deployment of BigTable servers (unit of physicalisolation)

• One zonemaster: assigns data to spanservers

• The proxies: used by clients to locate the spanservers assigned toserve their data

• Thousands of spanservers: serve data to clients

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54

Page 44: MegaStore and Spanner

Spanner Organization (1/2)

I Universe: Spanner deployment

I Zones: analogues to deployment of BigTable servers (unit of physicalisolation)

• One zonemaster: assigns data to spanservers• The proxies: used by clients to locate the spanservers assigned to

serve their data

• Thousands of spanservers: serve data to clients

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54

Page 45: MegaStore and Spanner

Spanner Organization (1/2)

I Universe: Spanner deployment

I Zones: analogues to deployment of BigTable servers (unit of physicalisolation)

• One zonemaster: assigns data to spanservers• The proxies: used by clients to locate the spanservers assigned to

serve their data• Thousands of spanservers: serve data to clients

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54

Page 46: MegaStore and Spanner

Spanner Organization (2/2)

I The universe master: a console that displays status informationabout all the zones.

I The placement driver: handles automated movement of data acrosszones.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54

Page 47: MegaStore and Spanner

Spanserver Software Stack (1/4)

I Each spanserver is responsible for 100-1000 data structure instances,called tablet (similar to BigTable tablet).

I Tablet mapping: (key: string, timestamp:int64) → string

I Data and logs stored on Colossus (successor of GFS).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54

Page 48: MegaStore and Spanner

Spanserver Software Stack (2/4)

I A single paxos state machine on top of each tablet: consistent repli-cation

I Paxos group: all machines involved in an instance of paxos.

I Paxos implementation supports long-lived leaders with time-basedleader leases.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54

Page 49: MegaStore and Spanner

Spanserver Software Stack (2/4)

I A single paxos state machine on top of each tablet: consistent repli-cation

I Paxos group: all machines involved in an instance of paxos.

I Paxos implementation supports long-lived leaders with time-basedleader leases.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54

Page 50: MegaStore and Spanner

Spanserver Software Stack (2/4)

I A single paxos state machine on top of each tablet: consistent repli-cation

I Paxos group: all machines involved in an instance of paxos.

I Paxos implementation supports long-lived leaders with time-basedleader leases.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54

Page 51: MegaStore and Spanner

Spanserver Software Stack (3/4)

I Writes must initiate the paxos protocol at the leader.

I Reads access state directly from the underlying tablet at any replicathat is sufficiently up-to-date.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54

Page 52: MegaStore and Spanner

Spanserver Software Stack (3/4)

I Writes must initiate the paxos protocol at the leader.

I Reads access state directly from the underlying tablet at any replicathat is sufficiently up-to-date.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54

Page 53: MegaStore and Spanner

Spanserver Software Stack (4/4)

I Transaction manager: to support distributed transactions• At every replica that is a leader.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54

Page 54: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.

• Maintained by paxos leader.• Maps ranges of keys to lock states.• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 55: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.

• Maintained by paxos leader.• Maps ranges of keys to lock states.• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 56: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.• Maintained by paxos leader.

• Maps ranges of keys to lock states.• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 57: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.• Maintained by paxos leader.• Maps ranges of keys to lock states.

• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 58: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.• Maintained by paxos leader.• Maps ranges of keys to lock states.• Two-phase locking.

• Wound-wait for dead lock avoidance: young transaction dies if anolder transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 59: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.• Maintained by paxos leader.• Maps ranges of keys to lock states.• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 60: MegaStore and Spanner

Transactions Involving Only One Paxos Group

I This is the case for most transactions.

I A long lived paxos leader.• The transaction manager: participant leader• The other replicas in the group: participant slaves

I A lock table for concurrency control.• Multiple concurrent transactions.• Maintained by paxos leader.• Maps ranges of keys to lock states.• Two-phase locking.• Wound-wait for dead lock avoidance: young transaction dies if an

older transaction needs a resource held by the young transaction.

I It can bypass the transaction manager.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54

Page 61: MegaStore and Spanner

Transactions Involving Multiple Paxos Groups

I One of the participant groups is chosen as the coordinator.• The participant leader of that group will be referred to as the

coordinator leader.• The slaves of that group as coordinator slaves.

I Group’s leaders coordinate to perform two phase commit.

I The state of each transaction manager is stored in the underlyingpaxos group (and therefore is replicated).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54

Page 62: MegaStore and Spanner

Transactions Involving Multiple Paxos Groups

I One of the participant groups is chosen as the coordinator.• The participant leader of that group will be referred to as the

coordinator leader.• The slaves of that group as coordinator slaves.

I Group’s leaders coordinate to perform two phase commit.

I The state of each transaction manager is stored in the underlyingpaxos group (and therefore is replicated).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54

Page 63: MegaStore and Spanner

Transactions Involving Multiple Paxos Groups

I One of the participant groups is chosen as the coordinator.• The participant leader of that group will be referred to as the

coordinator leader.• The slaves of that group as coordinator slaves.

I Group’s leaders coordinate to perform two phase commit.

I The state of each transaction manager is stored in the underlyingpaxos group (and therefore is replicated).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54

Page 64: MegaStore and Spanner

Data Model and Directories

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54

Page 65: MegaStore and Spanner

Data Model

I An application creates one or more databases in a universe.

I Each database can contain an unlimited number of schematizedtables.

I Table• Rows and columns• Must have an ordered set one or more primary key columns• Primary key uniquely identifies each row

I Hierarchies of tables• Tables must be partitioned by client into one or more hierarchies of

tables• Table in the top: directory table

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54

Page 66: MegaStore and Spanner

Data Model

I An application creates one or more databases in a universe.

I Each database can contain an unlimited number of schematizedtables.

I Table• Rows and columns• Must have an ordered set one or more primary key columns• Primary key uniquely identifies each row

I Hierarchies of tables• Tables must be partitioned by client into one or more hierarchies of

tables• Table in the top: directory table

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54

Page 67: MegaStore and Spanner

Data Model

I An application creates one or more databases in a universe.

I Each database can contain an unlimited number of schematizedtables.

I Table• Rows and columns• Must have an ordered set one or more primary key columns• Primary key uniquely identifies each row

I Hierarchies of tables• Tables must be partitioned by client into one or more hierarchies of

tables• Table in the top: directory table

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54

Page 68: MegaStore and Spanner

Data Model

I An application creates one or more databases in a universe.

I Each database can contain an unlimited number of schematizedtables.

I Table• Rows and columns• Must have an ordered set one or more primary key columns• Primary key uniquely identifies each row

I Hierarchies of tables• Tables must be partitioned by client into one or more hierarchies of

tables• Table in the top: directory table

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54

Page 69: MegaStore and Spanner

Directory (1/2)

I Set of contiguous keys that share a common prefix.

I All data in a directory has the same replication configuration.

I The smallest unit whose geographic replication properties can bespecified by an application.

I A Paxos group may contain multiple directories.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54

Page 70: MegaStore and Spanner

Directory (1/2)

I Set of contiguous keys that share a common prefix.

I All data in a directory has the same replication configuration.

I The smallest unit whose geographic replication properties can bespecified by an application.

I A Paxos group may contain multiple directories.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54

Page 71: MegaStore and Spanner

Directory (1/2)

I Set of contiguous keys that share a common prefix.

I All data in a directory has the same replication configuration.

I The smallest unit whose geographic replication properties can bespecified by an application.

I A Paxos group may contain multiple directories.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54

Page 72: MegaStore and Spanner

Directory (1/2)

I Set of contiguous keys that share a common prefix.

I All data in a directory has the same replication configuration.

I The smallest unit whose geographic replication properties can bespecified by an application.

I A Paxos group may contain multiple directories.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54

Page 73: MegaStore and Spanner

Directory (2/2)

I Spanner might move a directory:• To shed load from a paxos group.• To put directories that are frequently accessed together into the

same group.• To move a directory into a group that is closer to its accessors.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54

Page 74: MegaStore and Spanner

Example

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54

Page 75: MegaStore and Spanner

Example

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54

Page 76: MegaStore and Spanner

Example

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54

Page 77: MegaStore and Spanner

Example

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54

Page 78: MegaStore and Spanner

True Timeand

Consistency

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 37 / 54

Page 79: MegaStore and Spanner

Key Innovation

I Spanner knows what time it is.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54

Page 80: MegaStore and Spanner

Time Synchronization (1/2)

I Is synchronizing time at the global scale possible?

I Synchronizing time within and between datacenters is extremelyhard and uncertain.

I Serialization of requests is impossible at global scale.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54

Page 81: MegaStore and Spanner

Time Synchronization (1/2)

I Is synchronizing time at the global scale possible?

I Synchronizing time within and between datacenters is extremelyhard and uncertain.

I Serialization of requests is impossible at global scale.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54

Page 82: MegaStore and Spanner

Time Synchronization (1/2)

I Is synchronizing time at the global scale possible?

I Synchronizing time within and between datacenters is extremelyhard and uncertain.

I Serialization of requests is impossible at global scale.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54

Page 83: MegaStore and Spanner

Time Synchronization (2/2)

I Idea: accept uncertainty, keep it small and quantify (using GPS andAtomic Clocks).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54

Page 84: MegaStore and Spanner

True Time API

I TTinterval: is guaranteed to contain the absolute time duringwhich TT.now() was invoked.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54

Page 85: MegaStore and Spanner

How TrueTime Is Implemented? (1/2)

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54

Page 86: MegaStore and Spanner

How TrueTime Is Implemented? (2/2)

I Daemon polls variety of masters:• Chosen from nearby datacenters• From further datacenters• Armageddon masters

I Daemon polls variety of mastersand reaches a consensus aboutcorrect timestamp.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54

Page 87: MegaStore and Spanner

External Consistency (1/2)

I Jerry unfriends Tom to write a controversial comment.

I If serial order is as above, Jerry will be in trouble!

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54

Page 88: MegaStore and Spanner

External Consistency (1/2)

I Jerry unfriends Tom to write a controversial comment.

I If serial order is as above, Jerry will be in trouble!

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54

Page 89: MegaStore and Spanner

External Consistency (2/2)

I External Consistency: Formally, If commit of T1 preceded the ini-tiation of a new transaction T2 in wall-clock (physical) time, thencommit of T1 should precede commit of T2 in the serial orderingalso.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54

Page 90: MegaStore and Spanner

Snapshot Reads

I Read in past without locking.

I Client can specify timestamp for read or an upper bound of times-tamp.

I Each replica tracks a value called safe time tsafe, which is the max-imum timestamp at which a replica is up-to-date.

I Replica can satisfy read at any t ≤ tsafe.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54

Page 91: MegaStore and Spanner

Snapshot Reads

I Read in past without locking.

I Client can specify timestamp for read or an upper bound of times-tamp.

I Each replica tracks a value called safe time tsafe, which is the max-imum timestamp at which a replica is up-to-date.

I Replica can satisfy read at any t ≤ tsafe.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54

Page 92: MegaStore and Spanner

Snapshot Reads

I Read in past without locking.

I Client can specify timestamp for read or an upper bound of times-tamp.

I Each replica tracks a value called safe time tsafe, which is the max-imum timestamp at which a replica is up-to-date.

I Replica can satisfy read at any t ≤ tsafe.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54

Page 93: MegaStore and Spanner

Snapshot Reads

I Read in past without locking.

I Client can specify timestamp for read or an upper bound of times-tamp.

I Each replica tracks a value called safe time tsafe, which is the max-imum timestamp at which a replica is up-to-date.

I Replica can satisfy read at any t ≤ tsafe.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54

Page 94: MegaStore and Spanner

Read-only Transactions

I Assign timestamp sread and do snapshot read at sread.

I sread = TT.now().latest()

I It guarantees external consistency.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54

Page 95: MegaStore and Spanner

Read-Write Transactions (1/3)

I Leader must only assign timestamps within the interval of its leaderlease.

I Timestamps must be assigned in monotonically increasing order.

I If transaction T1 commits before T2 starts, T2’s commit timestampmust be greater than T1’s commit timestamp.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54

Page 96: MegaStore and Spanner

Read-Write Transactions (1/3)

I Leader must only assign timestamps within the interval of its leaderlease.

I Timestamps must be assigned in monotonically increasing order.

I If transaction T1 commits before T2 starts, T2’s commit timestampmust be greater than T1’s commit timestamp.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54

Page 97: MegaStore and Spanner

Read-Write Transactions (1/3)

I Leader must only assign timestamps within the interval of its leaderlease.

I Timestamps must be assigned in monotonically increasing order.

I If transaction T1 commits before T2 starts, T2’s commit timestampmust be greater than T1’s commit timestamp.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54

Page 98: MegaStore and Spanner

Read-Write Transactions (2/3)

I Clients buffer writes.

I Client chooses a coordinate group that initiates two-phase commit.

I A non-coordinator-participant leader chooses a prepare timestampand logs a prepare record through paxos and notifies the coordinator.

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54

Page 99: MegaStore and Spanner

Read-Write Transactions (3/3)

I The coordinator assigns a commit timestamp si no less than allprepare timestamps and TT.now().latest().

I The coordinator ensures that clients cannot see any data committedby Ti until TT.after(si) is true. This is done by commit wait (waituntil absolute time passes si to commit).

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54

Page 100: MegaStore and Spanner

Summary

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 51 / 54

Page 101: MegaStore and Spanner

Summary

I Megastore

I Entity Groups (EG)

I Within EG: using paxos - ACID

I Across EGs: using queue and two-phase commit

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54

Page 102: MegaStore and Spanner

Summary

I Spanner

I Replica consistency: using paxos protocol

I Concurrency control: using two phase locking

I Transaction coordination: using two-phase commit

I Timestamps for transactions and data items

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54

Page 103: MegaStore and Spanner

Questions?

Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 54 / 54