Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database
-
Upload
maciek-jozwiak -
Category
Software
-
view
863 -
download
1
description
Transcript of Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database
Internet-scale Distributed Systems
Google Spanner a
Synchronously-Replicated Globally-Distributed
Multi-Version Database
22.01.2013 Maciej Jozwiak Page 1
Presented by: Maciej Jozwiak
Internet-scale Distributed Systems
Agenda • Problem description
• Overview of available solutions
• Globally-distributed database
• Architecture
• How is data replicated?
• Data model
• TrueTime API
• Transactions
• Summary
22.01.2013 Maciej Jozwiak Page 2
Internet-scale Distributed Systems
Problem – Need for Scalable MySQL • Google’s advertising backend
– Based on MySQL • Relations
• Query language
– Manually sharded • Resharding is very costly
– Global distribution
22.01.2013 Maciej Jozwiak Page 3
SHARDING:
Sharding is another name for "horizontal partitioning" of a database. Rows of a database table are held separately, form a partition which can be located on a separate database server or physical location.
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 4
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 5
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 6
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
Internet-scale Distributed Systems
Bridging the gap between Megastore and Bigtable
22.01.2013 Maciej Jozwiak Page 7
Google Megastore
• Removes the need to manually partition data • Synchronous replication and automatic failover • Strong transactional semantics • SQL based query language • Semi-relational, schematized tables
Solution: Google Spanner
Internet-scale Distributed Systems
Globally-Distributed Database
22.01.2013 Maciej Jozwiak Page 8
Future scale: • one million to 10 million servers • 100s to 1000s locations around the world • 1013 directories • 1018 bytes of storage
cross-datacenter replicated data management: • high availability • minimize latency of data reads and writes • replication configuration dynamically controlled at a fine grain by applications
Internet-scale Distributed Systems
Spanner Deployment - Universe
22.01.2013 Maciej Jozwiak Page 9
Universe master (status + interactive debugging)
Placement driver (move data across
zones automatically)
Internet-scale Distributed Systems
How Is Data Replicated?
22.01.2013 Maciej Jozwiak Page 10
Paxos: protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
Spanserver software stack
Internet-scale Distributed Systems
Replication Configuration
• Replication configurations for data can be dynamically controllered at a fine grain by applications
• Applications can specify constraints to control:
– which datacenters contain which data
– how far data is from user (to control read latency)
– how far replicas are from each other (to control write latency)
– how many replicas are maintained (to control durability, availability, and read performance) • North America: 5 replicas, Europe 2 replicas
22.01.2013 Maciej Jozwiak Page 11
Internet-scale Distributed Systems
Hierarchical Data Model • Universe (Spanner deployment)
– Database
• Tables – Rows and columns
– Must have an ordered set one or more primary key columns
– Primary key uniquely identifies each row
• Hierarchies of tables – Tables must be partioned by client into one or more
hierarchies of tables (INTERLEAVE IN)
– Table in the top – directory table
22.01.2013 Maciej Jozwiak Page 12
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 13
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 14
directory table
directory table
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 15
directory
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 16
directory
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 17
Albums(2,1) – row from the Albums table for user_id 2, album_id 1 Interleaving is important because it allows clients to describe the locality relationship which is necessary for good performance in a sharded, distributed database.
Internet-scale Distributed Systems
Key Innovation
22.01.2013 Maciej Jozwiak Page 18
Spanner knows what time is it
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 19
Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 20
Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 21
Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)
Internet-scale Distributed Systems
TrueTime API
22.01.2013 Maciej Jozwiak Page 22
Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)
Novel API distributing a globally synchronized „proper time”
Method Returns
TT.now() TTinterval: [earliest, latest]
TT.after(t) True if t has definitely passed
TT.before(t) True if t has definitely not arrived
TT interval - is guaranteed to contain the absolute time during which TT.now() was invoked
Internet-scale Distributed Systems
How TrueTime Is Implemented?
22.01.2013 Maciej Jozwiak Page 23
set of time master machines per datacenter
majority of masters have GPS receivers with dedicated antennas
timeslave daemon per machine
The remaining masters (which we refer to as Armageddon masters) are equipped with atomic clocks.
Internet-scale Distributed Systems
Time References Vulnerabilities
• GPS:
– antenna and receiver failures
– local radio interference
– correlated failures (e.g. spoofing)
– GPS system outages
• Atomic clock:
– can drift significantly due to frequency error
2 forms of time reference – 2 failure modes (uncorrelated to each other):
22.01.2013 Maciej Jozwiak Page 24
Internet-scale Distributed Systems
How Does Daemon Work?
22.01.2013 Maciej Jozwiak Page 25
Daemon polls variety of masters: • chosen from nearby datacenters • from further datacenters • Armageddon masters
Daemon polls variety of masters and reaches a consensus about correct timestamp. Daemon’s poll interval is 30 seconds.
Between synchronizations daemon advertises a slowy increasing time uncertainty (e)
Internet-scale Distributed Systems
Transactions In Spanner
• Globally meaningful commit timestamps to distributed transactions
– If A happens-before B, then timestamp(A) < timestamp (B)
– A happens-before B if its effects become visible before B begins, in real time • Visible means acked to client or updates applied to some replica
• Begins means first request arrived at Spanner server
• Two-phase commit
22.01.2013 Maciej Jozwiak Page 26
Internet-scale Distributed Systems
What About Performance?
22.01.2013 Maciej Jozwiak Page 27
„We believe it is better to have application
programmers deal with performance problems
due to overuse of transactions as bottlenecks arise,
rather than always coding around the lack of
transactions.”
Two-phase commit can raise availability and performance
issues.
Internet-scale Distributed Systems
Summary
• Externally consistent global write-transactions with synchronous replication.
• Schematized, semi-relational data model.
• SQL-like query interface.
• Auto-sharding, auto-rebalancing, automatic failure response.
• Exposes control of data replication and placement to user/application.
22.01.2013 Maciej Jozwiak Page 28