Dissecting Scalable Database Architectures

Post on 14-Dec-2014

1.697 views 0 download

description

Presentation by Doug Judd, co-founder of Hypertable Inc, at Groupon office in Palo Alto, CA on November 15th, 2012.

Transcript of Dissecting Scalable Database Architectures

Dissecting Scalable Database ArchitecturesDoug JuddCEO, Hypertable Inc.

Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends

Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable

Auto-Sharding

Auto-Sharding

Auto-sharding Systems• Oracle NoSQL Database• MongoDB

Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”

– Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability

Consistent Hashing

Eventual Consistency

Vector Clocks

Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort

Bigtable• “Bigtable: A Distributed Storage System for Structured Data”

- Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication

Google Architecture

Google File System

Google File System

Table: Growth Process

Scaling (part 1)

Scaling (part 2)

Scaling (part 3)

System overview

Database Model

• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key

• Row (string)• Column Family• Column Qualifier (string)• Timestamp

Table: Visual Representation

Table: Actual Representation

Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,

ones-compliment• Simple byte-wise comparison

Log Structured Merge Tree

Range Server: CellStore• Sequence of 65K blocks of

compressed key/value pairs

Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present

Request Routing

Bigtable-based Systems• Accumulo• HBase• Hypertable

Next-generation Architectures

• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)

PNUTS

• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records

• Hashed tables implemented via proprietary disk-based hash• Ordered tables implemented with MySQL+InnoDB

• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!

PNUTS System Architecture

Record-level Mastering

• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record

PNUTS API

• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)

Spanner

• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language

Spanner Server Organization

Spanserver

• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of

mappings: (key:string, timestamp:int64) -> string

• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories

• Set of contiguous keys that share a common prefix• Unit of data placement• Can be moved between Tablets for performance reasons

TrueTime

• Universal Clock• Set of time master servers per-datacenter

• GPL clock via GPS receivers with dedicated antennas• Atomic clock

• Time daemon runs on every machine• TrueTime API:

Spanner Software Stack

Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction

Dremel

• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds

Columnar Storage Format

• Novel format for storing lists of nested records (Protocol Buffers)

• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records

Multi-level Execution Trees

• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)

• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel

aggregation of partial results.

Performance

Example Queries

• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5

Future Evolution - Hardware Trends• SSD Drives• Disk Drives• Networking

Flash Memory Rated Lifetime(P/E Cycles)

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Flash Memory Average BER at Rated Lifetime

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Disk: Areal Density Trend

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Disk: Maximum SustainedBandwidth Trend

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Time Required to Sequentially Fill a SATA Drive

Average Seek Time

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Average Rotational Latency

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Time Required to Randomly Read a SATA Drive

Ethernet• 10GbE

• Starting to replace 1GbE for server NICs• De facto network port for new servers in 2014

• 40GbE• Data center core & aggregation• Top-of-rack server aggregation

• 100GbE• Service Provider core and aggregation• Metro and large Campus core• Data center core & aggregation

• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber

• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”

10GbE Adoption Curve (?)

Source: CREHAN RESEARCH Inc. © Copyright 2012

The EndThank you!