Dissecting Scalable Database Architectures

Dissecting Scalable Database ArchitecturesDoug JuddCEO, Hypertable Inc.

Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends

Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable

Auto-Sharding

Auto-sharding Systems• Oracle NoSQL Database• MongoDB

Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”

– Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability

Consistent Hashing

Eventual Consistency

Vector Clocks

Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort

Bigtable• “Bigtable: A Distributed Storage System for Structured Data”

- Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication

Google Architecture

Google File System

Table: Growth Process

Scaling (part 1)

Scaling (part 2)

Scaling (part 3)

System overview

Database Model

• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key

• Row (string)• Column Family• Column Qualifier (string)• Timestamp

Table: Visual Representation

Table: Actual Representation

Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,

ones-compliment• Simple byte-wise comparison

Log Structured Merge Tree

Range Server: CellStore• Sequence of 65K blocks of

compressed key/value pairs

Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present

Request Routing

Bigtable-based Systems• Accumulo• HBase• Hypertable

Next-generation Architectures

• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)

• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records

• Hashed tables implemented via proprietary disk-based hash• Ordered tables implemented with MySQL+InnoDB

• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!

PNUTS System Architecture

Record-level Mastering

• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record

PNUTS API

• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)

Spanner

• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language

Spanner Server Organization

Spanserver

• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of

mappings: (key:string, timestamp:int64) -> string

• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories

• Set of contiguous keys that share a common prefix• Unit of data placement• Can be moved between Tablets for performance reasons

TrueTime

• Universal Clock• Set of time master servers per-datacenter

• GPL clock via GPS receivers with dedicated antennas• Atomic clock

• Time daemon runs on every machine• TrueTime API:

Spanner Software Stack

Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction

Dremel

• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds

Columnar Storage Format

• Novel format for storing lists of nested records (Protocol Buffers)

• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records

Multi-level Execution Trees

• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)

• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel

aggregation of partial results.

Performance

Example Queries

• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5

Future Evolution - Hardware Trends• SSD Drives• Disk Drives• Networking

Flash Memory Rated Lifetime(P/E Cycles)

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Flash Memory Average BER at Rated Lifetime

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Disk: Areal Density Trend

Disk: Maximum SustainedBandwidth Trend

Time Required to Sequentially Fill a SATA Drive

Average Seek Time

Average Rotational Latency

Time Required to Randomly Read a SATA Drive

Ethernet• 10GbE

• Starting to replace 1GbE for server NICs• De facto network port for new servers in 2014

• 40GbE• Data center core & aggregation• Top-of-rack server aggregation

• 100GbE• Service Provider core and aggregation• Metro and large Campus core• Data center core & aggregation

• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber

• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”

10GbE Adoption Curve (?)

The EndThank you!

Dissecting Scalable Database Architectures

Technology

Transcript of Dissecting Scalable Database Architectures

MySQL Reference Architectures for Massively Scalable … · MySQL Reference Architectures for Massively Scalable Web Infrastructure ... processes, and services to ... does serve as

Scalable Interconnects for Reconfigurable Spatial Architectures · 2020-04-19 · Scalable Interconnects for Reconfigurable Spatial Architectures ISCA ’19, June 22–26, 2019, Phoenix,

Scalable Web Architectures - Common Patterns & Approaches

SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR …neeraj/doc/thesis_Neeraj_Goel.pdf · SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR VLIW PROCESSORS by Neeraj Goel Department

Scalable Architectures - Microsoft Finland DevDays 2014

Scalable Web-Server Systems: Architectures, Models and Load

Rapidly Building and Deploying Scalable Web Architectures

Building Scalable, Flexible Enterprise Architectures with · Building Scalable, Flexible Enterprise Architectures with Cisco Meraki ... Wireless LAN Meraki Security Appliances Meraki

INTERFACE & TEXTURE SCALABLE GRAPHICS ARCHITECTURES

Scalable Internet Architectures - GOTO Bloggotocon.com/.../slides/TheoSchlossnagle_ScalableInternetArchitectures.pdf · Author of “Scalable Internet Architectures” Pearson, ISBN:

Highly scalable-architectures

An Architecture for Scalable Simulations · An Architecture for Scalable Simulations Alberto Vaccari ... Keywords: Scalable, Architectures, Simulations, Kubernetes, Docker. iii ...

Scalable Service Architectures

Scalable Network Architectures Supporting Quality of Servicerprior/phd/rpriorPhD.pdf · Rui Pedro de Magalhães Claro Prior Scalable Network Architectures Supporting Quality of Service

ASH: A Substrate for Scalable Architectures

Scalable Web Architectures Common Patterns & Approaches Cal Henderson.

Ölçeklenebilir Mimariler Scalable Architectures

Scalable Architectures 101 - Meetupfiles.meetup.com/1558487/scalable-architecture-101.pdf · Scalable Architectures 101 MNPHP Feb 3, 2011 Mike Willbanks Blog: http ...

Perturb-Seq: Dissecting Molecular Circuits with Scalable ... et al_2016_Cell.pdfPerturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Proﬁling of Pooled Genetic

Scalable Parallel Flash Firmware for Many-core Architectures