OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel,...

26
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides have been borrowed from the authors’ presentations

Transcript of OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel,...

OceanStore: An Infrastructure for

Global-Scale Persistent Storage

John Kubiatowicz, David Bindel, Yan Chen, Steven

Czerwinski,

Patrick Eaton, Dennis Geels, Ramakrishna

Gummadi, Sean Rhea,

Hakim Weatherspoon, Westley Weimer, Chris Wells,

Ben ZhaoA few slides have been borrowed from the authors’ presentations

Vision

• What is Oceanstore?• “a utility infrastructure to span the globe and provide continuous access to persistent information”

Source: Berkeley OceanStore Website

Vision

• What is Oceanstore?

• “a utility infrastructure to span the globe and

provide continuous access to persistent

information”

• data

• all kinds of information

• desktop, laptop, palmtop

• cars, cellular phones, other devices

• futuristic: embedded in environment

Vision

• What is Oceanstore?

• “a utility infrastructure to span the globe and

provide continuous access to persistent

information”

• persistence

• devices can be rebooted, lost, replaced

• reliable, durable data (“deep archival” will last

forever)

• Automatic maintenance

Vision

What is Oceanstore?

• “a utility infrastructure to span the globe and

provide continuous access to persistent information”

• connectivity

• even to tiniest devices, possibly intermittent

• variable bandwidth, latency

• availability

• uniform access, comparable to LAN-based networked

storage

• fault-tolerant, DoS-tolerant

Vision

• what is oceanstore?

• “a utility infrastructure to span the globe and

provide continuous access to persistent

information”

• scale

• geographically distributed

• 1010 users

• 1014 files / objects

Questions about information:

• Where is persistent information stored?• 20th-century tie between location and content outdated

• In world-scale system, locality is key

• How is it protected?• Can disgruntled employee of ISP sell your secrets?• Can’t trust anyone (how paranoid are you?)

• Can we make it indestructible? • Want our data to survive “the big one”! • Highly resistant to hackers (denial of service)• Wide-scale disaster recovery

• Is it hard to manage?• Worst failures are human-related• Want automatic (introspective) diagnosis and repair

First Observation:Want Utility Infrastructure

• Mark Weiser from Xerox: Transparent computing is the

ultimate goal. Computers should disappear into the background

• In the context of storage:

• Don’t want to worry about backup

• Don’t want to worry about obsolescence

• Need lots of resources to make data secure and highly

available, BUT don’t want to own them

• Outsourcing of storage already becoming popular

• Pay monthly fee and your “data is out there”

• Service provided by confederation of companies

• Monthly fee paid to one service provider

• Companies buy and sell capacity from each other

Utility-based Infrastructure

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

Target applications

Email

Group calendar, contacts

Distributed design tools

Computer Supported Cooperative Work

Digital libraries

Distributed/shared repositories

Assumptions

• Untrusted infrastructure• A small number of servers may crash or leak information

• most of the servers functioning correctly• financially “responsible party” of servers ensure integrity

• but only clients trusted with cleartext

• Nomadic data• data divorced from location• flows freely within the storage infrastructure• promiscuous caching: “anywhere, anytime”• location important for performance• dynamic system tuning through introspection

System overview

• persistent object• GUID: 160-bit SHA-1 hash

• secure identification – globally unique and unforgeable• 280 unique objects before collisions (birthday paradox)

• floating object replicas: independent of location• encrypted data

• read• try fast probabilistic replica search (Bloom filter)• fallback to slower deterministic search (Tapestry)

• write• update with predicates [as in Bayou – what is Bayou?]• creates new version

What is Bayou

The Bayou System (Xerox PARC) is a

platform of replicated, highly-available,

variable-consistency, databases on which

collaborative applications can be built.

It caters to portable devices having

intermittent connections.

System overview

• application interface

• sessions: sequence of read/writes

• session guarantees [Bayou]

• loose consistency levels, ACID

• active and archival forms

• active: latest version, with update handle

• archive: erasure coded read-only version

• dynamic optimization

• object location

• degree of replication

Tentative Updates:Epidemic Dissemination

Committed Updates:Multicast Dissemination

naming

• self-certifying path names (Mazières)

• object GUID = hash of owner key and readable name

• create hierarchies using “directory” objects

• read restriction

• through client encryption of data

• write restriction, access control

• associate ACL lists with object, respected by

servers

addressing

• address an object by its GUID

• message: GUID, random number, small predicate

• route to closest GUID replica matching predicate

• combines data location and routing:

• no central name service to attack

• save one round-trip for location discovery

• routing

• fast, probabilistic search algorithm

• slow, deterministic search algorithm

routing

• fast, probabilistic search algorithm

• Bloom filter

• probabilistic set membership test using bit vector

• n-bit vector generated from n hashes of each set

element

• filter is union (OR) of all bit vectors

• attenuated Bloom filter

• array of d Bloom filters

• i th Bloom filter is union of all <i -hop nodes

• slow, deterministic algorithm

• Tapestry

addressing and routing

probabilistic

deterministic

Attenuated Bloom Filter

updates

• Updates based on versioning and conflict

resolution

• i.e. no locking

• update: actions with predicates

• commit – apply action of first true predicate

• abort – no true predicates

• conflict resolution on encrypted data

• possible predicates:

• compare-version, compare-size, compare-block, search

• possible actions:

• replace-block, insert-block, delete-block, append

archival

• produced when objects idle

• use erasure codes (redundant fragmentation)

• simplest example: parity bit

• need any (n-1) out of n fragments

• interleaved Reed-Solomon codes, Tornado codes

• fragmentation improves reliability

• “deep archival storage”

• sweeper processes ensure replication

sustained over time

• fragmentation improves performance

Erasure Codes

Simple parity bits, or generalized Reed-Solomon codes

can be used to implement it.

Floating Replica and Deep Archival Coding

Erasure-coded Fragments

Ver1: 0x34243Ver2: 0x49873Ver3: …

FullCopy

Conflict Resolution

LogsVer1: 0x34243Ver2: 0x49873Ver3: …

FullCopy

Conflict Resolution

Logs

Ver1: 0x34243Ver2: 0x49873Ver3: …

FullCopy

Conflict Resolution

Logs

FloatingReplica

dynamic optimization (introspection)

• observation modules

• collect and summarize information

• incrementally update system database

• optimization modules

• periodically process the observation database

• cluster recognition: group related objects

• replica management: maintain replica number and

location

• periodic migration: work-home-work-home…

• maintenance: routing, dissemination, availability,

durability