Download - Resiliency at Scale in the Distributed Storage Cloud - … Storage Developer Conference. © EMC Corporation. ... 2013 Resiliency at Scale in the Distributed Storage Cloud Alma Riska

2013 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

2013

Resiliency at Scale in the Distributed Storage Cloud

Alma Riska Advanced Storage Division

EMC Corporation

In collaboration with many at Cloud Infrastructure Group


2013 Outline

Wide topic but this talk will focus on Architecture Resiliency Failures Redundancy schemes Policies to differentiate services

2


2013 Digital Content Creation & Investment

3


2013 Scaled-out Storage Systems

Large amount of hardware Thousands of disks Tens to hundreds of servers Significant amount of networking

Wide range of applications Internet Service Providers On-line Service Providers Private cloud

Up-to million of users 4


2013 Storage Requirements

Store massive amount of data (Tens) PetaBytes Direct attached high capacity nearline

HDDs Highly available Minimum down time

Reliably stored Beyond the traditional 5 nines

Ubiquitous access Cross geographical boundaries

5


2013 Scaled-out Storage Architecture

Hardware organized in nodes / racks / geographical sites

6

Site

Rack

Nod

e S

ervi

ces/

node

LAN / WAN

Site

Rack

Nod

e

Ser

vice

s/no

de

Site

Rack

Nod

e S

ervi

ces/

node

s ,.

Site

Rack

Nod

e

Ser

vice

s/no

de


2013 Scalability in Scaled-out Storage

Independence between components – no single point of failure Hardware – disks, nodes, racks, sites Software – services such as metadata

Seamlessly add/remove storage devices or nodes Isolation of failures Sustaining performance

Shared-nothing architecture Elasticity / resilience / performance

7


2013 EMC Atmos Architecture

Shared nothing architecture Nodes - 15-60 large capacity SAS HDDs Racks – up to 8 nodes or 480 4TB HDDs (>1PByte) At least two sites

8

Site

Rack

Nod

e Se

rvic

es/n

ode

LAN / WAN

Site

Rack

Nod

e Se

rvic

es/n

ode


2013 Storage Resiliency

Data Reliability Data is stored persistently in device(s) like HDDs

Data Availability Data is available independently of the failures of

hardware

Data Consistency and Accuracy Returned data is what the user has stored in the

system

9


2013 Failures

Data devices (HDDs) Other components Hardware

Network Power outages Cooling outages

Software Drivers Services (metadata)

10

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013 Transient Failures

Many failures are transient Temporary interruption of

operation of a component Variability in component

response time can be seen as a transient failure Particularly network delays

System load causes transient failures

Transient failures occur much more often than hardware component failures

11

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013 Impact of Failures

Reliability Disk failures directly But all other failures too

Availability Directly impacted by any

failure, particularly transient Consistency Service failures

Metadata

Transient failures

12

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013 Criticality of Failures in the Cloud

Large scale, e.g., node failure Make unavailable large amount of data and other

components simultaneously Since there are more components in the system,

failures happen more often System needs to be design with high component

unavailability in mind Even if the unavailability is transient

13


2013 Challenges of Handling Failures

Correct identification of failures Many failures have similar symptoms

Disk unreachable (disk failure, controller failure, power failure, network failure)

Effective isolation of failures Limit the cases when a single component failure

becomes a node or site failure Timely detection of failures In a large system failures may go undetected Particularly transient failures and their impact

14


2013 Example of System Alerts

15

HDD events are overwhelming

Event do not necessarily indicate disk failures

Rather temporary unreachable HDDs Various reasons Majority,

transient


2013

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Fault Tolerance in Cloud Storage

Transparency toward failures Disks / Nodes / Racks Services Even entire sites

Transparency varies by system Goal or targets

16

x

X

X x


2013 Fault Tolerance in Cloud Storage

17

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode Transparency toward failures

Disks / Nodes / Racks Services Even entire sites

Transparency varies by system Goal or targets

Resilience goals determine fault domains


2013 Fault Domains

The hierarchy of the set of resources whose failure can be tolerated in a system

Example: Tolerate a site failure Two racks or 16 nodes or 240 disks

Determines distribution Data Services

18

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013 Fault Tolerance and Redundancy

Fault tolerance is primarily achieved via redundancy More hardware and software than needed

Achieving a fault tolerant goal depends Amount of redundancy (storage capacity)

Traditionally parity (RAID) Often in the cloud is replication Erasure coding

Pro-active measures Monitoring/analysis/prediction of system’s health Background detection of failures

19


2013 Fault Tolerance and Data Replication

Replicate data (including metadata) up to 4 times Pros High reliability High availability Good performance and accessibility Easy to implement

Cons High capacity overhead

Up to 300% in a 4-way replication

20


2013 Replication in Scale-Out Cloud Storage

Average case in a cloud storage system Several tens (up to hundred) of raw PBytes

capacity Multiple tens of user PBytes capacity

Does not scale well with regard to Cost Resilience

With only 3 replicas it is not always possible to tolerate multi-node and site failure

21


2013 Erasure Coding

Generalization of parity-based fault tolerance RAID schemes Replication is a special case

Out of n fragments of information m are actual data k are additional codes (n=m+k) k missing fragments of data can be tolerated Code is referred to as m/n code

22


2013 Erasure Coding

Capacity overhead k/n Overhead reduces as n increases

Same protection

Complexity – computational and management Increases as n increases As network delays dominate performance erasure

coding becomes feasible approach Trade-off between protection, complexity, overhead Common EMC Atmos codes are 9/12, 10/16

23


2013 EC vs. Other redundancy schemes

24


2013 Erasure Coding at Scale

Data fragments distributed based on the system fault domains

Placement of these fragments is crucial Round-robin placement ensures uniform

distribution of fragments Assumed in previous calculations

Placement of data fragments depends on User requirements with regard to

Performance Priorities

25


2013 EC data placement in the Cloud

We develop a model to see dependencies between EC fragment placement and system size/architecture

Determine Tolerance toward site failures as a function of

Number of sites m/n erasure code parameters

Additional node failure tolerance


2013 EC data placement in the Cloud

Assumptions: Homogeneous geographically distributed sites Equal number of nodes and disks Equal network delays between any pair of sites Equal data priority

Round robin distribution of the fragments across Sites / nodes / disks

Failures on disks / nodes / sites (power, network)


2013 Failure Tolerance in 2 Site System

In a two site system there is only one site failure tolerance Each site has 6 nodes available The numbers inside each (x,y) tuple are the number of nodes tolerated in addition to the sites tolerated


2013 Failure Tolerance in 4 Site System

In a four site system there are one, two and three site failure tolerance Each site has 6 nodes available The numbers inside each (x,y) tuple are the number of nodes tolerated in addition to the sites tolerated


2013 Heterogeneous Protection Policies

As system evolve their resources become heterogeneous Different node or site sizes Different network bandwidth Different data priority

location origin

In such a case Uniformity of data distribution not a requirement The above factors (including performance) should

determine data fragment placement

30


2013

Abstraction of Heterogeneous Cloud Storage

Group components based on affinity criteria Network bandwidth

Create homogeneous sub-cluster

Determine redundancy for each sub-cluster

Handle each sub-cluster independently

Combine outcome for system-wide placement

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013

Abstraction of Heterogeneous Cloud Storage - Example

Two sites are close (e.g. on the same US coast) Fast network connection

Data can be placed in any of the nodes in both sites and retrieving it will not suffer extra network delay

If an 6/12 redundancy scheme is used If data primary location is the upper

two-site subcluster then 6 data fragments can be placed in its two sites and the 6 codes in the other remote sites Accessing the data is not affected

by network bandwidth One site failure is tolerated

Site

Rack

No

de

Ser

vice

s/n

ode

LAN / WAN

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode

Site

Rack

No

de

Ser

vice

s/n

ode


2013 Differentiate Protection via Policy

Flexible policy settings for grouping resources and isolating applications /tenants

Easily managing a large heterogeneous system Hybrid protection schemes that combine multiple

replication schemes E.g., a two replications policy where

First replica is the original data (stored in the closest site to tenant)

Second replica is a 9/12 EC scheme that distributes the data in the rest of the sites for resilience

33


2013 Protection Policies in the Field

34

Tenants Sites 2 replicas >= 3 replicas 1 EC replica >= 2 EC replica Mix regular/EC

1 1 10/2; 9/3 4 1 sync; async 2 2 sync; async async 3 2 sync async

4 2 sync; async 9/3 9/3; 10/6; sync; async 2 2 sync; async sync; async 2 2 9/3; sync; async

2 sync; async sync 3 2 sync 9:3; sync; async 2 4 sync; async async 10/6; 9/3 9/3; sync; async 9:/3; async 1 2 10:2 2 2 sync 9:3; sync; async 9/3 async 2 1 sync 9:3 2 2 sync async 9/3 async 1 6 9:3 async 2 2 async 2 2 sync 9:3 9:3 async 3 2 sync async 3 3 sync async 2 1 sync


2013 Proactive Failure Detection

Monitoring the health of devices and services Logging events

Taking corrective measures before failures happen Strengthen the resilience Address without the redundancy affected by

failure Example Use of SMART logs to determine health of drives

Replace HDDs that are about to fail rather than failed

35


2013 Proactive Failure Detection

Verify in the background the validity of data, services and health of hardware Critical aspect of resiliency in the cloud

System are large and some portions maybe idle for extended periods of time Failures and issues may go undetected

Ensure timely failure detection Improve resilience for a given amount of redundancy

36


2013 Conclusions

Resilience at scale = reliability+availability+consistency Wide range of large scale failures Redundancy aids resiliency at scale Erasure coding – efficient scaling of resiliency Proactive measures to ensure resiliency at scale

37