Dependability of Data Center Networks

23
1 1 Dependability of Data Center Networks Trial Lecture March 11, 2016 Trondheim Jonas Wäfler

Transcript of Dependability of Data Center Networks

1 1

Dependability of Data Center Networks

Trial Lecture March 11, 2016 Trondheim Jonas Wäfler

2 2

Introduction What is a Data Center?

•  Data Center (DC): contains resources (computational, storage, network)

•  Enterprise DC and Internet DC

•  Core infrastructure for cloud based services: –  On-demand Media –  Cloud storage –  Cloud computing –  social networking services –  e-commerce

3 3

Introduction What is a Data Center?

•  Microsoft has over a million servers in data centers, google even more [Ballmer 2013]

•  Very large DC: > 100´000 servers

q  Agile and reconfigurable q  High availability levels q  Low cost q  Energy efficiency

4 4

Introduction What is a Data Center Network (DCN)?

•  Connect computational and storage resources –  with each other –  To the outside

q  Scalability q  High cross-sectional

bandwidth q  Fault tolerance

[Bilal2012]

5 5

Dependability Dependability: •  ability to avoid service failures

that are more frequent and more severe than is acceptable

Availability: •  readiness for correct service •  E.g. A= 99% or A= 99.999 % Reliability: •  continuity of correct service •  E.g. R(60 min) = 99%

Defini/ons:[Avizienis2004] DataCenter

6 6

Dependability Availability Levels Data Center

ANSI/TIA-942-A •  Network architecture •  Electrical design •  System redundancy •  Database management •  Protection against physical

hazards (fire, flood, windstorm) •  Power management •  …

•  Practical design considerations (cabling etc.) à Well-designed DC is easier to repair and maintain (availability, maintainability)

Uptime Institute •  Tier 1 (99.671%) •  Tier 2 (99.741%) •  Tier 3 (99.982%) •  Tier 4 (99.995%)

Image:h>p://www.coloca/onamerica.com/

7 7

Dependability Threats

[Helvik 2007]

8 8

Dependability In Data Center Networks Threats to Dependability •  Link and node failure Counter measures •  More reliable equipment •  Fault tolerance

system continues to operate properly even when some of its components have failed.

DataCenter

9 9

Dependability In Data Center Networks Threats to Dependability •  Link and node failure Counter measures •  More reliable equipment •  Fault tolerance

system continues to operate properly even when some of its components have failed. –  Topology –  Routing

DataCenterDataCenter

10 10

Three-tier DCN Structure

•  Topology with three layers (Core / Aggregation / Access)

•  access routers (AccR) aggregate traffic from up to several thousand servers

•  1:1 redundancy in each layer

(except for ToRs)

Figures:[Gill,2011]

11 11

Three-tier DCN Dependability Analysis

•  Data center networks are reliable –  A = 99.99% for 80% of links and

60% of devices

•  Low-cost, commodity switches (ToR) are highly reliable.

(# devices with failures) / (# devices)

Figures:[Gill,2011]

12 12

•  Hardware problems take longer to mitigate

•  Load balancers experience a high number of software faults.

•  Link failures: dominated by HW and connection errors

Three-tier DCN Root causes of failures

Figures:[Gill,2011]

13 13

Fat-Tree Topology

•  K pods (k/2)2

k(k/2)k(k/2)k(k/2)2

•  Based on Clos-network

ü  Use commodity network switches (all identical with k ports)

ü  Fault tolerance: Higher redundancy

ü  High cross-sectional bandwidth

Limitations •  Scalability issues •  # pods ≤ # ports in each switch

Image:[Al-Fares2008]

14 14

Dcell Topology

•  Server centric (!) •  Server with network

connections •  Recursive building algorithm

–  DCell0: n servers, 1 switch –  DCell1: n+1 DCell0 cells –  DCell2: n(n+1)+1 DCell1 cells –  …

•  Decentralized routing based on structure; fault-tolerant routing without using global states

Image:[Guo2008]

15 15

Many more topologies

BCube[Guo2009] Scafida,basedonScale-freenetworks[Gyarma/a2013]

16 16

•  How to assess dependability in a network?

•  Network properties

Table:[Manzano2013]

Robustness Dependability in networks

17 17

•  How to assess dependability in a network?

•  Network properties •  Robustness

–  A2TR(p): fraction of node pairs that are connected to each other after p failures [Neumayer 2010]

–  Fully connected: A2TR=1

Robustness Dependability in networks

18 18

Robustness Connectivity

•  Random networks •  1000 simulation runs •  Remove stepwise from 0 to

(n-2) nodes Note: •  FatTree: better robustness

metrics (<k>, …) than Dcell But FatTree is worse in connectivity analysis

[Manzano2013]

19 19

Effectiveness of Redundancy

•  Considers only structure, rerouting not considered à gives best case

•  Real systems: Coverage not perfect

•  Study: Three-Tier network [Gill 2011]

•  Failures in fail-over mechanism •  Configuration problems in back

up (reroute traffic to failed component)

•  Protocol issues and timeouts

[Gill2011]

20 20

Conclusion •  Network dependability needs

different metrics

•  “Optimal” is relative •  Many additional factors to

consider –  Scalability –  Cross-section bandwidth –  Cost effectiveness –  Energy efficiency

Table:[Bilal2013B]

21 21

References 1

•  [Avizienis 2004] Avizienis et al., “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Trans. dependable and secure computing, 2004

•  [Helvik 2007] B. E. Helvik, K. Sallhammar, and S. J. Knapskog. Information Assurance; Dependability and Security in Networked Systems, chapter “Chapter 8: Integrated Dependability and Security Evaluation Using Game Theory and Markov Models”, Elsevier 2007.

•  [Gill, 2011] P. Gill, Navendu Jain and Nachiappan Nagappan, “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, SIGCOMM, 2011

•  [Al-Fares 2008] Al-Fares M, Loukissas A, Vahdat A. “A scalable, commodity data center network architecture”, SIGCOMM 2008

•  [Guo 2008] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, “DCell: a scalable and fault tolerant network structure for data centers”, SIGCOMM, 2008

•  [Guo 2009] C.Guo, et al., “BCube: a high performance,server-centric network architecture for modular data centers”, SIGCOMM 2009

22 22

References 2 •  [Bilal 2012] Bilal et al., “Quantitative comparisons of the state-of-the-art data

center architectures”, Concurrency Computat.: Pract. Exper. 2012 •  [Manzano 2013] M. Manzano, K. Bilal, E. Calle, and S. U. Khan, "On the

Connectivity of Data Center Networks," IEEE Communications Letters, 2013. •  [Bilal 2013A] K. Bilal, M. Manzano, S. U. Khan, E. Calle, K. Li, and A. Y.

Zomaya, "On the Characterization of the Structural Robustness of Data Center Networks," IEEE Trans. Cloud Computing, 2013.

•  [Bilal 2013B] Bilal et al., "A Taxonomy and Survey on Green Data Center Networks," Future Generation Computer Systems, 2013.

•  [Neumayer 2010] Sebastian Neumayer and Eytan Modiano, “Network Reliability With Geographically Correlated Failures”, Proc. 2010 Conference on Information Communications

•  [Ballmer 2013] http://news.microsoft.com/2013/07/08/steve-ballmer-worldwide-partner-conference-2013-keynote/

•  [Liu 2013] Y. Liu et al., “Data Center Networks; Topologies, Architectures and Fault-Tolerance Characteristics”, Springer, 2013

•  [Gyarmatia 2013] Gyarmatia et al., “Free-Scaling Your Data Center”, Computer Networks, 2013

23 23

The End