CSCE678 - Datacenterscourses.cse.tamu.edu/chiache/csce678/s19/slides/datacenter.pdf · Types of...

Post on 27-Jun-2020

2 views 0 download

Transcript of CSCE678 - Datacenterscourses.cse.tamu.edu/chiache/csce678/s19/slides/datacenter.pdf · Types of...

Datacenters

CSCE678

Announcement

• Pop quiz results will be on GradeScope (soon)

• No class meeting on Jan 18;

Lecture video will be available on blackboard

• No class meeting on Jan 21 (MLK Day)

2

Everything is in the Cloud Nowadays

3

Datacenters Across the Globe

4

Datacenters Across the Globe

5

Datacenters with 100,000+ Servers

6

Google Datacenter

Microsoft Datacenter

Facebook Datacenter

Internal of a Data Center

7

Server

Rack

Cluster

Aggregating Resources

8

Google TensorFlow Processing Units (TPUs)

9

1 TPU Pod = 64 TPUv2= 11.52 Petaflops (11.52 x 1015 float-point ops per sec)

TPU RacksCPU Rack CPU Rack

Why Datacenters?

• Cost Efficiency:• Reduce facility, management, power cost

• Elasticity:• Adaptive resource allocation for customers’ need

• Pay-per-use

• Multi-tenancy:• Accommodating multiple users in one infrastructure

• Ease of management:• Fault/crash tolerance and disaster recovery

10

Cost of Maintaining a Datacenter

11

Types of Cloud Deployments

• Private:• Datacenters run by organizations (e.g., banks, DoD)

• Strong isolation but expensive to build

• Public:• AWS, Microsoft Azure, GCP, etc

• Rented by public

• Hybrid:• Private cloud using public cloud as backend or backup

• Private cloud hosted by public cloud

12

Not Just a Collection of Servers

Microsoft Azure

Azure SQL Database

SQL Data Warehouse

RedisCache

SQL Server on Virtual Machines

Azure Database

for PostgreSQL

Azure Cosmo DB

13

Google Cloud

Platform

Google Cloud

Storage

Google Cloud SQL

Google Cloud

Bigtable

Compute Engine

Google App Engine

CDN to CLoud

Multi-Tier Data Centers

14

Front End

Webservers

Back End

DB &Storage

Aggregatorsor middle tieror compute nodesor microservices

Workers

Front-End Services

• Directly deal with data from/to users

• Latency matters a lot

• Oftentimes dealing with mobile apps, streaming

service, telecommunication, etc

15

Middle-Tier Services

• Each node has a specific task

• Installed on commodity computers ➔ Highly elastic

and easy to scale up

• Example:

• Data processing: Hadoop, Spark

• Caching: Redis, Memcached

• Application servers: JBOSS, Tomcat

16

Back-end Services

• Storage/DB or batch processing

• Sometimes are separately rented and connected to

other cloud ➔ Ex: AWS S3 or EBS

• Often need all-to-all connectivity to the cloud

• Example:

• Sorting / filtering / indexing

• Federated learning

• Databases / key-value stores

17

Types of Cloud Services

• Software as a Service (SaaS)

• Vendors sell licensed cloud software to customers

• Naturally multi-tenant and scalable

• Example: Google Doc, Office 365, Saleforces.com

• Platform as a service (PaaS)

• Vendors/Providers provide development platforms

• Rapid deployment

• Example: Google AppEngine

18

Types of Cloud Services

• Infrastructure as a Service (IaaS)

• Providers provision computing resources• vCPU, memory, network, disks, etc

• Customers is provided with virtual machine instances

• a1.large, t3.medium, t2.micro, etc

• Customers have control over OS, storage, system configuration, etc

• Example: AWS EC2

19

Other New Service Models (I)

• Function as a Service (FaaS)

• Developers deploy functions, not VMs

• Event-driven, pay-per-use

• Spread out execution to 5000+ cores in < 1 sec

• Backend as a Service (BaaS)

• Connect web/mobile apps with APIs or SDKs

• Example: connect cloud services (DB, LDAP, etc) to a unified REST API for mobiles

20

FaaS + BaaS = Serverless Computing

Other New Service Models (II)

• Database as a Service (DBaaS)

• Provide users access to a relational or NoSQL database

• Big Data as a Service (BDaaS)

• Programmable analytic platform for users (i.e., PaaS)

• Security as a Service (SECaaS)

• Security primitives (single sign-on, antivirus, intrusion detection, etc) for corporates

21

Service Level Agreement (SLA)

• A contract about expectation and responsibility of

cloud providers and customers

• Comes with financial penalties

• Need to be defined in precise metrics

• Service Availability

• Defect Rates

• Technical Quality

• Security

22

Examples of SLAAvailability % Downtime per year Downtime per month Downtime per week

90% ("one nine") 36.5 days 72 hours 16.8 hours

95% 18.25 days 36 hours 8.4 hours

97% 10.96 days 21.6 hours 5.04 hours

98% 7.30 days 14.4 hours 3.36 hours

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.50% 1.83 days 3.60 hours 50.4 minutes

99.80% 17.52 hours 86.23 minutes 20.16 minutes

99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes

99.95% 4.38 hours 21.56 minutes 5.04 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

100.00% 26.28 minutes 2.16 minutes 30.24 seconds

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 second

23

SLA Decides Everything

• Cloud providers make decisions based on SLA

• Resource Allocation

• Disaster & failure recovery

• … and a lot more

24

Resource Allocation

25

Tenant 1: 2 CPUs, 4GB RAM

Tenant 2: 2 CPUs, 4GB RAM

Tenant 3: 2 CPUs, 6GB RAM

4 CPUs

8GBRAM

2 CPUs

12GBRAM

Server A Server B

Resource Allocation

26

4 CPUs

8GBRAM

2 CPUs

12GBRAM

Server A Server B

Tenant 1: 2 CPUs, 4GB RAM

Tenant 2: 2 CPUs, 4GB RAM

Tenant 3: 2 CPUs, 6GB RAM

1

2

1

2 3

3

Resource Allocation Failure

27

4 CPUs

8GBRAM

2 CPUs

12GBRAM

Server A Server B

Tenant 1: 2 CPUs, 4GB RAM

Tenant 2: 2 CPUs, 4GB RAM

Tenant 3: 2 CPUs, 6GB RAM12

12

Enough spared resources,but not on one server.

Overprovisioning vs Reserved

• Virtualization Technology allows overprovisioning

• Offering more resources than you have

• Example: 8GB RAM shared by 4 tenants who want 4GB

• Assuming they won’t use 4GB simultaneously

• Problem: cannot guarantee availability and

performance isolation

• Vendors are worried about violating SLA

• Out of memory will cause thrashing

• Generally, vendors just reserve resources for customers

28

Resource Disaggregation

• Break down a workload into smaller ones to make

allocation easier

• 2 CPU, 6GB RAM ➔ (1 CPU, 3GB RAM) * 2

• Occupation time also matters

• Short occupation ➔ easier for time-sharing

• Example: FaaS

• Proportional DRAM (128/192/…/3008 MBs) and CPUs

• Time limit (10 mins)

29

Dealing with Failures

• Even a small server room has failures

• A warehouse-size datacenter faces failures daily;

cannot be prevented at hardware level.

• Can only handle at software & management level

30

Failure Rates of Datacenters

• According to Google fellow Jeff Dean (2008),

in the first year of a datacenter:

• Typically 1,000 machine failures

• 500-1,000 machines are down for ~6 hours

• 20 racks with 40-80 machines vanish from network

• 5 racks got half of the packets missing

• 50% chance of overheating in the whole cluster, taking down all servers in 5 mins and taking 1-2 days to recover

31

Large Disasters in Datacenters

32

Apr 20, 2014Samsung SDS datacenter burning during a fire,causing disruption in global data services

Replication

33

Replica 1

Replica 2

Sync

Sync

Datacenter A

Datacenter B

Replica 3Sync

Availability Zones

34

References

• “The Datacenter as a Computer – An Introduction to the Design of Warehouse-

Scale Machines”, 2nd Edition, by Barroso, Clidaras, and Hölzle

• Course material: “Datacenter Fundamentals: The Datacenter as a Computer”,

by George Porter, UCSD

• “What is Resiliency in Azure?”

(https://azure.microsoft.com/mediahandler/files/resourcefiles/azure-resiliency-

infographic/Azure_resiliency_infographic.pdf)

35