Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28,...

29
Amin Vahdat on behalf of Google Technical Infratructure Google Fellow Cloud 3.0 and Software Defined Networking October 28, 2016

Transcript of Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28,...

Page 1: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Amin Vahdat on behalf of Google Technical InfratructureGoogle Fellow

Cloud 3.0 and Software Defined NetworkingOctober 28, 2016

Page 2: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

• This talk: example of the Google research model

• Driven by novel application requirements, we must solve problems at the frontier of computer systems

• The impact of doing so can be huge

• Our research question: how do we build a network that can allow a building to be the unit of storage access and a shared medium for compute?

Overview

https://g.co/research/networkinfra

Page 3: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 3

1. Congestion Based Congestion Control, ACM Queue 2016.2. An Internet-Wide Analysis of Traffic Policing, SIGCOMM 2016.3. Evolve or Die: High-Availability Design Principles Drawn from Failures in a

Global-Scale Content Provider, SIGCOMM 2016.4. Maglev: A Fast and Reliable Software Network Load Balancer, NSDI 2016.5. TIMELY: RTT-based Congestion Control for the Datacenter, SIGCOMM 2015.6. Condor: Better Topologies through Declarative Design, SIGCOMM 2015.7. Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN

Distributed Computing, SIGCOMM 2015.8. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s

Datacenter Network, SIGCOMM 2015.9. Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks, NSDI

2014.10. B4: Experience With a Globally-Deployed Software Defined WAN, SIGCOMM 2013.

Subset of Google Networking Publications

Page 4: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 4

• Distributed programming has faced similar challenges since sockets

• The free lunch in performance improvements are over

• Storage capacity has increased through disaggregation

• I/O latency gap remains

• “Next-gen” storage remains largely untapped at scale

Computing at a Crossroads

Networking will drive future improvements in compute performance by blurring the line between individual servers and then making them disappear.

Page 5: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 5

Virtualization delivers capex savings to enterprise DCs

Cloud 1.0

Last Decade

Page 6: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 6

Cloud 1.0

Public cloud frees enterprise from private HW infrastructure

Scheduling, load balancing primitives, “big data” query processing

Cloud 2.0Cloud 1.0

HW on Demand

Now

Page 7: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 7

Cloud 1.0 Cloud 2.0

Serverless compute, actionable intelligence, and machine learning

Not data placement, load balancing, OS configuration and patching

Cloud 3.0

Compute,not servers

The Third Wave of Cloud Computing

Page 8: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 8

Cloud 2.0

Networking should be aiming for Cloud 3.0

Cloud 3.0Cloud 1.0

The Third Wave of Cloud Computing

Page 9: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 9

Storage disaggregation: the datacenter is the storage appliance

Open Marketplace of services, securely placed and accessed

Transparent live migration

Seamless telemetry and scale up/down

Networking and Cloud 3.0

Page 10: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 10

Applicationsnot VMs

Policynot middleboxes

Actionable Intelligence not data processing

SLOsnot placement/load balancing/scheduling

Networking and Cloud 3.0

Page 11: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Making the Network Disappear with Software Defined Networking

Making the Network Disappear with Software Defined Networking

Page 12: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity
Page 13: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 13

Borg

2012

20022004

20062008

2010GFS

MapReduce

Bigtable Pregel

Colossus

FlumeJava

Dremel

Spanner

Google Software Innovations Driven by Unprecedented Demand for Scale, Bandwidth, Reliability

Page 14: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 14

B4

2006

2008

2010

2012

2014Google Global Cache

BwE

JupitergRPC

Onix

Freedome

Watchtower

QUIC

Andromeda

Google Networking InnovationsOur distributed computing infrastructure required networks that did not exist

BBR

2016

Page 15: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 15

DCN Bandwidth GrowthTraffic generated by servers in our datacenters

Aggr

egat

e tr

affic

50x

1xJul ‘08 Jun ‘09 May ‘10 Apr ‘11 Mar ‘12 Feb ‘13 Dec ‘13 Nov ‘14

Time

Page 16: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 16

• Traditional network architectures could not keep up with bandwidth demands in the data center

• Operational complexity of “box-centric” deployment

SDN Motivation

Google’s DCN redesign, inspired by server & storage scale out

• Clos Topologies

• Merchant Silicon• Centralized Control →

Software Defined Networking

Page 17: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 17

Amdahl’s lesser known law: 1Mbit/sec of IO for every 1 Mhz of computation in parallel computing

Amdahl’s lesser known law:

1Mbit/sec of IO for every 1 Mhz of computation in parallel computing

An unbalanced data center means:

• Some resource is scarce...limiting your value

• Other resources are idle...increasing your cost

Substantial resource stranding [Eurosys 2015] if we cannot schedule at scale

Why Balance Matters @ Building Scale

Page 18: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 18

Compute Slice

Compute Slice

Flash

NVM

64*2.5 Ghzserver

100k+ IOPS100 us accessPB’s storage

1M+ IOPS10 us accessTB’s storage

100 Gb/s

50k servers→ 5 Pb/s Network??

Based on Amdahl’s observation, we might need a 5 Pb/s network• Even with 10:1 oversub → 500Tb/s datacenter network• Every building needs more bisection than the Internet

Datacenter Network

Bandwidth @ Building Scale

Page 19: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 19

To exploit future NVM, we need ~10 usec latency• Even for Flash, we need 100 usec latency• Or, expensive servers sit idle while they wait for IO

Latency @ Building Scale

Compute Slice

Compute Slice

Flash

NVM

100k+ IOPS100 us accessPB’s storage

1M+ IOPS10 us accessTB’s storage10 us latency

Datacenter Network

Page 20: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 20

Cannot take down a XX MW building for maintenance• New servers always added; older ones decommissioned… with zero service impact• Network evolves from 1G → 10G → 40G → 100G → ???

Availability @ Building Scale

Compute Slice

Compute Slice

Flash

NVM

50k servers

Datacenter Network

Page 21: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 21

Datacenter Network infrastructure to support Google scale, performance, and availability → underpinnings of

Google Cloud Platform

Page 22: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 22

Edge Aggregation Block 1

Edge Aggregation Block 2

Edge Aggregation Block N

Spine Block 1

Spine Block 2

Spine Block 3

Spine Block 4

Spine Block M

Server racks with ToR switches

Five Generations of Networks for Google scale

Page 23: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 23

Saturn

Firehose 1.0

1T

10T

100T

1000T

‘04 ‘05 ‘06 ‘08 ‘09 ‘12

Bisection B/w

Year

Watchtower

Firehose 1.1

4 Post

Jupiter

Page 24: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 24

Saturn

Firehose 1.0

1T

10T

100T

1000T

‘04 ‘05 ‘06 ‘08 ‘09 ‘12

Bisection B/w

Year

Watchtower

4 Post

Jupiter

+ Scales out building wide 1.3 Pbps

Firehose 1.1

Page 25: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 25

Saturn

Firehose 1.0

1T

10T

100T

1000T

‘04 ‘05 ‘06 ‘08 ‘09 ‘12

Bisection B/w

Year

Watchtower

4 Post

Firehose 1.1

Jupiter

+ Enables 40G to hosts

+ External control servers

+ OpenFlow

Page 26: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 26

B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]

B4: Google's Software Defined WAN

Page 27: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 27

10.1.4/24

ToR

VNET: 5.4/16

VNET: 192.168.32/24

VNET: 10.1.1/24

Load Balancing

DoS

ACLs

VPN

NFV

Google Infrastructure Services

Internal Network

Andromeda Network Virtualization

10.1.3/24

ToR

10.1.2/24

ToR

10.1.1/24

ToR

Page 28: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Google Cloud Platform 28

• Sufficiently high bandwidth, low latency, and low cost

• Fabrics, not boxes, programmable for performance and isolation

• Highest level of availability, zero downtime for new features/performance

Making the Network Disappear

Software Defined Networking enables the network to disappear, driving the next wave of computing

Page 29: Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28, 2016 •This talk: example of the Google research model ... •Storage capacity

Thank You!