Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28,...
Transcript of Cloud 3.0 and Software Defined Networking · Cloud 3.0 and Software Defined Networking October 28,...
Amin Vahdat on behalf of Google Technical InfratructureGoogle Fellow
Cloud 3.0 and Software Defined NetworkingOctober 28, 2016
• This talk: example of the Google research model
• Driven by novel application requirements, we must solve problems at the frontier of computer systems
• The impact of doing so can be huge
• Our research question: how do we build a network that can allow a building to be the unit of storage access and a shared medium for compute?
Overview
https://g.co/research/networkinfra
Google Cloud Platform 3
1. Congestion Based Congestion Control, ACM Queue 2016.2. An Internet-Wide Analysis of Traffic Policing, SIGCOMM 2016.3. Evolve or Die: High-Availability Design Principles Drawn from Failures in a
Global-Scale Content Provider, SIGCOMM 2016.4. Maglev: A Fast and Reliable Software Network Load Balancer, NSDI 2016.5. TIMELY: RTT-based Congestion Control for the Datacenter, SIGCOMM 2015.6. Condor: Better Topologies through Declarative Design, SIGCOMM 2015.7. Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN
Distributed Computing, SIGCOMM 2015.8. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s
Datacenter Network, SIGCOMM 2015.9. Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks, NSDI
2014.10. B4: Experience With a Globally-Deployed Software Defined WAN, SIGCOMM 2013.
Subset of Google Networking Publications
Google Cloud Platform 4
• Distributed programming has faced similar challenges since sockets
• The free lunch in performance improvements are over
• Storage capacity has increased through disaggregation
• I/O latency gap remains
• “Next-gen” storage remains largely untapped at scale
Computing at a Crossroads
Networking will drive future improvements in compute performance by blurring the line between individual servers and then making them disappear.
Google Cloud Platform 5
Virtualization delivers capex savings to enterprise DCs
Cloud 1.0
Last Decade
Google Cloud Platform 6
Cloud 1.0
Public cloud frees enterprise from private HW infrastructure
Scheduling, load balancing primitives, “big data” query processing
Cloud 2.0Cloud 1.0
HW on Demand
Now
Google Cloud Platform 7
Cloud 1.0 Cloud 2.0
Serverless compute, actionable intelligence, and machine learning
Not data placement, load balancing, OS configuration and patching
Cloud 3.0
Compute,not servers
The Third Wave of Cloud Computing
Google Cloud Platform 8
Cloud 2.0
Networking should be aiming for Cloud 3.0
Cloud 3.0Cloud 1.0
The Third Wave of Cloud Computing
Google Cloud Platform 9
Storage disaggregation: the datacenter is the storage appliance
Open Marketplace of services, securely placed and accessed
Transparent live migration
Seamless telemetry and scale up/down
Networking and Cloud 3.0
Google Cloud Platform 10
Applicationsnot VMs
Policynot middleboxes
Actionable Intelligence not data processing
SLOsnot placement/load balancing/scheduling
Networking and Cloud 3.0
Making the Network Disappear with Software Defined Networking
Making the Network Disappear with Software Defined Networking
Google Cloud Platform 13
Borg
2012
20022004
20062008
2010GFS
MapReduce
Bigtable Pregel
Colossus
FlumeJava
Dremel
Spanner
Google Software Innovations Driven by Unprecedented Demand for Scale, Bandwidth, Reliability
Google Cloud Platform 14
B4
2006
2008
2010
2012
2014Google Global Cache
BwE
JupitergRPC
Onix
Freedome
Watchtower
QUIC
Andromeda
Google Networking InnovationsOur distributed computing infrastructure required networks that did not exist
BBR
2016
Google Cloud Platform 15
DCN Bandwidth GrowthTraffic generated by servers in our datacenters
Aggr
egat
e tr
affic
50x
1xJul ‘08 Jun ‘09 May ‘10 Apr ‘11 Mar ‘12 Feb ‘13 Dec ‘13 Nov ‘14
Time
Google Cloud Platform 16
• Traditional network architectures could not keep up with bandwidth demands in the data center
• Operational complexity of “box-centric” deployment
SDN Motivation
Google’s DCN redesign, inspired by server & storage scale out
• Clos Topologies
• Merchant Silicon• Centralized Control →
Software Defined Networking
Google Cloud Platform 17
Amdahl’s lesser known law: 1Mbit/sec of IO for every 1 Mhz of computation in parallel computing
Amdahl’s lesser known law:
1Mbit/sec of IO for every 1 Mhz of computation in parallel computing
An unbalanced data center means:
• Some resource is scarce...limiting your value
• Other resources are idle...increasing your cost
Substantial resource stranding [Eurosys 2015] if we cannot schedule at scale
Why Balance Matters @ Building Scale
Google Cloud Platform 18
Compute Slice
Compute Slice
Flash
NVM
64*2.5 Ghzserver
100k+ IOPS100 us accessPB’s storage
1M+ IOPS10 us accessTB’s storage
100 Gb/s
50k servers→ 5 Pb/s Network??
Based on Amdahl’s observation, we might need a 5 Pb/s network• Even with 10:1 oversub → 500Tb/s datacenter network• Every building needs more bisection than the Internet
Datacenter Network
Bandwidth @ Building Scale
Google Cloud Platform 19
To exploit future NVM, we need ~10 usec latency• Even for Flash, we need 100 usec latency• Or, expensive servers sit idle while they wait for IO
Latency @ Building Scale
Compute Slice
Compute Slice
Flash
NVM
100k+ IOPS100 us accessPB’s storage
1M+ IOPS10 us accessTB’s storage10 us latency
Datacenter Network
Google Cloud Platform 20
Cannot take down a XX MW building for maintenance• New servers always added; older ones decommissioned… with zero service impact• Network evolves from 1G → 10G → 40G → 100G → ???
Availability @ Building Scale
Compute Slice
Compute Slice
Flash
NVM
50k servers
Datacenter Network
Google Cloud Platform 21
Datacenter Network infrastructure to support Google scale, performance, and availability → underpinnings of
Google Cloud Platform
Google Cloud Platform 22
Edge Aggregation Block 1
Edge Aggregation Block 2
Edge Aggregation Block N
Spine Block 1
Spine Block 2
Spine Block 3
Spine Block 4
Spine Block M
Server racks with ToR switches
Five Generations of Networks for Google scale
Google Cloud Platform 23
Saturn
Firehose 1.0
1T
10T
100T
1000T
‘04 ‘05 ‘06 ‘08 ‘09 ‘12
Bisection B/w
Year
Watchtower
Firehose 1.1
4 Post
Jupiter
Google Cloud Platform 24
Saturn
Firehose 1.0
1T
10T
100T
1000T
‘04 ‘05 ‘06 ‘08 ‘09 ‘12
Bisection B/w
Year
Watchtower
4 Post
Jupiter
+ Scales out building wide 1.3 Pbps
Firehose 1.1
Google Cloud Platform 25
Saturn
Firehose 1.0
1T
10T
100T
1000T
‘04 ‘05 ‘06 ‘08 ‘09 ‘12
Bisection B/w
Year
Watchtower
4 Post
Firehose 1.1
Jupiter
+ Enables 40G to hosts
+ External control servers
+ OpenFlow
Google Cloud Platform 26
B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: Google's Software Defined WAN
Google Cloud Platform 27
10.1.4/24
ToR
VNET: 5.4/16
VNET: 192.168.32/24
VNET: 10.1.1/24
Load Balancing
DoS
ACLs
VPN
NFV
Google Infrastructure Services
Internal Network
Andromeda Network Virtualization
10.1.3/24
ToR
10.1.2/24
ToR
10.1.1/24
ToR
Google Cloud Platform 28
• Sufficiently high bandwidth, low latency, and low cost
• Fabrics, not boxes, programmable for performance and isolation
• Highest level of availability, zero downtime for new features/performance
Making the Network Disappear
Software Defined Networking enables the network to disappear, driving the next wave of computing
Thank You!