Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way...

38
#vmworld Deep Dive: Run Kubernetes in Production with PKS James Webb: T-Mobile MTS, Platform Engineering Merlin Glynn: VMware, PKS Product Management #CNA1674BE CNA1674BE VMworld 2018 Content: Not for publication or distribution

Transcript of Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way...

Page 1: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

#vmworld

Deep Dive: Run Kubernetes in Production

with PKS

James Webb: T-Mobile MTS, Platform Engineering

Merlin Glynn: VMware, PKS Product Management

#CNA1674BE

CNA1674BE

VMworld 2018 Content: Not for publication or distribution

Page 2: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

Disclaimer

2©2018 VMware, Inc.

This presentation may contain product features orfunctionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.

VMworld 2018 Content: Not for publication or distribution

Page 3: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

Agenda

3©2018 VMware, Inc.

Table of Contents1. PKS Episodes I-III (The Prequel) 10 mins

PKS Design

2. PKS Episode IV Day 0 15minsArchitecting for Production

3. PKS Episode V Day 1 10 minsDeveloper Onboarding & Self Service

4. PKS Episode VI Day 2 15 minsNetworking & Security Persistent StorageMonitoring & LoggingTop 3 Real World Challenges to Look Out for

5. PKS Episodes VII-IX Q&A 10 mins

VMworld 2018 Content: Not for publication or distribution

Page 4: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

4©2018 VMware, Inc.

PKS Episodes I-IIIA PKS Prequel Story: PKS Design

VMworld 2018 Content: Not for publication or distribution

Page 5: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

5©2018 VMware, Inc.

PKS & PAS & Functions

PAS PKS Functions

Development Teams

What Who

Writes Code Developer

Builds Image Platform

Defines How it is Exposed Platform

1here is my source code

run it on the cloud for me I do not care how

What Who

Writes Code Developer

Builds Image Developer & Pipeline

Defines How it is Exposed Developer & Pipeline

API Rqst

Code

2here is my built code

run it on the cloud for me I will tell you how

Image

What Who

Calls a Function Developer

3here is what I need

run it on the cloud for me stop it when its done

AI

PODBuildpack

VMworld 2018 Content: Not for publication or distribution

Page 6: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

6Confidential │ ©2018 VMware, Inc.

Who is PKS Built For?

IT Operator

– PRE (Platform Reliability Engineering)

– Deploy, Scale, Operate PKS

– Physical Infrastructure is Operated

– Network & Security Control Policy is defined

• Developers– Writes code, code deployed using CI/CD– Focus on business problems and innovation

• Application Dev/Ops owner– Automate Everything– Agile– Serve developers

• Platform Reliability Engineers– Platform is Reliable– Capacity Is planned for– Platform is Secured & Controlled– Platform is Auditable

ApplicationDev/Ops Owner

Platform Reliability Engineer

– Develop, Deploy, Scale, Monitor Apps

– Innovation of Business Capability as Cloud native Apps

– Create K8s cluster, scale clusters and maintain the health customers

– Provide developer access to the cluster

Development Teams

VMworld 2018 Content: Not for publication or distribution

Page 7: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

7©2018 VMware, Inc.

PKS Design Overview: BOSH

VMworld 2018 Content: Not for publication or distribution

Page 8: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

8©2018 VMware, Inc.

PKS Design Overview: A PKS Prequel Story

● It all Starts with an IaaS

● Multi Cloud is a Key ‘Theme’ of PKS

○ Common Ops across clouds

○ Azure Coming Soon

VMworld 2018 Content: Not for publication or distribution

Page 9: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

9©2018 VMware, Inc.

PKS Design OverviewControl Plane DesignPRE○ Deploys PKS

Control Plane

VMworld 2018 Content: Not for publication or distribution

Page 10: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

10©2018 VMware, Inc.

PKS Design OverviewDeploy A ClusterADO○ Create Cluster

w/ NSX-T

VMworld 2018 Content: Not for publication or distribution

Page 11: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

11©2018 VMware, Inc.

PKS Design OverviewDeploy A ClusterADO○ Create Cluster

w/ NSX-T

VMworld 2018 Content: Not for publication or distribution

Page 12: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

12©2018 VMware, Inc.

PKS Design OverviewK8s & PKSDeveloper or

CD○ Uses Cluster

w/ NSX-T

VMworld 2018 Content: Not for publication or distribution

Page 13: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

13©2018 VMware, Inc.

PKS Design OverviewK8s & PKSDeveloper or

CD○ Uses Cluster

w/out NSX-T

VMworld 2018 Content: Not for publication or distribution

Page 14: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

VMworld 2018 Content: Not for publication or distribution

Page 15: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

15©2018 VMware, Inc.

PKS Episode IVDay 0: Architecting for Production: Real

World w/ T-Mobile

VMworld 2018 Content: Not for publication or distribution

Page 16: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

16©2018 VMware, Inc.

BackgroundWho we are - T-Mobile Platform Engineering

● 25 member team supporting customer facing platforms○ Pivotal Application Service (PAS)○ Pivotal Container Service (PKS)○ Open Source K8S○ BOSH

● Part of a larger organization supporting all IT infrastructure for T-Mobile

Where we were - Jan 2018

● IaaS - 30,000+ VMs● PaaS - 22,000+ Pivotal Application Service (PAS) Containers● CaaS - ~300 Containers running in PAS● Goal: Evaluate and build on-premise K8S offeringVMworld 2018 Content: Not for publication or distribution

Page 17: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

17©2018 VMware, Inc.

CaaS Gap

DevOps teams looking for a place to run Docker containers on-premise

● No standard on-premise offering● Docker in PAS is not an ideal experience

○ Upgrades not seamless○ No persistent storage○ TCP Routing - good but not great for all use cases

● DevOps teams often running their own Docker platforms on VMs

VMworld 2018 Content: Not for publication or distribution

Page 18: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

18©2018 VMware, Inc.

On-Prem CaaS RequirementsPlatform Team:

Highly AvailableControl Plane (etcd/API)Worker NodesAuthn/Authz

ScalableControl Plane (API)Worker Nodes

Automated DeploymentControl Plane (OpsMan/Bosh)Cluster builds

No Downtime Lifecycle Management

K8S UpgradesOS PatchingInfrastructure Maintenance

LDAP IntegrationAPI Configurability

DevOps Teams:

Native K8S ExperienceContainer OrchestrationPersistent Storage

Single AZIntra-AZ ReplicationCross-Region Replication

PAS-like HTTPS experienceCertificateDNSLoad Balancing

TCP IngressLoad Balancing

VMworld 2018 Content: Not for publication or distribution

Page 19: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

19©2018 VMware, Inc.

HL Physical ArchitectureRegion

● 3 AZs○ Network○ Compute○ Storage

● High Bandwidth/Low Latency East/West Networking

Data Center● Multiple Regions per

○ Isolated network & power● Near/Near/Far Availability Strategy

VMworld 2018 Content: Not for publication or distribution

Page 20: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

20©2018 VMware, Inc.

PKS - Architecture Challenges Platform Team:

Highly AvailableControl Plane (etcd/API)Worker NodesAuthn/Authz

ScalableControl Plane (API)Worker Nodes

Automated DeploymentControl Plane (OpsMan/Bosh)Cluster builds

No Downtime Lifecycle Management

K8S UpgradesOS PatchingInfrastructure Maintenance

LDAP IntegrationAPI Configurability

DevOps Teams:

Native K8S ExperienceContainer OrchestrationPersistent Storage

Single AZIntra-AZ ReplicationCross-Region Replication

PAS-like HTTPS experienceCertificateDNSLoad Balancing

TCP IngressLoad Balancing

VMworld 2018 Content: Not for publication or distribution

Page 21: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

21©2018 VMware, Inc.

PKS – Architecture ChallengesThermal Exhaust Ports:

● Authn - PKSCLI/UAA is not HA Yet …○ AZ failure results in no new LDAP auth until resolved○ Clear recoverability process into new AZ not yet well defined

● API Configurability○ Need access to more API flags to support cluster customization

■ PodPresets■ PodSecurityPolicy■ ...

● Scalability○ Worker scale up available, scale down coming○ 200 Node Worker Limit is tested scale○ Cannot scale K8s API nodes independently of etcd nodes

VMworld 2018 Content: Not for publication or distribution

Page 22: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

22©2018 VMware, Inc.

PKS Episode VDay 1: Developer On Boarding

VMworld 2018 Content: Not for publication or distribution

Page 23: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

23©2018 VMware, Inc.

Who Does What?

Kubernetes

Namespace Namespace

Namespace Namespace

UAA

Masterkube-api

PKS API

ADO

Developer

OIDC

Access K8s

Set K8s RBAC

AD/LDAP

PKS Create-ClusterPRE

Operates PKS

Set PKS RBAC

Are they the Same Person?

VMworld 2018 Content: Not for publication or distribution

Page 24: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

24©2018 VMware, Inc.

UAA

PKS API

ApplicationDev/Ops Owner

“Fred”

ApplicationDev/Ops Owner

“Ethel”

manage

admin

Fred’s K8s Cluster

Ethel’s K8s Cluster

Rick’s K8s ClusterCan Only Access Clusters They Create

Can Access All Clusters

UAA Scopes

pks.clusters.admin

pks.clusters.manage

PKS – Control Plane RBAC Basics

VMworld 2018 Content: Not for publication or distribution

Page 25: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

25©2018 VMware, Inc.

Namespace A Namespace C

Namespace B Namespace D

UAAMaster

kube-api

PKS API

OIDC

AD/LDAP

K8s – OIDC / RBAC Basics

kind: RoleapiVersion: rbac.authorization.k8s.io/v1metadata:namespace: Namespace Aname: pod-reader

rules:- apiGroups: [""] resources: ["pods"]verbs: ["get", "watch", "list"]

kind: RoleBinding # Can also Apply at ClusterapiVersion: rbac.authorization.k8s.io/v1metadata:name: read-podsnamespace: Namespace A

subjects:- kind: User # Can Support LDAP G# Name is case sensitiveroups as wellname: FredapiGroup: rbac.authorization.k8s.io

roleRef:kind: Role #this must be Role or ClusterRolename: pod-reader # this must match the name of

the Role or ClusterRole you wish to bind toapiGroup: rbac.authorization.k8s.io

K8s Role

Developer“Fred”

ADO“Lamont”

kubectl create

K8s RoleBinding

VMworld 2018 Content: Not for publication or distribution

Page 26: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

26©2018 VMware, Inc.

Kubernetes

dev Namespace C

Namespace B Namespace D

UAA

Masterkube-api

PKS API

OIDC

AD/LDAP

Putting it Together …

ADO“Lamont”

1. pks login <<<---James has pks.manage role 2. pks create-cluster omni-app3. pks get-credentials omni-app4. kubectl create -f cluster-role-binding.yaml

a. Bind Cluster admin role on Cluster to LDAP group “CN=omni-app-admins” <<<--- Lamont is a memberof

PRE“James”

1. get kubeconfig (jwt token from UAA/OIDC)2. kubectl create namespace dev3. kubectl create -f role-binding.yaml

a. Bind NamespaceAdmin role on namespace dev to LDAP group “CN=omni-app-devteam” <<<--- Fred is a memberof

1. get kubeconfig (jwt token from UAA/OIDC)2. kubectl create -f my-app.yaml -n

dev

Developer: “Fred”

1

2

3

VMworld 2018 Content: Not for publication or distribution

Page 27: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

27©2018 VMware, Inc.

Putting it Together for Production:

ADO“Lamont”

PRE“James”

● Q: Why didn’t James grant pks.admin or pks.manage to Lamont AKA Self Service of creating the K8s Cluster?

● A: James needs some way to limit what Lamont can create and enable Lamont’s team to perform certain actions on the cluster

■ Resource Quotas■ Tenant / Group Ownership of Clusters

Are the PRE & ADO the Same Person?

Quotas

Tenancy Hierarchy

&

VMworld 2018 Content: Not for publication or distribution

Page 28: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

28©2018 VMware, Inc.

PKS Episode VIDay 2: The Challenges (Hard Stuff)

VMworld 2018 Content: Not for publication or distribution

Page 29: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

29©2018 VMware, Inc.

AutomationChallenge: Automate all the thingsSolution: Concourse (turtles all the way down)

● Bootstrap side-car BOSH environment (via Concourse)● Deploy Concourse to support environment pipelines● Deploy Opsman (PCF Pipelines)● Deploy PKS (PCF Pipelines)● Deploy PKS clusters (custom)● Post cluster install configuration (custom)

○ Front-End LBs○ Ingress○ Monitoring○ Persistent Storage○ Logging○ ...VMworld 2018 Content: Not for publication or distribution

Page 30: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

30©2018 VMware, Inc.

Cluster OwnershipChallenge: Enable (but don’t burden) DevOps customersSolution: Managed Clusters

● Platform team manages:○ Infrastructure (Compute, Network, Storage)○ Cluster install/upgrades○ Base cluster tooling (monitoring, logging, ingress, persistent storage, …)

● Multi-tenant clusters○ Economies of scale○ Fewer objects to manage

● Single tenant clusters where it makes sense○ Sensitive environments○ Advanced customers who need more control

VMworld 2018 Content: Not for publication or distribution

Page 31: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

31©2018 VMware, Inc.

User Management & AccessChallenge: Efficiently managing a lot of users and teams(with audit trails)Solution: GitOps

● Adapted from PAS management tooling● Namespace & user management in source control

○ Namespace quotas & configuration○ DevOps team leaders control who has access to their namespace○ User management pipelined for self-service○ Quota/config changes generate a pull request to CaaS Platform Team for

review● LDAP Integration via UAA and PKS cli● User token generation/management cumbersome – better tooling in

the works

VMworld 2018 Content: Not for publication or distribution

Page 32: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

32©2018 VMware, Inc.

ObservabilityChallenge: See all the thingsSolution: Prometheus/Grafana

● Leveraging existing PAS tooling to PKS Opsman/BOSH framework ● Prometheus at every layer

○ Proactive monitoring and alarming○ Metrics dashboards○ Capacity planning○ Data available for export to aggregation engines

● Cluster wide APM service being evaluated○ Currently bring your own APM

● Pod logging by default○ DevOps teams can customize as needed

VMworld 2018 Content: Not for publication or distribution

Page 33: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

33©2018 VMware, Inc.

Traffic Management – (LB, Ingress, …)Challenge: HTTPS/TCP/UDP/IP SpaghettiSolution: It’s complicated

● Chose not to use NSX● External LTMs direct traffic to clusters

○ Per cluster configuration of LTM (automation needed)○ Provide wildcard DNS & certificate for HTTPS ingress○ Support bring your own certificate as well

● Evaluating TCP ingress solutions● Evaluating Envoy & Istio

○ mTLS & egress routing solve a lot of problems● Smattering of NodePort

VMworld 2018 Content: Not for publication or distribution

Page 34: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

34©2018 VMware, Inc.

Persistent StorageChallenge: Replicated Volumes across AZsSolution: Software

● AZ local storage○ VMware does not support (coming in k8s 1.14?)

● SDS layer ○ Local storage presented to worker nodes via VMDK○ Single-AZ storage class (data can be lost or is replicated by application)○ Multi-AZ storage class (SW replicated, 2 or 3 RF)○ Replicated storage at PVC layer easy button for app teams○ Pod scheduling optimizes location (if possible)

● Evaluating CSI drivers for ISCSI storage devices

VMworld 2018 Content: Not for publication or distribution

Page 35: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

35©2018 VMware, Inc.

PKS Episode VII-Xhttps://maps.t-mobile.com

Q & A

VMworld 2018 Content: Not for publication or distribution

Page 36: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

#CNA1674BE

VMworld 2018 Content: Not for publication or distribution

Page 37: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

THANK YOU!

#vmworld #CNA1674BE

VMworld 2018 Content: Not for publication or distribution

Page 38: Kubernetes in Production for publication Deep Dive: Run ...Solution: Concourse (turtles all the way down) Bootstrap side-car BOSH environment (via Concourse) Deploy Concourse to support

VMworld 2018 Content: Not for publication or distribution