Anynines - Running Cloud Foundry for 12 months - An experience report

Post on 22-Nov-2014

198 views 1 download

description

anynines runs a public PaaS located in a German datacenter based on Cloud Foundry. In more than 12 months of running a Cloud Foundry PaaS man lessons about security, high availability, open stack and many other exciting topics have been learned. See how Bosh can be used and how it shouldn't be used. Learn how to perform Cloud Foundry upgrades and read how to harden Cloud Foundry by adding more fault tolerance with pacemaker.

Transcript of Anynines - Running Cloud Foundry for 12 months - An experience report

Running Cloud Foundry An Experience Report

About this talk

• Receive an opinion about running Cloud Foundry (CF)

• How to shoot your own leg with CF and overcommitment settings

• How to perform CF updates

• How to harden CF

• Wise words about CF services

Introduction

about.me/fischerjulian

Running a public Cloud Foundry

for more than a year.

It works.

In order to run Cloud Foundry smoothly …

… refer to the package leaflet for risks and side effects and consult pivotal, cloudcredo or anynines.“

The details

The anynines Stack

Hardware

OpenStack

Cloud Foundry

VMware

We migrated from a Rented VMware to a

self-hosted OpenStack.

For more details on this: http://rh.gd/a9vmw2sos

Proof point made…

Cloud Foundry saves investments into software development

by being infrastructure agnostic.

Running Cloud Foundry. What happened.

Security Issues

• Pivotal informs partners early about issued

• Usually along with fixes

OpenStack Issues

• Ext4 vs. Ext3

• DEA MTU

• rsyslogd command not found

CF Gotchas

DEA evacuate & Bosh timeout race-condition

• Removing a DEA → Apps will be evacuated→ DEA will be stopped

• Bosh deployment will fail when evacuation takes longer than the Bosh timeout

• Set your Bosh timeout accordingly!

DEA over-commitment

Default overcommitment factor = 4

RAM peaks may cause random errors

• Failures during staging

• Random application crashes

• No meaningful log information

Reducing over-commitment

• Native strategy

• Reduce over-commitment factor

• Bosh deploy

• 8 GB VM, OC factor 4 → Announces 32 GB (V)RAM

• 8 GB VM, OC factor 2 → Announces 16 GB (V)RAM

• When evacuating a 32 GB (V)RAM host, another 32 GB (V)RAM host will be preferred (more free space)

Evacuation Wave

1 GB

1 GB

1 GB

1 GB

= maximum impact on running apps!

New DEAs (OC 2) will receive apps when old DEAs

(OC 4) have been stopped.

Hints

• Create 2nd resource pool for new DEAs

• Deploy the 2nd resource pool before startup to stop old DEAs

• (-) Needs more resources

• (+) Smoother transition

Updating Cloud Foundry

Required: Staging System

• Structurally identical

• Less VMs

1. Determine new features

since last release

2. Study

deployment manifest changes

3. Apply

deployment manifest changes

4. First staging attempt

5. Debug and Fix it!

6. Simulate the live-upgrade

7. Schedule maintenance on

status.anynines.com

8. Perform the upgrade

and cross fingers.

CF Hardening

Accept that VMs are ephemeral

VM Failover Strategies

Resurrect

• Monitor VM

• Re-Build VMs automatically

• e.g. using Cloud Foundry Bosh

• + Easy

• - Takes long (minutes not seconds)

• - Open Stack doesn’t release persistent disks automatically

Failover to Standby VM

Distribute CF components across availability zones

• Build disjunct networks, racks, etc.

• Each disjunct zone = availability zone

• Tell your IaaS about availability zones

• On provision choose the AZ

• Build Bosh releases accordingly

• Provide stand-by VM

• Monitor VM and perform failover

• IP failover using Pacemaker

• + Fast failover (seconds)

• - Pacemaker not easy to use (& boshify)

• - Increased resource usage by stdby VM(s)

• 2 * UAA

• 2 * CC

• 2 * n * DEAs

• 2 * Health Manager

• …

UAA & CC DB =

SPOF

HA Postgres

• UAA and Cloud Controller database

• Single point of failure for Cloud Foundry

• Postgres not inherently clusterable > failover with standby vm

• Master/slave replication

• Pacemaker/corosync

• IP-Failover using NIC-reattachment

That’s half way towards a PostgreSQL CF Service

• Add a V2 Service Broker

• Add a provisioning logic

• Provision 2-node db cluster on cf create service postgres medium-cluster

Services

“The best way to find yourself is to lose yourself in the service of others.”

― Mahatma Gandhi

Wardenized Services (community services)

are cute for pet projects.

Not suitable for production.

• Implementations are outdated

• One size doesn’t fit all!

No production CF without high quality services.

CF Service Design

• Use clusterable services if possible

• Implement automatic failover if not

• Autoprovisioning using Bosh

• Organize self-healing

• (Semi-)Automatic recovery from degraded mode

Summary

• Bosh & the CF release are powerful, yet you can cut yourself.

• HA Services are very necessary.

• CF is ready to be used in production.

Questions?

Thank you!