Open stack HA - Theory to Reality

OpenStack HA -Theory to Reality

GERD PRÜßMANN SHAMAIL TAHIRSRIRAM SUBRAMANIAN KALIN NIKOLOV

Gerd Prüßmann Shamail TahirCloud Architect Cloud Architect Deutsche Telekom AG EMC Office of the CTO

Sriram Subramanian Kalin NikolovFounder & Cloud Specialist Cloud EngineerCloudDon PayPal

@2digitsleft @ShamailXD

@sriramhere

Agenda

OpenStack HA - IntroductionActive/ ActiveActive/ PassiveDT ImplementationeBay/PayPal ImplementationSummary

OpenStack HA - Introduction

What does it mean?Why is it not by default?Stateless vs StatefulChallengesMore than one way

Active/ PassiveActive/ Active

Is This?

Or This?

Active/ Active

API Service EndpointsDatabaseNetworking

Active/ Active● OS High Availability (HA) concept depends on components used for

i.e. network virtualization, storage backend, database system etc.● Various technologies available to realize HA:

Vendors use combinations: i.e. Pacemaker, Corosync, Galera, Keepalived, HAProxy, VRRP, DRBD … or their own tools

The following description is derived from the generic proposal from the OpenStack HA guide:http://docs.openstack.org/high-availability-guide/content/index.html

http://docs.openstack.org/high-availability-guide/content/index.html

Active/ Active● Target: Try to have all services of the platform highly available

Redundancy and resiliency against single service / node failure

● stateless services are load balanced (HAproxy + keepalived)

o i.e. API endpoints / nova-scheduler

● stateful services use individual HA technologies

o i.e. RabbitMQ, MySQL DB etc.

o might be load balanced as well

● some services/agents where no built in HA feature is available

Active/ Active - API service endpoints

API endpoints● deploy on multiple nodes● configure load balancing with virtual IPs in HAproxy● use HAproxy’s VIPs to configure respective identity endpoints● all service configuration files refer to these VIPs only

schedulers ● nova-scheduler, nova-conductor, cinder-scheduler, neutron-server,

ceilometer-collector, heat-engine● schedulers will be configured with clustered RabbitMQ nodes

Active/ Active - Databases

● MySQL or MariaDB with Galera cluster (wsrep) library extensiono transaction commit level replication

● synchronous multiple master nodes setupo min. 3 nodes to get quorum in

case of network partition● Write and read to any node● other databases options possible:

Percona XtraDB, PostgreSQL etc.

Active/ Active - RabbitMQ

● RabbitMQ nodes clustered● mirrored queues configured via policy (i.e. ha-mode all)● all services use the RabbitMQ nodes

Active/ Active - Networking

Network ● deploy multiple network nodes● Neutron DHCP agent – configure multiple DHCP agents

(dhcp_agents_per_network)● Neutron L3 agent

o Automatic L3 agent HA (allow_automatic_l3agent_failover)o VRRP (l3_ha, max_l3_agents_per_router, min_l3_agents_per_router)

● Neutron L2 agent - no HA available● Neutron metadata agent – no HA availailable● Neutron LBaaS agent – no HA available

● no HA feature available: active/passive pacemaker / corosync solution

Active/ Active - ExampleDeployment example

Active/ Passive

GeneralTools OverviewControllers Overview

Active/ Passive: General

● Components should leverage a Virtual IP● The primary tools used for Active/Passive

OpenStack configurations are general (non-OpenStack specific): Pacemaker + Corosync, and DRBD

Corosync

● Messaging Layer used by Cluster● Responsibilities include cluster membership and

messaging● Leverages RRP (Redundant Ring Protocol)

o Rings can be set up as A/A or A/Po UDP Onlyo mcastport specifies rcv port; mcastport minus 1 is

send port

Pacemaker ● Cluster Resource Manager

● Cluster Information Base (CIB)

o Represents current state of resources and cluster configuration (XML)

● Cluster Resource Management Daemon (CRMd)

o Acts as decision maker (one master)

● Policy Engine (PEngine)

o Send instructions to LRMd and CRMd

● STONITHd

o Fencing mechanism

● Resource Agents

o Standardized interfaces for resource

CRMd

STONITHd CIB

PEngine

LRMd

DRBD

● Distributed Replicated Block Device● Creates logical block devices (e.g. /dev/drbdX) that

having backing volumes● Reads serviced locally● Primary node writes are sent to secondary node

Host1

Active/Passive: Database

MySQL

Host2

MySQL

DRBD DRBD

Pacemaker Pacemaker

Corosync Corosync

● Use DRBD to back MySQL

● Leverage VIP that can float between hosts

● Manage all resources (including MySQL Daemon) with Pacemaker

● MySQL/Galera is an alternative but current version of HA Guide does not recommend it

Host1

Active/Passive: RabbitMQ

RabbitMQ

Host2

RabbitMQ

DRBD DRBD

Pacemaker Pacemaker

Corosync Corosync

● Use DRBD to back RabbitMQ

● Leverage VIP that can float between hosts

● Ensure erlang.cookie are identical on all nodes

o Enables ability to communicate with each other

● RabbitMQ clustering does not tolerate network partitions well

Active/Passive: Overview (From Guide)

● Leverage DB, RabbitMQ VIP in configuration files

● Configure Pacemaker Resources for OpenStack Services

o Image API

o Identity

o Block Storage API

o Telemetry Central Agent

o Networking

o L3-Agent

o DHCP

DT Implementation - Overview

● Business Market Place (BMP)● SaaS offering● https://portal.telekomcloud.com/● SaaS Applications from Software Partners

(ISVs) and DT offered to SME customers ● Platform based on Open Source technologies only

(OpenStack, CEPH, Linux)● Project started in 2012 with OS Essex, CEPH● In production since 3/13

https://portal.telekomcloud.com/

DT Implementation

DTAG scale out project (ongoing)

Target: Migrate production to a new DC and scale out

Requirements:● scale out compute by 30%, storage by 40%● eliminate all SPOFs● Setup in two fire protection areas / physically separated DC rooms

DT Implementation

● single region HA OS instance● all services distributed over two DC rooms

o Compute and Storage distributed equallyo All OpenStack services HA (as far as possible)

OSS (DNS, NTP, puppet master, Mirror etc., redundant perimeter firewall)

● Instance distribution: 4 Availability Zones, multiple host aggregates and scheduler filters

DT Implementation● Load Balancing

o HAproxy for MySQL, services, RabbitMQ, APIs (nginx under test)● MySQL

o Galera Multi Master Node replication (3 nodes)● RabbitMQ

o 2 nodes cluster / mirrored queues● Neutron

o DHCP multiple agents started; Pacemaker/Corosync● API Endpoints

o Loadbalancing with round robin distribution● Storage

o 2 shared, distributed CEPH clusters (RBD/S3)

DT ImplementationTests/Experiences so far

● Load balancing works well● Database: OpenStack multi-node write issues

o 1 node write / 2 nodes backup: diminishes Galera HA efficiency (monitoring)● Specific issues with deployment in 2 DC rooms / uneven distribution of services (Galera)

o if the “wrong” room fails Galera: quorum requires majority!

room with 2 nodes goes down → 3rd node will deactivate itself → DB outage Storage specific:

CEPH may lose 2/3 of the replicas → heavy replication load on CEPH cluster danger of losing data (OSD/disk failure) → raise replica level / adapt crush

map Network: recovering from a neutron / L3 failure: <15 minutes to recover

o pet applications vulnerable – may suffer from hick-ups at disasters anyway● DHCP agent failures

DT Implementation

Plans for the future

● use DVR / VRRP in the futureo make network more resilient and elastic

● a third DC room would be desirable :-)o CEPH replicas / MONs, MySQL Galera

eBay/PayPal Implementation

The scope of Ebay/PayPal OpenStack Clouds● 100% of PayPal web/mid tier● Most of Dev/QA● Number of HVs: 8,500● Number of Virtual Machines: 70,000● Number of users: Several thousands● Availability zones: 10

eBay/PayPal Implementation● Database

MySQL MMM replication, VIP with FailoverPersistence / Galera● RabbitMQ

VIP with SingleNode FailoverPersistence or 3 nodes with mirrored queues● NeutronDHCP / LBaaS

Corosync/Pacemaker● API Endpoints

LB VIPs for every service with either RR or least connection● Storage

Shared storage with nfs/iscsi


Successful HA Implementations● LoadBalanced HA - VIPs for every service● LB Single Node Failover Persistence Profile● Galera/Percona for Identity Service● Global Identity Service using GLB


HA Failures● Corosync/Pacemaker

NeutronDHCP and LBaaS - missing advanced health checks ● RabbitMQ

Single Node Failover Persistence● MySQL Replication

Single Node Failover Persistence sometimes doesn't work well Implemented external monitoring and disabling of the failed member.● VIPs without ECV health checks


Future direction● HA on Global or Regional Services

One leg in each Availability Zone (Keystone, LBaaS, Swift)● RabbitMQ with 3 node/mirrored queues

LB VIP with least connections● No shared NFS for Glance

eBay/PayPal Global Identity Service


Lessons Learned● Try not to overcomplicate● Simulate Failures

Before placing in production make sure HA works● Place your services in different Availability zones

or at least different FaultZones● Always make backups

No matter how robust your HA solution is

● OpenStack HA Guide Update Efforts● WTE Work Group (now known as ‘Enterprise’)

● Share Best Practices

Call to Action


Reference

OpenStack HA guide: http://docs.openstack.org/high-availability-guide/content/index.htmlPercona Resourceshttps://www.percona.com/resources/mysql-webinars/high-availability-using-mysql-cloud-today-tomorrow-and-keys-your-successHA Proxy Documentation:http://www.haproxy.org/


https://www.percona.com/resources/mysql-webinars/high-availability-using-mysql-cloud-today-tomorrow-and-keys-your-success

https://www.percona.com/resources/mysql-webinars/high-availability-using-mysql-cloud-today-tomorrow-and-keys-your-success

http://www.haproxy.org/

Open stack HA - Theory to Reality

Technology

Transcript of Open stack HA - Theory to Reality