Scalable High Availability for Lustre with Pacemaker

LLNL-PRES-731486This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Scalable High Availability for Lustre with PacemakerLUG17

Christopher J. MorroneJune 1, 2017

LLNLL-RES-7314862

Lustre High Availability

OST1’sStorageOST1’sStorage

OST1OST1

Server1


OST2OST2

Server2

Node failsNode fails


Server1


OST2,OST1OST2,OST1

Server2

LLNLL-RES-7314863

Heartbeat missing from RHEL7

Official RHEL7 HA stack— Pacemaker 1.1— Corosync 2.X — PCS

Motivation: Migrate from Heartbeat to Pacemaker

LLNLL-RES-7314864

Pacemaker – Resource manager

Corosync – Messaging layer, quorum

pcs – Unified Pacemaker/Corosync command shell

Resource Agent (RA) – Script/program interface to a single resource type

Fence Agent (FA) – STONITH device driver (script)

HA Stack Components/Terms

LLNLL-RES-7314865

Single Pacemaker Cluster

Server1Server1 Server2Server2 Server3Server3 Server4Server4 Server5Server5 Server6Server6

Heartbeat Heartbeat Heartbeat

Server1Server1 Server2Server2 Server3Server3 Server4Server4 Server5Server5 Server6Server6

Pacemaker

LLNLL-RES-7314866

Issue – Pacemaker assumes unique local storage

Issue – Pacemaker cib.xml direct edits forbidden (use cibadmin/pcs)

LLNL configuration through cfengine

LLNL Constraint: Stateless Servers

LLNLL-RES-7314867

The Wrong Approach: Script-Generated cib.xml, corosync.conf

corosyncpacemaker

RAs, FA

corosyncpacemaker

RAs, FA

Lustre Server

corosyncpacemaker

RAs, FA

corosyncpacemaker

RAs, FA

Lustre Server

corosyncpacemaker

RAs, FA

corosyncpacemaker

RAs, FA

Lustre Server

Corosync “cluster”

cfengineat boot time

cfengineat boot time

cib.xml, corosync.conf, etc.

LLNLL-RES-7314868

Rule of thumb: 16 node corosync cluster limit

RH support ends at 16 corosync nodes

Issue: Corosync Does Not Scale

LLNLL-RES-7314869

Allows corosync/pacemaker cluster to control non-corosync nodes

Only configuration non-corosync nodes is /etc/packemaker/authkey

Use RHEL standard pcs command

Red Hat support

Not documented in main Pacemaker manual (separate manual)

Better Approach: pacemaker_remote

LLNLL-RES-73148610

pacemaker_remote Architecture

pacemaker_remoteRAs

pacemaker_remoteRAs

Lustre Server Lustre Server

corosyncpacemaker

FA

corosyncpacemaker

FA

corosyncpacemaker

FA

corosyncpacemaker

FA

pacemaker_remoteRAs

pacemaker_remoteRAs

ManagementNode

ManagementNode


Local storage Stateless

LLNLL-RES-73148611

LLNL’s pacemaker_remote Architecture

pacemaker_remoteRAs

pacemaker_remoteRAs

Lustre Server Lustre Server

corosyncpacemaker

FA

corosyncpacemaker

FA

pacemaker_remoteRAs

pacemaker_remoteRAs

ManagementNode


Local storage Stateless

LLNLL-RES-73148612

Configuration Details

LLNLL-RES-73148613

lustre – RA for MGS, MDT, and OST resources— /usr/lib/ocf/resource.d/llnl/lustre— ocf:llnl:lustre

zpool – RA for ZFS zpool resource— /usr/lib/ocf/resource.d/llnl/zpool— ocf:llnl:zpool

fence_powerman – FA for powerman node power control— /usr/sbin/fence_powerman— fence_powerman

LLNL Developed Agents

LLNLL-RES-73148614

Example Cluster

ossnodemannode mdsnode

mpoolmpool opoolopool

LLNLL-RES-73148615

Generate authkey— dd if=/dev/urandom of=authkey bs=4096 count=1

Install at /etc/pacemaker/authkey on all nodes

Permission 0440, group haclient

Authentication

LLNLL-RES-73148616

Start Daemons

pacemaker_remotedpacemaker_remoted

ossnode

corosync*pacemakerd*

corosync*pacemakerd* pacemaker_remotedpacemaker_remoted

mannode mdsnode

mpoolmpool opoolopool

* The “pcs cluster start” command will start these for us

LLNLL-RES-73148617

pcs cluster setup --name mycluster mannode

pcs cluster start— starts both corosync and pacemaker

Initial Setup (On mannode)

LLNLL-RES-73148618

pcs property set symmetric-cluster=false

Symmetric Property

LLNLL-RES-73148619

pcs property set stonith-action=of

Stonith Property

LLNLL-RES-73148620

set property set batch-limit=100

Batch Limit Property

LLNLL-RES-73148621

pcs property set cluster-recheck-interval=60

Recheck Interval Property

LLNLL-RES-73148622

pcs stonith create fence_pm fence_powerman ipaddr=localhost ipport=10101 pcmk_host_check=“mdsnode,ossnode”

pcs constraint location fence_pm prefers mannode

Fencing Setup

LLNLL-RES-73148623

pcs resource create mds ocf:pacemaker:remote server=mdsnode reconnect_interval=60

pcs resource create oss ocf:pacemaker:remote server=ossnode reconnect_interval=60

pcs constraint location mds prefers mannode

pcs constraint location oss prefers mannode

Configure Remote Nodes

LLNLL-RES-73148624

pcs resource create mpool ocf:llnl:zpool import_options=“-f –N –d /dev/disk/by_vdev” pool=mpool

pcs constraint location add mpool_l1 mpool mdsnode 20 resource-discovery=exclusive

pcs constraint location add mpool_l2 mpool ossnode 10 resource-discovery=exclusive

Configure ZFS zpool for MGS & MDT

LLNLL-RES-73148625

pcs resource create mpool ocf:llnl:zpool import_options=“-f –N –d /dev/disk/by_vdev” pool=opool

pcs constraint location add opool_l1 opool ossnode 20 resource-discovery=exclusive

pcs constraint location add opool_l2 opool mdsnode 10 resource-discovery=exclusive

Configure ZFS zpool for OST

LLNLL-RES-73148626

pcs resource create MGS ocf:llnl:lustre dataset=mpool/mgs mountpoint=/mnt/lustre/MGS

pcs constraint order mpool then MGS

pcs colocation add MGS with mpool score=INFINITY

pcs constraint location add MGS_l1 MGS mdsnode 20 resource-discovery=exclusive

pcs constraint location add MGS_l2 MGS ossnode 10 resource-discovery=exclusive

Configure Lustre MGS

LLNLL-RES-73148627

pcs resource create MDT ocf:llnl:lustre dataset=mpool/mds mountpoint=/mnt/lustre/MDT

pcs constraint order mpool then MDT

pcs constraint order MGS then MDT kind=Optional

pcs colocation add MDT with mpool score=INFINITY

pcs constraint location add MDT_l1 MDT mdsnode 20 resource-discovery=exclusive

pcs constraint location add MDT_l2 MDT ossnode 10 resource-discovery=exclusive

Configure Lustre MDT

LLNLL-RES-73148628

pcs resource create OST ocf:llnl:lustre dataset=opool/ost mountpoint=/mnt/lustre/OST

pcs constraint order opool then OST

pcs constraint order MGS then OST kind=Optional

pcs colocation add OST with opool score=INFINITY

pcs constraint location add OST_l1 OST ossnode 20 resource-discovery=exclusive

pcs constraint location add OST_l2 OST mdsnode 10 resource-discovery=exclusive

Configure Lustre OST

LLNLL-RES-73148629

Cluster name: myclusterStack: corosyncCurrent DC: mannode (version 1.1.15-11.el7_3.2-e174ec8) – partition with quaorumLast updated: Fri May 12 13:25:59 2017 Last change: …

3 nodes and 5 resources configured

Online: [ mannode ]RemoteOnline: [ mdsnode ossnode ]

pcs status

LLNLL-RES-73148630

Full list of resources:

fence_pm (stonith:fence_powerman): Started mds (ocf::pacemaker:remote): Started mannode oss (ocf::pacemaker:remote): Started mannode mpool (ocf::llnl:zpool): Started mds opool (ocf::llnl:zpool): Started oss MGS (ocf::llnl:lustre): Started mds MDT (ocf::llnl:lustre): Started mds OST (ocf::llnl:lustre): Started oss

pcs status (continued)

LLNLL-RES-73148631

Test Cluster of 20 Lustre servers (16 MDS & 4 OSS)— 45 seconds— 233 commands

Production Cluster of 52 Lustre servers (16 MDS & 36 OSS)— 2-3 minutes— 585 commands

ldev2pcs – Read ldev.conf and generate pcs commands

Script Initial Setup

LLNLL-RES-73148632

No pacemaker state needed on stateless Lustre servers

Single pacemaker instance for entire filesystem— (Start/stop entire filesytem with systemctl start/stop

pacemaker)

Manage all failover from cluster login/management node

Scales to 50+ nodes in production

Positive Results

LLNLL-RES-73148633

pcs status compact form

pacemaker administration learning curve

Better timeout values

Test scaling limits

Improve Resource Agents’ monitor actions

Add LNet Resource Agent

Future Work

LLNLL-RES-73148634

http://github.com/LLNL/lustre-tools-llnl

/tree/master/scripts/[lustre,zpool,ldev2pcs]

Get The Source

Scalable High Availability for Lustre with Pacemaker

Documents

Transcript of Scalable High Availability for Lustre with Pacemaker