Ganeti walk-through — Ganeti 2.10.0~rc1 documentation.pdf

download Ganeti walk-through — Ganeti 2.10.0~rc1 documentation.pdf

of 18

Transcript of Ganeti walk-through — Ganeti 2.10.0~rc1 documentation.pdf

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    1/18

    Ganeti walk-through

    Documents Ganeti version 2.10

    Contents

    Ganeti walk-through

    Introduction

    Cluster creation

    Running a burn-in

    Instance operations

    Creation

    Accessing instances

    Removal

    Recovering from hardware failures

    Recovering from node failureRe-adding a node to the cluster

    Disk failures

    Common cluster problems

    Instance status

    Unallocated DRBD minors

    Orphan volumes

    N+1 errors

    Network issues

    Migration problems

    In use disks at instance shutdown

    LUXI version mismatch

    Introduction

    This document serves as a more example-oriented guide to Ganeti; while the administration

    guide shows a conceptual approach, here you will find a step-by-step example to managing

    instances and the cluster.

    Our simulated, example cluster will have three machines, named node1, node2, node3. Note

    that in real life machines will usually have FQDNs but here we use short names for brevity.

    We will use a secondary network for replication data, 192.0.2.0/24 , with nodes having the

    last octet the same as their index. The cluster name will be example-cluster . All nodes have

    the same simulated hardware configuration, two disks of 750GB, 32GB of memory and 4

    CPUs.

    On this cluster, we will create up to seven instances, named instance1to instance7.

    Cluster creation

    Follow the Ganeti installation tutorialdocument and prepare the nodes. Then its time to

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    2/18

    initialise the cluster:

    $ gnt-clusterinit-s192.0.2.1--enabled-hypervisors=xen-pvmexample-cluster$

    The creation was fine. Lets check that one node we have is functioning correctly:

    $ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinstnode1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0

    $ gnt-clusterverifyMon Oct 26 02:08:51 2009 * Verifying global settingsMon Oct 26 02:08:51 2009 * Gathering data (1 nodes)Mon Oct 26 02:08:52 2009 * Verifying node statusMon Oct 26 02:08:52 2009 * Verifying instance statusMon Oct 26 02:08:52 2009 * Verifying orphan volumesMon Oct 26 02:08:52 2009 * Verifying remaining instancesMon Oct 26 02:08:52 2009 * Verifying N+1 Memory redundancyMon Oct 26 02:08:52 2009 * Other NotesMon Oct 26 02:08:52 2009 * Hooks Results

    $

    Since this proceeded correctly, lets add the other two nodes:

    $ gnt-nodeadd-s192.0.2.2node2

    -- WARNING --Performing this operation is going to replace the ssh daemon keypairon the target machine (node2) with the ones of the current oneand grant full intra-cluster ssh root access to/from it

    Unable to verify hostkey of host xen-devi-5.fra.corp.google.com:

    f7:. Do you want to accept it?y/[n]/?: yMon Oct 26 02:11:53 2009 Authentication to node2 via public key failed, trying pasroot password:Mon Oct 26 02:11:54 2009 - INFO: Node will be a master candidate$ gnt-nodeadd-s192.0.2.3node3-- WARNING --Performing this operation is going to replace the ssh daemon keypairon the target machine (node3) with the ones of the current oneand grant full intra-cluster ssh root access to/from it

    Mon Oct 26 02:12:43 2009 - INFO: Node will be a master candidate

    Checking the cluster status again:

    $ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinstnode1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0node2 1.3T 1.3T 32.0G 1.0G 30.5G 0 0

    node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 0$ gnt-clusterverifyMon Oct 26 02:15:14 2009 * Verifying global settingsMon Oct 26 02:15:14 2009 * Gathering data (3 nodes)Mon Oct 26 02:15:16 2009 * Verifying node status

    Mon Oct 26 02:15:16 2009 * Verifying instance statusMon Oct 26 02:15:16 2009 * Verifying orphan volumesMon Oct 26 02:15:16 2009 * Verifying remaining instancesMon Oct 26 02:15:16 2009 * Verifying N+1 Memory redundancyMon Oct 26 02:15:16 2009 * Other Notes

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    3/18

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    4/18

    - Failing over instances * instance instance1

    * instance instance5 * Submitted job ID(s) 179, 180, 181, 182, 183 waiting for job 179 for instance1 - Migrating instances

    * instance instance1 migration and migration cleanup * instance instance5 migration and migration cleanup * Submitted job ID(s) 184, 185, 186, 187, 188 waiting for job 184 for instance1 - Exporting and re-importing instances * instance instance1 export to node node3

    remove instance import from node3 to node1, node2

    remove export * instance instance5 export to node node1 remove instance import from node1 to node2, node3 remove export * Submitted job ID(s) 196, 197, 198, 199, 200 waiting for job 196 for instance1

    - Reinstalling instances * instance instance1 reinstall without passing the OS reinstall specifying the OS

    * instance instance5 reinstall without passing the OS reinstall specifying the OS * Submitted job ID(s) 203, 204, 205, 206, 207 waiting for job 203 for instance1 - Rebooting instances * instance instance1 reboot with type 'hard' reboot with type 'soft'

    reboot with type 'full' * instance instance5 reboot with type 'hard' reboot with type 'soft' reboot with type 'full'

    * Submitted job ID(s) 208, 209, 210, 211, 212 waiting for job 208 for instance1 - Adding and removing disks * instance instance1 adding a disk removing last disk

    * instance instance5 adding a disk

    removing last disk * Submitted job ID(s) 213, 214, 215, 216, 217

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    5/18

    waiting for job 213 for instance1 - Adding and removing NICs * instance instance1

    adding a NIC removing last NIC * instance instance5 adding a NIC

    removing last NIC * Submitted job ID(s) 218, 219, 220, 221, 222 waiting for job 218 for instance1 - Activating/deactivating disks * instance instance1 activate disks when online activate disks when offline deactivate disks (when offline) * instance instance5

    activate disks when online activate disks when offline

    deactivate disks (when offline) * Submitted job ID(s) 223, 224, 225, 226, 227 waiting for job 223 for instance1 - Stopping and starting instances * instance instance1 * instance instance5 * Submitted job ID(s) 230, 231, 232, 233, 234

    waiting for job 230 for instance1 - Removing instances * instance instance1

    * instance instance5 * Submitted job ID(s) 235, 236, 237, 238, 239 waiting for job 235 for instance1 $

    You can see in the above what operations the burn-in does. Ideally, the burn-in log would

    proceed successfully through all the steps and end cleanly, without throwing errors.

    Instance operations

    Creation

    At this point, Ganeti and the hardware seems to be functioning correctly, so well follow up

    with creating the instances manually:

    $ gnt-instanceadd-tdrbd-odebootstrap-s256minstance1Mon Oct 26 04:06:52 2009 - INFO: Selected nodes for instance instance1 via iallocaMon Oct 26 04:06:53 2009 * creating instance disks...

    Mon Oct 26 04:06:57 2009 adding instance instance1 to cluster configMon Oct 26 04:06:57 2009 - INFO: Waiting for instance instance1 to sync disks.Mon Oct 26 04:06:57 2009 - INFO: - device disk/0: 20.00% done, 4 estimated secondsMon Oct 26 04:07:01 2009 - INFO: Instance instance1's disks are in sync.Mon Oct 26 04:07:01 2009 creating os for instance instance1 on node node2

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    6/18

    Mon Oct 26 04:07:01 2009 * running the instance OS create scripts...Mon Oct 26 04:07:14 2009 * starting instance...$ gnt-instanceadd-tdrbd-odebootstrap-s256m-nnode1:node2instance2Mon Oct 26 04:11:37 2009 * creating instance disks...

    Mon Oct 26 04:11:40 2009 adding instance instance2 to cluster configMon Oct 26 04:11:41 2009 - INFO: Waiting for instance instance2 to sync disks.Mon Oct 26 04:11:41 2009 - INFO: - device disk/0: 35.40% done, 1 estimated secondsMon Oct 26 04:11:42 2009 - INFO: - device disk/0: 58.50% done, 1 estimated secondsMon Oct 26 04:11:43 2009 - INFO: - device disk/0: 86.20% done, 0 estimated seconds

    Mon Oct 26 04:11:44 2009 - INFO: - device disk/0: 92.40% done, 0 estimated secondsMon Oct 26 04:11:44 2009 - INFO: - device disk/0: 97.00% done, 0 estimated secondsMon Oct 26 04:11:44 2009 - INFO: Instance instance2's disks are in sync.Mon Oct 26 04:11:44 2009 creating os for instance instance2 on node node1Mon Oct 26 04:11:44 2009 * running the instance OS create scripts...Mon Oct 26 04:11:57 2009 * starting instance...$

    The above shows one instance created via an iallocator script, and one being created with

    manual node assignment. The other three instances were also created and now its time to

    check them:

    $ gnt-instancelistInstance Hypervisor OS Primary_node Status Memoryinstance1 xen-pvm debootstrap node2 running 128Minstance2 xen-pvm debootstrap node1 running 128Minstance3 xen-pvm debootstrap node1 running 128Minstance4 xen-pvm debootstrap node3 running 128Minstance5 xen-pvm debootstrap node2 running 128M

    Accessing instances

    Accessing an instances console is easy:

    $ gnt-instanceconsoleinstance2[ 0.000000] Bootdata ok (command line is root=/dev/sda1 ro)[ 0.000000] Linux version 2.6

    [ 0.000000] BIOS-provided physical RAM map:[ 0.000000] Xen: 0000000000000000 - 0000000008800000 (usable)[13138176.018071] Built 1 zonelists. Total pages: 34816[13138176.018074] Kernel command line: root=/dev/sda1 ro[13138176.018694] Initializing CPU#0Checking file systems...fsck 1.41.3 (12-Oct-2008)

    done.Setting kernel variables (/etc/sysctl.conf)...done.Mounting local filesystems...done.Activating swapfile swap...done.

    Setting up networking....Configuring network interfaces...done.Setting console screen modes and fonts.INIT: Entering runlevel: 2Starting enhanced syslogd: rsyslogd.Starting periodic command scheduler: crond.

    Debian GNU/Linux 5.0 instance2 tty1

    instance2 login:

    At this moment you can login to the instance and, after configuring the network (and doing

    this on all instances), we can check their connectivity:

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    7/18

    $ fpinginstance{1..5}instance1 is aliveinstance2 is aliveinstance3 is aliveinstance4 is aliveinstance5 is alive$

    Removal

    Removing unwanted instances is also easy:

    $ gnt-instanceremoveinstance5This will remove the volumes of the instance instance5 (includingmirrors), thus removing all the data of the instance. Continue?y/[n]/?: y$

    Recovering from hardware failures

    Recovering from node failure

    We are now left with four instances. Assume that at this point, node3, which has one primary

    and one secondary instance, crashes:

    $ gnt-nodeinfonode3Node name: node3

    primary ip: 198.51.100.1 secondary ip: 192.0.2.3 master candidate: True drained: False offline: False primary for instances: - instance4

    secondary for instances: - instance1$ fpingnode3node3 is unreachable

    At this point, the primary instance of that node (instance4) is down, but the secondaryinstance (instance1) is not affected except it has lost disk redundancy:

    $ fpinginstance{1,4}

    instance1 is aliveinstance4 is unreachable$

    If we try to check the status of instance4 via the instance info command, it fails because it

    tries to contact node3 which is down:

    $ gnt-instanceinfoinstance4Failure: command execution error:Error checking node node3: Connection failed (113: No route to host)

    $

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    8/18

    So we need to mark node3 as being offline, and thus Ganeti wont talk to it anymore:

    $ gnt-nodemodify-Oyes-fnode3Mon Oct 26 04:34:12 2009 - WARNING: Not enough master candidates (desired 10, newMon Oct 26 04:34:15 2009 - WARNING: Communication failure to node node3: ConnectioModified node node3

    - offline -> True- master_candidate -> auto-demotion due to offline

    $

    And now we can failover the instance:

    $ gnt-instancefailoverinstance4Failover will happen to image instance4. This requires a shutdown ofthe instance. Continue?y/[n]/?: yMon Oct 26 04:35:34 2009 * checking disk consistency between source and targetFailure: command execution error:Disk disk/0 is degraded on target node, aborting failover.$ gnt-instancefailover--ignore-consistencyinstance4

    Failover will happen to image instance4. This requires a shutdown ofthe instance. Continue?y/[n]/?: yMon Oct 26 04:35:47 2009 * checking disk consistency between source and targetMon Oct 26 04:35:47 2009 * shutting down instance on source nodeMon Oct 26 04:35:47 2009 - WARNING: Could not shutdown instance instance4 on nodeMon Oct 26 04:35:47 2009 * deactivating the instance's disks on source nodeMon Oct 26 04:35:47 2009 - WARNING: Could not shutdown block device disk/0 on nodeMon Oct 26 04:35:47 2009 * activating the instance's disks on target nodeMon Oct 26 04:35:47 2009 - WARNING: Could not prepare block device disk/0 on nodeMon Oct 26 04:35:48 2009 * starting the instance on the target node$

    Note in our first attempt, Ganeti refused to do the failover since it wasnt sure what is the

    status of the instances disks. We pass the --ignore-consistency flag and then we can

    failover:

    $ gnt-instancelistInstance Hypervisor OS Primary_node Status Memoryinstance1 xen-pvm debootstrap node2 running 128Minstance2 xen-pvm debootstrap node1 running 128Minstance3 xen-pvm debootstrap node1 running 128M

    instance4 xen-pvm debootstrap node1 running 128M$

    But at this point, both instance1 and instance4 are without disk redundancy:

    $ gnt-instanceinfoinstance1Instance name: instance1UUID: 45173e82-d1fa-417c-8758-7d582ab7eef4Serial number: 2Creation time: 2009-10-26 04:06:57Modification time: 2009-10-26 04:07:14State: configured to be up, actual state is up Nodes:

    - primary: node2 - secondaries: node3 Operating system: debootstrap Allocated network port: None Hypervisor: xen-pvm

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    9/18

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    10/18

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    11/18

    Disk failures

    A disk failure is simpler than a full node failure. First, a single disk failure should not cause

    data-loss for any redundant instance; only the performance of some instances might be

    reduced due to more network traffic.

    Let take the cluster status in the above listing, and check what volumes are in use:

    $ gnt-nodevolumes-ophys,instancenode2PhysDev Instance/dev/sdb1 instance4/dev/sdb1 instance4/dev/sdb1 instance1/dev/sdb1 instance1/dev/sdb1 instance3/dev/sdb1 instance3/dev/sdb1 instance2/dev/sdb1 instance2

    $

    You can see that all instances on node2 have logical volumes on /dev/sdb1. Lets simulate a

    disk failure on that disk:

    $ sshnode2# on node2

    $ echooffline>/sys/block/sdb/device/state$ vgs /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 4096 at 750153695232: Input/output error /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error

    Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'. Couldn't find all physical volumes for volume group xenvg. /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error

    Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'. Couldn't find all physical volumes for volume group xenvg. Volume group xenvg not found$

    At this point, the node is broken and if we are to examine instance2 we get (simplified output

    shown):

    $ gnt-instanceinfoinstance2

    Instance name: instance2State: configured to be up, actual state is up Nodes: - primary: node1 - secondaries: node2 Disks:

    - disk/0: drbd8, size 256M on primary: /dev/drbd0 (147:0) in sync, status ok on secondary: /dev/drbd1 (147:1) in sync, status *DEGRADED* *MISSING DISK*

    This instance has a secondary only on node2. Lets verify a primary instance of node2:

    $ gnt-instanceinfoinstance1Instance name: instance1State: configured to be up, actual state is up

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    12/18

    Nodes: - primary: node2 - secondaries: node1 Disks:

    - disk/0: drbd8, size 256M on primary: /dev/drbd0 (147:0) in sync, status *DEGRADED* *MISSING DISK* on secondary: /dev/drbd3 (147:3) in sync, status ok$ gnt-instanceconsoleinstance1

    Debian GNU/Linux 5.0 instance1 tty1

    instance1 login: rootLast login: Tue Oct 27 01:24:09 UTC 2009 on tty1instance1:~# date > testinstance1:~# syncinstance1:~# cat testTue Oct 27 01:25:20 UTC 2009instance1:~# dmesg|tail[5439785.235448] NET: Registered protocol family 15[5439785.235489] 802.1Q VLAN Support v1.8 Ben Greear

    [5439785.235495] All bugs added by David S. Miller [5439785.235517] XENBUS: Device with no driver: device/console/0

    [5439785.236576] kjournald starting. Commit interval 5 seconds[5439785.236588] EXT3-fs: mounted filesystem with ordered data mode.[5439785.236625] VFS: Mounted root (ext3 filesystem) readonly.[5439785.236663] Freeing unused kernel memory: 172k freed[5439787.533779] EXT3 FS on sda1, internal journal[5440655.065431] eth0: no IPv6 routers presentinstance1:~#

    As you can see, the instance is running fine and doesnt see any disk issues. It is now time

    to fix node2 and re-establish redundancy for the involved instances.

    Note: For Ganeti 2.0 we need to fix manually the volume group on node2 by runningvgreduce --removemissing xenvg

    $ gnt-noderepair-storagenode2lvm-vgxenvgMon Oct 26 18:14:03 2009 Repairing storage unit 'xenvg' on node2 ...$ sshnode2vgsVG #PV #LV #SN Attr VSize VFreexenvg 1 8 0 wz--n- 673.84G 673.84G$

    This has removed the bad disk from the volume group, which is now left with only one PV.

    We can now replace the disks for the involved instances:

    $ foriininstance{1..4};dognt-instancereplace-disks-a$i;doneMon Oct 26 18:15:38 2009 Replacing disk(s) 0 for instance1Mon Oct 26 18:15:38 2009 STEP 1/6 Check device existenceMon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node1Mon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node2Mon Oct 26 18:15:38 2009 - INFO: Checking volume groupsMon Oct 26 18:15:38 2009 STEP 2/6 Check peer consistencyMon Oct 26 18:15:38 2009 - INFO: Checking disk/0 consistency on node node1Mon Oct 26 18:15:39 2009 STEP 3/6 Allocate new storage

    Mon Oct 26 18:15:39 2009 - INFO: Adding storage on node2 for disk/0Mon Oct 26 18:15:39 2009 STEP 4/6 Changing drbd configurationMon Oct 26 18:15:39 2009 - INFO: Detaching disk/0 drbd from local storageMon Oct 26 18:15:40 2009 - INFO: Renaming the old LVs on the target nodeMon Oct 26 18:15:40 2009 - INFO: Renaming the new LVs on the target node

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    13/18

    Mon Oct 26 18:15:40 2009 - INFO: Adding new mirror component on node2Mon Oct 26 18:15:41 2009 STEP 5/6 Sync devicesMon Oct 26 18:15:41 2009 - INFO: Waiting for instance instance1 to sync disks.Mon Oct 26 18:15:41 2009 - INFO: - device disk/0: 12.40% done, 9 estimated seconds

    Mon Oct 26 18:15:50 2009 - INFO: Instance instance1's disks are in sync.Mon Oct 26 18:15:50 2009 STEP 6/6 Removing old storageMon Oct 26 18:15:50 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:15:52 2009 Replacing disk(s) 0 for instance2Mon Oct 26 18:15:52 2009 STEP 1/6 Check device existence

    Mon Oct 26 18:16:01 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:01 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:16:02 2009 Replacing disk(s) 0 for instance3Mon Oct 26 18:16:02 2009 STEP 1/6 Check device existenceMon Oct 26 18:16:09 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:09 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:16:10 2009 Replacing disk(s) 0 for instance4Mon Oct 26 18:16:10 2009 STEP 1/6 Check device existence

    Mon Oct 26 18:16:18 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:18 2009 - INFO: Remove logical volumes for disk/0

    $

    As this point, all instances should be healthy again.

    Note: Ganeti 2.0 doesnt have the -aoption to replace-disks, so for it you have to run the

    loop twice, once over primary instances with argument -pand once secondary instances

    with argument -s, but otherwise the operations are similar:

    $ gnt-instancereplace-disks-pinstance1

    $ foriininstance{2..4};dognt-instancereplace-disks-s$i;done

    Common cluster problems

    There are a number of small issues that might appear on a cluster that can be solved easily

    as long as the issue is properly identified. For this exercise we will consider the case of

    node3, which was broken previously and re-added to the cluster without reinstallation.

    Running cluster verify on the cluster reports:

    $ gnt-clusterverifyMon Oct 26 18:30:08 2009 * Verifying global settingsMon Oct 26 18:30:08 2009 * Gathering data (3 nodes)Mon Oct 26 18:30:10 2009 * Verifying node statusMon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 0 is in useMon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 1 is in useMon Oct 26 18:30:10 2009 * Verifying instance statusMon Oct 26 18:30:10 2009 - ERROR: instance instance4: instance should not run on

    Mon Oct 26 18:30:10 2009 * Verifying orphan volumesMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-7916Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03afMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03afMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-7916Mon Oct 26 18:30:10 2009 * Verifying remaining instances

    Mon Oct 26 18:30:10 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:30:10 2009 * Other NotesMon Oct 26 18:30:10 2009 * Hooks Results

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    14/18

    $

    Instance status

    As you can see, instance4has a copy running on node3, because we forced the failover

    when node3 failed. This case is dangerous as the instance will have the same IP and MAC

    address, wreaking havoc on the network environment and anyone who tries to use it.

    Ganeti doesnt directly handle this case. It is recommended to logon to node3 and run:

    $ xmdestroyinstance4

    Unallocated DRBD minors

    There are still unallocated DRBD minors on node3. Again, these are not handled by Ganeti

    directly and need to be cleaned up via DRBD commands:

    $ sshnode3# on node 3

    $ drbdsetup/dev/drbd0down$ drbdsetup/dev/drbd1down$

    Orphan volumes

    At this point, the only remaining problem should be the so-called orphanvolumes. This can

    happen also in the case of an aborted disk-replace, or similar situation where Ganeti was notable to recover automatically. Here you need to remove them manually via LVM commands:

    $ sshnode3# on node3

    $ lvremovexenvgDo you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data" successfully remDo you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d

    Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta" successfully remDo you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data" successfully remDo you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta" successfully remnode3#

    At this point cluster verify shouldnt complain anymore:

    $ gnt-clusterverifyMon Oct 26 18:37:51 2009 * Verifying global settingsMon Oct 26 18:37:51 2009 * Gathering data (3 nodes)Mon Oct 26 18:37:53 2009 * Verifying node statusMon Oct 26 18:37:53 2009 * Verifying instance status

    Mon Oct 26 18:37:53 2009 * Verifying orphan volumesMon Oct 26 18:37:53 2009 * Verifying remaining instancesMon Oct 26 18:37:53 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:37:53 2009 * Other NotesMon Oct 26 18:37:53 2009 * Hooks Results

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    15/18

    $

    N+1 errors

    Since redundant instances in Ganeti have a primary/secondary model, it is needed to leave

    aside on each node enough memory so that if one of its peer node fails, all the secondary

    instances that have that node as primary can be relocated. More specifically, if instance2 hasnode1 as primary and node2 as secondary (and node1 and node2 do not have any other

    instances in this layout), then it means that node2 must have enough free memory so that if

    node1 fails, we can failover instance2 without any other operations (for reducing the

    downtime window). Lets increase the memory of the current instances to 4G, and add three

    new instances, two on node2:node3 with 8GB of RAM and one on node1:node2, with 12GB

    of RAM (numbers chosen so that we run out of memory):

    $ gnt-instancemodify-Bmemory=4Ginstance1Modified instance instance1- be/maxmem -> 4096- be/minmem -> 4096Please don't forget that these parameters take effect only at the next start of the$ gnt-instancemodify

    $ gnt-instanceadd-tdrbd-nnode2:node3-s512m-Bmemory=8G-odebootstrapinsta$ gnt-instanceadd-tdrbd-nnode2:node3-s512m-Bmemory=8G-odebootstrapinsta

    $ gnt-instanceadd-tdrbd-nnode1:node2-s512m-Bmemory=8G-odebootstrapinsta$ gnt-instancereboot--allThe reboot will operate on 7 instances.Do you want to continue?

    Affected instances: instance1 instance2 instance3 instance4 instance5 instance6 instance7y/[n]/?: ySubmitted jobs 677, 678, 679, 680, 681, 682, 683Waiting for job 677 for instance1...Waiting for job 678 for instance2...

    Waiting for job 679 for instance3...

    Waiting for job 680 for instance4...Waiting for job 681 for instance5...Waiting for job 682 for instance6...Waiting for job 683 for instance7...

    $

    We rebooted the instances for the memory changes to have effect. Now the cluster looks

    like:

    $ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinst

    node1 1.3T 1.3T 32.0G 1.0G 6.5G 4 1node2 1.3T 1.3T 32.0G 1.0G 10.5G 3 4node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 2

    $ gnt-clusterverifyMon Oct 26 18:59:36 2009 * Verifying global settings

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    16/18

    Mon Oct 26 18:59:36 2009 * Gathering data (3 nodes)Mon Oct 26 18:59:37 2009 * Verifying node statusMon Oct 26 18:59:37 2009 * Verifying instance statusMon Oct 26 18:59:37 2009 * Verifying orphan volumes

    Mon Oct 26 18:59:37 2009 * Verifying remaining instancesMon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate inMon Oct 26 18:59:37 2009 * Other NotesMon Oct 26 18:59:37 2009 * Hooks Results

    $

    The cluster verify error above shows that if node1 fails, node2 will not have enough memory

    to failover all primary instances on node1 to it. To solve this, you have a number of options:

    try to manually move instances around (but this can become complicated for any

    non-trivial cluster)

    try to reduce the minimum memory of some instances on the source node of the N+1

    failure (in the example above node1): this will allow it to start and be failed

    over/migrated with less than its maximum memory

    try to reduce the runtime/maximum memory of some instances on the destination nodeof the N+1 failure (in the example above node2) to create additional available node

    memory (check the Ganeti administrators guideguide for what Ganeti will and wont

    automatically do in regards to instance runtime memory modification)

    if Ganeti has been built with the htools package enabled, you can run the hbal tool

    which will try to compute an automated cluster solution that complies with the N+1 rule

    Network issues

    In case a node has problems with the network (usually the secondary network, as problemswith the primary network will render the node unusable for ganeti commands), it will show up

    in cluster verify as:

    $ gnt-clusterverifyMon Oct 26 19:07:19 2009 * Verifying global settingsMon Oct 26 19:07:19 2009 * Gathering data (3 nodes)Mon Oct 26 19:07:23 2009 * Verifying node statusMon Oct 26 19:07:23 2009 - ERROR: node node1: tcp communication with node 'node3':Mon Oct 26 19:07:23 2009 - ERROR: node node2: tcp communication with node 'node3':

    Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node1':

    Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node2':Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node3':Mon Oct 26 19:07:23 2009 * Verifying instance statusMon Oct 26 19:07:23 2009 * Verifying orphan volumesMon Oct 26 19:07:23 2009 * Verifying remaining instancesMon Oct 26 19:07:23 2009 * Verifying N+1 Memory redundancyMon Oct 26 19:07:23 2009 * Other NotesMon Oct 26 19:07:23 2009 * Hooks Results$

    This shows that both node1 and node2 have problems contacting node3 over the secondary

    network, and node3 has problems contacting them. From this output is can be deduced that

    since node1 and node2 can communicate between themselves, node3 is the one havingproblems, and you need to investigate its network settings/connection.

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    17/18

    Migration problems

    Since live migration can sometimes fail and leave the instance in an inconsistent state,

    Ganeti provides a --cleanupargument to the migrate command that does:

    check on which node the instance is actually running (has the command failed before

    or after the actual migration?)

    reconfigure the DRBD disks accordingly

    It is always safe to run this command as long as the instance has good data on its primary

    node (i.e. not showing as degraded). If so, you can simply run:

    $ gnt-instancemigrate--cleanupinstance1Instance instance1 will be recovered from a failed migration. Notethat the migration procedure (including cleanup) is **experimental**in this version. This might impact the instance if anything goeswrong. Continue?y/[n]/?: y

    Mon Oct 26 19:13:49 2009 Migrating instance instance1Mon Oct 26 19:13:49 2009 * checking where the instance actually runs (if this hangs,Mon Oct 26 19:13:49 2009 * instance confirmed to be running on its primary node (noMon Oct 26 19:13:49 2009 * switching node node1 to secondary modeMon Oct 26 19:13:50 2009 * wait until resync is doneMon Oct 26 19:13:50 2009 * changing into standalone modeMon Oct 26 19:13:50 2009 * changing disks into single-master mode

    Mon Oct 26 19:13:50 2009 * wait until resync is doneMon Oct 26 19:13:51 2009 * done$

    In use disks at instance shutdown

    If you see something like the following when trying to shutdown or deactivate disks for an

    instance:

    $ gnt-instanceshutdowninstance1Mon Oct 26 19:16:23 2009 - WARNING: Could not shutdown block device disk/0 on node

    It most likely means something is holding open the underlying DRBD device. This can be

    bad if the instance is not running, as it might mean that there was concurrent access from

    both the node and the instance to the disks, but not always (e.g. you could only have had thepartitions activated via kpartx).

    To troubleshoot this issue you need to follow standard Linux practices, and pay attention to

    the hypervisor being used:

    check if (in the above example) /dev/drbd0 on node2 is being mounted somewhere

    (cat /proc/mounts)

    check if the device is not being used by device mapper itself: dmsetup ls and look for

    entries of the form drbd0pX, and if so remove them with either kpartx -d or dmsetup

    remove

    For Xen, check if its not using the disks itself:

    eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html

    f 18 21.1.2014. 19:32

  • 8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf

    18/18