Troubleshooting XenServer deployments
-
Upload
elijah-mills -
Category
Documents
-
view
60 -
download
2
description
Transcript of Troubleshooting XenServer deployments
• Case Study: “Production down”
• Learn: “XenServer crash”
• Case study: “Singlepathing”
• Q & A
Agenda
Basic troubleshooting in XenCenterVM don’t start - why?
• Cannot start a VM “The SR is not available” error
• Storage Repositry (SR) in “broken” state
“Repair” does not work.
Use CLI to troubleshoot
# xe pbd-list currently-atached=false
PBDPBD
What is “broken”?
XenServer_1 XenServer_1
SRSRXenServer_2XenServer_2PBDPBDhas UUID (unique ID)
SCSI ID
PBD = Physical Block DevicePBDPBDPBDPBD
SRSR SRSR
Volume GroupVolume Group
Name: <Prefix>+SR UUID”
Broken storage
Goal: Reproduce and analyse the logsStorage troubleshooting
/var/log/xensource.log* ; SMlog* ; messages* ;
# tail –f /var/log/messages > /tmp/ShortLog
# date
# echo “Unplugging cable” >> messages
messages (UTC) <> xensource.log (local)
Plugging PBD manuallyPBD unplugged
# xe pbd-list host-uuid=... sr-uuid=...
# xe pbd-plug uuid=...
SR_BACKEND_FAILURE_47: The SR is not available no such volume group: VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4
# xe sr-list name-label=“My SR” params=uuid
19856cba-830c-e298-79fa-84a79eb658f4
# grep “PBD.plug” xensource.log# grep “PBD.plug” xensource.log
Logical Volume(LV)
Logical Volume(LV)
What is VG?Volume Group
Virtual Disk
Storage Repository
HDD / LUN
Logical Volume Manager (LVM)
Volume Group(VG)
Volume Group(VG)
Physical Volume(PV)
Physical Volume(PV)
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Physical Volume(PV)
Physical Volume(PV)
Physical Volume(PV)
Physical Volume(PV)
Volume Group(VG)
Volume Group(VG)
HDD / LUN
HDD / LUN
3 VMs1 virtual disk each
3 VMs1 virtual disk each
SRSR
VDIVDI
VDIVDI
VDIVDI
Matching the UUIDVolume Group
# vgs
# vgs 'VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4'
Volume group "VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4" not found
VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G
VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G
Checking SCSI IDExamining HDD/LUN
• check SCSI ID (unique for each SCSI device)
# xe pbd-list params=device-config sr-uuid=...
device-config SCSIid: 360a9800050334f49633459
PBDPBD
SCSI ID
Can Linux kernel see this block device? (SCSI device)Examining HDD/LUN
# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...
Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec
(LUN readable! )
Addressing SCSI disks # ls -lR /dev/disk | grep 360a9800050334f4963345767656c546
• /dev/disk/by-id
•scsi-360a9800050334f4963345767656c546a -> /dev/sde
•/dev/disk/by-scsibus
•360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc
•360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde
/dev/mapper/360a9800050334f4963345767656c546
Also check /dev/disk/by-path
Is the LUN empty?Examining HDD/LUN
# udevinfo -q all -n /dev/disk/by-id/scsi-360a9800050334f496334576765...
...
ID_FS_TYPE=LVM2 member
...
“If this is LVM member, why there is no VG on it?”
Is there a VG created on PV?Examining HDD/LUN
# pvs
# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free
/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-332432-430d-3423-4332434-5485974 lvm2 a- 14.99G 14.99G
# xe sr-list name-label="My SR" params=uuid
19856cba-830c-e298-79fa-84a79eb658f4
VG_Xenstorage<UUID> differs from SR UUID !
PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G
PV VG Fmt Attr Psize Free/dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G/dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G/dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G/dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G/dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G
Potential reasons:No original VG on the LUN
• (Re)installation of host in the same pool• Unplug FC / Zoning
• (Re)installation of host in other pool• Zoning
• Adding SR with “xe sr-create” in CLI
...BE VERY CAREFUL!
...has been recreated!Volume Group
• Lost LVM metadata
• Lost 100 MB of the VDI data
Action steps:
• don’t shutdown running VMs
• Online backup for running Vms (now)
• Block-level clone of the whole LUN (now)
• Assess professional data recovery
Looking for LVM metadata backupVolume Group
/etc/lmv/backup/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4
• Check backup timestamp (within the file)
LVs in backup file
# cat /etc/lvm/backup/VG... | grep VHD
LVs in backup file
# cat /etc/lvm/backup/VG... | grep VHD
VDI in xapi database
# xe vdi-list sr=<uuid> params=uuid
VDI in xapi database
# xe vdi-list sr=<uuid> params=uuid=
Make a copy first# cp /etc/lvm/backup/* /root/backup/Make a copy first# cp /etc/lvm/backup/* /root/backup/
LVLV
LVLV
VDIVDI
VDIVDI
VDIVDI
LVLV
Removing new VG and PVVolume Group
# vgremove "VG_XenStorage-<new SR uuid>”
# pvremove/dev/mapper/<SCSI ID>
Recreating PV and VG from backupVolume Group
# pvcreate--uuid <PV uuid from backup file>--restorefile /etc/lvm/backup/VG_XenStorage-<SR_UUID>
/dev/mapper/<SCSI ID>
# vgcfgrestore VG_XenStorage-<SR UUID>-f /etc/lvm/backup/VG_XenStorage-<SR UUID>
Confirm that VG name contains SR uuid...Examining HDD/LUN
# pvs |grep 360a9800050334f496334595a32306431PV VG Fmt Attr Psize Free
/dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G
# xe sr-list name-label="My SR" params=uuid
19856cba-830c-e298-79fa-84a79eb658f4
VG_Xenstorage<UUID> matches SR UUID
Checking Logical VolumesVolume Group
# lvs
•MGT VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M
•VHD-352d31ec-aeb6-4601-8ea9-990575dab395 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
•VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G
•VHD-fbce18dd-397e-444e-9470-b6fa240243d9 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G
•VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Logical Volume(LV)
Plugging PBD again...Storage Repository
# xe pbd-plug uuid=…
# xe sr-scan uuid=…
Error code: SR_BACKEND_FAILURE_46
Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]
# xe vdi-list uuid=<above number>
# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
# xe sr-scan uuid=…
Success! But no VDIs shown...
Success! All VDIs shown... Well done!
...by troubleshooting “Production Down” issueWhat we’ve learned
•PBD to get plugged needs...
•LUN/HDD PV VG (SR) LV (VDI)
•VG name generated from SR uuid (+ prefix)LV name generated from VDI uuid (+ prefix)
•Displaying VG (vgs), PV (pvs), LV (lvs)
•Addressing block devices (/dev/disk)
•Examining HDD/LUN with "hdparm –t"
•Restoring PV & VG from backup
Unresponsive or rebooting hostThe XenServer Crash?
• Kernel panic or crash dump• Error on Console, host locked• Memory addressing, Bug in OS, Hardware failure
• No Kernel Panic and no crash dump• Host rebooting / frozen / no errors on the console• Hardware failure, OS busy (I/O), user action
/var/crash/<date> exists
Symptom: Host rebooted itselfSymptom: Host is unresponsive
Serial consoleSerial console
Review crashdump
HA enabledHA enabled
Host fenced? Check /var/log/xha.log
Disable HA
HA disabledHA disabled
Add „noreboot” option in extlinux.conf
Add „noreboot” option in extlinux.conf
Analyse /var/log/messages,
xensource.log
Still rebooting? examine hardwareStill rebooting? examine hardware
No serial consoleNo serial console
Connect local consoleConnect local console
Any errors on the console?Any errors on the console?
Analyse /var/log/messages, xensource.log for HA reasons
Boot the host to the consoleCTX120540 & reboot
Boot the host to the consoleCTX120540 & reboot
Generate crashdump CTX120540 & reboot
Generate crashdump CTX120540 & reboot
Review crashdump
Analyse /var/log/messages,
xensource.log
Take photos and rebootTake photos and reboot
Contact Citrix Tech SupportContact Citrix Tech Support
No crashdump
Startup strings:
# cd /var/log
# grep “klogd” messages -B100
# grep “SERVER START” xensource.log -B100
As easy as grepGetting into details… Analyse /var/log/
messages, xensource.log
/var/crash/<stamp>Inside crash log directory
Citrix Confidential - Do Not Distribute
Domain0.logDomain0.log
Hypervisor console ringHypervisor console ring
Domain0 console ring
Domain0 console ring
crash.logcrash.log
CPU stack - to be analysed by Citrix Tech SupportCPU stack - to be analysed by Citrix Tech Support
HA activity, page fault, driver, storage issues
HA activity, page fault, driver, storage issues
Review crashdump
Domain1,2,3...logDebug.log
xen-memory-dump
Domain1,2,3...logDebug.log
xen-memory-dump
XenConsole ringInvestigating crash.log
• located at the bottom of the file
•(XEN) Watchdog timer fired for domain 0(XEN) Domain 0 shutdown: watchdog rebooting machine.
• Why watchdog triggered? /var/log/xha.log (Network or Storage heartbeat failed)
• Why heartbeat failed? /var/log/messages (DMP, kernel, drivers, I/O errors)
Review crashdump (cont)
Page faultInvestigating crash.log
Other examples:
• (XEN) ****************************************
• (XEN) Panic on CPU 6:
• (XEN) FATAL TRAP: vector = 14 (page fault)
• (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
• (XEN)
• ****************************************
• (XEN)
• (XEN) Reboot in five seconds...
Learn: XenServer crashWhat we’ve learned
•Host really crashed?
•Kernel Panic
•Crashdump
•Triggering Crashdump manually
•Locating host reboot in the logs
•Reviewing crashdump logs
Storage Performance issue
• DMP has been enabled to improve performance
• Virtual Machines are running on different iSCSI SRs
LinuxGuestVM:~# hdparm -t /dev/xvdb
/dev/xvdb:
Timing buffered disk reads: 96 MB in 3.07 seconds = 30.41 MB/sec
Checking multipath statusStorage Performance
# mpathutil status
360a9800050334f496334596c71665246 dm-13 NETAPP,LUN
[size=2.0G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=4][enabled]
\_ 3:0:0:2 sdk 8:160 [active][ready]
\_ 4:0:0:2 sdj 8:144 [active][ready]
/dev/mapper/....
/dev/
Determining current performance on domain0Storage Performance
• Testing multi-path device
# hdparm /dev/mapper/<scsi id>
• Testing single-path devices
# hdparm /dev/sdj
# hdparm /dev/sdm
In all cases: 30 MB/sec
Determining usage of pathsStorage Performance
# iostat –x <device>
# iostat –x /dev/sdk /dev/sdj 5
Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdk 803.50 33.0 4122 160 sdj 784.00 32.8 3922 155
Both paths are used equally
Checking if there are really 2 iSCSI sessionsStorage Performance
# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"
ip-10.1.200.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk
ip-10.1.201.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj
Checking if different paths are really usedStorage Performance
# tcpdump -i any port 3260
# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "
eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C
RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB)
eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E
RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)
Checking source IP addresses for iSCSI sessionsStorage Performance
# netstat -at | grep iscsi
10.1.200.138:53049 10.1.200.40:iscsi-target ESTABLISHED
10.1.200.178:46684 10.1.201.40:iscsi-target ESTABLISHED
Checking kernel routing tableStorage Performance
# route
Destination Gateway Genmask Iface
10.1.200.0 * 255.255.255.0 xenbr0
10.1.200.0 * 255.255.255.0 xenbr1
default 10.1.200.1 0.0.0.0 xenbr0
Configuration of management interfaces in XenCenterStorage Performance
Modify ISCSI_2 into 10.1.201.78
Determining current performance on domain0Storage Performance
# route
Destination Gateway Genmask Iface
10.1.200.0 * 255.255.255.0 xenbr0
10.1.201.0 * 255.255.255.0 xenbr1
default 10.1.200.1 0.0.0.0 xenbr0
Configuring kernel routing tableStorage Performance
...or (not recommended)
• Add to /etc/rc.local
# route add -host 10.1.200.40 xenbr0
# route add -host 10.1.201.40 xenbr1
• What about Pool Upgrade and Pool Join?
LinuxVM:~# hdparm -t /dev/xvdb
/dev/xvdb:
Timing buffered disk reads: 45 MB/sec
Well Done!
Determining current performance on VMStorage Performance
Case study: Single-pathingWhat we’ve learned
•/dev/ locations for single and multi-path devices
•# mpathutil status
•# hdparm –t
•# iostat
•# ifconfig, # tcpdump, # netstat, # route
•# watch
•Best practices for iSCSI storages
First aid kitResources
• http://docs.xensource.com –XenServer documentation
• http://support.citrix.com/product/xens/ - Knowledge Center
• http://forums.citrix.com/support - Support forums
• http://community.citrix.com/citrixready/xenserver - XenServer Central (one-stop information center)
Before you leave…
• Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October• Provide your feedback and pick up a complimentary gift card at the registration desk
• Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account