Novell Course 3068 SUSE Migrating From Red-Hat Suse Linux Enterprise 10 Workbook
SUSE Advanced Troubleshooting: The Boot Process Lab · responsibility for your failure to obtain...
Transcript of SUSE Advanced Troubleshooting: The Boot Process Lab · responsibility for your failure to obtain...
www.novel l .comNovell Training Services
AT T L I V E 2 0 1 2 L A S V E G A S
SUSE Advanced Troubleshooting: The Boot ProcessLab
S U S 2 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Proprietary StatementCopyright © 2011 Novell, Inc. All rights reserved.
Novell, Inc., has intellectual property rights relating to technology embodied in the product that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more of the U.S. patents listed on the Novell Legal Patents Web page (http://www.novell.com/company/legal/patents/) and one or more additional patents or pending patent applications in the U.S. and in other countries.
No part of this publication may be reproduced, photocopied, stored on a retrieval system, or transmitted without the express written consent of the publisher.
Novell, Inc.404 Wyman Street, Suite 500Waltham, MA 02451U.S.A.www.novell.com
Novell TrademarksFor Novell trademarks, see the Novell Trademark and Service Mark list (http://www.novell.com/company/legal/trademarks/tmlist.html).
Third-Party MaterialsAll third-party trademarks are the property of their respective owners.
Software PiracyThroughout the world, unauthorized duplication of software is subject to bothcriminal and civil penalties.
If you know of illegal copying of software, contact your local Software Antipiracy Hotline. For the Hotline number for your area, access Novell’s World Wide Web page (http://www.novell.com) and look for the piracy page under “Programs.”Or, contact Novell’s anti-piracy headquarters in the U.S. at 800-PIRATES (747-2837) or 801-861-7101.
DisclaimerNovell, Inc., makes no representations or warranties with respect to the contents or use of this documentation, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.
Further, Novell, Inc., reserves the right to revise this publication and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. Further, Novell, Inc., makes no representations or warranties with respect to any software, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. Further, Novell, Inc., reserves the right to make changes to any and all parts of Novell software, at any time, without any obligation to notify any person or entity of such changes.
Any products or technical information provided under this Agreement may besubject to U.S. export controls and the trade laws of other countries. You agree to comply with all export control regulations and to obtain any required licenses or classification to export, re-export or import deliverables. You agree not to export or re-export to entities on the current U.S. export exclusion lists or to any embargoed or terrorist countries as specified in the U.S. export laws. You agree to not use deliverables for prohibited nuclear, missile, or chemical biological weaponry end uses. See the Novell International Trade Services Web page (http://www.novell.com/info/exports/) for more information on exporting Novell software. Novell assumes no responsibility for your failure to obtain any necessary export approvals.
This Novell Training Manual is published solely to instruct students in the use of Novell networking software. Although third-party application software packages are used in Novell training courses, this is for demonstration purposes only and shall not constitute an endorsement of any of these software applications.
Further, Novell, Inc. does not represent itself as having any particular expertisein these application software packages and any use by students of the same shall be done at the student’s own risk.
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Contents
Section 1 Troubleshooting........................................................................7
Exercise 1.1 Troubleshooting Techniques.......................................................................................8
Exercise 1.2 Troubleshooting Table...............................................................................................12
Section 2 Administration.........................................................................13
Exercise 2.1 Configuring Your Snapshot.......................................................................................14Task I: Take a Snapshot................................................................................................................14
Section 3 Troubleshooting Exercises.....................................................16
Exercise 3.1 Troubleshooting Exercise: Root Password............................................................17Task I: Configuration...................................................................................................................18Task II: Troubleshooting Procedure.............................................................................................19Task III: Root Cause.....................................................................................................................19
Exercise 3.2 Troubleshooting Exercise: Users Locked Out........................................................20Task I: Configuration...................................................................................................................21Task II: Troubleshooting Procedure.............................................................................................22Task III: Root Cause.....................................................................................................................22
Exercise 3.3 Troubleshooting Exercise: Repair Filesystem Prompt..........................................23Task I: Configuration...................................................................................................................24Task II: Troubleshooting Procedure.............................................................................................25Task III: Root Cause.....................................................................................................................25
Exercise 3.4 Troubleshooting Exercise: Server Hung with Blank Screen.................................27Task I: Configuration...................................................................................................................28Task II: Troubleshooting Procedure.............................................................................................29Task III: Root Cause.....................................................................................................................29
Exercise 3.5 Troubleshooting Exercise: Kernel and Initrd Messages.......................................30Task I: Configuration...................................................................................................................31Task II: Troubleshooting Procedure.............................................................................................32Task III: Root Cause.....................................................................................................................32
Exercise 3.6 Troubleshooting Exercise: Server Reboots............................................................34Task I: Configuration...................................................................................................................35Task II: Troubleshooting Procedure.............................................................................................36Task III: Root Cause.....................................................................................................................36
Exercise 3.7 Troubleshooting Exercise: Login Console Hang...................................................38Task I: Configuration...................................................................................................................39Task II: Troubleshooting Procedure.............................................................................................40Task III: Root Cause.....................................................................................................................40
Exercise 3.8 Troubleshooting Exercise: Waiting for Device.......................................................41Task I: Configuration...................................................................................................................42Task II: Troubleshooting Procedure.............................................................................................43Task III: Root Cause.....................................................................................................................43
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
3
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
Exercise 3.9 Troubleshooting Exercise: GRUB Prompt..............................................................45Task I: Configuration...................................................................................................................46Task II: Troubleshooting Procedure.............................................................................................47Task III: Root Cause.....................................................................................................................48
Exercise 3.10 Troubleshooting Exercise: Failed Run Level Services..........................................49Task I: Configuration...................................................................................................................50Task II: Troubleshooting Procedure.............................................................................................51Task III: Root Cause.....................................................................................................................52
Exercise 3.11 Troubleshooting Exercise: Read-Only Root Filesystem.......................................53Task I: Configuration...................................................................................................................54Task II: Troubleshooting Procedure.............................................................................................55Task III: Root Cause.....................................................................................................................55
Exercise 3.12 Troubleshooting Exercise: Missing Action Field...................................................56Task I: Configuration...................................................................................................................57Task II: Troubleshooting Procedure.............................................................................................58Task III: Root Cause.....................................................................................................................59
Exercise 3.13 Troubleshooting Exercise: GRUB............................................................................60Task I: Configuration...................................................................................................................61Task II: Troubleshooting Procedure.............................................................................................62Task III: Root Cause.....................................................................................................................62
Exercise 3.14 Troubleshooting Exercise: Invalid Partition Table.................................................64Task I: Configuration...................................................................................................................65Task II: Troubleshooting Procedure.............................................................................................66Task III: Root Cause.....................................................................................................................69
Exercise 3.15 Troubleshooting Exercise: Kernel Panic.................................................................70Task I: Configuration...................................................................................................................71Task II: Troubleshooting Procedure.............................................................................................72Task III: Root Cause.....................................................................................................................73
Exercise 3.16 Troubleshooting Exercise: Error in Service Module..............................................73Task I: Configuration...................................................................................................................75Task II: Troubleshooting Procedure.............................................................................................75Task III: Root Cause.....................................................................................................................76
Exercise 3.17 Troubleshooting Exercise: Fatal modules.dep Error.............................................77Task I: Configuration...................................................................................................................78Task II: Troubleshooting Procedure.............................................................................................79Task III: Root Cause.....................................................................................................................79
Exercise 3.18 Troubleshooting Exercise: Another Kernel Panic..................................................81Task I: Configuration...................................................................................................................82Task II: Troubleshooting Procedure.............................................................................................83Task III: Root Cause.....................................................................................................................84
Exercise 3.19 Troubleshooting Exercise: Segmentation Fault.....................................................86Task I: Configuration...................................................................................................................87Task II: Troubleshooting Procedure.............................................................................................88Task III: Root Cause.....................................................................................................................89
Exercise 3.20 Troubleshooting Exercise: Respawning Too Fast.................................................90Task I: Configuration...................................................................................................................91Task II: Troubleshooting Procedure.............................................................................................92Task III: Root Cause.....................................................................................................................93
4 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Exercise 3.21 Troubleshooting Exercise: Booting to $ Prompt....................................................94Task I: Configuration...................................................................................................................95Task II: Troubleshooting Procedure.............................................................................................96Task III: Root Cause.....................................................................................................................97
Exercise 3.22 Troubleshooting Exercise: Server Hang at Boot....................................................98Task I: Configuration...................................................................................................................99Task II: Troubleshooting Procedure...........................................................................................100Task III: Root Cause..................................................................................................................101
Exercise 3.23 Troubleshooting Exercise: Power Off...................................................................102Task I: Configuration.................................................................................................................103Task II: Troubleshooting Procedure...........................................................................................104Task III: Root Cause..................................................................................................................105
Exercise 3.24 Troubleshooting Exercise: Critical Data...............................................................106Task I: Configuration.................................................................................................................107Task II: Troubleshooting Procedure...........................................................................................108Task III: Root Cause..................................................................................................................109
Exercise 3.25 Troubleshooting Exercise: Kernel Panic After Disk Change...............................110Task I: Configuration..................................................................................................................111Task II: Troubleshooting Procedure...........................................................................................112Task III: Root Cause...................................................................................................................112
Exercise 3.26 Troubleshooting Exercise: Command Not Found................................................114Task I: Configuration.................................................................................................................115Task II: Troubleshooting Procedure...........................................................................................116Task III: Root Cause...................................................................................................................116
Exercise 3.27 Troubleshooting Exercise: Waiting for Device after LUN....................................118Task I: Configuration.................................................................................................................119Task II: Troubleshooting Procedure...........................................................................................120Task III: Root Cause..................................................................................................................121
Exercise 3.28 Troubleshooting Exercise: Not Booting After Power Failure..............................122Task I: Configuration.................................................................................................................123Task II: Troubleshooting Procedure...........................................................................................124Task III: Root Cause..................................................................................................................124
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
5
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
List of Figures
Initial "Refresh" Snapshot............................................................................................................14
6 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting
Section 1 Troubleshooting
Brief overview of troubleshooting techniques and the troubleshooting table.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
7
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
1.1 Troubleshooting Techniques
Troubleshooting Procedure
1. Let the server boot until it fails
2. Write down verbatim what's on the screen
3. Match on-screen landmarks to the Troubleshooting Table
4. Use Boot Installed System (BIS) to bypass GRUB, kernel and ram disk issues
5. Use an administrative run level for daemon failure
6. Use Chroot Installed System (CIS) if all else fails
7. Address the issues and files associated with the location of the boot failure (see Troubleshooting Table)
Boot Installed System (BIS)
1. Used mostly in lines 1-7 of the Troubleshooting Table.
2. Boot from DVD1
3. Select “Installation”
4. Accept the License Agreement
5. Click “Next”, and “Next” to skip media checks
6. Select “Repair Installed System” and “Next”
7. Select “Expert Tools”
8. Select “Boot Installed System”
NOTE: Select “Repair Installed System” directly from the DVD boot menu does not probe
as thoroughly as the “Repair Installed System” from the Installation option.
Administrative Run Levels
Run level S and 1 are very similar to chroot installed system (CIS), as far as run levels go.
However, run levels S and 1 use the installed system's boot loader, kernel and ram disk to
boot. It just doesn't start all the system processes like run level 3 or 5 do. So, run level S
and 1 are preferred over CIS. There are a couple of ways to change to run level S or 1. You
could just type init 1. However, if you are troubleshooting system processes that fail at boot
time or cause the server to misbehave as a result; you will want to reboot the server, bypass
8 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting
the default run level and boot into the run level 1. To boot to run level 1, do the following:
1. Boot the server normally
2. Select the kernel you usually boot to
3. Tab or click in the “Boot Options” field
4. Append “ 1” (a space followed by the number 1) to the boot options line
5. Type root's password
If you need network access, run “/etc/init.d/network start”, or dhcpcd eth0
Chroot Installed System (CIS)
1. Used mostly in lines 8 and above of the troubleshooting table.
2. Boot from DVD1
3. Select “Rescue System”, Rescue login: root
4. Your first goal is to find and mount the root “/” partition, so we can see /etc/fstab
1. Run cat /proc/partitions to find the disk devices the OS sees
2. For each device, display the partition table
lsboot:~ # parted s /dev/sda print
Disk geometry for /dev/sda: 0kB 2147MB
Disk label type: msdos
Number Start End Size Type File system Flags
1 32kB 214MB 214MB primary ext2 boot, type=83
2 214MB 535MB 321MB primary linuxswap type=82
3 535MB 2147MB 1612MB extended lba, type=0f
5 535MB 1012MB 477MB logical reiserfs type=83
6 1012MB 1596MB 584MB logical reiserfs type=83
7 1596MB 2147MB 551MB logical reiserfs type=83
3. You can ignore type 82 swap and type 0f extended partitions
4. To find the root partition, you may need to just guess. For example,
1. mount /dev/sda1 /mnt
2. ls l /mnt
3. If the /mnt directory listing shows /etc and /root, then its the root partition
4. Repeat these steps for each device until you find root. In this case, the root device is /dev/sda6
5. mount /dev/sda6 /mnt
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
9
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
5. Mount all additional file systems relative to /mnt
1. Once you have mounted the root filesystem, run cat /mnt/etc/fstab to see all the other
filesystem mount points.
2. Mount all file systems manually as shown in /mnt/etc/fstab.
mount /dev/sda1 /mnt/boot
mount /dev/sda5 /mnt/var
mount /dev/sda7 /mnt/usr
3. Rebind the /proc, /sys and /dev filesystems.
mount rbind /proc /mnt/proc
mount rbind /sys /mnt/sys
mount rbind /dev /mnt/dev
6. Chroot to the installed system: chroot /mnt
7. To return to the rescue system, type exit.
Flow Control
The normal boot messages display on the screen very fast. There are ways to slow it down
and test each service as it loads. The boot messages are controlled by variables set in the
/etc/sysconfig/boot file.
FLOW_CONTROL=”yes”
Allows you to stop the boot process messages using Ctrl-S and resume them with Ctrl-Q.
PROMPT_FOR_CONFIRM=”yes”
CONFIRM_PROMPT_TIMEOUT=”5”
This will display the prompt:
Enter interactive startup mode? y/[n](5s)
You will need to select “y” to enter interactive startup mode within the
CONFORM_PROMPT_TIMEOUT period, otherwise the server will boot normally without
prompting to load system daemons. After you enter interactive startup mode, you will be
prompted to load each service with the following:
10 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting
Start service <service_name>, (Y)es/(N)o/(C)ontinue? [y]
The CONFIRM_PROMPT_TIMEOUT value also applies to each service start prompt. This
was not true with earlier versions of SLES.
Once the server has booted up, you can use Shift-PgUp to scroll up about two screens worth
of boot messages, regardless of the /etc/sysconfig/boot settings. However, if you switch to
other consoles (ie tty2), you will not be able to use this keystroke.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
11
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
1.2 Troubleshooting TableBoot Process Associated File(s) On-Screen Landmarks Troubleshooting/Potential Fixes
1 BIOS N/A BIOS Messages Update the firmwareMark boot device “bootable” with fdisk
2 MBR /boot/grub/stage1 GRUB loading stage2... BISgrub-install or yast bootloader
3 GRUB /boot/grub/stage2/boot/grub/menu.lst
GRUB menu or grub> prompt BISgrub-install or yast bootloaderCheck /boot/grub/menu.lst (file and device)
4 GRUB /boot/vmlinuz/boot/initrd
root (hd?,?) Filesystem type is … kernel /<path_to_vmlinuz>initrd /<path_to_initrd>
Reinstall kernel RPMmkinitrdGRUB loads and boots the kernel
5 kernel /boot/vmlinuz Kernel driver information beginning with [ 0.0000000 ] time stamps
BISReinstall kernel rpm
6 initrd /boot/initrd/etc/sysconfig/kernel
A time stamp [ 0.0000000] followed by module info BIScd /tmp/ramdisk; zcat /boot/initrd | cpio -ivdmkinitrd
7 ramdisk:init /init in /boot/initrd/etc/sysconfig/kernel
Starting udevdCreating devicesLoading <module_name>
There will be “Loading” for each module defined in /etc/sysconfig/kernel INITRD_MODULES
BISmkinitrd creates the ramdisk:init file.
8 sbin:init /sbin/init/etc/inittab
INIT: version 2.86 booting Use boot options init=/bin/bash or init=/bin/sash to bypass running /sbin/init
9 sbin:init:boot /bin/bash/etc/init.d/boot/etc/init.d/boot.d/*/etc/sysconfig/boot
System Boot Control: Running /etc/init.d/bootEach service shows: done, failed or skippedSystem Boot Control: The system has been setup
CISPROMPT_FOR_CONFIRM=”yes”RUN_PARALLEL="no"FLOW_CONTROL="yes" (Ctrl-S stops, Ctrl-Q resumes)
10 sbin:init:boot /etc/init.d/boot.local System Boot Control: Running /etc/init.d/boot.local CIS
11 sbin:init /etc/inittab INIT: Entering runlevel: 3 init 1 or CIS
12 sbin:init:rc /bin/bash/etc/init.d/rc/etc/init.d/rc?.d/*/etc/init.d/before.local/etc/init.d/after.local
Master Resource Control: previous runlevel:N, switching to runlevel: 3Master Resource Control: Running /etc/init.d/before.localEach service shows: done, failed or skippedMaster Resource Control: Running /etc/init.d/after.localMaster Resource Control: runlevel 3 has been reachedSkipped services in runlevel 3:
init 1 or CISPROMPT_FOR_CONFIRM=”yes”RUN_PARALLEL="no"FLOW_CONTROL="yes"
13 sbin:init /etc/inittab N/A init 1 or CISinit refers to it's inittab file to know how to run the login programs.
14 sbin:init:mingetty
/etc/issue/etc/motd/etc/nologin/sbin/mingetty/etc/pam.d/login
<contents of /etc/issue>login:
init 1 bypasses mingettyCIS
15 sbin:init:X /etc/sysconfig/ displaymanger/etc/sysconfig/ windowmanger
Graphical login screen init 1 bypasses X loginCIS
12 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting
Section 2 Administration
Exercises that help prepare for the troubleshooting labs.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
13
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
2.1 Configuring Your SnapshotThe boot process labs require a lot of rebooting. Create a snapshot with the ls-boot virtual
machine running to minimize boot time after reverting your snapshot.
Objectives:Task I: Take a Snapshot
Special Instructions and Notes:
None
Task I: Take a Snapshot1. In VMware, click File, Open
2. Select /opt/labs/vms/ls-boot/ls-boot.vmx
3. Power on the ls-boot virtual machine.
4. Select “Boot from Hard Disk”
5. Login as root, password linux
6. Type hi
7. Type bplab followed by a space, but DO NOT press Enter.
8. Select VM, Snapshot, Take Snapshot
9. Call the Snapshot “Revert” and press OK
10. Wait for the virtual machine state to finish saving before continuing
14 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Administration
This exercise will reduce down time between exercises.
(End of Exercise)
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
15
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
Section 3 Troubleshooting Exercises
The purpose of these exercises is to create a boot failure that you need to troubleshoot. The lab notes are an example of how to approach troubleshooting that lab's symptoms. Realize that there are multiple ways to troubleshoot issues. Try to resolve the problem without looking at the lab notes. The lab notes will contain the symptom, error messages if any, a method for troubleshooting the issue, and the root cause.
You will become effective at troubleshooting boot related issues as you practice the techniques taught and apply them to the exercises in this lab.
16 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.1 Troubleshooting Exercise: Root PasswordI forgot root's password. Set root's password to "linux" and login normally as the root user.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 1
3. Press enter to continue
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Method 1
1. Reboot with boot options parameter: init=/bin/bash
2. The passwd command is in /usr/bin, which is not mounted yet.
3. Run mount a to mount all remaining filesystems
4. Run /usr/bin/passwd to change root's password
5. Reboot
2. Method 2
1. Boot to Rescue System; Rescue login: root
2. chroot Installed System (CIS)
3. Run passwd to change root's password
4. Reboot
Task III: Root Cause1. Root's password was forgotten
2. Does an Automatic Repair fix this scenario? No
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
17
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
Though forgetting is not directly a boot related issue, changing the root's password is a good skill to have.
(End of Exercise)
18 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.2 Troubleshooting Exercise: Users Locked OutAll my users are locked out, only root can log in. Make sure the geeko user can login without errors. Geeko's password is linux.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 2
3. Press enter to continue
4. Some errors observed
1. Permission denied (publickey,keyboardinteractive). (ssh)
2. login[2639]: FAILED LOGIN 1 FROM /dev/tty2 FOR geeko, Authentication failure
3. login[3050]: FAILED LOGIN 1 FROM /dev/tty1 FOR UNKNOWN, User not known to the underlying
authentication module
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Review the troubleshooting table and identify where you are.
2. The boot process is complete, and you are at the login.
3. Read the man pages for each of the associated files with login.
4. Does an /etc/nologin file exist? Yes.
5. Remove the /etc/nologin file.
Task III: Root Cause1. Normal administrative feature. No logins are allowed when /etc/nologin is present.
2. Does an Automatic Repair fix this scenario? No
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
19
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
Normal administrative functions may appear like failures, but are not.
(End of Exercise)
20 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.3 Troubleshooting Exercise: Repair Filesystem PromptThe system fails to boot and prompts for root's password.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 3
3. Press enter to continue
4. Some errors observed
1. fsck failed for at least one filesystem (not /).
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table.
2. Write down any errors verbatim on the screen. To see additional errors that have scrolled past the current
screen, press Shift-Up. This allows you to see several previous screens before logging in. Some errors in
addition to the error(s) above include:
1. Filesystem is clean failed
2. blogd: no message logging because /var file system is not accessible
3. Failed to open the device '/dev/hdb3': No such file or directory
4. You could also look in /var/log/boot.msg for these errors, but in this case that won't work, because /var
was not mounted.
3. Since we see “System Boot Control:” messages, but we never see a “Master Resource Control:” message, we
are in the init boot phase of the boot process.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
4. When you see the prompts “Give root password for login:” and “Attention: Only CONTROL-D will reboot
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
21
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
the system in this maintenance mode,” you should immediately suspect problems with the /etc/fstab
configuration file.
5. Try cat /proc/partitions to see if the OS recognizes the /dev/hdb3 partition.
6. Edit the /etc/fstab and confirm that each device, mount point, and mount options are valid. Comment out any
that are not valid.
7. Comment out /dev/hdb3 from /etc/fstab and press Ctrl-D to reboot.
8. If this works, then we need to determine why /dev/hdb3 does not exist. In this case, it was an old disk that
was removed.
9. Edit /etc/fstab
10. Delete the entry “/dev/hdb3 /vol1 reiserfs acl,user_xattr 1 1”
11. Save and reboot
Task III: Root Cause1. Invalid /etc/fstab entry, /dev/hdb3 is a non-existent device
2. Does an Automatic Repair fix this scenario? Yes
The boot process looks in the /etc/fstab for filesystems that need to be mounted at boot time. If the last entry in the /etc/fstab is non-zero, the filesystem will be checked for errors. If the device cannot be found, or the filesystem does not check properly, the boot will fail and stop in the repair mode. You can usually comment out non-system filesystems from the /etc/fstab and boot properly for troubleshooting.
(End of Exercise)
22 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.4 Troubleshooting Exercise: Server Hung with Blank Screen
When I boot the server, it just hangs. The screen is completely blank.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 4
3. Press enter to continue
4. Some errors observed
1. Blank screen
2. VM attempts PXE boot
3. Operating System not found
1. Booting from local disk...
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Did you see the boot loader menu, and successfully picked a kernel to boot? No. BIS will still help, but the
problem is with the boot loader itself. Execute the BIS procedure.
3. Since we saw the BIOS information on screen, but not boot loader menu, and no errors, the problem is with
the BIOS transitioning to the stage1 boot loader.
4. BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
5. Try reinstalling the boot loader and reboot to test.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
23
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
1. DVD1, Installation, Repair Installed System (RIS), Expert Tools, Install New Boot Loader, OK (No
edits or changing necessary), Exit RIS and reboot.
2. NOTE: DVD1, Repair Installed System does not always work well, you should use the repair installed
system after selecting Installation.
Task III: Root Cause1. Damaged or corrupted Master Boot Record
2. Does an Automatic Repair fix this scenario? Yes
The master boot record was corrupted. Since we boot off the DVD for BIS, we bypassed the disk's MBR. Reinstalling the boot loader resolves the issue.
(End of Exercise)
24 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.5 Troubleshooting Exercise: Kernel and Initrd Messages
We moved our server from one data center to another. When we boot the server, we just see some kernel and initrd message information. Sometimes the screen just goes black or the server reboots.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 5
3. Press enter to continue
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Once you pressed enter on the kernel to boot from the boot loader, did you see any messages scroll on the
screen? No. This indicates something is wrong with whatever the boot loader is pointing to (ie the kernel and
ram disk).
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. GRUB runs the commands in /boot/grub/menu.lst order. So, it would run root, kernel, initrd, and then boot.
The only command you don't see on screen is boot. Since nothing scrolled on the screen at all, you can
suspect the kernel could not execute and something is wrong with the kernel.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
4. Load BIS, and make sure the /boot/grub/menu.lst is valid and all files are present.
5. Check the /boot/vmlinuz and /boot/initrd symbolic links found in menu.lst and make sure they are pointing
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
25
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
to valid files in /boot
6. Run rpm -Vf /boot/vmlinuz to validate the rpm that owns the /boot/vmlinuz file. Notice that the
/boot/vmlinuz-3.0* kernel is marked with an S and 5. Refer to the rpm man page to understand all the RPM
verify options, but S means the size has changed and 5 means the MD5 checksum has changed. Since
/boot/vmlinuz is a symbolic link to the vmlinuz kernel file, then this is a major red flag. The Linux kernel
itself has changed.
7. Try reinstalling the kernel, and reboot.
8. Boot installed system and make sure DVD installation media is mounted.
9. Install the kernel rpm, yast -i kernel-pae-base
10. rpm -Vf /boot/vmlinuz should return no output
11. Reboot
Task III: Root Cause1. Corrupt kernel in /boot
2. Does an Automatic Repair fix this scenario? No
The kernel was damaged and needed to be reinstalled. BIS worked because it bypassed the installed kernel.
(End of Exercise)
26 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.6 Troubleshooting Exercise: Server RebootsYour computer keeps rebooting. You do not have access to your installation media and so cannot use rescue mode.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 6
3. Press enter to continue
4. Do not use rescue mode.
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since the screen goes by too fast, it's hard to narrow down where the reboot starts.
So reboot and append to the boot options: S. This tells the kernel to bypass the
existing runlevel and start in the stand alone run level. Runlevel 1 will also fail, but
runlevel S will prompt for root's password.
3. Change /etc/sysconfig/boot options to the following:
1. PROMPT_FOR_CONFIRM=”yes” (Prompts before loading each service)
2. FLOW_CONTROL=”yes” (Pauses the screen with Ctrl-S and resumes with
Ctrl-Q)
3. RUN_PARALLEL="no" (Runs each service, waits before running the next)
4. Reboot and press “y” to “Enter Interactive startup mode”.
4. Load each service until you notice that the server is rebooting. Press Ctrl-S to
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
27
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
pause the screen and observe the messages on screen. Press Shift-Up to see more
messages that have scrolled too far.
5. Since we see “System Boot Control:” messages, but we never see a “Master
Resource Control:” message, we are in the init boot phase of the boot process.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
6. The last message before the reboot message “INIT: Switching to runlevel: 6” is
“System Boot Control: Running /etc/init.d/boot.local”
7. Reboot and include the boot loader boot options parameter: S, then edit and check
the following: /etc/inittab, /etc/init.d/boot.local,
/etc/init.d/before.local and /etc/init.d/after.local
(before and after.local files do not exist by default, whereas boot.local does exist
but is usually empty.)
8. Try switching to run level 3 to see if the rebooting has stopped init 3.
9. Edit /etc/init.d/boot.local
10. Remove the shutdown r now command.
Task III: Root Cause1. Reboot command in the boot script: /etc/init.d/boot.local
2. Does an Automatic Repair fix this scenario? No
The init process run /etc/init.d/boot to process all boot level scripts, followed by the /etc/init.d/boot.local. This command is run prior to starting the typical run levels. An invalid command in this file will cause problems at boot time.
(End of Exercise)
28 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.7 Troubleshooting Exercise: Login Console HangI can login to a virtual console, but once I logout, I cannot log back into them. Help!
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 7
3. Press enter to continue
4. Some errors observed
1. INIT: no more processes left in this runlevel
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. The server appears to be at the very end of the boot process; sounds like a
configuration issue.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. What application provides the login after a reboot? mingetty
4. Did you login successfully? Yes. So mingetty is probably working fine.
5. What application runs the mingetty login program? /sbin/init
6. What is /sbin/init's configuration file? /etc/inittab
7. Use 'man inittab' to help you understand the fields and confirm the mingetty field
values are correct.
8. Compare /etc/inittab from a working system to your faulty system.
9. You could also reinstall the aaa_base package that owns /etc/inittab by running
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
29
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
yast i aaa_base.
Task III: Root Cause1. Incorrect /etc/inittab configuration
2. Does an Automatic Repair fix this scenario? No
Init is the parent process of all running processes, including the login programs. An invalid configuration caused init to stop respawning the login process.
(End of Exercise)
30 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.8 Troubleshooting Exercise: Waiting for DeviceI cannot boot my system, there seems to be an issue with the file system.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 8
3. Press enter to continue
4. Some errors observed
1. Waiting for device /dev/sda2 to appear: ok
2. fsck: Error 2 while executing fsck.swap for /dev/sda2
3. fsck failed. Mounting root device readonly.
4. could not mount root filesystem – exiting to /bin/sh
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since the boot loader works, but init does not run, then the problem is narrowed to
the kernel or ram disk.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. Observe carefully the error messages seen on screen when the boot process failed.
1. Check on /dev/sda2 to see what it is and why it's not showing up.
2. Consider mounting it manually to see if it mounts read/write instead of read
only.
4. Load BIS, and investigate /dev and /dev/sda2
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
31
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
5. Use cat /proc/partitions to see if sda2 is valid, and fdisk -l to see what
kind of partition sda2 is.
6. sda2 is a swap partition, yet the OS was “Waiting for sda2 to appear.” This means it
was attempting to mount it as a file system, instead of just turning it on with
swapon.
7. Does the /etc/fstab show the swap mounted correctly? Yes.
8. Type mount and observe which device is the root device. (/dev/sda6)
9. GRUB tells the kernel where the root device is with the root= parameter. GRUB's
configuration file is /boot/grub/menu.lst.
10. Edit /etc/grub/menu.lst, and confirm root= is set properly.
11. Change the kernel parameter root= so that it points to the correct root partition,
instead of the swap partition. (ie root=/dev/sda6)
12. Save, exit and reboot
Task III: Root Cause1. The swap partition was used instead of the root partition in the /boot/grub/menu.lst
configuration
2. Does an Automatic Repair fix this scenario? Yes
Grub has a kernel command option allowing you to tell the kernel the location of the root partition. The location needs to be correct in order for the system to boot properly.
(End of Exercise)
32 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.9 Troubleshooting Exercise: GRUB PromptWhen I boot, it stops at the grub> prompt.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 9
3. Press enter to continue
4. Some errors observed
Grub prompt instead of the grub menu
GNU GRUB version 0.94 (640K lower / 3072K upper memory)
[ Minimal BASHlike line editing is supported. For the first
word, TABlists possible command completions. Anywhere else TAB
lists the possible completions of a device/filename. ]
grub>
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since you don't see the GRUB menu screen, but do get a GRUB prompt; this
means you reached stage2.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
33
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
3. Since you are at the GRUB prompt, can you boot manually without any options?
Yes. This means it's probably a configuration issue.
4. Try booting manually from the grub> prompt using the following method. Do not
use BIS or CIS.
grub> find /boot/vmlinuz
(hd0,1)
grub> root (hd0,1)
Filesystem type is reiserfs, partition type 0x83
grub> kernel /boot/vmlinuz
[LinuxbzImage, setup=0x1400, size=0x176b27]
grub> initrd /boot/initrd
[Linuxinitrd @ 0x2a6000, 0x149abd bytes]
grub> boot
5. What configuration file does GRUB use to display it's default menu?
/boot/grub/menu.lst.
6. In this case the customer renamed the menu.lst file to menu.lst.old. You could
restore this file, but assuming you did not have a menu.lst file, recreate the
menu.lst for the purpose of this lab.
7. Recreate the menu.lst with yast bootloader, Other, Propose New
Configuration, OK
Task III: Root Cause1. Missing /boot/grub/menu.lst file.
2. Does an Automatic Repair fix this scenario? Yes
The grub> prompt means the second stage GRUB boot loader is working just fine, but it cannot find the default menu configuration file. The menu configuration file is /boot/grub/menu.lst. In this case the file itself was missing.
(End of Exercise)
34 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.10 Troubleshooting Exercise: Failed Run Level Services
When I boot the server, I see a bunch of messages on the screen, with a lot of failed runlevel services.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 10
3. Press enter to continue
4. Some errors observed
1. /etc/init.d/boot.d/S12boot.compliance: line 57: clear: command not found
2. Press any key to proceed with booting
3. /etc/init.d/boot.d/S13boot.klog: line 41: /var/log/boot.msg: No such file or
directory
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Press Shift-Up to observe other errors, looking for errors relating to mount, since it
returned a non-zero exit status. We need the mount command to mount the file
systems. This is a big red flag.
3. Some errors of interest:
1. mount: invalid option – 'o'
2. Try 'mount –help' for more information
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
35
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
3. INIT: Entering runlevel: 3 (Means init has started rc)
4. The first error we see before the problem occurs is:
1. System Boot Control: Running /etc/init.d/boot
2. Mounting procfs at /procmount: invalid option -- n
3. Try 'mount --help' for more information
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
5. On a working system, type which mount. The executable is located at
/bin/mount.
6. Reboot and use boot options: S to bypass all runlevels.
7. What happens when you type 'mount' or 'mount –help'?
8. Boot DVD1, Rescue System
9. Mount the root filesystem to /mnt using mount /dev/sda6 /mnt
10. Copy the rescue mode's /bin/mount into your failing system's /bin directory using
cp /bin/mount /mnt/bin
11. Reboot the system and reinstall the rpm that owns /bin/mount so it will pass rpm
validation. Run yast i utillinux
Task III: Root Cause1. Corrupted /bin/mount command
2. Does an Automatic Repair fix this scenario? No
The date command was accidentally copied over the top of the mount command causing the boot failure.
(End of Exercise)
36 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.11 Troubleshooting Exercise: Read-Only Root Filesystem
Some services fail to load at boot, and the root filesystem is read-only. Resolve the errors and boot normally.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 11
3. Press enter to continue
4. Some errors observed
1. mktemp: failed to create file via template `/tmp/keymap.XXXXXX` : Read-
only filesystem
2. Failed services in runlevel 3: random kbd
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Identify how far the boot process got.
2. You get to a login screen, but there are errors.
3. If you have not changed screens, you can use Shift-PgUp to see more boot
messages.
4. Try to find out where the errors first started.
5. The first error is after /etc/init.d/before.local.
6. Login and see what is in /etc/init.d/before.local.
7. Change to runlevel 1 and type root's password
8. Try moving /etc/init.d/before.local to /root
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
37
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
9. Remount root as read/write
1. mount -o rw,remount /
10. Run mv /etc/init.d/before.local /root
11. Reboot to test.
Task III: Root Cause1. The umount command was in /etc/init.d/before.local.
2. Does an Automatic Repair fix this scenario? No
Sometimes problems happen due to logical errors on the part of the administrator. The umount command in the /etc/init.d/before.local file had unmounted several file systems and caused some to become read-only.
(End of Exercise)
38 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.12 Troubleshooting Exercise: Missing Action FieldI can boot my computer, but cannot login. I get an error message about a missing action field.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 12
3. Press enter to continue
4. Some errors observed
1. INIT: /etc/inittab[50]: missing action field
2. INIT: no more processes left in this runlevel
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Do you get a login prompt? No.
3. How far did INIT get before failing?
1. You can see INIT ran rc because of message “Master Resource Control:
runlevel 3 has been reached”
2. This indicates the runlevel completed, otherwise it would say there were
skipped or failed services; so we are having a problem running the login
program.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
4. init is identifying the file and line number where it thinks the problem is. In this
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
39
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
example, you would check line 50 in the /etc/inittab file. Check your error
messages for your specific line number.
5. Use 'man inittab' to help you understand which is the “action” field, and what
should go in it, or compare it with a working system.
6. Load CIS, edit /etc/inittab and fill in the correct value for the missing action
field(s).
7. Change the /etc/inittab entries from this:
1:2345::/sbin/mingetty --noclear tty1
2:2345::/sbin/mingetty tty2
3:2345::/sbin/mingetty tty3
4:2345::/sbin/mingetty tty4
5:2345::/sbin/mingetty tty5
6:2345::/sbin/mingetty tty6
to this:
1:2345:respawn:/sbin/mingetty --noclear tty1
2:2345:respawn:/sbin/mingetty tty2
3:2345:respawn:/sbin/mingetty tty3
4:2345:respawn:/sbin/mingetty tty4
5:2345:respawn:/sbin/mingetty tty5
6:2345:respawn:/sbin/mingetty tty6
8. Reboot
Task III: Root Cause1. Missing action filed in /etc/inittab configuration
2. Does an Automatic Repair fix this scenario? No
The /etc/inittab requires the format id:runlevels:action:process . The action field was missing and needed a valid action based on inittab(5).
(End of Exercise)
40 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.13 Troubleshooting Exercise: GRUBWhen I boot, all I see on the screen is GRUB, and the server hangs.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 13
3. Press enter to continue
4. Some errors observed
1. GRUB GRUB Hard Disk Error
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since we saw the BIOS information on screen and then “GRUB”, we know that the
BIOS found and started executing the first stage boot loader from the MBR.
However, we could not progress from stage1 to stage2.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. Since BIS worked, and the problem seems to be with the boot loader, try
reinstalling the GRUB boot loader (ie grub-install), and reboot to test.
4. Reinstalling the boot loader fails, so we should verify the rpm that owns the stage1
and stage2 (ie rpm -Vf /boot/grub/stage{1,2})
5. Since the /boot/grub/stage{1,2} files get installed by grub-install, and they are not
owned by any rpm package, we need to determine how they got in the /boot/grub
directory to begin with.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
41
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
6. The most likely package to put the files in /boot/grub is the grub rpm itself. Since it
does not own the files, we could check it pre and post scripts to see if it touched the
files during the pre or post rpm installation.
1. Run rpm q scripts grub | grep stage
7. The grub rpm did copy the stage{1,2} files from a different location, explaining
why it does not directly own /boot/grub/stage{1,2}
8. You could try copy the the *stage* files to /boot/grub list the rpm did to see if that
would help. cp /usr/lib/grub/*stage1* /boot/grub/
9. If that fails, the easiest thing to try while in BIS, is reinstalling the grub rpm. Run
yast i grub
10. Run grub-install
11. Reboot and retest
Task III: Root Cause1. Corrupted /boot/grub/stage1 file.
2. Does an Automatic Repair fix this scenario? No
The GRUB files in /boot/grub are not directly owned by the grub rpm package. As a result, an rpm verify did not identify the problem. Based on the location in the boot process, a grub package reinstall makes sense.
(End of Exercise)
42 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.14 Troubleshooting Exercise: Invalid Partition TableI get an invalid partition table error when booting.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 14
3. Press enter to continue
4. Some errors observed
1. Error No active partition
2. Operating System not found
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Did you see the boot loader menu, and successfully picked a kernel to boot? No.
BIS may still help, but the problem might be with the boot loader or the partition
table itself.
3. Since we saw the BIOS information on screen, and then the partition table error, it
probable that the BIOS cannot find the partition table in the MBR at all.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
4. Try BIS or Repair Installed System.
5. Repair Installed System is grayed out, because without a partition table, there is not
an installed system that install knows about.
6. At this point you need to either restore the partition table from backup (sfdisk),
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
43
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
attempt to recover the partition table from the disk (gpart) or reformat the disk and
restore data from backup.
7. To restore or recover the partition table, you need to boot to Rescue System,
Rescue login: root
1. Run cat /proc/partitions to see the disk and show there are no
partitions.
2. Restore the partition table backup file. -OR-
1. Copy the partition table backup file to the rescue system
2. Use sfdisk to restore the partition table
3. Recover the partition table from disk -OR-
1. Use gpart to recover the partition from disk
2. If you have a partition backup stored on the failed server, you might want
to restore it, once you recover enough with gpart to boot the server.
3. Try gpart W /dev/sda /dev/sda to attempt to recover the
partition table.
4. NOTE: gpart does not always work well with extended partitions.
4. Repeat the lab, but copy the /boot/backup_mbr and a supportconfig to another
server before starting the lab. Restore the MBR using both methods.
1. For the backup_mbr
1. dd if=backup_mbr of=/dev/sda
2. partprobe
3. fdisk /dev/sda, a, 1 to mark sda1 bootable
2. For supportconfig
1. Copy the sfdisk -d section in fs-diskio.txt to its own file called
partitions.
2. Run sfdisk /dev/sda < partitions
3. partprobe
8. Boot to Rescue System; Rescue login: root; Run dhcpcd eth0 for network
connection
9. Recovering partition table from disk --OR--
1. Boot from DVD, select Repair Installed System (NOTE: Selecting “Repair
Installed System” after selecting installation will fail due to the missing
partition table.)
2. Select Expert Tools, Recover Lost Partitions, Start
3. If this works then restore the /boot/backup_mbr and reboot
4. If it fails, try manual recovery
44 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
10. Restoring partition table from supportconfig backup --OR--
1. Create a partition backup file from supportconfig (You would have to have a
supportconfig tarball that was copied off the server for this to work.)
1. The partition table backup is stored at the bottom of the fs-diskio.txt file,
and looks like this:
#==[ Command ]======================================#
# /sbin/sfdisk -d
# partition table of /dev/sda
unit: sectors
/dev/sda1 : start= 63, size= 417627, Id=83, bootable
/dev/sda2 : start= 417690, size= 626535, Id=82
/dev/sda3 : start= 1044225, size= 3148740, Id= f
/dev/sda4 : start= 0, size= 0, Id= 0
/dev/sda5 : start= 1044288, size= 931707, Id=83
/dev/sda6 : start= 1976058, size= 1140552, Id=83
/dev/sda7 : start= 3116673, size= 1076292, Id=83
2. Copy the uncommented text from the relevant device. supportconfig gets a
backup of all disk devices. This example only shows one.
3. Create the partition backup file (ie part.txt) that looks like this:
unit: sectors
11.
/dev/sda1 : start= 63, size= 417627, Id=83, bootable
/dev/sda2 : start= 417690, size= 626535, Id=82
/dev/sda3 : start= 1044225, size= 3148740, Id= f
/dev/sda4 : start= 0, size= 0, Id= 0
/dev/sda5 : start= 1044288, size= 931707, Id=83
/dev/sda6 : start= 1976058, size= 1140552, Id=83
/dev/sda7 : start= 3116673, size= 1076292, Id=83
1. Copy part.txt from your backup server to the rescue system's /tmp directory.
scp <backup_ip>:/directory/part.txt /tmp
2. Restore the partition table with part.txt
sfdisk /dev/sda < /tmp/part.txt
3. Reboot
12. Manually Recovering --OR--
1. Boot Rescue System
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
45
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
2. Since we are familiar with the filesystem, we know sda1 was boot, but forgot
how big it was.
3. Run fdisk /dev/sda, n (New), p (Primary), 1, use the default beginning sector
number, +75M (make the partition 75M as a guess).
4. P lists the partitions. W (Write the partition table).
5. Cat /proc/partitions shows only one partition, if we got the beginning sector
number right, we should be able but mount the filesystem, even though the
device is smaller than the filesystem.
mount /dev/sda1 /mnt
6. Restore the backup_mbr and reread the partition table. Use cat /proc/partitions
to confirm.
dd if=/mnt/backup_mbr of=/dev/sda
partprobe
7. Use fdisk /dev/sda, a (toggle bootable flag), 1 (partition 1), w (write changes to
disk)
8. reboot
Task III: Root Cause1. Missing partition table
2. Does an Automatic Repair fix this scenario? No
When the partition table gets damaged, it does not necessarily mean the filesystems are damaged. Restoring the partition table may allow you to recover the filesystems and boot the server properly.
(End of Exercise)
46 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.15 Troubleshooting Exercise: Kernel PanicI get a kernel panic when I start my Linux host. It doesn't matter which kernel I boot from, I still get the error.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 15
3. Press enter to continue
4. Some errors observed
1. VFS: Cannot open root device “sda6” or unknown-block(0,0)
2. Please append a correct “root=” boot option
3. Kernel panic – not syncing: VFS: Unable to mount root fs or unknown-
block(0,0)
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table.
2. Write down the information on the screen verbatim at the point of boot failure.
3. The on screen error messages give us three clues to check: root=, sda6 and
mounting root.
4. Since we did not see “done” during boot, this issue is a good candidate for Boot
Installed System (BIS).
5. Since BIS worked, the problem most likely can be found in the bold section below:
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
6. Since we saw the boot loader menu during a normal boot, we can assume the boot
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
47
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
loader is fine as well, further narrowing the problem to the kernel or RAM disk.
This correlates with our clues (root= and sda6 relate to kernel) and (RAM disk to
initrd).
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
7. While in BIS, investigate the clues: root= (/boot/grub/menu.lst), sda6 (mount), and
RAM disk (/boot/initrd*)
1. Since root= in /boot/grub/menu.lst is set to root=/dev/sda6, and mount shows
that BIS choose /dev/sda6 as the root file system to load, we can eliminate
those two clues from the problem.
2. This leaves the RAM disk. Root must be mounted initially from the RAM disk.
It is then remounted once the system is up.
8. You can dump the content of the RAM disk as follows:
1. mkdir p /tmp/ramdisk
2. cd /tmp/ramdisk
3. zcat /boot/initrd | cpio ivd
4. Notice you get errors, instead of the contents of the initrd ramdisk.
9. Check to make sure /etc/sysconfig/kernel has all the drivers needed in the
INITRD_MODULES= variable to get to the root file system. If you don't know for
sure, just recreate the RAM disk anyway.
10. Recreate the RAM disk and try a reboot to retest.
1. Boot Installed System
2. Run mkinitrd
3. Reboot
Task III: Root Cause1. Corrupt initrd ram disk
2. Does an Automatic Repair fix this scenario? Yes
The kernel panic was caused by a corrupted ram disk (/boot/initrd-*). The BIS troubleshooting technique allows you to use the DVD's kernel and ram disk to boot the server and troubleshoot the issue. If BIS works, then the most common problem is a ram disk issue, which can usually be resolved by running mkinitrd. The next most common problem is GRUB. Reinstalling the GRUB boot loader usually resolves them.
(End of Exercise)
48 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.16 Troubleshooting Exercise: Error in Service ModuleThe server hard crashed due to power outage. Logging is as root fails with “Error in service module.”
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 16
3. Press enter to continue
4. Some errors observed
1. INIT: version 2.86 booting
2. INIT: cannot execute “/bin/sh”
3. INIT: Entering runlevel: 3
4. INIT: Id “3” respawning too fast: disabled for 5 minutes
5. INIT: no more processes left in this runlevel
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Notice the “INIT: Id '3' respawning too fast” and “no more
processess left in this runlevel” messages. This indicates
/sbin/init attempted to execute runlevels, but could not. So, BIS is not an option.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. Try chroot Installed System (CIS)
4. CIS failed with the error message: “chroot: failed to run command
'/bin/bash': No such file or directory.” The chroot command
needs to source the new directory's environment. If it cannot, this is a red flag that
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
49
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
needs to be resolved.
5. Change to /mnt/bin and look for bash. It's missing.
6. Try using the rescue system's bash to boot the server: cp /bin/bash
/mnt/bin/
1. If this works, reboot and then reinstall the bash rpm to restore the correct
/bin/bash on the system. You might also want to tell the customer to verify
their other RPMs.
2. If it fails, you will have to manually copy the /bin/bash file from another
server.
Task III: Root Cause1. Missing /bin/bash
2. Does an Automatic Repair fix this scenario? No
Bash is the default shell and is used to start all the services on the server. Without, nothing works right.
(End of Exercise)
50 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.17 Troubleshooting Exercise: Fatal modules.dep ErrorI cannot boot my computer. I get an error message about the / (root) partition waiting to appear... not found exiting to /bin/sh. I also see a fatal modules.dep error.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 17
3. Press enter to continue
4. Some errors observed
1. FATAL: Could not load /lib/modules/3.0*/modules.dep: No such file or
directory
2. Waiting for /dev/sda6 to appear: ...Could not find /dev/sda6
3. Want me to fall back to /dev/sda6? (Y/n)
4. Waiting for /dev/sda6 to appear: ...not found – exiting to /bin/sh
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since the boot loader works, but init does not run, then the problem is narrowed to
the kernel or ram disk.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. Load BIS
4. Verify the kernel RPMs. They pass.
5. The easiest thing to try now is making a new RAM disk.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
51
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
1. Check /etc/sysconfig/kernel and make sure INITRD_MODULES is correct.
2. Run mkinitrd
6. mkinitrd does not give any errors, but seems to be shorter than you expect. Run
mkinitrd on your test system and compare the output.
7. mkinitrd is missing the Kernel Modules list included in the RAM disk.
8. This is the first solid lead to test. Run rpm V mkinitrd to validate the RPM.
9. It passes, but you still need to have the kernel modules included in the RAM disk
in order to boot the server. Try reinstalling the mkinitrd RPM (yast i
mkinitrd) even though it passes RPM validation.
10. You could also run /sbin/mkinitrd_setup which is called by the RPM
install script. It would fix the broken links too and make mkinitrd work.
Task III: Root Cause1. Missing symlink for mkinitrd
2. Does an Automatic Repair fix this scenario? No
A good troubleshooting technique is to try reinstalling the rpm package associated with files you know are having a problem, even if they pass RPM validation. Maybe the RPM package scripts do something to resolve the issue.
(End of Exercise)
52 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.18 Troubleshooting Exercise: Another Kernel PanicI get a kernel panic right after attempting to mount root during the boot process.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 18
3. Press enter to continue
4. Some errors observed
1. Mounting root /dev/sda6
2. INIT: version 2.86 booting
3. Kernel panic – not syncing: Attempted to kill init!
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. You see the INIT: version 2.86 booting just before the kernel panic.
This means we paniced just as init was executing, suggesting init may have an
issue.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. Boot using the “boot options” parameter init=/bin/bash.
4. It fails. However, if you try “boot options” init=/bin/sash, which is the stand alone
shell, it works. Sash is a statically linked executable.
5. Since we are already at init in the boot process, this issue is not a good candidate
for BIS, but CIS. Try CIS.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
53
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
6. Check /bin/bash and /lib/libc.so.6 on the installed system.
7. Did you document any errors when attempting init=/bin/bash? Always document
what you do and the outcome. CIS fails with the same error as init=/bin/bash.
“/bin/bash: relocation error: /bin/bash: symbol access,
version GLIBC_2.4 not defined in file libc.so.6 with
link time reference”
8. Mount all additional filesystems as shown in /mnt/etc/fstab manually if you haven't
already, and try chroot /mnt again.
9. CIS still fails. However, the error message suggests we may have a problem with
bash or libc.so.6.
10. Try validating the RPM packages that own bash and libc.so.6.
11. Since you have mounted the installed file systems, you can also run rpm verify
against the mounted filesystems, without chrooting to it.
12. Verify which rpm package owns /bin/bash and /lib/libc.so.6, then verfiy those
packages.
1. rpm -qf -r /mnt /bin/bash
2. rpm -qf -r /mnt /lib/libc.so.6
3. rpm -Vr /mnt bash
4. rpm -Vr /mnt glibc
13. Notice that libc-2.11.3.so has been modified, but what does that have to do with
libc.so.6?
1. ls -l /mnt/lib/libc*
2. libc.so.6 is a symbolic link to libc-2.11.3.so
14. Update the damaged libc-2.11.3.so with a good one. The easiest way to fix the
problem is by doing a down server update from the DVD. Try this method.
1. Boot from DVD, Select Installation, Update an Existing System
2. Update an Existing System does not work, you get an error: “Switching
to the installed system has failed”
15. You will have to manually update the glibc package.
1. There are two glibc rpms, i586 and i686. Make sure you update the correct
one!
2. rpm qi r /mnt glibc may show you which one you are using.
16. Boot into rescue mode, then mount the installation media.
17. Mount the installed system. This is the same as CIS, only do not do the final chroot
command.
18. Install the required glibc rpm from the installation media
54 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
19. Boot to rescue mode and mount all the filesystems as if you are chrooting the
installed system.
mount /dev/sda6 /mnt
mount /dev/sda1 /mnt/boot
mount /dev/sda5 /mnt/var
mount /dev/sda7 /mnt/usr
20. Run uname a to determine the glibc architecture to use (ie i686)
21. Mount the installation media
mount -o ro /dev/cdrom /media/cdrom
22. Install the rpm
rpm -Uvh --force -r /mnt /media/cdrom/suse/i686/glibc-2.*rpm
23. Reboot
Task III: Root Cause1. Damaged glibc library file.
2. Does an Automatic Repair fix this scenario? No
The glibc library is used with all dynamically linked executables. The first application to rely on the system's glibc libraries is /sbin/init, causing the kernel panic.
(End of Exercise)
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
55
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
3.19 Troubleshooting Exercise: Segmentation FaultIt seems like the server is hung or something. Every command I type gives me a segmentation fault.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 19
3. Press enter to continue
4. Some errors observed
1. Segmentation Fault
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. The first objective is to reboot the server. Since all commands seem to segfault,
you have two options.
1. Reset or power off the server and reboot
2. Use magic keys
1. echo s > /proc/sysrq-trigger # sync all filesystems
2. echo u > /proc/sysrq-trigger # remount filesystems read-only
3. echo b > /proc/sysrq-trigger # force a server reboot
2. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
3. We get a kernel panic at boot just after the INIT: version 2.86 booting,
message.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
56 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
4. Try chroot installed system (CIS).
5. CIS seg faults too. Manually mount all filesystems and chroot /mnt again. They all
still fail. Since chroot needs to source installed environment, we need to check
/mnt/bin/bash.
6. After mounting all installed filesystems, run rpm verify on bash: rpm qf
r /mnt bash
7. No bash RPM errors. Since bash is a dynamically linked executable, then it will
have library dependencies, which would also need to be checked.
8. Run ldd /mnt/bin/bash, to check for shared library dependencies.
9. Run an rpm verify on each shared library file that bash depends on. You need to
run rpm qfr /mnt /lib/libreadline.so.5 to find out to which RPM
it belongs, and repeat this process for each file listed in the ldd output for
/mnt/bin/bash.
10. libreadline5 and libncurses5 seem fine, but glibc says something is wrong with ld-
2.11.3.so.
11. Reinstall the glibc rpm
12. Boot from DVD1, Select Installation, New Installation, Custom Partitioning
13. Select System View/linux/Import Mount Points..., DESELECT Format system
volumes, Import
1. NOTE: If this fails, you will need to select the filesystem devices and mount
points manually.
14. Edit the swap partition and configure it to be mounted.
15. Accept, “Really keep the partition unformatted?” Yes.
16. Select the software patterns you originally had on the system. In this case, Base
and Minimal System, Accept.
17. Install
18. Go through the configuration phase, configuring the server as it was previous to the
install.
Task III: Root Cause1. Damaged glibc shared library
2. Does an Automatic Repair fix this scenario? No
This is another exercise where a glibc library was damaged. When glibc is damaged, it may bring into question the integrity of the server. This exercise demonstrates a method of reinstalling the existing installed system. The procedure can be referred to as “Install the Installed System” (IIS) method. You basically reinstall the OS WITHOUT formatting the
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
57
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
filesystems. This assumes the filesystems are intact.
(End of Exercise)
58 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.20 Troubleshooting Exercise: Respawning Too FastI cannot login to the server. I keep getting errors that id 1 is respawning too fast.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 20
3. Press enter to continue
4. Some errors observed
1. INIT: Id “1” respawning too fast: disabled for 5 minutes
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. You also see the “Master Resource Control: runlevel 3 has been: reached”. This
means /sbin/init finished boot and rc successfully, otherwise it would show
“skipped” services. So we have reached a failed login state.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
3. After the reboot, observe the error verbatim as it appears on the screen. The error
is, INIT: Id “1” respawning too fast: disabled for 5
minutes. The same error occurs with the number ranging from 1 to 6.
4. Since we are too far along in the boot process, BIS will not work. However, it
appears that the runlevels do work. Try rebooting with “boot options”: S, for single
user admin mode.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
59
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
5. Since that worked, type init 1 to change to runlevel 1 to see if it works.
6. It works. Now you can troubleshoot from the administration runlevel 1.
7. The error message is /sbin/init identifying the specific /etc/inittab ID that is failing.
8. We should look in the /etc/inittab file for ID's 1 through 6 to see what application
init is respawning too fast.
9. The application is /sbin/mingetty. If you compare this to your working test system,
you would see that mingetty is the correct application that should be spawned.
Remember the first field is the ID field, and last one is the application.
1:2345:respawn:/sbin/mingetty --noclear tty1
10. rpm verify the package that owns /sbin/mingetty.
11. Reinstall the mingetty rpm package, because /sbin/mingetty had it's MD5 sum
changed since installation. Type init 3 to change to runlevel three to confirm
you can login.
1. Boot to runlevel 1
2. Mount the installation media
mount -o ro /dev/cdrom /mnt
3. Install the rpm
rpm -Uvh --force /mnt/suse/i586/mingetty-1*rpm
4. Run init 3
Task III: Root Cause1. Corrupted /sbin/mingetty login executible.
2. Does an Automatic Repair fix this scenario? No
The mingetty binary is used to get login credentials and validate them through the PAM stack. /sbin/init is responsible for running mingettty. Since mingetty was failing, init kept trying to restart it. /sbin/init detected too many restarts or respawning attempts and stopped trying.
(End of Exercise)
60 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.21 Troubleshooting Exercise: Booting to $ PromptThe server will only boot to the $ prompt
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 21
3. Press enter to continue
4. Some errors observed
1. could not mount root filesystem -- exiting to /bin/sh
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Copy the screen verbatim at the time of the failure, and look for additional
messages or errors.
3. Some interesting messages from bottom to top are:
1. umount: /dev: device is busy
2. mount: unknown filesystem type 'reiserfs'
3. modprobe: FATAL: Error inserting reiserfs (/lib/modules/2.6.16.21-0.8-
default/kernel/fs/reiserfs/reiserfs.ko): Unknown symbol in module, or
unknown parameter (see dmesg)
4. reiserfs: Unknown symbol vfs_check_on
5. reiserfs: Unknown symbol vfs_check_on_mount
4. The predominate theme seems to be something wrong with the reiserfs file
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
61
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
system driver. The unknown symbols may mean an outdated
modules.dep, or something may be wrong with the file system driver itself.
In any case, dmesg output is suggested for more information. So we
should look into these three areas for clues.
5. Did you see “done” scroll across the screen? No. Boot Installed System (BIS)
should work.
6. Did you see the boot loader menu, and successfully picked a kernel to boot? Yes.
The boot loader is probably fine, and the problem would seem to kernel/initrd
related.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
7. Confirm that the reiserfs driver is needed. Type mount. There are two things to
notice with this output: 1) reiser is used for one or more devices, and 2) the root “/”
partition is reiserfs. This means the RAMDISK needs to have this driver; requiring
a mkinitrd rebuild.
8. Backup the /lib/modules/$(uname -r)/modules.dep file. Run depmod -a to update
modules.dep. Use diff to check for any differences, and vimdiff to see the
differences. There are none.
9. To check the reiserfs driver file, the easiest way is to remember that all distributed
kernel drivers are owned by the kernel rpm. This is done with rpm V kernel
default.
1. Notice that the reiserfs.ko and ext3.ko driver files have a modified MD5 sum
and time stamp. This is a big red flag.
10. Reinstall the kernel rpm and retest.
11. Boot installed system
12. The uname -r command shows a “pae” kernel type.
yast -i kernel-pae
Task III: Root Cause1. Corrupted file system driver ko files.
2. Does an Automatic Repair fix this scenario? Yes
All shipping kernel drivers come packaged in the kernel RPM package. If they are bad, reinstalling the kernel RPM package restore the good driver files.
(End of Exercise)
62 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.22 Troubleshooting Exercise: Server Hang at BootThe server hangs at boot time. There don't seem to be any messages or errors.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 22
3. Press enter to continue
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. The last messages you see that match the troubleshooting table are, “Loading
<module_name>” messages. However, we do not see the “INIT: version 2.86
booting” message. This indicates the problem may be with the ramdisk:init or
sbin:init.
3. Did you see “done” scroll across the screen? No. Boot Installed System (BIS)
should work.
4. Did you see the boot loader menu, and successfully picked a kernel to boot? Yes.
The boot loader is probably fine, and the problem would seem to kernel/initrd
related.
BIOS -> MBR/stage1 -> stage2 -> kernel/initrd -> init -> boot -> rc -> login
5. BIS still hangs, so the problem may be after the kernel/initrd. Try chroot Installed
System (CIS).
6. CIS worked, which means /bin/bash and glibc are probably fine.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
63
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
7. Perform a basic server health check, specifically disk space on root “/”.
1. http://www.novell.com/communities/node/4097/basic-server-health-check-
supportconfig
8. Since the problem may be related to the ramdisk:init, and mkinitrd creates that file
in the initrd, you should try recreating the ramdisk.
9. mkinitrd did not show any errors, and a reboot test shows the server still hangs.
10. CIS again and verify the troubleshooting table's associated files for sbin:init.
11. Running rpm Vf /sbin/init shows the MD5 sum has changed. This is a
red flag and must be resolved before troubleshooting further. Run rpm qf
/sbin/init to see which rpm needs to be reinstalled.
12. Reinstall the sysvinit rpm and reboot to test.
1. Chroot Installed System
2. Mount the DVD
mount /dev/cdrom /mnt
3. Reinstall the rpm
rpm -Uvh --force /mnt/suse/i586/sysvinit*rpm
4. Reboot
Task III: Root Cause1. Corrupted /sbin/init
2. Does an Automatic Repair fix this scenario? No
The troubleshooting table helps narrow down where in the boot process a failure is occuring. Once known, CIS was used because BIS would continue to use the /sbin/init that was bad. The RPM package that owned /sbin/init needed to be replaced. Updating the server would also fix the problem if a new sysvinit RPM package was available.
(End of Exercise)
64 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.23 Troubleshooting Exercise: Power OffThe server turns itself off and never comes up completely. Boot the server normally and determine the root cause for the server hang.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 23
3. Press enter to continue
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Since init 1 worked, then the problem is with one of the services that exist in
runlevel 3, but do not exist in runlevel 1.
3. Compare /etc/init.d/rc1.d/S* with /etc/init.d/rc3.d/S*; these are the services that are
Started in that runlevel.
4. Look at the /var/log/boot.omsg for clues. The boot.omsg is the “old” boot.msg file
(ie the previous boot).
5. Run an rpm -Vf /etc/init.d/<service>, for each service start script that exists in
runlevel 3, but does not exit in runlevel 1.
6. Since more than one service has changed, you could reinstall all the affected rpms,
or try an narrow it down further by stepping through the boot process.
7. Edit /etc/sysconfig/boot, change to PROMPT_FOR_CONFIRM=”yes”,
RUN_PARALLEL=”no” and FLOW_CONTROL=”yes”. Reboot and watch for
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
65
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
the first sign that the server is shutting down or powers off.
8. You will have a couple of seconds to respond either Y or N to load the service. The
prompt will be Start service <name>? (Y)es/(N)o/(C)ontinue)
[y]. Quickly note the name and then press enter to load the service. What for
anything unusual for each service, but remember you only have a couple of
seconds to respond to the next load prompt. If you don't respond in time, all the rest
of the services will load automatically.
9. As soon as microcode.ctl runs, we see Start service purgekernels,
(Y)es/(N)o/(C)ontinue? [y], but it does not prompt to load and we then
immediately see “INIT: Sending processes the KILL signal” and the server is
already in its shutdown procedure. Press Ctrl-S to pause the shutdown process long
enough to see the messages on the screen. This is what FLOW_CONTROL was
for. When you are done, press Ctrl-Q to continue.
10. Boot to runlevel 1 again and look at /etc/init.d/rc3.d/*microcode.ctl and the scripts
that follow it.
11. Run an rpm verify on the scripts that follow microcode.ctl. Make sure you verify
the scripts in the /etc/init.d directory and not the /etc/inti.d/rc3.d directory (these
are not owned by any rpm package, but created by insserv).
12. Notice that /etc/init.d/syslog has changed since installation. Reinstall the rpm that
installed /etc/init.d/syslog
13. The reinstall may have failed, try mv /etc/init.d/syslog
/etc/init.d/syslog.old, and then reinstall the klogd rpm.
14. Boot to runlevel 1
15. Mount the DVD, mount /dev/cdrom /mnt
16. mv /etc/init.d/syslog /etc/init.d/syslog.old
17. rpm -Uvh --force /mnt/suse/i586/klogd-*.rpm
18. init 3
Task III: Root Cause1. Logic error in customized /etc/init.d/syslog service
2. Does an Automatic Repair fix this scenario? No
The value of this lab is to learn how to use PROMPT_FOR_CONFIRM and FLOW_CONTROL. These are valuable troubleshooting tools for problems relating to a boot service.
66 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
(End of Exercise)
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
67
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
3.24 Troubleshooting Exercise: Critical DataThe server is hung, but the customer must have access to their critical data. Fix the server and make sure there are 100 critical files and 200 important files on /data. The customer must have these files, and there is no backup.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 24
3. Press enter to continue
4. Some errors observed
1. The server is hung
2. fsck failed for at least one filesystem (not /).
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Make sure you look at all the errors on the screen where the boot failed, and then
type root's password for maintenance mode.
3. Errors on the screen include:
1. reiserfs_open: the reiserfs superblock cannot be found on /dev/sdb1
2. you need to run this utility with --rebuild-sb
3. Reiserfs super block in block 16 on 0x805 of format 3.6 with standard journal
4. This indicates some bad damage to the /dev/sdb1 device and filesystem. You
should ask the customer if they have a good backup.
68 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
5. Since this is there data volume, you can comment out the /dev/sdb1 device from
/etc/fstab and type exit to reboot normally. The server should come up just fine, but
without their data volume mounted.
6. If there is no backup and /data files are critical, then STOP now. The customer
needs to send the drive to a third party data recovery service (i.e. Ontrack or
DriveSavers) to get the files back. If they cannot afford very expensive option to
recovery the disk, then we can move forward: Run reiserfsck --check /dev/sdb1
7. The fsck says a --rebuild-sb is required because the super block is gone. Run
reiserfsck --rebuild-sb /dev/sdb1
8. The reiser filesystem does not create copies of the super block throughout the
filesystem, you just have to attempt to rebuild it. Read the output, but generally
you can assume the defaults. The error message said it was version 3.6. Try to
rebuild the super block successfully. If you succeed, you will be able to run a
normal reiserfsck --check /dev/sdb1
9. The check says you need to rebuild the tree. The chances of successfully
recovering data, let alone the filesystem has dropped from 30% to about 5% or
less. This message means serious damage has occurred to the filesystem.
Run: reiserfsck --rebuild-tree /dev/sdb1
10. Run a --check until it comes back with “No corruptions found.”
11. Try mounting the filesystem. If it mounts, reboot by typing exit. If it fails, restore
from backup.
12. You can look in the /data/lost+found directory for any files that were recovered.
These files can be renamed back into their original location if you know the
original filename. The original file names are lost.
13. Since this is a data volume, restore the files from backup.
14. If no backup exists, try to rename the files in lost+found to their original location.
Yes you have to go through them one at a time.
15. If the data was critical, you should not try anything recovery or fsck options, but
immediately refer to the customer to a third party data recovery service like:
Ontrack or DriveSavers.
Task III: Root Cause1. Corrupted reiser filesystem with superblock lost
2. Does an Automatic Repair fix this scenario? No
Damaged filesystems happen. Many times there is not a lot you can do about it other than
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
69
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
try the fsck options.
(End of Exercise)
70 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.25 Troubleshooting Exercise: Kernel Panic After Disk Change
We changed some disks on the server and now the kernel panics.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 25
3. Press enter to continue
4. Some errors observed
1. Kernel panic – not syncing: Attempted to kill init!
2. No init found. Try passing init= option to the kernel.
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Make sure you look at all the errors on the screen where the boot failed.
3. Important messages on the screen, in addition to those listed above, include:
1. Mounting root /dev/sda5
2. /dev/sda5: clean
4. The system thinks /dev/sda5 is the root filesystem. Where is “init” found?
/sbin/init. Make sure /sbin/init exists on /dev/sda5.
5. Try boot installed system
6. chroot installed system
7. When you mount /dev/sda5 in rescue mode, it does not even have a /root or /sbin
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
71
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
directory. This is is not the root filesystem.
8. As you explore, you find /dev/sda6 is the correct root filesystem.
9. Correct the root= option in /boot/grub/menu.lst to point to the correct root
filesystem and retest.
10. Correct the /etc/fstab to use the correct root filesystem and retest.
11. Once menu.lst and fstab are fixed, run mkinitrd to create a new ram disk with the
updated root filesystem information and retest.
12. Chroot Installed System
13. sed -i -e 's!/dev/sda5!/dev/sda6!g' /boot/grub/menu.lst
14. sed -i -e 's!/dev/sda5!/dev/sda6!g' /etc/fstab
15. mkinitrd
16. reboot
Task III: Root Cause1. The root filesystem device changed.
2. Does an Automatic Repair fix this scenario? Partially
1. df still lists /dev/sda5 as the root device.
When the root device changes, more than one file needs to be updated. The exercise is typical of systems installed onto a single local disk, and later adding SAN devices.
(End of Exercise)
72 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.26 Troubleshooting Exercise: Command Not FoundSome commands are not found that should be found, and I get an input/output error on /usr/bin. After rebooting, my server won't come up. Boot the server successfully and find the missing commands.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 26
3. Press enter to continue
4. Some errors observed
1. commandnotfound lsof (others include gc, lpr, prune)
2. fsck failed for at least one filesystem (not /).
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Make sure you look at all the errors on the screen where the boot failed, and then
type root's password for maintenance mode.
3. Errors on the screen include:
1. fsck.ext3: Bad magic number in super-block while trying to open /dev/sda7
2. you might try running e2fsck with an alternate superblock: e2fsck -b 8193
<device>
4. This indicates significant damage to /dev/sda7. There is about a 30% chance we
will recover this filesystem. Ask if there is a backup. The /usr directory contains
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
73
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
most of the installed files for the system. A reinstall is probable.
5. Type root's password for maintenance, and run e2fsck /dev/sda7. It suggests using
an alternate superblock. Run mke2fs n /dev/sda7 to determine the location
of the alternate superblocks. The -n is a non-destructive option.
6. Since the superblock is at the beginning of the disk, start by using the last
alternates first. For example, run e2fsck y b 294912 /dev/sda7.
Repeat for each superblock listed in the mke2fs -n /dev/sda7 output until it works
or you run out of superblocks.
7. Run e2fsck -f /dev/sda7 to force another fsck on the filesystem.
8. Try mounting the filesytem: mount /dev/sda7 /usr
9. List the files in /usr and /usr/lost+found. Try rebooting by typing exit from
maintenance mode.
10. Do you trust this system to run properly?
11. Restore from backup -OR- Reinstall the OS
Task III: Root Cause1. Corrupted ext3 filesystem /usr on /dev/sda7
2. Does an Automatic Repair fix this scenario? No
Some issues just cannot be fixed and a reinstall of the OS is the best course of action.
(End of Exercise)
74 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.27 Troubleshooting Exercise: Waiting for Device after LUN
We attached a LUN from the SAN to the server. The boot process keep asking me to fall back to a different device. It works, but the system shows the wrong device. Will this hurt the server or fail to work properly? Make sure the server boots without falling back, and the correct root device shows up in the mount and df commands.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 27
3. Press enter to continue
4. Some errors observed
1. Waiting for device /dev/sdd6 to appear: ..Could not find /dev/sdd6
2. df -h shows /dev/sdd6 mounted to root, but that is the device that could not be
found.
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below. **NOTE** Change steps for non-existent device.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Make sure you look at all the errors on the screen where the boot failed.
3. Important messages on the screen, in addition to those listed above, include:
1. Mounting root /dev/sda5
2. /dev/sda5: clean
4. The system thinks /dev/sda5 is the root filesystem. Where is “init” found?
/sbin/init. Make sure /sbin/init exists on /dev/sda5.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
75
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
5. Try boot installed system
6. chroot installed system
7. When you mount /dev/sda5 in rescue mode, it does not even have a /root or /sbin
directory. This is not the root filesystem.
8. As you explore, you find /dev/sda6 is the correct root filesystem.
9. Correct the root= option in /boot/grub/menu.lst to point to the correct root
filesystem and retest.
10. Correct the /etc/fstab to use the correct root filesystem and retest.
11. Once menu.lst and fstab are fixed, run mkinitrd to create a new ram disk with the
updated root filesystem information and retest.
12. Chroot Installed System
13. sed -i -e 's!/dev/sdd6!/dev/sda6!g' /boot/grub/menu.lst
14. sed -i -e 's!/dev/sdd6!/dev/sda6!g' /etc/fstab
15. mkinitrd
16. reboot
Task III: Root Cause1. The root filesystem device changed to a non-existent device.
2. Does an Automatic Repair fix this scenario? Partially
1. df still lists /dev/sdd6 as the root device.
(End of Exercise)
76 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
3.28 Troubleshooting Exercise: Not Booting After Power Failure
We think we've fixed all our hardware problems, but the server still won't boot. Fix the problems so the server boots properly.
Objectives:Task I: ConfigurationTask II: Troubleshooting ProcedureTask III: Root Cause
Special Instructions and Notes:
None
Task I: ConfigurationConfigures the virtual machine for the assigned lab exercise.
1. Revert your snapshot
2. Run bplab 28
3. Press enter to continue
4. Some errors observed
1. fsck failed for at least one filesystem (not /).
Task II: Troubleshooting ProcedureTry to resolve the issue without looking at the troubleshooting procedure, otherwise follow the troubleshooting steps below.
1. Find the last on-screen landmark that matches the troubleshooting table. Follow the
“Troubleshooting/Potential Fixes”.
2. Make sure you look at all the errors on the screen where the boot failed, and then
type root's password for maintenance mode.
3. Errors on the screen include:
1. fsck.ext3: Bad magic number in super-block while trying to open /dev/sda5
2. you might try running e2fsck with an alternate superblock: e2fsck -b 8193
<device>
4. This indicates significant damage to /dev/sda5. There is about a 30% chance we
will recover this filesystem. Ask if there is a backup. The /usr directory contains
most of the installed files for the system. A reinstall is probable.
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
77
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
SUSE Advanced Troubleshooting: The Boot Process
5. Type root's password for maintenance, and run e2fsck /dev/sda5. It suggests using
an alternate superblock. Run mke2fs n /dev/sda5 to determine the location
of the alternate superblocks. The -n is a non-destructive option.
6. Since the superblock is at the beginning of the disk, start by using the last
alternates first. For example, run e2fsck y b 294912 /dev/sda5.
Repeat for each superblock listed in the mke2fs -n /dev/sda5 output until it works
or you run out of superblocks.
7. Notice that all of the alternate superblocks failed.
8. This means the geometry has changed or the partition table is messed up.
9. You could try restoring the partition table from backup and seeing if that helps.
10. Restore from backup -OR- Reinstall the OS
Task III: Root Cause1. Corrupted partition table or disk geometry
2. Does an Automatic Repair fix this scenario? No
Disk geometry problems generally must be fixed with a restore from backup or reinstall of the operating system. You can sometimes recover the partition table using gpart.
(End of Exercise)
78 Copying all or part of this manual, or distributing such copies, is strictly prohibited. To report suspected copying, please call 1-800-PIRATES
Version 1
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.
Troubleshooting Exercises
Version 1 Copying all or part of this manual, or distributing such copies, is strictlyprohibited. To report suspected copying, please call 1-800-PIRATES
79
Novell, Inc. Copyright 2012-ATT LIVE-1-HARDCOPY PERMITTED. NO OTHER PRINTING, COPYING, OR DISTRIBUTION ALLOWED.