Thursday, January 12, 2017

Lets Automate Kdump

Kdump is kernel crash dumping mechanism and is very reliable because the crash dump is captured from the context of a freshly booted kernel and not from the context of the crashed kernel. Kdump uses Kexec to boot into a second kernel whenever system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.

Kdump uses Kexec fast booting mechanism which facilitates booting with secondary kernel to capture memory image of the previous kernel. This skips BIOS initialization process. Both “kdump” & “Kexec” were introduced from RHEL 5.x on-wards.

Kdump is supported on the i686, x86_64, ia64 and ppc64 platforms. The standard kernel and capture kernel are one in the same on i686, x86_64, ia64 and ppc64.

How to configure kdump?

The usual way to get this done, is to manually configure required parameters as explained below:-

Install “kexec-tools” package to start the process.
To configure a system (RHEL5/6/7 variants) to successfully capture core dump (vmcore), the following conditions needs to be met:-

- Make sure enough free space available under /var/crash (default dump location).

- The “crashkernel” with proper memory size should be set and it should reflect in /proc/cmdline

- Same “crashkernel” should be set in /boot/grub/grub.conf or /boot/grub2/grub.cfg (RHEL7) files with correct value.

- “kernel.sysrq” should be set to 1.

- “kernel. unknown_nmi_panic” should be set to 1.

- Make sure to see if “Crash memory space” is set (/proc/iomem)

- Kernel crash should be loaded (/sys/kernel/kexec_crash_loaded)

Another important point to keep in mind is the correct “crashkernel” value that needs to be set. According to Red Hat the following values should be set as based on “Total RAM” and RHEL version:

What is the necessity of kdump?

“Kdump” would help in situations where there is a need to analyze the memory dump when system crashed or was in non-responsive state which would facilitate in understanding system state at that moment. So, by analyzing this memory dump (vmcore) an administrator or Linux expert would get the system tuned up properly to avoid or over-come such incidents.

Let’s automate configuring kdump via shell script…..

I’ve written a simple shell script which would get the “kdump” configuration set properly on RHEL7.x/6.x/5.x variants.
These are the tasks performed by this script when run as root or sudo user:-

- Install kdump package if not installed.
- Verify if enough space is available under /var file system.
- Verify if “crashkernel=xxx” parameter is set as per Red Hat according the RHEL version and Total RAM availability, if not correct then it would set it.
- Make sure “crashkernel” is configured in /boot/grub/grub.conf (RHEL6/5) & /boot/grub2/grub.cfg (RHEL7) as recommended.
- Check if “crashkernel” is also added to /etc/default/grub (in case of RHEL7).
- Verify if “kernel.sysrq” and “kernel. unknown_nmi_panic” are enabled and set to 1, if not set them.
- Enable the dump path which is /var/crash (default dump location) and “core_collector” parameters as per Red Hat in /etc/kdump.conf file, if already set then it would quickly verify this.
- Make sure the required service (kdump) is set to come up on boot.
- Finally if everything is set then user would be asked whether to run kdump test and if yes, kdump service would be started (if not started). If the “crashkernel” parameter is not found in /proc/cmdline then user would be prompted to reboot the system.
- If user wishes to test kdump then this script would flush out dirty data out of cache to disks, increase console logging verbosity, re-mount all files systems read-only before prompting one more time for user confirmation to continue, and then would crash the system.
- If “crashkernel” parameter needs to be re-configured then a system reboot is required.
So, this shell script works by using native Linux commands which is ideal for RHEL7.x/6.x/5/x variants. This script would perform all sanity checks for proper function of kdump and allow user to test kdump as well as shown in the following pages:

How does this works?

>> The following snip shows what the script would do when run on a system which doesn’t have kexec-tools package installed:

After installing “kexec-tools” and configuring required parameters, the script would prompt for a restart of the system to get valid kdump initrd image to be generated.
Now, all configurations are set and system needs a reboot. Here is a snip of what has been set now:

>> When run on a system where kdump service is up and “crashkernel” parameter is properly set, script would prompt for user option to test crash dump.

So, if user hits “y” when prompted then the script would validate to see if “crash” is loaded, “crashkernel” is active in kernel and crash memory space is set, then it would prompt for user whether to continue or not as shown above.

If user hits “y” to continue the script would flush out dirty data to disks, re-mounts all active file systems into read-only (as a safety step) and enable debug mode for console messages, and then it would crash the system so that the vmcore file gets recorded under /var/crash path.  The system may not respond for some time and once it reboots crash file would be available for further analysis.
>> Any changes to /etc/kdump.conf would needs a reboot so that initrd image file gets re-generated and message as below would show up while reboot:

Any change to the “crashkernel” parameter would need a reboot since /proc/cmdline doesn’t reflect the changed value immediately.

>> So, irrespective of whether kdump service is up or not, you can safely run this script which would validate and process requests.

>> The core dump path would be the default one which is /var/crash with recommended “core_collector” configuration, if you wish to dump vmcore file to a remote server then need to set this manually. If there is a central server to collect such dumps then you could set it in the script as required.

>> Any configuration changes to either kdump.conf or grub.cfg/grub.conf would be done after taking a backup of those files ending with “-ddmmyyyy” format, and same goes with “/etc/default/grub” file as well in their default path.

>> Those parameters which are not specified here would be left to default.

Check sum values of this file are pasted below:
File Name: configure-kdump
Size: 10.4KB
MD5: 490F9CBA5F9BE1CFD8939341671C8279
SHA1: F20578F3D203E476909BDC33667E89E6CD936B34

The script file is attached as a text file, you would need to remove the header lines (starting two lines at beginning) which are commented out, set “execute” bit, and then run the script. Make sure line indents (space/tabs) are not altered, otherwise, script may not give proper results.

