Wednesday, December 6, 2017

Linux Performance issues - How/where to start with?

Let's talk "how to troubleshoot performance issues in Linux". There are many a times that we find a server performing slow (specifically a Linux system ) compared to previous days or sluggish or in a worst case systems responds slow i.e running a simple command such as “ls” would take almost a minute to respond. So, at that juncture a system admin or person in charge of system would be in a dilemma on what is causing such slowness, is that because of processor (CPU overloaded), or memory (high memory usage) or network (slower network channels) or disk (poor performance of disk drives) or a bug with application or kernel. No amount of tuning would help if a hardware component is broken. Identifying bottlenecks at this point of time would need a proper analysis of system sub-components. Therefore, I came up with this page which documents simple steps/instructions/navigation which can be used to identify culprit for system slowness and later more advanced steps/commands could be used to dig further (only a few most followed commands and steps are documented here i.e native).

<<---- General/common guidelines----->>

-- At broader level, we need to first identify resources which are culprit for system slow/performance issues. Well, it may be a bug in kernel or in application or a mal-functioning hardware sub-system. Hence, in analyzing a performance issues we need to thoroughly study system and identify the hot spots. Once we get to know which component is causing the issue then further analysis would be easy.

-- Since a customer doesn't provide/not aware of the exact cause, it is better to start identifying causes at system level. Also, understanding recent changes if any could facilitate in analyzing such a cause.

-- When an exact cause is not known, one could start by checking "load average" to see if CPU is the culprit or something else. If load average seems to be fine then issue could be with memory or network etc,. At times, just checking load average may not be enough (when performance was bad previously), hence, need to filter SAR(System Activity Report) dump data for load and check based on when performance glitch was reported.

-- Check recent system logs (/var/log/messages) to notice if any failed/error/warning messages.

-- Run "vmstat 1 50" (dstat command can also be used which is a advanced one) which would print continuously system statistics of 50 samples (process status, memory details, Swapped in/out, IO in/out, CPU usage details, interrupts details etc.,) and check if there is any un-usual trend (remember to ignore the first line of data since it is an average from last reboot).

-- If there is slowness/glitch when accessing network shares or any remote transfer/retrieval of files/data would necessarily point to network segment which needs further analysis.

-- Once the main component is identified which is causing the issue then need to check if there is any tuning required for that component or component upgrade is necessary.

-- If there are no abnormalities reported from any of the hardware components then there could be a bug which is either application or operating system related. Sometimes it could be due to improper tuning of an application which results in performance slog.

-- If system is not meeting the minimum requirements of an application then we can't expect a good & steady performance, so it is always recommended to make sure system hardware sub-components are in agreement with application prerequisites.

For each hardware subsystem following common procedure/guidelines could be followed

<<>> for CPU(processor) <<>>

-- If any process is consuming more CPU resource (100% CPU Usage) then we would need to reduce the priority of such process by doing a re-nice. Also note that high CPU usage doesn't necessarily mean that CPU is busy or occupied, however, it could be because it is waiting for input from other subsystems such as storage ( IO wait).

-- First command that can give a glimpse of average CPU usage for last 1,5 and 15 minutes is "uptime". Run this command and check load average. Those numbers would actually shows the average number of processes waiting for CPU resource in the last 1, 5, and 15 minutes respectively. So, the CPU is not busy in this case and this is run on a virtual system with one processor. Normally, we calculate system load based on number of processors (logical) available. Simply speaking this means that a single processor system with load average of 1 is as busy as 4 CPU based system with load average of 4.


-- Use top command to observe processor statistics as by default it would show processes list based on CPU utilization (high-low). By default when "top" command is run without options it would open up in monitor mode window which dynamically populates data.
-- One could use "vmstat" (Virtual Memory Statistics) (belongs to procps package) or "dstat" or "sar" (belongs to sysstat package, and other popular commands such as iostat, mpstat, pidstat are also part of this package) to dynamically view CPU stats which could aid in understanding about time spent in running kernel-space code, user-space code, wait time (percentage of time spent blocked while waiting for IO to complete), idle time etc,. If there is a spike in "%usr" value then this indicates that system has spent time in running a user space process, otherwise, if there is high usage percentage for "%sys" then it says that system was busy in running a kernel thread.

Example: Using "vmstat" to print stats with 1 second interval of 10 samples and showing in kilo bytes with time stamps:

> In the above snap the processor is almost idle. So, lets spin up"sha1sum" command to create load on processor and see how it behaves now. I've started two of these processes in background. Now, you could see that the processor is busy in executing a user-space task as shown in the below snap:-

> Also, you may notice that the "wa" column is zero (otherwise, "%wa" in top), that means CPU is not waiting for IO or for any other sub-systems. At this time we come to know that processor is busy and there 2 tasks running now which are consuming high CPU cycles, so check what are those processes now using "top" command:-

> As shown in the "top" command, we could see that "sha1sum" processes are eating up CPU space and they are the culprits for high CPU usage, as rightly shown. Also, if you notice the "load average" as shown on the first line, on a single CPU based systems it is high.

-- The "mpstat" command could be used to understand   each processor statistics in a multiprocessor system. If processor architecture is NUMA based then "numactl" and related commands could be used to understand numa hits and misses.
Example: Using "mpstat" to print CPU stats with 1 second interval of 3 samples :

-- The "dstat" is another versatile tool which can be used to fetch system statistics and it got so many options which can be used such as "--top-cpu (shows expensive process), --top-mem (most memory consumed process), --top-io (most expensive IO process)" etc,. This command is provided by "dstat" package which is available in standard server repo channel or from ISO image which needs to be installed. To get a list of arguments that can be passed to dstat command run "dstat -h".

Example: To pull out system statistics dynamically with delay of 1 second and 10 sample outputs:

-- Stop unnecessary processes.
-- Reduce priority of a process if that is consuming more CPU cycles. The "renice" command could be used to change priority of a process. By default nice value ranges from -20 to 19 (highest to lowest). So, renicing a high CPU usage process with less priority (highest number) would be a good option.
-- If there is a need to check the CPU statistics for earlier days then we could use 'sar' data (/var/log/sa) and observe the stats:

#LANG=C /usr/bin/sar -P ALL -f /var/log/sa/sa<XX>


#LANG=C /usr/bin/sar -u -f /var/log/sa/sa<XX>

Example: To understand processor statistics dynamically for today between 4:00 - 5:00 (with 10 minute interval):-

-- To find out top 5 CPU resource hogs:

#ps -eo pcpu,pmem,pid,ppid,user,stat,args | sort -k 1 -r | head -6|sed 's/$/\n/'

<<>> for RAM (memory) <<>>

-- Start analyzing high memory usage using "free" command, otherwise, using "top" or "vmstat" etc., commands. Use "free" command and check on memory available which is a combination of free+buffers+cached pages (in RHEL7.x there is "available" field which tells about the same).

Example: To understand memory usage dynamically, we could run "dstat command with 2 seconds delay and to print 5 samples as shown below:

-- Alternatively, run either "free" command or "cat /proc/meminfo" as shown (this output is taken on RHEL6.x virtual system, on RHEL7.x there is "MemAvailable" parameter added in /proc/meminfo):-


-- If a system is showing high memory usage (almost no free memory left out) and also swap space is being consumed then it certainly indicates that system is under memory pressure. Memory management is one of the nice features of Linux kernel which intelligently manages it out to get the best out of it.

Why swap is getting used though there is enough free memory?

Linux handles swap space efficiently. While swap is nothing more than a guarantee in case of over-allocation of main memory in other operating systems, Linux kernel utilizes swap space far more efficiently. Virtual memory is composed of both physical memory and the disk subsystem or the swap partition. If the virtual memory manager in Linux realizes that a memory page has been allocated but not used for a significant amount of time, it moves this memory page to swap space. Often you will see daemons such as "getty" that will be launched when the system starts up but will hardly ever be used. It appears that it would be more efficient to free the expensive main memory of such a page and move the memory page to swap. This is exactly how Linux kernel handles swap, so there is no need to be alarmed if you find the swap partition filled to 50%. The fact that swap space is being used does not mean a memory bottleneck but rather proves how efficiently Linux handles system resources. Also, a swapped out page stays in swap space until there is a need for it, that is when it gets moved in (swap-in).

Why is there a high amount of cache memory being consumed? What is this "cached/cache" in free command output?

Simply it shows the amount of RAM space occupied by "Page Cache". Cache pages are nothing but the data/files which gets copied over to RAM when kernel performs read/write operation on a disk. The reason for keeping these page cache is for I/O performance. So, kernel maintains these files on RAM and frees them up whenever not required or whenever a new memory space is requested by a new process/command provided no free space available in RAM. For a simple read operations it would be marked as "clean" pages, however, writes would mark them "dirty" and becomes clean after updating on disk. These "cached" pages would remain in memory even not used and would not move out even when not required. So, as a system usage goes these cached pages size would also grow. This is usually not a concern to be worried, however, sometimes yes when "cache" is consuming most of the memory and system starts to swap pages.

-- Check out page faults (major faults) using "sar -B" command, if there are major faults then it would certainly delay since page needs to be moved from disk to memory.

Example: To check the page faults happened between certain time stamp using sar command:-

Even the "dstat --vm" command can be run to view major/minor page faults dynamically.

-- Use "ps" to view major and minor page faults for a process as shown below:-

#ps -o pid,comm,minflt,majflt <ProcessID>

-- To find out processes wise memory consumption, use "pidstat" command which is part of the sysstat package. So, to find out memory usage for active process we could use "pidstat -l -r|sort -k8r", otherwise use "pidstat -p ALL -l -r|sort -k8r" to find out for every process as shown below :-

-- Find out top processes/threads consuming memory using the 'ps' command:

#ps -eo pcpu,pmem,pid,ppid,user,stat,args | sort -k 2 -r | head -6|sed 's/$/\n/'

-- If there is a need to check the memory statistics for earlier days then we could use sar data (/var/log/sa) and observe the stats:

#LANG=C /usr/bin/sar -r -f /var/log/sa/sa<XX>

EXAMPLE: LANG=C /usr/bin/sar -r -f /var/log/sa/sa22

-- To check swap statistics of earlier days, please use sar command as shown below:-

#LANG=C /usr/bin/sar -S -f /var/log/sa/sa<XX>

-- Check the private memory space of a process using "pmap" command.
-- Try to implement hugetlb, increase/decrease page size, stop unnecessary services/daemons, add more memory if there is a physical memory crunch.

-- If there is a high pressure of cache memory then one could start flushing out dirty data sooner than defaults by tuning "dirty_background_ratio" to lower values. Also, in some cases where kernel starts to swap-out pages instead of reclaiming from cache then one could lower the "vm.swappiness" to decrease rate at which pages are swapped out. A low swappiness value is recommended for database workloads. For example, for Oracle databases, Red Hat recommends a swappiness value of 10. The default value is 60.

dirty_ratio: Defines a percentage value. Write-out of dirty data begins (via pdflush) when dirty data comprises this percentage of total system memory. The default value is 20. Red Hat recommends a slightly lower value of 15 for database workloads.

dirty_background_ratio: Defines a percentage value. Write-out of dirty data begins in the background (via pdflush) when dirty data comprises this percentage of total memory. The default value is 10. For database workloads, Red Hat recommends a lower value of 3.

-- If there is a memory leak from an application then advanced commands/tools such as "valgrind" could be used to detect such issues (need to install valgrind package).

<<>> for Hard Disk Drives (IO) <<>>

-- Slow disks would cause memory buffers being filled up which would delay all disk operations. CPU idle time would increase since CPU would be waiting for IO from disks.

-- "vmstat" command could give disk stats with respective to blocks per second received and sent. Also, would show number of processes running or blocked.

-- "iostat" command (reads data from /proc/diskstats & belongs to sysstat package), otherwise "iostat -xd" (extended for more data and only disk stats) could be a good tool to analyze disk layer subsystem. Look for "await" (average wait time) and "svctm"(service time), "avgrq-sz" (average size of request), "avgqu-sz" (average queue length), rrqm/s & wrqm/s (read/write requests merged per second). The command "iostat -xd 1 10" would continuously print 10 samples of iostat output with delay of 1 second (please ignore the first sample since it is an average data from reboot).

-- IOPS or I/O Per Second, is measured as the sum of (r/s + w/s), otherwise, displayed as "tps" when iostat command used without "-x" argument.


rrqm/s = The number of read requests merged per second that were queued to the io scheduler for the device.

wrqm/s = The number of write requests merged per second that were queued to the io scheduler for the device.

r/s = The number of read requests that were completed by the device per second.

w/s = The number of read requests that were completed by the device per second.

rsec/s = The number of sectors read from the device per second, as measured by completed io.

wsec/s = The number of sectors written to the device per second, as measured by completed io.

avgrq-sz = The average size, in sectors, of the requests that were completed by the device. This field is always displayed in sectors; neither -k or -m will affect the displayed units used by this column. This value is computed from other fields as shown:

avgrq-sz = ( rsec/s + wsec/s ) / ( r/s + w/s )

avgqu-sz = The average queue length of the requests that were issued to the device.

await = The average time, in milliseconds, that it took to complete I/O requests. This includes the time spent by the requests within the io scheduler sort queue and time spent by storage servicing them. This value is measured on a per io basis from the point at which the io is submitted into the scheduler until I/O done time.

svctm = The effective average service time, in milliseconds, for completed I/O requests that were issued to the device.

-- Always look out for "await", "%util" for each block device and high numbers of these output would shows which device is IO bound and help us to further troubleshoot.

-- "iotop" command (installed by iotop package, available in standard ISO image or repo channel) is a top like command which dynamically displays current I/O usage by processes or threads on the system. One could run "iotop -P" (to show only process) or "iotop -o" (only show processes or threads actually doing I/O) if required.

-- Also, the usual process list command "ps" could give details about processes which are stuck as "uninterruptible sleep" state (waiting on IO), otherwise, "D" (defunct state: stuck) would also provide a good information in troubleshooting.

-- Using below command one could view the disk IO stats on previous days:

# LANG=C /usr/bin/sar -dp -f /var/log/sa/sa<XX>
-- It is important to note that high I/O wait percentages could not be a concern especially on machines running I/O bound applications. Sometimes, yes this could be an indication of slowness by disk sub-systems in responding or could be due to an issue with kernel which needs further analysis and tuning. Need to check with hardware disk drive vendor to see if device is at optimal best or if something such as firmware update may be necessary.

-- At times it may be required to change the default elevator (disk elevator or IO scheduler) to something more meaningful or required depending on situations. "deadline" is now the default IO scheduler in RHEL 7.x (except SATA drives) which was earlier "CFQ" (Completely Fair Queuing).
-- If the hard drives being used consists of both spindle based drives and SSD then speed could be improved by making SSD as a cache device for larger hard drives. Refer this link for more details:

<<>> for Network <<>>

-- There are a many common pre-checks to be done before diving into troubleshooting a glitch in network. Make sure that network drivers/firmware of system are updated. Network interface speed is matching with router/gateway speed, as we know that overall network speed would operate on slowest network component in a infrastructure.
-- Common Linux native network commands such as "netstat", "ip", "ethtool", "ss" (socket statistics) etc., could be used to understand network statistics and infrastructure. The "netstat" can provide details about open network connections and stack statistics. This command would basically pull data from /proc/net/[dev|tcp|unix] and other files stored under "/proc/net/" directory.

-- The "/proc/net/dev" file would provide overall network packets received/sent/dropped/collisions etc,. This file "/proc/net/tcp" is another useful source of information which provides details about TCP socket information.

-- The "ethtool" command would be the ideal one when someone wish to see packet drops/loss at hardware level. Need to look out for "errors|drops|discards|fail|missed" etc,. while analyzing the data (example: ethtool -S eth0|egrep -i "errors|drops|discards|fail|missed"). This command could also be used to identify network card driver/firmware version. Even "ip -s link" would be a good option to know about drops/errors/collisions etc,.

-- To test network bandwidth one may use qperf, iperf, netcat, dd etc., and other available open-source commands. The "qperf" is native available in the default RHEL server repo channel could be used when required ( available in standard RHEL ISO image for offline install). Refer this link to know how to use qperf to test bandwidth:

> To check network throughput between server and client, run "qperf" command on server end and then run the command "qperf -t 60 --use_bits_per_sec <ServerIP> tcp_bw" as shown below and latency using "tcp_lat" parameter:


-- The "dropwatch" command installed by dropwatch package is an interactive tool which can monitor packet drops by kernel.

-- The "nethogs" application in the Extra Packages for Enterprise Linux (EPEL) repository lists the network interface usage by application for applications using the TCP connections.

-- If there is a need to analyze issues at network packet level then one could use "tcpdump" command to capture dump data to analyze further. This should be installed and available by default on standard installations. To capture network packets this format could be used :
" tcpdump -s 0 -i {INTERFACE} -w {FILEPATH} [filter expression] ". So, to capture network packet traffic on interface "eth0", we could run "#tcpdump -s 0 -i eth0 -w /tmp/tcpdump.pcap" and hit "ctrl+c" to terminate the process. Later the same file could be read using the command
"#tcpdump -r /tmp/tcpdump.pcap".
> A sample tcpdump capture with a short span of time is show in below snap:-

-- There are at times in-correct DNS configuration may also lead to a glitch in network performance.
-- We could check the network stats of any earlier days not more than 28 days (by default) of data using SAR command as shown below:-
#LANG=C /usr/bin/sar -n DEV -f /var/log/sa/sa<XX>

 > To see the network packet flow for earlier days between specific time (provided sysstat is active):

* It may be required to perform advanced level of investigations further.

This a wonderful Red Hat article for network performance analysis and tuning :

All the best!

No comments: