Operating Systems

How to monitor disk performance iowait on Linux

Table of Contents

How to monitor disk performance iowait on Linux

Description

When your hard disk run slow, your entire system slows down. So it figures the key to monitoring performance on any server implies monitoring the disk. Linux comes with a number of tools to assist with this operation, and this article aims to present some of the most common utilities, and some common use cases.

The Definition of IO Wait Time

To understand disk performance in Linux one has to understand what’s called io wait time. The quickest way to see IO Wait time is to use the top utility.. Referring to the diagram below, you will notice 1.3 wa This is the IO Wait Time. Although it seems a bit obscure as it’s referring to IO, it’s really just saying “How long must an idle CPU wait for the disk I/O to complete.“. The caveat is it’s not only waiting for the disk – the entire “IO” subsystem might be playing a role. As a rule of thumb though, you don’t really want more than 1.0.

top is one of the first tools that you reach for when checking to see if a disk is running at maximum or degraded performance and it’s universal, so learn to use it.

Historical Statistics

It’s all fine and dandy seeing what’s happening right now, but what if you needed to see historical statistics? In this article we provide a few ways os testing, some utilities, and the SNMP method.

FIO

Fio is our favourite as of July 2023. Fio has many options but can do a really grand 1 Gigabyte file test and show you what’s currently happening.

This is the command we prefer:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75

This will write a 1 GB file. Remember to delete it.

Benchmarks? Let’s show you some that we got:

Oldish HP Laptop with local SSD: 7 seconds
Host in SA, 3 seconds
Dedicated server in SA, 7 seconds
Mirror 1 server in SA, 39 seconds.
Random AWS server with HyperV and bare metal: 16 minutes
Low powered Supermicro Proxmox with ZFS over iSCSI via TrueNAS, 7 minutes
Random host in SA, 7 minutes
Random host in Germany, 3 seconds

As you can see times are from 3 seconds to 16 minutes. Does this means everything else is slow? It’s quite possible but sometimes perception is worse (or better) than reality.

Script

Fio for all it’s power doesn’t actually clearly show the total time spent.

This script relies on fio:

sudo apt install fio

Then use this Bash script:

#!/bin/bash

# Start timer to measure disk speed
start=$(date +%s)

# FIO does the heavy lifting by writing at 25% and reading at 75% a test file
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75

# Remove the temporary test file
rm random_read_write.fio

# End timer
end=$(date +%s)

# Calculate the time difference
duration=$((end - start))

# Display the number of seconds passed
echo "The command took $duration seconds to complete."

On FreeBSD skip the --ioengine parameter.

Sample outputs

Description	Read	Write	Time
Single NVMe	101MB/s	33.6MB/s	19
Host SSD Raid	115MB/s	38.4MB/s	17
RAID 5 + Cache on TrueNAS			20
RAID 5 Magnetic	6028kB/s	2014kB/s	148

SAR

SAR stands for System Activity Report and keeps track of historical system data, including CPU and disk I/O. To use the actual utility, just type sar. When you run sar, you will get historical statistics up to 10 minute minute intervals of your system that goes back to the start of the day. In the screenshot below, you will see sar output. What’s notable about the output are the spikes of 11, 14, 12, and 10. Then at 2AM an actual backup kicks off, and you see a dramatic increase in the disk I/O wait time.

At this point you might ask what is a normal range for Disk I/O wait time? In our experience, anything from 1 to 5 is normal, 10 starts getting slow, 20 is really slow, and anywhere above 20 is really very slow. These values are a bit relative though and we recommend checking your system on a regular basis to determine baselines, and experimenting with backups or the du command to test some limits. Leave us a comment to tell us what you think is normal for your system.

Installing SAR

If you’re system doesn’t have sar, then do this for Ubuntu/Debian:

apt install sysstat

Next change ENABLED=”false” to ENABLED=”true” in /etc/default/sysstat

Then

service sysstat restart

The SNMP Method

It turns out SNMP can also monitor system IO stats. To monitor exactly iowait time, use this OID but be sure to specify delta values instead of absolute values.

.1.3.6.1.4.1.2021.11.54.0

To test:

snmpwalk -v 1 -c your_community localhost 1.3.6.1.4.1.2021.11.54.0

Example of PRTG Configuration specifying Delta instead of Absolute.

Other Utilities and more SNMP

Two other notable utilities for monitoring that includes disk performance monitoring are iostat and the cat /proc/diskstats command. If your CentOS system doesn’t have iostat install, install it so yum install iostat

iostat

iostat has the handy d flag which allow you to continuously monitor the output, for example below every two seconds:

iostat -d 2 %iowait

If you don’t have iostat on your Ubuntu rig, do this:

sudo apt install sysstat -y

cat /proc/diskstats

/proc/diskstats is used by the handy Perl script for Webmin, called Webminstats, which draws fairly comprehensive RRD data of disk operation. Here is a snippet from that Perl code:

my $module_name;
my $info  = '/proc/diskstats';
my $EMPTY = EMPTY();

###############################################################################
# ask the system info on file system
sub read_data() {

	my $r_tab = read_full_file($info);
	my @res   = @{$r_tab};
	return @res;
}

More Disk SNMP Monitoring

If you’re looking for more general SNMP monitoring of disk activity, use the following OID:

snmpwalk -v 1 -c your_community localhost 1.3.6.1.4.1.2021.13.15.1

So What’s Causing the Slow Disk

The aim of this article is just to help you determine your disk is slow. To see what’s actually slowing it down, takes more work. As a starting point we generally recommend top, and looking at the top processes by CPU to see what is busy. If you are running a web server, this only paints part of the picture, you might have to go deeper under the hood with netstat to see how many actual connections are made to the web server. Perhaps start gracefully terminating the processes one by one to see if ‘WA’ recovers.

Conclusion

Disk I/O Monitoring is key to performance. Be sure to know what you’re dealing with. If you are working with many disks, graph the data to compare workload surges and ensure they are moved away if they affect other areas.

How to monitor disk performance iowait on Linux

How to monitor disk performance iowait on Linux

Description

The Definition of IO Wait Time

Historical Statistics

FIO

Script

Sample outputs

SAR

Installing SAR

The SNMP Method

Other Utilities and more SNMP

iostat

cat /proc/diskstats

More Disk SNMP Monitoring

So What’s Causing the Slow Disk

Conclusion

See Also

References

Tags

Share this article

Leave a Reply Cancel reply