Linux Disk I/O Monitoring: Beyond iostat

iostat is the first tool everyone reaches for when disk performance tanks. It's in every sysadmin's muscle memory. But it's also a liar—or at least, a liar by omission.

You'll see await spike to 200ms and %util hit 99%, and you'll still have no idea whether it's a single noisy application, a kernel bug, or a failing drive. iostat shows you the aggregate pain, not who's causing it. That's where most troubleshooting stops, and where most of my 3am pages have started.

This post is a runbook for going deeper into disk I/O. We'll move past iostat and build a mental model of the Linux I/O stack, then use the right tools to pinpoint the culprit before your boss asks why the database is slow.

The iostat Blind Spot

Let's set up a scenario. You SSH into a production server, run iostat:

$ iostat -xz 1
Device            r/s     w/s     rMB/s   wMB/s r_await w_await %util
sda              45.2    120.1    2.1     8.3   15.2    250.3   87.4

That w_await of 250ms is bad. But iostat can't tell you:

Is PostgreSQL writing huge transactions, or is it a backup job?
Are writes queuing in the kernel, or is the disk itself slow?
Is this a single process or a thousand small processes?
Did this start 30 seconds ago or has it been happening all week?

You're flying blind. iostat is a symptom detector, not a diagnosis tool.

Meet iotop: Process-Level I/O Visibility

iotop is iostat's smarter sibling. It shows you which process is doing the I/O, and it's the first thing I run after iostat in a real incident.

Install it:

sudo apt-get install iotop  # Debian/Ubuntu
sudo yum install iotop      # RHEL/CentOS

Run it:

sudo iotop -o -b -n 5

Breakdown:

-o: only show processes doing I/O (cuts the noise)
-b: batch mode (for logs or scripts)
-n 5: exit after 5 iterations

You'll see output like:

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>    COMMAND
1234 be/4  postgres  0.00 B/s   45.2 MB/s  0.00%   99.8%  postgres: writer
5678 be/4  backup    12.3 MB/s   0.00 B/s  0.00%   0.12%  rsync

Now you know: PostgreSQL's writer process is hammering the disk. You can kill the backup, tune PostgreSQL's wal_buffers, or escalate to the DBA. You have a name and a number.

Gotcha: iotop reads from /proc/<pid>/io, which only works if the process is still alive. If a process finished or crashed, it's gone from iotop's view. For post-mortem analysis, you need something else.

Kernel-Level Tracing with eBPF

For the hard cases—intermittent slowness, processes that exit quickly, or kernel-level I/O queuing—you need eBPF (extended Berkeley Packet Filter). The BCC (BPF Compiler Collection) toolkit has several I/O tools that trace syscalls and kernel functions in real time.

Install BCC:

sudo apt-get install bpfcc-tools linux-headers-$(uname -r)

Use biolatency to histogram I/O latency:

sudo /usr/share/bcc/tools/biolatency -m 10

Output:

msecs           : count     distribution
0 -> 1          : 1234     |****                           |
1 -> 2          : 5678     |*****************************   |
2 -> 4          : 890      |****                           |
4 -> 8          : 45       |                               |
8 -> 16         : 12       |                               |

This shows the distribution of I/O latencies. If most I/Os complete in 1-2ms but a tail of them take 8-16ms, you've got a queue buildup, not a slow disk. If all I/Os are slow, the disk itself is the bottleneck.

Use biotop to see which process is causing the latency:

sudo /usr/share/bcc/tools/biotop -C 10

Output:

COMMAND           PID      COMM             READ_MB   WRITE_MB  MSEC_READING  MSEC_WRITING
mysql             1456     mysqld           0.0       120.3     0             8920
rsync             2341     rsync            45.2      0.0       3400          0

Now you see: MySQL is spending 8.9 seconds writing per 10-second interval. That's your culprit.

Gotcha: eBPF tools require Linux 4.4+ and may need CAP_SYS_ADMIN. They also add overhead (typically 1-5% CPU), so don't leave them running 24/7 in production.

Building Your Disk I/O Runbook

Here's the order I follow in a real incident:

Step 1: Confirm the symptom (30 seconds)

iostat -xz 1 | head -20

Look at %util, r_await, and w_await. If all three are high, disk is congested. If %util is low but await is high, it's a queue problem (likely in the application or filesystem layer).

Step 2: Find the culprit (1-2 minutes)

sudo iotop -o -b -n 10 | sort -k4 -rn

Sort by DISK_WRITE descending. The top process is your suspect.

Step 3: Understand the pattern (2-5 minutes)

sudo /usr/share/bcc/tools/biolatency -m 10

Run this for 30-60 seconds. If latencies are distributed (long tail), the queue is the problem. If they're all high, the disk is slow.

Step 4: Drill into the process (5-10 minutes)

sudo /usr/share/bcc/tools/biotop -C 30 | grep <PID>

Watch the specific process's I/O pattern. Is it steady, or does it spike? Is it reads or writes?

Step 5: Check the filesystem (immediate)

df -h
fs_usage=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $fs_usage -gt 90 ]; then echo "Disk full!"; fi

A full filesystem will cause writes to queue indefinitely.

When to Suspect Hardware

Disk hardware failures have a signature. Watch for:

Sudden latency spikes in biolatency (previously 2-5ms, now 50-100ms)
Timeouts in dmesg: [ 45.234] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Consistent high await even when only one small process is reading

If you see these, run:

sudo smartctl -a /dev/sda

Look for SMART Health Status: PASSED (good) or FAILED (bad). If it says PASSED but you still see timeouts, the disk is dying anyway—SMART is not reliable for predicting failure.

Gotcha: Don't trust SMART alone. If latency is anomalous and SMART says it's fine, the disk is still suspect. Plan a replacement.

Monitoring This Long-Term

For ongoing observability, you need metrics, not just manual tools. I'd set up:

Prometheus node_exporter with node_disk_io_reads_completed_total and node_disk_io_writes_completed_total
A custom eBPF exporter that emits disk_io_latency_bucket histograms (from biolatency)
Alerting on disk_io_latency_p99 > 50ms or disk_util > 80% for 5 minutes

This gives you the data to correlate disk slowness with application events (deployments, backups, traffic spikes) after the fact.

The Runbook

Paste this into your incident playbook:

#!/bin/bash
# Disk I/O troubleshooting runbook
echo "=== iostat (5 seconds) ==="
iostat -xz 1 5
echo ""
echo "=== Top I/O processes ==="
sudo iotop -o -b -n 5 | head -15
echo ""
echo "=== I/O latency distribution ==="
sudo timeout 10 /usr/share/bcc/tools/biolatency -m 10 2>/dev/null || echo "biolatency not available"
echo ""
echo "=== Filesystem usage ==="
df -h

Run it, save the output, and share it with your team.

What to Do Tomorrow

Don't wait for a 3am page. This week:

Install iotop and bpfcc-tools on your production servers (or at least a staging box).
Run the runbook above on a quiet server to get a baseline of what "normal" looks like.
Add a Prometheus alert for disk I/O latency (p99 > 50ms for 5 minutes).
Share the runbook with your team in Slack or your wiki.

When the next disk slowness hits, you'll have a repeatable process and the tools to find the culprit in under 10 minutes. That's the difference between a 30-minute incident and a 2-hour one.