iostat is the first tool everyone reaches for when disk performance tanks. It's in every sysadmin's muscle memory. But it's also a liar—or at least, a liar by omission.
You'll see await spike to 200ms and %util hit 99%, and you'll still have no idea whether it's a single noisy application, a kernel bug, or a failing drive. iostat shows you the aggregate pain, not who's causing it. That's where most troubleshooting stops, and where most of my 3am pages have started.
This post is a runbook for going deeper into disk I/O. We'll move past iostat and build a mental model of the Linux I/O stack, then use the right tools to pinpoint the culprit before your boss asks why the database is slow.
The iostat Blind Spot
Let's set up a scenario. You SSH into a production server, run iostat:
$ iostat -xz 1
Device r/s w/s rMB/s wMB/s r_await w_await %util
sda 45.2 120.1 2.1 8.3 15.2 250.3 87.4
That w_await of 250ms is bad. But iostat can't tell you:
- Is PostgreSQL writing huge transactions, or is it a backup job?
- Are writes queuing in the kernel, or is the disk itself slow?
- Is this a single process or a thousand small processes?
- Did this start 30 seconds ago or has it been happening all week?
You're flying blind. iostat is a symptom detector, not a diagnosis tool.
Meet iotop: Process-Level I/O Visibility
iotop is iostat's smarter sibling. It shows you which process is doing the I/O, and it's the first thing I run after iostat in a real incident.
Install it:
sudo apt-get install iotop # Debian/Ubuntu
sudo yum install iotop # RHEL/CentOS
Run it:
sudo iotop -o -b -n 5
Breakdown:
-o: only show processes doing I/O (cuts the noise)-b: batch mode (for logs or scripts)-n 5: exit after 5 iterations
You'll see output like:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1234 be/4 postgres 0.00 B/s 45.2 MB/s 0.00% 99.8% postgres: writer
5678 be/4 backup 12.3 MB/s 0.00 B/s 0.00% 0.12% rsync
Now you know: PostgreSQL's writer process is hammering the disk. You can kill the backup, tune PostgreSQL's wal_buffers, or escalate to the DBA. You have a name and a number.
Gotcha: iotop reads from /proc/<pid>/io, which only works if the process is still alive. If a process finished or crashed, it's gone from iotop's view. For post-mortem analysis, you need something else.
Kernel-Level Tracing with eBPF
For the hard cases—intermittent slowness, processes that exit quickly, or kernel-level I/O queuing—you need eBPF (extended Berkeley Packet Filter). The BCC (BPF Compiler Collection) toolkit has several I/O tools that trace syscalls and kernel functions in real time.
Install BCC:
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
Use biolatency to histogram I/O latency:
sudo /usr/share/bcc/tools/biolatency -m 10
Output:
msecs : count distribution
0 -> 1 : 1234 |**** |
1 -> 2 : 5678 |***************************** |
2 -> 4 : 890 |**** |
4 -> 8 : 45 | |
8 -> 16 : 12 | |
This shows the distribution of I/O latencies. If most I/Os complete in 1-2ms but a tail of them take 8-16ms, you've got a queue buildup, not a slow disk. If all I/Os are slow, the disk itself is the bottleneck.
Use biotop to see which process is causing the latency:
sudo /usr/share/bcc/tools/biotop -C 10
Output:
COMMAND PID COMM READ_MB WRITE_MB MSEC_READING MSEC_WRITING
mysql 1456 mysqld 0.0 120.3 0 8920
rsync 2341 rsync 45.2 0.0 3400 0
Now you see: MySQL is spending 8.9 seconds writing per 10-second interval. That's your culprit.
Gotcha: eBPF tools require Linux 4.4+ and may need CAP_SYS_ADMIN. They also add overhead (typically 1-5% CPU), so don't leave them running 24/7 in production.
Building Your Disk I/O Runbook
Here's the order I follow in a real incident:
Step 1: Confirm the symptom (30 seconds)
iostat -xz 1 | head -20
Look at %util, r_await, and w_await. If all three are high, disk is congested. If %util is low but await is high, it's a queue problem (likely in the application or filesystem layer).
Step 2: Find the culprit (1-2 minutes)
sudo iotop -o -b -n 10 | sort -k4 -rn
Sort by DISK_WRITE descending. The top process is your suspect.
Step 3: Understand the pattern (2-5 minutes)
sudo /usr/share/bcc/tools/biolatency -m 10
Run this for 30-60 seconds. If latencies are distributed (long tail), the queue is the problem. If they're all high, the disk is slow.
Step 4: Drill into the process (5-10 minutes)
sudo /usr/share/bcc/tools/biotop -C 30 | grep <PID>
Watch the specific process's I/O pattern. Is it steady, or does it spike? Is it reads or writes?
Step 5: Check the filesystem (immediate)
df -h
fs_usage=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $fs_usage -gt 90 ]; then echo "Disk full!"; fi
A full filesystem will cause writes to queue indefinitely.
When to Suspect Hardware
Disk hardware failures have a signature. Watch for:
- Sudden latency spikes in biolatency (previously 2-5ms, now 50-100ms)
- Timeouts in dmesg:
[ 45.234] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 - Consistent high await even when only one small process is reading
If you see these, run:
sudo smartctl -a /dev/sda
Look for SMART Health Status: PASSED (good) or FAILED (bad). If it says PASSED but you still see timeouts, the disk is dying anyway—SMART is not reliable for predicting failure.
Gotcha: Don't trust SMART alone. If latency is anomalous and SMART says it's fine, the disk is still suspect. Plan a replacement.
Monitoring This Long-Term
For ongoing observability, you need metrics, not just manual tools. I'd set up:
- Prometheus node_exporter with
node_disk_io_reads_completed_totalandnode_disk_io_writes_completed_total - A custom eBPF exporter that emits
disk_io_latency_buckethistograms (from biolatency) - Alerting on
disk_io_latency_p99 > 50msordisk_util > 80%for 5 minutes
This gives you the data to correlate disk slowness with application events (deployments, backups, traffic spikes) after the fact.
The Runbook
Paste this into your incident playbook:
#!/bin/bash
# Disk I/O troubleshooting runbook
echo "=== iostat (5 seconds) ==="
iostat -xz 1 5
echo ""
echo "=== Top I/O processes ==="
sudo iotop -o -b -n 5 | head -15
echo ""
echo "=== I/O latency distribution ==="
sudo timeout 10 /usr/share/bcc/tools/biolatency -m 10 2>/dev/null || echo "biolatency not available"
echo ""
echo "=== Filesystem usage ==="
df -h
Run it, save the output, and share it with your team.
What to Do Tomorrow
Don't wait for a 3am page. This week:
- Install
iotopandbpfcc-toolson your production servers (or at least a staging box). - Run the runbook above on a quiet server to get a baseline of what "normal" looks like.
- Add a Prometheus alert for disk I/O latency (p99 > 50ms for 5 minutes).
- Share the runbook with your team in Slack or your wiki.
When the next disk slowness hits, you'll have a repeatable process and the tools to find the culprit in under 10 minutes. That's the difference between a 30-minute incident and a 2-hour one.