Titles in this page

Tuesday, July 28, 2009

iostat: (r/s + w/s) * svctm = %util on Linux

iostat -x is very useful to check disk i/o activities. Sometimes it is said that "check %util is less than 100%" or "check svctm is less than 50ms", but please do not fully trust these numbers. For example, the following two cases (DBT-2 load on MySQL) used same disks (two HDD disks, RAID1) and reached almost 100% util, but performance numbers were very different (no.2 was about twice as fast as no.1).
# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
21.16 0.00 6.14 29.77 0.00 42.93

Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 2.60 389.01 283.12 47.35 4.86 2.19
avgrq-sz avgqu-sz await svctm %util
43.67 4.89 14.76 3.02 99.83

# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
40.03 0.00 16.51 16.52 0.00 26.94

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 6.39 368.53 543.06 490.41 6.71 3.90
avgrq-sz avgqu-sz await svctm %util
21.02 3.29 3.20 0.90 92.66

100% util does not mean disks can not be faster anymore. For example, command queuing (TCQ/NCQ) or battery backed up write cache can often boosts performance significantly. For random i/o oriented applications(in most cases), I pay attention to r/s and w/s. r/s is the number of read requests that were issued to the device per second. w/s is the number of write requests that were issued to the device per second (copied from man). r/s + w/s is the total number of i/o requests per second (IOPS) so it is easier to check whether disks work as expected or not. For example, a few thousands of IOPS can be expected on single Intel SSD drive. For sequential i/o operations, r/s and w/s can be significantly affected by Linux parameters such as max_sectors_kb even though throughput is not different, so I check different iostat status variables such as rrqm/s, rMB/s.

What about svctm? Actually Linux's iostat calculates svctm automatically from r/s, w/s and %util. Here is an excerpt from iostat.c .

...
nr_ios = sdev.rd_ios + sdev.wr_ios;
tput = ((double) nr_ios) * HZ / itv;
util = ((double) sdev.tot_ticks) / itv * HZ;
svctm = tput ? util / tput : 0.0;
...
/* rrq/s wrq/s r/s w/s rsec wsec rkB wkB rqsz qusz await svctm %util */
printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n",
((double) sdev.rd_merges) / itv * HZ,
((double) sdev.wr_merges) / itv * HZ,
((double) sdev.rd_ios) / itv * HZ,
((double) sdev.wr_ios) / itv * HZ,
...

The latter means the following.
r/s = sdev.rd_ios / itv * HZ
w/s = sdev.wr_ios / itv * HZ

The former means the following.
svctm = util / ((sdev.rd_ios + sdev.wr_ios) * HZ / itv)

If %util is 100%, svctm is just 1 / (r/s + w/s) seconds, 1000/(r/s+w/s) milliseconds. This is an inverse number of IOPS. In other words, svctm * (r/s+w/s) is always 1000 if %util is 100%. So checking svctm is practically as same as checking r/s and w/s (as long as %util is close to 100%). The latter (IOPS) is much easier, isn't it?