Setting the I/O scheduler

You might want to set the I/O scheduler used for a drive to reduce latency for individual requests, increase overall throughput, favor some I/O or processes over others, or improve scaling under extreme loads.

Check the current I/O scheduler

Checking your machine is easy. Simply run the following command:

grep . /sys/block/*/queue/scheduler

Your output might look something like this:

root@ubuntu:~# grep . /sys/block/*/queue/scheduler
/sys/block/dm-0/queue/scheduler:none
/sys/block/md127/queue/scheduler:none
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline

DM (“device-mapper”) devices (like dm-0 or anything in LVM), software RAID devices (like md127), and other virtual block devices backed by real ones usually don’t have an active scheduler; they rely on the schedulers for the underlying devices.

In this example, the primary line to look at is the nvme0n1 drive. It’s using the none scheduler. If that’s a bad choice for its workload, you’d want to change it.

Testing a difference I/O scheduler

To test a difference scheduler, simply write the name of the desired scheduler to the /sys/block/*/queue/scheduler file, like this:

echo bfq > /sys/block/nvme1n1/queue/scheduler

The names of the schedulers are:

Single-queue: noop, deadline, cfq
Multi-queue: none, mq-deadline, kyber, bfq

Which set of schedulers is available depends on your kernel and the distribution’s kernel-build settings. In general, multi-queue schedulers scale better.

Some schedulers may be provided by loadable modules. If those modules haven’t been loaded yet, you won’t see them listed in the output above.

Don’t worry about manually loading modules, though—writing the name of an unloaded scheduler will trigger the kernel to load it automatically.

If you try an invalid name, or the name of a scheduler that isn’t available for your system or drive (bfq on CentOS 7, for example), you’ll see:

echo: write error: Invalid argument

Making the change permanent

Setting the I/O scheduler as above is convenient for testing, but won’t survive a reboot, or a drive detaching and re-attaching (I’ve seen that happen in popular cloud environments).

The most common way to make the change more permanent is to create a systemd unit that writes the setting for the drive on boot. Unfortunately, that still won’t survive a detach/re-attach cycle.

The most durable approach is to write a custom udev rule. Unfortunately, this carries risks of making the machine unbootable, creating udev event loops, or interfering with other important udev rules. Avoiding such pitfalls is a large topic; however, for those willing to push through the technical details, here’s an example rule as a starting point:

ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{kernel}=="nvme1n1", ATTR{queue/scheduler}="bfq"

If you’d prefer an alternative to developing and testing udev rules, try our I/O Manager tool. You tell it which I/O scheduler to use for each of your cloud drives, and it correctly manages kernel settings as machines reboot or drives detach & re-attach or move between machines.