I/O scheduler fundamentals

For scheduling, I/O is generally considered in two categories: synchronous, asynchronous. When a process must stop and wait for the I/O request to complete before continuing, the request is synchronous; otherwise, it’s asynchronous.

Normally, read-requests are synchronous:

int bytes_read = read(file_descriptor, buffer, buffer_len);

The request must complete before read can return.

Write-requests are usually asynchronous:

int bytes_written = write(file_descriptor, buffer, buffer_len);

There’s a slight trick that the OS plays here: after performing some sanity checks, it will copy the contents of buffer (in the process’s data space) into a kernel buffer, allowing write to return immediately.

A write can be made synchronous by either opening the file with the O_SYNC flag set, or by requesting a sync after the write:

int bytes_written = write(file_descriptor, buffer, buffer_len);
fsync(file_descriptor);

Read-requests can also be performed in an asynchronous way, with the kernel sending a signal to the process when the request has completed.

Importantly, asynchronous requests can be delayed with less impact on the requesting process; any delay in a synchronous request will have a direct impact on the requesting process.

Why the I/O scheduler is important

Ordering and merging

Firstly, the scheduler orders requests by sector number. This is especially helpful on spinning media to minimize expensive head-seeks.

Secondly, the scheduler merges requests that hit consecutive sectors into a single request. While this does help a little with spinning media (because fewer operations are dispatched to the controller, even if the same number of data bits must be transferred), it helps a lot on SSD media, especially for writes.

That’s because SSD media works internally with blocks that are much larger than the logical sectors that the OS uses for addressing. If a sector needs to be written to SSD, the controller reads the entire block into an internal buffer, overlays the write data at the appropriate location, then writes the entire block back down. If multiple writes to a single block can be combined, the additional ones happen for free.

However, SSD’s have multiple controllers that execute in parallel.

In order to merge, the scheduler must first sort.

Limiting latency

Once a scheduler is collecting requests to order and sort, it becomes possible that some requests might be queued for an indefinitely long period of time, unless the scheduler is also sensitive to how long a request has been waiting.

For example, suppose processes A and B are both trying to write. Process A issues requests for sectors 1-20. The first 8 are issued to the drive as a single full batch, and sectors 9-20 are queued. Process B then requests sector 1000. The write for sectors 1-8 finishes, and sectors 9-16 are sent to the drive. Process A then requests writes to sectors 21-40; then, later, sectors 41-60, and so on.

If the scheduler performed only ordering and merging, it would be possible for process A to completely starve process B. Hence, most schedulers will account for this by assigning a request to the next batch once it’s been queued for some maximum amount of time, even if other requests would be preferred base only on ordering.

Sharing bandwidth

A scheduler may also try to share bandwidth between processes according to some proportion that it defines as more fair. A process that hasn’t made a request for a while will have its new request advanced in front of the requests of another process that’s made heavy use of the device.

Prioritization

A process has an I/O-priority, which can be changed. A scheduler can honor this priority by giving more bandwidth to higher-priority processes than to lower-priority processes.

Three classes of priority are defined:

Idle: a process in this class will have its I/O requests fulfilled only when there are no outstanding requests from processes in any other priority class. Ideally, this means that I/O in this class has almost no impact on any other class. This class can be starved by all other classes.
Realtime: processes in this class will have their requests fulfilled ahead of any other class. This class can starve all other classes.
Best-effort: processes in this class will share bandwidth with each other, according to their priority number within this class: higher priorities (lower numbers) will be given a higher portion of bandwidth. Processes in this class cannot starve each other, no matter their priority numbers, but may starve processes in the idle class, and be starved by processes in the realtime class.

If you are running multiple workloads on a single disk (e.g. a database as well as system logs; or, a file-server as well as RAID consistency check), I/O priorities allow you to express which workload should be preferred.

High-load modes

All of the schedulers will do fine under light loads (“light” depends on the drive, of course).

The concern comes with a system under /heavy/ load. There, a scheduler can make all the difference between a system remaining usable, or becoming completely unresponsive for some class of I/O requests.

Each scheduler has different strengths and weaknesses under these circumstances. You can read about each scheduler here: