The deadline scheduler (or mq-deadline for the multi-queue version) is the default scheduler on many distributions because it can provide significant gains to many common workloads without too much extra processing overhead. However, it does have some significant drawbacks for particular workloads.

Queues with deadlines

Because most drives can perform better if we group nearby requests together, requests are collected into a queue before dispatching them to the drive. The more requests can be queued, the more likely we are to have requests near each other.

Simple queuing can lead to starvation, though. If the queue is being dispatched from low sector number to high, a request near the end of the drive could be queued for an indefinite period of time, if requests for earlier sectors keep arriving before it can be dispatched.

To avoid that, every request is assigned a deadline. As long as no deadline has passed, the scheduler will keep dispatching requests in sector order. If a deadline has passed, it will jump to the oldest request, and continue dispatching from there. This ensures that no request can be delayed indefinitely by newer requests.

Batching to reduce overhead

Any amount of processing spent in the I/O path reduces throughput, and checking for expired deadlines requires CPU time.

As an optimization, the deadline scheduler will only check for expired deadlines between batches of 16 requests.

When the last request in a batch is dispatched, the scheduler checks the deadlines. If any have expired, it uses the oldest request as the start of the new batch; otherwise, it uses the request following (in sector order) the last one dispatched. Either way, it will then dispatch the remainder of the batch’s requests strictly in order of ascending sector number.

The deadline scheduler assumes that all reads are synchronous (and therefore blocking other work from happening) and also that all writes are asynchronous (and therefore can be delayed for much longer). Additionally, because most drives take more time to perform a write than a read, batching nearby writes can give a larger performance improvement than batching nearby reads.

Because of those reasons, the scheduler uses a deadline of 0.5 seconds for reads, and 5 seconds for writes.

These aren’t “realtime deadlines” in the sense that the request is guaranteed to finish—or even start—in that time. They’re soft deadlines that really only guarantee the avoidance of starvation.

Split read and write queues

The deadline scheduler splits all I/O requests into two internal queues: one for reads, and one for writes1.

Every time a new batch starts, the scheduler picks a queue from which to create that batch. If only one queue has any requests, it will use that queue.

If both queues have requests, it will favor the read queue up to two times over the write queue—again, on the assumption that reads are currently blocking work. After two consecutive read batches, it will then service the write queue for one batch, before switching back to the read queue.

The scheduler selects queues in this way regardless of whether either of them have any requests with expired deadlines.

Because of this, a workload with enough reads to saturate the drive can partially starve out writes. If writes are truly asynchronous, this might not matter too much. However, if writes are synchronous (like, for example, writes from a database with durability guarantees), it could cause important work to back up for increasingly long amounts of time.

Finally, each time the scheduler switches queues, it starts servicing that queue over in sector order, unless there are expired requests.

Right for the workload?

The deadline scheduler is a good pick for many generic workloads; on the other hand, under high load it can be abysmal for workloads typically seen on servers.

Of course, the only way to determine the right scheduler for your workload is to try each one out. Using I/O Manager, you can easily change a drive’s current scheduler to perform a comparison.

Read about the other schedulers here:


Endnotes

1 Basically, anything that doesn’t read sectors is considered a write. For example, a discard is a write.