Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block
Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
This commit is contained in:
@@ -16,14 +16,16 @@ throughput. So, when needed for achieving a lower latency, BFQ builds
|
||||
schedules that may lead to a lower throughput. If your main or only
|
||||
goal, for a given device, is to achieve the maximum-possible
|
||||
throughput at all times, then do switch off all low-latency heuristics
|
||||
for that device, by setting low_latency to 0. Full details in Section 3.
|
||||
for that device, by setting low_latency to 0. See Section 3 for
|
||||
details on how to configure BFQ for the desired tradeoff between
|
||||
latency and throughput, or on how to maximize throughput.
|
||||
|
||||
On average CPUs, the current version of BFQ can handle devices
|
||||
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
|
||||
reference, 30-50 KIOPS correspond to very high bandwidths with
|
||||
sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
|
||||
to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
|
||||
multi-queue devices.
|
||||
to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
|
||||
multi-queue devices too.
|
||||
|
||||
The table of contents follow. Impatients can just jump to Section 3.
|
||||
|
||||
@@ -33,7 +35,7 @@ CONTENTS
|
||||
1-1 Personal systems
|
||||
1-2 Server systems
|
||||
2. How does BFQ work?
|
||||
3. What are BFQ's tunable?
|
||||
3. What are BFQ's tunables and how to properly configure BFQ?
|
||||
4. BFQ group scheduling
|
||||
4-1 Service guarantees provided
|
||||
4-2 Interface
|
||||
@@ -145,19 +147,28 @@ plus a lot of code, are borrowed from CFQ.
|
||||
contrast, BFQ may idle the device for a short time interval,
|
||||
giving the process the chance to go on being served if it issues
|
||||
a new request in time. Device idling typically boosts the
|
||||
throughput on rotational devices, if processes do synchronous
|
||||
and sequential I/O. In addition, under BFQ, device idling is
|
||||
also instrumental in guaranteeing the desired throughput
|
||||
fraction to processes issuing sync requests (see the description
|
||||
of the slice_idle tunable in this document, or [1, 2], for more
|
||||
details).
|
||||
throughput on rotational devices and on non-queueing flash-based
|
||||
devices, if processes do synchronous and sequential I/O. In
|
||||
addition, under BFQ, device idling is also instrumental in
|
||||
guaranteeing the desired throughput fraction to processes
|
||||
issuing sync requests (see the description of the slice_idle
|
||||
tunable in this document, or [1, 2], for more details).
|
||||
|
||||
- With respect to idling for service guarantees, if several
|
||||
processes are competing for the device at the same time, but
|
||||
all processes (and groups, after the following commit) have
|
||||
the same weight, then BFQ guarantees the expected throughput
|
||||
distribution without ever idling the device. Throughput is
|
||||
thus as high as possible in this common scenario.
|
||||
all processes and groups have the same weight, then BFQ
|
||||
guarantees the expected throughput distribution without ever
|
||||
idling the device. Throughput is thus as high as possible in
|
||||
this common scenario.
|
||||
|
||||
- On flash-based storage with internal queueing of commands
|
||||
(typically NCQ), device idling happens to be always detrimental
|
||||
for throughput. So, with these devices, BFQ performs idling
|
||||
only when strictly needed for service guarantees, i.e., for
|
||||
guaranteeing low latency or fairness. In these cases, overall
|
||||
throughput may be sub-optimal. No solution currently exists to
|
||||
provide both strong service guarantees and optimal throughput
|
||||
on devices with internal queueing.
|
||||
|
||||
- If low-latency mode is enabled (default configuration), BFQ
|
||||
executes some special heuristics to detect interactive and soft
|
||||
@@ -191,10 +202,7 @@ plus a lot of code, are borrowed from CFQ.
|
||||
- Queues are scheduled according to a variant of WF2Q+, named
|
||||
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
|
||||
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
|
||||
also ready for hierarchical scheduling. However, for a cleaner
|
||||
logical breakdown, the code that enables and completes
|
||||
hierarchical support is provided in the next commit, which focuses
|
||||
exactly on this feature.
|
||||
also ready for hierarchical scheduling, details in Section 4.
|
||||
|
||||
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
|
||||
perfectly fair, and smooth service. In particular, B-WF2Q+
|
||||
@@ -249,13 +257,24 @@ plus a lot of code, are borrowed from CFQ.
|
||||
the Idle class, to prevent it from starving.
|
||||
|
||||
|
||||
3. What are BFQ's tunable?
|
||||
==========================
|
||||
3. What are BFQ's tunables and how to properly configure BFQ?
|
||||
=============================================================
|
||||
|
||||
The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
|
||||
fifo_expire_sync below are the same as in CFQ. Their description is
|
||||
just copied from that for CFQ. Some considerations in the description
|
||||
of slice_idle are copied from CFQ too.
|
||||
Most BFQ tunables affect service guarantees (basically latency and
|
||||
fairness) and throughput. For full details on how to choose the
|
||||
desired tradeoff between service guarantees and throughput, see the
|
||||
parameters slice_idle, strict_guarantees and low_latency. For details
|
||||
on how to maximise throughput, see slice_idle, timeout_sync and
|
||||
max_budget. The other performance-related parameters have been
|
||||
inherited from, and have been preserved mostly for compatibility with
|
||||
CFQ. So far, no performance improvement has been reported after
|
||||
changing the latter parameters in BFQ.
|
||||
|
||||
In particular, the tunables back_seek-max, back_seek_penalty,
|
||||
fifo_expire_async and fifo_expire_sync below are the same as in
|
||||
CFQ. Their description is just copied from that for CFQ. Some
|
||||
considerations in the description of slice_idle are copied from CFQ
|
||||
too.
|
||||
|
||||
per-process ioprio and weight
|
||||
-----------------------------
|
||||
@@ -285,15 +304,17 @@ number of seeks and see improved throughput.
|
||||
|
||||
Setting slice_idle to 0 will remove all the idling on queues and one
|
||||
should see an overall improved throughput on faster storage devices
|
||||
like multiple SATA/SAS disks in hardware RAID configuration.
|
||||
like multiple SATA/SAS disks in hardware RAID configuration, as well
|
||||
as flash-based storage with internal command queueing (and
|
||||
parallelism).
|
||||
|
||||
So depending on storage and workload, it might be useful to set
|
||||
slice_idle=0. In general for SATA/SAS disks and software RAID of
|
||||
SATA/SAS disks keeping slice_idle enabled should be useful. For any
|
||||
configurations where there are multiple spindles behind single LUN
|
||||
(Host based hardware RAID controller or for storage arrays), setting
|
||||
slice_idle=0 might end up in better throughput and acceptable
|
||||
latencies.
|
||||
(Host based hardware RAID controller or for storage arrays), or with
|
||||
flash-based fast storage, setting slice_idle=0 might end up in better
|
||||
throughput and acceptable latencies.
|
||||
|
||||
Idling is however necessary to have service guarantees enforced in
|
||||
case of differentiated weights or differentiated I/O-request lengths.
|
||||
@@ -312,13 +333,14 @@ There is an important flipside for idling: apart from the above cases
|
||||
where it is beneficial also for throughput, idling can severely impact
|
||||
throughput. One important case is random workload. Because of this
|
||||
issue, BFQ tends to avoid idling as much as possible, when it is not
|
||||
beneficial also for throughput. As a consequence of this behavior, and
|
||||
of further issues described for the strict_guarantees tunable,
|
||||
short-term service guarantees may be occasionally violated. And, in
|
||||
some cases, these guarantees may be more important than guaranteeing
|
||||
maximum throughput. For example, in video playing/streaming, a very
|
||||
low drop rate may be more important than maximum throughput. In these
|
||||
cases, consider setting the strict_guarantees parameter.
|
||||
beneficial also for throughput (as detailed in Section 2). As a
|
||||
consequence of this behavior, and of further issues described for the
|
||||
strict_guarantees tunable, short-term service guarantees may be
|
||||
occasionally violated. And, in some cases, these guarantees may be
|
||||
more important than guaranteeing maximum throughput. For example, in
|
||||
video playing/streaming, a very low drop rate may be more important
|
||||
than maximum throughput. In these cases, consider setting the
|
||||
strict_guarantees parameter.
|
||||
|
||||
strict_guarantees
|
||||
-----------------
|
||||
@@ -420,6 +442,13 @@ The default value is 0, which enables auto-tuning: BFQ sets max_budget
|
||||
to the maximum number of sectors that can be served during
|
||||
timeout_sync, according to the estimated peak rate.
|
||||
|
||||
For specific devices, some users have occasionally reported to have
|
||||
reached a higher throughput by setting max_budget explicitly, i.e., by
|
||||
setting max_budget to a higher value than 0. In particular, they have
|
||||
set max_budget to higher values than those to which BFQ would have set
|
||||
it with auto-tuning. An alternative way to achieve this goal is to
|
||||
just increase the value of timeout_sync, leaving max_budget equal to 0.
|
||||
|
||||
weights
|
||||
-------
|
||||
|
||||
@@ -427,51 +456,6 @@ Read-only parameter, used to show the weights of the currently active
|
||||
BFQ queues.
|
||||
|
||||
|
||||
wr_ tunables
|
||||
------------
|
||||
|
||||
BFQ exports a few parameters to control/tune the behavior of
|
||||
low-latency heuristics.
|
||||
|
||||
wr_coeff
|
||||
|
||||
Factor by which the weight of a weight-raised queue is multiplied. If
|
||||
the queue is deemed soft real-time, then the weight is further
|
||||
multiplied by an additional, constant factor.
|
||||
|
||||
wr_max_time
|
||||
|
||||
Maximum duration of a weight-raising period for an interactive task
|
||||
(ms). If set to zero (default value), then this value is computed
|
||||
automatically, as a function of the peak rate of the device. In any
|
||||
case, when the value of this parameter is read, it always reports the
|
||||
current duration, regardless of whether it has been set manually or
|
||||
computed automatically.
|
||||
|
||||
wr_max_softrt_rate
|
||||
|
||||
Maximum service rate below which a queue is deemed to be associated
|
||||
with a soft real-time application, and is then weight-raised
|
||||
accordingly (sectors/sec).
|
||||
|
||||
wr_min_idle_time
|
||||
|
||||
Minimum idle period after which interactive weight-raising may be
|
||||
reactivated for a queue (in ms).
|
||||
|
||||
wr_rt_max_time
|
||||
|
||||
Maximum weight-raising duration for soft real-time queues (in ms). The
|
||||
start time from which this duration is considered is automatically
|
||||
moved forward if the queue is detected to be still soft real-time
|
||||
before the current soft real-time weight-raising period finishes.
|
||||
|
||||
wr_min_inter_arr_async
|
||||
|
||||
Minimum period between I/O request arrivals after which weight-raising
|
||||
may be reactivated for an already busy async queue (in ms).
|
||||
|
||||
|
||||
4. Group scheduling with BFQ
|
||||
============================
|
||||
|
||||
|
Reference in New Issue
Block a user