Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
This commit is contained in:
165
Documentation/block/as-iosched.txt
Normal file
165
Documentation/block/as-iosched.txt
Normal file
@@ -0,0 +1,165 @@
|
||||
Anticipatory IO scheduler
|
||||
-------------------------
|
||||
Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
|
||||
|
||||
Attention! Database servers, especially those using "TCQ" disks should
|
||||
investigate performance with the 'deadline' IO scheduler. Any system with high
|
||||
disk performance requirements should do so, in fact.
|
||||
|
||||
If you see unusual performance characteristics of your disk systems, or you
|
||||
see big performance regressions versus the deadline scheduler, please email
|
||||
me. Database users don't bother unless you're willing to test a lot of patches
|
||||
from me ;) its a known issue.
|
||||
|
||||
Also, users with hardware RAID controllers, doing striping, may find
|
||||
highly variable performance results with using the as-iosched. The
|
||||
as-iosched anticipatory implementation is based on the notion that a disk
|
||||
device has only one physical seeking head. A striped RAID controller
|
||||
actually has a head for each physical device in the logical RAID device.
|
||||
|
||||
However, setting the antic_expire (see tunable parameters below) produces
|
||||
very similar behavior to the deadline IO scheduler.
|
||||
|
||||
|
||||
Selecting IO schedulers
|
||||
-----------------------
|
||||
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
|
||||
'noop' and 'as' (the default) are also available. IO schedulers are assigned
|
||||
globally at boot time only presently.
|
||||
|
||||
|
||||
Anticipatory IO scheduler Policies
|
||||
----------------------------------
|
||||
The as-iosched implementation implements several layers of policies
|
||||
to determine when an IO request is dispatched to the disk controller.
|
||||
Here are the policies outlined, in order of application.
|
||||
|
||||
1. one-way Elevator algorithm.
|
||||
|
||||
The elevator algorithm is similar to that used in deadline scheduler, with
|
||||
the addition that it allows limited backward movement of the elevator
|
||||
(i.e. seeks backwards). A seek backwards can occur when choosing between
|
||||
two IO requests where one is behind the elevator's current position, and
|
||||
the other is in front of the elevator's position. If the seek distance to
|
||||
the request in back of the elevator is less than half the seek distance to
|
||||
the request in front of the elevator, then the request in back can be chosen.
|
||||
Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
|
||||
This favors forward movement of the elevator, while allowing opportunistic
|
||||
"short" backward seeks.
|
||||
|
||||
2. FIFO expiration times for reads and for writes.
|
||||
|
||||
This is again very similar to the deadline IO scheduler. The expiration
|
||||
times for requests on these lists is tunable using the parameters read_expire
|
||||
and write_expire discussed below. When a read or a write expires in this way,
|
||||
the IO scheduler will interrupt its current elevator sweep or read anticipation
|
||||
to service the expired request.
|
||||
|
||||
3. Read and write request batching
|
||||
|
||||
A batch is a collection of read requests or a collection of write
|
||||
requests. The as scheduler alternates dispatching read and write batches
|
||||
to the driver. In the case a read batch, the scheduler submits read
|
||||
requests to the driver as long as there are read requests to submit, and
|
||||
the read batch time limit has not been exceeded (read_batch_expire).
|
||||
The read batch time limit begins counting down only when there are
|
||||
competing write requests pending.
|
||||
|
||||
In the case of a write batch, the scheduler submits write requests to
|
||||
the driver as long as there are write requests available, and the
|
||||
write batch time limit has not been exceeded (write_batch_expire).
|
||||
However, the length of write batches will be gradually shortened
|
||||
when read batches frequently exceed their time limit.
|
||||
|
||||
When changing between batch types, the scheduler waits for all requests
|
||||
from the previous batch to complete before scheduling requests for the
|
||||
next batch.
|
||||
|
||||
The read and write fifo expiration times described in policy 2 above
|
||||
are checked only when in scheduling IO of a batch for the corresponding
|
||||
(read/write) type. So for example, the read FIFO timeout values are
|
||||
tested only during read batches. Likewise, the write FIFO timeout
|
||||
values are tested only during write batches. For this reason,
|
||||
it is generally not recommended for the read batch time
|
||||
to be longer than the write expiration time, nor for the write batch
|
||||
time to exceed the read expiration time (see tunable parameters below).
|
||||
|
||||
When the IO scheduler changes from a read to a write batch,
|
||||
it begins the elevator from the request that is on the head of the
|
||||
write expiration FIFO. Likewise, when changing from a write batch to
|
||||
a read batch, scheduler begins the elevator from the first entry
|
||||
on the read expiration FIFO.
|
||||
|
||||
4. Read anticipation.
|
||||
|
||||
Read anticipation occurs only when scheduling a read batch.
|
||||
This implementation of read anticipation allows only one read request
|
||||
to be dispatched to the disk controller at a time. In
|
||||
contrast, many write requests may be dispatched to the disk controller
|
||||
at a time during a write batch. It is this characteristic that can make
|
||||
the anticipatory scheduler perform anomalously with controllers supporting
|
||||
TCQ, or with hardware striped RAID devices. Setting the antic_expire
|
||||
queue paramter (see below) to zero disables this behavior, and the anticipatory
|
||||
scheduler behaves essentially like the deadline scheduler.
|
||||
|
||||
When read anticipation is enabled (antic_expire is not zero), reads
|
||||
are dispatched to the disk controller one at a time.
|
||||
At the end of each read request, the IO scheduler examines its next
|
||||
candidate read request from its sorted read list. If that next request
|
||||
is from the same process as the request that just completed,
|
||||
or if the next request in the queue is "very close" to the
|
||||
just completed request, it is dispatched immediately. Otherwise,
|
||||
statistics (average think time, average seek distance) on the process
|
||||
that submitted the just completed request are examined. If it seems
|
||||
likely that that process will submit another request soon, and that
|
||||
request is likely to be near the just completed request, then the IO
|
||||
scheduler will stop dispatching more read requests for up time (antic_expire)
|
||||
milliseconds, hoping that process will submit a new request near the one
|
||||
that just completed. If such a request is made, then it is dispatched
|
||||
immediately. If the antic_expire wait time expires, then the IO scheduler
|
||||
will dispatch the next read request from the sorted read queue.
|
||||
|
||||
To decide whether an anticipatory wait is worthwhile, the scheduler
|
||||
maintains statistics for each process that can be used to compute
|
||||
mean "think time" (the time between read requests), and mean seek
|
||||
distance for that process. One observation is that these statistics
|
||||
are associated with each process, but those statistics are not associated
|
||||
with a specific IO device. So for example, if a process is doing IO
|
||||
on several file systems on separate devices, the statistics will be
|
||||
a combination of IO behavior from all those devices.
|
||||
|
||||
|
||||
Tuning the anticipatory IO scheduler
|
||||
------------------------------------
|
||||
When using 'as', the anticipatory IO scheduler there are 5 parameters under
|
||||
/sys/block/*/queue/iosched/. All are units of milliseconds.
|
||||
|
||||
The parameters are:
|
||||
* read_expire
|
||||
Controls how long until a read request becomes "expired". It also controls the
|
||||
interval between which expired requests are served, so set to 50, a request
|
||||
might take anywhere < 100ms to be serviced _if_ it is the next on the
|
||||
expired list. Obviously request expiration strategies won't make the disk
|
||||
go faster. The result basically equates to the timeslice a single reader
|
||||
gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
|
||||
very roughly the % streaming read efficiency your disk should get with
|
||||
multiple readers.
|
||||
|
||||
* read_batch_expire
|
||||
Controls how much time a batch of reads is given before pending writes are
|
||||
served. A higher value is more efficient. This might be set below read_expire
|
||||
if writes are to be given higher priority than reads, but reads are to be
|
||||
as efficient as possible when there are no writes. Generally though, it
|
||||
should be some multiple of read_expire.
|
||||
|
||||
* write_expire, and
|
||||
* write_batch_expire are equivalent to the above, for writes.
|
||||
|
||||
* antic_expire
|
||||
Controls the maximum amount of time we can anticipate a good read (one
|
||||
with a short seek distance from the most recently completed request) before
|
||||
giving up. Many other factors may cause anticipation to be stopped early,
|
||||
or some processes will not be "anticipated" at all. Should be a bit higher
|
||||
for big seek time devices though not a linear correspondence - most
|
||||
processes have only a few ms thinktime.
|
||||
|
1213
Documentation/block/biodoc.txt
Normal file
1213
Documentation/block/biodoc.txt
Normal file
File diff suppressed because it is too large
Load Diff
78
Documentation/block/deadline-iosched.txt
Normal file
78
Documentation/block/deadline-iosched.txt
Normal file
@@ -0,0 +1,78 @@
|
||||
Deadline IO scheduler tunables
|
||||
==============================
|
||||
|
||||
This little file attempts to document how the deadline io scheduler works.
|
||||
In particular, it will clarify the meaning of the exposed tunables that may be
|
||||
of interest to power users.
|
||||
|
||||
Each io queue has a set of io scheduler tunables associated with it. These
|
||||
tunables control how the io scheduler works. You can find these entries
|
||||
in:
|
||||
|
||||
/sys/block/<device>/queue/iosched
|
||||
|
||||
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
|
||||
you can do so by typing:
|
||||
|
||||
# mount none /sys -t sysfs
|
||||
|
||||
|
||||
********************************************************************************
|
||||
|
||||
|
||||
read_expire (in ms)
|
||||
-----------
|
||||
|
||||
The goal of the deadline io scheduler is to attempt to guarentee a start
|
||||
service time for a request. As we focus mainly on read latencies, this is
|
||||
tunable. When a read request first enters the io scheduler, it is assigned
|
||||
a deadline that is the current time + the read_expire value in units of
|
||||
miliseconds.
|
||||
|
||||
|
||||
write_expire (in ms)
|
||||
-----------
|
||||
|
||||
Similar to read_expire mentioned above, but for writes.
|
||||
|
||||
|
||||
fifo_batch
|
||||
----------
|
||||
|
||||
When a read request expires its deadline, we must move some requests from
|
||||
the sorted io scheduler list to the block device dispatch queue. fifo_batch
|
||||
controls how many requests we move, based on the cost of each request. A
|
||||
request is either qualified as a seek or a stream. The io scheduler knows
|
||||
the last request that was serviced by the drive (or will be serviced right
|
||||
before this one). See seek_cost and stream_unit.
|
||||
|
||||
|
||||
write_starved (number of dispatches)
|
||||
-------------
|
||||
|
||||
When we have to move requests from the io scheduler queue to the block
|
||||
device dispatch queue, we always give a preference to reads. However, we
|
||||
don't want to starve writes indefinitely either. So writes_starved controls
|
||||
how many times we give preference to reads over writes. When that has been
|
||||
done writes_starved number of times, we dispatch some writes based on the
|
||||
same criteria as reads.
|
||||
|
||||
|
||||
front_merges (bool)
|
||||
------------
|
||||
|
||||
Sometimes it happens that a request enters the io scheduler that is contigious
|
||||
with a request that is already on the queue. Either it fits in the back of that
|
||||
request, or it fits at the front. That is called either a back merge candidate
|
||||
or a front merge candidate. Due to the way files are typically laid out,
|
||||
back merges are much more common than front merges. For some work loads, you
|
||||
may even know that it is a waste of time to spend any time attempting to
|
||||
front merge requests. Setting front_merges to 0 disables this functionality.
|
||||
Front merges may still occur due to the cached last_merge hint, but since
|
||||
that comes at basically 0 cost we leave that on. We simply disable the
|
||||
rbtree front sector lookup when the io scheduler merge function is called.
|
||||
|
||||
|
||||
Nov 11 2002, Jens Axboe <axboe@suse.de>
|
||||
|
||||
|
88
Documentation/block/request.txt
Normal file
88
Documentation/block/request.txt
Normal file
@@ -0,0 +1,88 @@
|
||||
|
||||
struct request documentation
|
||||
|
||||
Jens Axboe <axboe@suse.de> 27/05/02
|
||||
|
||||
1.0
|
||||
Index
|
||||
|
||||
2.0 Struct request members classification
|
||||
|
||||
2.1 struct request members explanation
|
||||
|
||||
3.0
|
||||
|
||||
|
||||
2.0
|
||||
Short explanation of request members
|
||||
|
||||
Classification flags:
|
||||
|
||||
D driver member
|
||||
B block layer member
|
||||
I I/O scheduler member
|
||||
|
||||
Unless an entry contains a D classification, a device driver must not access
|
||||
this member. Some members may contain D classifications, but should only be
|
||||
access through certain macros or functions (eg ->flags).
|
||||
|
||||
<linux/blkdev.h>
|
||||
|
||||
2.1
|
||||
Member Flag Comment
|
||||
------ ---- -------
|
||||
|
||||
struct list_head queuelist BI Organization on various internal
|
||||
queues
|
||||
|
||||
void *elevator_private I I/O scheduler private data
|
||||
|
||||
unsigned char cmd[16] D Driver can use this for setting up
|
||||
a cdb before execution, see
|
||||
blk_queue_prep_rq
|
||||
|
||||
unsigned long flags DBI Contains info about data direction,
|
||||
request type, etc.
|
||||
|
||||
int rq_status D Request status bits
|
||||
|
||||
kdev_t rq_dev DBI Target device
|
||||
|
||||
int errors DB Error counts
|
||||
|
||||
sector_t sector DBI Target location
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep sector sane
|
||||
|
||||
unsigned long nr_sectors DBI Total number of sectors in request
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
|
||||
|
||||
unsigned short nr_phys_segments DB Number of physical scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned short nr_hw_segments DB Number of hardware scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned int current_nr_sectors DB Number of sectors in first segment
|
||||
of request
|
||||
|
||||
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
|
||||
|
||||
int tag DB TCQ tag, if assigned
|
||||
|
||||
void *special D Free to be used by driver
|
||||
|
||||
char *buffer D Map of first segment, also see
|
||||
section on bouncing SECTION
|
||||
|
||||
struct completion *waiting D Can be used by driver to get signalled
|
||||
on request completion
|
||||
|
||||
struct bio *bio DBI First bio in request
|
||||
|
||||
struct bio *biotail DBI Last bio in request
|
||||
|
||||
request_queue_t *q DB Request queue this request belongs to
|
||||
|
||||
struct request_list *rl B Request list this request came from
|
Reference in New Issue
Block a user