Merge tag 'docs-4.18' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet: "There's been a fair amount of work in the docs tree this time around, including: - Extensive RST conversions and organizational work in the memory-management docs thanks to Mike Rapoport. - An update of Documentation/features from Andrea Parri and a script to keep it updated. - Various LICENSES updates from Thomas, along with a script to check SPDX tags. - Work to fix dangling references to documentation files; this involved a fair number of one-liner comment changes outside of Documentation/ ... and the usual list of documentation improvements, typo fixes, etc" * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits) Documentation: document hung_task_panic kernel parameter docs/admin-guide/mm: add high level concepts overview docs/vm: move ksm and transhuge from "user" to "internals" section. docs: Use the kerneldoc comments for memalloc_no*() doc: document scope NOFS, NOIO APIs docs: update kernel versions and dates in tables docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge docs/vm: transhuge: minor updates docs/vm: transhuge: change sections order Documentation: arm: clean up Marvell Berlin family info Documentation: gpio: driver: Fix a typo and some odd grammar docs: ranoops.rst: fix location of ramoops.txt scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode docs: uio-howto.rst: use a code block to solve a warning mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback w1: w1_io.c: fix a kernel-doc warning Documentation/process/posting: wrap text at 80 cols docs: admin-guide: add cgroup-v2 documentation Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'" Documentation: refcount-vs-atomic: Update reference to LKMM doc. ...
This commit is contained in:
649
Documentation/admin-guide/bcache.rst
Normal file
649
Documentation/admin-guide/bcache.rst
Normal file
@@ -0,0 +1,649 @@
|
||||
============================
|
||||
A block layer cache (bcache)
|
||||
============================
|
||||
|
||||
Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
|
||||
nice if you could use them as cache... Hence bcache.
|
||||
|
||||
Wiki and git repositories are at:
|
||||
|
||||
- http://bcache.evilpiepirate.org
|
||||
- http://evilpiepirate.org/git/linux-bcache.git
|
||||
- http://evilpiepirate.org/git/bcache-tools.git
|
||||
|
||||
It's designed around the performance characteristics of SSDs - it only allocates
|
||||
in erase block sized buckets, and it uses a hybrid btree/log to track cached
|
||||
extents (which can be anywhere from a single sector to the bucket size). It's
|
||||
designed to avoid random writes at all costs; it fills up an erase block
|
||||
sequentially, then issues a discard before reusing it.
|
||||
|
||||
Both writethrough and writeback caching are supported. Writeback defaults to
|
||||
off, but can be switched on and off arbitrarily at runtime. Bcache goes to
|
||||
great lengths to protect your data - it reliably handles unclean shutdown. (It
|
||||
doesn't even have a notion of a clean shutdown; bcache simply doesn't return
|
||||
writes as completed until they're on stable storage).
|
||||
|
||||
Writeback caching can use most of the cache for buffering writes - writing
|
||||
dirty data to the backing device is always done sequentially, scanning from the
|
||||
start to the end of the index.
|
||||
|
||||
Since random IO is what SSDs excel at, there generally won't be much benefit
|
||||
to caching large sequential IO. Bcache detects sequential IO and skips it;
|
||||
it also keeps a rolling average of the IO sizes per task, and as long as the
|
||||
average is above the cutoff it will skip all IO from that task - instead of
|
||||
caching the first 512k after every seek. Backups and large file copies should
|
||||
thus entirely bypass the cache.
|
||||
|
||||
In the event of a data IO error on the flash it will try to recover by reading
|
||||
from disk or invalidating cache entries. For unrecoverable errors (meta data
|
||||
or dirty data), caching is automatically disabled; if dirty data was present
|
||||
in the cache it first disables writeback caching and waits for all dirty data
|
||||
to be flushed.
|
||||
|
||||
Getting started:
|
||||
You'll need make-bcache from the bcache-tools repository. Both the cache device
|
||||
and backing device must be formatted before use::
|
||||
|
||||
make-bcache -B /dev/sdb
|
||||
make-bcache -C /dev/sdc
|
||||
|
||||
make-bcache has the ability to format multiple devices at the same time - if
|
||||
you format your backing devices and cache device at the same time, you won't
|
||||
have to manually attach::
|
||||
|
||||
make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
|
||||
|
||||
bcache-tools now ships udev rules, and bcache devices are known to the kernel
|
||||
immediately. Without udev, you can manually register devices like this::
|
||||
|
||||
echo /dev/sdb > /sys/fs/bcache/register
|
||||
echo /dev/sdc > /sys/fs/bcache/register
|
||||
|
||||
Registering the backing device makes the bcache device show up in /dev; you can
|
||||
now format it and use it as normal. But the first time using a new bcache
|
||||
device, it'll be running in passthrough mode until you attach it to a cache.
|
||||
If you are thinking about using bcache later, it is recommended to setup all your
|
||||
slow devices as bcache backing devices without a cache, and you can choose to add
|
||||
a caching device later.
|
||||
See 'ATTACHING' section below.
|
||||
|
||||
The devices show up as::
|
||||
|
||||
/dev/bcache<N>
|
||||
|
||||
As well as (with udev)::
|
||||
|
||||
/dev/bcache/by-uuid/<uuid>
|
||||
/dev/bcache/by-label/<label>
|
||||
|
||||
To get started::
|
||||
|
||||
mkfs.ext4 /dev/bcache0
|
||||
mount /dev/bcache0 /mnt
|
||||
|
||||
You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache .
|
||||
You can also control them through /sys/fs//bcache/<cset-uuid>/ .
|
||||
|
||||
Cache devices are managed as sets; multiple caches per set isn't supported yet
|
||||
but will allow for mirroring of metadata and dirty data in the future. Your new
|
||||
cache set shows up as /sys/fs/bcache/<UUID>
|
||||
|
||||
Attaching
|
||||
---------
|
||||
|
||||
After your cache device and backing device are registered, the backing device
|
||||
must be attached to your cache set to enable caching. Attaching a backing
|
||||
device to a cache set is done thusly, with the UUID of the cache set in
|
||||
/sys/fs/bcache::
|
||||
|
||||
echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
|
||||
|
||||
This only has to be done once. The next time you reboot, just reregister all
|
||||
your bcache devices. If a backing device has data in a cache somewhere, the
|
||||
/dev/bcache<N> device won't be created until the cache shows up - particularly
|
||||
important if you have writeback caching turned on.
|
||||
|
||||
If you're booting up and your cache device is gone and never coming back, you
|
||||
can force run the backing device::
|
||||
|
||||
echo 1 > /sys/block/sdb/bcache/running
|
||||
|
||||
(You need to use /sys/block/sdb (or whatever your backing device is called), not
|
||||
/sys/block/bcache0, because bcache0 doesn't exist yet. If you're using a
|
||||
partition, the bcache directory would be at /sys/block/sdb/sdb2/bcache)
|
||||
|
||||
The backing device will still use that cache set if it shows up in the future,
|
||||
but all the cached data will be invalidated. If there was dirty data in the
|
||||
cache, don't expect the filesystem to be recoverable - you will have massive
|
||||
filesystem corruption, though ext4's fsck does work miracles.
|
||||
|
||||
Error Handling
|
||||
--------------
|
||||
|
||||
Bcache tries to transparently handle IO errors to/from the cache device without
|
||||
affecting normal operation; if it sees too many errors (the threshold is
|
||||
configurable, and defaults to 0) it shuts down the cache device and switches all
|
||||
the backing devices to passthrough mode.
|
||||
|
||||
- For reads from the cache, if they error we just retry the read from the
|
||||
backing device.
|
||||
|
||||
- For writethrough writes, if the write to the cache errors we just switch to
|
||||
invalidating the data at that lba in the cache (i.e. the same thing we do for
|
||||
a write that bypasses the cache)
|
||||
|
||||
- For writeback writes, we currently pass that error back up to the
|
||||
filesystem/userspace. This could be improved - we could retry it as a write
|
||||
that skips the cache so we don't have to error the write.
|
||||
|
||||
- When we detach, we first try to flush any dirty data (if we were running in
|
||||
writeback mode). It currently doesn't do anything intelligent if it fails to
|
||||
read some of the dirty data, though.
|
||||
|
||||
|
||||
Howto/cookbook
|
||||
--------------
|
||||
|
||||
A) Starting a bcache with a missing caching device
|
||||
|
||||
If registering the backing device doesn't help, it's already there, you just need
|
||||
to force it to run without the cache::
|
||||
|
||||
host:~# echo /dev/sdb1 > /sys/fs/bcache/register
|
||||
[ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered
|
||||
|
||||
Next, you try to register your caching device if it's present. However
|
||||
if it's absent, or registration fails for some reason, you can still
|
||||
start your bcache without its cache, like so::
|
||||
|
||||
host:/sys/block/sdb/sdb1/bcache# echo 1 > running
|
||||
|
||||
Note that this may cause data loss if you were running in writeback mode.
|
||||
|
||||
|
||||
B) Bcache does not find its cache::
|
||||
|
||||
host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach
|
||||
[ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set
|
||||
[ 1933.478179] bcache: __cached_dev_store() Can't attach 0226553a-37cf-41d5-b3ce-8b1e944543a8
|
||||
[ 1933.478179] : cache set not found
|
||||
|
||||
In this case, the caching device was simply not registered at boot
|
||||
or disappeared and came back, and needs to be (re-)registered::
|
||||
|
||||
host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
|
||||
|
||||
|
||||
C) Corrupt bcache crashes the kernel at device registration time:
|
||||
|
||||
This should never happen. If it does happen, then you have found a bug!
|
||||
Please report it to the bcache development list: linux-bcache@vger.kernel.org
|
||||
|
||||
Be sure to provide as much information that you can including kernel dmesg
|
||||
output if available so that we may assist.
|
||||
|
||||
|
||||
D) Recovering data without bcache:
|
||||
|
||||
If bcache is not available in the kernel, a filesystem on the backing
|
||||
device is still available at an 8KiB offset. So either via a loopdev
|
||||
of the backing device created with --offset 8K, or any value defined by
|
||||
--data-offset when you originally formatted bcache with `make-bcache`.
|
||||
|
||||
For example::
|
||||
|
||||
losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev
|
||||
|
||||
This should present your unmodified backing device data in /dev/loop0
|
||||
|
||||
If your cache is in writethrough mode, then you can safely discard the
|
||||
cache device without loosing data.
|
||||
|
||||
|
||||
E) Wiping a cache device
|
||||
|
||||
::
|
||||
|
||||
host:~# wipefs -a /dev/sdh2
|
||||
16 bytes were erased at offset 0x1018 (bcache)
|
||||
they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
|
||||
|
||||
After you boot back with bcache enabled, you recreate the cache and attach it::
|
||||
|
||||
host:~# make-bcache -C /dev/sdh2
|
||||
UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
|
||||
Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
|
||||
version: 0
|
||||
nbuckets: 106874
|
||||
block_size: 1
|
||||
bucket_size: 1024
|
||||
nr_in_set: 1
|
||||
nr_this_dev: 0
|
||||
first_bucket: 1
|
||||
[ 650.511912] bcache: run_cache_set() invalidating existing data
|
||||
[ 650.549228] bcache: register_cache() registered cache device sdh2
|
||||
|
||||
start backing device with missing cache::
|
||||
|
||||
host:/sys/block/md5/bcache# echo 1 > running
|
||||
|
||||
attach new cache::
|
||||
|
||||
host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
|
||||
[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
|
||||
|
||||
|
||||
F) Remove or replace a caching device::
|
||||
|
||||
host:/sys/block/sda/sda7/bcache# echo 1 > detach
|
||||
[ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7
|
||||
|
||||
host:~# wipefs -a /dev/nvme0n1p4
|
||||
wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy
|
||||
Ooops, it's disabled, but not unregistered, so it's still protected
|
||||
|
||||
We need to go and unregister it::
|
||||
|
||||
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0
|
||||
lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/
|
||||
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop
|
||||
kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered
|
||||
|
||||
Now we can wipe it::
|
||||
|
||||
host:~# wipefs -a /dev/nvme0n1p4
|
||||
/dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
|
||||
|
||||
|
||||
G) dm-crypt and bcache
|
||||
|
||||
First setup bcache unencrypted and then install dmcrypt on top of
|
||||
/dev/bcache<N> This will work faster than if you dmcrypt both the backing
|
||||
and caching devices and then install bcache on top. [benchmarks?]
|
||||
|
||||
|
||||
H) Stop/free a registered bcache to wipe and/or recreate it
|
||||
|
||||
Suppose that you need to free up all bcache references so that you can
|
||||
fdisk run and re-register a changed partition table, which won't work
|
||||
if there are any active backing or caching devices left on it:
|
||||
|
||||
1) Is it present in /dev/bcache* ? (there are times where it won't be)
|
||||
|
||||
If so, it's easy::
|
||||
|
||||
host:/sys/block/bcache0/bcache# echo 1 > stop
|
||||
|
||||
2) But if your backing device is gone, this won't work::
|
||||
|
||||
host:/sys/block/bcache0# cd bcache
|
||||
bash: cd: bcache: No such file or directory
|
||||
|
||||
In this case, you may have to unregister the dmcrypt block device that
|
||||
references this bcache to free it up::
|
||||
|
||||
host:~# dmsetup remove oldds1
|
||||
bcache: bcache_device_free() bcache0 stopped
|
||||
bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered
|
||||
|
||||
This causes the backing bcache to be removed from /sys/fs/bcache and
|
||||
then it can be reused. This would be true of any block device stacking
|
||||
where bcache is a lower device.
|
||||
|
||||
3) In other cases, you can also look in /sys/fs/bcache/::
|
||||
|
||||
host:/sys/fs/bcache# ls -l */{cache?,bdev?}
|
||||
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
|
||||
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
|
||||
lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
|
||||
|
||||
The device names will show which UUID is relevant, cd in that directory
|
||||
and stop the cache::
|
||||
|
||||
host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop
|
||||
|
||||
This will free up bcache references and let you reuse the partition for
|
||||
other purposes.
|
||||
|
||||
|
||||
|
||||
Troubleshooting performance
|
||||
---------------------------
|
||||
|
||||
Bcache has a bunch of config options and tunables. The defaults are intended to
|
||||
be reasonable for typical desktop and server workloads, but they're not what you
|
||||
want for getting the best possible numbers when benchmarking.
|
||||
|
||||
- Backing device alignment
|
||||
|
||||
The default metadata size in bcache is 8k. If your backing device is
|
||||
RAID based, then be sure to align this by a multiple of your stride
|
||||
width using `make-bcache --data-offset`. If you intend to expand your
|
||||
disk array in the future, then multiply a series of primes by your
|
||||
raid stripe size to get the disk multiples that you would like.
|
||||
|
||||
For example: If you have a 64k stripe size, then the following offset
|
||||
would provide alignment for many common RAID5 data spindle counts::
|
||||
|
||||
64k * 2*2*2*3*3*5*7 bytes = 161280k
|
||||
|
||||
That space is wasted, but for only 157.5MB you can grow your RAID 5
|
||||
volume to the following data-spindle counts without re-aligning::
|
||||
|
||||
3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
|
||||
|
||||
- Bad write performance
|
||||
|
||||
If write performance is not what you expected, you probably wanted to be
|
||||
running in writeback mode, which isn't the default (not due to a lack of
|
||||
maturity, but simply because in writeback mode you'll lose data if something
|
||||
happens to your SSD)::
|
||||
|
||||
# echo writeback > /sys/block/bcache0/bcache/cache_mode
|
||||
|
||||
- Bad performance, or traffic not going to the SSD that you'd expect
|
||||
|
||||
By default, bcache doesn't cache everything. It tries to skip sequential IO -
|
||||
because you really want to be caching the random IO, and if you copy a 10
|
||||
gigabyte file you probably don't want that pushing 10 gigabytes of randomly
|
||||
accessed data out of your cache.
|
||||
|
||||
But if you want to benchmark reads from cache, and you start out with fio
|
||||
writing an 8 gigabyte test file - so you want to disable that::
|
||||
|
||||
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
|
||||
|
||||
To set it back to the default (4 mb), do::
|
||||
|
||||
# echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
|
||||
|
||||
- Traffic's still going to the spindle/still getting cache misses
|
||||
|
||||
In the real world, SSDs don't always keep up with disks - particularly with
|
||||
slower SSDs, many disks being cached by one SSD, or mostly sequential IO. So
|
||||
you want to avoid being bottlenecked by the SSD and having it slow everything
|
||||
down.
|
||||
|
||||
To avoid that bcache tracks latency to the cache device, and gradually
|
||||
throttles traffic if the latency exceeds a threshold (it does this by
|
||||
cranking down the sequential bypass).
|
||||
|
||||
You can disable this if you need to by setting the thresholds to 0::
|
||||
|
||||
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
|
||||
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
|
||||
|
||||
The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
|
||||
|
||||
- Still getting cache misses, of the same data
|
||||
|
||||
One last issue that sometimes trips people up is actually an old bug, due to
|
||||
the way cache coherency is handled for cache misses. If a btree node is full,
|
||||
a cache miss won't be able to insert a key for the new data and the data
|
||||
won't be written to the cache.
|
||||
|
||||
In practice this isn't an issue because as soon as a write comes along it'll
|
||||
cause the btree node to be split, and you need almost no write traffic for
|
||||
this to not show up enough to be noticeable (especially since bcache's btree
|
||||
nodes are huge and index large regions of the device). But when you're
|
||||
benchmarking, if you're trying to warm the cache by reading a bunch of data
|
||||
and there's no other traffic - that can be a problem.
|
||||
|
||||
Solution: warm the cache by doing writes, or use the testing branch (there's
|
||||
a fix for the issue there).
|
||||
|
||||
|
||||
Sysfs - backing device
|
||||
----------------------
|
||||
|
||||
Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
|
||||
(if attached) /sys/fs/bcache/<cset-uuid>/bdev*
|
||||
|
||||
attach
|
||||
Echo the UUID of a cache set to this file to enable caching.
|
||||
|
||||
cache_mode
|
||||
Can be one of either writethrough, writeback, writearound or none.
|
||||
|
||||
clear_stats
|
||||
Writing to this file resets the running total stats (not the day/hour/5 minute
|
||||
decaying versions).
|
||||
|
||||
detach
|
||||
Write to this file to detach from a cache set. If there is dirty data in the
|
||||
cache, it will be flushed first.
|
||||
|
||||
dirty_data
|
||||
Amount of dirty data for this backing device in the cache. Continuously
|
||||
updated unlike the cache set's version, but may be slightly off.
|
||||
|
||||
label
|
||||
Name of underlying device.
|
||||
|
||||
readahead
|
||||
Size of readahead that should be performed. Defaults to 0. If set to e.g.
|
||||
1M, it will round cache miss reads up to that size, but without overlapping
|
||||
existing cache entries.
|
||||
|
||||
running
|
||||
1 if bcache is running (i.e. whether the /dev/bcache device exists, whether
|
||||
it's in passthrough mode or caching).
|
||||
|
||||
sequential_cutoff
|
||||
A sequential IO will bypass the cache once it passes this threshold; the
|
||||
most recent 128 IOs are tracked so sequential IO can be detected even when
|
||||
it isn't all done at once.
|
||||
|
||||
sequential_merge
|
||||
If non zero, bcache keeps a list of the last 128 requests submitted to compare
|
||||
against all new requests to determine which new requests are sequential
|
||||
continuations of previous requests for the purpose of determining sequential
|
||||
cutoff. This is necessary if the sequential cutoff value is greater than the
|
||||
maximum acceptable sequential size for any single request.
|
||||
|
||||
state
|
||||
The backing device can be in one of four different states:
|
||||
|
||||
no cache: Has never been attached to a cache set.
|
||||
|
||||
clean: Part of a cache set, and there is no cached dirty data.
|
||||
|
||||
dirty: Part of a cache set, and there is cached dirty data.
|
||||
|
||||
inconsistent: The backing device was forcibly run by the user when there was
|
||||
dirty data cached but the cache set was unavailable; whatever data was on the
|
||||
backing device has likely been corrupted.
|
||||
|
||||
stop
|
||||
Write to this file to shut down the bcache device and close the backing
|
||||
device.
|
||||
|
||||
writeback_delay
|
||||
When dirty data is written to the cache and it previously did not contain
|
||||
any, waits some number of seconds before initiating writeback. Defaults to
|
||||
30.
|
||||
|
||||
writeback_percent
|
||||
If nonzero, bcache tries to keep around this percentage of the cache dirty by
|
||||
throttling background writeback and using a PD controller to smoothly adjust
|
||||
the rate.
|
||||
|
||||
writeback_rate
|
||||
Rate in sectors per second - if writeback_percent is nonzero, background
|
||||
writeback is throttled to this rate. Continuously adjusted by bcache but may
|
||||
also be set by the user.
|
||||
|
||||
writeback_running
|
||||
If off, writeback of dirty data will not take place at all. Dirty data will
|
||||
still be added to the cache until it is mostly full; only meant for
|
||||
benchmarking. Defaults to on.
|
||||
|
||||
Sysfs - backing device stats
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are directories with these numbers for a running total, as well as
|
||||
versions that decay over the past day, hour and 5 minutes; they're also
|
||||
aggregated in the cache set directory as well.
|
||||
|
||||
bypassed
|
||||
Amount of IO (both reads and writes) that has bypassed the cache
|
||||
|
||||
cache_hits, cache_misses, cache_hit_ratio
|
||||
Hits and misses are counted per individual IO as bcache sees them; a
|
||||
partial hit is counted as a miss.
|
||||
|
||||
cache_bypass_hits, cache_bypass_misses
|
||||
Hits and misses for IO that is intended to skip the cache are still counted,
|
||||
but broken out here.
|
||||
|
||||
cache_miss_collisions
|
||||
Counts instances where data was going to be inserted into the cache from a
|
||||
cache miss, but raced with a write and data was already present (usually 0
|
||||
since the synchronization for cache misses was rewritten)
|
||||
|
||||
cache_readaheads
|
||||
Count of times readahead occurred.
|
||||
|
||||
Sysfs - cache set
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Available at /sys/fs/bcache/<cset-uuid>
|
||||
|
||||
average_key_size
|
||||
Average data per key in the btree.
|
||||
|
||||
bdev<0..n>
|
||||
Symlink to each of the attached backing devices.
|
||||
|
||||
block_size
|
||||
Block size of the cache devices.
|
||||
|
||||
btree_cache_size
|
||||
Amount of memory currently used by the btree cache
|
||||
|
||||
bucket_size
|
||||
Size of buckets
|
||||
|
||||
cache<0..n>
|
||||
Symlink to each of the cache devices comprising this cache set.
|
||||
|
||||
cache_available_percent
|
||||
Percentage of cache device which doesn't contain dirty data, and could
|
||||
potentially be used for writeback. This doesn't mean this space isn't used
|
||||
for clean cached data; the unused statistic (in priority_stats) is typically
|
||||
much lower.
|
||||
|
||||
clear_stats
|
||||
Clears the statistics associated with this cache
|
||||
|
||||
dirty_data
|
||||
Amount of dirty data is in the cache (updated when garbage collection runs).
|
||||
|
||||
flash_vol_create
|
||||
Echoing a size to this file (in human readable units, k/M/G) creates a thinly
|
||||
provisioned volume backed by the cache set.
|
||||
|
||||
io_error_halflife, io_error_limit
|
||||
These determines how many errors we accept before disabling the cache.
|
||||
Each error is decayed by the half life (in # ios). If the decaying count
|
||||
reaches io_error_limit dirty data is written out and the cache is disabled.
|
||||
|
||||
journal_delay_ms
|
||||
Journal writes will delay for up to this many milliseconds, unless a cache
|
||||
flush happens sooner. Defaults to 100.
|
||||
|
||||
root_usage_percent
|
||||
Percentage of the root btree node in use. If this gets too high the node
|
||||
will split, increasing the tree depth.
|
||||
|
||||
stop
|
||||
Write to this file to shut down the cache set - waits until all attached
|
||||
backing devices have been shut down.
|
||||
|
||||
tree_depth
|
||||
Depth of the btree (A single node btree has depth 0).
|
||||
|
||||
unregister
|
||||
Detaches all backing devices and closes the cache devices; if dirty data is
|
||||
present it will disable writeback caching and wait for it to be flushed.
|
||||
|
||||
Sysfs - cache set internal
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This directory also exposes timings for a number of internal operations, with
|
||||
separate files for average duration, average frequency, last occurrence and max
|
||||
duration: garbage collection, btree read, btree node sorts and btree splits.
|
||||
|
||||
active_journal_entries
|
||||
Number of journal entries that are newer than the index.
|
||||
|
||||
btree_nodes
|
||||
Total nodes in the btree.
|
||||
|
||||
btree_used_percent
|
||||
Average fraction of btree in use.
|
||||
|
||||
bset_tree_stats
|
||||
Statistics about the auxiliary search trees
|
||||
|
||||
btree_cache_max_chain
|
||||
Longest chain in the btree node cache's hash table
|
||||
|
||||
cache_read_races
|
||||
Counts instances where while data was being read from the cache, the bucket
|
||||
was reused and invalidated - i.e. where the pointer was stale after the read
|
||||
completed. When this occurs the data is reread from the backing device.
|
||||
|
||||
trigger_gc
|
||||
Writing to this file forces garbage collection to run.
|
||||
|
||||
Sysfs - Cache device
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Available at /sys/block/<cdev>/bcache
|
||||
|
||||
block_size
|
||||
Minimum granularity of writes - should match hardware sector size.
|
||||
|
||||
btree_written
|
||||
Sum of all btree writes, in (kilo/mega/giga) bytes
|
||||
|
||||
bucket_size
|
||||
Size of buckets
|
||||
|
||||
cache_replacement_policy
|
||||
One of either lru, fifo or random.
|
||||
|
||||
discard
|
||||
Boolean; if on a discard/TRIM will be issued to each bucket before it is
|
||||
reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
|
||||
slow).
|
||||
|
||||
freelist_percent
|
||||
Size of the freelist as a percentage of nbuckets. Can be written to to
|
||||
increase the number of buckets kept on the freelist, which lets you
|
||||
artificially reduce the size of the cache at runtime. Mostly for testing
|
||||
purposes (i.e. testing how different size caches affect your hit rate), but
|
||||
since buckets are discarded when they move on to the freelist will also make
|
||||
the SSD's garbage collection easier by effectively giving it more reserved
|
||||
space.
|
||||
|
||||
io_errors
|
||||
Number of errors that have occurred, decayed by io_error_halflife.
|
||||
|
||||
metadata_written
|
||||
Sum of all non data writes (btree writes and all other metadata).
|
||||
|
||||
nbuckets
|
||||
Total buckets in this cache
|
||||
|
||||
priority_stats
|
||||
Statistics about how recently data in the cache has been accessed.
|
||||
This can reveal your working set size. Unused is the percentage of
|
||||
the cache that doesn't contain any data. Metadata is bcache's
|
||||
metadata overhead. Average is the average priority of cache buckets.
|
||||
Next is a list of quantiles with the priority threshold of each.
|
||||
|
||||
written
|
||||
Sum of all data that has been written to the cache; comparison with
|
||||
btree_written gives the amount of write inflation in bcache.
|
1998
Documentation/admin-guide/cgroup-v2.rst
Normal file
1998
Documentation/admin-guide/cgroup-v2.rst
Normal file
File diff suppressed because it is too large
Load Diff
@@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
:maxdepth: 1
|
||||
|
||||
initrd
|
||||
cgroup-v2
|
||||
serial-console
|
||||
braille-console
|
||||
parport
|
||||
@@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
|
||||
mono
|
||||
java
|
||||
ras
|
||||
bcache
|
||||
pm/index
|
||||
thunderbolt
|
||||
LSM/index
|
||||
mm/index
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
@@ -106,11 +106,11 @@
|
||||
use by PCI
|
||||
Format: <irq>,<irq>...
|
||||
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
Due to the existence of _Lxx/_Exx, some GPEs triggered
|
||||
by unsupported hardware/firmware features can result in
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
This facility can be used to prevent such uncontrolled
|
||||
GPE floodings.
|
||||
Format: <int>
|
||||
@@ -472,10 +472,10 @@
|
||||
for platform specific values (SB1, Loongson3 and
|
||||
others).
|
||||
|
||||
ccw_timeout_log [S390]
|
||||
ccw_timeout_log [S390]
|
||||
See Documentation/s390/CommonIO for details.
|
||||
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
Format: {name of the controller(s) to disable}
|
||||
The effects of cgroup_disable=foo are:
|
||||
- foo isn't auto-mounted if you mount all cgroups in
|
||||
@@ -518,7 +518,7 @@
|
||||
those clocks in any way. This parameter is useful for
|
||||
debug and development, but should not be needed on a
|
||||
platform with proper driver support. For more
|
||||
information, see Documentation/clk.txt.
|
||||
information, see Documentation/driver-api/clk.rst.
|
||||
|
||||
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
|
||||
[Deprecated]
|
||||
@@ -641,8 +641,8 @@
|
||||
hvc<n> Use the hypervisor console device <n>. This is for
|
||||
both Xen and PowerPC hypervisors.
|
||||
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
console=brl,ttyS0
|
||||
For now, only VisioBraille is supported.
|
||||
|
||||
@@ -662,7 +662,7 @@
|
||||
|
||||
consoleblank= [KNL] The console blank (screen saver) timeout in
|
||||
seconds. A value of 0 disables the blank timer.
|
||||
Defaults to 0.
|
||||
Defaults to 0.
|
||||
|
||||
coredump_filter=
|
||||
[KNL] Change the default value for
|
||||
@@ -730,7 +730,7 @@
|
||||
or memory reserved is below 4G.
|
||||
|
||||
cryptomgr.notests
|
||||
[KNL] Disable crypto self-tests
|
||||
[KNL] Disable crypto self-tests
|
||||
|
||||
cs89x0_dma= [HW,NET]
|
||||
Format: <dma>
|
||||
@@ -746,7 +746,7 @@
|
||||
Format: <port#>,<type>
|
||||
See also Documentation/input/devices/joystick-parport.rst
|
||||
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
time. See
|
||||
Documentation/admin-guide/dynamic-debug-howto.rst for
|
||||
details. Deprecated, see dyndbg.
|
||||
@@ -833,7 +833,7 @@
|
||||
causing system reset or hang due to sending
|
||||
INIT from AP to BSP.
|
||||
|
||||
disable_ddw [PPC/PSERIES]
|
||||
disable_ddw [PPC/PSERIES]
|
||||
Disable Dynamic DMA Window support. Use this if
|
||||
to workaround buggy firmware.
|
||||
|
||||
@@ -1188,7 +1188,7 @@
|
||||
parameter will force ia64_sal_cache_flush to call
|
||||
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
|
||||
|
||||
forcepae [X86-32]
|
||||
forcepae [X86-32]
|
||||
Forcefully enable Physical Address Extension (PAE).
|
||||
Many Pentium M systems disable PAE but may have a
|
||||
functionally usable PAE implementation.
|
||||
@@ -1247,7 +1247,7 @@
|
||||
|
||||
gamma= [HW,DRM]
|
||||
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
Format: off | on
|
||||
default: on
|
||||
|
||||
@@ -1341,23 +1341,32 @@
|
||||
x86-64 are 2M (when the CPU supports "pse") and 1G
|
||||
(when the CPU supports the "pdpe1gb" cpuinfo flag).
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
hung_task_panic=
|
||||
[KNL] Should the hung task detector generate panics.
|
||||
Format: <integer>
|
||||
|
||||
A nonzero value instructs the kernel to panic when a
|
||||
hung task is detected. The default value is controlled
|
||||
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
|
||||
option. The value selected by this boot parameter can
|
||||
be changed later by the kernel.hung_task_panic sysctl.
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
keep_bootcon [KNL]
|
||||
Do not unregister boot console at start. This is only
|
||||
useful for debugging when something happens in the window
|
||||
between unregistering the boot console and initializing
|
||||
the real console.
|
||||
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
|
||||
i8042.debug [HW] Toggle i8042 debug mode
|
||||
i8042.unmask_kbd_data
|
||||
@@ -1386,7 +1395,7 @@
|
||||
Default: only on s2r transitions on x86; most other
|
||||
architectures force reset to be always executed
|
||||
i8042.unlock [HW] Unlock (ignore) the keylock
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
|
||||
i810= [HW,DRM]
|
||||
|
||||
@@ -1548,13 +1557,13 @@
|
||||
programs exec'd, files mmap'd for exec, and all files
|
||||
opened for read by uid=0.
|
||||
|
||||
ima_template= [IMA]
|
||||
ima_template= [IMA]
|
||||
Select one of defined IMA measurements template formats.
|
||||
Formats: { "ima" | "ima-ng" | "ima-sig" }
|
||||
Default: "ima-ng"
|
||||
|
||||
ima_template_fmt=
|
||||
[IMA] Define a custom template format.
|
||||
[IMA] Define a custom template format.
|
||||
Format: { "field1|...|fieldN" }
|
||||
|
||||
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
|
||||
@@ -1597,7 +1606,7 @@
|
||||
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
|
||||
Format: <irq>
|
||||
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
|
||||
integrity_audit=[IMA]
|
||||
Format: { "0" | "1" }
|
||||
@@ -1650,39 +1659,39 @@
|
||||
0 disables intel_idle and fall back on acpi_idle.
|
||||
1 to 9 specify maximum depth of C-state.
|
||||
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
|
||||
intremap= [X86-64, Intel-IOMMU]
|
||||
on enable Interrupt Remapping (default)
|
||||
@@ -2026,7 +2035,7 @@
|
||||
* [no]ncqtrim: Turn off queued DSM TRIM.
|
||||
|
||||
* nohrst, nosrst, norst: suppress hard, soft
|
||||
and both resets.
|
||||
and both resets.
|
||||
|
||||
* rstonce: only attempt one reset during
|
||||
hot-unplug link recovery
|
||||
@@ -2214,7 +2223,7 @@
|
||||
[KNL,SH] Allow user to override the default size for
|
||||
per-device physically contiguous DMA buffers.
|
||||
|
||||
memhp_default_state=online/offline
|
||||
memhp_default_state=online/offline
|
||||
[KNL] Set the initial state for the memory hotplug
|
||||
onlining policy. If not specified, the default value is
|
||||
set according to the
|
||||
@@ -2764,7 +2773,7 @@
|
||||
[X86,PV_OPS] Disable paravirtualized VMware scheduler
|
||||
clock and use the default one.
|
||||
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
steal time is computed, but won't influence scheduler
|
||||
behaviour
|
||||
|
||||
@@ -2825,7 +2834,7 @@
|
||||
notsc [BUGS=X86-32] Disable Time Stamp Counter
|
||||
|
||||
nowatchdog [KNL] Disable both lockup detectors, i.e.
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
|
||||
nowb [ARM]
|
||||
|
||||
@@ -2845,7 +2854,7 @@
|
||||
If the dependencies are under your control, you can
|
||||
turn on cpu0_hotplug.
|
||||
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
This parameter sets the maximum duration, in
|
||||
cycles, each HW thread of the CTOP can run
|
||||
without interruptions, before HW switches it.
|
||||
@@ -2986,7 +2995,7 @@
|
||||
|
||||
pci=option[,option...] [PCI] various PCI subsystem options:
|
||||
earlydump [X86] dump PCI config space before the kernel
|
||||
changes anything
|
||||
changes anything
|
||||
off [X86] don't probe for the PCI bus
|
||||
bios [X86-32] force use of PCI BIOS, don't access
|
||||
the hardware directly. Use this if your machine
|
||||
@@ -3074,7 +3083,7 @@
|
||||
is enabled by default. If you need to use this,
|
||||
please report a bug.
|
||||
nocrs [X86] Ignore PCI host bridge windows from ACPI.
|
||||
If you need to use this, please report a bug.
|
||||
If you need to use this, please report a bug.
|
||||
routeirq Do IRQ routing for all PCI devices.
|
||||
This is normally done in pci_enable_device(),
|
||||
so this option is a temporary workaround
|
||||
@@ -3917,7 +3926,7 @@
|
||||
cache (risks via metadata attacks are mostly
|
||||
unchanged). Debug options disable merging on their
|
||||
own.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slab_max_order= [MM, SLAB]
|
||||
Determines the maximum allowed order for slabs.
|
||||
@@ -3931,7 +3940,7 @@
|
||||
slub_debug can create guard zones around objects and
|
||||
may poison objects when not in use. Also tracks the
|
||||
last alloc / free. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_memcg_sysfs= [MM, SLUB]
|
||||
Determines whether to enable sysfs directories for
|
||||
@@ -3945,7 +3954,7 @@
|
||||
Determines the maximum allowed order for slabs.
|
||||
A high setting may cause OOMs due to memory
|
||||
fragmentation. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_objects= [MM, SLUB]
|
||||
The minimum number of objects per slab. SLUB will
|
||||
@@ -3954,12 +3963,12 @@
|
||||
the number of objects indicated. The higher the number
|
||||
of objects the smaller the overhead of tracking slabs
|
||||
and the less frequently locks need to be acquired.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_order= [MM, SLUB]
|
||||
Determines the minimum page order for slabs. Must be
|
||||
lower than slub_max_order.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_nomerge [MM, SLUB]
|
||||
Same with slab_nomerge. This is supported for legacy.
|
||||
@@ -4357,7 +4366,8 @@
|
||||
Format: [always|madvise|never]
|
||||
Can be used to control the default behavior of the system
|
||||
with respect to transparent hugepages.
|
||||
See Documentation/vm/transhuge.txt for more details.
|
||||
See Documentation/admin-guide/mm/transhuge.rst
|
||||
for more details.
|
||||
|
||||
tsc= Disable clocksource stability checks for TSC.
|
||||
Format: <string>
|
||||
@@ -4435,7 +4445,7 @@
|
||||
|
||||
usbcore.initial_descriptor_timeout=
|
||||
[USB] Specifies timeout for the initial 64-byte
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
(default 5000 = 5.0 seconds).
|
||||
|
||||
usbcore.nousb [USB] Disable the USB subsystem
|
||||
|
222
Documentation/admin-guide/mm/concepts.rst
Normal file
222
Documentation/admin-guide/mm/concepts.rst
Normal file
@@ -0,0 +1,222 @@
|
||||
.. _mm_concepts:
|
||||
|
||||
=================
|
||||
Concepts overview
|
||||
=================
|
||||
|
||||
The memory management in Linux is complex system that evolved over the
|
||||
years and included more and more functionality to support variety of
|
||||
systems from MMU-less microcontrollers to supercomputers. The memory
|
||||
management for systems without MMU is called ``nommu`` and it
|
||||
definitely deserves a dedicated document, which hopefully will be
|
||||
eventually written. Yet, although some of the concepts are the same,
|
||||
here we assume that MMU is available and CPU can translate a virtual
|
||||
address to a physical address.
|
||||
|
||||
.. contents:: :local:
|
||||
|
||||
Virtual Memory Primer
|
||||
=====================
|
||||
|
||||
The physical memory in a computer system is a limited resource and
|
||||
even for systems that support memory hotplug there is a hard limit on
|
||||
the amount of memory that can be installed. The physical memory is not
|
||||
necessary contiguous, it might be accessible as a set of distinct
|
||||
address ranges. Besides, different CPU architectures, and even
|
||||
different implementations of the same architecture have different view
|
||||
how these address ranges defined.
|
||||
|
||||
All this makes dealing directly with physical memory quite complex and
|
||||
to avoid this complexity a concept of virtual memory was developed.
|
||||
|
||||
The virtual memory abstracts the details of physical memory from the
|
||||
application software, allows to keep only needed information in the
|
||||
physical memory (demand paging) and provides a mechanism for the
|
||||
protection and controlled sharing of data between processes.
|
||||
|
||||
With virtual memory, each and every memory access uses a virtual
|
||||
address. When the CPU decodes the an instruction that reads (or
|
||||
writes) from (or to) the system memory, it translates the `virtual`
|
||||
address encoded in that instruction to a `physical` address that the
|
||||
memory controller can understand.
|
||||
|
||||
The physical system memory is divided into page frames, or pages. The
|
||||
size of each page is architecture specific. Some architectures allow
|
||||
selection of the page size from several supported values; this
|
||||
selection is performed at the kernel build time by setting an
|
||||
appropriate kernel configuration option.
|
||||
|
||||
Each physical memory page can be mapped as one or more virtual
|
||||
pages. These mappings are described by page tables that allow
|
||||
translation from virtual address used by programs to real address in
|
||||
the physical memory. The page tables organized hierarchically.
|
||||
|
||||
The tables at the lowest level of the hierarchy contain physical
|
||||
addresses of actual pages used by the software. The tables at higher
|
||||
levels contain physical addresses of the pages belonging to the lower
|
||||
levels. The pointer to the top level page table resides in a
|
||||
register. When the CPU performs the address translation, it uses this
|
||||
register to access the top level page table. The high bits of the
|
||||
virtual address are used to index an entry in the top level page
|
||||
table. That entry is then used to access the next level in the
|
||||
hierarchy with the next bits of the virtual address as the index to
|
||||
that level page table. The lowest bits in the virtual address define
|
||||
the offset inside the actual page.
|
||||
|
||||
Huge Pages
|
||||
==========
|
||||
|
||||
The address translation requires several memory accesses and memory
|
||||
accesses are slow relatively to CPU speed. To avoid spending precious
|
||||
processor cycles on the address translation, CPUs maintain a cache of
|
||||
such translations called Translation Lookaside Buffer (or
|
||||
TLB). Usually TLB is pretty scarce resource and applications with
|
||||
large memory working set will experience performance hit because of
|
||||
TLB misses.
|
||||
|
||||
Many modern CPU architectures allow mapping of the memory pages
|
||||
directly by the higher levels in the page table. For instance, on x86,
|
||||
it is possible to map 2M and even 1G pages using entries in the second
|
||||
and the third level page tables. In Linux such pages are called
|
||||
`huge`. Usage of huge pages significantly reduces pressure on TLB,
|
||||
improves TLB hit-rate and thus improves overall system performance.
|
||||
|
||||
There are two mechanisms in Linux that enable mapping of the physical
|
||||
memory with the huge pages. The first one is `HugeTLB filesystem`, or
|
||||
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
|
||||
store. For the files created in this filesystem the data resides in
|
||||
the memory and mapped using huge pages. The hugetlbfs is described at
|
||||
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
|
||||
|
||||
Another, more recent, mechanism that enables use of the huge pages is
|
||||
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
|
||||
requires users and/or system administrators to configure what parts of
|
||||
the system memory should and can be mapped by the huge pages, THP
|
||||
manages such mappings transparently to the user and hence the
|
||||
name. See
|
||||
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
|
||||
for more details about THP.
|
||||
|
||||
Zones
|
||||
=====
|
||||
|
||||
Often hardware poses restrictions on how different physical memory
|
||||
ranges can be accessed. In some cases, devices cannot perform DMA to
|
||||
all the addressable memory. In other cases, the size of the physical
|
||||
memory exceeds the maximal addressable size of virtual memory and
|
||||
special actions are required to access portions of the memory. Linux
|
||||
groups memory pages into `zones` according to their possible
|
||||
usage. For example, ZONE_DMA will contain memory that can be used by
|
||||
devices for DMA, ZONE_HIGHMEM will contain memory that is not
|
||||
permanently mapped into kernel's address space and ZONE_NORMAL will
|
||||
contain normally addressed pages.
|
||||
|
||||
The actual layout of the memory zones is hardware dependent as not all
|
||||
architectures define all zones, and requirements for DMA are different
|
||||
for different platforms.
|
||||
|
||||
Nodes
|
||||
=====
|
||||
|
||||
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
|
||||
systems. In such systems the memory is arranged into banks that have
|
||||
different access latency depending on the "distance" from the
|
||||
processor. Each bank is referred as `node` and for each node Linux
|
||||
constructs an independent memory management subsystem. A node has it's
|
||||
own set of zones, lists of free and used pages and various statistics
|
||||
counters. You can find more details about NUMA in
|
||||
:ref:`Documentation/vm/numa.rst <numa>` and in
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
|
||||
|
||||
Page cache
|
||||
==========
|
||||
|
||||
The physical memory is volatile and the common case for getting data
|
||||
into the memory is to read it from files. Whenever a file is read, the
|
||||
data is put into the `page cache` to avoid expensive disk access on
|
||||
the subsequent reads. Similarly, when one writes to a file, the data
|
||||
is placed in the page cache and eventually gets into the backing
|
||||
storage device. The written pages are marked as `dirty` and when Linux
|
||||
decides to reuse them for other purposes, it makes sure to synchronize
|
||||
the file contents on the device with the updated data.
|
||||
|
||||
Anonymous Memory
|
||||
================
|
||||
|
||||
The `anonymous memory` or `anonymous mappings` represent memory that
|
||||
is not backed by a filesystem. Such mappings are implicitly created
|
||||
for program's stack and heap or by explicit calls to mmap(2) system
|
||||
call. Usually, the anonymous mappings only define virtual memory areas
|
||||
that the program is allowed to access. The read accesses will result
|
||||
in creation of a page table entry that references a special physical
|
||||
page filled with zeroes. When the program performs a write, regular
|
||||
physical page will be allocated to hold the written data. The page
|
||||
will be marked dirty and if the kernel will decide to repurpose it,
|
||||
the dirty page will be swapped out.
|
||||
|
||||
Reclaim
|
||||
=======
|
||||
|
||||
Throughout the system lifetime, a physical page can be used for storing
|
||||
different types of data. It can be kernel internal data structures,
|
||||
DMA'able buffers for device drivers use, data read from a filesystem,
|
||||
memory allocated by user space processes etc.
|
||||
|
||||
Depending on the page usage it is treated differently by the Linux
|
||||
memory management. The pages that can be freed at any time, either
|
||||
because they cache the data available elsewhere, for instance, on a
|
||||
hard disk, or because they can be swapped out, again, to the hard
|
||||
disk, are called `reclaimable`. The most notable categories of the
|
||||
reclaimable pages are page cache and anonymous memory.
|
||||
|
||||
In most cases, the pages holding internal kernel data and used as DMA
|
||||
buffers cannot be repurposed, and they remain pinned until freed by
|
||||
their user. Such pages are called `unreclaimable`. However, in certain
|
||||
circumstances, even pages occupied with kernel data structures can be
|
||||
reclaimed. For instance, in-memory caches of filesystem metadata can
|
||||
be re-read from the storage device and therefore it is possible to
|
||||
discard them from the main memory when system is under memory
|
||||
pressure.
|
||||
|
||||
The process of freeing the reclaimable physical memory pages and
|
||||
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
|
||||
pages either asynchronously or synchronously, depending on the state
|
||||
of the system. When system is not loaded, most of the memory is free
|
||||
and allocation request will be satisfied immediately from the free
|
||||
pages supply. As the load increases, the amount of the free pages goes
|
||||
down and when it reaches a certain threshold (high watermark), an
|
||||
allocation request will awaken the ``kswapd`` daemon. It will
|
||||
asynchronously scan memory pages and either just free them if the data
|
||||
they contain is available elsewhere, or evict to the backing storage
|
||||
device (remember those dirty pages?). As memory usage increases even
|
||||
more and reaches another threshold - min watermark - an allocation
|
||||
will trigger the `direct reclaim`. In this case allocation is stalled
|
||||
until enough memory pages are reclaimed to satisfy the request.
|
||||
|
||||
Compaction
|
||||
==========
|
||||
|
||||
As the system runs, tasks allocate and free the memory and it becomes
|
||||
fragmented. Although with virtual memory it is possible to present
|
||||
scattered physical pages as virtually contiguous range, sometimes it is
|
||||
necessary to allocate large physically contiguous memory areas. Such
|
||||
need may arise, for instance, when a device driver requires large
|
||||
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
|
||||
addresses the fragmentation issue. This mechanism moves occupied pages
|
||||
from the lower part of a memory zone to free pages in the upper part
|
||||
of the zone. When a compaction scan is finished free pages are grouped
|
||||
together at the beginning of the zone and allocations of large
|
||||
physically contiguous areas become possible.
|
||||
|
||||
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
|
||||
daemon or synchronously as a result of memory allocation request.
|
||||
|
||||
OOM killer
|
||||
==========
|
||||
|
||||
It may happen, that on a loaded machine memory will be exhausted. When
|
||||
the kernel detects that the system runs out of memory (OOM) it invokes
|
||||
`OOM killer`. Its mission is simple: all it has to do is to select a
|
||||
task to sacrifice for the sake of the overall system health. The
|
||||
selected task is killed in a hope that after it exits enough memory
|
||||
will be freed to continue normal operation.
|
382
Documentation/admin-guide/mm/hugetlbpage.rst
Normal file
382
Documentation/admin-guide/mm/hugetlbpage.rst
Normal file
@@ -0,0 +1,382 @@
|
||||
.. _hugetlbpage:
|
||||
|
||||
=============
|
||||
HugeTLB Pages
|
||||
=============
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
that is provided by most modern architectures. For example, x86 CPUs normally
|
||||
support 4K and 2M (1G if architecturally supported) page sizes, ia64
|
||||
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
|
||||
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
|
||||
translations. Typically this is a very scarce resource on processor.
|
||||
Operating systems try to make best use of limited number of TLB resources.
|
||||
This optimization is more critical now as bigger and bigger physical memories
|
||||
(several GBs) are more readily available.
|
||||
|
||||
Users can use the huge page support in Linux kernel by either using the mmap
|
||||
system call or standard SYSV shared memory system calls (shmget, shmat).
|
||||
|
||||
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The ``/proc/meminfo`` file provides information about the total number of
|
||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||
default huge page size and information about the number of free, reserved
|
||||
and surplus huge pages in the pool of huge pages of default size.
|
||||
The huge page size is needed for generating the proper alignment and
|
||||
size of the arguments to system calls that map huge page regions.
|
||||
|
||||
The output of ``cat /proc/meminfo`` will include lines like::
|
||||
|
||||
HugePages_Total: uuu
|
||||
HugePages_Free: vvv
|
||||
HugePages_Rsvd: www
|
||||
HugePages_Surp: xxx
|
||||
Hugepagesize: yyy kB
|
||||
Hugetlb: zzz kB
|
||||
|
||||
where:
|
||||
|
||||
HugePages_Total
|
||||
is the size of the pool of huge pages.
|
||||
HugePages_Free
|
||||
is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd
|
||||
is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp
|
||||
is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
``/proc/sys/vm/nr_overcommit_hugepages``.
|
||||
Hugepagesize
|
||||
is the default hugepage size (in Kb).
|
||||
Hugetlb
|
||||
is the total amount of memory (in kB), consumed by huge
|
||||
pages of all sizes.
|
||||
If huge pages of different sizes are in use, this number
|
||||
will exceed HugePages_Total \* Hugepagesize. To get more
|
||||
detailed information, please, refer to
|
||||
``/sys/kernel/mm/hugepages`` (described below).
|
||||
|
||||
|
||||
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
|
||||
configured in the kernel.
|
||||
|
||||
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
|
||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||
returned to the huge page pool when freed by a task. A user with root
|
||||
privileges can dynamically allocate more or free some persistent huge pages
|
||||
by increasing or decreasing the value of ``nr_hugepages``.
|
||||
|
||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||
be used for other purposes. Huge pages cannot be swapped out under
|
||||
memory pressure.
|
||||
|
||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||
pool, a user with appropriate privilege can use either the mmap system call
|
||||
or shared memory system calls to use the huge pages. See the discussion of
|
||||
:ref:`Using Huge Pages <using_huge_pages>`, below.
|
||||
|
||||
The administrator can allocate persistent huge pages on the kernel boot
|
||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||
number of huge pages requested. This is the most reliable method of
|
||||
allocating huge pages as memory has not yet become fragmented.
|
||||
|
||||
Some platforms support multiple huge page sizes. To allocate huge pages
|
||||
of a specific size, one must precede the huge pages boot command parameters
|
||||
with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||
|
||||
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||
indicates the current number of pre-allocated huge pages of the default size.
|
||||
Thus, one can use the following command to dynamically allocate/deallocate
|
||||
default sized persistent huge pages::
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
This command will try to adjust the number of default sized huge pages in the
|
||||
huge page pool to 20, allocating or freeing huge pages, as required.
|
||||
|
||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
|
||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||
nodes with insufficient available, contiguous memory for a huge page will be
|
||||
silently skipped when allocating persistent huge pages. See the
|
||||
:ref:`discussion below <mem_policy_and_hp_alloc>`
|
||||
of the interaction of task memory policy, cpusets and per node attributes
|
||||
with the allocation and freeing of persistent huge pages.
|
||||
|
||||
The success or failure of huge page allocation depends on the amount of
|
||||
physically contiguous memory that is present in system at the time of the
|
||||
allocation attempt. If the kernel is unable to allocate huge pages from
|
||||
some nodes in a NUMA system, it will attempt to make up the difference by
|
||||
allocating extra pages on other nodes with sufficient available contiguous
|
||||
memory, if any.
|
||||
|
||||
System administrators may want to put this command in one of the local rc
|
||||
init files. This will enable the kernel to allocate huge pages early in
|
||||
the boot process when the possibility of getting physical contiguous pages
|
||||
is still very high. Administrators can verify the number of huge pages
|
||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||
distribution of huge pages in a NUMA system, use::
|
||||
|
||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||
|
||||
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
|
||||
requested by applications. Writing any non-zero value into this file
|
||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||
unused, they are freed back to the kernel's normal page pool.
|
||||
|
||||
When increasing the huge page pool size via ``nr_hugepages``, any existing
|
||||
surplus pages will first be promoted to persistent huge pages. Then, additional
|
||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||
the new persistent huge page pool size.
|
||||
|
||||
The administrator may shrink the pool of persistent huge pages for
|
||||
the default huge page size by setting the ``nr_hugepages`` sysctl to a
|
||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
|
||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||
normal page pool.
|
||||
|
||||
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
|
||||
it becomes less than the number of huge pages in use will convert the balance
|
||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||
the number of surplus pages would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
|
||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||
no more surplus huge pages will be allowed to be allocated.
|
||||
|
||||
With support for multiple huge page pools at run-time available, much of
|
||||
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
|
||||
sysfs.
|
||||
The ``/proc`` interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is::
|
||||
|
||||
/sys/kernel/mm/hugepages
|
||||
|
||||
For each huge page size supported by the running kernel, a subdirectory
|
||||
will exist, of the form::
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist::
|
||||
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
nr_overcommit_hugepages
|
||||
free_hugepages
|
||||
resv_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
|
||||
.. _mem_policy_and_hp_alloc:
|
||||
|
||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||
===================================================================
|
||||
|
||||
Whether huge pages are allocated and freed via the ``/proc`` interface or
|
||||
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
|
||||
NUMA nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
|
||||
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
|
||||
is ignored.
|
||||
|
||||
The recommended method to allocate or free huge pages to/from the kernel
|
||||
huge page pool, using the ``nr_hugepages`` example above, is::
|
||||
|
||||
numactl --interleave <node-list> echo 20 \
|
||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
or, more succinctly::
|
||||
|
||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
|
||||
specified in <node-list>, depending on whether number of persistent huge pages
|
||||
is initially less than or greater than 20, respectively. No huge pages will be
|
||||
allocated nor freed on any node not included in the specified <node-list>.
|
||||
|
||||
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
|
||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||
resulting effect on persistent huge page allocation is as follows:
|
||||
|
||||
#. Regardless of mempolicy mode [see
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
|
||||
persistent huge pages will be distributed across the node or nodes
|
||||
specified in the mempolicy as if "interleave" had been specified.
|
||||
However, if a node in the policy does not contain sufficient contiguous
|
||||
memory for a huge page, the allocation will not "fallback" to the nearest
|
||||
neighbor node with sufficient contiguous memory. To do this would cause
|
||||
undesirable imbalance in the distribution of the huge page pool, or
|
||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||
the task's memory policy.
|
||||
|
||||
#. One or more nodes may be specified with the bind or interleave policy.
|
||||
If more than one node is specified with the preferred policy, only the
|
||||
lowest numeric id will be used. Local policy will select the node where
|
||||
the task is running at the time the nodes_allowed mask is constructed.
|
||||
For local policy to be deterministic, the task must be bound to a cpu or
|
||||
cpus in a single node. Otherwise, the task could be migrated to some
|
||||
other node at any time after launch and the resulting node will be
|
||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||
Any of the other mempolicy modes may be used to specify a single node.
|
||||
|
||||
#. The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
whether this policy was set explicitly by the task itself or one of its
|
||||
ancestors, such as numactl. This means that if the task is invoked from a
|
||||
shell with non-default policy, that policy will be used. One can specify a
|
||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||
interleaving over all nodes in the system or cpuset.
|
||||
|
||||
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||
be no way for a task with non-default policy running in a cpuset with a
|
||||
subset of the system nodes to allocate huge pages outside the cpuset
|
||||
without first moving to a cpuset that contains all of the desired nodes.
|
||||
|
||||
#. Boot-time huge page allocation attempts to distribute the requested number
|
||||
of huge pages over all on-lines nodes with memory.
|
||||
|
||||
Per Node Hugepages Attributes
|
||||
=============================
|
||||
|
||||
A subset of the contents of the root huge page control directory in sysfs,
|
||||
described above, will be replicated under each the system device of each
|
||||
NUMA node with memory in::
|
||||
|
||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||
|
||||
Under this directory, the subdirectory for each supported huge page size
|
||||
contains the following attribute files::
|
||||
|
||||
nr_hugepages
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
The free\_' and surplus\_' attribute files are read-only. They return the number
|
||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||
node.
|
||||
|
||||
The ``nr_hugepages`` attribute returns the total number of huge pages on the
|
||||
specified node. When this attribute is written, the number of persistent huge
|
||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||
|
||||
Note that the number of overcommit and reserve pages remain global quantities,
|
||||
as we don't know until fault time, when the faulting task's mempolicy is
|
||||
applied, from which node the huge page allocation will be attempted.
|
||||
|
||||
.. _using_huge_pages:
|
||||
|
||||
Using Huge Pages
|
||||
================
|
||||
|
||||
If the user applications are going to request huge pages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs::
|
||||
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
|
||||
|
||||
The ``uid`` and ``gid`` options sets the owner and group of the root of the
|
||||
file system. By default the ``uid`` and ``gid`` of the current process
|
||||
are taken.
|
||||
|
||||
The ``mode`` option sets the mode of root of file system to value & 01777.
|
||||
This value is given in octal. By default the value 0755 is picked.
|
||||
|
||||
If the platform supports multiple huge page sizes, the ``pagesize`` option can
|
||||
be used to specify the huge page size and associated pool. ``pagesize``
|
||||
is specified in bytes. If ``pagesize`` is not specified the platform's
|
||||
default huge page size and associated pool will be used.
|
||||
|
||||
The ``size`` option sets the maximum value of memory (huge pages) allowed
|
||||
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
|
||||
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
|
||||
The size is rounded down to HPAGE_SIZE boundary.
|
||||
|
||||
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
|
||||
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
|
||||
either bytes or a percentage of the huge page pool.
|
||||
At mount time, the number of huge pages specified by ``min_size`` are reserved
|
||||
for use by the filesystem.
|
||||
If there are not enough free huge pages available, the mount will fail.
|
||||
As huge pages are allocated to the filesystem and freed, the reserve count
|
||||
is adjusted so that the sum of allocated and reserved huge pages is always
|
||||
at least ``min_size``.
|
||||
|
||||
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
|
||||
can use.
|
||||
|
||||
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
|
||||
command line then no limits are set.
|
||||
|
||||
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
|
||||
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
|
||||
For example, size=2K has the same meaning as size=2048.
|
||||
|
||||
While read system calls are supported on files that reside on hugetlb
|
||||
file systems, write system calls are not.
|
||||
|
||||
Regular chown, chgrp, and chmod commands (with right permissions) could be
|
||||
used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if
|
||||
applications are going to use only shmat/shmget system calls or mmap with
|
||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||
:ref:`map_hugetlb <map_hugetlb>` below.
|
||||
|
||||
Users who wish to use hugetlb memory via shared memory segment should be
|
||||
members of a supplementary group and system admin needs to configure that gid
|
||||
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
|
||||
applications to use any combination of mmaps and shm* calls, though the mount of
|
||||
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
||||
|
||||
Syscalls that operate on memory backed by hugetlb pages only have their lengths
|
||||
aligned to the native page size of the processor; they will normally fail with
|
||||
errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
|
||||
not hugepage aligned. For example, munmap(2) will fail if memory is backed by
|
||||
a hugetlb page and the length is smaller than the hugepage size.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
.. _map_hugetlb:
|
||||
|
||||
``map_hugetlb``
|
||||
see tools/testing/selftests/vm/map_hugetlb.c
|
||||
|
||||
``hugepage-shm``
|
||||
see tools/testing/selftests/vm/hugepage-shm.c
|
||||
|
||||
``hugepage-mmap``
|
||||
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
|
||||
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||
to help with huge page usability, environment setup, and control.
|
||||
|
||||
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
|
116
Documentation/admin-guide/mm/idle_page_tracking.rst
Normal file
116
Documentation/admin-guide/mm/idle_page_tracking.rst
Normal file
@@ -0,0 +1,116 @@
|
||||
.. _idle_page_tracking:
|
||||
|
||||
==================
|
||||
Idle Page Tracking
|
||||
==================
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The idle page tracking feature allows to track which memory pages are being
|
||||
accessed by a workload and which are idle. This information can be useful for
|
||||
estimating the workload's working set size, which, in turn, can be taken into
|
||||
account when configuring the workload parameters, setting memory cgroup limits,
|
||||
or deciding where to place the workload within a compute cluster.
|
||||
|
||||
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
||||
|
||||
.. _user_api:
|
||||
|
||||
User API
|
||||
========
|
||||
|
||||
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
|
||||
Currently, it consists of the only read-write file,
|
||||
``/sys/kernel/mm/page_idle/bitmap``.
|
||||
|
||||
The file implements a bitmap where each bit corresponds to a memory page. The
|
||||
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
||||
mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
|
||||
set, the corresponding page is idle.
|
||||
|
||||
A page is considered idle if it has not been accessed since it was marked idle
|
||||
(for more details on what "accessed" actually means see the :ref:`Implementation
|
||||
Details <impl_details>` section).
|
||||
To mark a page idle one has to set the bit corresponding to
|
||||
the page by writing to the file. A value written to the file is OR-ed with the
|
||||
current bitmap value.
|
||||
|
||||
Only accesses to user memory pages are tracked. These are pages mapped to a
|
||||
process address space, page cache and buffer pages, swap cache pages. For other
|
||||
page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
|
||||
and hence such pages are never reported idle.
|
||||
|
||||
For huge pages the idle flag is set only on the head page, so one has to read
|
||||
``/proc/kpageflags`` in order to correctly count idle huge pages.
|
||||
|
||||
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
|
||||
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
||||
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
||||
this file beyond max PFN will return -ENXIO.
|
||||
|
||||
That said, in order to estimate the amount of pages that are not used by a
|
||||
workload one should:
|
||||
|
||||
1. Mark all the workload's pages as idle by setting corresponding bits in
|
||||
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
|
||||
``/proc/pid/pagemap`` if the workload is represented by a process, or by
|
||||
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
|
||||
is placed in a memory cgroup.
|
||||
|
||||
2. Wait until the workload accesses its working set.
|
||||
|
||||
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
|
||||
If one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||
are not reclaimable, he or she can filter them out using
|
||||
``/proc/kpageflags``.
|
||||
|
||||
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
|
||||
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
|
||||
``/proc/kpagecgroup``.
|
||||
|
||||
.. _impl_details:
|
||||
|
||||
Implementation Details
|
||||
======================
|
||||
|
||||
The kernel internally keeps track of accesses to user memory pages in order to
|
||||
reclaim unreferenced pages first on memory shortage conditions. A page is
|
||||
considered referenced if it has been recently accessed via a process address
|
||||
space, in which case one or more PTEs it is mapped to will have the Accessed bit
|
||||
set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
|
||||
latter happens when:
|
||||
|
||||
- a userspace process reads or writes a page using a system call (e.g. read(2)
|
||||
or write(2))
|
||||
|
||||
- a page that is used for storing filesystem buffers is read or written,
|
||||
because a process needs filesystem metadata stored in it (e.g. lists a
|
||||
directory tree)
|
||||
|
||||
- a page is accessed by a device driver using get_user_pages()
|
||||
|
||||
When a dirty page is written to swap or disk as a result of memory reclaim or
|
||||
exceeding the dirty memory limit, it is not marked referenced.
|
||||
|
||||
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
||||
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
|
||||
:ref:`User API <user_api>`
|
||||
section), and cleared automatically whenever a page is referenced as defined
|
||||
above.
|
||||
|
||||
When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
|
||||
mapped to, otherwise we will not be able to detect accesses to the page coming
|
||||
from a process address space. To avoid interference with the reclaimer, which,
|
||||
as noted above, uses the Accessed bit to promote actively referenced pages, one
|
||||
more page flag is introduced, the Young flag. When the PTE Accessed bit is
|
||||
cleared as a result of setting or updating a page's Idle flag, the Young flag
|
||||
is set on the page. The reclaimer treats the Young flag as an extra PTE
|
||||
Accessed bit and therefore will consider such a page as referenced.
|
||||
|
||||
Since the idle memory tracking feature is based on the memory reclaimer logic,
|
||||
it only works with pages that are on an LRU list, other pages are silently
|
||||
ignored. That means it will ignore a user memory page if it is isolated, but
|
||||
since there are usually not many of them, it should not affect the overall
|
||||
result noticeably. In order not to stall scanning of the idle page bitmap,
|
||||
locked pages may be skipped too.
|
36
Documentation/admin-guide/mm/index.rst
Normal file
36
Documentation/admin-guide/mm/index.rst
Normal file
@@ -0,0 +1,36 @@
|
||||
=================
|
||||
Memory Management
|
||||
=================
|
||||
|
||||
Linux memory management subsystem is responsible, as the name implies,
|
||||
for managing the memory in the system. This includes implemnetation of
|
||||
virtual memory and demand paging, memory allocation both for kernel
|
||||
internal structures and user space programms, mapping of files into
|
||||
processes address space and many other cool things.
|
||||
|
||||
Linux memory management is a complex system with many configurable
|
||||
settings. Most of these settings are available via ``/proc``
|
||||
filesystem and can be quired and adjusted using ``sysctl``. These APIs
|
||||
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
|
||||
|
||||
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
|
||||
|
||||
Linux memory management has its own jargon and if you are not yet
|
||||
familiar with it, consider reading
|
||||
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
|
||||
|
||||
Here we document in detail how to interact with various mechanisms in
|
||||
the Linux memory management.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
concepts
|
||||
hugetlbpage
|
||||
idle_page_tracking
|
||||
ksm
|
||||
numa_memory_policy
|
||||
pagemap
|
||||
soft-dirty
|
||||
transhuge
|
||||
userfaultfd
|
189
Documentation/admin-guide/mm/ksm.rst
Normal file
189
Documentation/admin-guide/mm/ksm.rst
Normal file
@@ -0,0 +1,189 @@
|
||||
.. _admin_guide_ksm:
|
||||
|
||||
=======================
|
||||
Kernel Samepage Merging
|
||||
=======================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
|
||||
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
|
||||
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
|
||||
|
||||
KSM was originally developed for use with KVM (where it was known as
|
||||
Kernel Shared Memory), to fit more virtual machines into physical memory,
|
||||
by sharing the data common between them. But it can be useful to any
|
||||
application which generates many instances of the same data.
|
||||
|
||||
The KSM daemon ksmd periodically scans those areas of user memory
|
||||
which have been registered with it, looking for pages of identical
|
||||
content which can be replaced by a single write-protected page (which
|
||||
is automatically copied if a process later wants to update its
|
||||
content). The amount of pages that KSM daemon scans in a single pass
|
||||
and the time between the passes are configured using :ref:`sysfs
|
||||
intraface <ksm_sysfs>`
|
||||
|
||||
KSM only merges anonymous (private) pages, never pagecache (file) pages.
|
||||
KSM's merged pages were originally locked into kernel memory, but can now
|
||||
be swapped out just like other user pages (but sharing is broken when they
|
||||
are swapped back in: ksmd must rediscover their identity and merge again).
|
||||
|
||||
Controlling KSM with madvise
|
||||
============================
|
||||
|
||||
KSM only operates on those areas of address space which an application
|
||||
has advised to be likely candidates for merging, by using the madvise(2)
|
||||
system call::
|
||||
|
||||
int madvise(addr, length, MADV_MERGEABLE)
|
||||
|
||||
The app may call
|
||||
|
||||
::
|
||||
|
||||
int madvise(addr, length, MADV_UNMERGEABLE)
|
||||
|
||||
to cancel that advice and restore unshared pages: whereupon KSM
|
||||
unmerges whatever it merged in that range. Note: this unmerging call
|
||||
may suddenly require more memory than is available - possibly failing
|
||||
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
|
||||
|
||||
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
|
||||
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
|
||||
built with CONFIG_KSM=y, those calls will normally succeed: even if the
|
||||
the KSM daemon is not currently running, MADV_MERGEABLE still registers
|
||||
the range for whenever the KSM daemon is started; even if the range
|
||||
cannot contain any pages which KSM could actually merge; even if
|
||||
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
|
||||
|
||||
If a region of memory must be split into at least one new MADV_MERGEABLE
|
||||
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
|
||||
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
|
||||
|
||||
Like other madvise calls, they are intended for use on mapped areas of
|
||||
the user address space: they will report ENOMEM if the specified range
|
||||
includes unmapped gaps (though working on the intervening mapped areas),
|
||||
and might fail with EAGAIN if not enough memory for internal structures.
|
||||
|
||||
Applications should be considerate in their use of MADV_MERGEABLE,
|
||||
restricting its use to areas likely to benefit. KSM's scans may use a lot
|
||||
of processing power: some installations will disable KSM for that reason.
|
||||
|
||||
.. _ksm_sysfs:
|
||||
|
||||
KSM daemon sysfs interface
|
||||
==========================
|
||||
|
||||
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
|
||||
readable by all but writable only by root:
|
||||
|
||||
pages_to_scan
|
||||
how many pages to scan before ksmd goes to sleep
|
||||
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||
|
||||
Default: 100 (chosen for demonstration purposes)
|
||||
|
||||
sleep_millisecs
|
||||
how many milliseconds ksmd should sleep before next scan
|
||||
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
|
||||
|
||||
Default: 20 (chosen for demonstration purposes)
|
||||
|
||||
merge_across_nodes
|
||||
specifies if pages from different NUMA nodes can be merged.
|
||||
When set to 0, ksm merges only pages which physically reside
|
||||
in the memory area of same NUMA node. That brings lower
|
||||
latency to access of shared pages. Systems with more nodes, at
|
||||
significant NUMA distances, are likely to benefit from the
|
||||
lower latency of setting 0. Smaller systems, which need to
|
||||
minimize memory usage, are likely to benefit from the greater
|
||||
sharing of setting 1 (default). You may wish to compare how
|
||||
your system performs under each setting, before deciding on
|
||||
which to use. ``merge_across_nodes`` setting can be changed only
|
||||
when there are no ksm shared pages in the system: set run 2 to
|
||||
unmerge pages first, then to 1 after changing
|
||||
``merge_across_nodes``, to remerge according to the new setting.
|
||||
|
||||
Default: 1 (merging across nodes as in earlier releases)
|
||||
|
||||
run
|
||||
* set to 0 to stop ksmd from running but keep merged pages,
|
||||
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
|
||||
* set to 2 to stop ksmd and unmerge all pages currently merged, but
|
||||
leave mergeable areas registered for next run.
|
||||
|
||||
Default: 0 (must be changed to 1 to activate KSM, except if
|
||||
CONFIG_SYSFS is disabled)
|
||||
|
||||
use_zero_pages
|
||||
specifies whether empty pages (i.e. allocated pages that only
|
||||
contain zeroes) should be treated specially. When set to 1,
|
||||
empty pages are merged with the kernel zero page(s) instead of
|
||||
with each other as it would happen normally. This can improve
|
||||
the performance on architectures with coloured zero pages,
|
||||
depending on the workload. Care should be taken when enabling
|
||||
this setting, as it can potentially degrade the performance of
|
||||
KSM for some workloads, for example if the checksums of pages
|
||||
candidate for merging match the checksum of an empty
|
||||
page. This setting can be changed at any time, it is only
|
||||
effective for pages merged after the change.
|
||||
|
||||
Default: 0 (normal KSM behaviour as in earlier releases)
|
||||
|
||||
max_page_sharing
|
||||
Maximum sharing allowed for each KSM page. This enforces a
|
||||
deduplication limit to avoid high latency for virtual memory
|
||||
operations that involve traversal of the virtual mappings that
|
||||
share the KSM page. The minimum value is 2 as a newly created
|
||||
KSM page will have at least two sharers. The higher this value
|
||||
the faster KSM will merge the memory and the higher the
|
||||
deduplication factor will be, but the slower the worst case
|
||||
virtual mappings traversal could be for any given KSM
|
||||
page. Slowing down this traversal means there will be higher
|
||||
latency for certain virtual memory operations happening during
|
||||
swapping, compaction, NUMA balancing and page migration, in
|
||||
turn decreasing responsiveness for the caller of those virtual
|
||||
memory operations. The scheduler latency of other tasks not
|
||||
involved with the VM operations doing the virtual mappings
|
||||
traversal is not affected by this parameter as these
|
||||
traversals are always schedule friendly themselves.
|
||||
|
||||
stable_node_chains_prune_millisecs
|
||||
specifies how frequently KSM checks the metadata of the pages
|
||||
that hit the deduplication limit for stale information.
|
||||
Smaller milllisecs values will free up the KSM metadata with
|
||||
lower latency, but they will make ksmd use more CPU during the
|
||||
scan. It's a noop if not a single KSM page hit the
|
||||
``max_page_sharing`` yet.
|
||||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
how many more sites are sharing them i.e. how much saved
|
||||
pages_unshared
|
||||
how many pages unique but repeatedly checked for merging
|
||||
pages_volatile
|
||||
how many pages changing too fast to be placed in a tree
|
||||
full_scans
|
||||
how many times all mergeable areas have been scanned
|
||||
stable_node_chains
|
||||
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||
stable_node_dups
|
||||
number of duplicated KSM pages
|
||||
|
||||
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||
indicates wasted effort. ``pages_volatile`` embraces several
|
||||
different kinds of activity, but a high proportion there would also
|
||||
indicate poor use of madvise MADV_MERGEABLE.
|
||||
|
||||
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||
be increased accordingly.
|
||||
|
||||
--
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
@@ -0,0 +1,495 @@
|
||||
.. _numa_memory_policy:
|
||||
|
||||
==================
|
||||
NUMA Memory Policy
|
||||
==================
|
||||
|
||||
What is NUMA Memory Policy?
|
||||
============================
|
||||
|
||||
In the Linux kernel, "memory policy" determines from which node the kernel will
|
||||
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
||||
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
||||
The current memory policy support was added to Linux 2.6 around May 2004. This
|
||||
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||
support.
|
||||
|
||||
Memory policies should not be confused with cpusets
|
||||
(``Documentation/cgroup-v1/cpusets.txt``)
|
||||
which is an administrative mechanism for restricting the nodes from which
|
||||
memory may be allocated by a set of processes. Memory policies are a
|
||||
programming interface that a NUMA-aware application can take advantage of. When
|
||||
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
||||
takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
|
||||
below for more details.
|
||||
|
||||
Memory Policy Concepts
|
||||
======================
|
||||
|
||||
Scope of Memory Policies
|
||||
------------------------
|
||||
|
||||
The Linux kernel supports _scopes_ of memory policy, described here from
|
||||
most general to most specific:
|
||||
|
||||
System Default Policy
|
||||
this policy is "hard coded" into the kernel. It is the policy
|
||||
that governs all page allocations that aren't controlled by
|
||||
one of the more specific policy scopes discussed below. When
|
||||
the system is "up and running", the system default policy will
|
||||
use "local allocation" described below. However, during boot
|
||||
up, the system default policy will be set to interleave
|
||||
allocations across all nodes with "sufficient" memory, so as
|
||||
not to overload the initial boot node with boot-time
|
||||
allocations.
|
||||
|
||||
Task/Process Policy
|
||||
this is an optional, per-task policy. When defined for a
|
||||
specific task, this policy controls all page allocations made
|
||||
by or on behalf of the task that aren't controlled by a more
|
||||
specific scope. If a task does not define a task policy, then
|
||||
all page allocations that would have been controlled by the
|
||||
task policy "fall back" to the System Default Policy.
|
||||
|
||||
The task policy applies to the entire address space of a task. Thus,
|
||||
it is inheritable, and indeed is inherited, across both fork()
|
||||
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
||||
to establish the task policy for a child task exec()'d from an
|
||||
executable image that has no awareness of memory policy. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the system call
|
||||
that a task may use to set/change its task/process policy.
|
||||
|
||||
In a multi-threaded task, task policies apply only to the thread
|
||||
[Linux kernel task] that installs the policy and any threads
|
||||
subsequently created by that thread. Any sibling threads existing
|
||||
at the time a new task policy is installed retain their current
|
||||
policy.
|
||||
|
||||
A task policy applies only to pages allocated after the policy is
|
||||
installed. Any pages already faulted in by the task when the task
|
||||
changes its task policy remain where they were allocated based on
|
||||
the policy at the time they were allocated.
|
||||
|
||||
.. _vma_policy:
|
||||
|
||||
VMA Policy
|
||||
A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
||||
virtual address space. A task may define a specific policy for a range
|
||||
of its virtual address space. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the mbind() system call used to set a VMA
|
||||
policy.
|
||||
|
||||
A VMA policy will govern the allocation of pages that back
|
||||
this region of the address space. Any regions of the task's
|
||||
address space that don't have an explicit VMA policy will fall
|
||||
back to the task policy, which may itself fall back to the
|
||||
System Default Policy.
|
||||
|
||||
VMA policies have a few complicating details:
|
||||
|
||||
* VMA policy applies ONLY to anonymous pages. These include
|
||||
pages allocated for anonymous segments, such as the task
|
||||
stack and heap, and any regions of the address space
|
||||
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
|
||||
applied to a file mapping, it will be ignored if the mapping
|
||||
used the MAP_SHARED flag. If the file mapping used the
|
||||
MAP_PRIVATE flag, the VMA policy will only be applied when
|
||||
an anonymous page is allocated on an attempt to write to the
|
||||
mapping-- i.e., at Copy-On-Write.
|
||||
|
||||
* VMA policies are shared between all tasks that share a
|
||||
virtual address space--a.k.a. threads--independent of when
|
||||
the policy is installed; and they are inherited across
|
||||
fork(). However, because VMA policies refer to a specific
|
||||
region of a task's address space, and because the address
|
||||
space is discarded and recreated on exec*(), VMA policies
|
||||
are NOT inheritable across exec(). Thus, only NUMA-aware
|
||||
applications may use VMA policies.
|
||||
|
||||
* A task may install a new VMA policy on a sub-range of a
|
||||
previously mmap()ed region. When this happens, Linux splits
|
||||
the existing virtual memory area into 2 or 3 VMAs, each with
|
||||
it's own policy.
|
||||
|
||||
* By default, VMA policy applies only to pages allocated after
|
||||
the policy is installed. Any pages already faulted into the
|
||||
VMA range remain where they were allocated based on the
|
||||
policy at the time they were allocated. However, since
|
||||
2.6.16, Linux supports page migration via the mbind() system
|
||||
call, so that page contents can be moved to match a newly
|
||||
installed policy.
|
||||
|
||||
Shared Policy
|
||||
Conceptually, shared policies apply to "memory objects" mapped
|
||||
shared into one or more tasks' distinct address spaces. An
|
||||
application installs shared policies the same way as VMA
|
||||
policies--using the mbind() system call specifying a range of
|
||||
virtual addresses that map the shared object. However, unlike
|
||||
VMA policies, which can be considered to be an attribute of a
|
||||
range of a task's address space, shared policies apply
|
||||
directly to the shared object. Thus, all tasks that attach to
|
||||
the object share the policy, and all pages allocated for the
|
||||
shared object, by any task, will obey the shared policy.
|
||||
|
||||
As of 2.6.22, only shared memory segments, created by shmget() or
|
||||
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
||||
policy support was added to Linux, the associated data structures were
|
||||
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
||||
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
||||
shmem segments were never "hooked up" to the shared policy support.
|
||||
Although hugetlbfs segments now support lazy allocation, their support
|
||||
for shared policy has not been completed.
|
||||
|
||||
As mentioned above in :ref:`VMA policies <vma_policy>` section,
|
||||
allocations of page cache pages for regular files mmap()ed
|
||||
with MAP_SHARED ignore any VMA policy installed on the virtual
|
||||
address range backed by the shared file mapping. Rather,
|
||||
shared page cache pages, including pages backing private
|
||||
mappings that have not yet been written by the task, follow
|
||||
task policy, if any, else System Default Policy.
|
||||
|
||||
The shared policy infrastructure supports different policies on subset
|
||||
ranges of the shared object. However, Linux still splits the VMA of
|
||||
the task that installs the policy for each range of distinct policy.
|
||||
Thus, different tasks that attach to a shared memory segment can have
|
||||
different VMA configurations mapping that one shared object. This
|
||||
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
||||
a shared memory region, when one task has installed shared policy on
|
||||
one or more ranges of the region.
|
||||
|
||||
Components of Memory Policies
|
||||
-----------------------------
|
||||
|
||||
A NUMA memory policy consists of a "mode", optional mode flags, and
|
||||
an optional set of nodes. The mode determines the behavior of the
|
||||
policy, the optional mode flags determine the behavior of the mode,
|
||||
and the optional set of nodes can be viewed as the arguments to the
|
||||
policy behavior.
|
||||
|
||||
Internally, memory policies are implemented by a reference counted
|
||||
structure, struct mempolicy. Details of this structure will be
|
||||
discussed in context, below, as required to explain the behavior.
|
||||
|
||||
NUMA memory policy supports the following 4 behavioral modes:
|
||||
|
||||
Default Mode--MPOL_DEFAULT
|
||||
This mode is only used in the memory policy APIs. Internally,
|
||||
MPOL_DEFAULT is converted to the NULL memory policy in all
|
||||
policy scopes. Any existing non-default policy will simply be
|
||||
removed when MPOL_DEFAULT is specified. As a result,
|
||||
MPOL_DEFAULT means "fall back to the next most specific policy
|
||||
scope."
|
||||
|
||||
For example, a NULL or default task policy will fall back to the
|
||||
system default policy. A NULL or default vma policy will fall
|
||||
back to the task policy.
|
||||
|
||||
When specified in one of the memory policy APIs, the Default mode
|
||||
does not use the optional set of nodes.
|
||||
|
||||
It is an error for the set of nodes specified for this policy to
|
||||
be non-empty.
|
||||
|
||||
MPOL_BIND
|
||||
This mode specifies that memory must come from the set of
|
||||
nodes specified by the policy. Memory will be allocated from
|
||||
the node in the set with sufficient free memory that is
|
||||
closest to the node where the allocation takes place.
|
||||
|
||||
MPOL_PREFERRED
|
||||
This mode specifies that the allocation should be attempted
|
||||
from the single node specified in the policy. If that
|
||||
allocation fails, the kernel will search other nodes, in order
|
||||
of increasing distance from the preferred node based on
|
||||
information provided by the platform firmware.
|
||||
|
||||
Internally, the Preferred policy uses a single node--the
|
||||
preferred_node member of struct mempolicy. When the internal
|
||||
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
|
||||
and the policy is interpreted as local allocation. "Local"
|
||||
allocation policy can be viewed as a Preferred policy that
|
||||
starts at the node containing the cpu where the allocation
|
||||
takes place.
|
||||
|
||||
It is possible for the user to specify that local allocation
|
||||
is always preferred by passing an empty nodemask with this
|
||||
mode. If an empty nodemask is passed, the policy cannot use
|
||||
the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
|
||||
described below.
|
||||
|
||||
MPOL_INTERLEAVED
|
||||
This mode specifies that page allocations be interleaved, on a
|
||||
page granularity, across the nodes specified in the policy.
|
||||
This mode also behaves slightly differently, based on the
|
||||
context where it is used:
|
||||
|
||||
For allocation of anonymous pages and shared memory pages,
|
||||
Interleave mode indexes the set of nodes specified by the
|
||||
policy using the page offset of the faulting address into the
|
||||
segment [VMA] containing the address modulo the number of
|
||||
nodes specified by the policy. It then attempts to allocate a
|
||||
page, starting at the selected node, as if the node had been
|
||||
specified by a Preferred policy or had been selected by a
|
||||
local allocation. That is, allocation will follow the per
|
||||
node zonelist.
|
||||
|
||||
For allocation of page cache pages, Interleave mode indexes
|
||||
the set of nodes specified by the policy using a node counter
|
||||
maintained per task. This counter wraps around to the lowest
|
||||
specified node after it reaches the highest specified node.
|
||||
This will tend to spread the pages out over the nodes
|
||||
specified by the policy based on the order in which they are
|
||||
allocated, rather than based on any page offset into an
|
||||
address range or file. During system boot up, the temporary
|
||||
interleaved system default policy works in this mode.
|
||||
|
||||
NUMA memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES
|
||||
This flag specifies that the nodemask passed by
|
||||
the user should not be remapped if the task or VMA's set of allowed
|
||||
nodes changes after the memory policy has been defined.
|
||||
|
||||
Without this flag, any time a mempolicy is rebound because of a
|
||||
change in the set of allowed nodes, the node (Preferred) or
|
||||
nodemask (Bind, Interleave) is remapped to the new set of
|
||||
allowed nodes. This may result in nodes being used that were
|
||||
previously undesired.
|
||||
|
||||
With this flag, if the user-specified nodes overlap with the
|
||||
nodes allowed by the task's cpuset, then the memory policy is
|
||||
applied to their intersection. If the two sets of nodes do not
|
||||
overlap, the Default policy is used.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 1-3 that sets an Interleave policy over the same set. If
|
||||
the cpuset's mems change to 3-5, the Interleave will now occur
|
||||
over nodes 3, 4, and 5. With this flag, however, since only node
|
||||
3 is allowed from the user's nodemask, the "interleave" only
|
||||
occurs over that node. If no nodes from the user's nodemask are
|
||||
now allowed, the Default behavior is used.
|
||||
|
||||
MPOL_F_STATIC_NODES cannot be combined with the
|
||||
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
MPOL_F_RELATIVE_NODES
|
||||
This flag specifies that the nodemask passed
|
||||
by the user will be mapped relative to the set of the task or VMA's
|
||||
set of allowed nodes. The kernel stores the user-passed nodemask,
|
||||
and if the allowed nodes changes, then that original nodemask will
|
||||
be remapped relative to the new set of allowed nodes.
|
||||
|
||||
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
||||
mempolicy is rebound because of a change in the set of allowed
|
||||
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
||||
remapped to the new set of allowed nodes. That remap may not
|
||||
preserve the relative nature of the user's passed nodemask to its
|
||||
set of allowed nodes upon successive rebinds: a nodemask of
|
||||
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
||||
allowed nodes is restored to its original state.
|
||||
|
||||
With this flag, the remap is done so that the node numbers from
|
||||
the user's passed nodemask are relative to the set of allowed
|
||||
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
||||
nodemask, the policy will be effected over the first (and in the
|
||||
Bind or Interleave case, the third and fifth) nodes in the set of
|
||||
allowed nodes. The nodemask passed by the user represents nodes
|
||||
relative to task or VMA's set of allowed nodes.
|
||||
|
||||
If the user's nodemask includes nodes that are outside the range
|
||||
of the new set of allowed nodes (for example, node 5 is set in
|
||||
the user's nodemask when the set of allowed nodes is only 0-3),
|
||||
then the remap wraps around to the beginning of the nodemask and,
|
||||
if not already set, sets the node in the mempolicy nodemask.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 2-5 that sets an Interleave policy over the same set with
|
||||
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
||||
interleave now occurs over nodes 3,5-7. If the cpuset's mems
|
||||
then change to 0,2-3,5, then the interleave occurs over nodes
|
||||
0,2-3,5.
|
||||
|
||||
Thanks to the consistent remapping, applications preparing
|
||||
nodemasks to specify memory policies using this flag should
|
||||
disregard their current, actual cpuset imposed memory placement
|
||||
and prepare the nodemask as if they were always located on
|
||||
memory nodes 0 to N-1, where N is the number of memory nodes the
|
||||
policy is intended to manage. Let the kernel then remap to the
|
||||
set of memory nodes allowed by the task's cpuset, as that may
|
||||
change over time.
|
||||
|
||||
MPOL_F_RELATIVE_NODES cannot be combined with the
|
||||
MPOL_F_STATIC_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
Memory Policy Reference Counting
|
||||
================================
|
||||
|
||||
To resolve use/free races, struct mempolicy contains an atomic reference
|
||||
count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
||||
decrement this reference count, respectively. mpol_put() will only free
|
||||
the structure back to the mempolicy kmem cache when the reference count
|
||||
goes to zero.
|
||||
|
||||
When a new memory policy is allocated, its reference count is initialized
|
||||
to '1', representing the reference held by the task that is installing the
|
||||
new policy. When a pointer to a memory policy structure is stored in another
|
||||
structure, another reference is added, as the task's reference will be dropped
|
||||
on completion of the policy installation.
|
||||
|
||||
During run-time "usage" of the policy, we attempt to minimize atomic operations
|
||||
on the reference count, as this can lead to cache lines bouncing between cpus
|
||||
and NUMA nodes. "Usage" here means one of the following:
|
||||
|
||||
1) querying of the policy, either by the task itself [using the get_mempolicy()
|
||||
API discussed below] or by another task using the /proc/<pid>/numa_maps
|
||||
interface.
|
||||
|
||||
2) examination of the policy to determine the policy mode and associated node
|
||||
or node lists, if any, for page allocation. This is considered a "hot
|
||||
path". Note that for MPOL_BIND, the "usage" extends across the entire
|
||||
allocation process, which may sleep during page reclaimation, because the
|
||||
BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
||||
|
||||
We can avoid taking an extra reference during the usages listed above as
|
||||
follows:
|
||||
|
||||
1) we never need to get/free the system default policy as this is never
|
||||
changed nor freed, once the system is up and running.
|
||||
|
||||
2) for querying the policy, we do not need to take an extra reference on the
|
||||
target task's task policy nor vma policies because we always acquire the
|
||||
task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
||||
mbind() APIs [see below] always acquire the mmap_sem for write when
|
||||
installing or replacing task or vma policies. Thus, there is no possibility
|
||||
of a task or thread freeing a policy while another task or thread is
|
||||
querying it.
|
||||
|
||||
3) Page allocation usage of task or vma policy occurs in the fault path where
|
||||
we hold them mmap_sem for read. Again, because replacing the task or vma
|
||||
policy requires that the mmap_sem be held for write, the policy can't be
|
||||
freed out from under us while we're using it for page allocation.
|
||||
|
||||
4) Shared policies require special consideration. One task can replace a
|
||||
shared memory policy while another task, with a distinct mmap_sem, is
|
||||
querying or allocating a page based on the policy. To resolve this
|
||||
potential race, the shared policy infrastructure adds an extra reference
|
||||
to the shared policy during lookup while holding a spin lock on the shared
|
||||
policy management structure. This requires that we drop this extra
|
||||
reference when we're finished "using" the policy. We must drop the
|
||||
extra reference on shared policies in the same query/allocation paths
|
||||
used for non-shared policies. For this reason, shared policies are marked
|
||||
as such, and the extra reference is dropped "conditionally"--i.e., only
|
||||
for shared policies.
|
||||
|
||||
Because of this extra reference counting, and because we must lookup
|
||||
shared policies in a tree structure under spinlock, shared policies are
|
||||
more expensive to use in the page allocation path. This is especially
|
||||
true for shared policies on shared memory regions shared by tasks running
|
||||
on different NUMA nodes. This extra overhead can be avoided by always
|
||||
falling back to task or system default policy for shared memory regions,
|
||||
or by prefaulting the entire shared memory region into memory and locking
|
||||
it down. However, this might not be appropriate for all applications.
|
||||
|
||||
.. _memory_policy_apis:
|
||||
|
||||
Memory Policy APIs
|
||||
==================
|
||||
|
||||
Linux supports 3 system calls for controlling memory policy. These APIS
|
||||
always affect only the calling task, the calling task's address space, or
|
||||
some shared object mapped into the calling task's address space.
|
||||
|
||||
.. note::
|
||||
the headers that define these APIs and the parameter data types for
|
||||
user space applications reside in a package that is not part of the
|
||||
Linux kernel. The kernel system call interfaces, with the 'sys\_'
|
||||
prefix, are defined in <linux/syscalls.h>; the mode and flag
|
||||
definitions are defined in <linux/mempolicy.h>.
|
||||
|
||||
Set [Task] Memory Policy::
|
||||
|
||||
long set_mempolicy(int mode, const unsigned long *nmask,
|
||||
unsigned long maxnode);
|
||||
|
||||
Set's the calling task's "task/process memory policy" to mode
|
||||
specified by the 'mode' argument and the set of nodes defined by
|
||||
'nmask'. 'nmask' points to a bit mask of node ids containing at least
|
||||
'maxnode' ids. Optional mode flags may be passed by combining the
|
||||
'mode' argument with the flag (for example: MPOL_INTERLEAVE |
|
||||
MPOL_F_STATIC_NODES).
|
||||
|
||||
See the set_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Get [Task] Memory Policy or Related Information::
|
||||
|
||||
long get_mempolicy(int *mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
void *addr, int flags);
|
||||
|
||||
Queries the "task/process memory policy" of the calling task, or the
|
||||
policy or location of a specified virtual address, depending on the
|
||||
'flags' argument.
|
||||
|
||||
See the get_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Install VMA/Shared Policy for a Range of Task's Address Space::
|
||||
|
||||
long mbind(void *start, unsigned long len, int mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
unsigned flags);
|
||||
|
||||
mbind() installs the policy specified by (mode, nmask, maxnodes) as a
|
||||
VMA policy for the range of the calling task's address space specified
|
||||
by the 'start' and 'len' arguments. Additional actions may be
|
||||
requested via the 'flags' argument.
|
||||
|
||||
See the mbind(2) man page for more details.
|
||||
|
||||
Memory Policy Command Line Interface
|
||||
====================================
|
||||
|
||||
Although not strictly part of the Linux implementation of memory policy,
|
||||
a command line tool, numactl(8), exists that allows one to:
|
||||
|
||||
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
||||
exec(2)
|
||||
|
||||
+ set the shared policy for a shared memory segment via mbind(2)
|
||||
|
||||
The numactl(8) tool is packaged with the run-time version of the library
|
||||
containing the memory policy system call wrappers. Some distributions
|
||||
package the headers and compile-time libraries in a separate development
|
||||
package.
|
||||
|
||||
.. _mem_pol_and_cpusets:
|
||||
|
||||
Memory Policies and cpusets
|
||||
===========================
|
||||
|
||||
Memory policies work within cpusets as described above. For memory policies
|
||||
that require a node or set of nodes, the nodes are restricted to the set of
|
||||
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
||||
specified for the policy contains nodes that are not allowed by the cpuset and
|
||||
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
||||
specified for the policy and the set of nodes with memory is used. If the
|
||||
result is the empty set, the policy is considered invalid and cannot be
|
||||
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
||||
onto and folded into the task's set of allowed nodes as previously described.
|
||||
|
||||
The interaction of memory policies and cpusets can be problematic when tasks
|
||||
in two cpusets share access to a memory region, such as shared memory segments
|
||||
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
||||
any of the tasks install shared policy on the region, only nodes whose
|
||||
memories are allowed in both cpusets may be used in the policies. Obtaining
|
||||
this information requires "stepping outside" the memory policy APIs to use the
|
||||
cpuset information and requires that one know in what cpusets other task might
|
||||
be attaching to the shared region. Furthermore, if the cpusets' allowed
|
||||
memory sets are disjoint, "local" allocation is the only valid policy.
|
201
Documentation/admin-guide/mm/pagemap.rst
Normal file
201
Documentation/admin-guide/mm/pagemap.rst
Normal file
@@ -0,0 +1,201 @@
|
||||
.. _pagemap:
|
||||
|
||||
=============================
|
||||
Examining Process Page Tables
|
||||
=============================
|
||||
|
||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||
userspace programs to examine the page tables and related information by
|
||||
reading files in ``/proc``.
|
||||
|
||||
There are four components to pagemap:
|
||||
|
||||
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||
value for each virtual page, containing the following data (from
|
||||
``fs/proc/task_mmu.c``, above pagemap_read):
|
||||
|
||||
* Bits 0-54 page frame number (PFN) if present
|
||||
* Bits 0-4 swap type if swapped
|
||||
* Bits 5-54 swap offset if swapped
|
||||
* Bit 55 pte is soft-dirty (see
|
||||
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
|
||||
* Bit 56 page exclusively mapped (since 4.2)
|
||||
* Bits 57-60 zero
|
||||
* Bit 61 page is file-page or shared-anon (since 3.5)
|
||||
* Bit 62 page swapped
|
||||
* Bit 63 page present
|
||||
|
||||
Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
|
||||
In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
|
||||
4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
|
||||
Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
|
||||
|
||||
If the page is not present but in swap, then the PFN contains an
|
||||
encoding of the swap file number and the page's offset into the
|
||||
swap. Unmapped pages return a null PFN. This allows determining
|
||||
precisely which pages are mapped (or in swap) and comparing mapped
|
||||
pages between processes.
|
||||
|
||||
Efficient users of this interface will use ``/proc/pid/maps`` to
|
||||
determine which areas of memory are actually mapped and llseek to
|
||||
skip over unmapped regions.
|
||||
|
||||
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||
page, indexed by PFN.
|
||||
|
||||
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
||||
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
10. BUDDY
|
||||
11. MMAP
|
||||
12. ANON
|
||||
13. SWAPCACHE
|
||||
14. SWAPBACKED
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
17. HUGE
|
||||
18. UNEVICTABLE
|
||||
19. HWPOISON
|
||||
20. NOPAGE
|
||||
21. KSM
|
||||
22. THP
|
||||
23. BALLOON
|
||||
24. ZERO_PAGE
|
||||
25. IDLE
|
||||
|
||||
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
||||
memory cgroup each page is charged to, indexed by PFN. Only available when
|
||||
CONFIG_MEMCG is set.
|
||||
|
||||
Short descriptions to the page flags
|
||||
====================================
|
||||
|
||||
0 - LOCKED
|
||||
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||
7 - SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
10 - BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
15 - COMPOUND_HEAD
|
||||
A compound page with order N consists of 2^N physically contiguous pages.
|
||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||
head page and T donates its tail page(s). The major consumers of compound
|
||||
pages are hugeTLB pages
|
||||
(:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
|
||||
the SLUB etc. memory allocators and various device drivers.
|
||||
However in this interface, only huge/giga pages are made visible
|
||||
to end users.
|
||||
16 - COMPOUND_TAIL
|
||||
A compound page tail (see description above).
|
||||
17 - HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
19 - HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
20 - NOPAGE
|
||||
no page frame exists at the requested address
|
||||
21 - KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
22 - THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
23 - BALLOON
|
||||
balloon compaction page
|
||||
24 - ZERO_PAGE
|
||||
zero page for pfn_zero or huge_zero page
|
||||
25 - IDLE
|
||||
page has not been accessed since it was marked idle (see
|
||||
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
|
||||
Note that this flag may be stale in case the page was accessed via
|
||||
a PTE. To make sure the flag is up-to-date one has to read
|
||||
``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||
|
||||
IO related page flags
|
||||
---------------------
|
||||
|
||||
1 - ERROR
|
||||
IO error occurred
|
||||
3 - UPTODATE
|
||||
page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4 - DIRTY
|
||||
page has been written to, hence contains new data
|
||||
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||
8 - WRITEBACK
|
||||
page is being synced to disk
|
||||
|
||||
LRU related page flags
|
||||
----------------------
|
||||
|
||||
5 - LRU
|
||||
page is in one of the LRU lists
|
||||
6 - ACTIVE
|
||||
page is in the active LRU list
|
||||
18 - UNEVICTABLE
|
||||
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||
shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2 - REFERENCED
|
||||
page has been referenced since last LRU list enqueue/requeue
|
||||
9 - RECLAIM
|
||||
page will be reclaimed soon after its pageout IO completed
|
||||
11 - MMAP
|
||||
a memory mapped page
|
||||
12 - ANON
|
||||
a memory mapped page that is not part of a file
|
||||
13 - SWAPCACHE
|
||||
page is mapped to swap space, i.e. has an associated swap entry
|
||||
14 - SWAPBACKED
|
||||
page is backed by swap/RAM
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
above flags.
|
||||
|
||||
Using pagemap to do something useful
|
||||
====================================
|
||||
|
||||
The general procedure for using pagemap to find out about a process' memory
|
||||
usage goes like this:
|
||||
|
||||
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
||||
mapped to what.
|
||||
2. Select the maps you are interested in -- all of them, or a particular
|
||||
library, or the stack or the heap, etc.
|
||||
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
||||
4. Read a u64 for each page from pagemap.
|
||||
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
||||
just read, seek to that entry in the file, and read the data you want.
|
||||
|
||||
For example, to find the "unique set size" (USS), which is the amount of
|
||||
memory that a process is using that is not shared with any other process,
|
||||
you can go through every map in the process, find the PFNs, look those up
|
||||
in kpagecount, and tally up the number of pages that are only referenced
|
||||
once.
|
||||
|
||||
Other notes
|
||||
===========
|
||||
|
||||
Reading from any of the files will return -EINVAL if you are not starting
|
||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
||||
into the file), or if the size of the read is not a multiple of 8 bytes.
|
||||
|
||||
Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
|
||||
always 12 at most architectures). Since Linux 3.11 their meaning changes
|
||||
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
|
||||
flags unconditionally.
|
47
Documentation/admin-guide/mm/soft-dirty.rst
Normal file
47
Documentation/admin-guide/mm/soft-dirty.rst
Normal file
@@ -0,0 +1,47 @@
|
||||
.. _soft_dirty:
|
||||
|
||||
===============
|
||||
Soft-Dirty PTEs
|
||||
===============
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
writes to. In order to do this tracking one should
|
||||
|
||||
1. Clear soft-dirty bits from the task's PTEs.
|
||||
|
||||
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
|
||||
task in question.
|
||||
|
||||
2. Wait some time.
|
||||
|
||||
3. Read soft-dirty bits from the PTEs.
|
||||
|
||||
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
|
||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||
written to since step 1.
|
||||
|
||||
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
when the soft-dirty bit is cleared. So, after this, when the task tries to
|
||||
modify a page at some virtual address the #PF occurs and the kernel sets
|
||||
the soft-dirty bit on the respective PTE.
|
||||
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
|
||||
This is so, since the pages are still mapped to physical memory, and thus all
|
||||
the kernel does is finds this fact out and puts both writable and soft-dirty
|
||||
bits on the PTE.
|
||||
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
there is still a scenario when we can lose soft dirty bits -- a task
|
||||
unmaps a previously mapped memory region and then maps a new one at exactly
|
||||
the same place. When unmap is called, the kernel internally clears PTE values
|
||||
including soft dirty bits. To notify user space application about such
|
||||
memory region renewal the kernel always marks new memory regions (and
|
||||
expanded regions) as soft dirty.
|
||||
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
can find more details about it on http://criu.org
|
||||
|
||||
|
||||
-- Pavel Emelyanov, Apr 9, 2013
|
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
@@ -0,0 +1,418 @@
|
||||
.. _admin_guide_transhuge:
|
||||
|
||||
============================
|
||||
Transparent Hugepage Support
|
||||
============================
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Performance critical computing applications dealing with large memory
|
||||
working sets are already running on top of libhugetlbfs and in turn
|
||||
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
||||
using huge pages for the backing of virtual memory with huge pages
|
||||
that supports the automatic promotion and demotion of page sizes and
|
||||
without the shortcomings of hugetlbfs.
|
||||
|
||||
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
||||
But in the future it can expand to other filesystems.
|
||||
|
||||
.. note::
|
||||
in the examples below we presume that the basic page size is 4K and
|
||||
the huge page size is 2M, although the actual numbers may vary
|
||||
depending on the CPU architecture.
|
||||
|
||||
The reason applications are running faster is because of two
|
||||
factors. The first factor is almost completely irrelevant and it's not
|
||||
of significant interest because it'll also have the downside of
|
||||
requiring larger clear-page copy-page in page faults which is a
|
||||
potentially negative effect. The first factor consists in taking a
|
||||
single page fault for each 2M virtual region touched by userland (so
|
||||
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||
only matters the first time the memory is accessed for the lifetime of
|
||||
a memory mapping. The second long lasting and much more important
|
||||
factor will affect all subsequent accesses to the memory for the whole
|
||||
runtime of the application. The second factor consist of two
|
||||
components:
|
||||
|
||||
1) the TLB miss will run faster (especially with virtualization using
|
||||
nested pagetables but almost always also on bare metal without
|
||||
virtualization)
|
||||
|
||||
2) a single TLB entry will be mapping a much larger amount of virtual
|
||||
memory in turn reducing the number of TLB misses. With
|
||||
virtualization and nested pagetables the TLB can be mapped of
|
||||
larger size only if both KVM and the Linux guest are using
|
||||
hugepages but a significant speedup already happens if only one of
|
||||
the two is using hugepages just because of the fact the TLB miss is
|
||||
going to run faster.
|
||||
|
||||
THP can be enabled system wide or restricted to certain tasks or even
|
||||
memory ranges inside task's address space. Unless THP is completely
|
||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||
collapses sequences of basic pages into huge pages.
|
||||
|
||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||
interface and using madivse(2) and prctl(2) system calls.
|
||||
|
||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||
if compared to the reservation approach of hugetlbfs by allowing all
|
||||
unused memory to be used as cache or other movable (or even unmovable
|
||||
entities). It doesn't require reservation to prevent hugepage
|
||||
allocation failures to be noticeable from userland. It allows paging
|
||||
and all other advanced VM features to be available on the
|
||||
hugepages. It requires no modifications for applications to take
|
||||
advantage of it.
|
||||
|
||||
Applications however can be further optimized to take advantage of
|
||||
this feature, like for example they've been optimized before to avoid
|
||||
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||
is by far not mandatory and khugepaged already can take care of long
|
||||
lived page allocations even for hugepage unaware applications that
|
||||
deals with large amounts of memory.
|
||||
|
||||
In certain cases when hugepages are enabled system wide, application
|
||||
may end up allocating more memory resources. An application may mmap a
|
||||
large region but only touch 1 byte of it, in that case a 2M page might
|
||||
be allocated instead of a 4k page for no good. This is why it's
|
||||
possible to disable hugepages system-wide and to only have them inside
|
||||
MADV_HUGEPAGE madvise regions.
|
||||
|
||||
Embedded systems should enable hugepages only inside madvise regions
|
||||
to eliminate any risk of wasting any precious byte of memory and to
|
||||
only run faster.
|
||||
|
||||
Applications that gets a lot of benefit from hugepages and that don't
|
||||
risk to lose memory by using hugepages, should use
|
||||
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||
|
||||
.. _thp_sysfs:
|
||||
|
||||
sysfs
|
||||
=====
|
||||
|
||||
Global THP controls
|
||||
-------------------
|
||||
|
||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||
system wide. This can be achieved with one of::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
It's also possible to limit defrag efforts in the VM to generate
|
||||
anonymous hugepages in case they're not immediately free to madvise
|
||||
regions or to never try to defrag memory and simply fallback to regular
|
||||
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||
time to defrag memory, we would expect to gain even more by the fact we
|
||||
use hugepages later instead of regular pages. This isn't always
|
||||
guaranteed, but it may be more likely in case the allocation is for a
|
||||
MADV_HUGEPAGE region.
|
||||
|
||||
::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
|
||||
always
|
||||
means that an application requesting THP will stall on
|
||||
allocation failure and directly reclaim pages and compact
|
||||
memory in an effort to allocate a THP immediately. This may be
|
||||
desirable for virtual machines that benefit heavily from THP
|
||||
use and are willing to delay the VM start to utilise them.
|
||||
|
||||
defer
|
||||
means that an application will wake kswapd in the background
|
||||
to reclaim pages and wake kcompactd to compact memory so that
|
||||
THP is available in the near future. It's the responsibility
|
||||
of khugepaged to then install the THP pages later.
|
||||
|
||||
defer+madvise
|
||||
will enter direct reclaim and compaction like ``always``, but
|
||||
only for regions that have used madvise(MADV_HUGEPAGE); all
|
||||
other regions will wake kswapd in the background to reclaim
|
||||
pages and wake kcompactd to compact memory so that THP is
|
||||
available in the near future.
|
||||
|
||||
madvise
|
||||
will enter direct reclaim like ``always`` but only for regions
|
||||
that are have used madvise(MADV_HUGEPAGE). This is the default
|
||||
behaviour.
|
||||
|
||||
never
|
||||
should be self-explanatory.
|
||||
|
||||
By default kernel tries to use huge zero page on read page fault to
|
||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||
or enable it back by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
|
||||
Some userspace (such as a test program, or an optimized memory allocation
|
||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||
|
||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||
|
||||
khugepaged will be automatically started when
|
||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||
be automatically shutdown if it's set to "never".
|
||||
|
||||
Khugepaged controls
|
||||
-------------------
|
||||
|
||||
khugepaged runs usually at low frequency so while one may not want to
|
||||
invoke defrag algorithms synchronously during the page faults, it
|
||||
should be worth invoking defrag at least in khugepaged. However it's
|
||||
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||
defrag in khugepaged by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
|
||||
You can also control how many pages khugepaged should scan at each
|
||||
pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||
|
||||
and how many milliseconds to wait in khugepaged between each pass (you
|
||||
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||
|
||||
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||
allocation failure to throttle the next allocation attempt::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||
|
||||
The khugepaged progress can be seen in the number of pages collapsed::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||
|
||||
for each pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||
|
||||
``max_ptes_none`` specifies how many extra small pages (that are
|
||||
not already mapped) can be allocated when collapsing a group
|
||||
of small pages into one large page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
||||
|
||||
A higher value leads to use additional memory for programs.
|
||||
A lower value leads to gain less thp performance. Value of
|
||||
max_ptes_none can waste cpu time very little, you can
|
||||
ignore it.
|
||||
|
||||
``max_ptes_swap`` specifies how many pages can be brought in from
|
||||
swap when collapsing a group of pages into a transparent huge page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
||||
|
||||
A higher value can cause excessive swap IO and waste
|
||||
memory. A lower value can prevent THPs from being
|
||||
collapsed, resulting fewer pages being collapsed into
|
||||
THPs, and lower memory access performance.
|
||||
|
||||
Boot parameter
|
||||
==============
|
||||
|
||||
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||
Support by passing the parameter ``transparent_hugepage=always`` or
|
||||
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
||||
to the kernel command line.
|
||||
|
||||
Hugepages in tmpfs/shmem
|
||||
========================
|
||||
|
||||
You can control hugepage allocation policy in tmpfs with mount option
|
||||
``huge=``. It can have following values:
|
||||
|
||||
always
|
||||
Attempt to allocate huge pages every time we need a new page;
|
||||
|
||||
never
|
||||
Do not allocate huge pages;
|
||||
|
||||
within_size
|
||||
Only allocate huge page if it will be fully within i_size.
|
||||
Also respect fadvise()/madvise() hints;
|
||||
|
||||
advise
|
||||
Only allocate huge pages if requested with fadvise()/madvise();
|
||||
|
||||
The default policy is ``never``.
|
||||
|
||||
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
||||
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
||||
from being allocated.
|
||||
|
||||
There's also sysfs knob to control hugepage allocation policy for internal
|
||||
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||
|
||||
In addition to policies listed above, shmem_enabled allows two further
|
||||
values:
|
||||
|
||||
deny
|
||||
For use in emergencies, to force the huge option off from
|
||||
all mounts;
|
||||
force
|
||||
Force the huge option on for all - very useful for testing;
|
||||
|
||||
Need of application restart
|
||||
===========================
|
||||
|
||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||
future behavior. So to make them effective you need to restart any
|
||||
application that could have been using hugepages. This also applies to the
|
||||
regions registered in khugepaged.
|
||||
|
||||
Monitoring usage
|
||||
================
|
||||
|
||||
The number of anonymous transparent huge pages currently used by the
|
||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||
To identify what applications are using anonymous transparent huge pages,
|
||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||
for each mapping.
|
||||
|
||||
The number of file transparent huge pages mapped to userspace is available
|
||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||
To identify what applications are mapping file transparent huge pages, it
|
||||
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
||||
for each mapping.
|
||||
|
||||
Note that reading the smaps file is expensive and reading it
|
||||
frequently will incur overhead.
|
||||
|
||||
There are a number of counters in ``/proc/vmstat`` that may be used to
|
||||
monitor how successfully the system is providing huge pages for use.
|
||||
|
||||
thp_fault_alloc
|
||||
is incremented every time a huge page is successfully
|
||||
allocated to handle a page fault. This applies to both the
|
||||
first time a page is faulted and for COW faults.
|
||||
|
||||
thp_collapse_alloc
|
||||
is incremented by khugepaged when it has found
|
||||
a range of pages to collapse into one huge page and has
|
||||
successfully allocated a new huge page to store the data.
|
||||
|
||||
thp_fault_fallback
|
||||
is incremented if a page fault fails to allocate
|
||||
a huge page and instead falls back to using small pages.
|
||||
|
||||
thp_collapse_alloc_failed
|
||||
is incremented if khugepaged found a range
|
||||
of pages that should be collapsed into one huge page but failed
|
||||
the allocation.
|
||||
|
||||
thp_file_alloc
|
||||
is incremented every time a file huge page is successfully
|
||||
allocated.
|
||||
|
||||
thp_file_mapped
|
||||
is incremented every time a file huge page is mapped into
|
||||
user address space.
|
||||
|
||||
thp_split_page
|
||||
is incremented every time a huge page is split into base
|
||||
pages. This can happen for a variety of reasons but a common
|
||||
reason is that a huge page is old and is being reclaimed.
|
||||
This action implies splitting all PMD the page mapped with.
|
||||
|
||||
thp_split_page_failed
|
||||
is incremented if kernel fails to split huge
|
||||
page. This can happen if the page was pinned by somebody.
|
||||
|
||||
thp_deferred_split_page
|
||||
is incremented when a huge page is put onto split
|
||||
queue. This happens when a huge page is partially unmapped and
|
||||
splitting it would free up some memory. Pages on split queue are
|
||||
going to be split under memory pressure.
|
||||
|
||||
thp_split_pmd
|
||||
is incremented every time a PMD split into table of PTEs.
|
||||
This can happen, for instance, when application calls mprotect() or
|
||||
munmap() on part of huge page. It doesn't split huge page, only
|
||||
page table entry.
|
||||
|
||||
thp_zero_page_alloc
|
||||
is incremented every time a huge zero page is
|
||||
successfully allocated. It includes allocations which where
|
||||
dropped due race with other allocation. Note, it doesn't count
|
||||
every map of the huge zero page, only its allocation.
|
||||
|
||||
thp_zero_page_alloc_failed
|
||||
is incremented if kernel fails to allocate
|
||||
huge zero page and falls back to using small pages.
|
||||
|
||||
thp_swpout
|
||||
is incremented every time a huge page is swapout in one
|
||||
piece without splitting.
|
||||
|
||||
thp_swpout_fallback
|
||||
is incremented if a huge page has to be split before swapout.
|
||||
Usually because failed to allocate some continuous swap space
|
||||
for the huge page.
|
||||
|
||||
As the system ages, allocating huge pages may be expensive as the
|
||||
system uses memory compaction to copy data around memory to free a
|
||||
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
||||
monitor this overhead.
|
||||
|
||||
compact_stall
|
||||
is incremented every time a process stalls to run
|
||||
memory compaction so that a huge page is free for use.
|
||||
|
||||
compact_success
|
||||
is incremented if the system compacted memory and
|
||||
freed a huge page for use.
|
||||
|
||||
compact_fail
|
||||
is incremented if the system tries to compact memory
|
||||
but failed.
|
||||
|
||||
compact_pages_moved
|
||||
is incremented each time a page is moved. If
|
||||
this value is increasing rapidly, it implies that the system
|
||||
is copying a lot of data to satisfy the huge page allocation.
|
||||
It is possible that the cost of copying exceeds any savings
|
||||
from reduced TLB misses.
|
||||
|
||||
compact_pagemigrate_failed
|
||||
is incremented when the underlying mechanism
|
||||
for moving a page failed.
|
||||
|
||||
compact_blocks_moved
|
||||
is incremented each time memory compaction examines
|
||||
a huge page aligned range of pages.
|
||||
|
||||
It is possible to establish how long the stalls were using the function
|
||||
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||
using the mm_page_alloc tracepoint to identify which allocations were
|
||||
for huge pages.
|
||||
|
||||
Optimizing the applications
|
||||
===========================
|
||||
|
||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||
memory region, the mmap region has to be hugepage naturally
|
||||
aligned. posix_memalign() can provide that guarantee.
|
||||
|
||||
Hugetlbfs
|
||||
=========
|
||||
|
||||
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||
support enabled just fine as always. No difference can be noted in
|
||||
hugetlbfs other than there will be less overall fragmentation. All
|
||||
usual features belonging to hugetlbfs are preserved and
|
||||
unaffected. libhugetlbfs will also work fine as usual.
|
241
Documentation/admin-guide/mm/userfaultfd.rst
Normal file
241
Documentation/admin-guide/mm/userfaultfd.rst
Normal file
@@ -0,0 +1,241 @@
|
||||
.. _userfaultfd:
|
||||
|
||||
===========
|
||||
Userfaultfd
|
||||
===========
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Userfaults allow the implementation of on-demand paging from userland
|
||||
and more generally they allow userland to take control of various
|
||||
memory page faults, something otherwise only the kernel code could do.
|
||||
|
||||
For example userfaults allows a proper and more optimal implementation
|
||||
of the PROT_NONE+SIGSEGV trick.
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||
|
||||
The userfaultfd (aside from registering and unregistering virtual
|
||||
memory ranges) provides two primary functionalities:
|
||||
|
||||
1) read/POLLIN protocol to notify a userland thread of the faults
|
||||
happening
|
||||
|
||||
2) various UFFDIO_* ioctls that can manage the virtual memory regions
|
||||
registered in the userfaultfd that allows userland to efficiently
|
||||
resolve the userfaults it receives via 1) or to manage the virtual
|
||||
memory in the background
|
||||
|
||||
The real advantage of userfaults if compared to regular virtual memory
|
||||
management of mremap/mprotect is that the userfaults in all their
|
||||
operations never involve heavyweight structures like vmas (in fact the
|
||||
userfaultfd runtime load never takes the mmap_sem for writing).
|
||||
|
||||
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||||
when dealing with virtual address spaces that could span
|
||||
Terabytes. Too many vmas would be needed for that.
|
||||
|
||||
The userfaultfd once opened by invoking the syscall, can also be
|
||||
passed using unix domain sockets to a manager process, so the same
|
||||
manager process could handle the userfaults of a multitude of
|
||||
different processes without them being aware about what is going on
|
||||
(well of course unless they later try to use the userfaultfd
|
||||
themselves on the same region the manager is already tracking, which
|
||||
is a corner case that would currently return -EBUSY).
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
When first opened the userfaultfd must be enabled invoking the
|
||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||
a later API version) which will specify the read/POLLIN protocol
|
||||
userland intends to speak on the UFFD and the uffdio_api.features
|
||||
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
|
||||
requested uffdio_api.api is spoken also by the running kernel and the
|
||||
requested features are going to be enabled) will return into
|
||||
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
|
||||
respectively all the available features of the read(2) protocol and
|
||||
the generic ioctl available.
|
||||
|
||||
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
|
||||
defines what memory types are supported by the userfaultfd and what
|
||||
events, except page fault notifications, may be generated.
|
||||
|
||||
If the kernel supports registering userfaultfd ranges on hugetlbfs
|
||||
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
|
||||
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
|
||||
set if the kernel supports registering userfaultfd ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
|
||||
MAP_SHARED, memfd_create, etc).
|
||||
|
||||
The userland application that wants to use userfaultfd with hugetlbfs
|
||||
or shared memory need to set the corresponding flag in
|
||||
uffdio_api.features to enable those features.
|
||||
|
||||
If the userland desires to receive notifications for events other than
|
||||
page faults, it has to verify that uffdio_api.features has appropriate
|
||||
UFFD_FEATURE_EVENT_* bits set. These events are described in more
|
||||
detail below in "Non-cooperative userfaultfd" section.
|
||||
|
||||
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
|
||||
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
|
||||
register a memory range in the userfaultfd by setting the
|
||||
uffdio_register structure accordingly. The uffdio_register.mode
|
||||
bitmask will specify to the kernel which kind of faults to track for
|
||||
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
|
||||
pages). The UFFDIO_REGISTER ioctl will return the
|
||||
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
|
||||
userfaults on the range registered. Not all ioctls will necessarily be
|
||||
supported for all memory types depending on the underlying virtual
|
||||
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||||
mappings).
|
||||
|
||||
Userland can use the uffdio_register.ioctls to manage the virtual
|
||||
address space in the background (to add or potentially also remove
|
||||
memory from the userfaultfd registered range). This means a userfault
|
||||
could be triggering just before userland maps in the background the
|
||||
user-faulted page.
|
||||
|
||||
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
|
||||
atomically copies a page into the userfault registered range and wakes
|
||||
up the blocked userfaults (unless uffdio_copy.mode &
|
||||
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
|
||||
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||||
half copied page since it'll keep userfaulting until the copy has
|
||||
finished.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||
migration. Postcopy live migration is one form of memory
|
||||
externalization consisting of a virtual machine running with part or
|
||||
all of its memory residing on a different node in the cloud. The
|
||||
userfaultfd abstraction is generic enough that not a single line of
|
||||
KVM kernel code had to be modified in order to add postcopy live
|
||||
migration to QEMU.
|
||||
|
||||
Guest async page faults, FOLL_NOWAIT and all other GUP features work
|
||||
just fine in combination with userfaults. Userfaults trigger async
|
||||
page faults in the guest scheduler so those guest processes that
|
||||
aren't waiting for userfaults (i.e. network bound) can keep running in
|
||||
the guest vcpus.
|
||||
|
||||
It is generally beneficial to run one pass of precopy live migration
|
||||
just before starting postcopy live migration, in order to avoid
|
||||
generating userfaults for readonly guest regions.
|
||||
|
||||
The implementation of postcopy live migration currently uses one
|
||||
single bidirectional socket but in the future two different sockets
|
||||
will be used (to reduce the latency of the userfaults to the minimum
|
||||
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
|
||||
|
||||
The QEMU in the source node writes all pages that it knows are missing
|
||||
in the destination node, into the socket, and the migration thread of
|
||||
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
|
||||
ioctls on the userfaultfd in order to map the received pages into the
|
||||
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
|
||||
|
||||
A different postcopy thread in the destination node listens with
|
||||
poll() to the userfaultfd in parallel. When a POLLIN event is
|
||||
generated after a userfault triggers, the postcopy thread read() from
|
||||
the userfaultfd and receives the fault address (or -EAGAIN in case the
|
||||
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
|
||||
by the parallel QEMU migration thread).
|
||||
|
||||
After the QEMU postcopy thread (running in the destination node) gets
|
||||
the userfault address it writes the information about the missing page
|
||||
into the socket. The QEMU source node receives the information and
|
||||
roughly "seeks" to that page address and continues sending all
|
||||
remaining missing pages from that new page offset. Soon after that
|
||||
(just the time to flush the tcp_wmem queue through the network) the
|
||||
migration thread in the QEMU running in the destination node will
|
||||
receive the page that triggered the userfault and it'll map it as
|
||||
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
|
||||
was spontaneously sent by the source or if it was an urgent page
|
||||
requested through a userfault).
|
||||
|
||||
By the time the userfaults start, the QEMU in the destination node
|
||||
doesn't need to keep any per-page state bitmap relative to the live
|
||||
migration around and a single per-page bitmap has to be maintained in
|
||||
the QEMU running in the source node to know which pages are still
|
||||
missing in the destination node. The bitmap in the source node is
|
||||
checked to find which missing pages to send in round robin and we seek
|
||||
over it when receiving incoming userfaults. After sending each page of
|
||||
course the bitmap is updated accordingly. It's also useful to avoid
|
||||
sending the same page twice (in case the userfault is read by the
|
||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||
thread).
|
||||
|
||||
Non-cooperative userfaultfd
|
||||
===========================
|
||||
|
||||
When the userfaultfd is monitored by an external manager, the manager
|
||||
must be able to track changes in the process virtual memory
|
||||
layout. Userfaultfd can notify the manager about such changes using
|
||||
the same read(2) protocol as for the page fault notifications. The
|
||||
manager has to explicitly enable these events by setting appropriate
|
||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||
|
||||
UFFD_FEATURE_EVENT_FORK
|
||||
enable userfaultfd hooks for fork(). When this feature is
|
||||
enabled, the userfaultfd context of the parent process is
|
||||
duplicated into the newly created process. The manager
|
||||
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||
userfaultfd context in the uffd_msg.fork.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMAP
|
||||
enable notifications about mremap() calls. When the
|
||||
non-cooperative process moves a virtual memory area to a
|
||||
different location, the manager will receive
|
||||
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||
new addresses of the area and its original length.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMOVE
|
||||
enable notifications about madvise(MADV_REMOVE) and
|
||||
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||
be generated upon these calls to madvise. The uffd_msg.remove
|
||||
will contain start and end addresses of the removed area.
|
||||
|
||||
UFFD_FEATURE_EVENT_UNMAP
|
||||
enable notifications about memory unmapping. The manager will
|
||||
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||
end addresses of the unmapped area.
|
||||
|
||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||
are pretty similar, they quite differ in the action expected from the
|
||||
userfaultfd manager. In the former case, the virtual memory is
|
||||
removed, but the area is not, the area remains monitored by the
|
||||
userfaultfd, and if a page fault occurs in that area it will be
|
||||
delivered to the manager. The proper resolution for such page fault is
|
||||
to zeromap the faulting address. However, in the latter case, when an
|
||||
area is unmapped, either explicitly (with munmap() system call), or
|
||||
implicitly (e.g. during mremap()), the area is removed and in turn the
|
||||
userfaultfd context for such area disappears too and the manager will
|
||||
not get further userland page faults from the removed area. Still, the
|
||||
notification is required in order to prevent manager from using
|
||||
UFFDIO_COPY on the unmapped area.
|
||||
|
||||
Unlike userland page faults which have to be synchronous and require
|
||||
explicit or implicit wakeup, all the events are delivered
|
||||
asynchronously and the non-cooperative process resumes execution as
|
||||
soon as manager executes read(). The userfaultfd manager should
|
||||
carefully synchronize calls to UFFDIO_COPY with the events
|
||||
processing. To aid the synchronization, the UFFDIO_COPY ioctl will
|
||||
return -ENOSPC when the monitored process exits at the time of
|
||||
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
|
||||
its virtual memory layout simultaneously with outstanding UFFDIO_COPY
|
||||
operation.
|
||||
|
||||
The current asynchronous model of the event delivery is optimal for
|
||||
single threaded non-cooperative userfaultfd manager implementations. A
|
||||
synchronous event delivery model can be added later as a new
|
||||
userfaultfd feature to facilitate multithreading enhancements of the
|
||||
non cooperative manager, for example to allow UFFDIO_COPY ioctls to
|
||||
run in parallel to the event reception. Single threaded
|
||||
implementations should continue to use the current async event
|
||||
delivery model instead.
|
@@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
|
||||
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
|
||||
|
||||
B. Use Device Tree bindings, as described in
|
||||
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``.
|
||||
``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
|
||||
For example::
|
||||
|
||||
reserved-memory {
|
||||
|
Reference in New Issue
Block a user