docs/admin-guide/mm: start moving here files from Documentation/vm
Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:

committato da
Jonathan Corbet

parent
3a3f7e26e5
commit
1ad1335dc5
@@ -12,14 +12,10 @@ highmem.rst
|
||||
- Outline of highmem and common issues.
|
||||
hmm.rst
|
||||
- Documentation of heterogeneous memory management
|
||||
hugetlbpage.rst
|
||||
- a brief summary of hugetlbpage support in the Linux kernel.
|
||||
hugetlbfs_reserv.rst
|
||||
- A brief overview of hugetlbfs reservation design/implementation.
|
||||
hwpoison.rst
|
||||
- explains what hwpoison is
|
||||
idle_page_tracking.rst
|
||||
- description of the idle page tracking feature.
|
||||
ksm.rst
|
||||
- how to use the Kernel Samepage Merging feature.
|
||||
mmu_notifier.rst
|
||||
@@ -34,16 +30,12 @@ page_frags.rst
|
||||
- description of page fragments allocator
|
||||
page_migration.rst
|
||||
- description of page migration in NUMA systems.
|
||||
pagemap.rst
|
||||
- pagemap, from the userspace perspective
|
||||
page_owner.rst
|
||||
- tracking about who allocated each page
|
||||
remap_file_pages.rst
|
||||
- a note about remap_file_pages() system call
|
||||
slub.rst
|
||||
- a short users guide for SLUB.
|
||||
soft-dirty.rst
|
||||
- short explanation for soft-dirty PTEs
|
||||
split_page_table_lock.rst
|
||||
- Separate per-table lock to improve scalability of the old page_table_lock.
|
||||
swap_numa.rst
|
||||
@@ -52,8 +44,6 @@ transhuge.rst
|
||||
- Transparent Hugepage Support, alternative way of using hugepages.
|
||||
unevictable-lru.rst
|
||||
- Unevictable LRU infrastructure
|
||||
userfaultfd.rst
|
||||
- description of userfaultfd system call
|
||||
z3fold.txt
|
||||
- outline of z3fold allocator for storing compressed pages
|
||||
zsmalloc.rst
|
||||
|
@@ -1,381 +0,0 @@
|
||||
.. _hugetlbpage:
|
||||
|
||||
=============
|
||||
HugeTLB Pages
|
||||
=============
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
that is provided by most modern architectures. For example, x86 CPUs normally
|
||||
support 4K and 2M (1G if architecturally supported) page sizes, ia64
|
||||
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
|
||||
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
|
||||
translations. Typically this is a very scarce resource on processor.
|
||||
Operating systems try to make best use of limited number of TLB resources.
|
||||
This optimization is more critical now as bigger and bigger physical memories
|
||||
(several GBs) are more readily available.
|
||||
|
||||
Users can use the huge page support in Linux kernel by either using the mmap
|
||||
system call or standard SYSV shared memory system calls (shmget, shmat).
|
||||
|
||||
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The ``/proc/meminfo`` file provides information about the total number of
|
||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||
default huge page size and information about the number of free, reserved
|
||||
and surplus huge pages in the pool of huge pages of default size.
|
||||
The huge page size is needed for generating the proper alignment and
|
||||
size of the arguments to system calls that map huge page regions.
|
||||
|
||||
The output of ``cat /proc/meminfo`` will include lines like::
|
||||
|
||||
HugePages_Total: uuu
|
||||
HugePages_Free: vvv
|
||||
HugePages_Rsvd: www
|
||||
HugePages_Surp: xxx
|
||||
Hugepagesize: yyy kB
|
||||
Hugetlb: zzz kB
|
||||
|
||||
where:
|
||||
|
||||
HugePages_Total
|
||||
is the size of the pool of huge pages.
|
||||
HugePages_Free
|
||||
is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd
|
||||
is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp
|
||||
is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
``/proc/sys/vm/nr_overcommit_hugepages``.
|
||||
Hugepagesize
|
||||
is the default hugepage size (in Kb).
|
||||
Hugetlb
|
||||
is the total amount of memory (in kB), consumed by huge
|
||||
pages of all sizes.
|
||||
If huge pages of different sizes are in use, this number
|
||||
will exceed HugePages_Total \* Hugepagesize. To get more
|
||||
detailed information, please, refer to
|
||||
``/sys/kernel/mm/hugepages`` (described below).
|
||||
|
||||
|
||||
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
|
||||
configured in the kernel.
|
||||
|
||||
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
|
||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||
returned to the huge page pool when freed by a task. A user with root
|
||||
privileges can dynamically allocate more or free some persistent huge pages
|
||||
by increasing or decreasing the value of ``nr_hugepages``.
|
||||
|
||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||
be used for other purposes. Huge pages cannot be swapped out under
|
||||
memory pressure.
|
||||
|
||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||
pool, a user with appropriate privilege can use either the mmap system call
|
||||
or shared memory system calls to use the huge pages. See the discussion of
|
||||
:ref:`Using Huge Pages <using_huge_pages>`, below.
|
||||
|
||||
The administrator can allocate persistent huge pages on the kernel boot
|
||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||
number of huge pages requested. This is the most reliable method of
|
||||
allocating huge pages as memory has not yet become fragmented.
|
||||
|
||||
Some platforms support multiple huge page sizes. To allocate huge pages
|
||||
of a specific size, one must precede the huge pages boot command parameters
|
||||
with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||
|
||||
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||
indicates the current number of pre-allocated huge pages of the default size.
|
||||
Thus, one can use the following command to dynamically allocate/deallocate
|
||||
default sized persistent huge pages::
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
This command will try to adjust the number of default sized huge pages in the
|
||||
huge page pool to 20, allocating or freeing huge pages, as required.
|
||||
|
||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
|
||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||
nodes with insufficient available, contiguous memory for a huge page will be
|
||||
silently skipped when allocating persistent huge pages. See the
|
||||
:ref:`discussion below <mem_policy_and_hp_alloc>`
|
||||
of the interaction of task memory policy, cpusets and per node attributes
|
||||
with the allocation and freeing of persistent huge pages.
|
||||
|
||||
The success or failure of huge page allocation depends on the amount of
|
||||
physically contiguous memory that is present in system at the time of the
|
||||
allocation attempt. If the kernel is unable to allocate huge pages from
|
||||
some nodes in a NUMA system, it will attempt to make up the difference by
|
||||
allocating extra pages on other nodes with sufficient available contiguous
|
||||
memory, if any.
|
||||
|
||||
System administrators may want to put this command in one of the local rc
|
||||
init files. This will enable the kernel to allocate huge pages early in
|
||||
the boot process when the possibility of getting physical contiguous pages
|
||||
is still very high. Administrators can verify the number of huge pages
|
||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||
distribution of huge pages in a NUMA system, use::
|
||||
|
||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||
|
||||
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
|
||||
requested by applications. Writing any non-zero value into this file
|
||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||
unused, they are freed back to the kernel's normal page pool.
|
||||
|
||||
When increasing the huge page pool size via ``nr_hugepages``, any existing
|
||||
surplus pages will first be promoted to persistent huge pages. Then, additional
|
||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||
the new persistent huge page pool size.
|
||||
|
||||
The administrator may shrink the pool of persistent huge pages for
|
||||
the default huge page size by setting the ``nr_hugepages`` sysctl to a
|
||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
|
||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||
normal page pool.
|
||||
|
||||
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
|
||||
it becomes less than the number of huge pages in use will convert the balance
|
||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||
the number of surplus pages would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
|
||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||
no more surplus huge pages will be allowed to be allocated.
|
||||
|
||||
With support for multiple huge page pools at run-time available, much of
|
||||
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
|
||||
sysfs.
|
||||
The ``/proc`` interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is::
|
||||
|
||||
/sys/kernel/mm/hugepages
|
||||
|
||||
For each huge page size supported by the running kernel, a subdirectory
|
||||
will exist, of the form::
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist::
|
||||
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
nr_overcommit_hugepages
|
||||
free_hugepages
|
||||
resv_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
|
||||
.. _mem_policy_and_hp_alloc:
|
||||
|
||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||
===================================================================
|
||||
|
||||
Whether huge pages are allocated and freed via the ``/proc`` interface or
|
||||
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
|
||||
NUMA nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
|
||||
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
|
||||
is ignored.
|
||||
|
||||
The recommended method to allocate or free huge pages to/from the kernel
|
||||
huge page pool, using the ``nr_hugepages`` example above, is::
|
||||
|
||||
numactl --interleave <node-list> echo 20 \
|
||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
or, more succinctly::
|
||||
|
||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
|
||||
specified in <node-list>, depending on whether number of persistent huge pages
|
||||
is initially less than or greater than 20, respectively. No huge pages will be
|
||||
allocated nor freed on any node not included in the specified <node-list>.
|
||||
|
||||
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
|
||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||
resulting effect on persistent huge page allocation is as follows:
|
||||
|
||||
#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst],
|
||||
persistent huge pages will be distributed across the node or nodes
|
||||
specified in the mempolicy as if "interleave" had been specified.
|
||||
However, if a node in the policy does not contain sufficient contiguous
|
||||
memory for a huge page, the allocation will not "fallback" to the nearest
|
||||
neighbor node with sufficient contiguous memory. To do this would cause
|
||||
undesirable imbalance in the distribution of the huge page pool, or
|
||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||
the task's memory policy.
|
||||
|
||||
#. One or more nodes may be specified with the bind or interleave policy.
|
||||
If more than one node is specified with the preferred policy, only the
|
||||
lowest numeric id will be used. Local policy will select the node where
|
||||
the task is running at the time the nodes_allowed mask is constructed.
|
||||
For local policy to be deterministic, the task must be bound to a cpu or
|
||||
cpus in a single node. Otherwise, the task could be migrated to some
|
||||
other node at any time after launch and the resulting node will be
|
||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||
Any of the other mempolicy modes may be used to specify a single node.
|
||||
|
||||
#. The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
whether this policy was set explicitly by the task itself or one of its
|
||||
ancestors, such as numactl. This means that if the task is invoked from a
|
||||
shell with non-default policy, that policy will be used. One can specify a
|
||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||
interleaving over all nodes in the system or cpuset.
|
||||
|
||||
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||
be no way for a task with non-default policy running in a cpuset with a
|
||||
subset of the system nodes to allocate huge pages outside the cpuset
|
||||
without first moving to a cpuset that contains all of the desired nodes.
|
||||
|
||||
#. Boot-time huge page allocation attempts to distribute the requested number
|
||||
of huge pages over all on-lines nodes with memory.
|
||||
|
||||
Per Node Hugepages Attributes
|
||||
=============================
|
||||
|
||||
A subset of the contents of the root huge page control directory in sysfs,
|
||||
described above, will be replicated under each the system device of each
|
||||
NUMA node with memory in::
|
||||
|
||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||
|
||||
Under this directory, the subdirectory for each supported huge page size
|
||||
contains the following attribute files::
|
||||
|
||||
nr_hugepages
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
The free\_' and surplus\_' attribute files are read-only. They return the number
|
||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||
node.
|
||||
|
||||
The ``nr_hugepages`` attribute returns the total number of huge pages on the
|
||||
specified node. When this attribute is written, the number of persistent huge
|
||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||
|
||||
Note that the number of overcommit and reserve pages remain global quantities,
|
||||
as we don't know until fault time, when the faulting task's mempolicy is
|
||||
applied, from which node the huge page allocation will be attempted.
|
||||
|
||||
.. _using_huge_pages:
|
||||
|
||||
Using Huge Pages
|
||||
================
|
||||
|
||||
If the user applications are going to request huge pages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs::
|
||||
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
|
||||
|
||||
The ``uid`` and ``gid`` options sets the owner and group of the root of the
|
||||
file system. By default the ``uid`` and ``gid`` of the current process
|
||||
are taken.
|
||||
|
||||
The ``mode`` option sets the mode of root of file system to value & 01777.
|
||||
This value is given in octal. By default the value 0755 is picked.
|
||||
|
||||
If the platform supports multiple huge page sizes, the ``pagesize`` option can
|
||||
be used to specify the huge page size and associated pool. ``pagesize``
|
||||
is specified in bytes. If ``pagesize`` is not specified the platform's
|
||||
default huge page size and associated pool will be used.
|
||||
|
||||
The ``size`` option sets the maximum value of memory (huge pages) allowed
|
||||
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
|
||||
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
|
||||
The size is rounded down to HPAGE_SIZE boundary.
|
||||
|
||||
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
|
||||
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
|
||||
either bytes or a percentage of the huge page pool.
|
||||
At mount time, the number of huge pages specified by ``min_size`` are reserved
|
||||
for use by the filesystem.
|
||||
If there are not enough free huge pages available, the mount will fail.
|
||||
As huge pages are allocated to the filesystem and freed, the reserve count
|
||||
is adjusted so that the sum of allocated and reserved huge pages is always
|
||||
at least ``min_size``.
|
||||
|
||||
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
|
||||
can use.
|
||||
|
||||
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
|
||||
command line then no limits are set.
|
||||
|
||||
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
|
||||
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
|
||||
For example, size=2K has the same meaning as size=2048.
|
||||
|
||||
While read system calls are supported on files that reside on hugetlb
|
||||
file systems, write system calls are not.
|
||||
|
||||
Regular chown, chgrp, and chmod commands (with right permissions) could be
|
||||
used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if
|
||||
applications are going to use only shmat/shmget system calls or mmap with
|
||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||
:ref:`map_hugetlb <map_hugetlb>` below.
|
||||
|
||||
Users who wish to use hugetlb memory via shared memory segment should be
|
||||
members of a supplementary group and system admin needs to configure that gid
|
||||
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
|
||||
applications to use any combination of mmaps and shm* calls, though the mount of
|
||||
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
||||
|
||||
Syscalls that operate on memory backed by hugetlb pages only have their lengths
|
||||
aligned to the native page size of the processor; they will normally fail with
|
||||
errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
|
||||
not hugepage aligned. For example, munmap(2) will fail if memory is backed by
|
||||
a hugetlb page and the length is smaller than the hugepage size.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
.. _map_hugetlb:
|
||||
|
||||
``map_hugetlb``
|
||||
see tools/testing/selftests/vm/map_hugetlb.c
|
||||
|
||||
``hugepage-shm``
|
||||
see tools/testing/selftests/vm/hugepage-shm.c
|
||||
|
||||
``hugepage-mmap``
|
||||
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
|
||||
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||
to help with huge page usability, environment setup, and control.
|
||||
|
||||
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
|
@@ -155,7 +155,7 @@ Testing
|
||||
value). This allows stress testing of many kinds of
|
||||
pages. The page_flags are the same as in /proc/kpageflags. The
|
||||
flag bits are defined in include/linux/kernel-page-flags.h and
|
||||
documented in Documentation/vm/pagemap.rst
|
||||
documented in Documentation/admin-guide/mm/pagemap.rst
|
||||
|
||||
* Architecture specific MCE injector
|
||||
|
||||
|
@@ -1,115 +0,0 @@
|
||||
.. _idle_page_tracking:
|
||||
|
||||
==================
|
||||
Idle Page Tracking
|
||||
==================
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The idle page tracking feature allows to track which memory pages are being
|
||||
accessed by a workload and which are idle. This information can be useful for
|
||||
estimating the workload's working set size, which, in turn, can be taken into
|
||||
account when configuring the workload parameters, setting memory cgroup limits,
|
||||
or deciding where to place the workload within a compute cluster.
|
||||
|
||||
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
||||
|
||||
.. _user_api:
|
||||
|
||||
User API
|
||||
========
|
||||
|
||||
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
|
||||
Currently, it consists of the only read-write file,
|
||||
``/sys/kernel/mm/page_idle/bitmap``.
|
||||
|
||||
The file implements a bitmap where each bit corresponds to a memory page. The
|
||||
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
||||
mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
|
||||
set, the corresponding page is idle.
|
||||
|
||||
A page is considered idle if it has not been accessed since it was marked idle
|
||||
(for more details on what "accessed" actually means see the :ref:`Implementation
|
||||
Details <impl_details>` section).
|
||||
To mark a page idle one has to set the bit corresponding to
|
||||
the page by writing to the file. A value written to the file is OR-ed with the
|
||||
current bitmap value.
|
||||
|
||||
Only accesses to user memory pages are tracked. These are pages mapped to a
|
||||
process address space, page cache and buffer pages, swap cache pages. For other
|
||||
page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
|
||||
and hence such pages are never reported idle.
|
||||
|
||||
For huge pages the idle flag is set only on the head page, so one has to read
|
||||
``/proc/kpageflags`` in order to correctly count idle huge pages.
|
||||
|
||||
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
|
||||
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
||||
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
||||
this file beyond max PFN will return -ENXIO.
|
||||
|
||||
That said, in order to estimate the amount of pages that are not used by a
|
||||
workload one should:
|
||||
|
||||
1. Mark all the workload's pages as idle by setting corresponding bits in
|
||||
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
|
||||
``/proc/pid/pagemap`` if the workload is represented by a process, or by
|
||||
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
|
||||
is placed in a memory cgroup.
|
||||
|
||||
2. Wait until the workload accesses its working set.
|
||||
|
||||
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
|
||||
If one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||
are not reclaimable, he or she can filter them out using
|
||||
``/proc/kpageflags``.
|
||||
|
||||
See Documentation/vm/pagemap.rst for more information about
|
||||
``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
|
||||
|
||||
.. _impl_details:
|
||||
|
||||
Implementation Details
|
||||
======================
|
||||
|
||||
The kernel internally keeps track of accesses to user memory pages in order to
|
||||
reclaim unreferenced pages first on memory shortage conditions. A page is
|
||||
considered referenced if it has been recently accessed via a process address
|
||||
space, in which case one or more PTEs it is mapped to will have the Accessed bit
|
||||
set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
|
||||
latter happens when:
|
||||
|
||||
- a userspace process reads or writes a page using a system call (e.g. read(2)
|
||||
or write(2))
|
||||
|
||||
- a page that is used for storing filesystem buffers is read or written,
|
||||
because a process needs filesystem metadata stored in it (e.g. lists a
|
||||
directory tree)
|
||||
|
||||
- a page is accessed by a device driver using get_user_pages()
|
||||
|
||||
When a dirty page is written to swap or disk as a result of memory reclaim or
|
||||
exceeding the dirty memory limit, it is not marked referenced.
|
||||
|
||||
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
||||
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
|
||||
:ref:`User API <user_api>`
|
||||
section), and cleared automatically whenever a page is referenced as defined
|
||||
above.
|
||||
|
||||
When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
|
||||
mapped to, otherwise we will not be able to detect accesses to the page coming
|
||||
from a process address space. To avoid interference with the reclaimer, which,
|
||||
as noted above, uses the Accessed bit to promote actively referenced pages, one
|
||||
more page flag is introduced, the Young flag. When the PTE Accessed bit is
|
||||
cleared as a result of setting or updating a page's Idle flag, the Young flag
|
||||
is set on the page. The reclaimer treats the Young flag as an extra PTE
|
||||
Accessed bit and therefore will consider such a page as referenced.
|
||||
|
||||
Since the idle memory tracking feature is based on the memory reclaimer logic,
|
||||
it only works with pages that are on an LRU list, other pages are silently
|
||||
ignored. That means it will ignore a user memory page if it is isolated, but
|
||||
since there are usually not many of them, it should not affect the overall
|
||||
result noticeably. In order not to stall scanning of the idle page bitmap,
|
||||
locked pages may be skipped too.
|
@@ -13,15 +13,10 @@ various features of the Linux memory management
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
hugetlbpage
|
||||
idle_page_tracking
|
||||
ksm
|
||||
numa_memory_policy
|
||||
pagemap
|
||||
transhuge
|
||||
soft-dirty
|
||||
swap_numa
|
||||
userfaultfd
|
||||
zswap
|
||||
|
||||
Kernel developers MM documentation
|
||||
|
@@ -1,197 +0,0 @@
|
||||
.. _pagemap:
|
||||
|
||||
=============================
|
||||
Examining Process Page Tables
|
||||
=============================
|
||||
|
||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||
userspace programs to examine the page tables and related information by
|
||||
reading files in ``/proc``.
|
||||
|
||||
There are four components to pagemap:
|
||||
|
||||
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||
value for each virtual page, containing the following data (from
|
||||
``fs/proc/task_mmu.c``, above pagemap_read):
|
||||
|
||||
* Bits 0-54 page frame number (PFN) if present
|
||||
* Bits 0-4 swap type if swapped
|
||||
* Bits 5-54 swap offset if swapped
|
||||
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst)
|
||||
* Bit 56 page exclusively mapped (since 4.2)
|
||||
* Bits 57-60 zero
|
||||
* Bit 61 page is file-page or shared-anon (since 3.5)
|
||||
* Bit 62 page swapped
|
||||
* Bit 63 page present
|
||||
|
||||
Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
|
||||
In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
|
||||
4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
|
||||
Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
|
||||
|
||||
If the page is not present but in swap, then the PFN contains an
|
||||
encoding of the swap file number and the page's offset into the
|
||||
swap. Unmapped pages return a null PFN. This allows determining
|
||||
precisely which pages are mapped (or in swap) and comparing mapped
|
||||
pages between processes.
|
||||
|
||||
Efficient users of this interface will use ``/proc/pid/maps`` to
|
||||
determine which areas of memory are actually mapped and llseek to
|
||||
skip over unmapped regions.
|
||||
|
||||
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||
page, indexed by PFN.
|
||||
|
||||
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
||||
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
10. BUDDY
|
||||
11. MMAP
|
||||
12. ANON
|
||||
13. SWAPCACHE
|
||||
14. SWAPBACKED
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
17. HUGE
|
||||
18. UNEVICTABLE
|
||||
19. HWPOISON
|
||||
20. NOPAGE
|
||||
21. KSM
|
||||
22. THP
|
||||
23. BALLOON
|
||||
24. ZERO_PAGE
|
||||
25. IDLE
|
||||
|
||||
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
||||
memory cgroup each page is charged to, indexed by PFN. Only available when
|
||||
CONFIG_MEMCG is set.
|
||||
|
||||
Short descriptions to the page flags
|
||||
====================================
|
||||
|
||||
0 - LOCKED
|
||||
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||
7 - SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
10 - BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
15 - COMPOUND_HEAD
|
||||
A compound page with order N consists of 2^N physically contiguous pages.
|
||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||
head page and T donates its tail page(s). The major consumers of compound
|
||||
pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc.
|
||||
memory allocators and various device drivers. However in this interface,
|
||||
only huge/giga pages are made visible to end users.
|
||||
16 - COMPOUND_TAIL
|
||||
A compound page tail (see description above).
|
||||
17 - HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
19 - HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
20 - NOPAGE
|
||||
no page frame exists at the requested address
|
||||
21 - KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
22 - THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
23 - BALLOON
|
||||
balloon compaction page
|
||||
24 - ZERO_PAGE
|
||||
zero page for pfn_zero or huge_zero page
|
||||
25 - IDLE
|
||||
page has not been accessed since it was marked idle (see
|
||||
Documentation/vm/idle_page_tracking.rst). Note that this flag may be
|
||||
stale in case the page was accessed via a PTE. To make sure the flag
|
||||
is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||
|
||||
IO related page flags
|
||||
---------------------
|
||||
|
||||
1 - ERROR
|
||||
IO error occurred
|
||||
3 - UPTODATE
|
||||
page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4 - DIRTY
|
||||
page has been written to, hence contains new data
|
||||
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||
8 - WRITEBACK
|
||||
page is being synced to disk
|
||||
|
||||
LRU related page flags
|
||||
----------------------
|
||||
|
||||
5 - LRU
|
||||
page is in one of the LRU lists
|
||||
6 - ACTIVE
|
||||
page is in the active LRU list
|
||||
18 - UNEVICTABLE
|
||||
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||
shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2 - REFERENCED
|
||||
page has been referenced since last LRU list enqueue/requeue
|
||||
9 - RECLAIM
|
||||
page will be reclaimed soon after its pageout IO completed
|
||||
11 - MMAP
|
||||
a memory mapped page
|
||||
12 - ANON
|
||||
a memory mapped page that is not part of a file
|
||||
13 - SWAPCACHE
|
||||
page is mapped to swap space, i.e. has an associated swap entry
|
||||
14 - SWAPBACKED
|
||||
page is backed by swap/RAM
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
above flags.
|
||||
|
||||
Using pagemap to do something useful
|
||||
====================================
|
||||
|
||||
The general procedure for using pagemap to find out about a process' memory
|
||||
usage goes like this:
|
||||
|
||||
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
||||
mapped to what.
|
||||
2. Select the maps you are interested in -- all of them, or a particular
|
||||
library, or the stack or the heap, etc.
|
||||
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
||||
4. Read a u64 for each page from pagemap.
|
||||
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
||||
just read, seek to that entry in the file, and read the data you want.
|
||||
|
||||
For example, to find the "unique set size" (USS), which is the amount of
|
||||
memory that a process is using that is not shared with any other process,
|
||||
you can go through every map in the process, find the PFNs, look those up
|
||||
in kpagecount, and tally up the number of pages that are only referenced
|
||||
once.
|
||||
|
||||
Other notes
|
||||
===========
|
||||
|
||||
Reading from any of the files will return -EINVAL if you are not starting
|
||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
||||
into the file), or if the size of the read is not a multiple of 8 bytes.
|
||||
|
||||
Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
|
||||
always 12 at most architectures). Since Linux 3.11 their meaning changes
|
||||
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
|
||||
flags unconditionally.
|
@@ -1,47 +0,0 @@
|
||||
.. _soft_dirty:
|
||||
|
||||
===============
|
||||
Soft-Dirty PTEs
|
||||
===============
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
writes to. In order to do this tracking one should
|
||||
|
||||
1. Clear soft-dirty bits from the task's PTEs.
|
||||
|
||||
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
|
||||
task in question.
|
||||
|
||||
2. Wait some time.
|
||||
|
||||
3. Read soft-dirty bits from the PTEs.
|
||||
|
||||
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
|
||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||
written to since step 1.
|
||||
|
||||
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
when the soft-dirty bit is cleared. So, after this, when the task tries to
|
||||
modify a page at some virtual address the #PF occurs and the kernel sets
|
||||
the soft-dirty bit on the respective PTE.
|
||||
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
|
||||
This is so, since the pages are still mapped to physical memory, and thus all
|
||||
the kernel does is finds this fact out and puts both writable and soft-dirty
|
||||
bits on the PTE.
|
||||
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
there is still a scenario when we can lose soft dirty bits -- a task
|
||||
unmaps a previously mapped memory region and then maps a new one at exactly
|
||||
the same place. When unmap is called, the kernel internally clears PTE values
|
||||
including soft dirty bits. To notify user space application about such
|
||||
memory region renewal the kernel always marks new memory regions (and
|
||||
expanded regions) as soft dirty.
|
||||
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
can find more details about it on http://criu.org
|
||||
|
||||
|
||||
-- Pavel Emelyanov, Apr 9, 2013
|
@@ -1,241 +0,0 @@
|
||||
.. _userfaultfd:
|
||||
|
||||
===========
|
||||
Userfaultfd
|
||||
===========
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Userfaults allow the implementation of on-demand paging from userland
|
||||
and more generally they allow userland to take control of various
|
||||
memory page faults, something otherwise only the kernel code could do.
|
||||
|
||||
For example userfaults allows a proper and more optimal implementation
|
||||
of the PROT_NONE+SIGSEGV trick.
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||
|
||||
The userfaultfd (aside from registering and unregistering virtual
|
||||
memory ranges) provides two primary functionalities:
|
||||
|
||||
1) read/POLLIN protocol to notify a userland thread of the faults
|
||||
happening
|
||||
|
||||
2) various UFFDIO_* ioctls that can manage the virtual memory regions
|
||||
registered in the userfaultfd that allows userland to efficiently
|
||||
resolve the userfaults it receives via 1) or to manage the virtual
|
||||
memory in the background
|
||||
|
||||
The real advantage of userfaults if compared to regular virtual memory
|
||||
management of mremap/mprotect is that the userfaults in all their
|
||||
operations never involve heavyweight structures like vmas (in fact the
|
||||
userfaultfd runtime load never takes the mmap_sem for writing).
|
||||
|
||||
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||||
when dealing with virtual address spaces that could span
|
||||
Terabytes. Too many vmas would be needed for that.
|
||||
|
||||
The userfaultfd once opened by invoking the syscall, can also be
|
||||
passed using unix domain sockets to a manager process, so the same
|
||||
manager process could handle the userfaults of a multitude of
|
||||
different processes without them being aware about what is going on
|
||||
(well of course unless they later try to use the userfaultfd
|
||||
themselves on the same region the manager is already tracking, which
|
||||
is a corner case that would currently return -EBUSY).
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
When first opened the userfaultfd must be enabled invoking the
|
||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||
a later API version) which will specify the read/POLLIN protocol
|
||||
userland intends to speak on the UFFD and the uffdio_api.features
|
||||
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
|
||||
requested uffdio_api.api is spoken also by the running kernel and the
|
||||
requested features are going to be enabled) will return into
|
||||
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
|
||||
respectively all the available features of the read(2) protocol and
|
||||
the generic ioctl available.
|
||||
|
||||
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
|
||||
defines what memory types are supported by the userfaultfd and what
|
||||
events, except page fault notifications, may be generated.
|
||||
|
||||
If the kernel supports registering userfaultfd ranges on hugetlbfs
|
||||
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
|
||||
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
|
||||
set if the kernel supports registering userfaultfd ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
|
||||
MAP_SHARED, memfd_create, etc).
|
||||
|
||||
The userland application that wants to use userfaultfd with hugetlbfs
|
||||
or shared memory need to set the corresponding flag in
|
||||
uffdio_api.features to enable those features.
|
||||
|
||||
If the userland desires to receive notifications for events other than
|
||||
page faults, it has to verify that uffdio_api.features has appropriate
|
||||
UFFD_FEATURE_EVENT_* bits set. These events are described in more
|
||||
detail below in "Non-cooperative userfaultfd" section.
|
||||
|
||||
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
|
||||
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
|
||||
register a memory range in the userfaultfd by setting the
|
||||
uffdio_register structure accordingly. The uffdio_register.mode
|
||||
bitmask will specify to the kernel which kind of faults to track for
|
||||
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
|
||||
pages). The UFFDIO_REGISTER ioctl will return the
|
||||
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
|
||||
userfaults on the range registered. Not all ioctls will necessarily be
|
||||
supported for all memory types depending on the underlying virtual
|
||||
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||||
mappings).
|
||||
|
||||
Userland can use the uffdio_register.ioctls to manage the virtual
|
||||
address space in the background (to add or potentially also remove
|
||||
memory from the userfaultfd registered range). This means a userfault
|
||||
could be triggering just before userland maps in the background the
|
||||
user-faulted page.
|
||||
|
||||
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
|
||||
atomically copies a page into the userfault registered range and wakes
|
||||
up the blocked userfaults (unless uffdio_copy.mode &
|
||||
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
|
||||
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||||
half copied page since it'll keep userfaulting until the copy has
|
||||
finished.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||
migration. Postcopy live migration is one form of memory
|
||||
externalization consisting of a virtual machine running with part or
|
||||
all of its memory residing on a different node in the cloud. The
|
||||
userfaultfd abstraction is generic enough that not a single line of
|
||||
KVM kernel code had to be modified in order to add postcopy live
|
||||
migration to QEMU.
|
||||
|
||||
Guest async page faults, FOLL_NOWAIT and all other GUP features work
|
||||
just fine in combination with userfaults. Userfaults trigger async
|
||||
page faults in the guest scheduler so those guest processes that
|
||||
aren't waiting for userfaults (i.e. network bound) can keep running in
|
||||
the guest vcpus.
|
||||
|
||||
It is generally beneficial to run one pass of precopy live migration
|
||||
just before starting postcopy live migration, in order to avoid
|
||||
generating userfaults for readonly guest regions.
|
||||
|
||||
The implementation of postcopy live migration currently uses one
|
||||
single bidirectional socket but in the future two different sockets
|
||||
will be used (to reduce the latency of the userfaults to the minimum
|
||||
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
|
||||
|
||||
The QEMU in the source node writes all pages that it knows are missing
|
||||
in the destination node, into the socket, and the migration thread of
|
||||
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
|
||||
ioctls on the userfaultfd in order to map the received pages into the
|
||||
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
|
||||
|
||||
A different postcopy thread in the destination node listens with
|
||||
poll() to the userfaultfd in parallel. When a POLLIN event is
|
||||
generated after a userfault triggers, the postcopy thread read() from
|
||||
the userfaultfd and receives the fault address (or -EAGAIN in case the
|
||||
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
|
||||
by the parallel QEMU migration thread).
|
||||
|
||||
After the QEMU postcopy thread (running in the destination node) gets
|
||||
the userfault address it writes the information about the missing page
|
||||
into the socket. The QEMU source node receives the information and
|
||||
roughly "seeks" to that page address and continues sending all
|
||||
remaining missing pages from that new page offset. Soon after that
|
||||
(just the time to flush the tcp_wmem queue through the network) the
|
||||
migration thread in the QEMU running in the destination node will
|
||||
receive the page that triggered the userfault and it'll map it as
|
||||
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
|
||||
was spontaneously sent by the source or if it was an urgent page
|
||||
requested through a userfault).
|
||||
|
||||
By the time the userfaults start, the QEMU in the destination node
|
||||
doesn't need to keep any per-page state bitmap relative to the live
|
||||
migration around and a single per-page bitmap has to be maintained in
|
||||
the QEMU running in the source node to know which pages are still
|
||||
missing in the destination node. The bitmap in the source node is
|
||||
checked to find which missing pages to send in round robin and we seek
|
||||
over it when receiving incoming userfaults. After sending each page of
|
||||
course the bitmap is updated accordingly. It's also useful to avoid
|
||||
sending the same page twice (in case the userfault is read by the
|
||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||
thread).
|
||||
|
||||
Non-cooperative userfaultfd
|
||||
===========================
|
||||
|
||||
When the userfaultfd is monitored by an external manager, the manager
|
||||
must be able to track changes in the process virtual memory
|
||||
layout. Userfaultfd can notify the manager about such changes using
|
||||
the same read(2) protocol as for the page fault notifications. The
|
||||
manager has to explicitly enable these events by setting appropriate
|
||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||
|
||||
UFFD_FEATURE_EVENT_FORK
|
||||
enable userfaultfd hooks for fork(). When this feature is
|
||||
enabled, the userfaultfd context of the parent process is
|
||||
duplicated into the newly created process. The manager
|
||||
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||
userfaultfd context in the uffd_msg.fork.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMAP
|
||||
enable notifications about mremap() calls. When the
|
||||
non-cooperative process moves a virtual memory area to a
|
||||
different location, the manager will receive
|
||||
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||
new addresses of the area and its original length.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMOVE
|
||||
enable notifications about madvise(MADV_REMOVE) and
|
||||
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||
be generated upon these calls to madvise. The uffd_msg.remove
|
||||
will contain start and end addresses of the removed area.
|
||||
|
||||
UFFD_FEATURE_EVENT_UNMAP
|
||||
enable notifications about memory unmapping. The manager will
|
||||
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||
end addresses of the unmapped area.
|
||||
|
||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||
are pretty similar, they quite differ in the action expected from the
|
||||
userfaultfd manager. In the former case, the virtual memory is
|
||||
removed, but the area is not, the area remains monitored by the
|
||||
userfaultfd, and if a page fault occurs in that area it will be
|
||||
delivered to the manager. The proper resolution for such page fault is
|
||||
to zeromap the faulting address. However, in the latter case, when an
|
||||
area is unmapped, either explicitly (with munmap() system call), or
|
||||
implicitly (e.g. during mremap()), the area is removed and in turn the
|
||||
userfaultfd context for such area disappears too and the manager will
|
||||
not get further userland page faults from the removed area. Still, the
|
||||
notification is required in order to prevent manager from using
|
||||
UFFDIO_COPY on the unmapped area.
|
||||
|
||||
Unlike userland page faults which have to be synchronous and require
|
||||
explicit or implicit wakeup, all the events are delivered
|
||||
asynchronously and the non-cooperative process resumes execution as
|
||||
soon as manager executes read(). The userfaultfd manager should
|
||||
carefully synchronize calls to UFFDIO_COPY with the events
|
||||
processing. To aid the synchronization, the UFFDIO_COPY ioctl will
|
||||
return -ENOSPC when the monitored process exits at the time of
|
||||
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
|
||||
its virtual memory layout simultaneously with outstanding UFFDIO_COPY
|
||||
operation.
|
||||
|
||||
The current asynchronous model of the event delivery is optimal for
|
||||
single threaded non-cooperative userfaultfd manager implementations. A
|
||||
synchronous event delivery model can be added later as a new
|
||||
userfaultfd feature to facilitate multithreading enhancements of the
|
||||
non cooperative manager, for example to allow UFFDIO_COPY ioctls to
|
||||
run in parallel to the event reception. Single threaded
|
||||
implementations should continue to use the current async event
|
||||
delivery model instead.
|
Fai riferimento in un nuovo problema
Block a user