docs/admin-guide/mm: start moving here files from Documentation/vm

Several documents in Documentation/vm fit quite well into the "admin/user
guide" category. The documents that don't overload the reader with lots of
implementation details and provide coherent description of certain feature
can be moved to Documentation/admin-guide/mm.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
Mike Rapoport
2018-04-18 11:07:49 +03:00
committato da Jonathan Corbet
parent 3a3f7e26e5
commit 1ad1335dc5
16 ha cambiato i file con 28 aggiunte e 31 eliminazioni

Vedi File

@@ -12,14 +12,10 @@ highmem.rst
- Outline of highmem and common issues.
hmm.rst
- Documentation of heterogeneous memory management
hugetlbpage.rst
- a brief summary of hugetlbpage support in the Linux kernel.
hugetlbfs_reserv.rst
- A brief overview of hugetlbfs reservation design/implementation.
hwpoison.rst
- explains what hwpoison is
idle_page_tracking.rst
- description of the idle page tracking feature.
ksm.rst
- how to use the Kernel Samepage Merging feature.
mmu_notifier.rst
@@ -34,16 +30,12 @@ page_frags.rst
- description of page fragments allocator
page_migration.rst
- description of page migration in NUMA systems.
pagemap.rst
- pagemap, from the userspace perspective
page_owner.rst
- tracking about who allocated each page
remap_file_pages.rst
- a note about remap_file_pages() system call
slub.rst
- a short users guide for SLUB.
soft-dirty.rst
- short explanation for soft-dirty PTEs
split_page_table_lock.rst
- Separate per-table lock to improve scalability of the old page_table_lock.
swap_numa.rst
@@ -52,8 +44,6 @@ transhuge.rst
- Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.rst
- Unevictable LRU infrastructure
userfaultfd.rst
- description of userfaultfd system call
z3fold.txt
- outline of z3fold allocator for storing compressed pages
zsmalloc.rst

Vedi File

@@ -1,381 +0,0 @@
.. _hugetlbpage:
=============
HugeTLB Pages
=============
Overview
========
The intent of this file is to give a brief summary of hugetlbpage support in
the Linux kernel. This support is built on top of multiple page size support
that is provided by most modern architectures. For example, x86 CPUs normally
support 4K and 2M (1G if architecturally supported) page sizes, ia64
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
translations. Typically this is a very scarce resource on processor.
Operating systems try to make best use of limited number of TLB resources.
This optimization is more critical now as bigger and bigger physical memories
(several GBs) are more readily available.
Users can use the huge page support in Linux kernel by either using the mmap
system call or standard SYSV shared memory system calls (shmget, shmat).
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
automatically when CONFIG_HUGETLBFS is selected) configuration
options.
The ``/proc/meminfo`` file provides information about the total number of
persistent hugetlb pages in the kernel's huge page pool. It also displays
default huge page size and information about the number of free, reserved
and surplus huge pages in the pool of huge pages of default size.
The huge page size is needed for generating the proper alignment and
size of the arguments to system calls that map huge page regions.
The output of ``cat /proc/meminfo`` will include lines like::
HugePages_Total: uuu
HugePages_Free: vvv
HugePages_Rsvd: www
HugePages_Surp: xxx
Hugepagesize: yyy kB
Hugetlb: zzz kB
where:
HugePages_Total
is the size of the pool of huge pages.
HugePages_Free
is the number of huge pages in the pool that are not yet
allocated.
HugePages_Rsvd
is short for "reserved," and is the number of huge pages for
which a commitment to allocate from the pool has been made,
but no allocation has yet been made. Reserved huge pages
guarantee that an application will be able to allocate a
huge page from the pool of huge pages at fault time.
HugePages_Surp
is short for "surplus," and is the number of huge pages in
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
maximum number of surplus huge pages is controlled by
``/proc/sys/vm/nr_overcommit_hugepages``.
Hugepagesize
is the default hugepage size (in Kb).
Hugetlb
is the total amount of memory (in kB), consumed by huge
pages of all sizes.
If huge pages of different sizes are in use, this number
will exceed HugePages_Total \* Hugepagesize. To get more
detailed information, please, refer to
``/sys/kernel/mm/hugepages`` (described below).
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
configured in the kernel.
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
pages in the kernel's huge page pool. "Persistent" huge pages will be
returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of ``nr_hugepages``.
Pages that are used as huge pages are reserved inside the kernel and cannot
be used for other purposes. Huge pages cannot be swapped out under
memory pressure.
Once a number of huge pages have been pre-allocated to the kernel huge page
pool, a user with appropriate privilege can use either the mmap system call
or shared memory system calls to use the huge pages. See the discussion of
:ref:`Using Huge Pages <using_huge_pages>`, below.
The administrator can allocate persistent huge pages on the kernel boot
command line by specifying the "hugepages=N" parameter, where 'N' = the
number of huge pages requested. This is the most reliable method of
allocating huge pages as memory has not yet become fragmented.
Some platforms support multiple huge page sizes. To allocate huge pages
of a specific size, one must precede the huge pages boot command parameters
with a huge page size selection parameter "hugepagesz=<size>". <size> must
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
Thus, one can use the following command to dynamically allocate/deallocate
default sized persistent huge pages::
echo 20 > /proc/sys/vm/nr_hugepages
This command will try to adjust the number of default sized huge pages in the
huge page pool to 20, allocating or freeing huge pages, as required.
On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory. Allowed
nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the
:ref:`discussion below <mem_policy_and_hp_alloc>`
of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of
physically contiguous memory that is present in system at the time of the
allocation attempt. If the kernel is unable to allocate huge pages from
some nodes in a NUMA system, it will attempt to make up the difference by
allocating extra pages on other nodes with sufficient available contiguous
memory, if any.
System administrators may want to put this command in one of the local rc
init files. This will enable the kernel to allocate huge pages early in
the boot process when the possibility of getting physical contiguous pages
is still very high. Administrators can verify the number of huge pages
actually allocated by checking the sysctl or meminfo. To check the per node
distribution of huge pages in a NUMA system, use::
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
requested by applications. Writing any non-zero value into this file
indicates that the hugetlb subsystem is allowed to try to obtain that
number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool.
When increasing the huge page pool size via ``nr_hugepages``, any existing
surplus pages will first be promoted to persistent huge pages. Then, additional
huge pages will be allocated, if necessary and if possible, to fulfill
the new persistent huge page pool size.
The administrator may shrink the pool of persistent huge pages for
the default huge page size by setting the ``nr_hugepages`` sysctl to a
smaller value. The kernel will attempt to balance the freeing of huge pages
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
Any free huge pages on the selected nodes will be freed back to the kernel's
normal page pool.
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
it becomes less than the number of huge pages in use will convert the balance
of the in-use huge pages to surplus huge pages. This will occur even if
the number of surplus pages would exceed the overcommit value. As long as
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
increased sufficiently, or the surplus huge pages go out of use and are freed--
no more surplus huge pages will be allowed to be allocated.
With support for multiple huge page pools at run-time available, much of
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
sysfs.
The ``/proc`` interfaces discussed above have been retained for backwards
compatibility. The root huge page control directory in sysfs is::
/sys/kernel/mm/hugepages
For each huge page size supported by the running kernel, a subdirectory
will exist, of the form::
hugepages-${size}kB
Inside each of these directories, the same set of files will exist::
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
free_hugepages
resv_hugepages
surplus_hugepages
which function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
===================================================================
Whether huge pages are allocated and freed via the ``/proc`` interface or
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
NUMA nodes from which huge pages are allocated or freed are controlled by the
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
is ignored.
The recommended method to allocate or free huge pages to/from the kernel
huge page pool, using the ``nr_hugepages`` example above, is::
numactl --interleave <node-list> echo 20 \
>/proc/sys/vm/nr_hugepages_mempolicy
or, more succinctly::
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
specified in <node-list>, depending on whether number of persistent huge pages
is initially less than or greater than 20, respectively. No huge pages will be
allocated nor freed on any node not included in the specified <node-list>.
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
memory policy mode--bind, preferred, local or interleave--may be used. The
resulting effect on persistent huge page allocation is as follows:
#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst],
persistent huge pages will be distributed across the node or nodes
specified in the mempolicy as if "interleave" had been specified.
However, if a node in the policy does not contain sufficient contiguous
memory for a huge page, the allocation will not "fallback" to the nearest
neighbor node with sufficient contiguous memory. To do this would cause
undesirable imbalance in the distribution of the huge page pool, or
possibly, allocation of persistent huge pages on nodes not allowed by
the task's memory policy.
#. One or more nodes may be specified with the bind or interleave policy.
If more than one node is specified with the preferred policy, only the
lowest numeric id will be used. Local policy will select the node where
the task is running at the time the nodes_allowed mask is constructed.
For local policy to be deterministic, the task must be bound to a cpu or
cpus in a single node. Otherwise, the task could be migrated to some
other node at any time after launch and the resulting node will be
indeterminate. Thus, local policy is not very useful for this purpose.
Any of the other mempolicy modes may be used to specify a single node.
#. The nodes allowed mask will be derived from any non-default task mempolicy,
whether this policy was set explicitly by the task itself or one of its
ancestors, such as numactl. This means that if the task is invoked from a
shell with non-default policy, that policy will be used. One can specify a
node list of "all" with numactl --interleave or --membind [-m] to achieve
interleaving over all nodes in the system or cpuset.
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
the resource limits of any cpuset in which the task runs. Thus, there will
be no way for a task with non-default policy running in a cpuset with a
subset of the system nodes to allocate huge pages outside the cpuset
without first moving to a cpuset that contains all of the desired nodes.
#. Boot-time huge page allocation attempts to distribute the requested number
of huge pages over all on-lines nodes with memory.
Per Node Hugepages Attributes
=============================
A subset of the contents of the root huge page control directory in sysfs,
described above, will be replicated under each the system device of each
NUMA node with memory in::
/sys/devices/system/node/node[0-9]*/hugepages/
Under this directory, the subdirectory for each supported huge page size
contains the following attribute files::
nr_hugepages
free_hugepages
surplus_hugepages
The free\_' and surplus\_' attribute files are read-only. They return the number
of free and surplus [overcommitted] huge pages, respectively, on the parent
node.
The ``nr_hugepages`` attribute returns the total number of huge pages on the
specified node. When this attribute is written, the number of persistent huge
pages on the parent node will be adjusted to the specified value, if sufficient
resources exist, regardless of the task's mempolicy or cpuset constraints.
Note that the number of overcommit and reserve pages remain global quantities,
as we don't know until fault time, when the faulting task's mempolicy is
applied, from which node the huge page allocation will be attempted.
.. _using_huge_pages:
Using Huge Pages
================
If the user applications are going to request huge pages using mmap system
call, then it is required that system administrator mount a file system of
type hugetlbfs::
mount -t hugetlbfs \
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
min_size=<value>,nr_inodes=<value> none /mnt/huge
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
The ``uid`` and ``gid`` options sets the owner and group of the root of the
file system. By default the ``uid`` and ``gid`` of the current process
are taken.
The ``mode`` option sets the mode of root of file system to value & 01777.
This value is given in octal. By default the value 0755 is picked.
If the platform supports multiple huge page sizes, the ``pagesize`` option can
be used to specify the huge page size and associated pool. ``pagesize``
is specified in bytes. If ``pagesize`` is not specified the platform's
default huge page size and associated pool will be used.
The ``size`` option sets the maximum value of memory (huge pages) allowed
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
The size is rounded down to HPAGE_SIZE boundary.
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
either bytes or a percentage of the huge page pool.
At mount time, the number of huge pages specified by ``min_size`` are reserved
for use by the filesystem.
If there are not enough free huge pages available, the mount will fail.
As huge pages are allocated to the filesystem and freed, the reserve count
is adjusted so that the sum of allocated and reserved huge pages is always
at least ``min_size``.
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
can use.
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
command line then no limits are set.
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
For example, size=2K has the same meaning as size=2048.
While read system calls are supported on files that reside on hugetlb
file systems, write system calls are not.
Regular chown, chgrp, and chmod commands (with right permissions) could be
used to change the file attributes on hugetlbfs.
Also, it is important to note that no such mount command is required if
applications are going to use only shmat/shmget system calls or mmap with
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
:ref:`map_hugetlb <map_hugetlb>` below.
Users who wish to use hugetlb memory via shared memory segment should be
members of a supplementary group and system admin needs to configure that gid
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
applications to use any combination of mmaps and shm* calls, though the mount of
filesystem will be required for using mmap calls without MAP_HUGETLB.
Syscalls that operate on memory backed by hugetlb pages only have their lengths
aligned to the native page size of the processor; they will normally fail with
errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
not hugepage aligned. For example, munmap(2) will fail if memory is backed by
a hugetlb page and the length is smaller than the hugepage size.
Examples
========
.. _map_hugetlb:
``map_hugetlb``
see tools/testing/selftests/vm/map_hugetlb.c
``hugepage-shm``
see tools/testing/selftests/vm/hugepage-shm.c
``hugepage-mmap``
see tools/testing/selftests/vm/hugepage-mmap.c
The `libhugetlbfs`_ library provides a wide range of userspace tools
to help with huge page usability, environment setup, and control.
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs

Vedi File

@@ -155,7 +155,7 @@ Testing
value). This allows stress testing of many kinds of
pages. The page_flags are the same as in /proc/kpageflags. The
flag bits are defined in include/linux/kernel-page-flags.h and
documented in Documentation/vm/pagemap.rst
documented in Documentation/admin-guide/mm/pagemap.rst
* Architecture specific MCE injector

Vedi File

@@ -1,115 +0,0 @@
.. _idle_page_tracking:
==================
Idle Page Tracking
==================
Motivation
==========
The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for
estimating the workload's working set size, which, in turn, can be taken into
account when configuring the workload parameters, setting memory cgroup limits,
or deciding where to place the workload within a compute cluster.
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
.. _user_api:
User API
========
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
Currently, it consists of the only read-write file,
``/sys/kernel/mm/page_idle/bitmap``.
The file implements a bitmap where each bit corresponds to a memory page. The
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
set, the corresponding page is idle.
A page is considered idle if it has not been accessed since it was marked idle
(for more details on what "accessed" actually means see the :ref:`Implementation
Details <impl_details>` section).
To mark a page idle one has to set the bit corresponding to
the page by writing to the file. A value written to the file is OR-ed with the
current bitmap value.
Only accesses to user memory pages are tracked. These are pages mapped to a
process address space, page cache and buffer pages, swap cache pages. For other
page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
and hence such pages are never reported idle.
For huge pages the idle flag is set only on the head page, so one has to read
``/proc/kpageflags`` in order to correctly count idle huge pages.
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
if the size of the read/write is not a multiple of 8 bytes. Writing to
this file beyond max PFN will return -ENXIO.
That said, in order to estimate the amount of pages that are not used by a
workload one should:
1. Mark all the workload's pages as idle by setting corresponding bits in
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
``/proc/pid/pagemap`` if the workload is represented by a process, or by
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
is placed in a memory cgroup.
2. Wait until the workload accesses its working set.
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
If one wants to ignore certain types of pages, e.g. mlocked pages since they
are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
See Documentation/vm/pagemap.rst for more information about
``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
.. _impl_details:
Implementation Details
======================
The kernel internally keeps track of accesses to user memory pages in order to
reclaim unreferenced pages first on memory shortage conditions. A page is
considered referenced if it has been recently accessed via a process address
space, in which case one or more PTEs it is mapped to will have the Accessed bit
set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
latter happens when:
- a userspace process reads or writes a page using a system call (e.g. read(2)
or write(2))
- a page that is used for storing filesystem buffers is read or written,
because a process needs filesystem metadata stored in it (e.g. lists a
directory tree)
- a page is accessed by a device driver using get_user_pages()
When a dirty page is written to swap or disk as a result of memory reclaim or
exceeding the dirty memory limit, it is not marked referenced.
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
:ref:`User API <user_api>`
section), and cleared automatically whenever a page is referenced as defined
above.
When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
mapped to, otherwise we will not be able to detect accesses to the page coming
from a process address space. To avoid interference with the reclaimer, which,
as noted above, uses the Accessed bit to promote actively referenced pages, one
more page flag is introduced, the Young flag. When the PTE Accessed bit is
cleared as a result of setting or updating a page's Idle flag, the Young flag
is set on the page. The reclaimer treats the Young flag as an extra PTE
Accessed bit and therefore will consider such a page as referenced.
Since the idle memory tracking feature is based on the memory reclaimer logic,
it only works with pages that are on an LRU list, other pages are silently
ignored. That means it will ignore a user memory page if it is isolated, but
since there are usually not many of them, it should not affect the overall
result noticeably. In order not to stall scanning of the idle page bitmap,
locked pages may be skipped too.

Vedi File

@@ -13,15 +13,10 @@ various features of the Linux memory management
.. toctree::
:maxdepth: 1
hugetlbpage
idle_page_tracking
ksm
numa_memory_policy
pagemap
transhuge
soft-dirty
swap_numa
userfaultfd
zswap
Kernel developers MM documentation

Vedi File

@@ -1,197 +0,0 @@
.. _pagemap:
=============================
Examining Process Page Tables
=============================
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in ``/proc``.
There are four components to pagemap:
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
value for each virtual page, containing the following data (from
``fs/proc/task_mmu.c``, above pagemap_read):
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2)
* Bits 57-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
If the page is not present but in swap, then the PFN contains an
encoding of the swap file number and the page's offset into the
swap. Unmapped pages return a null PFN. This allows determining
precisely which pages are mapped (or in swap) and comparing mapped
pages between processes.
Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
page, indexed by PFN.
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
0. LOCKED
1. ERROR
2. REFERENCED
3. UPTODATE
4. DIRTY
5. LRU
6. ACTIVE
7. SLAB
8. WRITEBACK
9. RECLAIM
10. BUDDY
11. MMAP
12. ANON
13. SWAPCACHE
14. SWAPBACKED
15. COMPOUND_HEAD
16. COMPOUND_TAIL
17. HUGE
18. UNEVICTABLE
19. HWPOISON
20. NOPAGE
21. KSM
22. THP
23. BALLOON
24. ZERO_PAGE
25. IDLE
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
Short descriptions to the page flags
====================================
0 - LOCKED
page is being locked for exclusive access, e.g. by undergoing read/write IO
7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
10 - BUDDY
a free memory block managed by the buddy system allocator
The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page.
15 - COMPOUND_HEAD
A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound
pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc.
memory allocators and various device drivers. However in this interface,
only huge/giga pages are made visible to end users.
16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
this is an integral part of a HugeTLB page
19 - HWPOISON
hardware detected memory corruption on this page: don't touch the data!
20 - NOPAGE
no page frame exists at the requested address
21 - KSM
identical memory pages dynamically shared between one or more processes
22 - THP
contiguous pages which construct transparent hugepages
23 - BALLOON
balloon compaction page
24 - ZERO_PAGE
zero page for pfn_zero or huge_zero page
25 - IDLE
page has not been accessed since it was marked idle (see
Documentation/vm/idle_page_tracking.rst). Note that this flag may be
stale in case the page was accessed via a PTE. To make sure the flag
is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first.
IO related page flags
---------------------
1 - ERROR
IO error occurred
3 - UPTODATE
page has up-to-date data
ie. for file backed page: (in-memory data revision >= on-disk one)
4 - DIRTY
page has been written to, hence contains new data
i.e. for file backed page: (in-memory data revision > on-disk one)
8 - WRITEBACK
page is being synced to disk
LRU related page flags
----------------------
5 - LRU
page is in one of the LRU lists
6 - ACTIVE
page is in the active LRU list
18 - UNEVICTABLE
page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, e.g. ramfs pages,
shmctl(SHM_LOCK) and mlock() memory segments
2 - REFERENCED
page has been referenced since last LRU list enqueue/requeue
9 - RECLAIM
page will be reclaimed soon after its pageout IO completed
11 - MMAP
a memory mapped page
12 - ANON
a memory mapped page that is not part of a file
13 - SWAPCACHE
page is mapped to swap space, i.e. has an associated swap entry
14 - SWAPBACKED
page is backed by swap/RAM
The page-types tool in the tools/vm directory can be used to query the
above flags.
Using pagemap to do something useful
====================================
The general procedure for using pagemap to find out about a process' memory
usage goes like this:
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
mapped to what.
2. Select the maps you are interested in -- all of them, or a particular
library, or the stack or the heap, etc.
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap.
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
just read, seek to that entry in the file, and read the data you want.
For example, to find the "unique set size" (USS), which is the amount of
memory that a process is using that is not shared with any other process,
you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced
once.
Other notes
===========
Reading from any of the files will return -EINVAL if you are not starting
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
into the file), or if the size of the read is not a multiple of 8 bytes.
Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.

Vedi File

@@ -1,47 +0,0 @@
.. _soft_dirty:
===============
Soft-Dirty PTEs
===============
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from the task's PTEs.
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
task in question.
2. Wait some time.
3. Read soft-dirty bits from the PTEs.
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
64-bit qword is the soft-dirty one. If set, the respective PTE was
written to since step 1.
Internally, to do this tracking, the writable bit is cleared from PTEs
when the soft-dirty bit is cleared. So, after this, when the task tries to
modify a page at some virtual address the #PF occurs and the kernel sets
the soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus all
the kernel does is finds this fact out and puts both writable and soft-dirty
bits on the PTE.
While in most cases tracking memory changes by #PF-s is more than enough
there is still a scenario when we can lose soft dirty bits -- a task
unmaps a previously mapped memory region and then maps a new one at exactly
the same place. When unmap is called, the kernel internally clears PTE values
including soft dirty bits. To notify user space application about such
memory region renewal the kernel always marks new memory regions (and
expanded regions) as soft dirty.
This feature is actively used by the checkpoint-restore project. You
can find more details about it on http://criu.org
-- Pavel Emelyanov, Apr 9, 2013

Vedi File

@@ -1,241 +0,0 @@
.. _userfaultfd:
===========
Userfaultfd
===========
Objective
=========
Userfaults allow the implementation of on-demand paging from userland
and more generally they allow userland to take control of various
memory page faults, something otherwise only the kernel code could do.
For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.
Design
======
Userfaults are delivered and resolved through the userfaultfd syscall.
The userfaultfd (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
1) read/POLLIN protocol to notify a userland thread of the faults
happening
2) various UFFDIO_* ioctls that can manage the virtual memory regions
registered in the userfaultfd that allows userland to efficiently
resolve the userfaults it receives via 1) or to manage the virtual
memory in the background
The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
userfaultfd runtime load never takes the mmap_sem for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
The userfaultfd once opened by invoking the syscall, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
(well of course unless they later try to use the userfaultfd
themselves on the same region the manager is already tracking, which
is a corner case that would currently return -EBUSY).
API
===
When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
a later API version) which will specify the read/POLLIN protocol
userland intends to speak on the UFFD and the uffdio_api.features
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
requested uffdio_api.api is spoken also by the running kernel and the
requested features are going to be enabled) will return into
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
respectively all the available features of the read(2) protocol and
the generic ioctl available.
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
defines what memory types are supported by the userfaultfd and what
events, except page fault notifications, may be generated.
If the kernel supports registering userfaultfd ranges on hugetlbfs
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
set if the kernel supports registering userfaultfd ranges on shared
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
MAP_SHARED, memfd_create, etc).
The userland application that wants to use userfaultfd with hugetlbfs
or shared memory need to set the corresponding flag in
uffdio_api.features to enable those features.
If the userland desires to receive notifications for events other than
page faults, it has to verify that uffdio_api.features has appropriate
UFFD_FEATURE_EVENT_* bits set. These events are described in more
detail below in "Non-cooperative userfaultfd" section.
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
register a memory range in the userfaultfd by setting the
uffdio_register structure accordingly. The uffdio_register.mode
bitmask will specify to the kernel which kind of faults to track for
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
pages). The UFFDIO_REGISTER ioctl will return the
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
supported for all memory types depending on the underlying virtual
memory backend (anonymous memory vs tmpfs vs real filebacked
mappings).
Userland can use the uffdio_register.ioctls to manage the virtual
address space in the background (to add or potentially also remove
memory from the userfaultfd registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
atomically copies a page into the userfault registered range and wakes
up the blocked userfaults (unless uffdio_copy.mode &
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.
QEMU/KVM
========
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
migration. Postcopy live migration is one form of memory
externalization consisting of a virtual machine running with part or
all of its memory residing on a different node in the cloud. The
userfaultfd abstraction is generic enough that not a single line of
KVM kernel code had to be modified in order to add postcopy live
migration to QEMU.
Guest async page faults, FOLL_NOWAIT and all other GUP features work
just fine in combination with userfaults. Userfaults trigger async
page faults in the guest scheduler so those guest processes that
aren't waiting for userfaults (i.e. network bound) can keep running in
the guest vcpus.
It is generally beneficial to run one pass of precopy live migration
just before starting postcopy live migration, in order to avoid
generating userfaults for readonly guest regions.
The implementation of postcopy live migration currently uses one
single bidirectional socket but in the future two different sockets
will be used (to reduce the latency of the userfaults to the minimum
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
The QEMU in the source node writes all pages that it knows are missing
in the destination node, into the socket, and the migration thread of
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
ioctls on the userfaultfd in order to map the received pages into the
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
A different postcopy thread in the destination node listens with
poll() to the userfaultfd in parallel. When a POLLIN event is
generated after a userfault triggers, the postcopy thread read() from
the userfaultfd and receives the fault address (or -EAGAIN in case the
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
by the parallel QEMU migration thread).
After the QEMU postcopy thread (running in the destination node) gets
the userfault address it writes the information about the missing page
into the socket. The QEMU source node receives the information and
roughly "seeks" to that page address and continues sending all
remaining missing pages from that new page offset. Soon after that
(just the time to flush the tcp_wmem queue through the network) the
migration thread in the QEMU running in the destination node will
receive the page that triggered the userfault and it'll map it as
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
was spontaneously sent by the source or if it was an urgent page
requested through a userfault).
By the time the userfaults start, the QEMU in the destination node
doesn't need to keep any per-page state bitmap relative to the live
migration around and a single per-page bitmap has to be maintained in
the QEMU running in the source node to know which pages are still
missing in the destination node. The bitmap in the source node is
checked to find which missing pages to send in round robin and we seek
over it when receiving incoming userfaults. After sending each page of
course the bitmap is updated accordingly. It's also useful to avoid
sending the same page twice (in case the userfault is read by the
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
thread).
Non-cooperative userfaultfd
===========================
When the userfaultfd is monitored by an external manager, the manager
must be able to track changes in the process virtual memory
layout. Userfaultfd can notify the manager about such changes using
the same read(2) protocol as for the page fault notifications. The
manager has to explicitly enable these events by setting appropriate
bits in uffdio_api.features passed to UFFDIO_API ioctl:
UFFD_FEATURE_EVENT_FORK
enable userfaultfd hooks for fork(). When this feature is
enabled, the userfaultfd context of the parent process is
duplicated into the newly created process. The manager
receives UFFD_EVENT_FORK with file descriptor of the new
userfaultfd context in the uffd_msg.fork.
UFFD_FEATURE_EVENT_REMAP
enable notifications about mremap() calls. When the
non-cooperative process moves a virtual memory area to a
different location, the manager will receive
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
new addresses of the area and its original length.
UFFD_FEATURE_EVENT_REMOVE
enable notifications about madvise(MADV_REMOVE) and
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
be generated upon these calls to madvise. The uffd_msg.remove
will contain start and end addresses of the removed area.
UFFD_FEATURE_EVENT_UNMAP
enable notifications about memory unmapping. The manager will
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
end addresses of the unmapped area.
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
are pretty similar, they quite differ in the action expected from the
userfaultfd manager. In the former case, the virtual memory is
removed, but the area is not, the area remains monitored by the
userfaultfd, and if a page fault occurs in that area it will be
delivered to the manager. The proper resolution for such page fault is
to zeromap the faulting address. However, in the latter case, when an
area is unmapped, either explicitly (with munmap() system call), or
implicitly (e.g. during mremap()), the area is removed and in turn the
userfaultfd context for such area disappears too and the manager will
not get further userland page faults from the removed area. Still, the
notification is required in order to prevent manager from using
UFFDIO_COPY on the unmapped area.
Unlike userland page faults which have to be synchronous and require
explicit or implicit wakeup, all the events are delivered
asynchronously and the non-cooperative process resumes execution as
soon as manager executes read(). The userfaultfd manager should
carefully synchronize calls to UFFDIO_COPY with the events
processing. To aid the synchronization, the UFFDIO_COPY ioctl will
return -ENOSPC when the monitored process exits at the time of
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
its virtual memory layout simultaneously with outstanding UFFDIO_COPY
operation.
The current asynchronous model of the event delivery is optimal for
single threaded non-cooperative userfaultfd manager implementations. A
synchronous event delivery model can be added later as a new
userfaultfd feature to facilitate multithreading enhancements of the
non cooperative manager, for example to allow UFFDIO_COPY ioctls to
run in parallel to the event reception. Single threaded
implementations should continue to use the current async event
delivery model instead.