docs/admin-guide/mm: start moving here files from Documentation/vm

Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-04-18 11:07:49 +03:00
parent 3a3f7e26e5
commit 1ad1335dc5
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -12,14 +12,10 @@ highmem.rst
 	- Outline of highmem and common issues.
 hmm.rst
 	- Documentation of heterogeneous memory management
-hugetlbpage.rst
-	- a brief summary of hugetlbpage support in the Linux kernel.
 hugetlbfs_reserv.rst
 	- A brief overview of hugetlbfs reservation design/implementation.
 hwpoison.rst
 	- explains what hwpoison is
-idle_page_tracking.rst
-	- description of the idle page tracking feature.
 ksm.rst
 	- how to use the Kernel Samepage Merging feature.
 mmu_notifier.rst
@@ -34,16 +30,12 @@ page_frags.rst
 	- description of page fragments allocator
 page_migration.rst
 	- description of page migration in NUMA systems.
-pagemap.rst
-	- pagemap, from the userspace perspective
 page_owner.rst
 	- tracking about who allocated each page
 remap_file_pages.rst
 	- a note about remap_file_pages() system call
 slub.rst
 	- a short users guide for SLUB.
-soft-dirty.rst
-	- short explanation for soft-dirty PTEs
 split_page_table_lock.rst
 	- Separate per-table lock to improve scalability of the old page_table_lock.
 swap_numa.rst
@@ -52,8 +44,6 @@ transhuge.rst
 	- Transparent Hugepage Support, alternative way of using hugepages.
 unevictable-lru.rst
 	- Unevictable LRU infrastructure
-userfaultfd.rst
-	- description of userfaultfd system call
 z3fold.txt
 	- outline of z3fold allocator for storing compressed pages
 zsmalloc.rst
--- a/Documentation/vm/hugetlbpage.rst
+++ b/Documentation/vm/hugetlbpage.rst
@@ -1,381 +0,0 @@
-.. _hugetlbpage:
-
-=============
-HugeTLB Pages
-=============
-
-Overview
-========
-
-The intent of this file is to give a brief summary of hugetlbpage support in
-the Linux kernel.  This support is built on top of multiple page size support
-that is provided by most modern architectures.  For example, x86 CPUs normally
-support 4K and 2M (1G if architecturally supported) page sizes, ia64
-architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
-256M and ppc64 supports 4K and 16M.  A TLB is a cache of virtual-to-physical
-translations.  Typically this is a very scarce resource on processor.
-Operating systems try to make best use of limited number of TLB resources.
-This optimization is more critical now as bigger and bigger physical memories
-(several GBs) are more readily available.
-
-Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSV shared memory system calls (shmget, shmat).
-
-First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
-(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
-automatically when CONFIG_HUGETLBFS is selected) configuration
-options.
-
-The ``/proc/meminfo`` file provides information about the total number of
-persistent hugetlb pages in the kernel's huge page pool.  It also displays
-default huge page size and information about the number of free, reserved
-and surplus huge pages in the pool of huge pages of default size.
-The huge page size is needed for generating the proper alignment and
-size of the arguments to system calls that map huge page regions.
-
-The output of ``cat /proc/meminfo`` will include lines like::
-
-	HugePages_Total: uuu
-	HugePages_Free:  vvv
-	HugePages_Rsvd:  www
-	HugePages_Surp:  xxx
-	Hugepagesize:    yyy kB
-	Hugetlb:         zzz kB
-
-where:
-
-HugePages_Total
-	is the size of the pool of huge pages.
-HugePages_Free
-	is the number of huge pages in the pool that are not yet
-        allocated.
-HugePages_Rsvd
-	is short for "reserved," and is the number of huge pages for
-        which a commitment to allocate from the pool has been made,
-        but no allocation has yet been made.  Reserved huge pages
-        guarantee that an application will be able to allocate a
-        huge page from the pool of huge pages at fault time.
-HugePages_Surp
-	is short for "surplus," and is the number of huge pages in
-        the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
-        maximum number of surplus huge pages is controlled by
-        ``/proc/sys/vm/nr_overcommit_hugepages``.
-Hugepagesize
-	is the default hugepage size (in Kb).
-Hugetlb
-        is the total amount of memory (in kB), consumed by huge
-        pages of all sizes.
-        If huge pages of different sizes are in use, this number
-        will exceed HugePages_Total \* Hugepagesize. To get more
-        detailed information, please, refer to
-        ``/sys/kernel/mm/hugepages`` (described below).
-
-
-``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
-configured in the kernel.
-
-``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
-pages in the kernel's huge page pool.  "Persistent" huge pages will be
-returned to the huge page pool when freed by a task.  A user with root
-privileges can dynamically allocate more or free some persistent huge pages
-by increasing or decreasing the value of ``nr_hugepages``.
-
-Pages that are used as huge pages are reserved inside the kernel and cannot
-be used for other purposes.  Huge pages cannot be swapped out under
-memory pressure.
-
-Once a number of huge pages have been pre-allocated to the kernel huge page
-pool, a user with appropriate privilege can use either the mmap system call
-or shared memory system calls to use the huge pages.  See the discussion of
-:ref:`Using Huge Pages <using_huge_pages>`, below.
-
-The administrator can allocate persistent huge pages on the kernel boot
-command line by specifying the "hugepages=N" parameter, where 'N' = the
-number of huge pages requested.  This is the most reliable method of
-allocating huge pages as memory has not yet become fragmented.
-
-Some platforms support multiple huge page sizes.  To allocate huge pages
-of a specific size, one must precede the huge pages boot command parameters
-with a huge page size selection parameter "hugepagesz=<size>".  <size> must
-be specified in bytes with optional scale suffix [kKmMgG].  The default huge
-page size may be selected with the "default_hugepagesz=<size>" boot parameter.
-
-When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
-indicates the current number of pre-allocated huge pages of the default size.
-Thus, one can use the following command to dynamically allocate/deallocate
-default sized persistent huge pages::
-
-	echo 20 > /proc/sys/vm/nr_hugepages
-
-This command will try to adjust the number of default sized huge pages in the
-huge page pool to 20, allocating or freeing huge pages, as required.
-
-On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over all the set of allowed nodes specified by the NUMA memory policy of the
-task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
-task has default memory policy--is all on-line nodes with memory.  Allowed
-nodes with insufficient available, contiguous memory for a huge page will be
-silently skipped when allocating persistent huge pages.  See the
-:ref:`discussion below <mem_policy_and_hp_alloc>`
-of the interaction of task memory policy, cpusets and per node attributes
-with the allocation and freeing of persistent huge pages.
-
-The success or failure of huge page allocation depends on the amount of
-physically contiguous memory that is present in system at the time of the
-allocation attempt.  If the kernel is unable to allocate huge pages from
-some nodes in a NUMA system, it will attempt to make up the difference by
-allocating extra pages on other nodes with sufficient available contiguous
-memory, if any.
-
-System administrators may want to put this command in one of the local rc
-init files.  This will enable the kernel to allocate huge pages early in
-the boot process when the possibility of getting physical contiguous pages
-is still very high.  Administrators can verify the number of huge pages
-actually allocated by checking the sysctl or meminfo.  To check the per node
-distribution of huge pages in a NUMA system, use::
-
-	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
-
-``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
-huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
-requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain that
-number of "surplus" huge pages from the kernel's normal page pool, when the
-persistent huge page pool is exhausted. As these surplus huge pages become
-unused, they are freed back to the kernel's normal page pool.
-
-When increasing the huge page pool size via ``nr_hugepages``, any existing
-surplus pages will first be promoted to persistent huge pages.  Then, additional
-huge pages will be allocated, if necessary and if possible, to fulfill
-the new persistent huge page pool size.
-
-The administrator may shrink the pool of persistent huge pages for
-the default huge page size by setting the ``nr_hugepages`` sysctl to a
-smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all nodes in the memory policy of the task modifying ``nr_hugepages``.
-Any free huge pages on the selected nodes will be freed back to the kernel's
-normal page pool.
-
-Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
-it becomes less than the number of huge pages in use will convert the balance
-of the in-use huge pages to surplus huge pages.  This will occur even if
-the number of surplus pages would exceed the overcommit value.  As long as
-this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
-increased sufficiently, or the surplus huge pages go out of use and are freed--
-no more surplus huge pages will be allowed to be allocated.
-
-With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
-sysfs.
-The ``/proc`` interfaces discussed above have been retained for backwards
-compatibility. The root huge page control directory in sysfs is::
-
-	/sys/kernel/mm/hugepages
-
-For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form::
-
-	hugepages-${size}kB
-
-Inside each of these directories, the same set of files will exist::
-
-	nr_hugepages
-	nr_hugepages_mempolicy
-	nr_overcommit_hugepages
-	free_hugepages
-	resv_hugepages
-	surplus_hugepages
-
-which function as described above for the default huge page-sized case.
-
-.. _mem_policy_and_hp_alloc:
-
-Interaction of Task Memory Policy with Huge Page Allocation/Freeing
-===================================================================
-
-Whether huge pages are allocated and freed via the ``/proc`` interface or
-the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
-NUMA nodes from which huge pages are allocated or freed are controlled by the
-NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
-sysctl or attribute.  When the ``nr_hugepages`` attribute is used, mempolicy
-is ignored.
-
-The recommended method to allocate or free huge pages to/from the kernel
-huge page pool, using the ``nr_hugepages`` example above, is::
-
-    numactl --interleave <node-list> echo 20 \
-				>/proc/sys/vm/nr_hugepages_mempolicy
-
-or, more succinctly::
-
-    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
-
-This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
-specified in <node-list>, depending on whether number of persistent huge pages
-is initially less than or greater than 20, respectively.  No huge pages will be
-allocated nor freed on any node not included in the specified <node-list>.
-
-When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
-memory policy mode--bind, preferred, local or interleave--may be used.  The
-resulting effect on persistent huge page allocation is as follows:
-
-#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst],
-   persistent huge pages will be distributed across the node or nodes
-   specified in the mempolicy as if "interleave" had been specified.
-   However, if a node in the policy does not contain sufficient contiguous
-   memory for a huge page, the allocation will not "fallback" to the nearest
-   neighbor node with sufficient contiguous memory.  To do this would cause
-   undesirable imbalance in the distribution of the huge page pool, or
-   possibly, allocation of persistent huge pages on nodes not allowed by
-   the task's memory policy.
-
-#. One or more nodes may be specified with the bind or interleave policy.
-   If more than one node is specified with the preferred policy, only the
-   lowest numeric id will be used.  Local policy will select the node where
-   the task is running at the time the nodes_allowed mask is constructed.
-   For local policy to be deterministic, the task must be bound to a cpu or
-   cpus in a single node.  Otherwise, the task could be migrated to some
-   other node at any time after launch and the resulting node will be
-   indeterminate.  Thus, local policy is not very useful for this purpose.
-   Any of the other mempolicy modes may be used to specify a single node.
-
-#. The nodes allowed mask will be derived from any non-default task mempolicy,
-   whether this policy was set explicitly by the task itself or one of its
-   ancestors, such as numactl.  This means that if the task is invoked from a
-   shell with non-default policy, that policy will be used.  One can specify a
-   node list of "all" with numactl --interleave or --membind [-m] to achieve
-   interleaving over all nodes in the system or cpuset.
-
-#. Any task mempolicy specified--e.g., using numactl--will be constrained by
-   the resource limits of any cpuset in which the task runs.  Thus, there will
-   be no way for a task with non-default policy running in a cpuset with a
-   subset of the system nodes to allocate huge pages outside the cpuset
-   without first moving to a cpuset that contains all of the desired nodes.
-
-#. Boot-time huge page allocation attempts to distribute the requested number
-   of huge pages over all on-lines nodes with memory.
-
-Per Node Hugepages Attributes
-=============================
-
-A subset of the contents of the root huge page control directory in sysfs,
-described above, will be replicated under each the system device of each
-NUMA node with memory in::
-
-	/sys/devices/system/node/node[0-9]*/hugepages/
-
-Under this directory, the subdirectory for each supported huge page size
-contains the following attribute files::
-
-	nr_hugepages
-	free_hugepages
-	surplus_hugepages
-
-The free\_' and surplus\_' attribute files are read-only.  They return the number
-of free and surplus [overcommitted] huge pages, respectively, on the parent
-node.
-
-The ``nr_hugepages`` attribute returns the total number of huge pages on the
-specified node.  When this attribute is written, the number of persistent huge
-pages on the parent node will be adjusted to the specified value, if sufficient
-resources exist, regardless of the task's mempolicy or cpuset constraints.
-
-Note that the number of overcommit and reserve pages remain global quantities,
-as we don't know until fault time, when the faulting task's mempolicy is
-applied, from which node the huge page allocation will be attempted.
-
-.. _using_huge_pages:
-
-Using Huge Pages
-================
-
-If the user applications are going to request huge pages using mmap system
-call, then it is required that system administrator mount a file system of
-type hugetlbfs::
-
-  mount -t hugetlbfs \
-	-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
-	min_size=<value>,nr_inodes=<value> none /mnt/huge
-
-This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
-``/mnt/huge``.  Any file created on ``/mnt/huge`` uses huge pages.
-
-The ``uid`` and ``gid`` options sets the owner and group of the root of the
-file system.  By default the ``uid`` and ``gid`` of the current process
-are taken.
-
-The ``mode`` option sets the mode of root of file system to value & 01777.
-This value is given in octal. By default the value 0755 is picked.
-
-If the platform supports multiple huge page sizes, the ``pagesize`` option can
-be used to specify the huge page size and associated pool. ``pagesize``
-is specified in bytes. If ``pagesize`` is not specified the platform's
-default huge page size and associated pool will be used.
-
-The ``size`` option sets the maximum value of memory (huge pages) allowed
-for that filesystem (``/mnt/huge``). The ``size`` option can be specified
-in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
-The size is rounded down to HPAGE_SIZE boundary.
-
-The ``min_size`` option sets the minimum value of memory (huge pages) allowed
-for the filesystem. ``min_size`` can be specified in the same way as ``size``,
-either bytes or a percentage of the huge page pool.
-At mount time, the number of huge pages specified by ``min_size`` are reserved
-for use by the filesystem.
-If there are not enough free huge pages available, the mount will fail.
-As huge pages are allocated to the filesystem and freed, the reserve count
-is adjusted so that the sum of allocated and reserved huge pages is always
-at least ``min_size``.
-
-The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
-can use.
-
-If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
-command line then no limits are set.
-
-For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
-use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
-For example, size=2K has the same meaning as size=2048.
-
-While read system calls are supported on files that reside on hugetlb
-file systems, write system calls are not.
-
-Regular chown, chgrp, and chmod commands (with right permissions) could be
-used to change the file attributes on hugetlbfs.
-
-Also, it is important to note that no such mount command is required if
-applications are going to use only shmat/shmget system calls or mmap with
-MAP_HUGETLB.  For an example of how to use mmap with MAP_HUGETLB see
-:ref:`map_hugetlb <map_hugetlb>` below.
-
-Users who wish to use hugetlb memory via shared memory segment should be
-members of a supplementary group and system admin needs to configure that gid
-into ``/proc/sys/vm/hugetlb_shm_group``.  It is possible for same or different
-applications to use any combination of mmaps and shm* calls, though the mount of
-filesystem will be required for using mmap calls without MAP_HUGETLB.
-
-Syscalls that operate on memory backed by hugetlb pages only have their lengths
-aligned to the native page size of the processor; they will normally fail with
-errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
-not hugepage aligned.  For example, munmap(2) will fail if memory is backed by
-a hugetlb page and the length is smaller than the hugepage size.
-
-
-Examples
-========
-
-.. _map_hugetlb:
-
-``map_hugetlb``
-	see tools/testing/selftests/vm/map_hugetlb.c
-
-``hugepage-shm``
-	see tools/testing/selftests/vm/hugepage-shm.c
-
-``hugepage-mmap``
-	see tools/testing/selftests/vm/hugepage-mmap.c
-
-The `libhugetlbfs`_  library provides a wide range of userspace tools
-to help with huge page usability, environment setup, and control.
-
-.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
--- a/Documentation/vm/hwpoison.rst
+++ b/Documentation/vm/hwpoison.rst
@@ -155,7 +155,7 @@ Testing
 	value).  This allows stress testing of many kinds of
 	pages. The page_flags are the same as in /proc/kpageflags. The
 	flag bits are defined in include/linux/kernel-page-flags.h and
-	documented in Documentation/vm/pagemap.rst
+	documented in Documentation/admin-guide/mm/pagemap.rst

 * Architecture specific MCE injector

--- a/Documentation/vm/idle_page_tracking.rst
+++ b/Documentation/vm/idle_page_tracking.rst
@@ -1,115 +0,0 @@
-.. _idle_page_tracking:
-
-==================
-Idle Page Tracking
-==================
-
-Motivation
-==========
-
-The idle page tracking feature allows to track which memory pages are being
-accessed by a workload and which are idle. This information can be useful for
-estimating the workload's working set size, which, in turn, can be taken into
-account when configuring the workload parameters, setting memory cgroup limits,
-or deciding where to place the workload within a compute cluster.
-
-It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
-
-.. _user_api:
-
-User API
-========
-
-The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
-Currently, it consists of the only read-write file,
-``/sys/kernel/mm/page_idle/bitmap``.
-
-The file implements a bitmap where each bit corresponds to a memory page. The
-bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
-mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
-set, the corresponding page is idle.
-
-A page is considered idle if it has not been accessed since it was marked idle
-(for more details on what "accessed" actually means see the :ref:`Implementation
-Details <impl_details>` section).
-To mark a page idle one has to set the bit corresponding to
-the page by writing to the file. A value written to the file is OR-ed with the
-current bitmap value.
-
-Only accesses to user memory pages are tracked. These are pages mapped to a
-process address space, page cache and buffer pages, swap cache pages. For other
-page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
-and hence such pages are never reported idle.
-
-For huge pages the idle flag is set only on the head page, so one has to read
-``/proc/kpageflags`` in order to correctly count idle huge pages.
-
-Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
-if the size of the read/write is not a multiple of 8 bytes. Writing to
-this file beyond max PFN will return -ENXIO.
-
-That said, in order to estimate the amount of pages that are not used by a
-workload one should:
-
- 1. Mark all the workload's pages as idle by setting corresponding bits in
-    ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
-    ``/proc/pid/pagemap`` if the workload is represented by a process, or by
-    filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
-    is placed in a memory cgroup.
-
- 2. Wait until the workload accesses its working set.
-
- 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
-    If one wants to ignore certain types of pages, e.g. mlocked pages since they
-    are not reclaimable, he or she can filter them out using
-    ``/proc/kpageflags``.
-
-See Documentation/vm/pagemap.rst for more information about
-``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
-
-.. _impl_details:
-
-Implementation Details
-======================
-
-The kernel internally keeps track of accesses to user memory pages in order to
-reclaim unreferenced pages first on memory shortage conditions. A page is
-considered referenced if it has been recently accessed via a process address
-space, in which case one or more PTEs it is mapped to will have the Accessed bit
-set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
-latter happens when:
-
- - a userspace process reads or writes a page using a system call (e.g. read(2)
-   or write(2))
-
- - a page that is used for storing filesystem buffers is read or written,
-   because a process needs filesystem metadata stored in it (e.g. lists a
-   directory tree)
-
- - a page is accessed by a device driver using get_user_pages()
-
-When a dirty page is written to swap or disk as a result of memory reclaim or
-exceeding the dirty memory limit, it is not marked referenced.
-
-The idle memory tracking feature adds a new page flag, the Idle flag. This flag
-is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
-:ref:`User API <user_api>`
-section), and cleared automatically whenever a page is referenced as defined
-above.
-
-When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
-mapped to, otherwise we will not be able to detect accesses to the page coming
-from a process address space. To avoid interference with the reclaimer, which,
-as noted above, uses the Accessed bit to promote actively referenced pages, one
-more page flag is introduced, the Young flag. When the PTE Accessed bit is
-cleared as a result of setting or updating a page's Idle flag, the Young flag
-is set on the page. The reclaimer treats the Young flag as an extra PTE
-Accessed bit and therefore will consider such a page as referenced.
-
-Since the idle memory tracking feature is based on the memory reclaimer logic,
-it only works with pages that are on an LRU list, other pages are silently
-ignored. That means it will ignore a user memory page if it is isolated, but
-since there are usually not many of them, it should not affect the overall
-result noticeably. In order not to stall scanning of the idle page bitmap,
-locked pages may be skipped too.
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -13,15 +13,10 @@ various features of the Linux memory management
 .. toctree::
   :maxdepth: 1

-   hugetlbpage
-   idle_page_tracking
   ksm
   numa_memory_policy
-   pagemap
   transhuge
-   soft-dirty
   swap_numa
-   userfaultfd
   zswap

 Kernel developers MM documentation
--- a/Documentation/vm/pagemap.rst
+++ b/Documentation/vm/pagemap.rst
@@ -1,197 +0,0 @@
-.. _pagemap:
-
-=============================
-Examining Process Page Tables
-=============================
-
-pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
-userspace programs to examine the page tables and related information by
-reading files in ``/proc``.
-
-There are four components to pagemap:
-
- * ``/proc/pid/pagemap``.  This file lets a userspace process find out which
-   physical frame each virtual page is mapped to.  It contains one 64-bit
-   value for each virtual page, containing the following data (from
-   ``fs/proc/task_mmu.c``, above pagemap_read):
-
-    * Bits 0-54  page frame number (PFN) if present
-    * Bits 0-4   swap type if swapped
-    * Bits 5-54  swap offset if swapped
-    * Bit  55    pte is soft-dirty (see Documentation/vm/soft-dirty.rst)
-    * Bit  56    page exclusively mapped (since 4.2)
-    * Bits 57-60 zero
-    * Bit  61    page is file-page or shared-anon (since 3.5)
-    * Bit  62    page swapped
-    * Bit  63    page present
-
-   Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
-   In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
-   4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
-   Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
-
-   If the page is not present but in swap, then the PFN contains an
-   encoding of the swap file number and the page's offset into the
-   swap. Unmapped pages return a null PFN. This allows determining
-   precisely which pages are mapped (or in swap) and comparing mapped
-   pages between processes.
-
-   Efficient users of this interface will use ``/proc/pid/maps`` to
-   determine which areas of memory are actually mapped and llseek to
-   skip over unmapped regions.
-
- * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
-   times each page is mapped, indexed by PFN.
-
- * ``/proc/kpageflags``.  This file contains a 64-bit set of flags for each
-   page, indexed by PFN.
-
-   The flags are (from ``fs/proc/page.c``, above kpageflags_read):
-
-    0. LOCKED
-    1. ERROR
-    2. REFERENCED
-    3. UPTODATE
-    4. DIRTY
-    5. LRU
-    6. ACTIVE
-    7. SLAB
-    8. WRITEBACK
-    9. RECLAIM
-    10. BUDDY
-    11. MMAP
-    12. ANON
-    13. SWAPCACHE
-    14. SWAPBACKED
-    15. COMPOUND_HEAD
-    16. COMPOUND_TAIL
-    17. HUGE
-    18. UNEVICTABLE
-    19. HWPOISON
-    20. NOPAGE
-    21. KSM
-    22. THP
-    23. BALLOON
-    24. ZERO_PAGE
-    25. IDLE
-
- * ``/proc/kpagecgroup``.  This file contains a 64-bit inode number of the
-   memory cgroup each page is charged to, indexed by PFN. Only available when
-   CONFIG_MEMCG is set.
-
-Short descriptions to the page flags
-====================================
-
-0 - LOCKED
-   page is being locked for exclusive access, e.g. by undergoing read/write IO
-7 - SLAB
-   page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
-   When compound page is used, SLUB/SLQB will only set this flag on the head
-   page; SLOB will not flag it at all.
-10 - BUDDY
-    a free memory block managed by the buddy system allocator
-    The buddy system organizes free memory in blocks of various orders.
-    An order N block has 2^N physically contiguous pages, with the BUDDY flag
-    set for and _only_ for the first page.
-15 - COMPOUND_HEAD
-    A compound page with order N consists of 2^N physically contiguous pages.
-    A compound page with order 2 takes the form of "HTTT", where H donates its
-    head page and T donates its tail page(s).  The major consumers of compound
-    pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc.
-    memory allocators and various device drivers. However in this interface,
-    only huge/giga pages are made visible to end users.
-16 - COMPOUND_TAIL
-    A compound page tail (see description above).
-17 - HUGE
-    this is an integral part of a HugeTLB page
-19 - HWPOISON
-    hardware detected memory corruption on this page: don't touch the data!
-20 - NOPAGE
-    no page frame exists at the requested address
-21 - KSM
-    identical memory pages dynamically shared between one or more processes
-22 - THP
-    contiguous pages which construct transparent hugepages
-23 - BALLOON
-    balloon compaction page
-24 - ZERO_PAGE
-    zero page for pfn_zero or huge_zero page
-25 - IDLE
-    page has not been accessed since it was marked idle (see
-    Documentation/vm/idle_page_tracking.rst). Note that this flag may be
-    stale in case the page was accessed via a PTE. To make sure the flag
-    is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first.
-
-IO related page flags
---------------------
-
-1 - ERROR
-   IO error occurred
-3 - UPTODATE
-   page has up-to-date data
-   ie. for file backed page: (in-memory data revision >= on-disk one)
-4 - DIRTY
-   page has been written to, hence contains new data
-   i.e. for file backed page: (in-memory data revision >  on-disk one)
-8 - WRITEBACK
-   page is being synced to disk
-
-LRU related page flags
----------------------
-
-5 - LRU
-   page is in one of the LRU lists
-6 - ACTIVE
-   page is in the active LRU list
-18 - UNEVICTABLE
-   page is in the unevictable (non-)LRU list It is somehow pinned and
-   not a candidate for LRU page reclaims, e.g. ramfs pages,
-   shmctl(SHM_LOCK) and mlock() memory segments
-2 - REFERENCED
-   page has been referenced since last LRU list enqueue/requeue
-9 - RECLAIM
-   page will be reclaimed soon after its pageout IO completed
-11 - MMAP
-   a memory mapped page
-12 - ANON
-   a memory mapped page that is not part of a file
-13 - SWAPCACHE
-   page is mapped to swap space, i.e. has an associated swap entry
-14 - SWAPBACKED
-   page is backed by swap/RAM
-
-The page-types tool in the tools/vm directory can be used to query the
-above flags.
-
-Using pagemap to do something useful
-====================================
-
-The general procedure for using pagemap to find out about a process' memory
-usage goes like this:
-
- 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
-    mapped to what.
- 2. Select the maps you are interested in -- all of them, or a particular
-    library, or the stack or the heap, etc.
- 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
- 4. Read a u64 for each page from pagemap.
- 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``.  For each PFN you
-    just read, seek to that entry in the file, and read the data you want.
-
-For example, to find the "unique set size" (USS), which is the amount of
-memory that a process is using that is not shared with any other process,
-you can go through every map in the process, find the PFNs, look those up
-in kpagecount, and tally up the number of pages that are only referenced
-once.
-
-Other notes
-===========
-
-Reading from any of the files will return -EINVAL if you are not starting
-the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
-into the file), or if the size of the read is not a multiple of 8 bytes.
-
-Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
-always 12 at most architectures). Since Linux 3.11 their meaning changes
-after first clear of soft-dirty bits. Since Linux 4.2 they are used for
-flags unconditionally.
--- a/Documentation/vm/soft-dirty.rst
+++ b/Documentation/vm/soft-dirty.rst
@@ -1,47 +0,0 @@
-.. _soft_dirty:
-
-===============
-Soft-Dirty PTEs
-===============
-
-The soft-dirty is a bit on a PTE which helps to track which pages a task
-writes to. In order to do this tracking one should
-
-  1. Clear soft-dirty bits from the task's PTEs.
-
-     This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
-     task in question.
-
-  2. Wait some time.
-
-  3. Read soft-dirty bits from the PTEs.
-
-     This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
-     64-bit qword is the soft-dirty one. If set, the respective PTE was
-     written to since step 1.
-
-
-Internally, to do this tracking, the writable bit is cleared from PTEs
-when the soft-dirty bit is cleared. So, after this, when the task tries to
-modify a page at some virtual address the #PF occurs and the kernel sets
-the soft-dirty bit on the respective PTE.
-
-Note, that although all the task's address space is marked as r/o after the
-soft-dirty bits clear, the #PF-s that occur after that are processed fast.
-This is so, since the pages are still mapped to physical memory, and thus all
-the kernel does is finds this fact out and puts both writable and soft-dirty
-bits on the PTE.
-
-While in most cases tracking memory changes by #PF-s is more than enough
-there is still a scenario when we can lose soft dirty bits -- a task
-unmaps a previously mapped memory region and then maps a new one at exactly
-the same place. When unmap is called, the kernel internally clears PTE values
-including soft dirty bits. To notify user space application about such
-memory region renewal the kernel always marks new memory regions (and
-expanded regions) as soft dirty.
-
-This feature is actively used by the checkpoint-restore project. You
-can find more details about it on http://criu.org
-
-
-- Pavel Emelyanov, Apr 9, 2013
--- a/Documentation/vm/userfaultfd.rst
+++ b/Documentation/vm/userfaultfd.rst
@@ -1,241 +0,0 @@
-.. _userfaultfd:
-
-===========
-Userfaultfd
-===========
-
-Objective
-=========
-
-Userfaults allow the implementation of on-demand paging from userland
-and more generally they allow userland to take control of various
-memory page faults, something otherwise only the kernel code could do.
-
-For example userfaults allows a proper and more optimal implementation
-of the PROT_NONE+SIGSEGV trick.
-
-Design
-======
-
-Userfaults are delivered and resolved through the userfaultfd syscall.
-
-The userfaultfd (aside from registering and unregistering virtual
-memory ranges) provides two primary functionalities:
-
-1) read/POLLIN protocol to notify a userland thread of the faults
-   happening
-
-2) various UFFDIO_* ioctls that can manage the virtual memory regions
-   registered in the userfaultfd that allows userland to efficiently
-   resolve the userfaults it receives via 1) or to manage the virtual
-   memory in the background
-
-The real advantage of userfaults if compared to regular virtual memory
-management of mremap/mprotect is that the userfaults in all their
-operations never involve heavyweight structures like vmas (in fact the
-userfaultfd runtime load never takes the mmap_sem for writing).
-
-Vmas are not suitable for page- (or hugepage) granular fault tracking
-when dealing with virtual address spaces that could span
-Terabytes. Too many vmas would be needed for that.
-
-The userfaultfd once opened by invoking the syscall, can also be
-passed using unix domain sockets to a manager process, so the same
-manager process could handle the userfaults of a multitude of
-different processes without them being aware about what is going on
-(well of course unless they later try to use the userfaultfd
-themselves on the same region the manager is already tracking, which
-is a corner case that would currently return -EBUSY).
-
-API
-===
-
-When first opened the userfaultfd must be enabled invoking the
-UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
-a later API version) which will specify the read/POLLIN protocol
-userland intends to speak on the UFFD and the uffdio_api.features
-userland requires. The UFFDIO_API ioctl if successful (i.e. if the
-requested uffdio_api.api is spoken also by the running kernel and the
-requested features are going to be enabled) will return into
-uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
-respectively all the available features of the read(2) protocol and
-the generic ioctl available.
-
-The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
-defines what memory types are supported by the userfaultfd and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering userfaultfd ranges on hugetlbfs
-virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
-uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
-set if the kernel supports registering userfaultfd ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
-MAP_SHARED, memfd_create, etc).
-
-The userland application that wants to use userfaultfd with hugetlbfs
-or shared memory need to set the corresponding flag in
-uffdio_api.features to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that uffdio_api.features has appropriate
-UFFD_FEATURE_EVENT_* bits set. These events are described in more
-detail below in "Non-cooperative userfaultfd" section.
-
-Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
-be invoked (if present in the returned uffdio_api.ioctls bitmask) to
-register a memory range in the userfaultfd by setting the
-uffdio_register structure accordingly. The uffdio_register.mode
-bitmask will specify to the kernel which kind of faults to track for
-the range (UFFDIO_REGISTER_MODE_MISSING would track missing
-pages). The UFFDIO_REGISTER ioctl will return the
-uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
-userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
-
-Userland can use the uffdio_register.ioctls to manage the virtual
-address space in the background (to add or potentially also remove
-memory from the userfaultfd registered range). This means a userfault
-could be triggering just before userland maps in the background the
-user-faulted page.
-
-The primary ioctl to resolve userfaults is UFFDIO_COPY. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults (unless uffdio_copy.mode &
-UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
-UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
-half copied page since it'll keep userfaulting until the copy has
-finished.
-
-QEMU/KVM
-========
-
-QEMU/KVM is using the userfaultfd syscall to implement postcopy live
-migration. Postcopy live migration is one form of memory
-externalization consisting of a virtual machine running with part or
-all of its memory residing on a different node in the cloud. The
-userfaultfd abstraction is generic enough that not a single line of
-KVM kernel code had to be modified in order to add postcopy live
-migration to QEMU.
-
-Guest async page faults, FOLL_NOWAIT and all other GUP features work
-just fine in combination with userfaults. Userfaults trigger async
-page faults in the guest scheduler so those guest processes that
-aren't waiting for userfaults (i.e. network bound) can keep running in
-the guest vcpus.
-
-It is generally beneficial to run one pass of precopy live migration
-just before starting postcopy live migration, in order to avoid
-generating userfaults for readonly guest regions.
-
-The implementation of postcopy live migration currently uses one
-single bidirectional socket but in the future two different sockets
-will be used (to reduce the latency of the userfaults to the minimum
-possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
-
-The QEMU in the source node writes all pages that it knows are missing
-in the destination node, into the socket, and the migration thread of
-the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
-ioctls on the userfaultfd in order to map the received pages into the
-guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
-
-A different postcopy thread in the destination node listens with
-poll() to the userfaultfd in parallel. When a POLLIN event is
-generated after a userfault triggers, the postcopy thread read() from
-the userfaultfd and receives the fault address (or -EAGAIN in case the
-userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
-by the parallel QEMU migration thread).
-
-After the QEMU postcopy thread (running in the destination node) gets
-the userfault address it writes the information about the missing page
-into the socket. The QEMU source node receives the information and
-roughly "seeks" to that page address and continues sending all
-remaining missing pages from that new page offset. Soon after that
-(just the time to flush the tcp_wmem queue through the network) the
-migration thread in the QEMU running in the destination node will
-receive the page that triggered the userfault and it'll map it as
-usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
-was spontaneously sent by the source or if it was an urgent page
-requested through a userfault).
-
-By the time the userfaults start, the QEMU in the destination node
-doesn't need to keep any per-page state bitmap relative to the live
-migration around and a single per-page bitmap has to be maintained in
-the QEMU running in the source node to know which pages are still
-missing in the destination node. The bitmap in the source node is
-checked to find which missing pages to send in round robin and we seek
-over it when receiving incoming userfaults. After sending each page of
-course the bitmap is updated accordingly. It's also useful to avoid
-sending the same page twice (in case the userfault is read by the
-postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
-thread).
-
-Non-cooperative userfaultfd
-===========================
-
-When the userfaultfd is monitored by an external manager, the manager
-must be able to track changes in the process virtual memory
-layout. Userfaultfd can notify the manager about such changes using
-the same read(2) protocol as for the page fault notifications. The
-manager has to explicitly enable these events by setting appropriate
-bits in uffdio_api.features passed to UFFDIO_API ioctl:
-
-UFFD_FEATURE_EVENT_FORK
-	enable userfaultfd hooks for fork(). When this feature is
-	enabled, the userfaultfd context of the parent process is
-	duplicated into the newly created process. The manager
-	receives UFFD_EVENT_FORK with file descriptor of the new
-	userfaultfd context in the uffd_msg.fork.
-
-UFFD_FEATURE_EVENT_REMAP
-	enable notifications about mremap() calls. When the
-	non-cooperative process moves a virtual memory area to a
-	different location, the manager will receive
-	UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
-	new addresses of the area and its original length.
-
-UFFD_FEATURE_EVENT_REMOVE
-	enable notifications about madvise(MADV_REMOVE) and
-	madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
-	be generated upon these calls to madvise. The uffd_msg.remove
-	will contain start and end addresses of the removed area.
-
-UFFD_FEATURE_EVENT_UNMAP
-	enable notifications about memory unmapping. The manager will
-	get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
-	end addresses of the unmapped area.
-
-Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
-are pretty similar, they quite differ in the action expected from the
-userfaultfd manager. In the former case, the virtual memory is
-removed, but the area is not, the area remains monitored by the
-userfaultfd, and if a page fault occurs in that area it will be
-delivered to the manager. The proper resolution for such page fault is
-to zeromap the faulting address. However, in the latter case, when an
-area is unmapped, either explicitly (with munmap() system call), or
-implicitly (e.g. during mremap()), the area is removed and in turn the
-userfaultfd context for such area disappears too and the manager will
-not get further userland page faults from the removed area. Still, the
-notification is required in order to prevent manager from using
-UFFDIO_COPY on the unmapped area.
-
-Unlike userland page faults which have to be synchronous and require
-explicit or implicit wakeup, all the events are delivered
-asynchronously and the non-cooperative process resumes execution as
-soon as manager executes read(). The userfaultfd manager should
-carefully synchronize calls to UFFDIO_COPY with the events
-processing. To aid the synchronization, the UFFDIO_COPY ioctl will
-return -ENOSPC when the monitored process exits at the time of
-UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
-its virtual memory layout simultaneously with outstanding UFFDIO_COPY
-operation.
-
-The current asynchronous model of the event delivery is optimal for
-single threaded non-cooperative userfaultfd manager implementations. A
-synchronous event delivery model can be added later as a new
-userfaultfd feature to facilitate multithreading enhancements of the
-non cooperative manager, for example to allow UFFDIO_COPY ioctls to
-run in parallel to the event reception. Single threaded
-implementations should continue to use the current async event
-delivery model instead.