Merge tag 'docs-5.0' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"A fairly normal cycle for documentation stuff. We have a new document
on perf security, more Italian translations, more improvements to the
memory-management docs, improvements to the pathname lookup
documentation, and the usual array of smaller fixes.
As is often the case, there are a few reaches outside of
Documentation/ to adjust kerneldoc comments"
* tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
docs: improve pathname-lookup document structure
configfs: fix wrong name of struct in documentation
docs/mm-api: link slab_common.c to "The Slab Cache" section
slab: make kmem_cache_create{_usercopy} description proper kernel-doc
doc:process: add links where missing
docs/core-api: make mm-api.rst more structured
x86, boot: documentation whitespace fixup
Documentation: devres: note checking needs when converting
doc🇮🇹 add some process/* translations
doc🇮🇹 fixes in process/1.Intro
Documentation: convert path-lookup from markdown to resturctured text
Documentation/admin-guide: update admin-guide index.rst
Documentation/admin-guide: introduce perf-security.rst file
scripts/kernel-doc: Fix struct and struct field attribute processing
Documentation: dev-tools: Fix typos in index.rst
Correct gen_init_cpio tool's documentation
Document /proc/pid PID reuse behavior
Documentation: update path-lookup.md for parallel lookups
Documentation: Use "while" instead of "whilst"
dmaengine: Add mailing list address to the documentation
...
This commit is contained in:
@@ -1,3 +1,4 @@
|
||||
.. _admin_devices:
|
||||
|
||||
Linux allocated devices (4.x+ version)
|
||||
======================================
|
||||
|
||||
@@ -110,8 +110,8 @@ If your query set is big, you can batch them too::
|
||||
|
||||
~# cat query-batch-file > <debugfs>/dynamic_debug/control
|
||||
|
||||
A another way is to use wildcard. The match rule support ``*`` (matches
|
||||
zero or more characters) and ``?`` (matches exactly one character).For
|
||||
Another way is to use wildcards. The match rule supports ``*`` (matches
|
||||
zero or more characters) and ``?`` (matches exactly one character). For
|
||||
example, you can match all usb drivers::
|
||||
|
||||
~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
|
||||
@@ -258,7 +258,7 @@ this boot parameter for debugging purposes.
|
||||
|
||||
If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at
|
||||
boot time, without effect, but will be reprocessed when module is
|
||||
loaded later. ``dyndbg_query=`` and bare ``dyndbg=`` are only processed at
|
||||
loaded later. ``ddebug_query=`` and bare ``dyndbg=`` are only processed at
|
||||
boot.
|
||||
|
||||
|
||||
@@ -301,7 +301,7 @@ The ``dyndbg`` option is a "fake" module parameter, which means:
|
||||
|
||||
For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
|
||||
enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
|
||||
the sysfs interface if the debug messages are no longer needed::
|
||||
the debugfs interface if the debug messages are no longer needed::
|
||||
|
||||
echo "module module_name -p" > <debugfs>/dynamic_debug/control
|
||||
|
||||
|
||||
@@ -76,6 +76,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
thunderbolt
|
||||
LSM/index
|
||||
mm/index
|
||||
perf-security
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
||||
@@ -331,7 +331,7 @@
|
||||
APC and your system crashes randomly.
|
||||
|
||||
apic= [APIC,X86] Advanced Programmable Interrupt Controller
|
||||
Change the output verbosity whilst booting
|
||||
Change the output verbosity while booting
|
||||
Format: { quiet (default) | verbose | debug }
|
||||
Change the amount of debugging information output
|
||||
when initialising the APIC and IO-APIC components.
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
Concepts overview
|
||||
=================
|
||||
|
||||
The memory management in Linux is complex system that evolved over the
|
||||
years and included more and more functionality to support variety of
|
||||
The memory management in Linux is a complex system that evolved over the
|
||||
years and included more and more functionality to support a variety of
|
||||
systems from MMU-less microcontrollers to supercomputers. The memory
|
||||
management for systems without MMU is called ``nommu`` and it
|
||||
management for systems without an MMU is called ``nommu`` and it
|
||||
definitely deserves a dedicated document, which hopefully will be
|
||||
eventually written. Yet, although some of the concepts are the same,
|
||||
here we assume that MMU is available and CPU can translate a virtual
|
||||
here we assume that an MMU is available and a CPU can translate a virtual
|
||||
address to a physical address.
|
||||
|
||||
.. contents:: :local:
|
||||
@@ -21,10 +21,10 @@ Virtual Memory Primer
|
||||
The physical memory in a computer system is a limited resource and
|
||||
even for systems that support memory hotplug there is a hard limit on
|
||||
the amount of memory that can be installed. The physical memory is not
|
||||
necessary contiguous, it might be accessible as a set of distinct
|
||||
necessarily contiguous; it might be accessible as a set of distinct
|
||||
address ranges. Besides, different CPU architectures, and even
|
||||
different implementations of the same architecture have different view
|
||||
how these address ranges defined.
|
||||
different implementations of the same architecture have different views
|
||||
of how these address ranges are defined.
|
||||
|
||||
All this makes dealing directly with physical memory quite complex and
|
||||
to avoid this complexity a concept of virtual memory was developed.
|
||||
@@ -48,8 +48,8 @@ appropriate kernel configuration option.
|
||||
|
||||
Each physical memory page can be mapped as one or more virtual
|
||||
pages. These mappings are described by page tables that allow
|
||||
translation from virtual address used by programs to real address in
|
||||
the physical memory. The page tables organized hierarchically.
|
||||
translation from a virtual address used by programs to the physical
|
||||
memory address. The page tables are organized hierarchically.
|
||||
|
||||
The tables at the lowest level of the hierarchy contain physical
|
||||
addresses of actual pages used by the software. The tables at higher
|
||||
@@ -121,8 +121,8 @@ Nodes
|
||||
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
|
||||
systems. In such systems the memory is arranged into banks that have
|
||||
different access latency depending on the "distance" from the
|
||||
processor. Each bank is referred as `node` and for each node Linux
|
||||
constructs an independent memory management subsystem. A node has it's
|
||||
processor. Each bank is referred to as a `node` and for each node Linux
|
||||
constructs an independent memory management subsystem. A node has its
|
||||
own set of zones, lists of free and used pages and various statistics
|
||||
counters. You can find more details about NUMA in
|
||||
:ref:`Documentation/vm/numa.rst <numa>` and in
|
||||
@@ -149,9 +149,9 @@ for program's stack and heap or by explicit calls to mmap(2) system
|
||||
call. Usually, the anonymous mappings only define virtual memory areas
|
||||
that the program is allowed to access. The read accesses will result
|
||||
in creation of a page table entry that references a special physical
|
||||
page filled with zeroes. When the program performs a write, regular
|
||||
page filled with zeroes. When the program performs a write, a regular
|
||||
physical page will be allocated to hold the written data. The page
|
||||
will be marked dirty and if the kernel will decide to repurpose it,
|
||||
will be marked dirty and if the kernel decides to repurpose it,
|
||||
the dirty page will be swapped out.
|
||||
|
||||
Reclaim
|
||||
@@ -181,8 +181,8 @@ pressure.
|
||||
The process of freeing the reclaimable physical memory pages and
|
||||
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
|
||||
pages either asynchronously or synchronously, depending on the state
|
||||
of the system. When system is not loaded, most of the memory is free
|
||||
and allocation request will be satisfied immediately from the free
|
||||
of the system. When the system is not loaded, most of the memory is free
|
||||
and allocation requests will be satisfied immediately from the free
|
||||
pages supply. As the load increases, the amount of the free pages goes
|
||||
down and when it reaches a certain threshold (high watermark), an
|
||||
allocation request will awaken the ``kswapd`` daemon. It will
|
||||
@@ -190,7 +190,7 @@ asynchronously scan memory pages and either just free them if the data
|
||||
they contain is available elsewhere, or evict to the backing storage
|
||||
device (remember those dirty pages?). As memory usage increases even
|
||||
more and reaches another threshold - min watermark - an allocation
|
||||
will trigger the `direct reclaim`. In this case allocation is stalled
|
||||
will trigger `direct reclaim`. In this case allocation is stalled
|
||||
until enough memory pages are reclaimed to satisfy the request.
|
||||
|
||||
Compaction
|
||||
@@ -200,7 +200,7 @@ As the system runs, tasks allocate and free the memory and it becomes
|
||||
fragmented. Although with virtual memory it is possible to present
|
||||
scattered physical pages as virtually contiguous range, sometimes it is
|
||||
necessary to allocate large physically contiguous memory areas. Such
|
||||
need may arise, for instance, when a device driver requires large
|
||||
need may arise, for instance, when a device driver requires a large
|
||||
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
|
||||
addresses the fragmentation issue. This mechanism moves occupied pages
|
||||
from the lower part of a memory zone to free pages in the upper part
|
||||
@@ -208,15 +208,16 @@ of the zone. When a compaction scan is finished free pages are grouped
|
||||
together at the beginning of the zone and allocations of large
|
||||
physically contiguous areas become possible.
|
||||
|
||||
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
|
||||
daemon or synchronously as a result of memory allocation request.
|
||||
Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
|
||||
daemon or synchronously as a result of a memory allocation request.
|
||||
|
||||
OOM killer
|
||||
==========
|
||||
|
||||
It may happen, that on a loaded machine memory will be exhausted. When
|
||||
the kernel detects that the system runs out of memory (OOM) it invokes
|
||||
`OOM killer`. Its mission is simple: all it has to do is to select a
|
||||
task to sacrifice for the sake of the overall system health. The
|
||||
selected task is killed in a hope that after it exits enough memory
|
||||
will be freed to continue normal operation.
|
||||
It is possible that on a loaded machine memory will be exhausted and the
|
||||
kernel will be unable to reclaim enough memory to continue to operate. In
|
||||
order to save the rest of the system, it invokes the `OOM killer`.
|
||||
|
||||
The `OOM killer` selects a task to sacrifice for the sake of the overall
|
||||
system health. The selected task is killed in a hope that after it exits
|
||||
enough memory will be freed to continue normal operation.
|
||||
|
||||
97
Documentation/admin-guide/perf-security.rst
Normal file
97
Documentation/admin-guide/perf-security.rst
Normal file
@@ -0,0 +1,97 @@
|
||||
.. _perf_security:
|
||||
|
||||
Perf Events and tool security
|
||||
=============================
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Usage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_ can
|
||||
impose a considerable risk of leaking sensitive data accessed by monitored
|
||||
processes. The data leakage is possible both in scenarios of direct usage of
|
||||
perf_events system call API [2]_ and over data files generated by Perf tool user
|
||||
mode utility (Perf) [3]_ , [4]_ . The risk depends on the nature of data that
|
||||
perf_events performance monitoring units (PMU) [2]_ collect and expose for
|
||||
performance analysis. Having that said perf_events/Perf performance monitoring
|
||||
is the subject for security access control management [5]_ .
|
||||
|
||||
perf_events/Perf access control
|
||||
-------------------------------
|
||||
|
||||
To perform security checks, the Linux implementation splits processes into two
|
||||
categories [6]_ : a) privileged processes (whose effective user ID is 0, referred
|
||||
to as superuser or root), and b) unprivileged processes (whose effective UID is
|
||||
nonzero). Privileged processes bypass all kernel security permission checks so
|
||||
perf_events performance monitoring is fully available to privileged processes
|
||||
without access, scope and resource restrictions.
|
||||
|
||||
Unprivileged processes are subject to a full security permission check based on
|
||||
the process's credentials [5]_ (usually: effective UID, effective GID, and
|
||||
supplementary group list).
|
||||
|
||||
Linux divides the privileges traditionally associated with superuser into
|
||||
distinct units, known as capabilities [6]_ , which can be independently enabled
|
||||
and disabled on per-thread basis for processes and files of unprivileged users.
|
||||
|
||||
Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated as
|
||||
privileged processes with respect to perf_events performance monitoring and
|
||||
bypass *scope* permissions checks in the kernel.
|
||||
|
||||
Unprivileged processes using perf_events system call API is also subject for
|
||||
PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose outcome
|
||||
determines whether monitoring is permitted. So unprivileged processes provided
|
||||
with CAP_SYS_PTRACE capability are effectively permitted to pass the check.
|
||||
|
||||
Other capabilities being granted to unprivileged processes can effectively
|
||||
enable capturing of additional data required for later performance analysis of
|
||||
monitored processes or a system. For example, CAP_SYSLOG capability permits
|
||||
reading kernel space memory addresses from /proc/kallsyms file.
|
||||
|
||||
perf_events/Perf unprivileged users
|
||||
-----------------------------------
|
||||
|
||||
perf_events/Perf *scope* and *access* control for unprivileged processes is
|
||||
governed by perf_event_paranoid [2]_ setting:
|
||||
|
||||
-1:
|
||||
Impose no *scope* and *access* restrictions on using perf_events performance
|
||||
monitoring. Per-user per-cpu perf_event_mlock_kb [2]_ locking limit is
|
||||
ignored when allocating memory buffers for storing performance data.
|
||||
This is the least secure mode since allowed monitored *scope* is
|
||||
maximized and no perf_events specific limits are imposed on *resources*
|
||||
allocated for performance monitoring.
|
||||
|
||||
>=0:
|
||||
*scope* includes per-process and system wide performance monitoring
|
||||
but excludes raw tracepoints and ftrace function tracepoints monitoring.
|
||||
CPU and system events happened when executing either in user or
|
||||
in kernel space can be monitored and captured for later analysis.
|
||||
Per-user per-cpu perf_event_mlock_kb locking limit is imposed but
|
||||
ignored for unprivileged processes with CAP_IPC_LOCK [6]_ capability.
|
||||
|
||||
>=1:
|
||||
*scope* includes per-process performance monitoring only and excludes
|
||||
system wide performance monitoring. CPU and system events happened when
|
||||
executing either in user or in kernel space can be monitored and
|
||||
captured for later analysis. Per-user per-cpu perf_event_mlock_kb
|
||||
locking limit is imposed but ignored for unprivileged processes with
|
||||
CAP_IPC_LOCK capability.
|
||||
|
||||
>=2:
|
||||
*scope* includes per-process performance monitoring only. CPU and system
|
||||
events happened when executing in user space only can be monitored and
|
||||
captured for later analysis. Per-user per-cpu perf_event_mlock_kb
|
||||
locking limit is imposed but ignored for unprivileged processes with
|
||||
CAP_IPC_LOCK capability.
|
||||
|
||||
Bibliography
|
||||
------------
|
||||
|
||||
.. [1] `<https://lwn.net/Articles/337493/>`_
|
||||
.. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_
|
||||
.. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_
|
||||
.. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_
|
||||
.. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_
|
||||
.. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_
|
||||
.. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_
|
||||
|
||||
@@ -54,7 +54,7 @@ those errors are correctable.
|
||||
Types of errors
|
||||
---------------
|
||||
|
||||
Most mechanisms used on modern systems use use technologies like Hamming
|
||||
Most mechanisms used on modern systems use technologies like Hamming
|
||||
Codes that allow error correction when the number of errors on a bit packet
|
||||
is below a threshold. If the number of errors is above, those mechanisms
|
||||
can indicate with a high degree of confidence that an error happened, but
|
||||
|
||||
@@ -44,7 +44,7 @@ only valid reason for deferring the publication of a fix is to accommodate
|
||||
the logistics of QA and large scale rollouts which require release
|
||||
coordination.
|
||||
|
||||
Whilst embargoed information may be shared with trusted individuals in
|
||||
While embargoed information may be shared with trusted individuals in
|
||||
order to develop a fix, such information will not be published alongside
|
||||
the fix or on any other disclosure channel without the permission of the
|
||||
reporter. This includes but is not limited to the original bug report
|
||||
|
||||
Reference in New Issue
Block a user