Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits) hugetlbfs: dirty pages as they are added to pagecache mm: export add_swap_extent() mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t mm/gup: cache dev_pagemap while pinning pages Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" mm: return zero_resv_unavail optimization mm: zero remaining unavailable struct pages tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option tools/testing/selftests/vm/gup_benchmark.c: allow user specified file tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage mm/gup_benchmark.c: add additional pinning methods mm/gup_benchmark.c: time put_page() mm: don't raise MEMCG_OOM event due to failed high-order allocation mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock ...
This commit is contained in:
73
Documentation/accounting/psi.txt
Normal file
73
Documentation/accounting/psi.txt
Normal file
@@ -0,0 +1,73 @@
|
||||
================================
|
||||
PSI - Pressure Stall Information
|
||||
================================
|
||||
|
||||
:Date: April, 2018
|
||||
:Author: Johannes Weiner <hannes@cmpxchg.org>
|
||||
|
||||
When CPU, memory or IO devices are contended, workloads experience
|
||||
latency spikes, throughput losses, and run the risk of OOM kills.
|
||||
|
||||
Without an accurate measure of such contention, users are forced to
|
||||
either play it safe and under-utilize their hardware resources, or
|
||||
roll the dice and frequently suffer the disruptions resulting from
|
||||
excessive overcommit.
|
||||
|
||||
The psi feature identifies and quantifies the disruptions caused by
|
||||
such resource crunches and the time impact it has on complex workloads
|
||||
or even entire systems.
|
||||
|
||||
Having an accurate measure of productivity losses caused by resource
|
||||
scarcity aids users in sizing workloads to hardware--or provisioning
|
||||
hardware according to workload demand.
|
||||
|
||||
As psi aggregates this information in realtime, systems can be managed
|
||||
dynamically using techniques such as load shedding, migrating jobs to
|
||||
other systems or data centers, or strategically pausing or killing low
|
||||
priority or restartable batch jobs.
|
||||
|
||||
This allows maximizing hardware utilization without sacrificing
|
||||
workload health or risking major disruptions such as OOM kills.
|
||||
|
||||
Pressure interface
|
||||
==================
|
||||
|
||||
Pressure information for each resource is exported through the
|
||||
respective file in /proc/pressure/ -- cpu, memory, and io.
|
||||
|
||||
The format for CPU is as such:
|
||||
|
||||
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||
|
||||
and for memory and IO:
|
||||
|
||||
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
||||
|
||||
The "some" line indicates the share of time in which at least some
|
||||
tasks are stalled on a given resource.
|
||||
|
||||
The "full" line indicates the share of time in which all non-idle
|
||||
tasks are stalled on a given resource simultaneously. In this state
|
||||
actual CPU cycles are going to waste, and a workload that spends
|
||||
extended time in this state is considered to be thrashing. This has
|
||||
severe impact on performance, and it's useful to distinguish this
|
||||
situation from a state where some tasks are stalled but the CPU is
|
||||
still doing productive work. As such, time spent in this subset of the
|
||||
stall state is tracked separately and exported in the "full" averages.
|
||||
|
||||
The ratios are tracked as recent trends over ten, sixty, and three
|
||||
hundred second windows, which gives insight into short term events as
|
||||
well as medium and long term trends. The total absolute stall time is
|
||||
tracked and exported as well, to allow detection of latency spikes
|
||||
which wouldn't necessarily make a dent in the time averages, or to
|
||||
average trends over custom time frames.
|
||||
|
||||
Cgroup2 interface
|
||||
=================
|
||||
|
||||
In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
|
||||
mounted, pressure stall information is also tracked for tasks grouped
|
||||
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
|
||||
cpu.pressure, memory.pressure, and io.pressure files; the format is
|
||||
the same as the /proc/pressure/ files.
|
@@ -966,6 +966,12 @@ All time durations are in microseconds.
|
||||
$PERIOD duration. "max" for $MAX indicates no limit. If only
|
||||
one number is written, $MAX is updated.
|
||||
|
||||
cpu.pressure
|
||||
A read-only nested-key file which exists on non-root cgroups.
|
||||
|
||||
Shows pressure stall information for CPU. See
|
||||
Documentation/accounting/psi.txt for details.
|
||||
|
||||
|
||||
Memory
|
||||
------
|
||||
@@ -1127,6 +1133,10 @@ PAGE_SIZE multiple when read back.
|
||||
disk readahead. For now OOM in memory cgroup kills
|
||||
tasks iff shortage has happened inside page fault.
|
||||
|
||||
This event is not raised if the OOM killer is not
|
||||
considered as an option, e.g. for failed high-order
|
||||
allocations.
|
||||
|
||||
oom_kill
|
||||
The number of processes belonging to this cgroup
|
||||
killed by any kind of OOM killer.
|
||||
@@ -1271,6 +1281,12 @@ PAGE_SIZE multiple when read back.
|
||||
higher than the limit for an extended period of time. This
|
||||
reduces the impact on the workload and memory management.
|
||||
|
||||
memory.pressure
|
||||
A read-only nested-key file which exists on non-root cgroups.
|
||||
|
||||
Shows pressure stall information for memory. See
|
||||
Documentation/accounting/psi.txt for details.
|
||||
|
||||
|
||||
Usage Guidelines
|
||||
~~~~~~~~~~~~~~~~
|
||||
@@ -1408,6 +1424,12 @@ IO Interface Files
|
||||
|
||||
8:16 rbps=2097152 wbps=max riops=max wiops=max
|
||||
|
||||
io.pressure
|
||||
A read-only nested-key file which exists on non-root cgroups.
|
||||
|
||||
Shows pressure stall information for IO. See
|
||||
Documentation/accounting/psi.txt for details.
|
||||
|
||||
|
||||
Writeback
|
||||
~~~~~~~~~
|
||||
|
@@ -4851,6 +4851,18 @@
|
||||
This is actually a boot loader parameter; the value is
|
||||
passed to the kernel using a special protocol.
|
||||
|
||||
vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y.
|
||||
May slow down system boot speed, especially when
|
||||
enabled on systems with a large amount of memory.
|
||||
All options are enabled by default, and this
|
||||
interface is meant to allow for selectively
|
||||
enabling or disabling specific virtual memory
|
||||
debugging features.
|
||||
|
||||
Available options are:
|
||||
P Enable page structure init time poisoning
|
||||
- Disable all of the above options
|
||||
|
||||
vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
|
||||
size of <nn>. This can be used to increase the
|
||||
minimum size (128MB on x86). It can also be used to
|
||||
|
@@ -858,6 +858,7 @@ Writeback: 0 kB
|
||||
AnonPages: 861800 kB
|
||||
Mapped: 280372 kB
|
||||
Shmem: 644 kB
|
||||
KReclaimable: 168048 kB
|
||||
Slab: 284364 kB
|
||||
SReclaimable: 159856 kB
|
||||
SUnreclaim: 124508 kB
|
||||
@@ -925,6 +926,9 @@ AnonHugePages: Non-file backed huge pages mapped into userspace page tables
|
||||
ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
|
||||
with huge pages
|
||||
ShmemPmdMapped: Shared memory mapped into userspace with huge pages
|
||||
KReclaimable: Kernel allocations that the kernel will attempt to reclaim
|
||||
under memory pressure. Includes SReclaimable (below), and other
|
||||
direct allocations with a shrinker.
|
||||
Slab: in-kernel data structures cache
|
||||
SReclaimable: Part of Slab, that might be reclaimed, such as caches
|
||||
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
|
||||
|
@@ -36,9 +36,10 @@ debugging is enabled. Format:
|
||||
|
||||
slub_debug=<Debug-Options>
|
||||
Enable options for all slabs
|
||||
slub_debug=<Debug-Options>,<slab name>
|
||||
Enable options only for select slabs
|
||||
|
||||
slub_debug=<Debug-Options>,<slab name1>,<slab name2>,...
|
||||
Enable options only for select slabs (no spaces
|
||||
after a comma)
|
||||
|
||||
Possible debug options are::
|
||||
|
||||
@@ -62,7 +63,12 @@ Trying to find an issue in the dentry cache? Try::
|
||||
|
||||
slub_debug=,dentry
|
||||
|
||||
to only enable debugging on the dentry cache.
|
||||
to only enable debugging on the dentry cache. You may use an asterisk at the
|
||||
end of the slab name, in order to cover all slabs with the same prefix. For
|
||||
example, here's how you can poison the dentry cache as well as all kmalloc
|
||||
slabs:
|
||||
|
||||
slub_debug=P,kmalloc-*,dentry
|
||||
|
||||
Red zoning and tracking may realign the slab. We can just apply sanity checks
|
||||
to the dentry cache with::
|
||||
|
@@ -90,12 +90,12 @@ pci proc | -- | -- | WC |
|
||||
Advanced APIs for drivers
|
||||
-------------------------
|
||||
A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
|
||||
vm_insert_pfn
|
||||
vmf_insert_pfn
|
||||
|
||||
Drivers wanting to export some pages to userspace do it by using mmap
|
||||
interface and a combination of
|
||||
1) pgprot_noncached()
|
||||
2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
|
||||
2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
|
||||
|
||||
With PAT support, a new API pgprot_writecombine is being added. So, drivers can
|
||||
continue to use the above sequence, with either pgprot_noncached() or
|
||||
|
Reference in New Issue
Block a user