thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
Add info about tmpfs/shmem with huge pages. Link: http://lkml.kernel.org/r/1466021202-61880-38-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:

committed by
Linus Torvalds

parent
779750d20b
commit
1b5946a84d
@@ -436,6 +436,7 @@ Private_Dirty: 0 kB
|
|||||||
Referenced: 892 kB
|
Referenced: 892 kB
|
||||||
Anonymous: 0 kB
|
Anonymous: 0 kB
|
||||||
AnonHugePages: 0 kB
|
AnonHugePages: 0 kB
|
||||||
|
ShmemPmdMapped: 0 kB
|
||||||
Shared_Hugetlb: 0 kB
|
Shared_Hugetlb: 0 kB
|
||||||
Private_Hugetlb: 0 kB
|
Private_Hugetlb: 0 kB
|
||||||
Swap: 0 kB
|
Swap: 0 kB
|
||||||
@@ -464,6 +465,8 @@ accessed.
|
|||||||
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
|
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
|
||||||
and a page is modified, the file page is replaced by a private anonymous copy.
|
and a page is modified, the file page is replaced by a private anonymous copy.
|
||||||
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
|
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
|
||||||
|
"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
|
||||||
|
huge pages.
|
||||||
"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
|
"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
|
||||||
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
|
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
|
||||||
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
|
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
|
||||||
@@ -868,6 +871,9 @@ VmallocTotal: 112216 kB
|
|||||||
VmallocUsed: 428 kB
|
VmallocUsed: 428 kB
|
||||||
VmallocChunk: 111088 kB
|
VmallocChunk: 111088 kB
|
||||||
AnonHugePages: 49152 kB
|
AnonHugePages: 49152 kB
|
||||||
|
ShmemHugePages: 0 kB
|
||||||
|
ShmemPmdMapped: 0 kB
|
||||||
|
|
||||||
|
|
||||||
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
|
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
|
||||||
bits and the kernel binary code)
|
bits and the kernel binary code)
|
||||||
@@ -912,6 +918,9 @@ MemAvailable: An estimate of how much memory is available for starting new
|
|||||||
AnonHugePages: Non-file backed huge pages mapped into userspace page tables
|
AnonHugePages: Non-file backed huge pages mapped into userspace page tables
|
||||||
Mapped: files which have been mmaped, such as libraries
|
Mapped: files which have been mmaped, such as libraries
|
||||||
Shmem: Total memory used by shared memory (shmem) and tmpfs
|
Shmem: Total memory used by shared memory (shmem) and tmpfs
|
||||||
|
ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
|
||||||
|
with huge pages
|
||||||
|
ShmemPmdMapped: Shared memory mapped into userspace with huge pages
|
||||||
Slab: in-kernel data structures cache
|
Slab: in-kernel data structures cache
|
||||||
SReclaimable: Part of Slab, that might be reclaimed, such as caches
|
SReclaimable: Part of Slab, that might be reclaimed, such as caches
|
||||||
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
|
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
|
||||||
|
@@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
|
|||||||
that supports the automatic promotion and demotion of page sizes and
|
that supports the automatic promotion and demotion of page sizes and
|
||||||
without the shortcomings of hugetlbfs.
|
without the shortcomings of hugetlbfs.
|
||||||
|
|
||||||
Currently it only works for anonymous memory mappings but in the
|
Currently it only works for anonymous memory mappings and tmpfs/shmem.
|
||||||
future it can expand over the pagecache layer starting with tmpfs.
|
But in the future it can expand to other filesystems.
|
||||||
|
|
||||||
The reason applications are running faster is because of two
|
The reason applications are running faster is because of two
|
||||||
factors. The first factor is almost completely irrelevant and it's not
|
factors. The first factor is almost completely irrelevant and it's not
|
||||||
@@ -57,10 +57,6 @@ miss is going to run faster.
|
|||||||
feature that applies to all dynamic high order allocations in the
|
feature that applies to all dynamic high order allocations in the
|
||||||
kernel)
|
kernel)
|
||||||
|
|
||||||
- this initial support only offers the feature in the anonymous memory
|
|
||||||
regions but it'd be ideal to move it to tmpfs and the pagecache
|
|
||||||
later
|
|
||||||
|
|
||||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||||
if compared to the reservation approach of hugetlbfs by allowing all
|
if compared to the reservation approach of hugetlbfs by allowing all
|
||||||
unused memory to be used as cache or other movable (or even unmovable
|
unused memory to be used as cache or other movable (or even unmovable
|
||||||
@@ -94,21 +90,21 @@ madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
|||||||
|
|
||||||
== sysfs ==
|
== sysfs ==
|
||||||
|
|
||||||
Transparent Hugepage Support can be entirely disabled (mostly for
|
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||||
debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||||
avoid the risk of consuming more memory resources) or enabled system
|
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||||
wide. This can be achieved with one of:
|
system wide. This can be achieved with one of:
|
||||||
|
|
||||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
|
||||||
It's also possible to limit defrag efforts in the VM to generate
|
It's also possible to limit defrag efforts in the VM to generate
|
||||||
hugepages in case they're not immediately free to madvise regions or
|
anonymous hugepages in case they're not immediately free to madvise
|
||||||
to never try to defrag memory and simply fallback to regular pages
|
regions or to never try to defrag memory and simply fallback to regular
|
||||||
unless hugepages are immediately available. Clearly if we spend CPU
|
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||||
time to defrag memory, we would expect to gain even more by the fact
|
time to defrag memory, we would expect to gain even more by the fact we
|
||||||
we use hugepages later instead of regular pages. This isn't always
|
use hugepages later instead of regular pages. This isn't always
|
||||||
guaranteed, but it may be more likely in case the allocation is for a
|
guaranteed, but it may be more likely in case the allocation is for a
|
||||||
MADV_HUGEPAGE region.
|
MADV_HUGEPAGE region.
|
||||||
|
|
||||||
@@ -133,9 +129,9 @@ that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
|
|||||||
|
|
||||||
"never" should be self-explanatory.
|
"never" should be self-explanatory.
|
||||||
|
|
||||||
By default kernel tries to use huge zero page on read page fault.
|
By default kernel tries to use huge zero page on read page fault to
|
||||||
It's possible to disable huge zero page by writing 0 or enable it
|
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||||
back by writing 1:
|
or enable it back by writing 1:
|
||||||
|
|
||||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
@@ -204,21 +200,67 @@ Support by passing the parameter "transparent_hugepage=always" or
|
|||||||
"transparent_hugepage=madvise" or "transparent_hugepage=never"
|
"transparent_hugepage=madvise" or "transparent_hugepage=never"
|
||||||
(without "") to the kernel command line.
|
(without "") to the kernel command line.
|
||||||
|
|
||||||
|
== Hugepages in tmpfs/shmem ==
|
||||||
|
|
||||||
|
You can control hugepage allocation policy in tmpfs with mount option
|
||||||
|
"huge=". It can have following values:
|
||||||
|
|
||||||
|
- "always":
|
||||||
|
Attempt to allocate huge pages every time we need a new page;
|
||||||
|
|
||||||
|
- "never":
|
||||||
|
Do not allocate huge pages;
|
||||||
|
|
||||||
|
- "within_size":
|
||||||
|
Only allocate huge page if it will be fully within i_size.
|
||||||
|
Also respect fadvise()/madvise() hints;
|
||||||
|
|
||||||
|
- "advise:
|
||||||
|
Only allocate huge pages if requested with fadvise()/madvise();
|
||||||
|
|
||||||
|
The default policy is "never".
|
||||||
|
|
||||||
|
"mount -o remount,huge= /mountpoint" works fine after mount: remounting
|
||||||
|
huge=never will not attempt to break up huge pages at all, just stop more
|
||||||
|
from being allocated.
|
||||||
|
|
||||||
|
There's also sysfs knob to control hugepage allocation policy for internal
|
||||||
|
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||||
|
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||||
|
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||||
|
|
||||||
|
In addition to policies listed above, shmem_enabled allows two further
|
||||||
|
values:
|
||||||
|
|
||||||
|
- "deny":
|
||||||
|
For use in emergencies, to force the huge option off from
|
||||||
|
all mounts;
|
||||||
|
- "force":
|
||||||
|
Force the huge option on for all - very useful for testing;
|
||||||
|
|
||||||
== Need of application restart ==
|
== Need of application restart ==
|
||||||
|
|
||||||
The transparent_hugepage/enabled values only affect future
|
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||||
behavior. So to make them effective you need to restart any
|
future behavior. So to make them effective you need to restart any
|
||||||
application that could have been using hugepages. This also applies to
|
application that could have been using hugepages. This also applies to the
|
||||||
the regions registered in khugepaged.
|
regions registered in khugepaged.
|
||||||
|
|
||||||
== Monitoring usage ==
|
== Monitoring usage ==
|
||||||
|
|
||||||
The number of transparent huge pages currently used by the system is
|
The number of anonymous transparent huge pages currently used by the
|
||||||
available by reading the AnonHugePages field in /proc/meminfo. To
|
system is available by reading the AnonHugePages field in /proc/meminfo.
|
||||||
identify what applications are using transparent huge pages, it is
|
To identify what applications are using anonymous transparent huge pages,
|
||||||
necessary to read /proc/PID/smaps and count the AnonHugePages fields
|
it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
|
||||||
for each mapping. Note that reading the smaps file is expensive and
|
for each mapping.
|
||||||
reading it frequently will incur overhead.
|
|
||||||
|
The number of file transparent huge pages mapped to userspace is available
|
||||||
|
by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo.
|
||||||
|
To identify what applications are mapping file transparent huge pages, it
|
||||||
|
is necessary to read /proc/PID/smaps and count the FileHugeMapped fields
|
||||||
|
for each mapping.
|
||||||
|
|
||||||
|
Note that reading the smaps file is expensive and reading it
|
||||||
|
frequently will incur overhead.
|
||||||
|
|
||||||
There are a number of counters in /proc/vmstat that may be used to
|
There are a number of counters in /proc/vmstat that may be used to
|
||||||
monitor how successfully the system is providing huge pages for use.
|
monitor how successfully the system is providing huge pages for use.
|
||||||
@@ -238,6 +280,12 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
|
|||||||
of pages that should be collapsed into one huge page but failed
|
of pages that should be collapsed into one huge page but failed
|
||||||
the allocation.
|
the allocation.
|
||||||
|
|
||||||
|
thp_file_alloc is incremented every time a file huge page is successfully
|
||||||
|
i allocated.
|
||||||
|
|
||||||
|
thp_file_mapped is incremented every time a file huge page is mapped into
|
||||||
|
user address space.
|
||||||
|
|
||||||
thp_split_page is incremented every time a huge page is split into base
|
thp_split_page is incremented every time a huge page is split into base
|
||||||
pages. This can happen for a variety of reasons but a common
|
pages. This can happen for a variety of reasons but a common
|
||||||
reason is that a huge page is old and is being reclaimed.
|
reason is that a huge page is old and is being reclaimed.
|
||||||
@@ -403,19 +451,27 @@ pages:
|
|||||||
on relevant sub-page of the compound page.
|
on relevant sub-page of the compound page.
|
||||||
|
|
||||||
- map/unmap of the whole compound page accounted in compound_mapcount
|
- map/unmap of the whole compound page accounted in compound_mapcount
|
||||||
(stored in first tail page).
|
(stored in first tail page). For file huge pages, we also increment
|
||||||
|
->_mapcount of all sub-pages in order to have race-free detection of
|
||||||
|
last unmap of subpages.
|
||||||
|
|
||||||
PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
|
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
|
||||||
This additional reference is required to get race-free detection of unmap of
|
|
||||||
subpages when we have them mapped with both PMDs and PTEs.
|
For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
|
||||||
|
subpages is offset up by one. This additional reference is required to
|
||||||
|
get race-free detection of unmap of subpages when we have them mapped with
|
||||||
|
both PMDs and PTEs.
|
||||||
|
|
||||||
This is optimization required to lower overhead of per-subpage mapcount
|
This is optimization required to lower overhead of per-subpage mapcount
|
||||||
tracking. The alternative is alter ->_mapcount in all subpages on each
|
tracking. The alternative is alter ->_mapcount in all subpages on each
|
||||||
map/unmap of the whole compound page.
|
map/unmap of the whole compound page.
|
||||||
|
|
||||||
We set PG_double_map when a PMD of the page got split for the first time,
|
For anonymous pages, we set PG_double_map when a PMD of the page got split
|
||||||
but still have PMD mapping. The additional references go away with last
|
for the first time, but still have PMD mapping. The additional references
|
||||||
compound_mapcount.
|
go away with last compound_mapcount.
|
||||||
|
|
||||||
|
File pages get PG_double_map set on first map of the page with PTE and
|
||||||
|
goes away when the page gets evicted from page cache.
|
||||||
|
|
||||||
split_huge_page internally has to distribute the refcounts in the head
|
split_huge_page internally has to distribute the refcounts in the head
|
||||||
page to the tail pages before clearing all PG_head/tail bits from the page
|
page to the tail pages before clearing all PG_head/tail bits from the page
|
||||||
@@ -427,7 +483,7 @@ sum of mapcount of all sub-pages plus one (split_huge_page caller must
|
|||||||
have reference for head page).
|
have reference for head page).
|
||||||
|
|
||||||
split_huge_page uses migration entries to stabilize page->_refcount and
|
split_huge_page uses migration entries to stabilize page->_refcount and
|
||||||
page->_mapcount.
|
page->_mapcount of anonymous pages. File pages just got unmapped.
|
||||||
|
|
||||||
We safe against physical memory scanners too: the only legitimate way
|
We safe against physical memory scanners too: the only legitimate way
|
||||||
scanner can get reference to a page is get_page_unless_zero().
|
scanner can get reference to a page is get_page_unless_zero().
|
||||||
|
Reference in New Issue
Block a user