As Eric said,
"what we need to do is move the variable vmcoreinfo_note out of the
kernel's .bss section. And modify the code to regenerate and keep this
information in something like the control page.
Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures. I clearly was not
watching closely the data someone decided to keep this silly thing in
the kernel's .bss section."
This patch allocates extra pages for these vmcoreinfo_XXX variables, one
advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.
Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the first parameter of container_of() is a pointer to a
non-const-qualified array type (and the third parameter names a
non-const-qualified array member), the local variable __mptr will be
defined with a const-qualified array type. In ISO C, these types are
incompatible. They work as expected in GNU C, but some versions will
issue warnings. For example, GCC 4.9 produces the warning
"initialization from incompatible pointer type".
Here is an example of where the problem occurs:
-------------------------------------------------------
#include <linux/kernel.h>
#include <linux/module.h>
MODULE_LICENSE("GPL");
struct st {
int a;
char b[16];
};
static int __init example_init(void) {
struct st t = { .a = 101, .b = "hello" };
char (*p)[16] = &t.b;
struct st *x = container_of(p, struct st, b);
printk(KERN_DEBUG "%p %p\n", (void *)&t, (void *)x);
return 0;
}
static void __exit example_exit(void) {
}
module_init(example_init);
module_exit(example_exit);
-------------------------------------------------------
Building the module with gcc-4.9 results in these warnings (where '{m}'
is the module source and '{k}' is the kernel source):
-------------------------------------------------------
In file included from {m}/example.c:1:0:
{m}/example.c: In function `example_init':
{k}/include/linux/kernel.h:854:48: warning: initialization from incompatible pointer type
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
^
{m}/example.c:14:17: note: in expansion of macro `container_of'
struct st *x = container_of(p, struct st, b);
^
{k}/include/linux/kernel.h:854:48: warning: (near initialization for `x')
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
^
{m}/example.c:14:17: note: in expansion of macro `container_of'
struct st *x = container_of(p, struct st, b);
^
-------------------------------------------------------
Replace the type checking performed by the macro to avoid these
warnings. Make sure `*(ptr)` either has type compatible with the
member, or has type compatible with `void`, ignoring qualifiers. Raise
compiler errors if this is not true. This is stronger than the previous
behaviour, which only resulted in compiler warnings for a type mismatch.
[arnd@arndb.de: fix new warnings for container_of()]
Link: http://lkml.kernel.org/r/20170620200940.90557-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170525120316.24473-7-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"kernel.h: handle pointers to arrays better in container_of()" triggers:
In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/linux/syscalls.h:71,
from fs/dcache.c:17:
fs/dcache.c: In function 'release_dentry_name_snapshot':
include/linux/compiler.h:542:38: error: call to '__compiletime_assert_305' declared with attribute error: pointer type mismatch in container_of()
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/compiler.h:525:4: note: in definition of macro '__compiletime_assert'
prefix ## suffix(); \
^
include/linux/compiler.h:542:2: note: in expansion of macro '_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/build_bug.h:46:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^
include/linux/kernel.h:860:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
^
fs/dcache.c:305:7: note: in expansion of macro 'container_of'
p = container_of(name->name, struct external_name, name[0]);
Switch name_snapshot to use unsigned chars, matching struct qstr and
struct external_name.
Link: http://lkml.kernel.org/r/20170710152134.0f78c1e6@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fill in missing kernel-doc for missing elements in struct sock.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtc_time_to_tm and rtc_tm_to_time are not deprecated and make perfect sense
for RTCs that are simple 32bit counters.
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
In case of KVM_S390_GET_CMMA_BITS, the kernel does not only read struct
kvm_s390_cmma_log passed from userspace (which constitutes _IOC_WRITE),
it also writes back a return value (which constitutes _IOC_READ) making
this an _IOWR ioctl instead of _IOW.
Fixes: 4036e387 ("KVM: s390: ioctls to get and set guest storage attributes")
Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Clean up: Registration mode details are now handled by the rdma_rw
API, and thus can be removed from svcrdma.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API,
the details of Read sink buffer registration are dealt with by the
kernel's RDMA core. This cache is no longer used, and can be
removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up:
The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced
the Register, Read, and Invalidate completion handlers. Remove the
old ones, which are no longer used.
These handlers shared some helper code with svc_rdma_wc_send. Fold
the wc_common helper back into the one remaining completion handler.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current svcrdma recvfrom code path has a lot of detail about
registration mode and the type of port (iWARP, IB, etc).
Instead, use the RDMA core's generic R/W API. This shares code with
other RDMA-enabled ULPs that manages the gory details of buffer
registration and the posting of RDMA Read Work Requests.
Since the Read list marshaling code is being replaced, I took the
opportunity to replace C structure-based XDR encoding code with more
portable code that uses pointer arithmetic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svc_rdma_rw.c already contains helpers for the sendto path.
Introduce helpers for the recvfrom path.
The plan is to replace the local NFSD bespoke code that constructs
and posts RDMA Read Work Requests with calls to the rdma_rw API.
This shares code with other RDMA-enabled ULPs that manages the gory
details of buffer registration and posting Work Requests.
This new code also puts all RDMA_NOMSG-specific logic in one place.
Lastly, the use of rqstp->rq_arg.pages is deprecated in favor of
using rqstp->rq_pages directly, for clarity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svcrdma needs 259 pages allocated to receive 1MB NFSv4.0 WRITE requests:
- 1 page for the transport header and head iovec
- 256 pages for the data payload
- 1 page for the trailing GETATTR request (since NFSD XDR decoding
does not look for a tail iovec, the GETATTR is stuck at the end
of the rqstp->rq_arg.pages list)
- 1 page for building the reply xdr_buf
But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
svc_alloc_arg never allocates that many pages. To address this:
1. The final element of rq_pages always points to NULL. To
accommodate up to 259 pages in rq_pages, add an extra element
to rq_pages for the array termination sentinel.
2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
is calculated, so it can go up to 259. Bruce noted that the
calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
which might not always be true. I didn't change this assumption.
3. Change the loop boundaries to allow 259 pages to be allocated.
Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
which is basically never taken. And there's no need to dump the
stack here because svc_alloc_arg has only one caller.
Keeping that NULL "array termination sentinel"; there doesn't appear to
be any code that depends on it, only code in nfsd_splice_actor() which
needs the 259th element to be initialized to *something*. So it's
possible we could just keep the array at 259 elements and drop that
final NULL, but we're being conservative for now.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull i2c updates from Wolfram Sang:
"This pull request contains:
- i2c core reorganization. One source file became too monolithic. It
is now split up, yet we still have the same named object as the
final output. This should ease maintenance.
- new drivers: ZTE ZX2967 family, ASPEED 24XX/25XX
- designware driver gained slave mode support
- xgene-slimpro driver gained ACPI support
- bigger overhaul for pca-platform driver
- the algo-bit module now supports messages with enforced STOP
- slightly bigger than usual set of driver updates and improvements
and with much appreciated quality assurance from Andy Shevchenko"
* 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (51 commits)
i2c: Provide a stub for i2c_detect_slave_mode()
i2c: designware: Let slave adapter support be optional
i2c: designware: Make HW init functions static
i2c: designware: fix spelling mistakes
i2c: pca-platform: propagate error from i2c_pca_add_numbered_bus
i2c: pca-platform: correctly set algo_data.reset_chip
i2c: acpi: Do not create i2c-clients for LNXVIDEO ACPI devices
i2c: designware: enable SLAVE in platform module
i2c: designware: add SLAVE mode functions
i2c: zx2967: drop COMPILE_TEST dependency
i2c: zx2967: always use the same device when printing errors
i2c: pca-platform: use dev_warn/dev_info instead of printk
i2c: pca-platform: use device managed allocations
i2c: pca-platform: add devicetree awareness
i2c: pca-platform: switch to struct gpio_desc
dt-bindings: add bindings for i2c-pca-platform
i2c: cadance: fix ctrl/addr reg write order
i2c: zx2967: add i2c controller driver for ZTE's zx2967 family
dt: bindings: add documentation for zx2967 family i2c controller
i2c: algo-bit: add support for I2C_M_STOP
...
Pull IOMMU updates from Joerg Roedel:
"This update comes with:
- Support for lockless operation in the ARM io-pgtable code.
This is an important step to solve the scalability problems in the
common dma-iommu code for ARM
- Some Errata workarounds for ARM SMMU implemenations
- Rewrite of the deferred IO/TLB flush code in the AMD IOMMU driver.
The code suffered from very high flush rates, with the new
implementation the flush rate is down to ~1% of what it was before
- Support for amd_iommu=off when booting with kexec.
The problem here was that the IOMMU driver bailed out early without
disabling the iommu hardware, if it was enabled in the old kernel
- The Rockchip IOMMU driver is now available on ARM64
- Align the return value of the iommu_ops->device_group call-backs to
not miss error values
- Preempt-disable optimizations in the Intel VT-d and common IOVA
code to help Linux-RT
- Various other small cleanups and fixes"
* tag 'iommu-updates-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (60 commits)
iommu/vt-d: Constify intel_dma_ops
iommu: Warn once when device_group callback returns NULL
iommu/omap: Return ERR_PTR in device_group call-back
iommu: Return ERR_PTR() values from device_group call-backs
iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()
iommu/vt-d: Don't disable preemption while accessing deferred_flush()
iommu/iova: Don't disable preempt around this_cpu_ptr()
iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126
iommu/arm-smmu-v3: Enable ACPI based HiSilicon CMD_PREFETCH quirk(erratum 161010701)
iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #74
ACPI/IORT: Fixup SMMUv3 resource size for Cavium ThunderX2 SMMUv3 model
iommu/arm-smmu-v3, acpi: Add temporary Cavium SMMU-V3 IORT model number definitions
iommu/io-pgtable-arm: Use dma_wmb() instead of wmb() when publishing table
iommu/io-pgtable: depend on !GENERIC_ATOMIC64 when using COMPILE_TEST with LPAE
iommu/arm-smmu-v3: Remove io-pgtable spinlock
iommu/arm-smmu: Remove io-pgtable spinlock
iommu/io-pgtable-arm-v7s: Support lockless operation
iommu/io-pgtable-arm: Support lockless operation
iommu/io-pgtable: Introduce explicit coherency
iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
...
Pull overlayfs updates from Miklos Szeredi:
"This work from Amir introduces the inodes index feature, which
provides:
- hardlinks are not broken on copy up
- infrastructure for overlayfs NFS export
This also fixes constant st_ino for samefs case for lower hardlinks"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
ovl: mark parent impure and restore timestamp on ovl_link_up()
ovl: document copying layers restrictions with inodes index
ovl: cleanup orphan index entries
ovl: persistent overlay inode nlink for indexed inodes
ovl: implement index dir copy up
ovl: move copy up lock out
ovl: rearrange copy up
ovl: add flag for upper in ovl_entry
ovl: use struct copy_up_ctx as function argument
ovl: base tmpfile in workdir too
ovl: factor out ovl_copy_up_inode() helper
ovl: extract helper to get temp file in copy up
ovl: defer upper dir lock to tempfile link
ovl: hash overlay non-dir inodes by copy up origin
ovl: cleanup bad and stale index entries on mount
ovl: lookup index entry for copy up origin
ovl: verify index dir matches upper dir
ovl: verify upper root dir matches lower root dir
ovl: introduce the inodes index dir feature
ovl: generalize ovl_create_workdir()
...
fwnode_call_int_op() isn't suitable for calling ops that return bool
since it effectively causes the result returned to the user to be
true when an op hasn't been defined or the fwnode is NULL.
Address this by introducing fwnode_call_bool_op() for calling ops
that return bool.
Fixes: 3708184afc "device property: Move FW type specific functionality to FW specific files"
Fixes: 2294b3af05 "device property: Introduce fwnode_device_is_available()"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
pull request. It's a bit of a mixed bag, this contains:
- A followup pull request from Sagi for NVMe. Outside of fixups for
NVMe, it also includes a series for ensuring that we properly
quiesce hardware queues when browsing live tags.
- Set of integrity fixes from Dmitry (mostly), fixing various issues
for folks using DIF/DIX.
- Fix for a bug introduced in cciss, with the req init changes. From
Christoph.
- Fix for a bug in BFQ, from Paolo.
- Two followup fixes for lightnvm/pblk from Javier.
- Depth fix from Ming for blk-mq-sched.
- Also from Ming, performance fix for mtip32xx that was introduced
with the dynamic initialization of commands"
* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
block: call bio_uninit in bio_endio
nvmet: avoid unneeded assignment of submit_bio return value
nvme-pci: add module parameter for io queue depth
nvme-pci: compile warnings in nvme_alloc_host_mem()
nvmet_fc: Accept variable pad lengths on Create Association LS
nvme_fc/nvmet_fc: revise Create Association descriptor length
lightnvm: pblk: remove unnecessary checks
lightnvm: pblk: control I/O flow also on tear down
cciss: initialize struct scsi_req
null_blk: fix error flow for shared tags during module_init
block: Fix __blkdev_issue_zeroout loop
nvme-rdma: unconditionally recycle the request mr
nvme: split nvme_uninit_ctrl into stop and uninit
virtio_blk: quiesce/unquiesce live IO when entering PM states
mtip32xx: quiesce request queues to make sure no submissions are inflight
nbd: quiesce request queues to make sure no submissions are inflight
nvme: kick requeue list when requeueing a request instead of when starting the queues
nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
...
Pull ceph updates from Ilya Dryomov:
"The main item here is support for v12.y.z ("Luminous") clusters:
RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
feature bits, and various other changes in the RADOS client protocol.
On top of that we have a new fsc mount option to allow supplying
fscache uniquifier (similar to NFS) and the usual pile of filesystem
fixes from Zheng"
* tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
libceph: osd_state is 32 bits wide in luminous
crush: remove an obsolete comment
crush: crush_init_workspace starts with struct crush_work
libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
crush: implement weight and id overrides for straw2
libceph: apply_upmap()
libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
libceph: pg_upmap[_items] infrastructure
libceph: ceph_decode_skip_* helpers
libceph: kill __{insert,lookup,remove}_pg_mapping()
libceph: introduce and switch to decode_pg_mapping()
libceph: don't pass pgid by value
libceph: respect RADOS_BACKOFF backoffs
libceph: make DEFINE_RB_* helpers more general
libceph: avoid unnecessary pi lookups in calc_target()
libceph: use target pi for calc_target() calculations
libceph: always populate t->target_{oid,oloc} in calc_target()
libceph: make sure need_resend targets reflect latest map
libceph: delete from need_resend_linger before check_linger_pool_dne()
...
This patch re-introduces part of a long standing login workaround that
was recently dropped by:
commit 1c99de981f
Author: Nicholas Bellinger <nab@linux-iscsi.org>
Date: Sun Apr 2 13:36:44 2017 -0700
iscsi-target: Drop work-around for legacy GlobalSAN initiator
Namely, the workaround for FirstBurstLength ended up being required by
Mellanox Flexboot PXE boot ROMs as reported by Robert.
So this patch re-adds the work-around for FirstBurstLength within
iscsi_check_proposer_for_optional_reply(), and makes the key optional
to respond when the initiator does not propose, nor respond to it.
Also as requested by Arun, this patch introduces a new TPG attribute
named 'login_keys_workaround' that controls the use of both the
FirstBurstLength workaround, as well as the two other existing
workarounds for gPXE iSCSI boot client.
By default, the workaround is enabled with login_keys_workaround=1,
since Mellanox FlexBoot requires it, and Arun has verified the Qlogic
MSFT initiator already proposes FirstBurstLength, so it's uneffected
by this re-adding this part of the original work-around.
Reported-by: Robert LeBlanc <robert@leblancnet.us>
Cc: Robert LeBlanc <robert@leblancnet.us>
Reviewed-by: Arun Easi <arun.easi@cavium.com>
Cc: <stable@vger.kernel.org> # 3.1+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Pull chrome platform updates from Benson Leung:
"Changes in this pull request are around catching up cros_ec with the
internal chromeos-kernel versions of cros_ec, cros_ec_lpc, and
cros_ec_lightbar.
Also, switching maintainership from olof to bleung"
* tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform:
platform/chrome : Add myself as Maintainer
platform/chrome: cros_ec_lightbar - hide unused PM functions
cros_ec: Don't signal wake event for non-wake host events
cros_ec: Fix deadlock when EC is not responsive at probe
cros_ec: Don't return error when checking command version
platform/chrome: cros_ec_lightbar - Avoid I2C xfer to EC during suspend
platform/chrome: cros_ec_lightbar - Add userspace lightbar control bit to EC
platform/chrome: cros_ec_lightbar - Control of suspend/resume lightbar sequence
platform/chrome: cros_ec_lightbar - Add lightbar program feature to sysfs
platform/chrome: cros_ec_lpc: Add MKBP events support over ACPI
platform/chrome: cros_ec_lpc: Add power management ops
platform/chrome: cros_ec_lpc: Add support for GOOG004 ACPI device
platform/chrome: cros_ec_lpc: Add support for mec1322 EC
platform/chrome: cros_ec_lpc: Add R/W helpers to LPC protocol variants
mfd: cros_ec: Add support for dumping panic information
cros_ec_debugfs: Pass proper struct sizes to cros_ec_cmd_xfer()
mfd: cros_ec: add debugfs, console log file
mfd: cros_ec: Add EC console read structures definitions
mfd: cros_ec: Add helper for event notifier.
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
> dump_stack+0x63/0x86
> __warn+0xcb/0xf0
> warn_slowpath_null+0x1d/0x20
> umount_check+0x6e/0x80
> d_walk+0xc6/0x270
> ? dentry_free+0x80/0x80
> do_one_tree+0x26/0x40
> shrink_dcache_for_umount+0x2d/0x90
> generic_shutdown_super+0x1f/0xf0
> kill_anon_super+0x12/0x20
> proc_kill_sb+0x40/0x50
> deactivate_locked_super+0x43/0x70
> deactivate_super+0x5a/0x60
> cleanup_mnt+0x3f/0x90
> mntput_no_expire+0x13b/0x190
> kern_unmount+0x3e/0x50
> pid_ns_release_proc+0x15/0x20
> proc_cleanup_work+0x15/0x20
> process_one_work+0x197/0x450
> worker_thread+0x4e/0x4a0
> kthread+0x109/0x140
> ? process_one_work+0x450/0x450
> ? kthread_park+0x90/0x90
> ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0
Fix this by taking a reference to the super block in proc_sys_prune_dcache.
The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used. This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list. Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks. The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.
Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to. The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.
Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves. This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge more updates from Andrew Morton:
- most of the rest of MM
- KASAN updates
- lib/ updates
- checkpatch updates
- some binfmt_elf changes
- various misc bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits)
kernel/exit.c: avoid undefined behaviour when calling wait4()
kernel/signal.c: avoid undefined behaviour in kill_something_info
binfmt_elf: safely increment argv pointers
s390: reduce ELF_ET_DYN_BASE
powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
arm: move ELF_ET_DYN_BASE to 4MB
binfmt_elf: use ELF_ET_DYN_BASE only for PIE
fs, epoll: short circuit fetching events if thread has been killed
checkpatch: improve multi-line alignment test
checkpatch: improve macro reuse test
checkpatch: change format of --color argument to --color[=WHEN]
checkpatch: silence perl 5.26.0 unescaped left brace warnings
checkpatch: improve tests for multiple line function definitions
checkpatch: remove false warning for commit reference
checkpatch: fix stepping through statements with $stat and ctx_statement_block
checkpatch: [HLP]LIST_HEAD is also declaration
checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
checkpatch: improve the unnecessary OOM message test
lib/bsearch.c: micro-optimize pivot position calculation
...
This series of patches splits BUILD_BUG related macros out of
"include/linux/bug.h" into new file "include/linux/build_bug.h" (patch
5), and changes the pointer type checking in the `container_of()` macro
to deal with pointers of array type better (patch 6). Patches 1 to 4
are prerequisites.
Patches 2, 3, 4, and 5 have been inserted since the previous version of
this patch series. Patch 6 here corresponds to v3 and v4's patch 2.
Patch 1 was a prerequisite in v3 of this series to avoid a lot of
warnings when <linux/bug.h> was included by <linux/kernel.h>. That is
no longer relevant for v5 of the series, but I left it in because it was
acked by a Arnd Bergmann and Michal Nazarewicz.
Patches 2, 3, and 4 are some checkpatch clean-ups on
"include/linux/bug.h" before splitting out the BUILD_BUG stuff in patch
5.
Patch 5 splits the BUILD_BUG related macros out of "include/linux/bug.h"
into new file "include/linux/build_bug.h" because including
<linux/bug.h> in "include/linux/kernel.h" would result in build failures
due to circular dependencies.
Patch 6 changes the pointer type checking by `container_of()` to avoid
some incompatible pointer warnings when the dereferenced pointer has
array type.
1) asm-generic/bug.h: declare struct pt_regs; before function prototype
2) linux/bug.h: correct formatting of block comment
3) linux/bug.h: correct "(foo*)" should be "(foo *)"
4) linux/bug.h: correct "space required before that '-'"
5) bug: split BUILD_BUG stuff out into <linux/build_bug.h>
6) kernel.h: handle pointers to arrays better in container_of()
This patch (of 6):
The declaration of `__warn()` has `struct pt_regs *regs` as one of its
parameters. This can result in compiler warnings if `struct regs` is not
already declared. Add an empty declaration of `struct pt_regs` to avoid
the warnings.
Link: http://lkml.kernel.org/r/20170525120316.24473-2-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
early_pfn_to_nid will return node 0 if both HAVE_ARCH_EARLY_PFN_TO_NID
and HAVE_MEMBLOCK_NODE_MAP are disabled. It seems we are safe now
because all architectures which support NUMA define one of them (with an
exception of alpha which however has CONFIG_NUMA marked as broken) so
this works as expected. It can get silently and subtly broken too
easily, though. Make sure we fail the compilation if NUMA is enabled
and there is no proper implementation for this function. If that ever
happens we know that either the specific configuration is invalid and
the fix should either disable NUMA or enable one of the above configs.
Link: http://lkml.kernel.org/r/20170704075803.15979-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rework of the cpu hotplug locking unearthed potential deadlocks with
the memory hotplug locking code.
The solution for these is to rework the memory hotplug locking code as
well and take the cpu hotplug lock before the memory hotplug lock in
mem_hotplug_begin(), but this will cause a recursive locking of the cpu
hotplug lock when the memory hotplug code calls lru_add_drain_all().
Split out the inner workings of lru_add_drain_all() into
lru_add_drain_all_cpuslocked() so this function can be invoked from the
memory hotplug code with the cpu hotplug lock held.
Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask starting from lower numa nodes. This might lead to
filling up those low NUMA nodes while others are not used. We can
reduce this risk by introducing a concept of the preferred node similar
to what we have in the regular page allocator. We will start allocating
from the preferred nid and then iterate over all allowed nodes in the
zonelist order until we try them all.
This is mimicing the page allocator logic except it operates on per-node
mempools. dequeue_huge_page_vma already does this so distill the
zonelist logic into a more generic dequeue_huge_page_nodemask and use it
in alloc_huge_page_nodemask.
This will allow us to use proper per numa distance fallback also for
alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
can get rid of alloc_huge_page_node helper which doesn't have any user
anymore.
Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, hugetlb: allow proper node fallback dequeue".
While working on a hugetlb migration issue addressed in a separate
patchset[1] I have noticed that the hugetlb allocations from the
preallocated pool are quite subotimal.
[1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
There is no fallback mechanism implemented and no notion of preferred
node. I have tried to work around it but Vlastimil was right to push
back for a more robust solution. It seems that such a solution is to
reuse zonelist approach we use for the page alloctor.
This series has 3 patches. The first one tries to make hugetlb
allocation layers more clear. The second one implements the zonelist
hugetlb pool allocation and introduces a preferred node semantic which
is used by the migration callbacks. The last patch is a clean up.
This patch (of 3):
Hugetlb allocation path for fresh huge pages is unnecessarily complex
and it mixes different interfaces between layers.
__alloc_buddy_huge_page is the central place to perform a new
allocation. It checks for the hugetlb overcommit and then relies on
__hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is
all good except that __alloc_buddy_huge_page pushes vma and address down
the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
two different allocation modes - one for memory policy and other node
specific (or to make it more obscure node non-specific) requests.
This just screams for a reorganization.
This patch pulls out all the vma specific handling up to
__alloc_buddy_huge_page_with_mpol where it belongs.
__alloc_buddy_huge_page will get nodemask argument and
__hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
page allocator.
In short:
__alloc_buddy_huge_page_with_mpol - memory policy handling
__alloc_buddy_huge_page - overcommit handling and accounting
__hugetlb_alloc_buddy_huge_page - page allocator layer
Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
is not really needed because the page allocator already handles the
cpusets update.
Finally __hugetlb_alloc_buddy_huge_page had a special case for node
specific allocations (when no policy is applied and there is a node
given). This has relied on __GFP_THISNODE to not fallback to a different
node. alloc_huge_page_node is the only caller which relies on this
behavior so move the __GFP_THISNODE there.
Not only does this remove quite some code it also should make those
layers easier to follow and clear wrt responsibilities.
Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the debugging of the problem described in
https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
output is not really useful to understand issues related to the oom
reaper.
So, I assume, that adding some tracepoints might help with debugging of
similar issues.
Trace the following events:
1) a process is marked as an oom victim,
2) a process is added to the oom reaper list,
3) the oom reaper starts reaping process's mm,
4) the oom reaper finished reaping,
5) the oom reaper skips reaping.
How it works in practice? Below is an example which show how the problem
mentioned above can be found: one process is added twice to the
oom_reaper list:
$ cd /sys/kernel/debug/tracing
$ echo "oom:mark_victim" > set_event
$ echo "oom:wake_reaper" >> set_event
$ echo "oom:skip_task_reaping" >> set_event
$ echo "oom:start_task_reaping" >> set_event
$ echo "oom:finish_task_reaping" >> set_event
$ cat trace_pipe
allocate-502 [001] .... 91.836405: mark_victim: pid=502
allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502
allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502
oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502
oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502
oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502
Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 394e31d2ce ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") has duplicated a large part of
alloc_migrate_target with some hotplug specific special casing.
To be more precise it tried to enfore the allocation from a different
node than the original page. As a result the two function diverged in
their shared logic, e.g. the hugetlb allocation strategy.
Let's unify the two and express different NUMA requirements by the given
nodemask. new_node_page will simply exclude the node it doesn't care
about and alloc_migrate_target will use all the available nodes.
alloc_migrate_target will then learn to migrate hugetlb pages more
sanely and use preallocated pool when possible.
Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which is
quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.
Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages. If such a node doesn't have
any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
to allocate a surplus page instead. This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
node 0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
Now we consume the whole pool on node 4 and try to offline this node.
All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them. With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.
Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.
[akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
wrapping a critical section where struct pages can be modified without
having to worry about consistency for a concurrent fast-GUP.
Whilst page_ref_freeze has full barrier semantics due to its use of
atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
doesn't provide any barrier semantics and allows the operation to be
reordered with respect to page modifications in the critical section.
This patch ensures that page_ref_unfreeze is ordered after any critical
section updates, by invoking smp_mb() prior to the atomic_set.
Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.com
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The madvise policy for transparent huge pages is meant to avoid unwanted
allocations of transparent huge pages. It allows a policy of disabling
the extra memory pressure and effort to arrange for a huge page when it
is not needed.
DAX by definition never incurs this overhead since it is statically
allocated. The policy choice makes even less sense for device-dax which
tries to guarantee a given tlb-fault size. Specifically, the following
setting:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
...violates that guarantee and silently disables all device-dax
instances with a 2M or 1G alignment. So, let's avoid that non-obvious
side effect by force enabling thp for dax mappings in all cases.
It is worth noting that the reason this uses vma_is_dax(), and the
resulting header include changes, is that previous attempts to add a
VM_DAX flag were NAKd.
Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
been evaluated:
# cat /sys/kernel/debug/tracing/enum_map
(which will soon be renamed to eval_map)
And it showed some interesting results:
[..]
ZONE_MOVABLE 3 (oom)
ZONE_NORMAL 2 (oom)
ZONE_DMA32 1 (oom)
ZONE_DMA 0 (oom)
3 3 (oom)
2 2 (oom)
1 1 (oom)
COMPACT_PRIO_ASYNC 2 (oom)
COMPACT_PRIO_SYNC_LIGHT 1 (oom)
COMPACT_PRIO_SYNC_FULL 0 (oom)
[..]
ZONE_DMA 0 (vmscan)
3 3 (vmscan)
2 2 (vmscan)
1 1 (vmscan)
COMPACT_PRIO_ASYNC 2 (vmscan)
[..]
ZONE_DMA 0 (kmem)
3 3 (kmem)
2 2 (kmem)
1 1 (kmem)
COMPACT_PRIO_ASYNC 2 (kmem)
[..]
ZONE_DMA 0 (compaction)
3 3 (compaction)
2 2 (compaction)
1 1 (compaction)
COMPACT_PRIO_ASYNC 2 (compaction)
[..]
The name within the parenthesis are the trace systems that the enum/eval
maps are associated with. When there's a number evaluated to another
number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
and not an enum. As #defines get converted normally, they are not needed
to be evaluated.
Each of the above trace systems with the number to number evaluation
included the file include/trace/events/mmflags.h which has:
/* High-level compaction status feedback */
#define COMPACTION_FAILED 1
#define COMPACTION_WITHDRAWN 2
#define COMPACTION_PROGRESS 3
[..]
#define COMPACTION_FEEDBACK \
EM(COMPACTION_FAILED, "failed") \
EM(COMPACTION_WITHDRAWN, "withdrawn") \
EMe(COMPACTION_PROGRESS, "progress")
Which is still needed for the __print_symbolic() usage in the
trace_event. But it is not needed to be evaluated.
Removing the evaluation part removes the unnecessary evaluations of
numbers to numbers.
Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.
The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set. This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.
Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU. In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage. Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated. However, khugepaged
collapses the pages and the expected page faults do not occur.
In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.
Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.
It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.
Fixes: a0715cc226 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>