Commit Graph

94117 Commits

Author SHA1 Message Date
Xunlei Pang
203e9e4121 kexec: move vmcoreinfo out of the kernel's .bss section
As Eric said,
 "what we need to do is move the variable vmcoreinfo_note out of the
  kernel's .bss section. And modify the code to regenerate and keep this
  information in something like the control page.

  Definitely something like this needs a page all to itself, and ideally
  far away from any other kernel data structures. I clearly was not
  watching closely the data someone decided to keep this silly thing in
  the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables, one
advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:25:59 -07:00
Ian Abbott
c7acec713d kernel.h: handle pointers to arrays better in container_of()
If the first parameter of container_of() is a pointer to a
non-const-qualified array type (and the third parameter names a
non-const-qualified array member), the local variable __mptr will be
defined with a const-qualified array type.  In ISO C, these types are
incompatible.  They work as expected in GNU C, but some versions will
issue warnings.  For example, GCC 4.9 produces the warning
"initialization from incompatible pointer type".

Here is an example of where the problem occurs:

-------------------------------------------------------
   #include <linux/kernel.h>
   #include <linux/module.h>

  MODULE_LICENSE("GPL");

  struct st {
  	int a;
  	char b[16];
  };

  static int __init example_init(void) {
  	struct st t = { .a = 101, .b = "hello" };
  	char (*p)[16] = &t.b;
  	struct st *x = container_of(p, struct st, b);
  	printk(KERN_DEBUG "%p %p\n", (void *)&t, (void *)x);
  	return 0;
  }

  static void __exit example_exit(void) {
  }

  module_init(example_init);
  module_exit(example_exit);
-------------------------------------------------------

Building the module with gcc-4.9 results in these warnings (where '{m}'
is the module source and '{k}' is the kernel source):

-------------------------------------------------------
  In file included from {m}/example.c:1:0:
  {m}/example.c: In function `example_init':
  {k}/include/linux/kernel.h:854:48: warning: initialization from incompatible pointer type
    const typeof( ((type *)0)->member ) *__mptr = (ptr); \
                                                  ^
  {m}/example.c:14:17: note: in expansion of macro `container_of'
    struct st *x = container_of(p, struct st, b);
                   ^
  {k}/include/linux/kernel.h:854:48: warning: (near initialization for `x')
    const typeof( ((type *)0)->member ) *__mptr = (ptr); \
                                                  ^
  {m}/example.c:14:17: note: in expansion of macro `container_of'
    struct st *x = container_of(p, struct st, b);
                   ^
-------------------------------------------------------

Replace the type checking performed by the macro to avoid these
warnings.  Make sure `*(ptr)` either has type compatible with the
member, or has type compatible with `void`, ignoring qualifiers.  Raise
compiler errors if this is not true.  This is stronger than the previous
behaviour, which only resulted in compiler warnings for a type mismatch.

[arnd@arndb.de: fix new warnings for container_of()]
  Link: http://lkml.kernel.org/r/20170620200940.90557-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170525120316.24473-7-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:25:59 -07:00
Stephen Rothwell
0a2c13d9cd include/linux/dcache.h: use unsigned chars in struct name_snapshot
"kernel.h: handle pointers to arrays better in container_of()" triggers:

In file included from include/uapi/linux/stddef.h:1:0,
                 from include/linux/stddef.h:4,
                 from include/uapi/linux/posix_types.h:4,
                 from include/uapi/linux/types.h:13,
                 from include/linux/types.h:5,
                 from include/linux/syscalls.h:71,
                 from fs/dcache.c:17:
fs/dcache.c: In function 'release_dentry_name_snapshot':
include/linux/compiler.h:542:38: error: call to '__compiletime_assert_305' declared with attribute error: pointer type mismatch in container_of()
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                      ^
include/linux/compiler.h:525:4: note: in definition of macro '__compiletime_assert'
    prefix ## suffix();    \
    ^
include/linux/compiler.h:542:2: note: in expansion of macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^
include/linux/build_bug.h:46:37: note: in expansion of macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^
include/linux/kernel.h:860:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
  ^
fs/dcache.c:305:7: note: in expansion of macro 'container_of'
   p = container_of(name->name, struct external_name, name[0]);

Switch name_snapshot to use unsigned chars, matching struct qstr and
struct external_name.

Link: http://lkml.kernel.org/r/20170710152134.0f78c1e6@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:25:59 -07:00
stephen hemminger
771edcafa7 socket: add documentation for missing elements
Fill in missing kernel-doc for missing elements in struct sock.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12 14:39:43 -07:00
Alexandre Belloni
40bf6a3548 rtc: Remove wrong deprecation comment
rtc_time_to_tm and rtc_tm_to_time are not deprecated and make perfect sense
for RTCs that are simple 32bit counters.

Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2017-07-12 23:11:23 +02:00
Gleb Fotengauer-Malinovskiy
949c033694 KVM: s390: Fix KVM_S390_GET_CMMA_BITS ioctl definition
In case of KVM_S390_GET_CMMA_BITS, the kernel does not only read struct
kvm_s390_cmma_log passed from userspace (which constitutes _IOC_WRITE),
it also writes back a return value (which constitutes _IOC_READ) making
this an _IOWR ioctl instead of _IOW.

Fixes: 4036e387 ("KVM: s390: ioctls to get and set guest storage attributes")
Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-12 22:38:26 +02:00
Chuck Lever
9450ca8e2f svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
Clean up: Registration mode details are now handled by the rdma_rw
API, and thus can be removed from svcrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:59 -04:00
Chuck Lever
463e63d701 svcrdma: Remove frmr cache
Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API,
the details of Read sink buffer registration are dealt with by the
kernel's RDMA core. This cache is no longer used, and can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:58 -04:00
Chuck Lever
c84dc900d7 svcrdma: Remove unused Read completion handlers
Clean up:

The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced
the Register, Read, and Invalidate completion handlers. Remove the
old ones, which are no longer used.

These handlers shared some helper code with svc_rdma_wc_send. Fold
the wc_common helper back into the one remaining completion handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:57 -04:00
Chuck Lever
cafc739892 svcrdma: Use generic RDMA R/W API in RPC Call path
The current svcrdma recvfrom code path has a lot of detail about
registration mode and the type of port (iWARP, IB, etc).

Instead, use the RDMA core's generic R/W API. This shares code with
other RDMA-enabled ULPs that manages the gory details of buffer
registration and the posting of RDMA Read Work Requests.

Since the Read list marshaling code is being replaced, I took the
opportunity to replace C structure-based XDR encoding code with more
portable code that uses pointer arithmetic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:56 -04:00
Chuck Lever
026d958b38 svcrdma: Add recvfrom helpers to svc_rdma_rw.c
svc_rdma_rw.c already contains helpers for the sendto path.
Introduce helpers for the recvfrom path.

The plan is to replace the local NFSD bespoke code that constructs
and posts RDMA Read Work Requests with calls to the rdma_rw API.
This shares code with other RDMA-enabled ULPs that manages the gory
details of buffer registration and posting Work Requests.

This new code also puts all RDMA_NOMSG-specific logic in one place.

Lastly, the use of rqstp->rq_arg.pages is deprecated in favor of
using rqstp->rq_pages directly, for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:56 -04:00
Chuck Lever
8c6ae4980e sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
svcrdma needs 259 pages allocated to receive 1MB NFSv4.0 WRITE requests:

 - 1 page for the transport header and head iovec
 - 256 pages for the data payload
 - 1 page for the trailing GETATTR request (since NFSD XDR decoding
   does not look for a tail iovec, the GETATTR is stuck at the end
   of the rqstp->rq_arg.pages list)
 - 1 page for building the reply xdr_buf

But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
svc_alloc_arg never allocates that many pages. To address this:

1. The final element of rq_pages always points to NULL. To
   accommodate up to 259 pages in rq_pages, add an extra element
   to rq_pages for the array termination sentinel.

2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
   is calculated, so it can go up to 259. Bruce noted that the
   calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
   which might not always be true. I didn't change this assumption.

3. Change the loop boundaries to allow 259 pages to be allocated.

Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
which is basically never taken. And there's no need to dump the
stack here because svc_alloc_arg has only one caller.

Keeping that NULL "array termination sentinel"; there doesn't appear to
be any code that depends on it, only code in nfsd_splice_actor() which
needs the 259th element to be initialized to *something*.  So it's
possible we could just keep the array at 259 elements and drop that
final NULL, but we're being conservative for now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:55 -04:00
Linus Torvalds
235b84fc86 Merge branch 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
 "This pull request contains:

   - i2c core reorganization. One source file became too monolithic. It
     is now split up, yet we still have the same named object as the
     final output. This should ease maintenance.

   - new drivers: ZTE ZX2967 family, ASPEED 24XX/25XX

   - designware driver gained slave mode support

   - xgene-slimpro driver gained ACPI support

   - bigger overhaul for pca-platform driver

   - the algo-bit module now supports messages with enforced STOP

   - slightly bigger than usual set of driver updates and improvements

  and with much appreciated quality assurance from Andy Shevchenko"

* 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (51 commits)
  i2c: Provide a stub for i2c_detect_slave_mode()
  i2c: designware: Let slave adapter support be optional
  i2c: designware: Make HW init functions static
  i2c: designware: fix spelling mistakes
  i2c: pca-platform: propagate error from i2c_pca_add_numbered_bus
  i2c: pca-platform: correctly set algo_data.reset_chip
  i2c: acpi: Do not create i2c-clients for LNXVIDEO ACPI devices
  i2c: designware: enable SLAVE in platform module
  i2c: designware: add SLAVE mode functions
  i2c: zx2967: drop COMPILE_TEST dependency
  i2c: zx2967: always use the same device when printing errors
  i2c: pca-platform: use dev_warn/dev_info instead of printk
  i2c: pca-platform: use device managed allocations
  i2c: pca-platform: add devicetree awareness
  i2c: pca-platform: switch to struct gpio_desc
  dt-bindings: add bindings for i2c-pca-platform
  i2c: cadance: fix ctrl/addr reg write order
  i2c: zx2967: add i2c controller driver for ZTE's zx2967 family
  dt: bindings: add documentation for zx2967 family i2c controller
  i2c: algo-bit: add support for I2C_M_STOP
  ...
2017-07-12 10:04:56 -07:00
Linus Torvalds
fb4e3beeff Merge tag 'iommu-updates-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
 "This update comes with:

   - Support for lockless operation in the ARM io-pgtable code.

     This is an important step to solve the scalability problems in the
     common dma-iommu code for ARM

   - Some Errata workarounds for ARM SMMU implemenations

   - Rewrite of the deferred IO/TLB flush code in the AMD IOMMU driver.

     The code suffered from very high flush rates, with the new
     implementation the flush rate is down to ~1% of what it was before

   - Support for amd_iommu=off when booting with kexec.

     The problem here was that the IOMMU driver bailed out early without
     disabling the iommu hardware, if it was enabled in the old kernel

   - The Rockchip IOMMU driver is now available on ARM64

   - Align the return value of the iommu_ops->device_group call-backs to
     not miss error values

   - Preempt-disable optimizations in the Intel VT-d and common IOVA
     code to help Linux-RT

   - Various other small cleanups and fixes"

* tag 'iommu-updates-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (60 commits)
  iommu/vt-d: Constify intel_dma_ops
  iommu: Warn once when device_group callback returns NULL
  iommu/omap: Return ERR_PTR in device_group call-back
  iommu: Return ERR_PTR() values from device_group call-backs
  iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()
  iommu/vt-d: Don't disable preemption while accessing deferred_flush()
  iommu/iova: Don't disable preempt around this_cpu_ptr()
  iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126
  iommu/arm-smmu-v3: Enable ACPI based HiSilicon CMD_PREFETCH quirk(erratum 161010701)
  iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #74
  ACPI/IORT: Fixup SMMUv3 resource size for Cavium ThunderX2 SMMUv3 model
  iommu/arm-smmu-v3, acpi: Add temporary Cavium SMMU-V3 IORT model number definitions
  iommu/io-pgtable-arm: Use dma_wmb() instead of wmb() when publishing table
  iommu/io-pgtable: depend on !GENERIC_ATOMIC64 when using COMPILE_TEST with LPAE
  iommu/arm-smmu-v3: Remove io-pgtable spinlock
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  ...
2017-07-12 10:00:04 -07:00
Linus Torvalds
6b1c776d3e Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This work from Amir introduces the inodes index feature, which
  provides:

   - hardlinks are not broken on copy up

   - infrastructure for overlayfs NFS export

  This also fixes constant st_ino for samefs case for lower hardlinks"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
  ovl: mark parent impure and restore timestamp on ovl_link_up()
  ovl: document copying layers restrictions with inodes index
  ovl: cleanup orphan index entries
  ovl: persistent overlay inode nlink for indexed inodes
  ovl: implement index dir copy up
  ovl: move copy up lock out
  ovl: rearrange copy up
  ovl: add flag for upper in ovl_entry
  ovl: use struct copy_up_ctx as function argument
  ovl: base tmpfile in workdir too
  ovl: factor out ovl_copy_up_inode() helper
  ovl: extract helper to get temp file in copy up
  ovl: defer upper dir lock to tempfile link
  ovl: hash overlay non-dir inodes by copy up origin
  ovl: cleanup bad and stale index entries on mount
  ovl: lookup index entry for copy up origin
  ovl: verify index dir matches upper dir
  ovl: verify upper root dir matches lower root dir
  ovl: introduce the inodes index dir feature
  ovl: generalize ovl_create_workdir()
  ...
2017-07-12 09:28:55 -07:00
Sakari Ailus
e8158b486d device property: Introduce fwnode_call_bool_op() for ops that return bool
fwnode_call_int_op() isn't suitable for calling ops that return bool
since it effectively causes the result returned to the user to be
true when an op hasn't been defined or the fwnode is NULL.

Address this by introducing fwnode_call_bool_op() for calling ops
that return bool.

Fixes: 3708184afc "device property: Move FW type specific functionality to FW specific files"
Fixes: 2294b3af05 "device property: Introduce fwnode_device_is_available()"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-12 13:32:46 +02:00
Linus Torvalds
130568d5ea Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
 "This is a followup for block changes, that didn't make the initial
  pull request. It's a bit of a mixed bag, this contains:

   - A followup pull request from Sagi for NVMe. Outside of fixups for
     NVMe, it also includes a series for ensuring that we properly
     quiesce hardware queues when browsing live tags.

   - Set of integrity fixes from Dmitry (mostly), fixing various issues
     for folks using DIF/DIX.

   - Fix for a bug introduced in cciss, with the req init changes. From
     Christoph.

   - Fix for a bug in BFQ, from Paolo.

   - Two followup fixes for lightnvm/pblk from Javier.

   - Depth fix from Ming for blk-mq-sched.

   - Also from Ming, performance fix for mtip32xx that was introduced
     with the dynamic initialization of commands"

* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
  block: call bio_uninit in bio_endio
  nvmet: avoid unneeded assignment of submit_bio return value
  nvme-pci: add module parameter for io queue depth
  nvme-pci: compile warnings in nvme_alloc_host_mem()
  nvmet_fc: Accept variable pad lengths on Create Association LS
  nvme_fc/nvmet_fc: revise Create Association descriptor length
  lightnvm: pblk: remove unnecessary checks
  lightnvm: pblk: control I/O flow also on tear down
  cciss: initialize struct scsi_req
  null_blk: fix error flow for shared tags during module_init
  block: Fix __blkdev_issue_zeroout loop
  nvme-rdma: unconditionally recycle the request mr
  nvme: split nvme_uninit_ctrl into stop and uninit
  virtio_blk: quiesce/unquiesce live IO when entering PM states
  mtip32xx: quiesce request queues to make sure no submissions are inflight
  nbd: quiesce request queues to make sure no submissions are inflight
  nvme: kick requeue list when requeueing a request instead of when starting the queues
  nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
  ...
2017-07-11 15:36:52 -07:00
Linus Torvalds
3bf7878f0f Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
 "The main item here is support for v12.y.z ("Luminous") clusters:
  RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
  feature bits, and various other changes in the RADOS client protocol.

  On top of that we have a new fsc mount option to allow supplying
  fscache uniquifier (similar to NFS) and the usual pile of filesystem
  fixes from Zheng"

* tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
  libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
  libceph: osd_state is 32 bits wide in luminous
  crush: remove an obsolete comment
  crush: crush_init_workspace starts with struct crush_work
  libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
  crush: implement weight and id overrides for straw2
  libceph: apply_upmap()
  libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
  libceph: pg_upmap[_items] infrastructure
  libceph: ceph_decode_skip_* helpers
  libceph: kill __{insert,lookup,remove}_pg_mapping()
  libceph: introduce and switch to decode_pg_mapping()
  libceph: don't pass pgid by value
  libceph: respect RADOS_BACKOFF backoffs
  libceph: make DEFINE_RB_* helpers more general
  libceph: avoid unnecessary pi lookups in calc_target()
  libceph: use target pi for calc_target() calculations
  libceph: always populate t->target_{oid,oloc} in calc_target()
  libceph: make sure need_resend targets reflect latest map
  libceph: delete from need_resend_linger before check_linger_pool_dne()
  ...
2017-07-11 12:12:28 -07:00
Nicholas Bellinger
138d351eef iscsi-target: Add login_keys_workaround attribute for non RFC initiators
This patch re-introduces part of a long standing login workaround that
was recently dropped by:

  commit 1c99de981f
  Author: Nicholas Bellinger <nab@linux-iscsi.org>
  Date:   Sun Apr 2 13:36:44 2017 -0700

      iscsi-target: Drop work-around for legacy GlobalSAN initiator

Namely, the workaround for FirstBurstLength ended up being required by
Mellanox Flexboot PXE boot ROMs as reported by Robert.

So this patch re-adds the work-around for FirstBurstLength within
iscsi_check_proposer_for_optional_reply(), and makes the key optional
to respond when the initiator does not propose, nor respond to it.

Also as requested by Arun, this patch introduces a new TPG attribute
named 'login_keys_workaround' that controls the use of both the
FirstBurstLength workaround, as well as the two other existing
workarounds for gPXE iSCSI boot client.

By default, the workaround is enabled with login_keys_workaround=1,
since Mellanox FlexBoot requires it, and Arun has verified the Qlogic
MSFT initiator already proposes FirstBurstLength, so it's uneffected
by this re-adding this part of the original work-around.

Reported-by: Robert LeBlanc <robert@leblancnet.us>
Cc: Robert LeBlanc <robert@leblancnet.us>
Reviewed-by: Arun Easi <arun.easi@cavium.com>
Cc: <stable@vger.kernel.org> # 3.1+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-11 10:56:39 -07:00
Linus Torvalds
a3ddacbae5 Merge tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform
Pull chrome platform updates from Benson Leung:
 "Changes in this pull request are around catching up cros_ec with the
  internal chromeos-kernel versions of cros_ec, cros_ec_lpc, and
  cros_ec_lightbar.

  Also, switching maintainership from olof to bleung"

* tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform:
  platform/chrome : Add myself as Maintainer
  platform/chrome: cros_ec_lightbar - hide unused PM functions
  cros_ec: Don't signal wake event for non-wake host events
  cros_ec: Fix deadlock when EC is not responsive at probe
  cros_ec: Don't return error when checking command version
  platform/chrome: cros_ec_lightbar - Avoid I2C xfer to EC during suspend
  platform/chrome: cros_ec_lightbar - Add userspace lightbar control bit to EC
  platform/chrome: cros_ec_lightbar - Control of suspend/resume lightbar sequence
  platform/chrome: cros_ec_lightbar - Add lightbar program feature to sysfs
  platform/chrome: cros_ec_lpc: Add MKBP events support over ACPI
  platform/chrome: cros_ec_lpc: Add power management ops
  platform/chrome: cros_ec_lpc: Add support for GOOG004 ACPI device
  platform/chrome: cros_ec_lpc: Add support for mec1322 EC
  platform/chrome: cros_ec_lpc: Add R/W helpers to LPC protocol variants
  mfd: cros_ec: Add support for dumping panic information
  cros_ec_debugfs: Pass proper struct sizes to cros_ec_cmd_xfer()
  mfd: cros_ec: add debugfs, console log file
  mfd: cros_ec: Add EC console read structures definitions
  mfd: cros_ec: Add helper for event notifier.
2017-07-11 09:55:47 -07:00
Eric W. Biederman
2fd1d2c4ce proc: Fix proc_sys_prune_dcache to hold a sb reference
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
>  dump_stack+0x63/0x86
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  umount_check+0x6e/0x80
>  d_walk+0xc6/0x270
>  ? dentry_free+0x80/0x80
>  do_one_tree+0x26/0x40
>  shrink_dcache_for_umount+0x2d/0x90
>  generic_shutdown_super+0x1f/0xf0
>  kill_anon_super+0x12/0x20
>  proc_kill_sb+0x40/0x50
>  deactivate_locked_super+0x43/0x70
>  deactivate_super+0x5a/0x60
>  cleanup_mnt+0x3f/0x90
>  mntput_no_expire+0x13b/0x190
>  kern_unmount+0x3e/0x50
>  pid_ns_release_proc+0x15/0x20
>  proc_cleanup_work+0x15/0x20
>  process_one_work+0x197/0x450
>  worker_thread+0x4e/0x4a0
>  kthread+0x109/0x140
>  ? process_one_work+0x450/0x450
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0

Fix this by taking a reference to the super block in proc_sys_prune_dcache.

The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used.  This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list.  Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks.  The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.

Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to.  The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-11 11:01:24 -05:00
Paul Burton
7461279bba dt-bindings: Document img,boston-clock binding
Add device tree binding documentation for the clocks provided by the
MIPS Boston development board from Imagination Technologies, and a
header file describing the available clocks for use by device trees &
driver.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Cc: linux-clk@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16482/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2017-07-11 14:13:06 +02:00
David Howells
1d278a8790 VFS: Kill off s_options and helpers
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves.  This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells
c4fac91004 9p: Implement show_options
Implement the show_options superblock op for 9p as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Ron Minnich <rminnich@sandia.gov>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:08:58 -04:00
Linus Torvalds
9967468c0a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - most of the rest of MM

 - KASAN updates

 - lib/ updates

 - checkpatch updates

 - some binfmt_elf changes

 - various misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits)
  kernel/exit.c: avoid undefined behaviour when calling wait4()
  kernel/signal.c: avoid undefined behaviour in kill_something_info
  binfmt_elf: safely increment argv pointers
  s390: reduce ELF_ET_DYN_BASE
  powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm: move ELF_ET_DYN_BASE to 4MB
  binfmt_elf: use ELF_ET_DYN_BASE only for PIE
  fs, epoll: short circuit fetching events if thread has been killed
  checkpatch: improve multi-line alignment test
  checkpatch: improve macro reuse test
  checkpatch: change format of --color argument to --color[=WHEN]
  checkpatch: silence perl 5.26.0 unescaped left brace warnings
  checkpatch: improve tests for multiple line function definitions
  checkpatch: remove false warning for commit reference
  checkpatch: fix stepping through statements with $stat and ctx_statement_block
  checkpatch: [HLP]LIST_HEAD is also declaration
  checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
  checkpatch: improve the unnecessary OOM message test
  lib/bsearch.c: micro-optimize pivot position calculation
  ...
2017-07-10 16:58:42 -07:00
Thomas Meyer
a94c33dd1f lib/extable.c: use bsearch() library function in search_extable()
[thomas@m3y3r.de: v3: fix arch specific implementations]
  Link: http://lkml.kernel.org/r/1497890858.12931.7.camel@m3y3r.de
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:35 -07:00
Matthew Wilcox
2c6deb0152 bitmap: use memcmp optimisation in more situations
Commit 7dd968163f ("bitmap: bitmap_equal memcmp optimization") was
rather more restrictive than necessary; we can use memcmp() to implement
bitmap_equal() as long as the number of bits can be proved to be a
multiple of 8.  And architectures other than s390 may be able to make
good use of this optimisation.

[arnd@arndb.de: fix build: add a memcmp() declaration]
  Link: http://lkml.kernel.org/r/20170630153908.3439707-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170628153221.11322-5-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Matthew Wilcox
2a98dc028f include/linux/bitmap.h: turn bitmap_set and bitmap_clear into memset when possible
Several callers have constant 'start' and an 'nbits' that is a multiple
of 8, so we can turn them into calls to memset.  We don't need the
entirety of 'start' and 'nbits' to be constant, we just need to know
whether they're divisible by 8.

Link: http://lkml.kernel.org/r/20170628153221.11322-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Matthew Wilcox
e5af323c9b bitmap: optimise bitmap_set and bitmap_clear of a single bit
We have eight users calling bitmap_clear for a single bit and seventeen
calling bitmap_set for a single bit.  Rather than fix all of them to
call __clear_bit or __set_bit, turn bitmap_clear and bitmap_set into
inline functions and make this special case efficient.

Link: http://lkml.kernel.org/r/20170628153221.11322-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Bart Van Assche
287f3ca563 ARM: fix rd_size declaration
The global variable 'rd_size' is declared as 'int' in source file
arch/arm/kernel/atags_parse.c and as 'unsigned long' in
drivers/block/brd.c.  Fix this inconsistency.

Additionally, remove the declarations of rd_image_start, rd_prompt and
rd_doload from parse_tag_ramdisk() since these duplicate existing
declarations in <linux/initrd.h>.

Link: http://lkml.kernel.org/r/20170627065024.12347-1-bart.vanassche@wdc.com
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Zhaohongjiang <zhaohongjiang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Ian Abbott
bc6245e5ef bug: split BUILD_BUG stuff out into <linux/build_bug.h>
Including <linux/bug.h> pulls in a lot of bloat from <asm/bug.h> and
<asm-generic/bug.h> that is not needed to call the BUILD_BUG() family of
macros.  Split them out into their own header, <linux/build_bug.h>.

Also correct some checkpatch.pl errors for the BUILD_BUG_ON_ZERO() and
BUILD_BUG_ON_NULL() macros by adding parentheses around the bitfield
widths that begin with a minus sign.

Link: http://lkml.kernel.org/r/20170525120316.24473-6-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Ian Abbott
47e81e59d9 linux/bug.h: correct "space required before that '-'"
Correct these checkpatch.pl errors:

|ERROR: space required before that '-' (ctx:OxO)
|#37: FILE: include/linux/bug.h:37:
|+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

|ERROR: space required before that '-' (ctx:OxO)
|#38: FILE: include/linux/bug.h:38:
|+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); }))

I decided to wrap the bitfield expressions that begin with minus signs
in parentheses rather than insert spaces before the minus signs.

Link: http://lkml.kernel.org/r/20170525120316.24473-5-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Ian Abbott
8cdd7cca92 linux/bug.h: correct "(foo*)" should be "(foo *)"
Correct this checkpatch.pl error:

|ERROR: "(foo*)" should be "(foo *)"
|#19: FILE: include/linux/bug.h:19:
|+#define BUILD_BUG_ON_NULL(e) ((void*)0)

Link: http://lkml.kernel.org/r/20170525120316.24473-4-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Ian Abbott
e9d5a48499 linux/bug.h: correct formatting of block comment
Correct these checkpatch.pl warnings:

|WARNING: Block comments use * on subsequent lines
|#34: FILE: include/linux/bug.h:34:
|+/* Force a compilation error if condition is true, but also produce a
|+   result (of value 0 and type size_t), so the expression can be used

|WARNING: Block comments use a trailing */ on a separate line
|#36: FILE: include/linux/bug.h:36:
|+   aren't permitted). */

Link: http://lkml.kernel.org/r/20170525120316.24473-3-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Ian Abbott
0b396923ee asm-generic/bug.h: declare struct pt_regs; before function prototype
This series of patches splits BUILD_BUG related macros out of
"include/linux/bug.h" into new file "include/linux/build_bug.h" (patch
5), and changes the pointer type checking in the `container_of()` macro
to deal with pointers of array type better (patch 6).  Patches 1 to 4
are prerequisites.

Patches 2, 3, 4, and 5 have been inserted since the previous version of
this patch series.  Patch 6 here corresponds to v3 and v4's patch 2.

Patch 1 was a prerequisite in v3 of this series to avoid a lot of
warnings when <linux/bug.h> was included by <linux/kernel.h>.  That is
no longer relevant for v5 of the series, but I left it in because it was
acked by a Arnd Bergmann and Michal Nazarewicz.

Patches 2, 3, and 4 are some checkpatch clean-ups on
"include/linux/bug.h" before splitting out the BUILD_BUG stuff in patch
5.

Patch 5 splits the BUILD_BUG related macros out of "include/linux/bug.h"
into new file "include/linux/build_bug.h" because including
<linux/bug.h> in "include/linux/kernel.h" would result in build failures
due to circular dependencies.

Patch 6 changes the pointer type checking by `container_of()` to avoid
some incompatible pointer warnings when the dereferenced pointer has
array type.

1) asm-generic/bug.h: declare struct pt_regs; before function prototype
2) linux/bug.h: correct formatting of block comment
3) linux/bug.h: correct "(foo*)" should be "(foo *)"
4) linux/bug.h: correct "space required before that '-'"
5) bug: split BUILD_BUG stuff out into <linux/build_bug.h>
6) kernel.h: handle pointers to arrays better in container_of()

This patch (of 6):

The declaration of `__warn()` has `struct pt_regs *regs` as one of its
parameters.  This can result in compiler warnings if `struct regs` is not
already declared.  Add an empty declaration of `struct pt_regs` to avoid
the warnings.

Link: http://lkml.kernel.org/r/20170525120316.24473-2-abbotti@mev.co.uk
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Michal Hocko
9d1f4b3f5b mm: disallow early_pfn_to_nid on configurations which do not implement it
early_pfn_to_nid will return node 0 if both HAVE_ARCH_EARLY_PFN_TO_NID
and HAVE_MEMBLOCK_NODE_MAP are disabled.  It seems we are safe now
because all architectures which support NUMA define one of them (with an
exception of alpha which however has CONFIG_NUMA marked as broken) so
this works as expected.  It can get silently and subtly broken too
easily, though.  Make sure we fail the compilation if NUMA is enabled
and there is no proper implementation for this function.  If that ever
happens we know that either the specific configuration is invalid and
the fix should either disable NUMA or enable one of the above configs.

Link: http://lkml.kernel.org/r/20170704075803.15979-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Thomas Gleixner
a47fed5b5b mm: swap: provide lru_add_drain_all_cpuslocked()
The rework of the cpu hotplug locking unearthed potential deadlocks with
the memory hotplug locking code.

The solution for these is to rework the memory hotplug locking code as
well and take the cpu hotplug lock before the memory hotplug lock in
mem_hotplug_begin(), but this will cause a recursive locking of the cpu
hotplug lock when the memory hotplug code calls lru_add_drain_all().

Split out the inner workings of lru_add_drain_all() into
lru_add_drain_all_cpuslocked() so this function can be invoked from the
memory hotplug code with the cpu hotplug lock held.

Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Sahitya Tummala
2c80cd57c7 mm/list_lru.c: fix list_lru_count_node() to be race free
list_lru_count_node() iterates over all memcgs to get the total number of
entries on the node but it can race with memcg_drain_all_list_lrus(),
which migrates the entries from a dead cgroup to another.  This can return
incorrect number of entries from list_lru_count_node().

Fix this by keeping track of entries per node and simply return it in
list_lru_count_node().

Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Polakov <apolyakov@beget.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Nikolay Borisov
e3d3910a57 include/linux/backing-dev.h: simplify wb_stat_sum
wb_stat_sum() disables interrupts and calls __wb_stat_sum() which
eventually calls __percpu_counter_sum().  However, the percpu routine is
already irq-safe.  Simplify the code a bit by making wb_stat_sum()
directly call percpu_counter_sum_positive() and not disable interrupts.

Also remove the now-uneeded __wb_stat_sum() which was just a wrapper
over percpu_counter_sum_positive().

Link: http://lkml.kernel.org/r/1498230681-29103-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Nikolay Borisov
618b8c20d0 include/linux/mmzone.h: remove ancient/ambiguous comment
Currently pg_data_t is just a struct which describes a NUMA node memory
layout.  Let's keep the comment simple and remove ambiguity.

Link: http://lkml.kernel.org/r/1498220534-22717-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Michal Hocko
3e59fcb0e8 hugetlb: add support for preferred node to alloc_huge_page_nodemask
alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask starting from lower numa nodes.  This might lead to
filling up those low NUMA nodes while others are not used.  We can
reduce this risk by introducing a concept of the preferred node similar
to what we have in the regular page allocator.  We will start allocating
from the preferred nid and then iterate over all allowed nodes in the
zonelist order until we try them all.

This is mimicing the page allocator logic except it operates on per-node
mempools.  dequeue_huge_page_vma already does this so distill the
zonelist logic into a more generic dequeue_huge_page_nodemask and use it
in alloc_huge_page_nodemask.

This will allow us to use proper per numa distance fallback also for
alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
can get rid of alloc_huge_page_node helper which doesn't have any user
anymore.

Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Michal Hocko
aaf14e40a3 mm, hugetlb: unclutter hugetlb allocation layers
Patch series "mm, hugetlb: allow proper node fallback dequeue".

While working on a hugetlb migration issue addressed in a separate
patchset[1] I have noticed that the hugetlb allocations from the
preallocated pool are quite subotimal.

 [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org

There is no fallback mechanism implemented and no notion of preferred
node.  I have tried to work around it but Vlastimil was right to push
back for a more robust solution.  It seems that such a solution is to
reuse zonelist approach we use for the page alloctor.

This series has 3 patches.  The first one tries to make hugetlb
allocation layers more clear.  The second one implements the zonelist
hugetlb pool allocation and introduces a preferred node semantic which
is used by the migration callbacks.  The last patch is a clean up.

This patch (of 3):

Hugetlb allocation path for fresh huge pages is unnecessarily complex
and it mixes different interfaces between layers.

__alloc_buddy_huge_page is the central place to perform a new
allocation.  It checks for the hugetlb overcommit and then relies on
__hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
all good except that __alloc_buddy_huge_page pushes vma and address down
the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
two different allocation modes - one for memory policy and other node
specific (or to make it more obscure node non-specific) requests.

This just screams for a reorganization.

This patch pulls out all the vma specific handling up to
__alloc_buddy_huge_page_with_mpol where it belongs.
__alloc_buddy_huge_page will get nodemask argument and
__hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
page allocator.

In short:
__alloc_buddy_huge_page_with_mpol - memory policy handling
  __alloc_buddy_huge_page - overcommit handling and accounting
    __hugetlb_alloc_buddy_huge_page - page allocator layer

Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
is not really needed because the page allocator already handles the
cpusets update.

Finally __hugetlb_alloc_buddy_huge_page had a special case for node
specific allocations (when no policy is applied and there is a node
given).  This has relied on __GFP_THISNODE to not fallback to a different
node.  alloc_huge_page_node is the only caller which relies on this
behavior so move the __GFP_THISNODE there.

Not only does this remove quite some code it also should make those
layers easier to follow and clear wrt responsibilities.

Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Roman Gushchin
422580c3ce mm/oom_kill.c: add tracepoints for oom reaper-related events
During the debugging of the problem described in
https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
output is not really useful to understand issues related to the oom
reaper.

So, I assume, that adding some tracepoints might help with debugging of
similar issues.

Trace the following events:
 1) a process is marked as an oom victim,
 2) a process is added to the oom reaper list,
 3) the oom reaper starts reaping process's mm,
 4) the oom reaper finished reaping,
 5) the oom reaper skips reaping.

How it works in practice? Below is an example which show how the problem
mentioned above can be found: one process is added twice to the
oom_reaper list:

  $ cd /sys/kernel/debug/tracing
  $ echo "oom:mark_victim" > set_event
  $ echo "oom:wake_reaper" >> set_event
  $ echo "oom:skip_task_reaping" >> set_event
  $ echo "oom:start_task_reaping" >> set_event
  $ echo "oom:finish_task_reaping" >> set_event
  $ cat trace_pipe
          allocate-502   [001] ....    91.836405: mark_victim: pid=502
          allocate-502   [001] .N..    91.837356: wake_reaper: pid=502
          allocate-502   [000] .N..    91.871149: wake_reaper: pid=502
        oom_reaper-23    [000] ....    91.871177: start_task_reaping: pid=502
        oom_reaper-23    [000] .N..    91.879511: finish_task_reaping: pid=502
        oom_reaper-23    [000] ....    91.879580: skip_task_reaping: pid=502

Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Michal Hocko
8b91323889 mm: unify new_node_page and alloc_migrate_target
Commit 394e31d2ce ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") has duplicated a large part of
alloc_migrate_target with some hotplug specific special casing.

To be more precise it tried to enfore the allocation from a different
node than the original page.  As a result the two function diverged in
their shared logic, e.g.  the hugetlb allocation strategy.

Let's unify the two and express different NUMA requirements by the given
nodemask.  new_node_page will simply exclude the node it doesn't care
about and alloc_migrate_target will use all the available nodes.
alloc_migrate_target will then learn to migrate hugetlb pages more
sanely and use preallocated pool when possible.

Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which is
quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.

Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Michal Hocko
4db9b2efe9 hugetlb, memory_hotplug: prefer to use reserved pages for migration
new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages.  If such a node doesn't have
any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
to allocate a surplus page instead.  This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
node 0

  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
  /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

Now we consume the whole pool on node 4 and try to offline this node.
All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them.  With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.

Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.

[akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Will Deacon
108a7ac448 include/linux/page_ref.h: ensure page_ref_unfreeze is ordered against prior accesses
page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
wrapping a critical section where struct pages can be modified without
having to worry about consistency for a concurrent fast-GUP.

Whilst page_ref_freeze has full barrier semantics due to its use of
atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
doesn't provide any barrier semantics and allows the operation to be
reordered with respect to page modifications in the critical section.

This patch ensures that page_ref_unfreeze is ordered after any critical
section updates, by invoking smp_mb() prior to the atomic_set.

Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.com
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Dan Williams
baabda2614 mm: always enable thp for dax mappings
The madvise policy for transparent huge pages is meant to avoid unwanted
allocations of transparent huge pages.  It allows a policy of disabling
the extra memory pressure and effort to arrange for a huge page when it
is not needed.

DAX by definition never incurs this overhead since it is statically
allocated.  The policy choice makes even less sense for device-dax which
tries to guarantee a given tlb-fault size.  Specifically, the following
setting:

	echo never > /sys/kernel/mm/transparent_hugepage/enabled

...violates that guarantee and silently disables all device-dax
instances with a 2M or 1G alignment.  So, let's avoid that non-obvious
side effect by force enabling thp for dax mappings in all cases.

It is worth noting that the reason this uses vma_is_dax(), and the
resulting header include changes, is that previous attempts to add a
VM_DAX flag were NAKd.

Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Dan Williams
16981d7635 mm: improve readability of transparent_hugepage_enabled()
Turn the macro into a static inline and rewrite the condition checks for
better readability in preparation for adding another condition.

[ross.zwisler@linux.intel.com: fix logic to make conversion equivalent]
[akpm@linux-foundation.org: resolve vs mm-make-pr_set_thp_disable-immediately-active.patch]
[akpm@linux-foundation.org: include coredump.h for MMF_DISABLE_THP]
Link: http://lkml.kernel.org/r/149739530612.20686.14760671150202647861.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Steven Rostedt (VMware)
7ab0e50ad0 oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACK
After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
been evaluated:

 # cat /sys/kernel/debug/tracing/enum_map

(which will soon be renamed to eval_map)

And it showed some interesting results:

  [..]
  ZONE_MOVABLE 3 (oom)
  ZONE_NORMAL 2 (oom)
  ZONE_DMA32 1 (oom)
  ZONE_DMA 0 (oom)
  3 3 (oom)
  2 2 (oom)
  1 1 (oom)
  COMPACT_PRIO_ASYNC 2 (oom)
  COMPACT_PRIO_SYNC_LIGHT 1 (oom)
  COMPACT_PRIO_SYNC_FULL 0 (oom)
  [..]
  ZONE_DMA 0 (vmscan)
  3 3 (vmscan)
  2 2 (vmscan)
  1 1 (vmscan)
  COMPACT_PRIO_ASYNC 2 (vmscan)
  [..]
  ZONE_DMA 0 (kmem)
  3 3 (kmem)
  2 2 (kmem)
  1 1 (kmem)
  COMPACT_PRIO_ASYNC 2 (kmem)
  [..]
  ZONE_DMA 0 (compaction)
  3 3 (compaction)
  2 2 (compaction)
  1 1 (compaction)
  COMPACT_PRIO_ASYNC 2 (compaction)
  [..]

The name within the parenthesis are the trace systems that the enum/eval
maps are associated with. When there's a number evaluated to another
number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
and not an enum. As #defines get converted normally, they are not needed
to be evaluated.

Each of the above trace systems with the number to number evaluation
included the file include/trace/events/mmflags.h which has:

 /* High-level compaction status feedback */
 #define COMPACTION_FAILED       1
 #define COMPACTION_WITHDRAWN    2
 #define COMPACTION_PROGRESS     3

[..]

 #define COMPACTION_FEEDBACK             \
        EM(COMPACTION_FAILED,           "failed")       \
        EM(COMPACTION_WITHDRAWN,        "withdrawn")    \
        EMe(COMPACTION_PROGRESS,        "progress")

Which is still needed for the __print_symbolic() usage in the
trace_event.  But it is not needed to be evaluated.

Removing the evaluation part removes the unnecessary evaluations of
numbers to numbers.

Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Michal Hocko
1860033237 mm: make PR_SET_THP_DISABLE immediately active
PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.

The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set.  This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.

Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU.  In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage.  Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated.  However, khugepaged
collapses the pages and the expected page faults do not occur.

In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.

Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.

It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.

Fixes: a0715cc226 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00