Commit Graph

67785 Commits

Author SHA1 Message Date
Linus Torvalds
5a13fc3d8b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping fix from Thomas Gleixner:
 "A single fix for a regression caused by the generic VDSO
  implementation where a math overflow causes CLOCK_BOOTTIME to become a
  random number generator"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25 10:08:01 -07:00
Linus Torvalds
e67095fd2f Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
 "Two fixes for regressions in this merge window:

   - select the Kconfig symbols for the noncoherent dma arch helpers on
     arm if swiotlb is selected, not just for LPAE to not break then Xen
     build, that uses swiotlb indirectly through swiotlb-xen

   - fix the page allocator fallback in dma_alloc_contiguous if the CMA
     allocation fails"

* tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: fix zone selection after an unaddressable CMA allocation
  arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24 20:00:11 -07:00
Linus Torvalds
083f0f2cd4 Merge tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
 "Here is a (hopefully last) set of GPIO fixes for the v5.3 kernel
  cycle. Two are pretty core:

   - Fix not reporting open drain/source lines to userspace as "input"

   - Fix a minor build error found in randconfigs

   - Fix a chip select quirk on the Freescale SPI

   - Fix the irqchip initialization semantic order to reflect what it
     was using the old API"

* tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: Fix irqchip initialization order
  gpio: of: fix Freescale SPI CS quirk handling
  gpio: Fix build error of function redefinition
  gpiolib: never report open-drain/source lines as 'input' to user-space
2019-08-24 14:45:33 -07:00
Linus Torvalds
4e56394490 Merge tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "Three important fixes tagged for stable (an indefinite hang, a crash
  on an assert and a NULL pointer dereference) plus a small series from
  Luis fixing instances of vfree() under spinlock"

* tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client:
  libceph: fix PG split vs OSD (re)connect race
  ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply
  ceph: clear page dirty before invalidate page
  ceph: fix buffer free while holding i_ceph_lock in fill_inode()
  ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()
  ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
  libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
2019-08-23 09:19:38 -07:00
Thomas Gleixner
b99328a60a timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
nanoseconds based boot time offset left by the clocksource shift. That
overflows once the boot time offset becomes large enough. As a consequence
CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
misbehave.

Fix it by storing a timespec64 representation of the offset when boot time
is adjusted and add that to the MONOTONIC base time value in the vdso data
page. Using the timespec64 representation avoids a 64bit division in the
update code.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Reported-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de
2019-08-23 02:12:11 +02:00
Luis Henriques
5c498950f7 libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-08-22 10:47:41 +02:00
Christoph Hellwig
90ae409f9e dma-direct: fix zone selection after an unaddressable CMA allocation
The new dma_alloc_contiguous hides if we allocate CMA or regular
pages, and thus fails to retry a ZONE_NORMAL allocation if the CMA
allocation succeeds but isn't addressable.  That means we either fail
outright or dip into a small zone that might not succeed either.

Thanks to Hillf Danton for debugging this issue.

Fixes: b1d2dc009d ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
2019-08-21 07:14:10 +09:00
Linus Torvalds
287c55ed7d Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kernel thread signal handling fix from Eric Biederman:
 "I overlooked the fact that kernel threads are created with all signals
  set to SIG_IGN, and accidentally caused a regression in cifs and drbd
  when replacing force_sig with send_sig.

  This is my fix for that regression. I add a new function
  allow_kernel_signal which allows kernel threads to receive signals
  sent from the kernel, but continues to ignore all signals sent from
  userspace. This ensures the user space interface for cifs and drbd
  remain the same.

  These kernel threads depend on blocking networking calls which block
  until something is received or a signal is pending. Making receiving
  of signals somewhat necessary for these kernel threads.

  Perhaps someday we can cleanup those interfaces and remove
  allow_kernel_signal. If not allow_kernel_signal is pretty trivial and
  clearly documents what is going on so I don't think we will mind
  carrying it"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Allow cifs and drbd to receive their terminating signals
2019-08-19 16:17:59 -07:00
Linus Torvalds
06821504fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

  1) Fix jmp to 1st instruction in x64 JIT, from Alexei Starovoitov.

  2) Severl kTLS fixes in mlx5 driver, from Tariq Toukan.

  3) Fix severe performance regression due to lack of SKB coalescing of
     fragments during local delivery, from Guillaume Nault.

  4) Error path memory leak in sch_taprio, from Ivan Khoronzhuk.

  5) Fix batched events in skbedit packet action, from Roman Mashak.

  6) Propagate VLAN TX offload to hw_enc_features in bond and team
     drivers, from Yue Haibing.

  7) RXRPC local endpoint refcounting fix and read after free in
     rxrpc_queue_local(), from David Howells.

  8) Fix endian bug in ibmveth multicast list handling, from Thomas
     Falcon.

  9) Oops, make nlmsg_parse() wrap around the correct function,
     __nlmsg_parse not __nla_parse(). Fix from David Ahern.

 10) Memleak in sctp_scend_reset_streams(), fro Zheng Bin.

 11) Fix memory leak in cxgb4, from Wenwen Wang.

 12) Yet another race in AF_PACKET, from Eric Dumazet.

 13) Fix false detection of retransmit failures in tipc, from Tuong
     Lien.

 14) Use after free in ravb_tstamp_skb, from Tho Vu.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
  ravb: Fix use-after-free ravb_tstamp_skb
  netfilter: nf_tables: map basechain priority to hardware priority
  net: sched: use major priority number as hardware priority
  wimax/i2400m: fix a memory leak bug
  net: cavium: fix driver name
  ibmvnic: Unmap DMA address of TX descriptor buffers after use
  bnxt_en: Fix to include flow direction in L2 key
  bnxt_en: Use correct src_fid to determine direction of the flow
  bnxt_en: Suppress HWRM errors for HWRM_NVM_GET_VARIABLE command
  bnxt_en: Fix handling FRAG_ERR when NVM_INSTALL_UPDATE cmd fails
  bnxt_en: Improve RX doorbell sequence.
  bnxt_en: Fix VNIC clearing logic for 57500 chips.
  net: kalmia: fix memory leaks
  cx82310_eth: fix a memory leak bug
  bnx2x: Fix VF's VLAN reconfiguration in reload.
  Bluetooth: Add debug setting for changing minimum encryption key size
  tipc: fix false detection of retransmit failures
  lan78xx: Fix memory leaks
  MAINTAINERS: r8169: Update path to the driver
  MAINTAINERS: PHY LIBRARY: Update files in the record
  ...
2019-08-19 10:00:01 -07:00
David Howells
555df336c7 keys: Fix description size
The maximum key description size is 4095.  Commit f771fde820 ("keys:
Simplify key description management") inadvertantly reduced that to 255
and made sizes between 256 and 4095 work weirdly, and any size whereby
size & 255 == 0 would cause an assertion in __key_link_begin() at the
following line:

	BUG_ON(index_key->desc_len == 0);

This can be fixed by simply increasing the size of desc_len in struct
keyring_index_key to a u16.

Note the argument length test in keyutils only checked empty
descriptions and descriptions with a size around the limit (ie.  4095)
and not for all the values in between, so it missed this.  This has been
addressed and

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/commit/?id=066bf56807c26cd3045a25f355b34c1d8a20a5aa

now exhaustively tests all possible lengths of type, description and
payload and then some.

The assertion failure looks something like:

 kernel BUG at security/keys/keyring.c:1245!
 ...
 RIP: 0010:__key_link_begin+0x88/0xa0
 ...
 Call Trace:
  key_create_or_update+0x211/0x4b0
  __x64_sys_add_key+0x101/0x200
  do_syscall_64+0x5b/0x1e0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

It can be triggered by:

	keyctl add user "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" a @s

Fixes: f771fde820 ("keys: Simplify key description management")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-19 09:43:57 -07:00
Eric W. Biederman
33da8e7c81 signal: Allow cifs and drbd to receive their terminating signals
My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd.  I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN.  So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.

Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig.  As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL.  Which after the first signal will allow userspace
to send signals to these kernel threads.  At least for
wake_ack_receiver in drbd that does not appear actively wrong.

So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.

Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.

This change was partly inspired by similar cifs and drbd patches that
added allow_signal.

Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc9470b ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bcf0 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee109901f ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076fb ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-08-19 06:34:13 -05:00
Linus Torvalds
359334caf7 Merge tag 'usb-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
 "Here are number of small USB fixes for 5.3-rc5.

  Syzbot has been on a tear recently now that it has some good USB
  debugging hooks integrated, so there's a number of fixes in here found
  by those tools for some _very_ old bugs. Also a handful of gadget
  driver fixes for reported issues, some hopefully-final dma fixes for
  host controller drivers, and some new USB serial gadget driver ids.

  All of these have been in linux-next this week with no reported issues
  (the usb-serial ones were in linux-next in its own branch, but merged
  into mine on Friday)"

* tag 'usb-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: add a hcd_uses_dma helper
  usb: don't create dma pools for HCDs with a localmem_pool
  usb: chipidea: imx: fix EPROBE_DEFER support during driver probe
  usb: host: fotg2: restart hcd after port reset
  USB: CDC: fix sanity checks in CDC union parser
  usb: cdc-acm: make sure a refcount is taken early enough
  USB: serial: option: add the BroadMobi BM818 card
  USB: serial: option: Add Motorola modem UARTs
  USB: core: Fix races in character device registration and deregistraion
  usb: gadget: mass_storage: Fix races between fsg_disable and fsg_set_alt
  usb: gadget: composite: Clear "suspended" on reset/disconnect
  usb: gadget: udc: renesas_usb3: Fix sysfs interface of "role"
  USB: serial: option: add D-Link DWM-222 device ID
  USB: serial: option: Add support for ZTE MF871A
2019-08-18 09:11:29 -07:00
Linus Torvalds
8fde2832bd Merge tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes that should go into this series. This contains:

   - Revert of the REQ_NOWAIT_INLINE and associated dio changes. There
     were still corner cases there, and even though I had a solution for
     it, it's too involved for this stage. (me)

   - Set of NVMe fixes (via Sagi)

   - io_uring fix for fixed buffers (Anthony)

   - io_uring defer issue fix (Jackie)

   - Regression fix for queue sync at exit time (zhengbin)

   - xen blk-back memory leak fix (Wenwen)"

* tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block:
  io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list
  block: remove REQ_NOWAIT_INLINE
  io_uring: fix manual setup of iov_iter for fixed buffers
  xen/blkback: fix memory leaks
  blk-mq: move cancel of requeue_work to the front of blk_exit_queue
  nvme-pci: Fix async probe remove race
  nvme: fix controller removal race with scan work
  nvme-rdma: fix possible use-after-free in connect error flow
  nvme: fix a possible deadlock when passthru commands sent to a multipath device
  nvme-core: Fix extra device_put() call on error path
  nvmet-file: fix nvmet_file_flush() always returning an error
  nvmet-loop: Flush nvme_delete_wq when removing the port
  nvmet: Fix use-after-free bug when a port is removed
  nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns
2019-08-17 19:39:54 -07:00
Linus Torvalds
2d63ba3e41 Merge tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These add a check to avoid recent suspend-to-idle power regression on
  systems with NVMe drives where the PCIe ASPM policy is "performance"
  (or when the kernel is built without ASPM support), fix an issue
  related to frequency limits in the schedutil cpufreq governor and fix
  a mistake related to the PM QoS usage in the cpufreq core introduced
  recently.

  Specifics:

   - Disable NVMe power optimization related to suspend-to-idle added
     recently on systems where PCIe ASPM is not able to put PCIe links
     into low-power states to prevent excess power from being drawn by
     the system while suspended (Rafael Wysocki).

   - Make the schedutil governor handle frequency limits changes
     properly in all cases (Viresh Kumar).

   - Prevent the cpufreq core from treating positive values returned by
     dev_pm_qos_update_request() as errors (Viresh Kumar)"

* tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  PCI/ASPM: Add pcie_aspm_enabled()
  cpufreq: schedutil: Don't skip freq update when limits change
  cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16 09:13:16 -07:00
Jens Axboe
7b6620d7db block: remove REQ_NOWAIT_INLINE
We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa10 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb86 ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c9720 ("blk-mq: allow REQ_NOWAIT to return an error inline")

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-15 11:09:16 -06:00
Christoph Hellwig
edfbcb321f usb: add a hcd_uses_dma helper
The USB buffer allocation code is the only place in the usb core (and in
fact the whole kernel) that uses is_device_dma_capable, while the URB
mapping code uses the uses_dma flag in struct usb_bus.  Switch the buffer
allocation to use the uses_dma flag used by the rest of the USB code,
and create a helper in hcd.h that checks this flag as well as the
CONFIG_HAS_DMA to simplify the caller a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20190811080520.21712-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-15 15:18:05 +02:00
Linus Torvalds
e83b009c5c Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-14 10:31:11 -07:00
YueHaibing
68e03b8547 gpio: Fix build error of function redefinition
when do randbuilding, I got this error:

In file included from drivers/hwmon/pmbus/ucd9000.c:19:0:
./include/linux/gpio/driver.h:576:1: error: redefinition of gpiochip_add_pin_range
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
 ^~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/hwmon/pmbus/ucd9000.c:18:0:
./include/linux/gpio.h:245:1: note: previous definition of gpiochip_add_pin_range was here
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
 ^~~~~~~~~~~~~~~~~~~~~~

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 964cb34188 ("gpio: move pincontrol calls to <linux/gpio/driver.h>")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190731123814.46624-1-yuehaibing@huawei.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-08-14 10:57:18 +02:00
Andrea Arcangeli
a8282608c8 Revert "mm, thp: restore node-local hugepage allocations"
This reverts commit 2f0799a0ff ("mm, thp: restore node-local
hugepage allocations").

commit 2f0799a0ff was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window.  Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark.  So it's safe to re-apply the fix and continue
improving the code from there.  The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload.  The
removal of __GFP_THISNODE increased fairness.

__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.

Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.

Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.

The performance characteristic of memory depends on the hardware
details.  The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default.  The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).

  D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
  0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A

D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e.  inter socket memory),
etc...

For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:

  D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
  0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A

NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.

After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:

  D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

Before this commit it was:

  D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e.  the first one at
the top).

After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:

        unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
        page = alloc_pages_multi_order(..., &order);
        if (!page)
        	goto out;
        if (!(order & (1 << 0))) {
        	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
        	/* THP fault */
        } else {
        	VM_WARN_ON(order != 1 << 0);
        	/* 4k fallback */
        }

The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.

After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.

Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Andrea Arcangeli
92717d429b Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot.  We can safely re-apply them now that we had time
to analyze the problem.

The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.

The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further.  The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.

This patch (of 2):

This reverts commit 356ff8a9a7 ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
89c83fb539 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").

Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden.  There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.

This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.

[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Roman Gushchin
ec9f02384f mm: workingset: fix vmstat counters for shadow nodes
Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
        virt_to_page(xa_node)->mem_cgroup

Since commit 4d96ba3530 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.

Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page().  Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer.  That was a case since the introduction of these
counters by the commit 68d48e6a2d ("mm: workingset: add vmstat counter
for shadow nodes").

Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Ralph Campbell
76470ccd62 mm: document zone device struct page field usage
Patch series "mm/hmm: fixes for device private page migration", v3.

Testing the latest linux git tree turned up a few bugs with page
migration to and from ZONE_DEVICE private and anonymous pages.
Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
mapping and index fields from the source anonymous page mapping.

This patch (of 3):

Struct page for ZONE_DEVICE private pages uses the page->mapping and and
page->index fields while the source anonymous pages are migrated to
device private memory.  This is so rmap_walk() can find the page when
migrating the ZONE_DEVICE private page back to system memory.
ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
page->index fields when files are mapped into a process address space.

Add comments to struct page and remove the unused "_zd_pad_1" field to
make this more clear.

Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Rafael J. Wysocki
accd2dd72c PCI/ASPM: Add pcie_aspm_enabled()
Add a function checking whether or not PCIe ASPM has been enabled for
a given device.

It will be used by the NVMe driver to decide how to handle the
device during system suspend.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
2019-08-12 10:47:55 +02:00
Christoph Hellwig
33dcb37cef dma-mapping: fix page attributes for dma_mmap_*
All the way back to introducing dma_common_mmap we've defaulted to mark
the pages as uncached.  But this is wrong for DMA coherent devices.
Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
flag is only treated special on the alloc side for non-coherent devices.

Introduce a new dma_pgprot helper that deals with the check for coherent
devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
and we thus ensure no aliasing of page attributes happens, which makes
the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
remaining ones.

Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
we'll phase it out soon.

Fixes: 64ccc9c033 ("common: dma-mapping: add support for generic dma_mmap_* calls")
Reported-by: Shawn Anastasio <shawn@anastas.io>
Reported-by: Gavin Li <git@thegavinli.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10 19:52:45 +02:00
Linus Torvalds
7f20fd2337 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "Bugfixes (arm and x86) and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: kvm: Adding config fragments
  KVM: selftests: Update gitignore file for latest changes
  kvm: remove unnecessary PageReserved check
  KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable
  KVM: arm: Don't write junk to CP15 registers on reset
  KVM: arm64: Don't write junk to sysregs on reset
  KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block
  x86: kvm: remove useless calls to kvm_para_available
  KVM: no need to check return value of debugfs_create functions
  KVM: remove kvm_arch_has_vcpu_debugfs()
  KVM: Fix leak vCPU's VMCS value into other pCPU
  KVM: Check preempted_in_kernel for involuntary preemption
  KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire
  arm64: KVM: hyp: debug-sr: Mark expected switch fall-through
  KVM: arm64: Update kvm_arm_exception_class and esr_class_str for new EC
  KVM: arm: vgic-v3: Mark expected switch fall-through
  arm64: KVM: regmap: Fix unexpected switch fall-through
  KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter index
2019-08-09 15:46:29 -07:00
David S. Miller
9566e650bf Merge tag 'mlx5-fixes-2019-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2019-08-08

This series introduces some fixes to mlx5 driver.

Highlights:
1) From Tariq, Critical mlx5 kTLS fixes to better align with hw specs.
2) From Aya, Fixes to mlx5 tx devlink health reporter.
3) From Maxim, aRFs parsing to use flow dissector to avoid relying on
invalid skb fields.

Please pull and let me know if there is any problem.

For -stable v4.3
 ('net/mlx5e: Only support tx/rx pause setting for port owner')
For -stable v4.9
 ('net/mlx5e: Use flow keys dissector to parse packets for ARFS')
For -stable v5.1
 ('net/mlx5e: Fix false negative indication on tx reporter CQE recovery')
 ('net/mlx5e: Remove redundant check in CQE recovery flow of tx reporter')
 ('net/mlx5e: ethtool, Avoid setting speed to 56GBASE when autoneg off')

Note: when merged with net-next this minor conflict will pop up:
++<<<<<<< (net-next)
 +      if (is_eswitch_flow) {
 +              flow->esw_attr->match_level = match_level;
 +              flow->esw_attr->tunnel_match_level = tunnel_match_level;
++=======
+       if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
+               flow->esw_attr->inner_match_level = inner_match_level;
+               flow->esw_attr->outer_match_level = outer_match_level;
++>>>>>>> (net)

To resolve, use hunks from net (2nd) and replace:
if (flow->flags & MLX5E_TC_FLOW_ESWITCH)
with
if (is_eswitch_flow)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09 13:23:33 -07:00
Linus Torvalds
e577dc152e Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "Fix a number of bugs in the ccp driver"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: ccp - Ignore tag length when decrypting GCM ciphertext
  crypto: ccp - Add support for valid authsize values less than 16
  crypto: ccp - Fix oops by properly managing allocated structures
2019-08-09 09:17:05 -07:00
Paolo Bonzini
0e1c438c44 Merge tag 'kvmarm-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug
- Fix PMU reset bug
- Add missing exception class debug strings
2019-08-09 16:53:39 +02:00
Jakub Kicinski
414776621d net/tls: prevent skb_orphan() from leaking TLS plain text with offload
sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

See commit 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect
through ULP") for justification why the internal flag is safe.
The only location which could leak the flag in is tcp_bpf_sendmsg(),
which is taken care of by clearing the previously unused bit.

v2:
 - remove superfluous decrypted mark copy (Willem);
 - remove the stale doc entry (Boris);
 - rely entirely on EOR marking to prevent coalescing (Boris);
 - use an internal sendpages flag instead of marking the socket
   (Boris).
v3 (Willem):
 - reorganize the can_skb_orphan_partial() condition;
 - fix the flag leak-in through tcp_bpf_sendmsg.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 22:39:35 -07:00
Tariq Toukan
a9bc339032 net/mlx5e: kTLS, Fix progress params context WQE layout
The TLS progress params context WQE should not include an
Eth segment, drop it.
In addition, align the tls_progress_params layout with the
HW specification document:
- fix the tisn field name.
- remove the valid bit.

Fixes: a12ff35e0f ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Fixes: d2ead1f360 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-08 13:01:19 -07:00
Tariq Toukan
26149e3e1f net/mlx5: kTLS, Fix wrong TIS opmod constants
Fix the used constants for TLS TIS opmods, per the HW specification.

Fixes: a12ff35e0f ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-08 13:01:19 -07:00
Linus Torvalds
33920f1ec5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Yeah I should have sent a pull request last week, so there is a lot
  more here than usual:

   1) Fix memory leak in ebtables compat code, from Wenwen Wang.

   2) Several kTLS bug fixes from Jakub Kicinski (circular close on
      disconnect etc.)

   3) Force slave speed check on link state recovery in bonding 802.3ad
      mode, from Thomas Falcon.

   4) Clear RX descriptor bits before assigning buffers to them in
      stmmac, from Jose Abreu.

   5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF
      loops, from Nishka Dasgupta.

   6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean.

   7) Need to hold sock across skb->destructor invocation, from Cong
      Wang.

   8) IP header length needs to be validated in ipip tunnel xmit, from
      Haishuang Yan.

   9) Use after free in ip6 tunnel driver, also from Haishuang Yan.

  10) Do not use MSI interrupts on r8169 chips before RTL8168d, from
      Heiner Kallweit.

  11) Upon bridge device init failure, we need to delete the local fdb.
      From Nikolay Aleksandrov.

  12) Handle erros from of_get_mac_address() properly in stmmac, from
      Martin Blumenstingl.

  13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef
      Kadlecsik.

  14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with
      some devices, so revert. From Johannes Berg.

  15) Fix deadlock in rxrpc, from David Howells.

  16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue
      Haibing.

  17) Fix mvpp2 crash on module removal, from Matteo Croce.

  18) Fix race in genphy_update_link, from Heiner Kallweit.

  19) bpf_xdp_adjust_head() stopped working with generic XDP when we
      fixes generic XDP to support stacked devices properly, fix from
      Jesper Dangaard Brouer.

  20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from
      David Ahern.

  21) Several memory leaks in new sja1105 driver, from Vladimir Oltean"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits)
  net: dsa: sja1105: Fix memory leak on meta state machine error path
  net: dsa: sja1105: Fix memory leak on meta state machine normal path
  net: dsa: sja1105: Really fix panic on unregistering PTP clock
  net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
  net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
  net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
  net: sched: sample: allow accessing psample_group with rtnl
  net: sched: police: allow accessing police->params with rtnl
  net: hisilicon: Fix dma_map_single failed on arm64
  net: hisilicon: fix hip04-xmit never return TX_BUSY
  net: hisilicon: make hip04_tx_reclaim non-reentrant
  tc-testing: updated vlan action tests with batch create/delete
  net sched: update vlan action for batched events operations
  net: stmmac: tc: Do not return a fragment entry
  net: stmmac: Fix issues when number of Queues >= 4
  net: stmmac: xgmac: Fix XGMAC selftests
  be2net: disable bh with spin_lock in be_process_mcc
  net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
  net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
  net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
  ...
2019-08-06 17:11:59 -07:00
Greg KH
3e7093d045 KVM: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
void instead of an integer, as we should not care at all about if this
function actually does anything or not.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: <kvm@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:49 +02:00
Paolo Bonzini
741cbbae07 KVM: remove kvm_arch_has_vcpu_debugfs()
There is no need for this function as all arches have to implement
kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
let us actually simplify the code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:48 +02:00
Wanpeng Li
17e433b543 KVM: Fix leak vCPU's VMCS value into other pCPU
After commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:47 +02:00
Linus Torvalds
b7aea68a19 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "17 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  drivers/acpi/scan.c: document why we don't need the device_hotplug_lock
  memremap: move from kernel/ to mm/
  lib/test_meminit.c: use GFP_ATOMIC in RCU critical section
  asm-generic: fix -Wtype-limits compiler warnings
  cgroup: kselftest: relax fs_spec checks
  mm/memory_hotplug.c: remove unneeded return for void function
  mm/migrate.c: initialize pud_entry in migrate_vma()
  coredump: split pipe command whitespace before expanding template
  page flags: prioritize kasan bits over last-cpuid
  ubsan: build ubsan.c more conservatively
  kasan: remove clang version check for KASAN_STACK
  mm: compaction: avoid 100% CPU usage during compaction when a task is killed
  mm: migrate: fix reference check race between __find_get_block() and migration
  mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker
  ocfs2: remove set but not used variable 'last_hash'
  Revert "kmemleak: allow to coexist with fault injection"
  kernel/signal.c: fix a kernel-doc markup
2019-08-03 09:20:49 -07:00
Arnd Bergmann
ee38d94a0a page flags: prioritize kasan bits over last-cpuid
ARM64 randdconfig builds regularly run into a build error, especially
when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:

  #error "KASAN: not enough bits in page flags for tag"

The last-cpuid bits are already contitional on the available space, so
the result of the calculation is a bit random on whether they were
already left out or not.

Adding the kasan tag bits before last-cpuid makes it much more likely to
end up with a successful build here, and should be reliable for
randconfig at least, as long as that does not randomize NR_CPUS or
NODES_SHIFT but uses the defaults.

In order for the modified check to not trigger in the x86 vdso32 code
where all constants are wrong (building with -m32), enclose all the
definitions with an #ifdef.

[arnd@arndb.de: build fix]
  Link: http://lkml.kernel.org/r/CAK8P3a3Mno1SWTcuAOT0Wa9VS15pdU6EfnkxLbDpyS55yO04+g@mail.gmail.com
Link: http://lkml.kernel.org/r/20190722115520.3743282-1-arnd@arndb.de
Link: https://lore.kernel.org/lkml/20190618095347.3850490-1-arnd@arndb.de/
Fixes: 2813b9c029 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:01 -07:00
Linus Torvalds
dcb8cfbd8f Merge tag 'for-linus-5.3a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:

 - a small cleanup

 - a fix for a build error on ARM with some configs

 - a fix of a patch for the Xen gntdev driver

 - three patches for fixing a potential problem in the swiotlb-xen
   driver which Konrad was fine with me carrying them through the Xen
   tree

* tag 'for-linus-5.3a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/swiotlb: remember having called xen_create_contiguous_region()
  xen/swiotlb: simplify range_straddles_page_boundary()
  xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()
  xen: avoid link error on ARM
  xen/gntdev.c: Replace vm_map_pages() with vm_map_pages_zero()
  xen/pciback: remove set but not used variable 'old_state'
2019-08-02 15:26:48 -07:00
Linus Torvalds
10e5ddd71f Merge tag 'for-linus-20190802' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's a small collection of fixes that should go into this series.
  This contains:

   - io_uring potential use-after-free fix (Jackie)

   - loop regression fix (Jan)

   - O_DIRECT fragmented bio regression fix (Damien)

   - Mark Denis as the new floppy maintainer (Denis)

   - ataflop switch fall-through annotation (Gustavo)

   - libata zpodd overflow fix (Kees)

   - libata ahci deferred probe fix (Miquel)

   - nbd invalidation BUG_ON() fix (Munehisa)

   - dasd endless loop fix (Stefan)"

* tag 'for-linus-20190802' of git://git.kernel.dk/linux-block:
  s390/dasd: fix endless loop after read unit address configuration
  block: Fix __blkdev_direct_IO() for bio fragments
  MAINTAINERS: floppy: take over maintainership
  nbd: replace kill_bdev() with __invalidate_device() again
  ata: libahci: do not complain in case of deferred probe
  io_uring: fix KASAN use after free in io_sq_wq_submit_work
  loop: Fix mount(2) failure due to race with LOOP_SET_FD
  libata: zpodd: Fix small read overflow in zpodd_get_mech_type()
  ataflop: Mark expected switch fall-through
2019-08-02 14:31:26 -07:00
Linus Torvalds
42d21900b3 Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
 "A few fixes for code that came in during the merge window or that
  started getting exercised differently this time around:

   - Select regmap MMIO kconfig in spreadtrum driver to avoid compile
     errors

   - Complete kerneldoc on devm_clk_bulk_get_optional()

   - Register an essential clk earlier on mediatek mt8183 SoCs so the
     clocksource driver can use it

   - Fix divisor math in the at91 driver

   - Plug a race in Renesas reset control logic"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: renesas: cpg-mssr: Fix reset control race condition
  clk: sprd: Select REGMAP_MMIO to avoid compile errors
  clk: mediatek: mt8183: Register 13MHz clock earlier for clocksource
  clk: Add missing documentation of devm_clk_bulk_get_optional() argument
  clk: at91: generated: Truncate divisor to GENERATED_MAX_DIV + 1
2019-08-02 08:47:28 -07:00
Gary R Hook
9f00baf74e crypto: ccp - Add support for valid authsize values less than 16
AES GCM encryption allows for authsize values of 4, 8, and 12-16 bytes.
Validate the requested authsize, and retain it to save in the request
context.

Fixes: 36cf515b9b ("crypto: ccp - Enable support for AES GCM on v5 CCPs")
Cc: <stable@vger.kernel.org>
Signed-off-by: Gary R Hook <gary.hook@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-08-02 14:36:35 +10:00
Linus Torvalds
28f5ab1e12 Merge tag 'gpio-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
 "Three GPIO fixes, all touching the core, so quite important:

   - Fix the request of active low GPIO line events.

   - Don't issue WARN() stuff on NULL descriptors if the GPIOLIB is
     disabled.

   - Preserve the descriptor flags when setting the initial direction on
     lines"

* tag 'gpio-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpiolib: Preserve desc->flags when setting state
  gpio: don't WARN() on NULL descs if gpiolib is disabled
  gpiolib: fix incorrect IRQ requesting of an active-low lineevent
2019-08-01 06:26:30 -07:00
Juergen Gross
b877ac9815 xen/swiotlb: remember having called xen_create_contiguous_region()
Instead of always calling xen_destroy_contiguous_region() in case the
memory is DMA-able for the used device, do so only in case it has been
made DMA-able via xen_create_contiguous_region() before.

This will avoid a lot of xen_destroy_contiguous_region() calls for
64-bit capable devices.

As the memory in question is owned by swiotlb-xen the PG_owner_priv_1
flag of the first allocated page can be used for remembering.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-08-01 06:39:33 +02:00
Arnd Bergmann
055d88242a compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
linux-2.5.69 along with hundreds of other commands, but was always broken
sincen only the structure is compatible, but the command number is not,
due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
sockaddr_pppox)), which is different on 64-bit architectures.

Guillaume Nault adds:

  And the implementation was broken until 2016 (see 29e73269aa ("pppoe:
  fix reference counting in PPPoE proxy")), and nobody ever noticed. I
  should probably have removed this ioctl entirely instead of fixing it.
  Clearly, it has never been used.

Fix it by adding a compat_ioctl handler for all pppoe variants that
translates the command number and then calls the regular ioctl function.

All other ioctl commands handled by pppoe are compatible between 32-bit
and 64-bit, and require compat_ptr() conversion.

This should apply to all stable kernels.

Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30 14:42:13 -07:00
Linus Torvalds
515f12b9ee Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM fixes from Jason Gunthorpe:
 "Fix the locking around nouveau's use of the hmm_range_* APIs. It works
  correctly in the success case, but many of the the edge cases have
  missing unlocks or double unlocks.

  The diffstat is a bit big as Christoph did a comprehensive job to move
  the obsolete API from the core header and into the driver before
  fixing its flow, but the risk of regression from this code motion is
  low"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  nouveau: unlock mmap_sem on all errors from nouveau_range_fault
  nouveau: remove the block parameter to nouveau_range_fault
  mm/hmm: move hmm_vma_range_done and hmm_vma_fault to nouveau
  mm/hmm: always return EBUSY for invalid ranges in hmm_range_{fault,snapshot}
2019-07-30 12:54:44 -07:00
Jan Kara
89e524c04f loop: Fix mount(2) failure due to race with LOOP_SET_FD
Commit 33ec3e53e7 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:

for i in {a..z}{a..z}; do
        dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
        mkfs.ext2 $i.image
        mkdir mnt$i
done

echo "Run"
for i in {a..z}{a..z}; do
        mount -o loop -t ext2 $i.image mnt$i &
done

Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.

Fixes: 33ec3e53e7 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-30 13:16:57 -06:00
Linus Torvalds
45aee68e19 Merge tag 'platform-drivers-x86-v5.3-3' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Andy Shevchenko:
 "Business as usual, a few fixes and new IDs:

   - PC Engines APU got one fix for software dependencies to
     automatically load them and another fix for mapping of key button
     in the front to issue restart event.

   - OLPC driver is now probed automatically based on module device
     table.

   - Intel PMC core driver supports Intel Ice Lake NNPI processor.

   - WMI driver missed description of a new field in the structure that
     has been added"

* tag 'platform-drivers-x86-v5.3-3' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/x86: pcengines-apuv2: use KEY_RESTART for front button
  platform/x86: intel_pmc_core: Add ICL-NNPI support to PMC Core
  Platform: OLPC: add SPI MODULE_DEVICE_TABLE
  platform/x86: wmi: add missing struct parameter description
  platform/x86: pcengines-apuv2: Fix softdep statement
2019-07-29 11:11:05 -07:00
Bartosz Golaszewski
ffe0bbabb0 gpio: don't WARN() on NULL descs if gpiolib is disabled
If gpiolib is disabled, we use the inline stubs from gpio/consumer.h
instead of regular definitions of GPIO API. The stubs for 'optional'
variants of gpiod_get routines return NULL in this case as if the
relevant GPIO wasn't found. This is correct so far.

Calling other (non-gpio_get) stubs from this header triggers a warning
because the GPIO descriptor couldn't have been requested. The warning
however is unconditional (WARN_ON(1)) and is emitted even if the passed
descriptor pointer is NULL.

We don't want to force the users of 'optional' gpio_get to check the
returned pointer before calling e.g. gpiod_set_value() so let's only
WARN on non-NULL descriptors.

Cc: stable@vger.kernel.org
Reported-by: Claus H. Stovgaard <cst@phaseone.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2019-07-28 12:44:14 +02:00
Linus Torvalds
e24ce84e85 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two fixes for the fair scheduling class:

   - Prevent freeing memory which is accessible by concurrent readers

   - Make the RCU annotations for numa groups consistent"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Use RCU accessors consistently for ->numa_group
  sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27 21:22:33 -07:00
Linus Torvalds
5efbd93708 Merge tag 'devicetree-fixes-for-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull Devicetree fixes from Rob Herring:
 "The nvmem changes would typically go thru Greg's tree, but they were
  missed in the merge window. [ Acked by Greg ]

  Summary:

   - Fix mismatches in $id values and actual filenames. Now checked by
     tools.

   - Convert nvmem binding to DT schema

   - Fix a typo in of_property_read_bool() kerneldoc

   - Remove some redundant description in al-fic interrupt-controller"

* tag 'devicetree-fixes-for-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: Fix more $id value mismatches filenames
  dt-bindings: nvmem: SID: Fix the examples node names
  dt-bindings: nvmem: Add YAML schemas for the generic NVMEM bindings
  of: Fix typo in kerneldoc
  dt-bindings: interrupt-controller: al-fic: remove redundant binding
  dt-bindings: clk: allwinner,sun4i-a10-ccu: Correct path in $id
2019-07-27 08:49:19 -07:00