Commit Graph

753602 Commits

Author SHA1 Message Date
Chuck Lever
ff699ea826 SUNRPC: Make num_reqs a non-atomic integer
If recording xprt->stat.max_slots is moved into xprt_alloc_slot,
then xprt->num_reqs is never manipulated outside
xprt->reserve_lock. There's no longer a need for xprt->num_reqs to
be atomic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
78215759e2 SUNRPC: Make RTT measurement more precise (Send)
Some RPC transports have more overhead in their send_request
callouts than others. For example, for RPC-over-RDMA:

- Marshaling an RPC often has to DMA map the RPC arguments

- Registration methods perform memory registration as part of
  marshaling

To capture just server and network latencies more precisely: when
sending a Call, capture the rq_xtime timestamp _after_ the transport
header has been marshaled.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
0b87a46b43 SUNRPC: Make RTT measurement more precise (Receive)
Some RPC transports have more overhead in their reply handlers
than others. For example, for RPC-over-RDMA:

- RPC completion has to wait for memory invalidation, which is
  not a part of the server/network round trip

- Recently a context switch was introduced into the reply handler,
  which further artificially inflates the measure of RPC RTT

To capture just server and network latencies more precisely: when
receiving a reply, compute the RTT as soon as the XID is recognized
rather than at RPC completion time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ecd465ee88 SUNRPC: Move xprt_update_rtt callsite
Since commit 33849792cb ("xprtrdma: Detect unreachable NFS/RDMA
servers more reliably"), the xprtrdma transport now has a ->timer
callout. But xprtrdma does not need to compute RTT data, only UDP
needs that. Move the xprt_update_rtt call into the UDP transport
implementation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
2dd4a012d9 xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_req
Refactor: Both rpcrdma_create_req call sites have to allocate the
buffer where the transport header is built, so just move that
allocation into rpcrdma_create_req.

This buffer is a fixed size. There's no needed information available
in call_allocate that is not also available when the transport is
created.

The original purpose for allocating these buffers on demand was to
reduce the possibility that an allocation failure during transport
creation will hork the mount operation during low memory scenarios.
Some relief for this rare possibility is coming up in the next few
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
f287762308 xprtrdma: Chain Send to FastReg WRs
With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.

This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
fb14ae8853 xprtrdma: "Support" call-only RPCs
RPC-over-RDMA version 1 credit accounting relies on there being a
response message for every RPC Call. This means that RPC procedures
that have no reply will disrupt credit accounting, just in the same
way as a retransmit would (since it is sent because no reply has
arrived). Deal with the "no reply" case the same way.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ae741a8551 xprtrdma: Reduce number of MRs created by rpcrdma_mrs_create
Create fewer MRs on average. Many workloads don't need as many as
32 MRs, and the transport can now quickly restock the MR free list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
9e679d5e76 xprtrdma: ->send_request returns -EAGAIN when there are no free MRs
Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.

This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
8a14793e7a xprtrdma: Remove xprt-specific connect cookie
Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
b7e85fff52 xprtrdma: Remove arbitrary limit on initiator depth
Clean up: We need to check only that the value does not exceed the
range of the u8 field it's going into.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
6720a89933 xprtrdma: Fix latency regression on NUMA NFS/RDMA clients
With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.

"git bisect" settled on commit ccede75985 ("xprtrdma: Spread reply
processing over more CPUs") .

After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.

The fix is implemented by reverting only the part of
commit ccede75985 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.

Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.

Fixes: ccede75985 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Satoru Takeuchi
6cd110a91f ktest: Take submenu into account for grub2 menus
grub-reboot selects the submenu's first menuentry (title is "1>0") rather than ktest's
menuentry (title is "2") by mistake.

===
$ sudo cat /boot/grub/grub.cfg  | grep -E "^menuentry|^submenu"
...
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option '...' {
...
submenu 'Advanced options for Ubuntu' $menuentry_id_option '...' {
...
menuentry 'ktest' {
...
===

Correct it by taking submenu entries into account in get_grub2_index().

Link: http://lkml.kernel.org/r/87poaje4as.wl-satoru.takeuchi@gmail.com

Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-10 15:49:14 -04:00
Sinan Kaya
1b30dfd376 PCI: Mark Broadcom HT1100 and HT2000 Root Port Extended Tags as broken
Per PCIe r3.1, sec 2.2.6.2 and 7.8.4, a Requester may not use 8-bit Tags
unless its Extended Tag Field Enable is set, but all Receivers/Completers
must handle 8-bit Tags correctly regardless of their Extended Tag Field
Enable.

Some devices do not handle 8-bit Tags as Completers, so add a quirk for
them.  If we find such a device, we disable Extended Tags for the entire
hierarchy to make peer-to-peer DMA possible.

The Broadcom HT1100/HT2000/HT2100 seems to have issues with handling 8-bit
tags.  Mark it as broken.

This fixes Xorg hangs and unresponsive keyboards with errors like this:

  radeon 0000:06:00.0: GPU lockup (current fence id 0x000000000000000e last fence id 0x0000000000000
  [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
  [drm:r600_resume [radeon]] *ERROR* r600 startup failed on resume

Fixes: 60db3a4d8c ("PCI: Enable PCIe Extended Tags if supported")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196197
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
CC: stable@vger.kernel.org	# v4.11: 62ce94a7a5 PCI: Mark Broadcom HT2100 Root Port Extended Tags as broken
CC: stable@vger.kernel.org	# v4.11
2018-04-10 14:44:21 -05:00
Linus Torvalds
b284d4d5a6 Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
 "The big ticket items are:

   - support for rbd "fancy" striping (myself).

     The striping feature bit is now fully implemented, allowing mapping
     v2 images with non-default striping patterns. This completes
     support for --image-format 2.

   - CephFS quota support (Luis Henriques and Zheng Yan).

     This set is based on the new SnapRealm code in the upcoming v13.y.z
     ("Mimic") release. Quota handling will be rejected on older
     filesystems.

   - memory usage improvements in CephFS (Chengguang Xu).

     Directory specific bits have been split out of ceph_file_info and
     some effort went into improving cap reservation code to avoid OOM
     crashes.

  Also included a bunch of assorted fixes all over the place from
  Chengguang and others"

* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
  ceph: quota: report root dir quota usage in statfs
  ceph: quota: add counter for snaprealms with quota
  ceph: quota: cache inode pointer in ceph_snap_realm
  ceph: fix root quota realm check
  ceph: don't check quota for snap inode
  ceph: quota: update MDS when max_bytes is approaching
  ceph: quota: support for ceph.quota.max_bytes
  ceph: quota: don't allow cross-quota renames
  ceph: quota: support for ceph.quota.max_files
  ceph: quota: add initial infrastructure to support cephfs quotas
  rbd: remove VLA usage
  rbd: fix spelling mistake: "reregisteration" -> "reregistration"
  ceph: rename function drop_leases() to a more descriptive name
  ceph: fix invalid point dereference for error case in mdsc destroy
  ceph: return proper bool type to caller instead of pointer
  ceph: optimize memory usage
  ceph: optimize mds session register
  libceph, ceph: add __init attribution to init funcitons
  ceph: filter out used flags when printing unused open flags
  ceph: don't wait on writeback when there is no more dirty pages
  ...
2018-04-10 12:25:30 -07:00
Linus Torvalds
a7726f6b61 Merge tag 'platform-drivers-x86-v4.17-1' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver updates from Andy Shevchenko:

 - Dell SMBIOS driver fixed against memory leaks.

 - The fujitsu-laptop driver is cleaned up and now supports hotkeys for
   Lifebook U7x7 models. Besides that the typo introduced by one of
   previous clean up series has been fixed.

 - Specific to x86-based laptops HID device now supports
   KEY_ROTATE_LOCK_TOGGLE event which is emitted, for example, by Wacom
   MobileStudio Pro 13.

 - Turbo MAX 3 technology is enabled for the rest of platforms that
   support Hardware-P-States feature which have core priority described
   by ACPI CPPC table.

 - Mellanox on x86 gets better support of I2C bus in use including
   support of hotpluggable ones.

 - Silead touchscreen is enabled on two tablet models, i.e Yours Y8W81
   and I.T.Works TW701.

 - From now on the second fan on Thinkpad P50 is supported.

 - The topstar-laptop driver is reworked to support new models, in
   particular Topstar U931.

* tag 'platform-drivers-x86-v4.17-1' of git://git.infradead.org/linux-platform-drivers-x86: (41 commits)
  platform/x86: thinkpad_acpi: Add 2nd Fan Support for Thinkpad P50
  platform/x86: dell-smbios: Fix memory leaks in build_tokens_sysfs()
  intel-hid: support KEY_ROTATE_LOCK_TOGGLE
  intel-hid: clean up and sort header files
  platform/x86: silead_dmi: Add entry for the Yours Y8W81 tablet
  platform/x86: fujitsu-laptop: Support Lifebook U7x7 hotkeys
  platform/x86: mlx-platform: Add physical bus number auto detection
  platform/mellanox: mlxreg-hotplug: Change input for device create routine
  platform/x86: mlx-platform: Add deffered bus functionality
  platform/x86: mlx-platform: Use define for the channel numbers
  platform/x86: fujitsu-laptop: Revert UNSUPPORTED_CMD back to an int
  platform/x86: Fix dell driver init order
  platform/x86: dell-smbios: Resolve dependency error on ACPI_WMI
  platform/x86: dell-smbios: Resolve dependency error on DCDBAS
  platform/x86: Allow for SMBIOS backend defaults
  platform/x86: dell-smbios: Link all dell-smbios-* modules together
  platform/x86: dell-smbios: Rename dell-smbios source to dell-smbios-base
  platform/x86: dell-smbios: Correct some style warnings
  platform/x86: wmi: Fix misuse of vsprintf extension %pULL
  platform/x86: intel-hid: Reset wakeup capable flag on removal
  ...
2018-04-10 12:18:50 -07:00
Linus Torvalds
1b02dcb9fa Merge tag 'dmaengine-4.17-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
 "This time we have couple of new drivers along with updates to drivers:

   - new drivers for the DesignWare AXI DMAC and MediaTek High-Speed DMA
     controllers

   - stm32 dma and qcom bam dma driver updates

   - norandom test option for dmatest"

* tag 'dmaengine-4.17-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (30 commits)
  dmaengine: stm32-dma: properly mask irq bits
  dmaengine: stm32-dma: fix max items per transfer
  dmaengine: stm32-dma: fix DMA IRQ status handling
  dmaengine: stm32-dma: Improve memory burst management
  dmaengine: stm32-dma: fix typo and reported checkpatch warnings
  dmaengine: stm32-dma: fix incomplete configuration in cyclic mode
  dmaengine: stm32-dma: threshold manages with bitfield feature
  dt-bindings: stm32-dma: introduce DMA features bitfield
  dt-bindings: rcar-dmac: Document r8a77470 support
  dmaengine: rcar-dmac: Fix too early/late system suspend/resume callbacks
  dmaengine: dw-axi-dmac: fix spelling mistake: "catched" -> "caught"
  dmaengine: edma: Check the memory allocation for the memcpy dma device
  dmaengine: at_xdmac: fix rare residue corruption
  dmaengine: mediatek: update MAINTAINERS entry with MediaTek DMA driver
  dmaengine: mediatek: Add MediaTek High-Speed DMA controller for MT7622 and MT7623 SoC
  dt-bindings: dmaengine: Add MediaTek High-Speed DMA controller bindings
  dt-bindings: Document the Synopsys DW AXI DMA bindings
  dmaengine: Introduce DW AXI DMAC driver
  dmaengine: pl330: fix a race condition in case of threaded irqs
  dmaengine: imx-sdma: fix pagefault when channel is disabled during interrupt
  ...
2018-04-10 12:14:37 -07:00
Linus Torvalds
92589cbdda Merge tag 'rproc-v4.17' of git://github.com/andersson/remoteproc
Pull remoteproc updates from Bjorn Andersson:

 - add support for generating coredumps for remoteprocs using
   devcoredump

 - add the Qualcomm sysmon driver for intra-remoteproc crash handling

 - a number of fixes in Qualcomm and IMX drivers

* tag 'rproc-v4.17' of git://github.com/andersson/remoteproc:
  remoteproc: fix null pointer dereference on glink only platforms
  soc: qcom: qmi: add CONFIG_NET dependency
  remoteproc: imx_rproc: Slightly simplify code in 'imx_rproc_probe()'
  remoteproc: imx_rproc: Re-use existing error handling path in 'imx_rproc_probe()'
  remoteproc: imx_rproc: Fix an error handling path in 'imx_rproc_probe()'
  samples: Introduce Qualcomm QMI sample client
  remoteproc: qcom: Introduce sysmon
  remoteproc: Pass type of shutdown to subdev remove
  remoteproc: qcom: Register segments for core dump
  soc: qcom: mdt-loader: Return relocation base
  remoteproc: Rename "load_rsc_table" to "parse_fw"
  remoteproc: Add remote processor coredump support
  remoteproc: Remove null character write of shared mem
2018-04-10 12:09:27 -07:00
Linus Torvalds
9ab89c407d Merge tag 'rpmsg-v4.17' of git://github.com/andersson/remoteproc
Pull rpmsg updates from Bjorn Andersson:

 - transition the rpmsg_trysend() code paths of SMD and GLINK to use
   non-sleeping locks

 - revert the overly optimistic handling of discovered SMD channels

 - fix an issue in SMD where incoming messages race with the probing of
   a client driver

* tag 'rpmsg-v4.17' of git://github.com/andersson/remoteproc:
  rpmsg: smd: Use announce_create to process any receive work
  rpmsg: Only invoke announce_create for rpdev with endpoints
  rpmsg: smd: Fix container_of macros
  Revert "rpmsg: smd: Create device for all channels"
  rpmsg: glink: Use spinlock in tx path
  rpmsg: smd: Use spinlock in tx path
  rpmsg: smd: use put_device() if device_register fail
  rpmsg: glink: use put_device() if device_register fail
2018-04-10 12:04:54 -07:00
Linus Torvalds
f77cfbe645 Merge tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming
Pull c6x updates from Mark Salter.

* tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
  c6x: pass endianness info to sparse
  c6x: fix platforms/plldata.c get_coreid build error
  c6x: remove unused KTHREAD_SIZE definition
2018-04-10 11:50:14 -07:00
Linus Torvalds
948869fa9f Merge tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips
Pull MIPS updates from James Hogan:
 "These are the main MIPS changes for 4.17. Rough overview:

   (1) generic platform: Add support for Microsemi Ocelot SoCs

   (2) crypto: Add CRC32 and CRC32C HW acceleration module

   (3) Various cleanups and misc improvements

  More detailed summary:

  Miscellaneous:
   - hang more efficiently on halt/powerdown/restart
   - pm-cps: Block system suspend when a JTAG probe is present
   - expand make help text for generic defconfigs
   - refactor handling of legacy defconfigs
   - determine the entry point from the ELF file header to fix microMIPS
     for certain toolchains
   - introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code

  Minor cleanups:
   - DTS: boston/ci20: Unit name cleanups and correction
   - kdump: Make the default for PHYSICAL_START always 64-bit
   - constify gpio_led in Alchemy, AR7, and TXX9
   - silence a couple of W=1 warnings
   - remove duplicate includes

  Platform support:
  Generic platform:
   - add support for Microsemi Ocelot
   - dt-bindings: Add vendor prefix for Microsemi Corporation
   - dt-bindings: Add bindings for Microsemi SoCs
   - add ocelot SoC & PCB123 board DTS files
   - MAINTAINERS: Add entry for Microsemi MIPS SoCs
   - enable crc32-mips on r6 configs

  ath79:
   - fix AR724X_PLL_REG_PCIE_CONFIG offset

  BCM47xx:
   - firmware: Use mac_pton() for MAC address parsing
   - add Luxul XAP1500/XWR1750 WiFi LEDs
   - use standard reset button for Luxul XWR-1750

  BMIPS:
   - enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
   - add STB PM, wake-up timer, watchdog DT nodes

  Octeon:
   - drop '.' after newlines in printk calls

  ralink:
   - pci-mt7621: Enable PCIe on MT7688"

* tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (37 commits)
  MIPS: BCM47XX: Use standard reset button for Luxul XWR-1750
  MIPS: BCM47XX: Add Luxul XAP1500/XWR1750 WiFi LEDs
  MIPS: Make the default for PHYSICAL_START always 64-bit
  MIPS: Use the entry point from the ELF file header
  MAINTAINERS: Add entry for Microsemi MIPS SoCs
  MIPS: generic: Add support for Microsemi Ocelot
  MIPS: mscc: Add ocelot PCB123 device tree
  MIPS: mscc: Add ocelot dtsi
  dt-bindings: mips: Add bindings for Microsemi SoCs
  dt-bindings: Add vendor prefix for Microsemi Corporation
  MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
  MIPS: pci-mt7620: Enable PCIe on MT7688
  MIPS: pm-cps: Block system suspend when a JTAG probe is present
  MIPS: VDSO: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: BPF: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: Introduce isa-rev.h to define MIPS_ISA_REV
  MIPS: Hang more efficiently on halt/powerdown/restart
  FIRMWARE: bcm47xx_nvram: Replace mac address parsing
  MIPS: BMIPS: Add Broadcom STB watchdog nodes
  ...
2018-04-10 11:39:22 -07:00
Linus Torvalds
2a56bb596b Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
 "New features:

   - Tom Zanussi's extended histogram work.

     This adds the synthetic events to have histograms from multiple
     event data Adds triggers "onmatch" and "onmax" to call the
     synthetic events Several updates to the histogram code from this

   - Allow way to nest ring buffer calls in the same context

   - Allow absolute time stamps in ring buffer

   - Rewrite of filter code parsing based on Al Viro's suggestions

   - Setting of trace_clock to global if TSC is unstable (on boot)

   - Better OOM handling when allocating large ring buffers

   - Added initcall tracepoints (consolidated initcall_debug code with
     them)

  And other various fixes and clean ups"

* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
  init: Have initcall_debug still work without CONFIG_TRACEPOINTS
  init, tracing: Have printk come through the trace events for initcall_debug
  init, tracing: instrument security and console initcall trace events
  init, tracing: Add initcall trace events
  tracing: Add rcu dereference annotation for test func that touches filter->prog
  tracing: Add rcu dereference annotation for filter->prog
  tracing: Fixup logic inversion on setting trace_global_clock defaults
  tracing: Hide global trace clock from lockdep
  ring-buffer: Add set/clear_current_oom_origin() during allocations
  ring-buffer: Check if memory is available before allocation
  lockdep: Add print_irqtrace_events() to __warn
  vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
  tracing: Uninitialized variable in create_tracing_map_fields()
  tracing: Make sure variable string fields are NULL-terminated
  tracing: Add action comparisons when testing matching hist triggers
  tracing: Don't add flag strings when displaying variable references
  tracing: Fix display of hist trigger expressions containing timestamps
  ftrace: Drop a VLA in module_exists()
  tracing: Mention trace_clock=global when warning about unstable clocks
  tracing: Default to using trace_global_clock if sched_clock is unstable
  ...
2018-04-10 11:27:30 -07:00
Jonathan Helman
6c64fe7f2a virtio_balloon: export hugetlb page allocation counts
Export the number of successful and failed hugetlb page
allocations via the virtio balloon driver. These 2 counts
come directly from the vm_events HTLB_BUDDY_PGALLOC and
HTLB_BUDDY_PGALLOC_FAIL.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
2018-04-10 21:23:55 +03:00
Masami Hiramatsu
50268a3d26 tracing/uprobe_event: Fix strncpy corner case
Fix string fetch function to terminate with NUL.
It is OK to drop the rest of string.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: security@kernel.org
Cc: 范龙飞 <long7573@126.com>
Fixes: 5baaa59ef0 ("tracing/probes: Implement 'memory' fetch method for uprobes")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-10 20:14:11 +02:00
Linus Torvalds
9f3a0941fb Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Linus Torvalds
fbe173e3ff Merge tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
 "This contains a few series that have been in preparation for a while
  and that will help systems with RTCs that will fail in 2038, 2069 or
  2100.

  Subsystem:
   - Add tracepoints
   - Rework of the RTC/nvmem API to allow drivers to discard struct
     nvmem_config after registration
   - New range API, drivers can now expose the useful range of the RTC
   - New offset API the core is now able to add an offset to the RTC
     time, modifying the supported range.
   - Multiple rtc_time64_to_tm fixes
   - Handle time_t overflow on 32 bit platforms in the core instead of
     letting drivers do crazy things.
   - remove rtc_control API

  New driver:
   - Intersil ISL12026

  Drivers:
   - Drivers exposing the RTC non volatile memory have been converted to
     use nvmem
   - Removed useless time and date validation
   - Removed an indirection pattern that was a cargo cult from ancient
     drivers
   - Removed VLA usage
   - Fixed a possible race condition in probe functions
   - AB8540 support is dropped from ab8500
   - pcf85363 now has alarm support"

* tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (128 commits)
  rtc: snvs: Fix usage of snvs_rtc_enable
  rtc: mt7622: fix module autoloading for OF platform drivers
  rtc: isl12022: use true and false for boolean values
  rtc: ab8500: Drop AB8540 support
  rtc: remove a warning during scripts/kernel-doc step
  rtc: 88pm860x: remove artificial limitation
  rtc: 88pm80x: remove artificial limitation
  rtc: st-lpc: remove artificial limitation
  rtc: mrst: remove artificial limitation
  rtc: mv: remove artificial limitation
  rtc: hctosys: Ensure system time doesn't overflow time_t
  parisc: time: stop validating rtc_time in .read_time
  rtc: pcf85063: fix clearing bits in pcf85063_start_clock
  rtc: at91sam9: Set name of regmap_config
  rtc: s5m: Remove VLA usage
  rtc: s5m: Move enum from rtc.h to rtc-s5m.c
  rtc: remove VLA usage
  rtc: Add useful timestamp definitions
  rtc: Add one offset seconds to expand RTC range
  rtc: Factor out the RTC range validation into rtc_valid_range()
  ...
2018-04-10 10:22:27 -07:00
Linus Torvalds
5e630afdcb Merge tag 'fbdev-v4.17' of git://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
 "There is nothing really major here, just a couple of small bugfixes,
  improvements and cleanups:

   - make it possible to load radeonfb driver when offb driver is loaded
     first (Mathieu Malaterre)

   - fix memory leak in offb driver (Mathieu Malaterre)

   - fix unaligned access in udlfb driver (Ladislav Michl)

   - convert atmel_lcdfb driver to use GPIO descriptors (Ludovic
     Desroches)

   - avoid mismatched prototypes in sisfb driver (Arnd Bergmann)

   - remove VLA usage from viafb driver (Gustavo A. R. Silva)

   - add missing help text to FB_I810_I2 config option (Ulf Magnusson)

   - misc fixes (Gustavo A. R. Silva, Colin Ian King, Markus Elfring)

   - remove dead code from s3c-fb driver for Exynos and S5PV210
     platforms

   - misc cleanups (Corentin Labbe, Ladislav Michl, Ulf Magnusson,
     Vladimir Zapolskiy, Markus Elfring)"

* tag 'fbdev-v4.17' of git://github.com/bzolnier/linux: (32 commits)
  video: fbdev: s3c-fb: remove dead platform code for Exynos and S5PV210 platforms
  video: au1100fb: Delete an unnecessary variable initialisation in au1100fb_drv_probe()
  video: au1100fb: Improve a size determination in au1100fb_drv_probe()
  video: au1100fb: Delete an error message for a failed memory allocation in au1100fb_drv_probe()
  video/console/sticore: Delete an error message for a failed memory allocation in sti_try_rom_generic()
  video: ARM CLCD: Improve a size determination in clcdfb_probe()
  video: ARM CLCD: Delete an error message for a failed memory allocation in clcdfb_probe()
  video: matroxfb: Delete an error message for a failed memory allocation in matroxfb_crtc2_probe()
  video: s3c-fb: Improve a size determination in s3c_fb_probe()
  video: s3c-fb: Delete an error message for a failed memory allocation in s3c_fb_probe()
  video: fsl-diu-fb: Delete an error message for a failed memory allocation in fsl_diu_init()
  video: ssd1307fb: Improve a size determination in ssd1307fb_probe()
  video: smscufx: Delete an error message for a failed memory allocation in ufx_realloc_framebuffer()
  video: smscufx: Return an error code only as a constant in ufx_realloc_framebuffer()
  video: smscufx: Less checks in ufx_usb_probe() after error detection
  video: udlfb: Return an error code only as a constant in dlfb_realloc_framebuffer()
  video/fbdev/stifb: Delete an error message for a failed memory allocation in stifb_init_fb()
  video/fbdev/stifb: Return -ENOMEM after a failed kzalloc() in stifb_init_fb()
  video: fbdev: aty128fb: use true and false for boolean values
  fbdev: aty: fix missing indentation in if statement
  ...
2018-04-10 10:20:00 -07:00
Linus Torvalds
7aa1cf254c Merge tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "The main purpose of this pull request is a fix for a regression in the
  recent PCM OSS emulation code that may lead to RCU stall. Since
  syzkaller hits this too often, I send the pull request now with a
  minimal collection. Possibly another pull request may follow before
  RC1.

  The other fixes here are for USB-audio class 2 and 3 to improve the
  parser for the clock descriptors. These are rather cleanups but good
  for security, too.

  Last but not least, another included fix is the trivial one to remove
  superfluous WARN_ON() that annoyed syzbot"

* tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() error
  ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
  ALSA: usb-audio: Add sanity checks in UAC3 clock parsers
  ALSA: usb-audio: More strict sanity checks for clock parsers
  ALSA: usb-audio: Refactor clock finder helpers
2018-04-10 10:16:04 -07:00
Linus Torvalds
d36260050e Merge tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
 "A series of media updates/fixes for 4.17.

  There are two important core fix patches in this series:

   - A regression fix on Kernel 4.16 with causes it to not work with
     some input devices that depend on media core

   - A fix at compat32 bits with causes it to OOPS on overlay, and
     affects the Kernels where the CVE-2017-13166 was backported

  The remaining ones are other random fixes at the documentation and on
  drivers.

  The biggest part of this series is a set of 18 patches for the Intel
  atomisp driver. Currently, it produces hundreds of warnings/errors on
  sparse/smatch, causing me to sometimes ignore new warnings on other
  drivers that are not so broken. This driver is on really poor state,
  even for staging standards: it has several layers of abstraction on
  it, and it supports two different hardware. Selecting between them
  require to add a define (there isn't even a Kconfig option for such
  purpose). Just on this smatch cleanup, I could easily get rid of 8
  "do-nothing" files. So, I'm seriously considering its removal from
  upstream, if I don't see any real work on addressing the problems
  there along this year"

* tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (48 commits)
  media: v4l2-core: fix size of devnode_nums[] bitarray
  media: v4l2-compat-ioctl32: don't oops on overlay
  media: i2c: adv748x: afe: fix sparse warning
  media: extended-controls.rst: transmitter -> receiver
  media: staging: atomisp: stop duplicating input format types
  media: staging: atomisp: get rid of an unused var
  media: staging: atomisp: stop mixing enum types
  media: staging: atomisp: get rid of some static warnings
  media: staging: atomisp: use %p to print pointers
  media: staging: atomisp: remove an useless check
  media: staging: atomisp: avoid a warning if 32 bits build
  media: staging: atomisp: don't access a NULL var
  media: staging: atomisp: Get rid of *default.host.[ch]
  media: staging: atomisp: get rid of an unused function
  media: staging: atomisp: remove unused set_pd_base()
  media: staging: atomisp: fix endianess issues
  media: staging: atomisp: add a missing include
  media: staging: atomisp: get rid of stupid statements
  media: staging: atomisp: declare static vars as such
  media: staging: atomisp: ia_css_output.host: don't use var before check
  ...
2018-04-10 10:10:30 -07:00
Darrick J. Wong
4919d42ab6 xfs: only cancel cow blocks when truncating the data fork
In xfs_itruncate_extents, only cancel cow blocks and clear the reflink
flag if we were asked to truncate the data fork.  Attr fork blocks
cannot be shared, so this makes no sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-04-10 08:28:33 -07:00
Colin Ian King
4d5f26ee31 kvm: selftests: fix spelling mistake: "divisable" and "divisible"
Trivial fix to spelling mistakes in comment and message text

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-10 17:20:03 +02:00
KarimAllah Ahmed
386c6ddbda X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted
The VMX-preemption timer is used by KVM as a way to set deadlines for the
guest (i.e. timer emulation). That was safe till very recently when
capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was
introduced. According to Intel SDM 25.5.1:

"""
The VMX-preemption timer operates in the C-states C0, C1, and C2; it also
operates in the shutdown and wait-for-SIPI states. If the timer counts down
to zero in any state other than the wait-for SIPI state, the logical
processor transitions to the C0 C-state and causes a VM exit; the timer
does not cause a VM exit if it counts down to zero in the wait-for-SIPI
state. The timer is not decremented in C-states deeper than C2.
"""

Now once the guest issues the MWAIT with a c-state deeper than
C2 the preemption timer will never wake it up again since it stopped
ticking! Usually this is compensated by other activities in the system that
would wake the core from the deep C-state (and cause a VMExit). For
example, if the host itself is ticking or it received interrupts, etc!

So disable the VMX-preemption timer if MWAIT is exposed to the guest!

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Fixes: 4d5422cea3
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-10 17:19:44 +02:00
Sabrina Dubroca
1cc5954f44 ip_gre: clear feature flags when incompatible o_flags are set
Commit dd9d598c66 ("ip_gre: add the support for i/o_flags update via
netlink") added the ability to change o_flags, but missed that the
GSO/LLTX features are disabled by default, and only enabled some gre
features are unused. Thus we also need to disable the GSO/LLTX features
on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.

These two examples should result in the same features being set:

    ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0

    ip link set gre_none type gre seq
    ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq

Fixes: dd9d598c66 ("ip_gre: add the support for i/o_flags update via netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-10 11:03:32 -04:00
Rasmus Villemoes
9564a8cf42 Kbuild: fix # escaping in .cmd files for future Make
I tried building using a freshly built Make (4.2.1-69-g8a731d1), but
already the objtool build broke with

orc_dump.c: In function ‘orc_dump’:
orc_dump.c:106:2: error: ‘elf_getshnum’ is deprecated [-Werror=deprecated-declarations]
  if (elf_getshdrnum(elf, &nr_sections)) {

Turns out that with that new Make, the backslash was not removed, so cpp
didn't see a #include directive, grep found nothing, and
-DLIBELF_USE_DEPRECATED was wrongly put in CFLAGS.

Now, that new Make behaviour is documented in their NEWS file:

  * WARNING: Backward-incompatibility!
    Number signs (#) appearing inside a macro reference or function invocation
    no longer introduce comments and should not be escaped with backslashes:
    thus a call such as:
      foo := $(shell echo '#')
    is legal.  Previously the number sign needed to be escaped, for example:
      foo := $(shell echo '\#')
    Now this latter will resolve to "\#".  If you want to write makefiles
    portable to both versions, assign the number sign to a variable:
      C := \#
      foo := $(shell echo '$C')
    This was claimed to be fixed in 3.81, but wasn't, for some reason.
    To detect this change search for 'nocomment' in the .FEATURES variable.

This also fixes up the two make-cmd instances to replace # with $(pound)
rather than with \#. There might very well be other places that need
similar fixup in preparation for whatever future Make release contains
the above change, but at least this builds an x86_64 defconfig with the
new make.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=197847
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-04-11 00:03:02 +09:00
Li RongQing
a774635db5 x86/apic: Fix signedness bug in APIC ID validity checks
The APIC ID as parsed from ACPI MADT is validity checked with the
apic->apic_id_valid() callback, which depends on the selected APIC type.

For non X2APIC types APIC IDs >= 0xFF are invalid, but values > 0x7FFFFFFF
are detected as valid. This happens because the 'apicid' argument of the
apic_id_valid() callback is type 'int'. So the resulting comparison

   apicid < 0xFF

evaluates to true for all unsigned int values > 0x7FFFFFFF which are handed
to default_apic_id_valid(). As a consequence, invalid APIC IDs in !X2APIC
mode are considered valid and accounted as possible CPUs.

Change the apicid argument type of the apic_id_valid() callback to u32 so
the evaluation is unsigned and returns the correct result.

[ tglx: Massaged changelog ]

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: jgross@suse.com
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/1523322966-10296-1-git-send-email-lirongqing@baidu.com
2018-04-10 16:46:39 +02:00
Neil Armstrong
3c9f2157a2 MAINTAINERS: Migrate oxnas list to groups.io
The linux-oxnas migrates from tuxfamily to groups.io for a simpler
administration and maintainance.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-10 16:41:22 +02:00
Tomer Maimon
d893c4de01 arm: npcm: enable L2 cache in NPCM7xx architecture
This patch Enable ARM L2 cache module in Nuvoton NPCM7xx BMC
by adding L2 cache parameters into NPCM7xx DT machine start structure.

At patch V7 arm: npcm: add basic support for Nuvoton BMCs we got comments
regarding the flags use in L2 cache module.
- https://www.spinics.net/lists/arm-kernel/msg613212.html

After checking again the L2 cache use in the NPCM7xx,
the only L2 cache flag we need to set is L2C_AUX_CTRL_SHARED_OVERRIDE
and it is done in the device tree:
https://patchwork.kernel.org/patch/10063497/

L2 cache flag mask allowed all the flag option.

Signed-off-by: Tomer Maimon <tmaimon77@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-10 16:40:03 +02:00
Mathieu Malaterre
a93f00b376 backing: silence compiler warning using __printf
__printf marker was added in commit d2cc4dde92 ("bdi_register: add
__printf verification, fix arg mismatch") for function `bdi_register`
since it is useful to verify format and arguments. Apply equivalent gcc
attribute to `bdi_register_va`.

Remove warning triggered with W=1:

  mm/backing-dev.c:881:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
37c7c6c76d blk-mq: remove code for dealing with remapping queue
Firstly, from commit 4b855ad371 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue any more after CPU topo is changed.

Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map
all possible CPUs to hw queues, so at least one CPU is mapped to each hctx.

So queue mapping has became static and fixed just like percpu variable, and
we don't need to handle queue remapping any more.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
127276c6ce blk-mq: reimplement blk_mq_hw_queue_mapped
Now the actual meaning of queue mapped is that if there is any online
CPU mapped to this hctx, so implement blk_mq_hw_queue_mapped() in this
way.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
efea8450c3 blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
There are several reasons for removing the check:

1) blk_mq_hw_queue_mapped() returns true always now since each hctx
may be mapped by one CPU at least

2) when there isn't any online CPU mapped to this hctx, there won't
be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue
if there is IO queued to this hctx

3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(),
which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and
the hctx to be handled has to be mapped.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
15fe8a90bb blk-mq: remove blk_mq_delay_queue()
No driver uses this interface any more, so remove it.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
f82ddf1923 blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
This patch introduces helper of blk_mq_hw_queue_first_cpu() for
figuring out the hctx's first cpu, and code duplication can be
avoided.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
476f8c98a9 blk-mq: avoid to write intermediate result to hctx->next_cpu
This patch figures out the final selected CPU, then writes
it to hctx->next_cpu once, then we can avoid to intermediate
next cpu observed from other dispatch paths.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
bffa9909a6 blk-mq: don't keep offline CPUs mapped to hctx 0
From commit 4b855ad371 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue after CPU topo is changed, that said when
some of these offline CPUs become online, they are still mapped to
hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
completion.

This patch sets up the mapping from the beginning, and aligns to
queue mapping for PCI device (blk_mq_pci_map_queues()).

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: stable@vger.kernel.org
Fixes: 4b855ad371 ("blk-mq: Create hctx for each present CPU)
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
a1c735fb79 blk-mq: make sure that correct hctx->next_cpu is set
From commit 20e4d81393 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Omar Sandoval
bdac616db9 loop: fix LOOP_GET_STATUS lock imbalance
Commit 2d1d4c1e59 made loop_get_status() drop lo_ctx_mutex before
returning, but the loop_get_status_old(), loop_get_status64(), and
loop_get_status_compat() wrappers don't call loop_get_status() if the
passed argument is NULL. The callers expect that the lock is dropped, so
make sure we drop it in that case, too.

Reported-by: syzbot+31e8daa8b3fc129e75f2@syzkaller.appspotmail.com
Fixes: 2d1d4c1e59 ("loop: don't call into filesystem while holding lo_ctl_mutex")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Tetsuo Handa
1e047eaab3 block/loop: fix deadlock after loop_set_status
syzbot is reporting deadlocks at __blkdev_get() [1].

----------------------------------------
[   92.493919] systemd-udevd   D12696   525      1 0x00000000
[   92.495891] Call Trace:
[   92.501560]  schedule+0x23/0x80
[   92.502923]  schedule_preempt_disabled+0x5/0x10
[   92.504645]  __mutex_lock+0x416/0x9e0
[   92.510760]  __blkdev_get+0x73/0x4f0
[   92.512220]  blkdev_get+0x12e/0x390
[   92.518151]  do_dentry_open+0x1c3/0x2f0
[   92.519815]  path_openat+0x5d9/0xdc0
[   92.521437]  do_filp_open+0x7d/0xf0
[   92.527365]  do_sys_open+0x1b8/0x250
[   92.528831]  do_syscall_64+0x6e/0x270
[   92.530341]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[   92.931922] 1 lock held by systemd-udevd/525:
[   92.933642]  #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x73/0x4f0
----------------------------------------

The reason of deadlock turned out that wait_event_interruptible() in
blk_queue_enter() got stuck with bdev->bd_mutex held at __blkdev_put()
due to q->mq_freeze_depth == 1.

----------------------------------------
[   92.787172] a.out           S12584   634    633 0x80000002
[   92.789120] Call Trace:
[   92.796693]  schedule+0x23/0x80
[   92.797994]  blk_queue_enter+0x3cb/0x540
[   92.803272]  generic_make_request+0xf0/0x3d0
[   92.807970]  submit_bio+0x67/0x130
[   92.810928]  submit_bh_wbc+0x15e/0x190
[   92.812461]  __block_write_full_page+0x218/0x460
[   92.815792]  __writepage+0x11/0x50
[   92.817209]  write_cache_pages+0x1ae/0x3d0
[   92.825585]  generic_writepages+0x5a/0x90
[   92.831865]  do_writepages+0x43/0xd0
[   92.836972]  __filemap_fdatawrite_range+0xc1/0x100
[   92.838788]  filemap_write_and_wait+0x24/0x70
[   92.840491]  __blkdev_put+0x69/0x1e0
[   92.841949]  blkdev_close+0x16/0x20
[   92.843418]  __fput+0xda/0x1f0
[   92.844740]  task_work_run+0x87/0xb0
[   92.846215]  do_exit+0x2f5/0xba0
[   92.850528]  do_group_exit+0x34/0xb0
[   92.852018]  SyS_exit_group+0xb/0x10
[   92.853449]  do_syscall_64+0x6e/0x270
[   92.854944]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[   92.943530] 1 lock held by a.out/634:
[   92.945105]  #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x3c/0x1e0
----------------------------------------

The reason of q->mq_freeze_depth == 1 turned out that loop_set_status()
forgot to call blk_mq_unfreeze_queue() at error paths for
info->lo_encrypt_type != NULL case.

----------------------------------------
[   37.509497] CPU: 2 PID: 634 Comm: a.out Tainted: G        W        4.16.0+ #457
[   37.513608] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[   37.518832] RIP: 0010:blk_freeze_queue_start+0x17/0x40
[   37.521778] RSP: 0018:ffffb0c2013e7c60 EFLAGS: 00010246
[   37.524078] RAX: 0000000000000000 RBX: ffff8b07b1519798 RCX: 0000000000000000
[   37.527015] RDX: 0000000000000002 RSI: ffffb0c2013e7cc0 RDI: ffff8b07b1519798
[   37.529934] RBP: ffffb0c2013e7cc0 R08: 0000000000000008 R09: 47a189966239b898
[   37.532684] R10: dad78b99b278552f R11: 9332dca72259d5ef R12: ffff8b07acd73678
[   37.535452] R13: 0000000000004c04 R14: 0000000000000000 R15: ffff8b07b841e940
[   37.538186] FS:  00007fede33b9740(0000) GS:ffff8b07b8e80000(0000) knlGS:0000000000000000
[   37.541168] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   37.543590] CR2: 00000000206fdf18 CR3: 0000000130b30006 CR4: 00000000000606e0
[   37.546410] Call Trace:
[   37.547902]  blk_freeze_queue+0x9/0x30
[   37.549968]  loop_set_status+0x67/0x3c0 [loop]
[   37.549975]  loop_set_status64+0x3b/0x70 [loop]
[   37.549986]  lo_ioctl+0x223/0x810 [loop]
[   37.549995]  blkdev_ioctl+0x572/0x980
[   37.550003]  block_ioctl+0x34/0x40
[   37.550006]  do_vfs_ioctl+0xa7/0x6d0
[   37.550017]  ksys_ioctl+0x6b/0x80
[   37.573076]  SyS_ioctl+0x5/0x10
[   37.574831]  do_syscall_64+0x6e/0x270
[   37.576769]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
----------------------------------------

[1] https://syzkaller.appspot.com/bug?id=cd662bc3f6022c0979d01a262c318fab2ee9b56f

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <bot+48594378e9851eab70bcd6f99327c7db58c5a28a@syzkaller.appspotmail.com>
Fixes: ecdd09597a ("block/loop: fix race between I/O and set_status")
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei
0bca799b92 blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:

1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.

2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.

3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B

4) both two IO pathes can't move on, and IO stall is caused.

This issue can be observed when running dbench on USB storage.

This patch fixes this issue by always getting budget before getting
driver tag.

Cc: stable@vger.kernel.org
Fixes: de14829740 ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Sinan Kaya
a71e7c44ff io: change writeX_relaxed() to remove barriers
Now that we hardened writeX() API in asm-generic version, writeX_relaxed()
API is violating the rules when writeX_relaxed() == writeX() in the default
implementation.

The relaxed API shouldn't have any barriers in it and it doesn't provide
any ordering with respect to the memory transactions. The only requirement
is for writes to be ordered with respect to each other. This is achieved
by the volatile in the __raw_writeX() API.

Open code the relaxed API and remove any barriers in it.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-10 16:37:34 +02:00