Commit Graph

886869 Commits

Author SHA1 Message Date
Ursula Braun
50c6b20eff net/smc: fix final cleanup sequence for SMCD devices
If peer announces shutdown, use the link group terminate worker for
local cleanup of link groups and connections to terminate link group
in proper context.

Make sure link groups are cleaned up first before destroying the
event queue of the SMCD device, because link group cleanup may
raise events.

Send signal shutdown only if peer has not done it already.

Send socket abort or close only, if peer has not already announced
shutdown.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:28:28 -08:00
David S. Miller
43da44c876 Merge branch 'net-stmmac-CPU-Performance-Improvements'
Jose Abreu says:

====================
net: stmmac: CPU Performance Improvements

CPU Performance improvements for stmmac. Please check bellow for results
before and after the series.

Patch 1/7, allows RX Interrupt on Completion to be disabled and only use the
RX HW Watchdog.

Patch 2/7, setups the default RX coalesce settings instead of using the
minimum value.

Patch 3/7 and 4/7, removes the uneeded computations for RX Flow Control
activation/de-activation, on some cases.

Patch 5/7, tunes-up the default coalesce settings.

Patch 6/7, re-works the TX coalesce timer activation logic.

Patch 7/7, removes the now uneeded TBU interrupt.

NetPerf UDP Results:
--------------------

Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SS     us/KB
--- XGMAC@2.5G: Before
212992    1400   10.00     2100620      0     2351.7     36.69    5.112
212992           10.00     2100539            2351.6     26.18    3.648
--- XGMAC@2.5G: After
212992    1400   10.00     2108972      0     2361.5     21.73    3.015
212992           10.00     2097038            2348.1     19.21    2.666

--- GMAC5@1G: Before
212992    1400   10.00      786000      0      880.2     34.71    12.923
212992           10.00      786000             880.2     23.42    8.719
--- GMAC5@1G: After
212992    1400   10.00      842648      0      943.7     14.12    4.903
212992           10.00      842648             943.7     12.73    4.418

Perf TCP Results on RX Path:
----------------------------
--- XGMAC@2.5G: Before
22.51%  swapper          [stmmac]           [k] dwxgmac2_dma_interrupt
10.82%  swapper          [stmmac]           [k] dwxgmac2_host_mtl_irq_status
 5.21%  swapper          [stmmac]           [k] dwxgmac2_host_irq_status
 4.67%  swapper          [stmmac]           [k] dwxgmac3_safety_feat_irq_status
 3.63%  swapper          [kernel.kallsyms]  [k] stack_trace_consume_entry
 2.74%  iperf3           [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
 2.52%  swapper          [kernel.kallsyms]  [k] update_stack_state
 1.94%  ksoftirqd/0      [stmmac]           [k] dwxgmac2_dma_interrupt
 1.45%  iperf3           [kernel.kallsyms]  [k] queued_spin_lock_slowpath
 1.26%  swapper          [kernel.kallsyms]  [k] create_object
--- XGMAC@2.5G: After
 7.43%  swapper          [kernel.kallsyms]   [k] stack_trace_consume_entry
 5.86%  swapper          [stmmac]            [k] dwxgmac2_dma_interrupt
 5.68%  swapper          [kernel.kallsyms]   [k] update_stack_state
 4.71%  iperf3           [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 2.88%  swapper          [kernel.kallsyms]   [k] create_object
 2.69%  swapper          [stmmac]            [k] dwxgmac2_host_mtl_irq_status
 2.61%  swapper          [stmmac]            [k] stmmac_napi_poll_rx
 2.52%  swapper          [kernel.kallsyms]   [k] unwind_next_frame.part.4
 1.48%  swapper          [kernel.kallsyms]   [k] unwind_get_return_address
 1.38%  swapper          [kernel.kallsyms]   [k] arch_stack_walk

--- GMAC5@1G: Before
31.29%  swapper          [stmmac]           [k] dwmac4_dma_interrupt
14.57%  swapper          [stmmac]           [k] dwmac4_irq_mtl_status
10.66%  swapper          [stmmac]           [k] dwmac4_irq_status
 1.97%  swapper          [kernel.kallsyms]  [k] stack_trace_consume_entry
 1.73%  iperf3           [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
 1.59%  swapper          [kernel.kallsyms]  [k] update_stack_state
 1.15%  iperf3           [kernel.kallsyms]  [k] do_syscall_64
 1.01%  ksoftirqd/0      [stmmac]           [k] dwmac4_dma_interrupt
 0.89%  swapper          [kernel.kallsyms]  [k] __default_send_IPI_dest_field
 0.75%  swapper          [stmmac]           [k] stmmac_napi_poll_rx
--- GMAC5@1G: After
 6.70%  swapper          [kernel.kallsyms]   [k] stack_trace_consume_entry
 5.79%  swapper          [stmmac]            [k] dwmac4_dma_interrupt
 5.29%  swapper          [kernel.kallsyms]   [k] update_stack_state
 3.52%  iperf3           [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 2.83%  swapper          [stmmac]            [k] dwmac4_irq_mtl_status
 2.62%  swapper          [kernel.kallsyms]   [k] create_object
 2.46%  swapper          [stmmac]            [k] stmmac_napi_poll_rx
 2.32%  swapper          [kernel.kallsyms]   [k] unwind_next_frame.part.4
 2.19%  swapper          [stmmac]            [k] dwmac4_irq_status
 1.39%  swapper          [kernel.kallsyms]   [k] unwind_get_return_address
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:42 -08:00
Jose Abreu
8d07a79304 net: stmmac: xgmac: Do not enable TBU interrupt
Now that TX Coalesce has been rewritten we no longer need this
additional interrupt enabled. This reduces CPU usage.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
c2837423cb net: stmmac: Rework TX Coalesce logic
Coalesce logic currently increments the number of packets and sets the
IC bit when the coalesced packets have passed a given limit. This does
not reflect very well what coalesce was meant for as we can have a large
number of packets that are coalesced and then a single one, sent later
on that has the IC bit.

Rework the logic so that it coalesces only upon a limit of packets and
sets the IC bit for large number of packets.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
da20245100 net: stmmac: Tune-up default coalesce settings
Tune-up the defalt coalesce settings for optimal values. This gives the
best performance in most of the use-cases.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
52f96cd135 net: stmmac: xgmac: Remove uneeded computation for RFA/RFD
RFA and RFD should not be dependent on FIFO size. In fact, the more FIFO
space we have, the later we can activate Flow Control. Let's use
hard-coded values for RFA and RFD for all FIFO sizes with the exception
of 4k, which is a special case.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
854248e5ec net: stmmac: gmac4+: Remove uneeded computation for RFA/RFD
RFA and RFD should not be dependent on FIFO size. In fact, the more FIFO
space we have, the later we can activate Flow Control. Let's use
hard-coded values for RFA and RFD for all FIFO sizes with the exception
of 4k, which is a special case.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
4e4337ccf7 net: stmmac: Setup a default RX Coalesce value instead of the minimum
For performance reasons, sometimes using the minimum RX Coalesce value
is not optimal. Lets setup a default value that is optimal in most of
the use cases.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Jose Abreu
09146abebc net: stmmac: Do not set RX IC bit if RX Coalesce is zero
We may only want to use the RX Watchdog so lets check if RX Coalesce
settings are non-zero and only set the RX Interrupt on Completion bit if
its not.

Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:25:41 -08:00
Ido Schimmel
983db6198f mlxsw: spectrum_router: Allocate discard adjacency entry when needed
Commit 0c3cbbf96d ("mlxsw: Add specific trap for packets routed via
invalid nexthops") allocated an adjacency entry during driver
initialization whose purpose is to discard packets hitting the route
pointing to it.

These adjacency entries are allocated from a resource called KVD linear
(KVDL). There are situations in which the user can decide to set the
size of this resource (via devlink-resource) to 0, in which case the
driver will not be able to load.

Therefore, instead of pre-allocating this adjacency entry, simply
allocate it only when needed. A variable indicating the validity of the
entry is added and is used to ensure it is only allocated and written
once and that it is freed after all the routes were flushed.

Fixes: 0c3cbbf96d ("mlxsw: Add specific trap for packets routed via invalid nexthops")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:24:54 -08:00
Oliver Neukum
a9a51bd727 ax88172a: fix information leak on short answers
If a malicious device gives a short MAC it can elicit up to
5 bytes of leaked memory out of the driver. We need to check for
ETH_ALEN instead.

Reported-by: syzbot+a8d4acdad35e6bbca308@syzkaller.appspotmail.com
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:18:45 -08:00
Ido Schimmel
4d189c1026 selftests: mlxsw: Adjust test to recent changes
mlxsw does not support VXLAN devices with a physical device attached and
vetoes such configurations upon enslavement to an offloaded bridge.

Commit 0ce1822c2a ("vxlan: add adjacent link to limit depth level")
changed the VXLAN device to be an upper of the physical device which
causes mlxsw to veto the creation of the VXLAN device with "Unknown
upper device type".

This is OK as this configuration is not supported, but it prevents us
from testing bad flows involving the enslavement of VXLAN devices with a
physical device to a bridge, regardless if the physical device is an
mlxsw netdev or not.

Adjust the test to use a dummy device as a physical device instead of a
mlxsw netdev.

Fixes: 0ce1822c2a ("vxlan: add adjacent link to limit depth level")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:14:38 -08:00
YueHaibing
d6649d788e net/tls: Fix unused function warning
If PROC_FS is not set, gcc warning this:

net/tls/tls_proc.c:23:12: warning:
 'tls_statistics_seq_show' defined but not used [-Wunused-function]

Use #ifdef to guard this.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:12:28 -08:00
Sebastian Andrzej Siewior
49e9d1a9fa workqueue: Add RCU annotation for pwq list walk
An additional check has been recently added to ensure that a RCU related lock
is held while the RCU list is iterated.
The `pwqs' are sometimes iterated without a RCU lock but with the &wq->mutex
acquired leading to a warning.

Teach list_for_each_entry_rcu() that the RCU usage is okay if &wq->mutex
is acquired during the list traversal.

Fixes: 28875945ba ("rcu: Add support for consolidated-RCU reader checking")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-15 11:53:35 -08:00
Anson Huang
dffe969aa7 Input: imx_sc_key - correct SCU message structure to avoid stack corruption
The SCU message's data field used for receiving response data from SCU
should be 32-bit width, as SCU will send back 32-bit width data.

This solves kernel panic when CONFIG_CC_HAVE_STACKPROTECTOR_SYSREG is
enabled.

[    1.950768] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
[    1.980607] Workqueue: events imx_sc_check_for_events
[    1.985657] Call trace:
[    1.988104]  dump_backtrace+0x0/0x140
[    1.991768]  show_stack+0x14/0x20
[    1.995090]  dump_stack+0xb4/0xf8
[    1.998407]  panic+0x158/0x324
[    2.001463]  print_tainted+0x0/0xa8
[    2.004950]  imx_sc_check_for_events+0x18c/0x190
[    2.009569]  process_one_work+0x198/0x320
[    2.013579]  worker_thread+0x48/0x420
[    2.017252]  kthread+0xf0/0x120
[    2.020394]  ret_from_fork+0x10/0x18
[    2.023977] SMP: stopping secondary CPUs
[    2.027901] Kernel Offset: disabled
[    2.031391] CPU features: 0x0002,2100600c
[    2.035401] Memory Limit: none

Signed-off-by: Anson Huang <Anson.Huang@nxp.com>
Fixes: 688f1dfb69 ("Input: keyboard - imx_sc: Add i.MX system controller key support")
Link: https://lore.kernel.org/r/1573730499-2224-1-git-send-email-Anson.Huang@nxp.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-11-15 11:49:29 -08:00
Mikulas Patocka
d537858ac8 dm integrity: fix excessive alignment of metadata runs
Metadata runs are supposed to be aligned on 4k boundary (so that they work
efficiently with disks with 4k sectors). However, there was a programming
bug that makes them aligned on 128k boundary instead. The unused space is
wasted.

Fix this bug by providing a proper 4k alignment. In order to keep
existing volumes working, we introduce a new flag SB_FLAG_FIXED_PADDING
- when the flag is clear, we calculate the padding the old way. In order
to make sure that the old version cannot mount the volume created by the
new version, we increase superblock version to 4.

Also in order to not break with old integritysetup, we fix alignment
only if the parameter "fix_padding" is present when formatting the
device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-15 14:49:16 -05:00
Chuhong Yuan
ba60cf9f78 Input: synaptics-rmi4 - destroy F54 poller workqueue when removing
The driver forgets to destroy workqueue in remove() similarly to what is
done when probe() fails. Add a call to destroy_workqueue() to fix it.

Since unregistration will wait for the work to finish, we do not need to
cancel/flush the work instance in remove().

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191114023405.31477-1-hslester96@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-11-15 11:46:33 -08:00
Green Wan
c5ab54e994 riscv: dts: add support for PDMA device of HiFive Unleashed Rev A00
Add PDMA support to (arch/riscv/boot/dts/sifive/fu540-c000.dtsi)

Signed-off-by: Green Wan <green.wan@sifive.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-11-15 11:45:28 -08:00
Oliver Neukum
fa3a5a1880 Input: ff-memless - kill timer in destroy()
No timer must be left running when the device goes away.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
Reported-and-tested-by: syzbot+b6c55daa701fc389e286@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1573726121.17351.3.camel@suse.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-11-15 11:45:03 -08:00
Steven Rostedt (VMware)
128161f47b ftrace: Add helper find_direct_entry() to consolidate code
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:25:08 -05:00
Steven Rostedt (VMware)
406acdd32d ftrace: Add another check for match in register_ftrace_direct()
As an instruction pointer passed into register_ftrace_direct() may just
exist on the ftrace call site, but may not be the start of the call site
itself, register_ftrace_direct() still needs to update test if a direct call
exists on the normalized site, as only one direct call is allowed at any one
time.

Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:24:46 -05:00
Steven Rostedt (VMware)
1c7f9b673d ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
The direct->count wasn't being updated properly, where it only was updated
when the first entry was added, but should be updated every time.

Fixes: 013bf0da04 ("ftrace: Add ftrace_find_direct_func()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:23:55 -05:00
Fenghua Yu
db8c33f8b5 x86/cpu: Align the x86_capability array to size of unsigned long
The x86_capability array in cpuinfo_x86 is of type u32 and thus is
naturally aligned to 4 bytes. But, set_bit() and clear_bit() require the
array to be aligned to size of unsigned long (i.e. 8 bytes on 64-bit
systems).

The array pointer is handed into atomic bit operations. If the access is
not aligned to unsigned long then the atomic bit operations can end up
crossing a cache line boundary, which causes the CPU to do a full bus lock
as it can't lock both cache lines at once. The bus lock operation is heavy
weight and can cause severe performance degradation.

The upcoming #AC split lock detection mechanism will issue warnings for
this kind of access.

Force the alignment of the array to unsigned long. This avoids the massive
code changes which would be required when converting the array data type to
unsigned long.

[ tglx: Rewrote changelog so it contains information WHY this is required ]

Suggested-by: David Laight <David.Laight@aculab.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190916223958.27048-4-tony.luck@intel.com
2019-11-15 20:20:33 +01:00
Fenghua Yu
f6a892ddd5 x86/cpu: Align cpu_caps_cleared and cpu_caps_set to unsigned long
cpu_caps_cleared[] and cpu_caps_set[] are arrays of type u32 and therefore
naturally aligned to 4 bytes, which is also unsigned long aligned on
32-bit, but not on 64-bit.

The array pointer is handed into atomic bit operations. If the access not
aligned to unsigned long then the atomic bit operations can end up crossing
a cache line boundary, which causes the CPU to do a full bus lock as it
can't lock both cache lines at once. The bus lock operation is heavy weight
and can cause severe performance degradation.

The upcoming #AC split lock detection mechanism will issue warnings for
this kind of access.

Force the alignment of these arrays to unsigned long. This avoids the
massive code changes which would be required when converting the array data
type to unsigned long.

[ tglx: Rewrote changelog ]

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20190916223958.27048-2-tony.luck@intel.com
2019-11-15 20:20:32 +01:00
Al Viro
2fa6b1e01a fs/namei.c: fix missing barriers when checking positivity
Pinned negative dentries can, generally, be made positive
by another thread.  Conditions that prevent that are
	* ->d_lock on dentry in question
	* parent directory held at least shared
	* nobody else could have observed the address of dentry
Most of the places working with those fall into one of those
categories; however, d_lookup() and friends need to be used
with some care.  Fortunately, there's not a lot of call sites,
and with few exceptions all of those fall under one of the
cases above.

Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
and mountpoint_last().  Another one is lookup_slow() - there
dcache lookup is done with parent held shared, but the result
is used after we'd drop the lock.  The same happens in do_last() -
the lookup (in lookup_one()) is done with parent locked, but
result is used after unlocking.

lookup_fast(), do_last() and mountpoint_last() flat-out reject
negatives.

Most of lookup_dcache() calls are made with parent locked at least
shared; the only exception is lookup_one_len_unlocked().  It might
return pinned negative, needs serious care from callers.  Fortunately,
almost nobody calls it directly anymore; all but two callers have
converted to lookup_positive_unlocked(), which rejects negatives.

lookup_slow() is called by the same lookup_one_len_unlocked() (see
above), mountpoint_last() and walk_component().  In those two negatives
are rejected.

In other words, there is a small set of places where we need to
check carefully if a pinned potentially negative dentry is, in
fact, positive.  After that check we want to be sure that both
->d_inode and type bits in ->d_flags are stable and observed.
The set consists of follow_managed() (where the rejection happens
for lookup_fast(), walk_component() and do_last()), last_mountpoint()
and lookup_positive_unlocked().

Solution:
	1) transition from negative to positive (in __d_set_inode_and_type())
stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
smp_load_acquire() and bugger off if it type bits say "negative".
That way anyone downstream of those checks has dentry know positive pinned,
with ->d_inode and type bits of ->d_flags stable and observed.

I considered splitting off d_lookup_positive(), so that the checks could
be done right there, under ->d_lock.  However, that leads to massive
duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
No matter what, autofs_d_manage()/autofs_d_automount() must live with
the possibility of pinned negative dentry passed their way, becoming
positive under them - that's the intended behaviour when lookup comes
in the middle of automount in progress, so we can't keep them out of
the area that has to deal with those, more's the pity...

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
e840093367 fix dget_parent() fastpath race
We are overoptimistic about taking the fast path there; seeing
the same value in ->d_parent after having grabbed a reference
to that parent does *not* mean that it has remained our parent
all along.

That wouldn't be a big deal (in the end it is our parent and
we have grabbed the reference we are about to return), but...
the situation with barriers is messed up.

We might have hit the following sequence:

d is a dentry of /tmp/a/b
CPU1:					CPU2:
parent = d->d_parent (i.e. dentry of /tmp/a)
					rename /tmp/a/b to /tmp/b
					rmdir /tmp/a, making its dentry negative
grab reference to parent,
end up with cached parent->d_inode (NULL)
					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
recheck d->d_parent, which is back to original
decide that everything's fine and return the reference we'd got.

The trouble is, caller (on CPU1) will observe dget_parent()
returning an apparently negative dentry.  It actually is positive,
but CPU1 has stale ->d_inode cached.

Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
we just go into the slow path in such case.  We don't wait for ->d_seq
to become even either - again, if we are racing with renames, we
can bloody well go to slow path anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
6c2d4798a8 new helper: lookup_positive_unlocked()
Most of the callers of lookup_one_len_unlocked() treat negatives are
ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
that a pinned positive dentry remains positive - it's ->d_inode is
stable, etc.; a pinned _negative_ dentry can become positive at any
point as long as you are not holding its parent at least shared.
So using lookup_one_len_unlocked() needs to be careful;
lookup_positive_unlocked() is safer and that's what the callers
end up open-coding anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
d41efb522e fs/namei.c: pull positivity check into follow_managed()
There are 4 callers; two proceed to check if result is positive and
fail with ENOENT if it isn't; one (in handle_lookup_down()) is
guaranteed to yield positive and one (in lookup_fast()) is _preceded_
by positivity check.

However, follow_managed() on a negative dentry is a (fairly cheap)
no-op on anything other than autofs.  And negative autofs dentries
are never hashed, so lookup_fast() is not going to run into one
of those.  Moreover, successful follow_managed() on a _positive_
dentry never yields a negative one (and we significantly rely upon
that in callers of lookup_fast()).

In other words, we can easily transpose the positivity check and
the call of follow_managed() in lookup_fast().  And that allows
to fold the positivity check *into* follow_managed(), simplifying
life for the code downstream of its calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Linus Torvalds
875fef493f Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "Two fixes for the buffered reads and O_DIRECT writes serialization
  patch that went into -rc1 and a fixup for a bogus warning on older gcc
  versions"

* tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client:
  rbd: silence bogus uninitialized warning in rbd_object_map_update_finish()
  ceph: increment/decrement dio counter on async requests
  ceph: take the inode lock before acquiring cap refs
2019-11-15 10:30:24 -08:00
David Howells
a28f239e29 afs: Fix race in commit bulk status fetch
When a lookup is done, the afs filesystem will perform a bulk status-fetch
operation on the requested vnode (file) plus the next 49 other vnodes from
the directory list (in AFS, directory contents are downloaded as blobs and
parsed locally).  When the results are received, it will speculatively
populate the inode cache from the extra data.

However, if the lookup races with another lookup on the same directory, but
for a different file - one that's in the 49 extra fetches, then if the bulk
status-fetch operation finishes first, it will try and update the inode
from the other lookup.

If this other inode is still in the throes of being created, however, this
will cause an assertion failure in afs_apply_status():

	BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));

on or about fs/afs/inode.c:175 because it expects data to be there already
that it can compare to.

Fix this by skipping the update if the inode is being created as the
creator will presumably set up the inode with the same information.

Fixes: 39db9815da ("afs: Fix application of the results of a inline bulk status fetch")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 10:28:02 -08:00
Yang Tao
ca16d5bee5 futex: Prevent robust futex exit race
Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.

The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.

After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.

After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.

The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.

This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.

(1) Unlocking task is killed after unlocking the futex value in user space
    before being able to wake a waiter.

        pthread_mutex_unlock()
                |
                V
        atomic_exchange_rel (&mutex->__data.__lock, 0)
                        <------------------------killed
            lll_futex_wake ()                   |
                                                |
                                                |(__lock = 0)
                                                |(enter kernel)
                                                |
                                                V
                                            do_exit()
                                            exit_mm()
                                          mm_release()
                                        exit_robust_list()
                                        handle_futex_death()
                                                |
                                                |(__lock = 0)
                                                |(uval = 0)
                                                |
                                                V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned by
    the exiting task prevents the wakeup of waiters which in consequence
    block infinitely.

(2) Waiting task is killed after a wakeup and before it can acquire the
    futex in user space.

        OWNER                         WAITER
				futex_wait()      		
   pthread_mutex_unlock()               |
                |                       |
                |(__lock = 0)           |
                |                       |
                V                       |
         futex_wake() ------------>  wakeup()
                                        |
                                        |(return to userspace)
                                        |(__lock = 0)
                                        |
                                        V
                        oldval = mutex->__data.__lock
                                          <-----------------killed
    atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,  |
                        id | assume_other_futex_waiters, 0)      |
                                                                 |
                                                                 |
                                                   (enter kernel)|
                                                                 |
                                                                 V
                                                         do_exit()
                                                        |
                                                        |
                                                        V
                                        handle_futex_death()
                                        |
                                        |(__lock = 0)
                                        |(uval = 0)
                                        |
                                        V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned
    by the exiting task prevents the wakeup of waiters, which seems to
    be correct as the exiting task does not own the futex value, but
    the consequence is that other waiters wont be woken up and block
    infinitely.

In both scenarios the following conditions are true:

   - task->robust_list->list_op_pending != NULL
   - user space futex value == 0
   - Regular futex (not PI)

If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.

As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.

This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.

The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.

The eventually woken up waiter will observe the uncontended lock value and
take it over.

[ tglx: Massaged changelog and comment. Made the return explicit and not
  	depend on the subsequent check and added constants to hand into
  	handle_futex_death() instead of plain numbers. Fixed a few coding
	style issues. ]

Fixes: 0771dfefc9 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao <yang.tao172@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
2019-11-15 19:10:49 +01:00
Bob Peterson
d99724c3c3 gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
This patch closes a timing window in which two processes compete
and overlap in the execution of do_xmote for the same glock:

             Process A                              Process B
   ------------------------------------   -----------------------------
1. Grabs gl_lockref and calls do_xmote
2.                                        Grabs gl_lockref but is blocked
3. Sets GLF_INVALIDATE_IN_PROGRESS
4. Unlocks gl_lockref
5.                                        Calls do_xmote
6. Call glops->go_sync
7. test_and_clear_bit GLF_DIRTY
8. Call gfs2_log_flush                    Call glops->go_sync
9. (slow IO, so it blocks a long time)    test_and_clear_bit GLF_DIRTY
                                          It's not dirty (step 7) returns
10.                                       Tests GLF_INVALIDATE_IN_PROGRESS
11.                                       Calls go_inval (rgrp_go_inval)
12.                                       gfs2_rgrp_relse does brelse
13.                                       truncate_inode_pages_range
14.                                       Calls lm_lock UN

In step 14 we've just told dlm to give the glock to another node
when, in fact, process A has not finished the IO and synced all
buffer_heads to disk and make sure their revokes are done.

This patch fixes the problem by changing the GLF_INVALIDATE_IN_PROGRESS
to use test_and_set_bit, and if the bit is already set, process B just
ignores it and trusts that process A will do the do_xmote in the proper
order.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15 18:21:59 +01:00
Linus Torvalds
eb70e26cd7 Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Will Deacon:
 "One trivial fix for -rc8/final that ensures that the script used to
  detect RELR relocation support in the toolchain works correctly when
  $CC contains quotes. Although it fails safely (by failing to detect
  the support when it exists), it would be nice to have this fixed in
  5.4 given that it was only introduced in the last merge window.

  Summary:

   - Handle CC variables containing quotes in tools-support-relr.sh
     script"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  scripts/tools-support-relr.sh: un-quote variables
2019-11-15 09:14:23 -08:00
Linus Torvalds
34b38f5abd Merge tag 'mips_fixes_5.4_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
 "A fix and simplification for SGI IP27 exception handlers, and a small
  MAINTAINERS update for Broadcom MIPS systems"

* tag 'mips_fixes_5.4_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MAINTAINERS: Remove Kevin as maintainer of BMIPS generic platforms
  MIPS: SGI-IP27: fix exception handler replication
2019-11-15 09:10:13 -08:00
Linus Torvalds
74bc8acd6c Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM fixes from Paolo Bonzini:

 - fixes for CONFIG_KVM_COMPAT=n

 - two updates to the IFU erratum

 - selftests build fix

 - brown paper bag fix

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: Add a comment describing the /dev/kvm no_compat handling
  KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()
  KVM: Forbid /dev/kvm being opened by a compat task when CONFIG_KVM_COMPAT=n
  KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list()
  selftests: kvm: fix build with glibc >= 2.30
  kvm: x86: disable shattered huge page recovery for PREEMPT_RT.
2019-11-15 09:05:39 -08:00
Linus Torvalds
5b675f7362 Merge tag 'mmc-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fix from Ulf Hansson:
 "Don't overwrite quirk flags in sdhci-of-at91 host driver"

* tag 'mmc-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-of-at91: fix quirk2 overwrite
2019-11-15 08:58:56 -08:00
Bob Peterson
52b1cdcb7a gfs2: Abort gfs2_freeze if io error is seen
Before this patch, an io error, such as -EIO writing to the journal
would cause function gfs2_freeze to go into an infinite loop,
continuously retrying the freeze operation. But nothing ever clears
the -EIO except unmount after withdraw, which is impossible if the
freeze operation never ends (fails). Instead you get:

[ 6499.767994] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6499.773058] gfs2: fsid=dm-32.0: retrying...
[ 6500.791957] gfs2: fsid=dm-32.0: error freezing FS: -5
[ 6500.797015] gfs2: fsid=dm-32.0: retrying...

This patch adds a check for -EIO in gfs2_freeze, and if seen, it
dequeues the freeze glock, aborts the loop and returns the error.
Also, there's no need to pass the freeze holder to function
gfs2_lock_fs_check_clean since it's only called in one place and
it's a well-known superblock pointer, so this simplifies that.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15 17:57:30 +01:00
Linus Torvalds
11ac7cc88b Merge tag 'sound-5.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "A few small last-minute fixes for USB-audio and HD-audio as well as
  for PCM core:

   - A race fix for PCM core between stopping and closing a stream

   - USB-audio regressions in the recent descriptor validation code and
     relevant changes

   - A read of uninitialized value in USB-audio spotted by fuzzer

   - A fix for USB-audio race at stopping a stream

   - Intel HD-audio platform fixes"

* tag 'sound-5.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Fix incorrect size check for processing/extension units
  ALSA: usb-audio: Fix incorrect NULL check in create_yamaha_midi_quirk()
  ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
  ALSA: usb-audio: not submit urb for stopped endpoint
  ALSA: hda: hdmi - fix pin setup on Tigerlake
  ALSA: hda: Add Cometlake-S PCI ID
  ALSA: usb-audio: Fix missing error check at mixer resolution test
2019-11-15 08:53:52 -08:00
Linus Torvalds
37b49f31e8 Merge tag 'drm-fixes-2019-11-15' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "Here is this weeks non-intel hw vuln fixes pull. Three drivers, all
  small fixes.

  i915:
   - MOCS table fixes for EHL and TGL
   - Update Display's rawclock on resume
   - GVT's dmabuf reference drop fix

  amdgpu:
   - Fix a potential crash in firmware parsing

  sun4i:
   - One fix to the dotclock dividers range for sun4i"

* tag 'drm-fixes-2019-11-15' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: fix null pointer deref in firmware header printing
  drm/i915/tgl: MOCS table update
  Revert "drm/i915/ehl: Update MOCS table for EHL"
  drm/sun4i: tcon: Set min division of TCON0_DCLK to 1.
  drm/i915: update rawclk also on resume
  drm/i915/gvt: fix dropping obj reference twice
2019-11-15 08:47:34 -08:00
Linus Torvalds
b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
Colin Ian King
760f8bc7c8 ftrace/selftests: Fix spelling mistake "wakeing" -> "waking"
There is a spelling mistake in a trace_printk message. As well as in
the selftests that search for this string.

Link: http://lkml.kernel.org/r/20191115085938.38947-1-colin.king@canonical.com
Link: http://lkml.kernel.org/r/20191115090356.39572-1-colin.king@canonical.com

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 11:30:38 -05:00
Artem Bityutskiy
58a74a2925 tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
Increase the maximum allowed count of synthetic event fields from 16 to 32
in order to allow for larger-than-usual events.

Link: http://lkml.kernel.org/r/20191115091730.9192-1-dedekind1@gmail.com

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 11:30:29 -05:00
David Howells
3c0edea9b2 pipe: Remove sync on wake_ups 2019-11-15 16:22:54 +00:00
David Howells
cefa80ced5 pipe: Increase the writer-wakeup threshold to reduce context-switch count
Increase the threshold at which the reader sends a wake event to the
writers in the queue such that the queue must be half empty before the wake
is issued rather than the wake being issued when just a single slot
available.

This reduces the number of context switches in the tests significantly,
without altering the amount of work achieved.  With my pipe-bench program,
there's a 20% reduction versus an unpatched kernel.

Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8df441294d pipe: Check for ring full inside of the spinlock in pipe_write()
Make pipe_write() check to see if the ring has become full between it
taking the pipe mutex, checking the ring status and then taking the
spinlock.

This can happen if a notification is written into the pipe as that happens
without the pipe mutex.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
7e25a73f1a pipe: Remove redundant wakeup from pipe_write()
Remove a redundant wakeup from pipe_write().

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
a194dfe6e6 pipe: Rearrange sequence in pipe_write() to preallocate slot
Rearrange the sequence in pipe_write() so that the allocation of the new
buffer, the allocation of a ring slot and the attachment to the ring is
done under the pipe wait spinlock and then the lock is dropped and the
buffer can be filled.

The data copy needs to be done with the spinlock unheld and irqs enabled,
so the lock needs to be dropped first.  However, the reader can't progress
as we're holding pipe->mutex.

We also need to drop the lock as that would impact others looking at the
pipe waitqueue, such as poll(), the consumer and a future kernel message
writer.

We just abandon the preallocated slot if we get a copy error.  Future
writes may continue it and a future read will eventually recycle it.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
8446487feb pipe: Conditionalise wakeup in pipe_read()
Only do a wakeup in pipe_read() if we made space in a completely full
buffer.  The producer shouldn't be waiting on pipe->wait otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
b667b86734 pipe: Advance tail pointer inside of wait spinlock in pipe_read()
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read()
so that the pipe can be written into with kernel notifications from
contexts where pipe->mutex cannot be taken.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00
David Howells
6718b6f855 pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers:

 (1) pipe->ring_size - indicates the hard size of the pipe ring.

 (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
     userspace orchestrated events can fill.

This allows for a pipe that is both writable by the general kernel
notification facility and by userspace, allowing plenty of ring space for
notifications to be added whilst preventing userspace from being able to
pin too much unswappable kernel space.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15 16:22:54 +00:00