Although commit 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys
for workqueues") unregisters dynamic lockdep keys when a workqueue is
destroyed, a side effect of that commit is that all stack traces
associated with the lockdep key are leaked when a workqueue is destroyed.
Fix this by storing each unique stack trace once. Other changes in this
patch are:
- Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
- Store a pointer to a stack trace in struct lock_class and struct
lock_list instead of storing 'nr_entries' and 'offset'.
This patch avoids that the following program triggers the "BUG:
MAX_STACK_TRACE_ENTRIES too low!" complaint:
#include <fcntl.h>
#include <unistd.h>
int main()
{
for (;;) {
int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
close(fd);
}
}
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If device_link_add() is called for a consumer/supplier pair with an
existing device link between them and the existing link's type is
not in agreement with the flags passed to that function by its
caller, NULL will be returned. That is seriously inconvenient,
because it forces the callers of device_link_add() to worry about
what others may or may not do even if that is not relevant to them
for any other reasons.
It turns out, however, that this limitation can be made go away
relatively easily.
The underlying observation is that if DL_FLAG_STATELESS has been
passed to device_link_add() in flags for the given consumer/supplier
pair at least once, calling either device_link_del() or
device_link_remove() to release the link returned by it should work,
but there are no other requirements associated with that flag. In
turn, if at least one of the callers of device_link_add() for the
given consumer/supplier pair has not passed DL_FLAG_STATELESS to it
in flags, the driver core should track the status of the link and act
on it as appropriate (ie. the link should be treated as "managed").
This means that DL_FLAG_STATELESS needs to be set for managed device
links and it should be valid to call device_link_del() or
device_link_remove() to drop references to them in certain
sutiations.
To allow that to happen, introduce a new (internal) device link flag
called DL_FLAG_MANAGED and make device_link_add() set it automatically
whenever DL_FLAG_STATELESS is not passed to it. Also make it take
additional references to existing device links that were previously
stateless (that is, with DL_FLAG_STATELESS set and DL_FLAG_MANAGED
unset) and will need to be managed going forward and initialize
their status (which has been DL_STATE_NONE so far).
Accordingly, when a managed device link is dropped automatically
by the driver core, make it clear DL_FLAG_MANAGED, reset the link's
status back to DL_STATE_NONE and drop the reference to it associated
with DL_FLAG_MANAGED instead of just deleting it right away (to
allow it to stay around in case it still needs to be released
explicitly by someone).
With that, since setting DL_FLAG_STATELESS doesn't mean that the
device link in question is not managed any more, replace all of the
status-tracking checks against DL_FLAG_STATELESS with analogous
checks against DL_FLAG_MANAGED and update the documentation to
reflect these changes.
While at it, make device_link_add() reject flags that it does not
recognize, including DL_FLAG_MANAGED.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Saravana Kannan <saravanak@google.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Review-by: Saravana Kannan <saravanak@google.com>
Link: https://lore.kernel.org/r/2305283.AStDPdUUnE@kreacher
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduces a concept of external buffers, which is a mechanism for creating
trace sinks that would receive trace data from MSC buffers and transfer it
elsewhere.
A external buffer can implement its own window allocation/deallocation if
it has to. It must provide a callback that's used to notify it when a
window fills up, so that it can then start a DMA transaction from that
window 'elsewhere'. This window remains in a 'locked' state and won't be
used for storing new trace data until the buffer 'unlocks' it with a
provided API call, at which point the window can be used again for storing
trace data.
This relies on a functional "last block" interrupt, so not all versions of
Trace Hub can use this feature, which does not reflect on existing users.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20190705141425.19894-2-alexander.shishkin@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, evdev stamps events with timestamps acquired in evdev_events()
However, this timestamping may not be accurate in terms of measuring
when the actual event happened.
Let's allow individual drivers specify timestamp in order to provide a more
accurate sense of time for the event. It is expected that drivers will set the
timestamp in their hard interrupt routine.
Signed-off-by: Atif Niyaz <atifniyaz@google.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Make alt_pr_unregister function void, since it always returns 0,
and nothing would act on the value anyways.
Signed-off-by: Moritz Fischer <mdf@kernel.org>
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.
The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.
Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.
But it turns out that all of this is entirely unnecessary. Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all. Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.
So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this). We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.
Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards. It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.
It is possible that we should just remove the whole RCU markings for
->cred entirely. Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.
But this is a "minimal semantic changes" change for the immediate
problem.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify the timerqueue code by using cached rbtrees and rely on the tree
leftmost node semantics to get the timer with earliest expiration time.
This is a drop in conversion, and therefore semantics remain untouched.
The runtime overhead of cached rbtrees is be pretty much the same as the
current head->next method, noting that when removing the leftmost node,
a common operation for the timerqueue, the rb_next(leftmost) is O(1) as
well, so the next timer will either be the right node or its parent.
Therefore no extra pointer chasing. Finally, the size of the struct
timerqueue_head remains the same.
Passes several hours of rcutorture.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190724152323.bojciei3muvfxalm@linux-r8p5
Introduce a helper function for drivers to use when updating an
iommu_iotlb_gather structure in response to an ->unmap() call, rather
than having to open-code the logic in every page-table implementation.
Signed-off-by: Will Deacon <will@kernel.org>
To permit batching of TLB flushes across multiple calls to the IOMMU
driver's ->unmap() implementation, introduce a new structure for
tracking the address range to be flushed and the granularity at which
the flushing is required.
This is hooked into the IOMMU API and its caller are updated to make use
of the new structure. Subsequent patches will plumb this into the IOMMU
drivers as well, but for now the gathering information is ignored.
Signed-off-by: Will Deacon <will@kernel.org>
In preparation for TLB flush gathering in the IOMMU API, rename the
iommu_gather_ops structure in io-pgtable to iommu_flush_ops, which
better describes its purpose and avoids the potential for confusion
between different levels of the API.
$ find linux/ -type f -name '*.[ch]' | xargs sed -i 's/gather_ops/flush_ops/g'
Signed-off-by: Will Deacon <will@kernel.org>
Commit add02cfdc9 ("iommu: Introduce Interface for IOMMU TLB Flushing")
added three new TLB flushing operations to the IOMMU API so that the
underlying driver operations can be batched when unmapping large regions
of IO virtual address space.
However, the ->iotlb_range_add() callback has not been implemented by
any IOMMU drivers (amd_iommu.c implements it as an empty function, which
incurs the overhead of an indirect branch). Instead, drivers either flush
the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
invalidation during ->unmap().
Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
two major issues:
1. The page size used to map the region in the page-table is not known,
and so it is not generally possible to issue TLB flushes in the most
efficient manner.
2. The only mutable state passed to the callback is a pointer to the
iommu_domain, which can be accessed concurrently and therefore
requires expensive synchronisation to keep track of the outstanding
flushes.
Remove the callback entirely in preparation for extending ->unmap() and
->iotlb_sync() to update a token on the caller's stack.
Signed-off-by: Will Deacon <will@kernel.org>
Add missing SPDX identifiers for the CAN network layer and correct the SPDX
license for two of its include files to make sure the BSD-3-Clause applies
for the entire subsystem.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
With commit c7cbdbf29f ("net: rework SIOCGSTAMP ioctl handling") the only
ioctl function in can_ioctl() has been removed.
As this SIOCGSTAMP ioctl command is now handled in net/socket.c we can entirely
remove the CAN specific ioctl functions.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The very first check in test_pkt_md_access is failing on s390, which
happens because loading a part of a struct __sk_buff field produces
an incorrect result.
The preprocessed code of the check is:
{
__u8 tmp = *((volatile __u8 *)&skb->len +
((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8)));
if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2;
};
clang generates the following code for it:
0: 71 21 00 03 00 00 00 00 r2 = *(u8 *)(r1 + 3)
1: 61 31 00 00 00 00 00 00 r3 = *(u32 *)(r1 + 0)
2: 57 30 00 00 00 00 00 ff r3 &= 255
3: 5d 23 00 1d 00 00 00 00 if r2 != r3 goto +29 <LBB0_10>
Finally, verifier transforms it to:
0: (61) r2 = *(u32 *)(r1 +104)
1: (bc) w2 = w2
2: (74) w2 >>= 24
3: (bc) w2 = w2
4: (54) w2 &= 255
5: (bc) w2 = w2
The problem is that when verifier emits the code to replace a partial
load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of
struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a
bitwise AND, it assumes that the machine is little endian and
incorrectly decides to use a shift.
Adjust shift count calculation to account for endianness.
Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We currently have cases where the dma_addressing_limited() gets
called with dma_mask unset. This causes a NULL pointer dereference.
Use dma_get_mask() accessor to prevent the crash.
Fixes: b866455423 ("dma-mapping: add a dma_addressing_limited helper")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
blk_mq_sched_completed_request is a function that checks if the elevator
related to the request has started_request implemented, but currently, none of
the available IO schedulers implement started_request, so remove both.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The stub function for !CONFIG_IOMMU_IOVA needs to be
'static inline'.
Fixes: effa467870 ('iommu/vt-d: Don't queue_iova() if there is no flush queue')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Note that after previous changes dpm_noirq_begin() and
dpm_noirq_end() each have only one caller, so move the code from
them to their respective callers and drop them.
Also note that dpm_noirq_resume_devices() and
dpm_noirq_suspend_devices() need not be exported any more, so make
them both static.
This change is not expected to alter functionality.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
After commit 33e4f80ee6 ("ACPI / PM: Ignore spurious SCI wakeups
from suspend-to-idle") the "noirq" phases of device suspend and
resume may run for multiple times during suspend-to-idle, if there
are spurious system wakeup events while suspended. However, this
is complicated and fragile and actually unnecessary.
The main reason for doing this is that on some systems the EC may
signal system wakeup events (power button events, for example) as
well as events that should not cause the system to resume (spurious
system wakeup events). Thus, in order to determine whether or not
a given event signaled by the EC while suspended is a proper system
wakeup one, the EC GPE needs to be dispatched and to start with that
was achieved by allowing the ACPI SCI action handler to run, which
was only possible after calling resume_device_irqs().
However, dispatching the EC GPE this way turned out to take too much
time in some cases and some EC events might be missed due to that, so
commit 68e2201185 ("ACPI: EC: Dispatch the EC GPE directly on
s2idle wake") started to dispatch the EC GPE right after a wakeup
event has been detected, so in fact the full ACPI SCI action handler
doesn't need to run any more to deal with the wakeups coming from the
EC.
Use this observation to simplify the suspend-to-idle control flow
so that the "noirq" phases of device suspend and resume are each
run only once in every suspend-to-idle cycle, which is reported to
significantly reduce power drawn by some systems when suspended to
idle (by allowing them to reach a deep platform-wide low-power state
through the suspend-to-idle flow). [What appears to happen is that
the "noirq" resume of devices after a spurious EC wakeup brings some
devices into a state in which they prevent the platform from reaching
the deep low-power state going forward, even after a subsequent
"noirq" suspend phase, and on some systems the EC triggers such
wakeups already when the "noirq" suspend of devices is running for
the first time in the given suspend/resume cycle, so the platform
cannot reach the deep low-power state at all.]
First, make acpi_s2idle_wake() use the acpi_ec_dispatch_gpe() return
value to determine whether or not the wakeup may have been triggered
by the EC (in which case the system wakeup is canceled and ACPI
events are processed in order to determine whether or not the event
is a proper system wakeup one) and use rearm_wake_irq() (introduced
by a previous change) in it to rearm the ACPI SCI for system wakeup
detection in case the system will remain suspended.
Second, drop acpi_s2idle_sync(), which is not needed any more, and
the corresponding global platform suspend-to-idle callback.
Next, drop the pm_wakeup_pending() check (which is an optimization
only) from __device_suspend_noirq() to prevent it from returning
errors on system wakeups occurring before the "noirq" phase of
device suspend is complete (as in the case of suspend-to-idle it is
not known whether or not these wakeups are suprious at that point),
in order to avoid having to carry out a "noirq" resume of devices
on a spurious system wakeup.
Finally, change the code flow in s2idle_loop() to (1) run the
"noirq" suspend of devices once before starting the loop, (2) check
for spurious EC wakeups (via the platform ->wake callback) for the
first time before calling s2idle_enter(), and (3) run the "noirq"
resume of devices once after leaving the loop.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Introduce a new function, rearm_wake_irq(), allowing a wakeup IRQ
to be armed for systen wakeup detection again without running any
action handlers associated with it after it has been armed for
wakeup detection and triggered.
That is useful for IRQs, like ACPI SCI, that may deliver wakeup
as well as non-wakeup interrupts when armed for systen wakeup
detection. In those cases, it may be possible to determine whether
or not the delivered interrupt is a systen wakeup one without
running the entire action handler (or handlers, if the IRQ is
shared) for the IRQ, and if the interrupt turns out to be a
non-wakeup one, the IRQ can be rearmed with the help of the
new function.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
There are a lot of users of frag->page_offset, so use a union
to avoid converting those users today.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One step closer to turning the skb_frag_t into a bio_vec.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To increase commonality between block and net, we are going to replace
the skb_frag_t with the bio_vec. This patch increases the size of
skb_frag_t on 32-bit machines from 8 bytes to 12 bytes. The size is
unchanged on 64-bit machines.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some questionable coding styles in this function. It looks
quite odd to deref a pointer with array indexing that only uses the
first element. Also, destroying an input/output variable halfway through
the function and then overwriting it on success is not clear. It's
better to use a local variable and the kernel macros to step through
each bit set in a bitmask and clearly show where outputs are set.
Cc: Ian Jackson <ian.jackson@citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Avaneesh Kumar Dwivedi <akdwived@codeaurora.org>
Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
[bjorn: Changed for_each_set_bit() size to BITS_PER_LONG]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Fix an incomplete devm_clk_bulk_get_optional() function documentation
by adding description of the num_clks argument as in other *clk_bulk*
functions.
Fixes: 9bd5ef0bd8 ("clk: Add devm_clk_bulk_get_optional() function")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Currently, ARM32 and ARM64 uses different data structures to represent
their cpu topologies. Since, we are moving the ARM64 topology to common
code to be used by other architectures, we can reuse that for ARM32 as
well.
Take this opprtunity to remove the redundant functions from ARM32 and
reuse the common code instead.
To: Russell King <linux@armlinux.org.uk>
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com> (on TC2)
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Pull networking fixes from David Miller:
1) Several netfilter fixes including a nfnetlink deadlock fix from
Florian Westphal and fix for dropping VRF packets from Miaohe Lin.
2) Flow offload fixes from Pablo Neira Ayuso including a fix to restore
proper block sharing.
3) Fix r8169 PHY init from Thomas Voegtle.
4) Fix memory leak in mac80211, from Lorenzo Bianconi.
5) Missing NULL check on object allocation in cxgb4, from Navid
Emamdoost.
6) Fix scaling of RX power in sfp phy driver, from Andrew Lunn.
7) Check that there is actually an ip header to access in skb->data in
VRF, from Peter Kosyh.
8) Remove spurious rcu unlock in hv_netvsc, from Haiyang Zhang.
9) One more tweak the the TCP fragmentation memory limit changes, to be
less harmful to applications setting small SO_SNDBUF values. From
Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
tcp: be more careful in tcp_fragment()
hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
vrf: make sure skb->data contains ip header to make routing
connector: remove redundant input callback from cn_dev
qed: Prefer pcie_capability_read_word()
igc: Prefer pcie_capability_read_word()
cxgb4: Prefer pcie_capability_read_word()
be2net: Synchronize be_update_queues with dev_watchdog
bnx2x: Prevent load reordering in tx completion processing
net: phy: sfp: hwmon: Fix scaling of RX power
net: sched: verify that q!=NULL before setting q->flags
chelsio: Fix a typo in a function name
allocate_flower_entry: should check for null deref
net: hns3: typo in the name of a constant
kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist.
tipc: Fix a typo
mac80211: don't warn about CW params when not using them
mac80211: fix possible memory leak in ieee80211_assign_beacon
nl80211: fix NL80211_HE_MAX_CAPABILITY_LEN
nl80211: fix VENDOR_CMD_RAW_DATA
...
Intel VT-d driver was reworked to use common deferred flushing
implementation. Previously there was one global per-cpu flush queue,
afterwards - one per domain.
Before deferring a flush, the queue should be allocated and initialized.
Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
queue. It's probably worth to init it for static or unmanaged domains
too, but it may be arguable - I'm leaving it to iommu folks.
Prevent queuing an iova flush if the domain doesn't have a queue.
The defensive check seems to be worth to keep even if queue would be
initialized for all kinds of domains. And is easy backportable.
On 4.19.43 stable kernel it has a user-visible effect: previously for
devices in si domain there were crashes, on sata devices:
BUG: spinlock bad magic on CPU#6, swapper/0/1
lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
Call Trace:
<IRQ>
dump_stack+0x61/0x7e
spin_bug+0x9d/0xa3
do_raw_spin_lock+0x22/0x8e
_raw_spin_lock_irqsave+0x32/0x3a
queue_iova+0x45/0x115
intel_unmap+0x107/0x113
intel_unmap_sg+0x6b/0x76
__ata_qc_complete+0x7f/0x103
ata_qc_complete+0x9b/0x26a
ata_qc_complete_multiple+0xd0/0xe3
ahci_handle_port_interrupt+0x3ee/0x48a
ahci_handle_port_intr+0x73/0xa9
ahci_single_level_irq_intr+0x40/0x60
__handle_irq_event_percpu+0x7f/0x19a
handle_irq_event_percpu+0x32/0x72
handle_irq_event+0x38/0x56
handle_edge_irq+0x102/0x121
handle_irq+0x147/0x15c
do_IRQ+0x66/0xf2
common_interrupt+0xf/0xf
RIP: 0010:__do_softirq+0x8c/0x2df
The same for usb devices that use ehci-pci:
BUG: spinlock bad magic on CPU#0, swapper/0/1
lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
Call Trace:
<IRQ>
dump_stack+0x61/0x7e
spin_bug+0x9d/0xa3
do_raw_spin_lock+0x22/0x8e
_raw_spin_lock_irqsave+0x32/0x3a
queue_iova+0x77/0x145
intel_unmap+0x107/0x113
intel_unmap_page+0xe/0x10
usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
usb_hcd_unmap_urb_for_dma+0x17/0x100
unmap_urb_for_dma+0x22/0x24
__usb_hcd_giveback_urb+0x51/0xc3
usb_giveback_urb_bh+0x97/0xde
tasklet_action_common.isra.4+0x5f/0xa1
tasklet_action+0x2d/0x30
__do_softirq+0x138/0x2df
irq_exit+0x7d/0x8b
smp_apic_timer_interrupt+0x10f/0x151
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: <stable@vger.kernel.org> # 4.14+
Fixes: 13cf017446 ("iommu/vt-d: Make use of iova deferred flushing")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.
If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.
To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.
This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.
v4:
- make sure we don't free things for device;
- remove the checks which swap the callbacks back
only if TLS is at the top.
Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
By default, if a caller sets REQ_NOWAIT and we need to block, we'll
return -EAGAIN through the bio->bi_end_io() callback. For some use
cases, this makes it hard to use.
Allow a caller to ask for inline return of errors related to
blocking by also setting REQ_NOWAIT_INLINE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
don't bother with remapping LOOKUP_... values - all callers pass
constants and we can just as well pass the right ones from the
very beginning.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull NTB updates from Jon Mason:
"New feature to add support for NTB virtual MSI interrupts, the ability
to test and use this feature in the NTB transport layer.
Also, bug fixes for the AMD and Switchtec drivers, as well as some
general patches"
* tag 'ntb-5.3' of git://github.com/jonmason/ntb: (22 commits)
NTB: Describe the ntb_msi_test client in the documentation.
NTB: Add MSI interrupt support to ntb_transport
NTB: Add ntb_msi_test support to ntb_test
NTB: Introduce NTB MSI Test Client
NTB: Introduce MSI library
NTB: Rename ntb.c to support multiple source files in the module
NTB: Introduce functions to calculate multi-port resource index
NTB: Introduce helper functions to calculate logical port number
PCI/switchtec: Add module parameter to request more interrupts
PCI/MSI: Support allocating virtual MSI interrupts
ntb_hw_switchtec: Fix setup MW with failure bug
ntb_hw_switchtec: Skip unnecessary re-setup of shared memory window for crosslink case
ntb_hw_switchtec: Remove redundant steps of switchtec_ntb_reinit_peer() function
NTB: correct ntb_dev_ops and ntb_dev comment typos
NTB: amd: Silence shift wrapping warning in amd_ntb_db_vector_mask()
ntb_hw_switchtec: potential shift wrapping bug in switchtec_ntb_init_sndev()
NTB: ntb_transport: Ensure qp->tx_mw_dma_addr is initaliazed
NTB: ntb_hw_amd: set peer limit register
NTB: ntb_perf: Clear stale values in doorbell and command SPAD register
NTB: ntb_perf: Disable NTB link after clearing peer XLAT registers
...
Pull dma-mapping fixes from Christoph Hellwig:
"Fix various regressions:
- force unencrypted dma-coherent buffers if encryption bit can't fit
into the dma coherent mask (Tom Lendacky)
- avoid limiting request size if swiotlb is not used (me)
- fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang
Duan)"
* tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
dma-direct: only limit the mapping size if swiotlb could be used
dma-mapping: add a dma_addressing_limited helper
dma-direct: Force unencrypted DMA under SME for certain DMA masks
Pull core fixes from Thomas Gleixner:
- A collection of objtool fixes which address recent fallout partially
exposed by newer toolchains, clang, BPF and general code changes.
- Force USER_DS for user stack traces
[ Note: the "objtool fixes" are not all to objtool itself, but for
kernel code that triggers objtool warnings.
Things like missing function size annotations, or code that confuses
the unwinder etc. - Linus]
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
objtool: Support conditional retpolines
objtool: Convert insn type to enum
objtool: Fix seg fault on bad switch table entry
objtool: Support repeated uses of the same C jump table
objtool: Refactor jump table code
objtool: Refactor sibling call detection logic
objtool: Do frame pointer check before dead end check
objtool: Change dead_end_function() to return boolean
objtool: Warn on zero-length functions
objtool: Refactor function alias logic
objtool: Track original function across branches
objtool: Add mcsafe_handle_tail() to the uaccess safe list
bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
x86/uaccess: Remove redundant CLACs in getuser/putuser error paths
x86/uaccess: Don't leak AC flag into fentry from mcsafe_handle_tail()
x86/uaccess: Remove ELF function annotation from copy_user_handle_tail()
x86/head/64: Annotate start_cpu0() as non-callable
x86/entry: Fix thunk function ELF sizes
x86/kvm: Don't call kvm_spurious_fault() from .fixup
x86/kvm: Replace vmx_vmenter()'s call to kvm_spurious_fault() with UD2
...
Pull more KVM updates from Paolo Bonzini:
"Mostly bugfixes, but also:
- s390 support for KVM selftests
- LAPIC timer offloading to housekeeping CPUs
- Extend an s390 optimization for overcommitted hosts to all
architectures
- Debugging cleanups and improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: x86: Add fixed counters to PMU filter
KVM: nVMX: do not use dangling shadow VMCS after guest reset
KVM: VMX: dump VMCS on failed entry
KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
KVM: Boost vCPUs that are delivering interrupts
KVM: selftests: Remove superfluous define from vmx.c
KVM: SVM: Fix detection of AMD Errata 1096
KVM: LAPIC: Inject timer interrupt via posted interrupt
KVM: LAPIC: Make lapic timer unpinned
KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
kvm: x86: ioapic and apic debug macros cleanup
kvm: x86: some tsc debug cleanup
kvm: vmx: fix coccinelle warnings
x86: kvm: avoid constant-conversion warning
x86: kvm: avoid -Wsometimes-uninitized warning
KVM: x86: expose AVX512_BF16 feature to guest
KVM: selftests: enable pgste option for the linker on s390
KVM: selftests: Move kvm_create_max_vcpus test to generic code
...