With the following commit:
4bc9f92e64 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
... efi_bgrt_init() calls into the memblock allocator through
efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.
Indeed, KASAN reports a bad read access later on in efi_free_boot_services():
BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
at addr ffff88022de12740
Read of size 4 by task swapper/0/0
page:ffffea0008b78480 count:0 mapcount:-127
mapping: (null) index:0x1 flags: 0x5fff8000000000()
[...]
Call Trace:
dump_stack+0x68/0x9f
kasan_report_error+0x4c8/0x500
kasan_report+0x58/0x60
__asan_load4+0x61/0x80
efi_free_boot_services+0xae/0x24c
start_kernel+0x527/0x562
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x157/0x17a
start_cpu+0x5/0x14
The instruction at the given address is the first read from the memmap's
memory, i.e. the read of md->type in efi_free_boot_services().
Note that the writes earlier in efi_arch_mem_reserve() don't splat because
they're done through early_memremap()ed addresses.
So, after memblock is gone, allocations should be done through the "normal"
page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
of consistency, from efi_fake_memmap() as well.
Note that for the latter, the memmap allocations cease to be page aligned.
This isn't needed though.
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: <stable@vger.kernel.org> # v4.9
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Fixes: 4bc9f92e64 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull VFIO fixes from Alex Williamson:
- Add mtty sample driver properly into build system (Alex Williamson)
- Restore type1 mapping performance after mdev (Alex Williamson)
- Fix mdev device race (Alex Williamson)
- Cleanups to the mdev ABI used by vendor drivers (Alex Williamson)
- Build fix for old compilers (Arnd Bergmann)
- Fix sample driver error path (Dan Carpenter)
- Handle pci_iomap() error (Arvind Yadav)
- Fix mdev ioctl return type (Paul Gortmaker)
* tag 'vfio-v4.10-rc3' of git://github.com/awilliam/linux-vfio:
vfio-mdev: fix non-standard ioctl return val causing i386 build fail
vfio-pci: Handle error from pci_iomap
vfio-mdev: fix some error codes in the sample code
vfio-pci: use 32-bit comparisons for register address for gcc-4.5
vfio-mdev: Make mdev_device private and abstract interfaces
vfio-mdev: Make mdev_parent private
vfio-mdev: de-polute the namespace, rename parent_device & parent_ops
vfio-mdev: Fix remove race
vfio/type1: Restore mapping performance with mdev support
vfio-mdev: Fix mtty sample driver building
Pull swiotlb fixes from Konrad Rzeszutek Wilk:
"This has one fix to make i915 work when using Xen SWIOTLB, and a
feature from Geert to aid in debugging of devices that can't do DMA
outside the 32-bit address space.
The feature from Geert is on top of v4.10 merge window commit
(specifically you pulling my previous branch), as his changes were
dependent on the Documentation/ movement patches.
I figured it would just easier than me trying than to cherry-pick the
Documentation patches to satisfy git.
The patches have been soaking since 12/20, albeit I updated the last
patch due to linux-next catching an compiler error and adding an
Tested-and-Reported-by tag"
* 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: Export swiotlb_max_segment to users
swiotlb: Add swiotlb=noforce debug option
swiotlb: Convert swiotlb_force from int to enum
x86, swiotlb: Simplify pci_swiotlb_detect_override()
So they can figure out what is the optimal number of pages
that can be contingously stitched together without fear of
bounce buffer.
We also expose an mechanism for sub-users of SWIOTLB API, such
as Xen-SWIOTLB to set the max segment value. And lastly
if swiotlb=force is set (which mandates we bounce buffer everything)
we set max_segment so at least we can bounce buffer one 4K page
instead of a giant 512KB one for which we may not have space.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-Tested-by: Juergen Gross <jgross@suse.com>
Pull audit fixes from Paul Moore:
"Two small fixes relating to audit's use of fsnotify.
The first patch plugs a leak and the second fixes some lock
shenanigans. The patches are small and I banged on this for an
afternoon with our testsuite and didn't see anything odd"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: Fix sleep in atomic
fsnotify: Remove fsnotify_duplicate_mark()
Pull networking fixes from David Miller:
1) stmmac_drv_probe() can race with stmmac_open() because we register
the netdevice too early. Fix from Florian Fainelli.
2) UFO handling in __ip6_append_data() and ip6_finish_output() use
different tests for deciding whether a frame will be fragmented or
not, put them in sync. Fix from Zheng Li.
3) The rtnetlink getstats handlers need to validate that the netlink
request is large enough, fix from Mathias Krause.
4) Use after free in mlx4 driver, from Jack Morgenstein.
5) Fix setting of garbage UID value in sockets during setattr() calls,
from Eric Biggers.
6) Packet drop_monitor doesn't format the netlink messages properly
such that nlmsg_next fails to work, fix from Reiter Wolfgang.
7) Fix handling of wildcard addresses in l2tp lookups, from Guillaume
Nault.
8) __skb_flow_dissect() can crash on pptp packets, from Ian Kumlien.
9) IGMP code doesn't reset group query timers properly, from Michal
Tesar.
10) Fix overzealous MAIN/LOCAL route table combining in ipv4, from
Alexander Duyck.
11) vxlan offload check needs to be more strict in be2net driver, from
Sabrina Dubroca.
12) Moving l3mdev to packet hooks lost RX stat counters unintentionally,
fix from David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
sh_eth: enable RX descriptor word 0 shift on SH7734
sfc: don't report RX hash keys to ethtool when RSS wasn't enabled
dpaa_eth: Initialize CGR structure before init
dpaa_eth: cleanup after init_phy() failure
net: systemport: Pad packet before inserting TSB
net: systemport: Utilize skb_put_padto()
LiquidIO VF: s/select/imply/ for PTP_1588_CLOCK
libcxgb: fix error check for ip6_route_output()
net: usb: asix_devices: add .reset_resume for USB PM
net: vrf: Add missing Rx counters
drop_monitor: consider inserted data in genlmsg_end
benet: stricter vxlan offloading check in be_features_check
ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules
net: macb: Updated resource allocation function calls to new version of API.
net: stmmac: dwmac-oxnas: use generic pm implementation
net: stmmac: dwmac-oxnas: fix fixed-link-phydev leaks
net: stmmac: dwmac-oxnas: fix of-node leak
Documentation/networking: fix typo in mpls-sysctl
igmp: Make igmp group member RFC 3376 compliant
flow_dissector: Update pptp handling to avoid null pointer deref.
...
What appears to be a copy and paste error from the line above gets
the ioctl a ssize_t return value instead of the traditional "int".
The associated sample code used "long" which meant it would compile
for x86-64 but not i386, with the latter failing as follows:
CC [M] samples/vfio-mdev/mtty.o
samples/vfio-mdev/mtty.c:1418:20: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
.ioctl = mtty_ioctl,
^
samples/vfio-mdev/mtty.c:1418:20: note: (near initialization for ‘mdev_fops.ioctl’)
cc1: some warnings being treated as errors
Since in this case, vfio is working with struct file_operations; as such:
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
...and so here we just standardize on long vs. the normal int that user
space typically sees and documents as per "man ioctl" and similar.
Fixes: 9d1a546c53 ("docs: Sample driver to demonstrate how to use Mediated device framework.")
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series, one fixing a regression with
block size < page cache size in the alias series from Jan. Outside of
that, two small cleanups for wbt from Bart, a nvme pull request from
Christoph, and a few small fixes of documentation updates"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix up io_poll documentation
block: Avoid that sparse complains about context imbalance in __wbt_wait()
block: Make wbt_wait() definition consistent with declaration
clean_bdev_aliases: Prevent cleaning blocks that are not in block range
genhd: remove dead and duplicated scsi code
block: add back plugging in __blkdev_direct_IO
nvmet/fcloop: remove some logically dead code performing redundant ret checks
nvmet: fix KATO offset in Set Features
nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
nvme/fc: correct some printk information
nvme/scsi: Remove START STOP emulation
nvme/pci: Delete misleading queue-wrap comment
nvme/pci: Fix whitespace problem
nvme: simplify stripe quirk
nvme: update maintainers information
Jonathan writes:
First round of IIO fixes for the 4.10 cycle.
* 104-quad-8
- Fix selecting wrong register when the index control register is desired.
- Fix an off by one error when addressing the input/output control register.
- Fix inverted logic on the active high / low control
* bmi160
- Sleep for worst case rather than best case amount of time after cmd
execution begins.
* max44000
- typo fix in illuminance_integration_time_available listing.
* st-sensors
- Fix channel data passing. This one took a while to get tested on 24bit
parts. Definitely one for stable asap as the bug broke quite a few parts.
- lis3lv02 needs a data alignment bit set and the scaling was wrong.
* ti_am335x
- depend on HAS_DMA
Pull DAX updates from Dan Williams:
"The completion of Jan's DAX work for 4.10.
As I mentioned in the libnvdimm-for-4.10 pull request, these are some
final fixes for the DAX dirty-cacheline-tracking invalidation work
that was merged through the -mm, ext4, and xfs trees in -rc1. These
patches were prepared prior to the merge window, but we waited for
4.10-rc1 to have a stable merge base after all the prerequisites were
merged.
Quoting Jan on the overall changes in these patches:
"So I'd like all these 6 patches to go for rc2. The first three
patches fix invalidation of exceptional DAX entries (a bug which
is there for a long time) - without these patches data loss can
occur on power failure even though user called fsync(2). The other
three patches change locking of DAX faults so that ->iomap_begin()
is called in a more relaxed locking context and we are safe to
start a transaction there for ext4"
These have received a build success notification from the kbuild
robot, and pass the latest libnvdimm unit tests. There have not been
any -next releases since -rc1, so they have not appeared there"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
ext4: Simplify DAX fault path
dax: Call ->iomap_begin without entry lock during dax fault
dax: Finish fault completely when loading holes
dax: Avoid page invalidation races and unnecessary radix tree traversals
mm: Invalidate DAX radix tree entries only if appropriate
ext2: Return BH_New buffers for zeroed blocks
The LIS3LV02 has a special bit that need to be set to get the
read values left aligned. Before this patch we get gibberish
like this:
iio_generic_buffer -a -c10 -n lis3lv02dl_accel
(...)
0.000000 -0.010042 -0.642688 19155832931907
0.000000 -0.010042 -0.642688 19155858751073
Which is because we read a raw value for 1g as 64 which is
the nominal 1024 for 1g shifted 4 bits to the left by being
right-aligned rather than left aligned.
Since all other sensors are left aligned, add some code to
set the special DAS (data alignment setting) bit to 1 so that
the right value is now read like this:
iio_generic_buffer -a -c10 -n lis3lv02dl_accel
(...)
0.000000 -0.147095 -10.120135 24761614364956
-0.029419 -0.176514 -10.120135 24761631624540
The scaling was weird as well: we have a gain of 1000 for 1g
and 3000 for 6g. I don't even remember how I came up with the
old values but they are wrong.
Fixes: 3acddf74f8 ("iio: st-sensors: add support for lis3lv02d accelerometer")
Cc: Lorenzo Bianconi <lorenzo.bianconi@st.com>
Cc: Giuseppe Barba <giuseppe.barba@st.com>
Cc: Denis Ciocca <denis.ciocca@st.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Following any fw_rsc_vdev entries in the resource table are two variable
length arrays, the first one reference vring resources and the second
one is the virtio config space. The virtio config space is used by
virtio to communicate status and configuration changes and must as such
be shared with the remote.
The reverted commit incorrectly made any changes to the virtio config
space only affect the local copy, in an attempt to allowing memory
protection of the shared resource table.
This reverts commit cda8529346.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Demoting simple flow steering rule priority (for DPDK) was achieved by
wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
function for the PF and all VFs.
In function mlx4_ib_create_flow(), this change caused the main rule
creation for the PF to be wrapped, while it left the associated
tunnel steering rule creation unwrapped for the PF.
This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
for the PF when the detach wrapper function did not find the associated
tunnel-steering rule (since creation of that rule for the PF did not
go through the wrapper function).
Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
(so that the PF invocation does not go through the wrapper), and perform
the required priority demotion for the PF in the mlx4_ib_create_flow()
code path.
Fixes: 48564135cb ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6290602709 ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.
The most important is to not assume the packet is RX just because
the destination address matches that of the device. Such an
assumption causes problems when an interface is put into loopback
mode.
2) If we retry when creating a new tc entry (because we dropped the
RTNL mutex in order to load a module, for example) we end up with
-EAGAIN and then loop trying to replay the request. But we didn't
reset some state when looping back to the top like this, and if
another thread meanwhile inserted the same tc entry we were trying
to, we re-link it creating an enless loop in the tc chain. Fix from
Daniel Borkmann.
3) There are two different WRITE bits in the MDIO address register for
the stmmac chip, depending upon the chip variant. Due to a bug we
could set them both, fix from Hock Leong Kweh.
4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: stmmac: fix incorrect bit set in gmac4 mdio addr register
r8169: add support for RTL8168 series add-on card.
net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
openvswitch: upcall: Fix vlan handling.
ipv4: Namespaceify tcp_tw_reuse knob
net: korina: Fix NAPI versus resources freeing
net, sched: fix soft lockup in tc_classify
net/mlx4_en: Fix user prio field in XDP forward
tipc: don't send FIN message from connectionless socket
ipvlan: fix multicast processing
ipvlan: fix various issues in ipvlan_process_multicast()
Prime numbers are interesting for testing components that use multiplies
and divides, such as testing DRM's struct drm_mm alignment computations.
v2: Move to lib/, add selftest
v3: Fix initial constants (exclude 0/1 from being primes)
v4: More RCU markup to keep 0day/sparse happy
v5: Fix RCU unwind on module exit, add to kselftests
v6: Tidy computation of bitmap size
v7: for_each_prime_number_from()
v8: Compose small-primes using BIT() for easier verification
v9: Move rcu dance entirely into callers.
v10: Improve quote for Betrand's Postulate (aka Chebyshev's theorem)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/20161222144514.3911-1-chris@chris-wilson.co.uk
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.
Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull timer type cleanups from Thomas Gleixner:
"This series does a tree wide cleanup of types related to
timers/timekeeping.
- Get rid of cycles_t and use a plain u64. The type is not really
helpful and caused more confusion than clarity
- Get rid of the ktime union. The union has become useless as we use
the scalar nanoseconds storage unconditionally now. The 32bit
timespec alike storage got removed due to the Y2038 limitations
some time ago.
That leaves the odd union access around for no reason. Clean it up.
Both changes have been done with coccinelle and a small amount of
manual mopping up"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Get rid of ktime_equal()
ktime: Cleanup ktime_set() usage
ktime: Get rid of the union
clocksource: Use a plain u64 instead of cycle_t
Pull SMP hotplug notifier removal from Thomas Gleixner:
"This is the final cleanup of the hotplug notifier infrastructure. The
series has been reintgrated in the last two days because there came a
new driver using the old infrastructure via the SCSI tree.
Summary:
- convert the last leftover drivers utilizing notifiers
- fixup for a completely broken hotplug user
- prevent setup of already used states
- removal of the notifiers
- treewide cleanup of hotplug state names
- consolidation of state space
There is a sphinx based documentation pending, but that needs review
from the documentation folks"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/armada-xp: Consolidate hotplug state space
irqchip/gic: Consolidate hotplug state space
coresight/etm3/4x: Consolidate hotplug state space
cpu/hotplug: Cleanup state names
cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
staging/lustre/libcfs: Convert to hotplug state machine
scsi/bnx2i: Convert to hotplug state machine
scsi/bnx2fc: Convert to hotplug state machine
cpu/hotplug: Prevent overwriting of callbacks
x86/msr: Remove bogus cleanup from the error path
bus: arm-ccn: Prevent hotplug callback leak
perf/x86/intel/cstate: Prevent hotplug callback leak
ARM/imx/mmcd: Fix broken cpu hotplug handling
scsi: qedi: Convert to hotplug state machine
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.
This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.
The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).
This improves the performance of page lock intensive microbenchmarks by
2-3%.
Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No point in going through loops and hoops instead of just comparing the
values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
register_hotcpu_notifier(), register_cpu_notifier(),
__register_hotcpu_notifier(), __register_cpu_notifier(),
unregister_hotcpu_notifier(), unregister_cpu_notifier(),
__unregister_hotcpu_notifier(), __unregister_cpu_notifier()
are unused now. Remove them and all related code.
Remove also the now pointless cpu notifier error injection mechanism. The
states can be executed step by step and error rollback is the same as cpu
down, so any state transition can be tested w/o requiring the notifier
error injection.
Some CPU hotplug states are kept as they are (ab)used for hotplug state
tracking.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull NTB update from Jon Mason:
- NTB bug fixes for removing an unnecessary call to ntb_peer_spad_read,
and correcting a free_irq inconsistency
- add Intel SKX support
- change the AMD NTB maintainer, and fix some bugs present there
* tag 'ntb-4.10' of git://github.com/jonmason/ntb:
ntb_transport: Remove unnecessary call to ntb_peer_spad_read
NTB: Fix 'request_irq()' and 'free_irq()' inconsistancy
ntb: fix SKX NTB config space size register offsets
NTB: correct ntb_peer_spad_read for case when callback is not supplied.
MAINTAINERS: Change in maintainer for AMD NTB
ntb_transport: Limit memory windows based on available, scratchpads
NTB: Register and offset values fix for memory window
NTB: add support for hotplug feature
ntb: Adding Skylake Xeon NTB support
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.
Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull final vfs updates from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ufs: fix function declaration for ufs_truncate_blocks
fs: exec: apply CLOEXEC before changing dumpable task flags
seq_file: reset iterator to first record for zero offset
vfs: fix isize/pos/len checks for reflink & dedupe
[iov_iter] fix iterate_all_kinds() on empty iterators
move aio compat to fs/aio.c
reorganize do_make_slave()
clone_private_mount() doesn't need to touch namespace_sem
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
blk_scsi_cmd_filter use was deprecated by 4beab5c6 and the SCSI macros
are duplicated in blkdev.h, both likely reintroduced by a bad merge from
540eed56.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer fixes from Jens Axboe:
"Just a set of small fixes that have either been queued up after the
original pull for this merge window, or just missed the original pull
request.
- a few bcache fixes/changes from Eric and Kent
- add WRITE_SAME to the command filter whitelist frm Mauricio
- kill an unused struct member from Ritesh
- partition IO alignment fix from Stefan
- nvme sysfs printf fix from Stephen"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: check partition alignment
nvme : Use correct scnprintf in cmb show
block: allow WRITE_SAME commands with the SG_IO ioctl
block: Remove unused member (busy) from struct blk_queue_tag
bcache: partition support: add 16 minors per bcacheN device
bcache: Make gc wakeup sane, remove set_task_state()
Pull x86 cache allocation interface from Thomas Gleixner:
"This provides support for Intel's Cache Allocation Technology, a cache
partitioning mechanism.
The interface is odd, but the hardware interface of that CAT stuff is
odd as well.
We tried hard to come up with an abstraction, but that only allows
rather simple partitioning, but no way of sharing and dealing with the
per package nature of this mechanism.
In the end we decided to expose the allocation bitmaps directly so all
combinations of the hardware can be utilized.
There are two ways of associating a cache partition:
- Task
A task can be added to a resource group. It uses the cache
partition associated to the group.
- CPU
All tasks which are not member of a resource group use the group to
which the CPU they are running on is associated with.
That allows for simple CPU based partitioning schemes.
The main expected user sare:
- Virtualization so a VM can only trash only the associated part of
the cash w/o disturbing others
- Real-Time systems to seperate RT and general workloads.
- Latency sensitive enterprise workloads
- In theory this also can be used to protect against cache side
channel attacks"
[ Intel RDT is "Resource Director Technology". The interface really is
rather odd and very specific, which delayed this pull request while I
was thinking about it. The pull request itself came in early during
the merge window, I just delayed it until things had calmed down and I
had more time.
But people tell me they'll use this, and the good news is that it is
_so_ specific that it's rather independent of anything else, and no
user is going to depend on the interface since it's pretty rare. So if
push comes to shove, we can just remove the interface and nothing will
break ]
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/intel_rdt: Implement show_options() for resctrlfs
x86/intel_rdt: Call intel_rdt_sched_in() with preemption disabled
x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount
x86/intel_rdt: Fix setting of closid when adding CPUs to a group
x86/intel_rdt: Update percpu closid immeditately on CPUs affected by changee
x86/intel_rdt: Reset per cpu closids on unmount
x86/intel_rdt: Select KERNFS when enabling INTEL_RDT_A
x86/intel_rdt: Prevent deadlock against hotplug lock
x86/intel_rdt: Protect info directory from removal
x86/intel_rdt: Add info files to Documentation
x86/intel_rdt: Export the minimum number of set mask bits in sysfs
x86/intel_rdt: Propagate error in rdt_mount() properly
x86/intel_rdt: Add a missing #include
MAINTAINERS: Add maintainer for Intel RDT resource allocation
x86/intel_rdt: Add scheduler hook
x86/intel_rdt: Add schemata file
x86/intel_rdt: Add tasks files
x86/intel_rdt: Add cpus file
x86/intel_rdt: Add mkdir to resctrl file system
x86/intel_rdt: Add "info" files to resctrl file system
...
Pull more NFS client updates from Trond Myklebust:
"Highlights include:
- further attribute cache improvements to make revalidation more fine
grained
- NFSv4 locking improvements
Bugfixes:
- nfs4_fl_prepare_ds must be careful about reporting success in files
layout
- pNFS/flexfiles: Instead of marking a device inactive, remove it
from the cache"
* tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
NFSv4: Place the GETATTR operation before the CLOSE
NFSv4: Also ask for attributes when downgrading to a READ-only state
NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
pNFS: Return RW layouts on OPEN_DOWNGRADE
NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
NFSv4: ensure __nfs4_find_lock_state returns consistent result.
NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
pNFS/flexfiles: delete deviceid, don't mark inactive
NFS: Clean up nfs_attribute_timeout()
NFS: Remove unused function nfs_revalidate_inode_rcu()
NFS: Fix and clean up the access cache validity checking
NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
NFS: Clean up cache validity checking
NFS: Don't revalidate the file on close if we hold a delegation
NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
NFSv4: Update the attribute cache info in update_changeattr
Pull scsi target cleanups from Bart Van Assche:
"The changes here are:
- a few small bug fixes for the iSCSI and user space target drivers.
- minimize the target build time by about 30% by rearranging #include
directives
- fix the second argument passed to percpu_ida_alloc()
- reduce the number of false positive warnings reported by sparse
These patches pass Wu Fengguang's build bot tests and also the
linux-next tests"
* 'scsi-target-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux:
iscsi-target: Return error if unable to add network portal
target: Fix spelling mistake and unwrap multi-line text
target/iscsi: Fix double free in lio_target_tiqn_addtpg()
target/user: Fix use-after-free of tcmu_cmds if they are expired
target: Minimize #include directives
target/user: Add an #include directive
cxgbit: Add an #include directive
ibmvscsi_tgt: Add two #include directives
sbp-target: Add an #include directive
qla2xxx: Add an #include directive
configfs: Minimize #include directives
usb: gadget: Fix second argument of percpu_ida_alloc()
sbp-target: Fix second argument of percpu_ida_alloc()
target/user: Fix a data type in tcmu_queue_cmd()
target: Use NULL instead of 0 to represent a pointer