Often all is needed is these small helpers, instead of compiler.h or a
full kprobes.h. This is important for asm helpers, in fact even some
asm/kprobes.h make use of these helpers... instead just keep a generic
asm file with helpers useful for asm code with the least amount of
clutter as possible.
Likewise we need now to also address what to do about this file for both
when architectures have CONFIG_HAVE_KPROBES, and when they do not. Then
for when architectures have CONFIG_HAVE_KPROBES but have disabled
CONFIG_KPROBES.
Right now most asm/kprobes.h do not have guards against CONFIG_KPROBES,
this means most architecture code cannot include asm/kprobes.h safely.
Correct this and add guards for architectures missing them.
Additionally provide architectures that not have kprobes support with
the default asm-generic solution. This lets us force asm/kprobes.h on
the header include/linux/kprobes.h always, but most importantly we can
now safely include just asm/kprobes.h on architecture code without
bringing the full kitchen sink of header files.
Two architectures already provided a guard against CONFIG_KPROBES on its
kprobes.h: sh, arch. The rest of the architectures needed gaurds added.
We avoid including any not-needed headers on asm/kprobes.h unless
kprobes have been enabled.
In a subsequent atomic change we can try now to remove compiler.h from
include/linux/kprobes.h.
During this sweep I've also identified a few architectures defining a
common macro needed for both kprobes and ftrace, that of the definition
of the breakput instruction up. Some refer to this as
BREAKPOINT_INSTRUCTION. This must be kept outside of the #ifdef
CONFIG_KPROBES guard.
[mcgrof@kernel.org: fix arm64 build]
Link: http://lkml.kernel.org/r/CAB=NE6X1WMByuARS4mZ1g9+W=LuVBnMDnh_5zyN0CLADaVh=Jw@mail.gmail.com
[sfr@canb.auug.org.au: fixup for kprobes declarations moving]
Link: http://lkml.kernel.org/r/20170214165933.13ebd4f4@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170203233139.32682-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull tracing updates from Steven Rostedt:
"This release has no new tracing features, just clean ups, minor fixes
and small optimizations"
* tag 'trace-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (25 commits)
tracing: Remove outdated ring buffer comment
tracing/probes: Fix a warning message to show correct maximum length
tracing: Fix return value check in trace_benchmark_reg()
tracing: Use modern function declaration
jump_label: Reduce the size of struct static_key
tracing/probe: Show subsystem name in messages
tracing/hwlat: Update old comment about migration
timers: Make flags output in the timer_start tracepoint useful
tracing: Have traceprobe_probes_write() not access userspace unnecessarily
tracing: Have COMM event filter key be treated as a string
ftrace: Have set_graph_function handle multiple functions in one write
ftrace: Do not hold references of ftrace_graph_{notrace_}hash out of graph_lock
tracing: Reset parser->buffer to allow multiple "puts"
ftrace: Have set_graph_functions handle write with RDWR
ftrace: Reset fgd->hash in ftrace_graph_write()
ftrace: Replace (void *)1 with a meaningful macro name FTRACE_GRAPH_EMPTY
ftrace: Create a slight optimization on searching the ftrace_hash
tracing: Add ftrace_hash_key() helper function
ftrace: Convert graph filter to use hash tables
ftrace: Expose ftrace_hash_empty and ftrace_lookup_ip
...
Use automatic IRQ affinity assignment in the virtio layer if available,
and build the blk-mq queues based on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Similar to the PCI version, just calling into virtio instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This basically passed up the pci_irq_get_affinity information through
virtio through an optional get_vq_affinity method. It is only implemented
by the PCI backend for now, and only when we use per-virtqueue IRQs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add a struct irq_affinity pointer to the find_vqs methods, which if set
is used to tell the PCI layer to create the MSI-X vectors for our I/O
virtqueues with the proper affinity from the start. Compared to after
the fact affinity hints this gives us an instantly working setup and
allows to allocate the irq descritors node-local and avoid interconnect
traffic. Last but not least this will allow blk-mq queues are created
based on the interrupt affinity for storage drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In scenario of intensively node allocation, free nids will be ran out
soon, then it needs to stop to load free nids by traversing NAT blocks,
in worse case, if NAT blocks does not be cached in memory, it generates
IOs which slows down our foreground operations.
In order to speed up node allocation, in this patch we introduce a new
free_nid_bitmap array, so there is an bitmap table for each NAT block,
Once the NAT block is loaded, related bitmap cache will be switched on,
and bitmap will be set during traversing nat entries in NAT block, later
we can query and update nid usage status in memory completely.
With such implementation, I expect performance of node allocation can be
improved in the long-term after filesystem image is mounted.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patches adds bitmaps to represent empty or full NAT blocks containing
free nid entries.
If we can find valid crc|cp_ver in the last block of checkpoint pack, we'll
use these bitmaps when building free nids. In order to avoid checkpointing
burden, up-to-date bitmaps will be flushed only during umount time. So,
normally we can get this gain, but when power-cut happens, we rely on fsck.f2fs
which recovers this bitmap again.
After this patch, we build free nids from nid #0 at mount time to make more
full NAT blocks, but in runtime, we check empty NAT blocks to load free nids
without loading any NAT pages from disk.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit 9908859aca (cpuidle/menu: add per CPU PM QoS resume
latency consideration) the cpuidle menu governor calls
dev_pm_qos_read_value() on CPU devices to read the current resume
latency QoS constraint values for them. That function takes a spinlock
to prevent the device's power.qos pointer from becoming NULL during
the access which is a problem for the RT patchset where spinlocks are
converted into mutexes and the idle loop stops working.
However, it is not even necessary for the menu governor to take
that spinlock, because the power.qos pointer accessed under it
cannot be modified during the access anyway.
For this reason, introduce a "raw" routine for accessing device
QoS resume latency constraints without locking and use it in the
menu governor.
Fixes: 9908859aca (cpuidle/menu: add per CPU PM QoS resume latency consideration)
Acked-by: Alex Shi <alex.shi@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull watchdog updates from Guenter Roeck:
"Wim asked me to handle the watchdog pull request this time around.
Key changes:
- New drivers: Cortina Gemini, ZTE's zx2967 family, NIC7018
- Convert to use device managed functions: ebc-c384_wdt, tegra_wdt,
da9063_wdt, da9062_wdt, da9055_wdt, da9052_wdt, bcm2835_wdt,
mena21_wdt, wm831x_wdt, digicolor_wdt, intel-mid_wdt, meson_wdt,
sunxi_wdt, aspeed_wdt, coh901327_wdt, iTCO_wdt
- Use watchdog core to install restart handler: tangox, dw_wdt,
bcm2835_wdt, asm9260_wdt, bcm47xx_wdt
- Convert ts72xx_wdt driver to watchdog core
- Let core handle heartbeat in ep93xx_wdt driver
- Enable COMPILE_TEST where possible
- Various other improvements"
* tag 'watchdog-for-linus-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (54 commits)
watchdog: s3c2410: Add prefix to local function
watchdog: s3c2410: Select MFD_SYSCON on all Exynos platforms
watchdog: s3c2410: Use dev_dbg instead of pr_info
watchdog: s3c2410: Fix infinite interrupt in soft mode
watchdog: s3c2410: Remove confusing CONFIG prefix from local defines
watchdog: softdog: make pretimeout support a compile option
watchdog: zx2967: add watchdog controller driver for ZTE's zx2967 family
dt: bindings: add documentation for zx2967 family watchdog controller
watchdog: sama5d4: Implement resume hook
watchdog: sama5d4: Cache MR instead of a partial config
watchdog: ts72xx_wdt: convert driver to watchdog core
watchdog: ep93xx_wdt: cleanup and let the core handle the heartbeat
watchdog: RDC321X_WDT always depends on PCI
watchdog: add driver for Cortina Gemini watchdog
watchdog: add DT bindings for Cortina Gemini
watchdog: constify watchdog_ops structures
watchdog: Introduce watchdog_stop_on_unregister helper
watchdog: ebc-c384_wdt: Utilize devm_ functions in driver probe callback
watchdog: tegra_wdt: Convert to use device managed functions
watchdog: da9063_wdt: Convert to use device managed functions
...
Pull clk updates from Stephen Boyd:
"The usual collection of new drivers, non-critical fixes, and updates
to existing clk drivers. The bulk of the work is on Allwinner and
Rockchip SoCs, but there's also an Intel Atom driver in here too.
New Drivers:
- Tegra BPMP firmware
- Hisilicon hi3660 SoCs
- Rockchip rk3328 SoCs
- Intel Atom PMC
- STM32F746
- IDT VersaClock 5P49V5923 and 5P49V5933
- Marvell mv98dx3236 SoCs
- Allwinner V3s SoCs
Removed Drivers:
- Samsung Exynos4415 SoCs
Updates:
- Migrate ABx500 to OF
- Qualcomm IPQ4019 CPU clks and general PLL support
- Qualcomm MSM8974 RPM
- Rockchip non-critical fixes and clk id additions
- Samsung Exynos4412 CPUs
- Socionext UniPhier NAND and eMMC support
- ZTE zx296718 i2s and other audio clks
- Renesas CAN and MSIOF clks for R-Car M3-W
- Renesas resets for R-Car Gen2 and Gen3 and RZ/G1
- TI CDCE913, CDCE937, and CDCE949 clk generators
- Marvell Armada ap806 CPU frequencies
- STM32F4* I2S/SAI support
- Broadcom BCM2835 DSI support
- Allwinner sun5i and A80 conversion to new style clk bindings"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (130 commits)
clk: renesas: mstp: ensure register writes complete
clk: qcom: Do not drop device node twice
clk: mvebu: adjust clock handling for the CP110 system controller
clk: mvebu: Expand mv98dx3236-core-clock support
clk: zte: add i2s clocks for zx296718
clk: sunxi-ng: sun9i-a80: Fix wrong pointer passed to PTR_ERR()
clk: sunxi-ng: select SUNXI_CCU_MULT for sun5i
clk: sunxi-ng: Check kzalloc() for errors and cleanup error path
clk: tegra: Add BPMP clock driver
clk: uniphier: add eMMC clock for LD11 and LD20 SoCs
clk: uniphier: add NAND clock for all UniPhier SoCs
ARM: dts: sun9i: Switch to new clock bindings
clk: sunxi-ng: Add A80 Display Engine CCU
clk: sunxi-ng: Add A80 USB CCU
clk: sunxi-ng: Add A80 CCU
clk: sunxi-ng: Support separately grouped PLL lock status register
clk: sunxi-ng: mux: Get closest parent rate possible with CLK_SET_RATE_PARENT
clk: sunxi-ng: mux: honor CLK_SET_RATE_NO_REPARENT flag
clk: sunxi-ng: mux: Fix determine_rate for mux clocks with pre-dividers
clk: qcom: SDHCI enablement on Nexus 5X / 6P
...
Pull i2c updates from Wolfram Sang:
"I2C has for you two new drivers (Tegra BPMP and STM32F4), interrupt
support for pca954x muxes, and a bunch of driver bugfixes and
improvements. Nothing really special this cycle.
A few commits have been added to my tree just recently. Those are the
Tegra BPMP driver and a few straightforward bugfixes or cleanups which
I prefer to have upstream rather soonish. The rest had proper
linux-next exposure"
* 'i2c/for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (25 commits)
i2c: thunderx: Replace pci_enable_msix()
i2c: exynos5: fix arbitration lost handling
i2c: exynos5: disable fifo-almost-empty irq signal when necessary
i2c: at91: ensure state is restored after suspending
i2c: bcm2835: Avoid possible NULL ptr dereference
i2c: Add Tegra BPMP I2C proxy driver
dt-bindings: Add Tegra186 BPMP I2C binding
misc: eeprom: at24: use device_property_*() functions instead of of_get_property()
i2c: mux: pca954x: Add interrupt controller support
dt: bindings: i2c-mux-pca954x: Add documentation for interrupt controller
i2c: mux: pca954x: Add missing pca9542 definition to chip_desc
i2c: riic: correctly finish transfers
i2c: i801: Add support for Intel Gemini Lake
i2c: mux: pca9541: Export OF device ID table as module aliases
i2c: mux: pca954x: Export OF device ID table as module aliases
i2c: mux: mlxcpld: remove unused including <linux/version.h>
i2c: busses: constify i2c_algorithm structures
i2c: i2c-mux-gpio: rename i2c-gpio-mux to i2c-mux-gpio
i2c: sh_mobile: document support for r8a7796 (R-Car M3-W)
i2c: i2c-cros-ec-tunnel: Reduce logging noise
...
Pull rdma DMA mapping updates from Doug Ledford:
"Drop IB DMA mapping code and use core DMA code instead.
Bart Van Assche noted that the ib DMA mapping code was significantly
similar enough to the core DMA mapping code that with a few changes it
was possible to remove the IB DMA mapping code entirely and switch the
RDMA stack to use the core DMA mapping code.
This resulted in a nice set of cleanups, but touched the entire tree
and has been kept separate for that reason."
* tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits)
IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it
IB/core: Remove ib_device.dma_device
nvme-rdma: Switch from dma_device to dev.parent
RDS: net: Switch from dma_device to dev.parent
IB/srpt: Modify a debug statement
IB/srp: Switch from dma_device to dev.parent
IB/iser: Switch from dma_device to dev.parent
IB/IPoIB: Switch from dma_device to dev.parent
IB/rxe: Switch from dma_device to dev.parent
IB/vmw_pvrdma: Switch from dma_device to dev.parent
IB/usnic: Switch from dma_device to dev.parent
IB/qib: Switch from dma_device to dev.parent
IB/qedr: Switch from dma_device to dev.parent
IB/ocrdma: Switch from dma_device to dev.parent
IB/nes: Remove a superfluous assignment statement
IB/mthca: Switch from dma_device to dev.parent
IB/mlx5: Switch from dma_device to dev.parent
IB/mlx4: Switch from dma_device to dev.parent
IB/i40iw: Remove a superfluous assignment statement
IB/hns: Switch from dma_device to dev.parent
...
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
- fix for font color when console is switched to another fb driver
- deferred probing fixes for simplefb driver
- preparations to add support of an optional GPIO to enable panel for
ARM CLCD driver
- some improvements for ssd1307fb driver
- cleanups for OMAP fbdev LCD drivers
- misc fixes/cleanups for various fb drivers
* tag 'fbdev-v4.11' of git://github.com/bzolnier/linux: (30 commits)
video: fbdev: fsl-diu-fb: fix spelling mistake "palette"
fbdev: ssd1307fb: include linux/gpio/consumer.h
video: fbdev: fsl-diu-fb: remove impossible condition
video: fbdev: amifb: remove impossible condition
fbdev/ssd1307fb: clear screen in probe
fbdev/ssd1307fb: add support to enable VBAT
fbdev: ssd1307fb: Make reset gpio devicetree property optional
fbdev: ssd1307fb: Remove reset-active-low from the DT binding document
fbdev: ssd1307fb: Start to use gpiod API for reset gpio
video: fbdev: sh_mobile_lcdcfb: fix error return code in sh_mobile_lcdc_probe()
video: fbdev: offb: switch to using for_each_node_by_type
video/console: use setup_timer and mod_timer instead of init_timer
fbdev: omap/lcd: Make callbacks optional
fbdev: omap/lcd: Staticize non-exported lcd_panel structs
fbdev: omap/lcd: Remove no-op driver callbacks
video/mbx: use simple_open()
video: fbdev: stifb: handle NULL return value from ioremap_nocache
video: fbdev: pmagb-b-fb: Remove bad `__init' annotation
video: fbdev: pmag-ba-fb: Remove bad `__init' annotation
video: ARM CLCD: use panel device node for getting backlight and mode
...
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- misc bits
- KASAN updates
- procfs
- lib/ updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (124 commits)
checkpatch: remove false unbalanced braces warning
checkpatch: notice unbalanced else braces in a patch
checkpatch: add another old address for the FSF
checkpatch: update $logFunctions
checkpatch: warn on logging continuations
checkpatch: warn on embedded function names
lib/lz4: remove back-compat wrappers
fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version
crypto: change LZ4 modules to work with new LZ4 module version
lib/decompress_unlz4: change module to work with new LZ4 module version
lib: update LZ4 compressor module
lib/test_sort.c: make it explicitly non-modular
lib: add CONFIG_TEST_SORT to enable self-test of sort()
rbtree: use designated initializers
linux/kernel.h: fix DIV_ROUND_CLOSEST to support negative divisors
lib/find_bit.c: micro-optimise find_next_*_bit
lib: add module support to atomic64 tests
lib: add module support to glob tests
lib: add module support to crc32 tests
kernel/ksysfs.c: add __ro_after_init to bin_attribute structure
...
0-day bot reported some new objtool warnings which were caused by the
new annotate_unreachable() macro:
fs/afs/flock.o: warning: objtool: afs_do_unlk()+0x0: duplicate frame pointer save
fs/afs/flock.o: warning: objtool: afs_do_unlk()+0x0: frame pointer state mismatch
fs/btrfs/delayed-inode.o: warning: objtool: btrfs_delete_delayed_dir_index()+0x0: duplicate frame pointer save
fs/btrfs/delayed-inode.o: warning: objtool: btrfs_delete_delayed_dir_index()+0x0: frame pointer state mismatch
fs/dlm/lock.o: warning: objtool: _grant_lock()+0x0: duplicate frame pointer save
fs/dlm/lock.o: warning: objtool: _grant_lock()+0x0: frame pointer state mismatch
fs/ocfs2/alloc.o: warning: objtool: ocfs2_mv_path()+0x0: duplicate frame pointer save
fs/ocfs2/alloc.o: warning: objtool: ocfs2_mv_path()+0x0: frame pointer state mismatch
It turns out that, for older versions of GCC, if a function has multiple
BUG() incantations, GCC will sometimes merge the corresponding
annotate_unreachable() inline asm statements into a single block. That
has the undesirable effect of removing one of the entries in the
__unreachable section, confusing objtool greatly.
A workaround for this issue is to ensure that each instance of the
inline asm statement uses a different label, so that GCC sees the
statements are unique and leaves them alone. The inline asm ‘%=’ token
could be used for that, but unfortunately older versions of GCC don't
support it. So I implemented a poor man's version of it with the
__LINE__ macro.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d1091c7fa3 ("objtool: Improve detection of BUG() and other dead ends")
Link: http://lkml.kernel.org/r/0c14b00baf9f68d1b0221ddb6c88b925181c8be8.1487997036.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Patch series "Update LZ4 compressor module", v7.
This patchset updates the LZ4 compression module to a version based on
LZ4 v1.7.3 allowing to use the fast compression algorithm aka LZ4 fast
which provides an "acceleration" parameter as a tradeoff between high
compression ratio and high compression speed.
We want to use LZ4 fast in order to support compression in lustre and
(mostly, based on that) investigate data reduction techniques in behalf
of storage systems.
Also, it will be useful for other users of LZ4 compression, as with LZ4
fast it is possible to enable applications to use fast and/or high
compression depending on the usecase. For instance, ZRAM is offering a
LZ4 backend and could benefit from an updated LZ4 in the kernel.
LZ4 homepage: http://www.lz4.org/
LZ4 source repository: https://github.com/lz4/lz4 Source version: 1.7.3
Benchmark (taken from [1], Core i5-4300U @1.9GHz):
----------------|--------------|----------------|----------
Compressor | Compression | Decompression | Ratio
----------------|--------------|----------------|----------
memcpy | 4200 MB/s | 4200 MB/s | 1.000
LZ4 fast 50 | 1080 MB/s | 2650 MB/s | 1.375
LZ4 fast 17 | 680 MB/s | 2220 MB/s | 1.607
LZ4 fast 5 | 475 MB/s | 1920 MB/s | 1.886
LZ4 default | 385 MB/s | 1850 MB/s | 2.101
[1] http://fastcompression.blogspot.de/2015/04/sampling-or-faster-lz4.html
[PATCH 1/5] lib: Update LZ4 compressor module
[PATCH 2/5] lib/decompress_unlz4: Change module to work with new LZ4 module version
[PATCH 3/5] crypto: Change LZ4 modules to work with new LZ4 module version
[PATCH 4/5] fs/pstore: fs/squashfs: Change usage of LZ4 to work with new LZ4 version
[PATCH 5/5] lib/lz4: Remove back-compat wrappers
This patch (of 5):
Update the LZ4 kernel module to LZ4 v1.7.3 by Yann Collet. The kernel
module is inspired by the previous work by Chanho Min. The updated LZ4
module will not break existing code since the patchset contains
appropriate changes.
API changes:
New method LZ4_compress_fast which differs from the variant available in
kernel by the new acceleration parameter, allowing to trade compression
ratio for more compression speed and vice versa.
LZ4_decompress_fast is the respective decompression method, featuring a
very fast decoder (multiple GB/s per core), able to reach RAM speed in
multi-core systems. The decompressor allows to decompress data
compressed with LZ4 fast as well as the LZ4 HC (high compression)
algorithm.
Also the useful functions LZ4_decompress_safe_partial and
LZ4_compress_destsize were added. The latter reverses the logic by
trying to compress as much data as possible from source to dest while
the former aims to decompress partial blocks of data.
A bunch of streaming functions were also added which allow
compressig/decompressing data in multiple steps (so called "streaming
mode").
The methods lz4_compress and lz4_decompress_unknownoutputsize are now
known as LZ4_compress_default respectivley LZ4_decompress_safe. The old
methods will be removed since there's no callers left in the code.
[arnd@arndb.de: fix KERNEL_LZ4 support]
Link: http://lkml.kernel.org/r/20170208211946.2839649-1-arnd@arndb.de
[akpm@linux-foundation.org: simplify]
[akpm@linux-foundation.org: fix the simplification]
[4sschmid@informatik.uni-hamburg.de: fix performance regressions]
Link: http://lkml.kernel.org/r/1486898178-17125-2-git-send-email-4sschmid@informatik.uni-hamburg.de
[4sschmid@informatik.uni-hamburg.de: v8]
Link: http://lkml.kernel.org/r/1487182598-15351-2-git-send-email-4sschmid@informatik.uni-hamburg.de
Link: http://lkml.kernel.org/r/1486321748-19085-2-git-send-email-4sschmid@informatik.uni-hamburg.de
Signed-off-by: Sven Schmidt <4sschmid@informatik.uni-hamburg.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Bongkyu Kim <bongkyu.kim@lge.com>
Cc: Rui Salvaterra <rsalvaterra@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While working on a thermal driver I encounter a scenario where the
divisor could be negative, instead of adding local code to handle this I
though I first try to add support for this in DIV_ROUND_CLOSEST.
Add support to DIV_ROUND_CLOSEST for negative divisors if both dividend
and divisor variable types are signed. This should not alter current
behavior for users of the macro as previously negative divisors where
not supported.
Before:
DIV_ROUND_CLOSEST( 59, 4) = 15
DIV_ROUND_CLOSEST( 59, -4) = -14
DIV_ROUND_CLOSEST( -59, 4) = -15
DIV_ROUND_CLOSEST( -59, -4) = 14
After:
DIV_ROUND_CLOSEST( 59, 4) = 15
DIV_ROUND_CLOSEST( 59, -4) = -15
DIV_ROUND_CLOSEST( -59, 4) = -15
DIV_ROUND_CLOSEST( -59, -4) = 15
[akpm@linux-foundation.org: fix comment, per Guenter]
Link: http://lkml.kernel.org/r/20161222102217.29011-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CHECK_DATA_CORRUPTION() macro was designed to have callers do
something meaningful/protective on failure. However, using "return
false" in the macro too strictly limits the design patterns of callers.
Instead, let callers handle the logic test directly, but make sure that
the result IS checked by forcing __must_check (which appears to not be
able to be used directly on macro expressions).
Link: http://lkml.kernel.org/r/20170206204547.GA125312@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per memcg slab accounting and kasan have a problem with kmem_cache
destruction.
- kmem_cache_create() allocates a kmem_cache, which is used for
allocations from processes running in root (top) memcg.
- Processes running in non root memcg and allocating with either
__GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
kmem_cache.
- Kasan catches use-after-free by having kfree() and kmem_cache_free()
defer freeing of objects. Objects are placed in a quarantine.
- kmem_cache_destroy() destroys root and non root kmem_caches. It takes
care to drain the quarantine of objects from the root memcg's
kmem_cache, but ignores objects associated with non root memcg. This
causes leaks because quarantined per memcg objects refer to per memcg
kmem cache being destroyed.
To see the problem:
1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
2) from non root memcg, allocate and free a few objects from cache
3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
will trigger a "Slab cache still has objects" warning indicating
that the per memcg kmem_cache structure was leaked.
Fix the leak by draining kasan quarantined objects allocated from non
root memcg.
Racing memcg deletion is tricky, but handled. kmem_cache_destroy() =>
shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
flushes per memcg quarantined objects, even if that memcg has been
rmdir'd and gone through memcg_deactivate_kmem_caches().
This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
enabled. So I don't think it's worth patching stable kernels.
Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 31bc3858ea ("add automatic onlining policy for the newly added
memory") provides the capability to have added memory automatically
onlined during add, but this appears to be slightly broken.
The current implementation uses walk_memory_range() to call
online_memory_block, which uses memory_block_change_state() to online
the memory. Instead, we should be calling device_online() for the
memory block in online_memory_block(). This would online the memory
(the memory bus online routine memory_subsys_online() called from
device_online calls memory_block_change_state()) and properly update the
device struct offline flag.
As a result of the current implementation, attempting to remove a memory
block after adding it using auto online fails. This is because doing a
remove, for instance
echo offline > /sys/devices/system/memory/memoryXXX/state
uses device_offline() which checks the dev->offline flag.
Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If madvise(2) advice will result in the underlying vma being split and
the number of areas mapped by the process will exceed
/proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.
EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
is temporarily unavailable. It indicates that userspace should retry
the advice in the near future. This is important for advice such as
MADV_DONTNEED which is often used by malloc implementations to free
memory back to the system: we really do want to free memory back when
madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
or mempolicies) cannot be allocated.
Encountering /proc/sys/vm/max_map_count is not a temporary failure,
however, so return ENOMEM to indicate this is a more serious issue. A
followup patch to the man page will specify this behavior.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "1G transparent hugepage support for device dax", v2.
The following series implements support for 1G trasparent hugepage on
x86 for device dax. The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX. I have
forward ported the relevant bits to 4.10-rc. The current submission has
only the necessary code to support device DAX.
Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs. Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file. We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].
[1]: https://lkml.org/lkml/2016/1/31/52
Comments from Nilesh @ Oracle:
There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:
processes : 10,000
memory : 6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB
As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level. Memory sizes will keep increasing; so
this number will keep increasing.
An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.
This patch (of 3):
In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.
[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans. Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.
This can be traced back to recent efforts of thrash avoidance. Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect and
activate refaulting pages right away, distilling used-once pages on the
inactive list much more effectively. This is by design, and it makes
sense for clean cache. But for the most part our workload's cache
faults are refaults and its use-once cache is from streaming writes. We
end up with most of the inactive list dirty, and we don't go after the
active cache as long as we have use-once pages around.
But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off. Even if the refaults happen, reads are
faster than writes. Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.
To accomplish this, activate pages that are dirty or under writeback
when they reach the end of the inactive LRU. The pages are marked for
immediate reclaim, meaning they'll get moved back to the inactive LRU
tail as soon as they're written back and become reclaimable. But in the
meantime, by reducing the inactive list to only immediately reclaimable
pages, we allow the scanner to deactivate and refill the inactive list
with clean cache from the active list tail to guarantee forward
progress.
[hannes@cmpxchg.org: update comment]
Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory pressure can put dirty pages at the end of the LRU without
anybody running into dirty limits. Don't start writing individual pages
from kswapd while the flushers might be asleep.
Unlike the old direct reclaim flusher wakeup (removed in the next patch)
that flushes the number of pages just scanned, this patch wakes the
flushers for all outstanding dirty pages. That seemed to perform better
in a synthetic test that pushes dirty pages to the end of the LRU and
into reclaim, because we know LRU aging outstrips writeback already, and
this way we give younger dirty pages a headstart rather than wait until
reclaim runs into them as well. It also means less plugging and risk of
exhausting the struct request pool from reclaim.
There is a concern that this will cause temporary files that used to get
dirtied and truncated before writeback to now get written to disk under
memory pressure. If this turns out to be a real problem, we'll have to
revisit this and tame the reclaim flusher wakeups.
[hannes@cmpxchg.org: mention dirty expiration as a condition]
Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org
Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: vmscan: fix kswapd writeback regression".
We noticed a regression on multiple hadoop workloads when moving from
3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
writeout, causing direct reclaim herds that also don't make progress.
I tracked it down to the thrash avoidance efforts after 3.10 that make
the kernel better at keeping use-once cache and use-many cache sorted on
the inactive and active list, with more aggressive protection of the
active list as long as there is inactive cache. Unfortunately, our
workload's use-once cache is mostly from streaming writes. Waiting for
writes to avoid potential reloads in the future is not a good tradeoff.
These patches do the following:
1. Wake the flushers when kswapd sees a lump of dirty pages. It's
possible to be below the dirty background limit and still have cache
velocity push them through the LRU. So start a-flushin'.
2. Let kswapd only write pages that have been rotated twice. This makes
sure we really tried to get all the clean pages on the inactive list
before resorting to horrible LRU-order writeback.
3. Move rotating dirty pages off the inactive list. Instead of churning
or waiting on page writeback, we'll go after clean active cache. This
might lead to thrashing, but in this state memory demand outstrips IO
speed anyway, and reads are faster than writes.
Mel backported the series to 4.10-rc5 with one minor conflict and ran a
couple of tests on it. Mix of read/write random workload didn't show
anything interesting. Write-only database didn't show much difference
in performance but there were slight reductions in IO -- probably in the
noise.
simoop did show big differences although not as big as Mel expected.
This is Chris Mason's workload that similate the VM activity of hadoop.
Mel won't go through the full details but over the samples measured
during an hour it reported
4.10.0-rc5 4.10.0-rc5
vanilla johannes-v1r1
Amean p50-Read 21346531.56 ( 0.00%) 21697513.24 ( -1.64%)
Amean p95-Read 24700518.40 ( 0.00%) 25743268.98 ( -4.22%)
Amean p99-Read 27959842.13 ( 0.00%) 28963271.11 ( -3.59%)
Amean p50-Write 1138.04 ( 0.00%) 989.82 ( 13.02%)
Amean p95-Write 1106643.48 ( 0.00%) 12104.00 ( 98.91%)
Amean p99-Write 1569213.22 ( 0.00%) 36343.38 ( 97.68%)
Amean p50-Allocation 85159.82 ( 0.00%) 79120.70 ( 7.09%)
Amean p95-Allocation 204222.58 ( 0.00%) 129018.43 ( 36.82%)
Amean p99-Allocation 278070.04 ( 0.00%) 183354.43 ( 34.06%)
Amean final-p50-Read 21266432.00 ( 0.00%) 21921792.00 ( -3.08%)
Amean final-p95-Read 24870912.00 ( 0.00%) 26116096.00 ( -5.01%)
Amean final-p99-Read 28147712.00 ( 0.00%) 29523968.00 ( -4.89%)
Amean final-p50-Write 1130.00 ( 0.00%) 977.00 ( 13.54%)
Amean final-p95-Write 1033216.00 ( 0.00%) 2980.00 ( 99.71%)
Amean final-p99-Write 1517568.00 ( 0.00%) 32672.00 ( 97.85%)
Amean final-p50-Allocation 86656.00 ( 0.00%) 78464.00 ( 9.45%)
Amean final-p95-Allocation 211712.00 ( 0.00%) 116608.00 ( 44.92%)
Amean final-p99-Allocation 287232.00 ( 0.00%) 168704.00 ( 41.27%)
The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons Mel
hasn't analysed yet).
Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down. Direct reclaim activity is one fifth of
what it was according to vmstats. Kswapd activity is higher but this is
not necessarily surprising. Kswapd efficiency is unchanged at 99% (99%
of pages scanned were reclaimed) but direct reclaim efficiency went from
77% to 99%
In the vanilla kernel, 627MB of data was written back from reclaim
context. With the series, no data was written back. With or without
the patch, pages are being immediately reclaimed after writeback
completes. However, with the patch, only 1/8th of the pages are
reclaimed like this.
This patch (of 5):
We have an elaborate dirty/writeback throttling mechanism inside the
reclaim scanner, but for that to work the pages have to go through
shrink_page_list() and get counted for what they are. Otherwise, we
mess up the LRU order and don't match reclaim speed to writeback.
Especially during deactivation, there is never a reason to skip dirty
pages; nothing is even trying to write them out from there. Don't mess
up the LRU order for nothing, shuffle these pages along.
Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "userfaultfd: non-cooperative: add madvise() event for
MADV_REMOVE request".
These patches add notification of madvise(MADV_REMOVE) event to
non-cooperative userfaultfd monitor.
The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with
relevant functions and structures. Using _REMOVE instead of
_MADVDONTNEED describes the event semantics more clearly and I hope it's
not too late for such change in the ABI.
This patch (of 3):
The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about
removal of certain range from address space tracked by userfaultfd.
Hence, UFFD_EVENT_REMOVE seems to better reflect the operation
semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to
'remove' and the madvise_userfault_dontneed callback is renamed to
userfaultfd_remove.
Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>