Fixes gcc '-Wunused-but-set-variable' warning:
drivers/misc/vmw_vmci/vmci_host.c: In function 'vmci_host_do_alloc_queuepair':
drivers/misc/vmw_vmci/vmci_host.c:450:6: warning:
variable 'cid' set but not used [-Wunused-but-set-variable]
u32 cid;
^
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Looks like during merging the bulk POLL* -> EPOLL* replacement
missed the patch
'commit af336cabe0 ("mei: limit the number of queued writes")'
Fix sparse warning:
drivers/misc/mei/main.c:602:13: warning: restricted __poll_t degrades to integer
drivers/misc/mei/main.c:605:30: warning: invalid assignment: |=
drivers/misc/mei/main.c:605:30: left side has type restricted __poll_t
drivers/misc/mei/main.c:605:30: right side has type int
Fixes: af336cabe0 ("mei: limit the number of queued writes")
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When adding a VMCI resource, the check for an existing entry
would ignore that the new entry could be a wildcard. This could
result in multiple resource entries that would match a given
handle. One disastrous outcome of this is that the
refcounting used to ensure that delayed callbacks for VMCI
datagrams have run before the datagram is destroyed can be
wrong, since the refcount could be increased on the duplicate
entry. This in turn leads to a use after free bug. This issue
was discovered by Hangbin Liu using KASAN and syzkaller.
Fixes: bc63dedb7d ("VMCI: resource object implementation")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Within at24_loop_until_timeout the timestamp used for timeout checking
is recorded after the I2C transfer and sleep_range(). Under high CPU
load either the execution time for I2C transfer or sleep_range() could
actually be larger than the timeout value. Worst case the I2C transfer
is only tried once because the loop will exit due to the timeout
although the EEPROM is now ready.
To fix this issue the timestamp is recorded at the beginning of each
iteration. That is, before I2C transfer and sleep. Then the timeout
is actually checked against the timestamp of the previous iteration.
This makes sure that even if the timeout is reached, there is still one
more chance to try the I2C transfer in case the EEPROM is ready.
Example:
If you have a system which combines high CPU load with repeated EEPROM
writes you will run into the following scenario.
- System makes a successful regmap_bulk_write() to EEPROM.
- System wants to perform another write to EEPROM but EEPROM is still
busy with the last write.
- Because of high CPU load the usleep_range() will sleep more than
25 ms (at24_write_timeout).
- Within the over-long sleeping the EEPROM finished the previous write
operation and is ready again.
- at24_loop_until_timeout() will detect timeout and won't try to write.
Signed-off-by: Wang Xin <xin.wang7@cn.bosch.com>
Signed-off-by: Mark Jonas <mark.jonas@de.bosch.com>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
The function should return -EFAULT when copy_from_user fails. Even
though the caller does not distinguish them. but we should keep backward
compatibility.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clang warns when a variable is assigned to itself.
drivers/misc/mic/scif/scif_dma.c:1577:12: warning: explicitly assigning
value of variable of type 'bool' (aka '_Bool') to itself [-Wself-assign]
dst_local = dst_local;
~~~~~~~~~ ^ ~~~~~~~~~
1 warning generated.
This is usually done to avoid an unused variable warning, which is the
case here. dst_local is used nowhere in this function, which has been
the case since the initial code drop in commit 7cc31cd277 ("misc: mic:
SCIF DMA and CPU copy interface") in 2015. Just remove the variable, it
can be added back if it was intended to be used.
Link: https://github.com/ClangBuiltLinux/linux/issues/107
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
module.h already contains moduleparam.h, so it is safe to remove
the redundant include.
The issue is detected with the help of Coccinelle.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kgdbts current fails when compiled with restrict:
drivers/misc/kgdbts.c: In function ‘configure_kgdbts’:
drivers/misc/kgdbts.c:1070:2: error: ‘strcpy’ source argument is the same as destination [-Werror=restrict]
strcpy(config, opt);
^~~~~~~~~~~~~~~~~~~
As the error says, config is being used in both the source and destination.
Refactor the code to avoid the extra copy and put the parsing closer to
the actual location.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clang warns when multiple pairs of parentheses are used for a single
conditional statement.
drivers/misc/echo/echo.c:384:27: warning: equality comparison with
extraneous parentheses [-Wparentheses-equality]
if ((ec->nonupdate_dwell == 0)) {
~~~~~~~~~~~~~~~~~~~~^~~~
drivers/misc/echo/echo.c:384:27: note: remove extraneous parentheses
around the comparison to silence this warning
if ((ec->nonupdate_dwell == 0)) {
~ ^ ~
drivers/misc/echo/echo.c:384:27: note: use '=' to turn this equality
comparison into an assignment
if ((ec->nonupdate_dwell == 0)) {
^~
=
1 warning generated.
Remove them and while we're at it, simplify the zero check as '!var' is
used more than 'var == 0'.
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
of_node_put has taken the null pinter check into account. So it is
safe to remove the duplicated check before of_node_put.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is useful to expose how many times the balloon resets. If it happens
more than very rarely - this is an indication for a problem.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change all the remaining return values to int to avoid mistakes. Reduce
indentation when possible.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for supporting compaction and OOM notification, this
patch reworks the inflate/deflate loops. The main idea is to separate
the allocation, communication with the hypervisor, and the handling of
errors from each other. Doing will allow us to perform concurrent
inflation and deflation, excluding the actual communication with the
hypervisor.
To do so, we need to get rid of the remaining global state that is kept
in the balloon struct, specifically the refuse_list. When the VM
communicates with the hypervisor, it does not free or put back pages
to the balloon list and instead only moves the pages whose status
indicated failure into a refuse_list on the stack. Once the operation
completes, the inflation or deflation functions handle the list
appropriately.
As we do that, we can consolidate the communication with the hypervisor
for both the lock and unlock operations into a single function. We also
reuse the deflation function for popping the balloon.
As a preparation for preventing races, we hold a spinlock when the
communication actually takes place, and use atomic operations for
updating the balloon size. The balloon page list is still racy and will
be handled in the next patch.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To allow the balloon statistics to be updated concurrently, we change
the statistics to be held per core and aggregate it when needed.
To avoid the memory overhead of keeping the statistics per core, and
since it is likely not used by most users, we start updating the
statistics only after the first use. A read-write semaphore is used to
protect the statistics initialization and avoid races. This semaphore is
(and will) be used to protect configuration changes during reset.
While we are at it, address some other issues: change the statistics
update to inline functions instead of define; use ulong for saving the
statistics; and clean the statistics printouts.
Note that this patch changes the format of the outputs. If there are any
automatic tools that use the statistics, they might fail.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As we want to leave as little as possible on the global balloon
structure, to avoid possible future races, we want to get rid sysinfo.
We can actually get the total_ram directly, and simplify the logic of
vmballoon_send_get_target() a little.
While we are doing that, let's return int and avoid mistakes due to
bool/int conversions.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The required change in the balloon size is currently computed in
vmballoon_work(), vmballoon_inflate() and vmballoon_deflate(). Refactor
it to simplify the next patches.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The name of the macro'd VMW_BALLOON_2M_SHIFT is misleading. The value
reflects 2M huge-page order. Unfortunately, we cannot use
HPAGE_PMD_ORDER, since it is not defined when transparent huge-pages are
off, so we need to define our own one.
Rename it to VMW_BALLOON_2M_ORDER. No functional change.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, when the hypervisor rejects a page during lock operation, the
VM treats pages differently according to the error-code: in certain
cases the page is immediately freed, and in others it is put on a
rejection list and only freed later.
The behavior does not make too much sense. If the page is freed
immediately it is very likely to be used again in the next batch of
allocations, and be rejected again.
In addition, for support of compaction and OOM notifiers, we wish to
separate the logic that communicates with the hypervisor (as well as
analyzes the status of each page) from the logic that allocates or free
pages.
Treat all errors the same way, queuing the pages on the refuse list.
Move to the next allocation size (4k) when too many pages are refused.
Free the refused pages when moving to the next size to avoid situations
in which too much memory is waiting to be freed on the refused list.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current abstractions for batch vs single operations seem suboptimal
and complicate the implementation of additional features (OOM,
compaction).
The immediate problem of the current abstractions is that they cause
differences in how operations are handled when batching is on or off.
For example, the refused_alloc counter is not updated when batching is
on. These discrepancies are caused by code redundancies.
Instead, this patch presents three type of operations, according to
whether batching is on or off: (1) add page, (2) communication with the
hypervisor and (3) retrieving the status of a page.
To avoid the overhead of virtual functions, and since we do not expect
additional interfaces for communication with the hypervisor, we use
static keys instead of virtual functions.
Finally, while we are at it, change vmballoon_init_batching() to return
int instead of bool, to be consistent in the return type and avoid
potential coding errors.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Splitting the allocations between sleeping and non-sleeping made some
sort of sense as long as rate-limiting was enabled. Now that it is
removed, we need to decide - either we want sleeping allocations or not.
Since no other Linux balloon driver (hv, Xen, virtio) uses sleeping
allocations, use the same approach.
We do distinguish, however, between 2MB allocations and 4kB allocations
and prevent reclamation on 2MB. In both cases, we avoid using emergency
low-memory pools, as it may cause undesired effects.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The use of accessors for batch entries complicates the code and makes it
less readable. Remove it an instead use bit-fields.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The lock and unlock code paths are very similar, so avoid the duplicate
code by merging them together.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that we have a single point, unify the tracing and collecting the
statistics for commands and their failure. While it might somewhat
reduce the control over debugging, it cleans the code a lot.
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By inlining the hypercall interface, we can unify several operations
into one central point in the code:
- Updating the target.
- Updating when a reset is needed.
- Update statistics (which will be done later in the patch-set).
- Print debug-messages (although they cannot be enabled as selectively).
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace "fallthru" with a proper "fall through" annotation.
This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Dimitri Sivanich <sivanich@hpe.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The AFU Information DVSEC capability is a means to extract common,
general information about all of the AFUs associated with a Function
independent of the specific functionality that each AFU provides.
Write in the AFU Index field allows to access to the descriptor data
for each AFU.
With the current code, we are not able to access to these specific data
when the index >= 1 because we are writing to the wrong location.
All requests to the data of each AFU are pointing to those of the AFU 0,
which could have impacts when using a card with more than one AFU per
function.
This patch fixes the access to the AFU Descriptor Data indexed by the
AFU Info Index field.
Fixes: 5ef3166e8a ("ocxl: Driver code for 'generic' opencapi devices")
Cc: stable <stable@vger.kernel.org> # 4.16
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The genweq_add_file and genwqe_del_file by caching current without
using reference counting embed the assumption that a file descriptor
will never be passed from one process to another. It even embeds the
assumption that the the thread that opened the file will be in
existence when the process terminates. Neither of which are
guaranteed to be true.
Therefore replace caching the task_struct of the opener with
pid of the openers thread group id. All the knowledge of the
opener is used for is as the target of SIGKILL and a SIGKILL
will kill the entire process group.
Rename genwqe_force_sig to genwqe_terminate, remove it's unncessary
signal argument, update it's ownly caller, and use kill_pid
instead of force_sig.
The work force_sig does in changing signal handling state is not
relevant to SIGKILL sent as SEND_SIG_PRIV. The exact same processess
will be killed just with less work, and less confusion. The work done
by force_sig is really only needed for handling syncrhonous
exceptions.
It will still be possible to cause genwqe_device_remove to wait
8 seconds by passing a file descriptor to another process but
the possible user after free is fixed.
Fixes: eaf4722d46 ("GenWQE Character device and DDCB queue")
Cc: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frank Haverkamp <haver@linux.vnet.ibm.com>
Cc: Joerg-Stephan Vogt <jsvogt@de.ibm.com>
Cc: Michael Jung <mijung@gmx.net>
Cc: Michael Ruettger <michael@ibmra.de>
Cc: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Eberhard S. Amann <esa@linux.vnet.ibm.com>
Cc: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Array prox_curr_ma is declared but never used, hence it is redundant
and can be removed.
Cleans up clang warning:
warning: 'prox_curr_ma' defined but not used [-Wunused-const-variable=]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Array ir_currents is declared but never used, hence it is redundant
and can be removed.
Cleans up clang warning:
warning: 'ir_currents' defined but not used [-Wunused-const-variable=]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
val is indirectly controlled by user-space, hence leading to a
potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/misc/hmc6352.c:54 compass_store() warn: potential spectre issue
'map' [r]
Fix this by sanitizing val before using it to index map
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case a client fails to connect in mei_cldev_enable(), the
caller won't call the mei_cldev_disable leaving the client
in a linked stated. Upon driver unload the client structure
will be freed in mei_cl_bus_dev_release(), leaving a stale pointer
on a fail_list. This will eventually end up in crash
during power down flow in mei_cl_set_disonnected().
RIP: mei_cl_set_disconnected+0x5/0x260[mei]
Call trace:
mei_cl_all_disconnect+0x22/0x30
mei_reset+0x194/0x250
__synchronize_hardirq+0x43/0x50
_cond_resched+0x15/0x30
mei_me_intr_clear+0x20/0x100
mei_stop+0x76/0xb0
mei_me_shutdown+0x3f/0x80
pci_device_shutdown+0x34/0x60
kernel_restart+0x0e/0x30
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200455
Fixes: 'c110cdb17148 ("mei: bus: make a client pointer always available")'
Cc: <stable@vger.kernel.org> 4.10+
Tested-by: Georg Müller <georgmueller@gmx.net>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KASAN reports a use-after-free during startup, in mei_cl_write:
BUG: KASAN: use-after-free in mei_cl_write+0x601/0x870 [mei]
(drivers/misc/mei/client.c:1770)
This is caused by commit 98e70866aa ("mei: add support for variable
length mei headers."), which changed the return value from len, to
buf->size. That ends up using a stale buf pointer, because blocking
call, the cb (callback) is deleted in me_cl_complete() function.
However, fortunately, len remains unchanged throughout the function
(and I don't see anything else that would require re-reading buf->size
either), so the fix is to simply revert the change, and return len, as
before.
Fixes: 98e70866aa ("mei: add support for variable length mei headers.")
CC: Arnd Bergmann <arnd@arndb.de>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some of the ME clients are available only for BIOS operation and are
removed during hand off to an OS. However the removal is not instant.
A client may be visible on the client list when the mei driver requests
for enumeration, while the subsequent request for properties will be
answered with client not found error value. The default behavior
for an error is to perform client reset while this error is harmless and
the link reset should be prevented. This issue started to be visible due to
suspend/resume timing changes. Currently reported only on the Haswell
based system.
Fixes:
[33.564957] mei_me 0000:00:16.0: hbm: properties response: wrong status = 1 CLIENT_NOT_FOUND
[33.564978] mei_me 0000:00:16.0: mei_irq_read_handler ret = -71.
[33.565270] mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS fw status = 1E000255 60002306 00000200 00004401 00000000 00000010
Cc: <stable@vger.kernel.org>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce an lkdtm test for the STACKLEAK feature: check that the
current task stack is properly erased (filled with STACKLEAK_POISON).
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>