Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
The other half of nested migration, ->get_nested_state(), does not need
to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
state to userspace.
Detected as an RCU lockdep splat that is 100% reproducible by running
KVM's state_test selftest with CONFIG_PROVE_LOCKING=y. Note that the
failing function, kvm_is_visible_gfn(), is only checking the validity of
a gfn, it's not actually accessing guest memory (which is more or less
unsupported during vmx_set_nested_state() due to incorrect MMU state),
i.e. vmx_set_nested_state() itself isn't fundamentally broken. In any
case, setting nested state isn't a fast path so there's no reason to go
out of our way to avoid taking ->srcu.
=============================
WARNING: suspicious RCU usage
5.4.0-rc7+ #94 Not tainted
-----------------------------
include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by evmcs_test/10939:
#0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
stack backtrace:
CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x68/0x9b
kvm_is_visible_gfn+0x179/0x180 [kvm]
mmu_check_root+0x11/0x30 [kvm]
fast_cr3_switch+0x40/0x120 [kvm]
kvm_mmu_new_cr3+0x34/0x60 [kvm]
nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
vmx_set_nested_state+0x256/0x340 [kvm_intel]
kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
kvm_vcpu_ioctl+0xde/0x630 [kvm]
do_vfs_ioctl+0xa2/0x6c0
ksys_ioctl+0x66/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x54/0x200
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f59a2b95f47
Fixes: 8fcc4b5923 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fold shared_msr_update() into its sole user to eliminate its pointless
bounds check, its godawful printk, its misleading comment (it's called
under a global lock), and its woefully inaccurate name.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The jump label out_free_1 and out_free_2 deal with
the same stuff, so git rid of one and rename the
label out_free_0a to retain the label name order.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
perf report:
Jin Yao:
- Allow entering the annotation view (symbol source/assembly +
overhead/cycles/etc column) from the 'perf report --total-cycles'
interface.
E.g.:
# perf record --all-cpus --branch-any --all-kernel
^C[ perf record: Woken up 5 times to write data ]
#
# perf evlist -v
cycles: size: 120, { sample_period, sample_freq }: 4000,
sample_type: IP|TID|TIME|CPU|PERIOD|BRANCH_STACK,
read_format: ID, disabled: 1, inherit: 1, exclude_user: 1, mmap: 1, comm: 1, freq: 1, task: 1,
precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1,
bpf_event: 1, branch_sample_type: ANY
#
# perf report --total-cycles
#
# Samples: 78762 of event 'cycles'
Sampled Sampled Avg Avg
Cycles% Cycles Cycles% Cycles [Program Block Range] Shared Object
1.72% 95.8K 0.00% 254 [msr.h:105 -> msr.h:166] [kernel.vmlinux]
1.56% 107.6K 0.00% 618 [compiler.h:199 -> common.c:301] [kernel.vmlinux]
0.83% 46.3K 0.00% 409 [entry_64.S:153 -> entry_64.S:175] [kernel.vmlinux]
0.83% 46.1K 0.00% 83 [jump_label.h:41 -> tsc.c:230] [kernel.vmlinux]
0.64% 36.9K 0.01% 1.4K [hda_intel.c:904 -> hda_intel.c:916] [snd_hda_intel]
0.57% 30.2K 0.00% 282 [file.c:710 -> file.c:730] [kernel.vmlinux]
0.48% 25.8K 0.00% 82 [spinlock.c:158 -> spinlock.c:160] [kernel.vmlinux]
0.45% 23.7K 0.00% 369 [tick-broadcast.c:585 -> tick-broadcast.c:586] [kernel.vmlinux]
0.44% 24.4K 0.00% 73 [msr.h:236 -> tsc.c:1088] [kernel.vmlinux]
0.43% 22.7K 0.00% 144 [cpuidle.c:229 -> cpuidle.c:232] [kernel.vmlinux]
Then press 'A' or Enter on one of those lines, just like with 'perf top', say
the top one: [msr.h:105 -> msr.h:166], then this shows up:
Samples: 78K of event 'cycles', 4000 Hz, Event count (approx.): 78762
native_write_msr /lib/modules/5.4.0-rc8/build/vmlinux [Percent: local period]
Percent│ IPC Cycle (Average IPC: 0.02, IPC Coverage: 50.0%)
│
│ Disassembly of section .text:
│
│ ffffffff8106c480 <native_write_msr>:
│ __wrmsr():
│ return EAX_EDX_VAL(val, low, high);
│ }
│
│ static inline void notrace __wrmsr(unsigned int msr, u32 low, u32 high)
│ {
│ asm volatile("1: wrmsr\n"
49.16 │0.02 mov %edi,%ecx
│0.02 mov %esi,%eax
│0.02 wrmsr
│ arch_static_branch():
│ #include <linux/stringify.h>
│ #include <linux/types.h>
│
│ static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
│ {
│ asm_volatile_goto("1:"
0.79 │0.02 nop
│ native_write_msr():
│ {
│ __wrmsr(msr, low, high);
│
│ if (msr_tracepoint_active(__tracepoint_write_msr))
│ do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
│ }
50.05 │0.02 254 ← retq
│ do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
│ shl $0x20,%rdx
│ mov %esi,%esi
│ or %rdx,%rsi
│ xor %edx,%edx
│ → jmpq do_trace_write_msr
We need to improve this to show the source code line numbers in the
annotation view, so one can go from that program block to the annotation view
and see those source code line numbers straight away.
auxtrace/Intel PT:
Adrian Hunter:
- Add support for AUX area sampling, requires new functionality that
will land in 5.5, its already in tip.
This includes kernel capability querying so that it fails gracefully
with older kernels, duimping aux area samples in 'perf report -D' and
'perf script'.
perf.data:
Alexey Budankov:
- Fix decompression of PERF_RECORD_COMPRESSED records.
core:
Arnaldo Carvalho de Melo:
- Use the 'dcacheline' cmp routine to find the right DSOs taking into
account the 'maj', 'min', 'ino' and 'ino_generation', that got moved
from 'struct map' to 'struct dso', where it belongs.
This further reduces the size of 'struct map', there is still more
work to do to maybe get it to max one cacheline.
libtraceevent:
Hewenliang:
- Fix memory leakage in copy_filter_type().
Sudip Mukherjee:
- Fix header installation.
perf parse:
Ian Rogers :
- Fix potential memory leak when handling tracepoint errors, found using
LLVM's libFuzzer.
perf probe:
Colin Ian King:
- Fix spelling mistake "addrees" -> "address".
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 2dffd23f81 ("kbuild: make single target builds much faster")
made the situation much better.
To improve it even more, apply the similar idea to the top Makefile.
Trim unrelated directories from build-dirs.
The single build code must be moved above the 'descend' target.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
When 'exported twice' is warned, let sym_add_exported() return without
updating the symbol info. This respects the previous export, which is
ordered first in modules.order
This simplifies the code too.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Now that there is no overwrap between symbols from ELF files and
ones from Module.symvers.
So, the 'exported twice' warning should be reported irrespective
of where the symbol in question came from.
The exceptional case is external module; in some cases, we build
an external module to provide a different version/variant of the
corresponding in-kernel module, overriding the same set of exported
symbols.
You can see this use-case in upstream; tools/testing/nvdimm/libnvdimm.ko
replaces drivers/nvdimm/libnvdimm.ko in order to link it against mocked
version of core kernel symbols.
So, let's relax the 'exported twice' warning when building external
modules. The multiple export from external modules is warned only
when the previous one is from vmlinux or itself.
With this refactoring, the ugly preloading goes away.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
It is complicated to add mocked-up symbols for pre-handling CRC.
Handle CRC after all the export symbols in the relevant module
are registered.
Call handle_modversion() after the handle_symbol() iteration.
In some cases, I see atand-alone __crc_* without __ksymtab_*.
For example, ARCH=arm allyesconfig produces __crc_ccitt_veneer and
__crc_itu_t_veneer. I guess they come from crc_ccitt, crc_itu_t,
respectively. Since __*_veneer are auto-generated symbols, just
ignore them.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Part of the documentation is taken from the README of the userspace
utils (https://github.com/vitorafsr/i8kutils). The license is GPL-2+
and the author Massimo Dal Zotto is already credited as author of
the module. Therefore there should be no copyright problem.
I also added a paragraph with specific information on the experimental
support for automatic BIOS fan control.
Signed-off-by: Giovanni Mascellani <gio@debian.org>
Link: https://lore.kernel.org/r/20191122101519.1246458-2-gio@debian.org
[groeck: Fixed some of the documentation warnings]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
This patch exports standard hwmon pwmX_enable sysfs attribute for
enabling or disabling automatic fan control by BIOS. Standard value
"1" is for disabling automatic BIOS fan control and value "2" for
enabling.
By default BIOS auto mode is enabled by laptop firmware.
When BIOS auto mode is enabled, custom fan speed value (set via hwmon
pwmX sysfs attribute) is overwritten by SMM in few seconds and
therefore any custom settings are without effect. So this is reason
why implementing option for disabling BIOS auto mode is needed.
So finally this patch allows kernel to set and control fan speed on
laptops, but it can be dangerous (like setting speed of other fans).
The SMM commands to enable or disable automatic fan control are not
documented and are not the same on all Dell laptops. Therefore a
whitelist is used to send the correct codes only on laptopts for which
they are known.
This patch was originally developed by Pali Rohár; later Giovanni
Mascellani implemented the whitelist.
Signed-off-by: Giovanni Mascellani <gio@debian.org>
Co-Developed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Link: https://lore.kernel.org/r/20191122101519.1246458-1-gio@debian.org
[groeck: Fixed checkpatch warnings]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
This function handles not only modversions, but also unresolved
symbols, export symbols, etc.
Rename it to a more proper function name.
While I was here, I also added the 'const' qualifier to *sym.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Currently, namespace_from_kstrtabns() relies on the fact that
namespace strings are recorded in the __ksymtab_strings section.
Actually, it is coded in include/linux/export.h, but modpost does
not need to hard-code the section name.
Elf_Sym::st_shndx holds the index of the relevant section. Using it is
a more portable way to get the namespace string.
Make namespace_from_kstrtabns() simply call sym_get_data(), and delete
the info->ksymtab_strings .
While I was here, I added more 'const' qualifiers to pointers.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
When CONFIG_MODULE_REL_CRCS is enabled, the value of __crc_* is not
an absolute value, but the address to the CRC data embedded in the
.rodata section.
Getting the data pointed by the symbol value is somewhat complex.
Split it out into a new helper, sym_get_data().
I will reuse it to refactor namespace_from_kstrtabns() in the next
commit.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Add support for dumping the kernel address space layout to the console.
User can enable CONFIG_DEBUG_VM to dump the virtual memory region into
dmesg buffer during boot-up.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: dropped .init/.text/.data/.bss prints;
added PCI legacy I/O region display]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Enable more debugging options in the RISC-V defconfigs to help kernel
developers catch problems with patches earlier in the development
cycle.
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Edward Cree says:
====================
A series of changes to how we check filters for expiry, manage how much
of that work to do & when, etc.
Prompted by some pathological behaviour under heavy load, which was
Reported-by: David Ahern <dahern@digitalocean.com>
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
If there's no traffic on a channel, its ARFS expiry work will never get
scheduled by efx_poll() as that isn't being run.
So make efx_filter_rfs_expire() reschedule itself to run after 30 seconds.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Report the number of successful and failed insertions, and also the
current count of filters, to aid in tuning e.g. rps_flow_cnt.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
In high connection count usage, the NIC's filter table may be filled with
sufficiently many ARFS filters that further insertions fail. As this
does not represent a correctness issue, do not log the resulting MCDI
errors. Add a debug-level message under the (by default disabled)
rx_status category instead; and take the opportunity to do a little extra
expiry work.
Since there are now multiple workitems able to call __efx_filter_rfs_expire
on a given channel, it is possible for them to race and thus pass quotas
which, combined, exceed rfs_filter_count. Thus, don't WARN_ON if we loop
all the way around the table with quota left over.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Tested-by: David Ahern <dahern@digitalocean.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
The old rfs_filters_added method for determining the quota could potentially
allow the NIC to become filled with old filters, which never get tested for
expiry. Instead, explicitly make expiry check work depend on the number of
filters installed, and don't count checking slots without filters in as
doing work. This guarantees that each filter will be checked for expiry at
least once every thirty seconds (assuming the channel to which it belongs is
NAPI polling actively) regardless of fill level.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Tested-by: David Ahern <dahern@digitalocean.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Jeff Kirsher says:
====================
This series contains updates to the ice driver only.
Bruce updates the driver to store the number of functions the device has
so that it won't have to compute it when setting safe mode capabilities.
Adds a check to adjust the reporting of capabilities for devices with
more than 4 ports, which differ for devices with less than 4 ports.
Brett adds a helper function to determine if the VF is allowed to do
VLAN operations based on the host's VF configuration. Also adds a new
function that initializes VLAN stripping (enabled/disabled) for the VF
based on the device supported capabilities. Adds a check if the vector
index is valid with the respect to the number of transmit and receive
queues configured when we set coalesce settings for DCB. Adds a check
if the promisc_mask contains ICE_PROMISC_VLAN_RX or ICE_PROMISC_VLAN_TX
so that VLAN 0 promiscuous rules to be removed. Add a helper macro for
a commonly used de-reference of a pointer to &pf->dev->pdev.
Jesse fixes an issue where if an invalid virtchnl request from the VF,
the driver would return uninitialized data to the VF from the PF stack,
so ensure the stack variable is initialized earlier. Add helpers to the
virtchnl interface make the reporting of strings consistent and help
reduce stack space. Implements VF statistics gathering via the kernel
ndo_get_vf_stats().
Akeem ensures we disable the state flag for each VF when its resources
are returned to the device.
Tony does additional cleanup in the driver to ensure the when we
allocate and free memory within the same function, we should not be
using devm_* variants; use regular alloc and free functions.
Henry implements code to query and set the number of channels on the
primary VSI for a PF via ethtool.
Jake cleans up needless NULL checks in ice_sched_cleanup_all().
Kevin updates the firmware API version to align with current NVM images.
v2: Added "Fixes:" tag to patch 5 commit description and added the use
of netif_is_rxfh_configured() in patch 13 to see if RSS has been
configured by the user, if so do not overwrite that configuration.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Pull input fix from Dmitry Torokhov:
"Just a single revert as RMI mode should not have been enabled for this
model [yet?]"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Revert "Input: synaptics - enable RMI mode for X1 Extreme 2nd Generation"
Currently, a lot of memory is wasted for architectures like MIPS when
init_ftrace_syscalls() allocates the array for syscalls using kcalloc.
This is because syscalls numbers start from 4000, 5000 or 6000 and
array elements up to that point are unused.
Fix this by using a data structure more suited to storing sparsely
populated arrays. The XARRAY data structure, implemented using radix
trees, is much more memory efficient for storing the syscalls in
question.
Link: http://lkml.kernel.org/r/20191115234314.21599-1-hnaveed@wavecomp.com
Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com>
Reviewed-by: Paul Burton <paulburton@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Rahul Lakkireddy says:
====================
This series of patches add UDP Segmentation Offload (USO) supported
by Chelsio T5/T6 NICs.
Patch 1 updates the current Scatter Gather List (SGL) DMA unmap logic
for USO requests.
Patch 2 adds USO support for NIC and MQPRIO QoS offload Tx path.
Patch 3 adds missing stats for MQPRIO QoS offload Tx path.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Implement and export UDP segmentation offload (USO) support for both
NIC and MQPRIO QoS offload Tx path. Update appropriate logic in Tx to
parse GSO info in skb and configure FW_ETH_TX_EO_WR request needed to
perform USO.
v2:
- Remove inline keyword from write_eo_udp_wr() in sge.c. Let the
compiler decide.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
The FW_ETH_TX_EO_WR used for sending UDP Segmentation Offload (USO)
requests expects the headers to be part of the descriptor and the
payload to be part of the SGL containing the DMA mapped addresses.
Hence, the DMA address in the first entry of the SGL can start after
the packet headers. Currently, unmap_sgl() tries to unmap from this
wrong offset, instead of the originally mapped DMA address.
So, use existing unmap_skb() instead, which takes originally saved DMA
addresses as input. Update all necessary Tx paths to save the original
DMA addresses, so that unmap_skb() can unmap them properly.
v2:
- No change.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
In some situations AppArmor needs to be able to use its work buffers
from atomic context. Add the ability to specify when in atomic context
and hold a set of work buffers in reserve for atomic context to
reduce the chance that a large work buffer allocation will need to
be done.
Fixes: df323337e5 ("apparmor: Use a memory pool instead per-CPU caches")
Signed-off-by: John Johansen <john.johansen@canonical.com>
Adding 2 new functions -
1) struct trace_array *trace_array_get_by_name(const char *name);
Return pointer to a trace array with given name. If it does not exist,
create and return pointer to the new trace array.
2) int trace_array_set_clr_event(struct trace_array *tr,
const char *system ,const char *event, bool enable);
Enable/Disable events to this trace array.
Additionally,
- To handle reference counters, export trace_array_put()
- Due to introduction of the above 2 new functions, we no longer need to
export - ftrace_set_clr_event & trace_array_create APIs.
Link: http://lkml.kernel.org/r/1574276919-11119-2-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that the buffers allocation has changed and no longer needs
the full mediation under an rcu_read_lock, reduce the rcu_read_lock
scope to only where it is necessary.
Fixes: df323337e5 ("apparmor: Use a memory pool instead per-CPU caches")
Signed-off-by: John Johansen <john.johansen@canonical.com>
The sanity check in macro update_for_len checks to see if len
is less than zero, however, len is a size_t so it can never be
less than zero, so this sanity check is a no-op. Fix this by
making len a ssize_t so the comparison will work and add ulen
that is a size_t copy of len so that the min() macro won't
throw warnings about comparing different types.
Addresses-Coverity: ("Macro compares unsigned to 0")
Fixes: f1bd904175 ("apparmor: add the base fns() for domain labels")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
The v4l2-compliance utility reported several V4L2 API compliance
issues:
- the sequence counter wasn't filled in
- the sequence counter wasn't reset to 0 at the start of streaming
- the returned field value wasn't set to V4L2_FIELD_NONE
- the timestamp wasn't set
- the payload size was undefined if an error was returned
- min_buffers_needed doesn't need to be initialized
Fix these issues.
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de
Link: https://lore.kernel.org/r/20191119105118.54285-3-hverkuil-cisco@xs4all.nl
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>