Pull perf/core fixes and improvements from Arnaldo Carvalho de Melo:
kernel + tools/perf:
Alexey Budankov:
- Introduce CAP_PERFMON to kernel and user space.
callchains:
Adrian Hunter:
- Allow using Intel PT to synthesize callchains for regular events.
Kan Liang:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
perf script:
Andreas Gerstmayr:
- Add flamegraph.py script
BPF:
Jiri Olsa:
- Synthesize bpf_trampoline/dispatcher ksymbol events.
perf stat:
Arnaldo Carvalho de Melo:
- Honour --timeout for forked workloads.
Stephane Eranian:
- Force error in fallback on :k events, to avoid counting nothing when
the user asks for kernel events but is not allowed to.
perf bench:
Ian Rogers:
- Add event synthesis benchmark.
tools api fs:
Stephane Eranian:
- Make xxx__mountpoint() more scalable
libtraceevent:
He Zhe:
- Handle return value of asprintf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
VERMAGIC* definitions are not supposed to be used by the drivers,
see this [1] bug report, so introduce special define to guard inclusion
of this header file and define it in kernel/modules.h and in internal
script that generates *.mod.c files.
In-tree module build:
➜ kernel git:(vermagic) ✗ make clean
➜ kernel git:(vermagic) ✗ make M=drivers/infiniband/hw/mlx5
➜ kernel git:(vermagic) ✗ modinfo drivers/infiniband/hw/mlx5/mlx5_ib.ko
filename: /images/leonro/src/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
<...>
vermagic: 5.6.0+ SMP mod_unload modversions
Out-of-tree module build:
➜ mlx5 make -C /images/leonro/src/kernel clean M=/tmp/mlx5
➜ mlx5 make -C /images/leonro/src/kernel M=/tmp/mlx5
➜ mlx5 modinfo /tmp/mlx5/mlx5_ib.ko
filename: /tmp/mlx5/mlx5_ib.ko
<...>
vermagic: 5.6.0+ SMP mod_unload modversions
[1] https://lore.kernel.org/lkml/20200411155623.GA22175@zn.tnic
Reported-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Jessica Yu <jeyu@kernel.org>
Co-developed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christof Meerwald <cmeerw@cmeerw.org> writes:
> Hi,
>
> this is probably related to commit
> 7a0cf09494 (signal: Correct namespace
> fixups of si_pid and si_uid).
>
> With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
> properly set si_pid field - this seems to happen for multi-threaded
> child processes.
>
> A simple test program (based on the sample from the signalfd man page):
>
> #include <sys/signalfd.h>
> #include <signal.h>
> #include <unistd.h>
> #include <spawn.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #define handle_error(msg) \
> do { perror(msg); exit(EXIT_FAILURE); } while (0)
>
> int main(int argc, char *argv[])
> {
> sigset_t mask;
> int sfd;
> struct signalfd_siginfo fdsi;
> ssize_t s;
>
> sigemptyset(&mask);
> sigaddset(&mask, SIGCHLD);
>
> if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
> handle_error("sigprocmask");
>
> pid_t chldpid;
> char *chldargv[] = { "./sfdclient", NULL };
> posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
>
> sfd = signalfd(-1, &mask, 0);
> if (sfd == -1)
> handle_error("signalfd");
>
> for (;;) {
> s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
> if (s != sizeof(struct signalfd_siginfo))
> handle_error("read");
>
> if (fdsi.ssi_signo == SIGCHLD) {
> printf("Got SIGCHLD %d %d %d %d\n",
> fdsi.ssi_status, fdsi.ssi_code,
> fdsi.ssi_uid, fdsi.ssi_pid);
> return 0;
> } else {
> printf("Read unexpected signal\n");
> }
> }
> }
>
>
> and a multi-threaded client to test with:
>
> #include <unistd.h>
> #include <pthread.h>
>
> void *f(void *arg)
> {
> sleep(100);
> }
>
> int main()
> {
> pthread_t t[8];
>
> for (int i = 0; i != 8; ++i)
> {
> pthread_create(&t[i], NULL, f, NULL);
> }
> }
>
> I tried to do a bit of debugging and what seems to be happening is
> that
>
> /* From an ancestor pid namespace? */
> if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
>
> fails inside task_pid_nr_ns because the check for "pid_alive" fails.
>
> This code seems to be called from do_notify_parent and there we
> actually have "tsk != current" (I am assuming both are threads of the
> current process?)
I instrumented the code with a warning and received the following backtrace:
> WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
> Modules linked in:
> CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
> Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd
> RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
> RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
> RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
> R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
> FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
> Call Trace:
> send_signal+0x1c8/0x310
> do_notify_parent+0x50f/0x550
> release_task.part.21+0x4fd/0x620
> do_exit+0x6f6/0xaf0
> do_group_exit+0x42/0xb0
> get_signal+0x13b/0xbb0
> do_signal+0x2b/0x670
> ? __audit_syscall_exit+0x24d/0x2b0
> ? rcu_read_lock_sched_held+0x4d/0x60
> ? kfree+0x24c/0x2b0
> do_syscall_64+0x176/0x640
> ? trace_hardirqs_off_thunk+0x1a/0x1c
> entry_SYSCALL_64_after_hwframe+0x49/0xb3
The immediate problem is as Christof noticed that "pid_alive(current) == false".
This happens because do_notify_parent is called from the last thread to exit
in a process after that thread has been reaped.
The bigger issue is that do_notify_parent can be called from any
process that manages to wait on a thread of a multi-threaded process
from wait_task_zombie. So any logic based upon current for
do_notify_parent is just nonsense, as current can be pretty much
anything.
So change do_notify_parent to call __send_signal directly.
Inspecting the code it appears this problem has existed since the pid
namespace support started handling this case in 2.6.30. This fix only
backports to 7a0cf09494 ("signal: Correct namespace fixups of si_pid and si_uid")
where the problem logic was moved out of __send_signal and into send_signal.
Cc: stable@vger.kernel.org
Fixes: 6588c1e3ff ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
Ref: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/
Reported-by: Christof Meerwald <cmeerw@cmeerw.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
check_xadd() can cause check_ptr_to_btf_access() to be executed with
atype==BPF_READ and value_regno==-1 (meaning "just check whether the access
is okay, don't tell me what type it will result in").
Handle that case properly and skip writing type information, instead of
indexing into the registers at index -1 and writing into out-of-bounds
memory.
Note that at least at the moment, you can't actually write through a BTF
pointer, so check_xadd() will reject the program after calling
check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the
verifier has already corrupted memory.
This patch assumes that BTF pointers are not available in unprivileged
programs.
Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com
When check_xadd() verifies an XADD operation on a pointer to a stack slot
containing a spilled pointer, check_stack_read() verifies that the read,
which is part of XADD, is valid. However, since the placeholder value -1 is
passed as `value_regno`, check_stack_read() can only return a binary
decision and can't return the type of the value that was read. The intent
here is to verify whether the value read from the stack slot may be used as
a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and
the type information is lost when check_stack_read() returns, this is not
enforced, and a malicious user can abuse XADD to leak spilled kernel
pointers.
Fix it by letting check_stack_read() verify that the value is usable as a
SCALAR_VALUE if no type information is passed to the caller.
To be able to use __is_pointer_value() in check_stack_read(), move it up.
Fix up the expected unprivileged error message for a BPF selftest that,
until now, assumed that unprivileged users can use XADD on stack-spilled
pointers. This also gives us a test for the behavior introduced in this
patch for free.
In theory, this could also be fixed by forbidding XADD on stack spills
entirely, since XADD is a locked operation (for operations on memory with
concurrency) and there can't be any concurrency on the BPF stack; but
Alexei has said that he wants to keep XADD on stack slots working to avoid
changes to the test suite [1].
The following BPF program demonstrates how to leak a BPF map pointer as an
unprivileged user using this bug:
// r7 = map_pointer
BPF_LD_MAP_FD(BPF_REG_7, small_map),
// r8 = launder(map_pointer)
BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_1, 0),
((struct bpf_insn) {
.code = BPF_STX | BPF_DW | BPF_XADD,
.dst_reg = BPF_REG_FP,
.src_reg = BPF_REG_1,
.off = -8
}),
BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8),
// store r8 into map
BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7),
BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN()
[1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code
can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This
happens because in this configuration, NR_CPUS can be larger than
nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient
to guard against hitting the warning in cpumask_check().
Fix this by explicitly checking the supplied key against the
nr_cpumask_bits variable before calling cpu_possible().
Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Xiumei Mu <xmu@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com
If audit_send_reply() fails when trying to create a new thread to
send the reply it also fails to cleanup properly, leaking a reference
to a net structure. This patch fixes the error path and makes a
handful of other cleanups that came up while fixing the code.
Reported-by: teroincn@gmail.com
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Commit 7561252892 ("audit: always check the netlink payload length
in audit_receive_msg()") fixed a number of missing message length
checks, but forgot to check the length of userspace generated audit
records. The good news is that you need CAP_AUDIT_WRITE to submit
userspace audit records, which is generally only given to trusted
processes, so the impact should be limited.
Cc: stable@vger.kernel.org
Fixes: 7561252892 ("audit: always check the netlink payload length in audit_receive_msg()")
Reported-by: syzbot+49e69b4d71a420ceda3e@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
The single atomic pool is allocated from the lowest zone possible since
it is guaranteed to be applicable for any DMA allocation.
Devices may allocate through the DMA API but not have a strict reliance
on GFP_DMA memory. Since the atomic pool will be used for all
non-blockable allocations, returning all memory from ZONE_DMA may
unnecessarily deplete the zone.
Provision for multiple atomic pools that will map to the optimal gfp
mask of the device.
When allocating non-blockable memory, determine the optimal gfp mask of
the device and use the appropriate atomic pool.
The coherent DMA mask will remain the same between allocation and free
and, thus, memory will be freed to the same atomic pool it was allocated
from.
__dma_atomic_pool_init() will be changed to return struct gen_pool *
later once dynamic expansion is added.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
DMA atomic pools will be needed beyond only CONFIG_DMA_DIRECT_REMAP so
separate them out into their own file.
This also adds a new Kconfig option that can be subsequently used for
options, such as CONFIG_AMD_MEM_ENCRYPT, that will utilize the coherent
pools but do not have a dependency on direct remapping.
For this patch alone, there is no functional change introduced.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Rientjes <rientjes@google.com>
[hch: fixup copyrights and remove unused includes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fix the following sparse warning:
kernel/dma/debug.c:659:6: warning: symbol '__dma_entry_alloc_check_leak'
was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pull time namespace fix from Thomas Gleixner:
"An update for the proc interface of time namespaces: Use symbolic
names instead of clockid numbers. The usability nuisance of numbers
was noticed by Michael when polishing the man page"
* tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
Pull irq fixes from Thomas Gleixner:
"A set of fixes/updates for the interrupt subsystem:
- Remove setup_irq() and remove_irq(). All users have been converted
so remove them before new users surface.
- A set of bugfixes for various interrupt chip drivers
- Add a few missing static attributes to address sparse warnings"
* tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-bcm7038-l1: Make bcm7038_l1_of_init() static
irqchip/irq-mvebu-icu: Make legacy_bindings static
irqchip/meson-gpio: Fix HARDIRQ-safe -> HARDIRQ-unsafe lock order
irqchip/sifive-plic: Fix maximum priority threshold value
irqchip/ti-sci-inta: Fix processing of masked irqs
irqchip/mbigen: Free msi_desc on device teardown
irqchip/gic-v4.1: Update effective affinity of virtual SGIs
irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid signaling
genirq: Remove setup_irq() and remove_irq()
Pull scheduler fixes from Thomas Gleixner:
"Two fixes for the scheduler:
- Work around an uninitialized variable warning where GCC can't
figure it out.
- Allow 'isolcpus=' to skip unknown subparameters so that older
kernels work with the commandline of a newer kernel. Improve the
error output while at it"
* tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/vtime: Work around an unitialized variable warning
sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters
Pull RCU fix from Thomas Gleixner:
"A single bugfix for RCU to prevent taking a lock in NMI context"
* tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()
Pull thread fixes from Christian Brauner:
"A few fixes and minor improvements:
- Correctly validate the cgroup file descriptor when clone3() is used
with CLONE_INTO_CGROUP.
- Check that a new enough version of struct clone_args is passed
which supports the cgroup file descriptor argument when
CLONE_INTO_CGROUP is set in the flags argument.
- Catch nonsensical struct clone_args layouts at build time.
- Catch extensions of struct clone_args without updating the uapi
visible size definitions at build time.
- Check whether the signal is valid early in kill_pid_usb_asyncio()
before doing further work.
- Replace open-coded rcu_read_lock()+kill_pid_info()+rcu_read_unlock()
sequence in kill_something_info() with kill_proc_info() which is a
dedicated helper to do just that"
* tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks
clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set
clone3: fix cgroup argument sanity check
signal: use kill_proc_info instead of kill_pid_info in kill_something_info
signal: check sig before setting info in kill_pid_usb_asyncio
Various frob_* and module_{enable,disable}_* functions are defined in a
CONFIG_ARCH_HAS_STRICT_MODULE_RWX ifdef block which also has a nested
CONFIG_STRICT_MODULE_RWX ifdef block within it. This is unecessary and
makes things hard to read. Not only that, this construction requires
redundant empty stubs for module_enable_nx(). I suspect this was
originally done for cosmetic reasons - to keep all the frob_* functions
in the same place, and all the module_{enable,disable}_* functions right
after, but as a result it made the code harder to read.
Make this more readable by unnesting the ifdef blocks and getting rid of
the redundant empty stubs.
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Pull networking fixes from David Miller:
1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.
2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.
3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.
4) Adjust icmp6 message source address selection when routes have a
preferred source address set, from Tim Stallard.
5) Be sure to validate HSR protocol version when creating new links,
from Taehee Yoo.
6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
non-initial namespaces, from Michael Weiß.
7) Missing release firmware call in mlx5, from Eran Ben Elisha.
8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
Taehee Yoo.
9) Fix pause frame negotiation in marvell phy driver, from Clemens
Gruber.
10) Record RX queue early enough in tun packet paths such that XDP
programs will see the correct RX queue index, from Gilberto Bertin.
11) Fix double unlock in mptcp, from Florian Westphal.
12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.
13) marvell10g needs to soft reset PHY when coming out of low power
mode, from Russell King.
14) Fix MTU setting regression in stmmac for some chip types, from
Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
amd-xgbe: Use __napi_schedule() in BH context
mISDN: make dmril and dmrim static
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
tipc: fix incorrect increasing of link window
Documentation: Fix tcp_challenge_ack_limit default value
net: tulip: make early_486_chipsets static
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
ipv6: remove redundant assignment to variable err
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
selftests/bpf: Check for correct program attach/detach in xdp_attach test
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
libbpf: Always specify expected_attach_type on program load if supported
xsk: Add missing check on user supplied headroom size
mac80211: fix channel switch trigger from unknown mesh peer
mac80211: fix race in ieee80211_register_hw()
net: marvell10g: soft-reset the PHY when coming out of low power
net: marvell10g: report firmware version
net/cxgb4: Check the return from t4_query_params properly
...
Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more
secure.
perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Cc: linux-man@vger.kernel.org
Link: http://lore.kernel.org/lkml/3c129d9a-ba8a-3483-ecc5-ad6c8e7c203f@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The kernel requires at least GCC 4.8 in order to build, and so there is
no need to cater for the pre-4.7 gcov format.
Remove the obsolete code.
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.
Now the content of these files looks like this:
$ cat /proc/774/timens_offsets
monotonic 864000 0
boottime 1728000 0
For setting offsets, both representations of clocks (numeric and symbolic)
can be used.
As for compatibility, it is acceptable to change things as long as
userspace doesn't care. The format of timens_offsets files is very new and
there are no userspace tools yet which rely on this format.
But three projects crun, util-linux and criu rely on the interface of
setting time offsets and this is why it's required to continue supporting
the numeric clock IDs on write.
Fixes: 04a8682a71 ("fs/proc: Introduce /proc/pid/timens_offsets")
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
saved_max_pfn was originally introduced in commit
92aa63a5a1 ("[PATCH] kdump: Retrieve saved max pfn")
It used to make sure that the user does not try to read the physical memory
beyond saved_max_pfn. But since commit
921d58c0e6 ("vmcore: remove saved_max_pfn check")
it's no longer used for the check. This variable doesn't have any users
anymore so just remove it.
[ bp: Drop the Calgary IOMMU reference from the commit message. ]
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/20200330181544.1595733-1-kasong@redhat.com
Work around this warning:
kernel/sched/cputime.c: In function ‘kcpustat_field’:
kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized]
because GCC can't see that val is used only when err is 0.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic
The "isolcpus=" parameter allows sub-parameters before the cpulist is
specified, and if the parser detects an unknown sub-parameters the whole
parameter will be ignored.
This design is incompatible with itself when new sub-parameters are added.
An older kernel will not recognize the new sub-parameter and will
invalidate the whole parameter so the CPU isolation will not take
effect. It emits a warning:
isolcpus: Error, unknown flag
The better and compatible way is to allow "isolcpus=" to skip unknown
sub-parameters, so that even if new sub-parameters are added an older
kernel will still be able to behave as usual even if with the new
sub-parameter specified on the command line.
Ideally this should have been there when the first sub-parameter for
"isolcpus=" was introduced.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com
CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via
the offsets of the relevant struct clone_args fields, which makes
it rather error-prone, so it probably makes sense to add some
compile-time checks for them (including the one that breaks
on struct clone_args extension as a reminder to add a relevant
size macro and a similar check). Function copy_clone_args_from_user
seems to be a good place for such checks.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Checking that cgroup field value of struct clone_args is less than 0
is useless, as it is defined as unsigned 64-bit integer. Moreover,
it doesn't catch the situations where its higher bits are lost during
the assignment to the cgroup field of the cgroup field of the internal
struct kernel_clone_args (where it is declared as signed 32-bit
integer), so it is still possible to pass garbage there. A check
against INT_MAX solves both these issues.
Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202533.GA29554@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed
pages can be later remapped as writable ones through mprotect() call. To
prevent user application to rewrite contents of memory-mapped as read-only and
subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially
read-only mapping.
Alternatively, we could treat any memory-mapping on unfrozen map as writable
and bump writecnt instead. But there is little legitimate reason to map
BPF map as read-only and then re-mmap() it as writable through mprotect(),
instead of just mmap()'ing it as read/write from the very beginning.
Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap
operations. We can just rely on VMA holding reference to BPF map's file
properly.
Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com
Reporting hides KCSAN runtime functions in the stack trace, with
filtering done based on function names. Currently this included all
functions (or modules) that would match "kcsan_". Make the filter aware
of KCSAN tests, which contain "kcsan_test", and are no longer skipped in
the report.
This is in preparation for adding a KCSAN test module.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Pass string length as returned by scnprintf() to strnstr(), since
strnstr() searches exactly len bytes in haystack, even if it contains a
NUL-terminator before haystack+len.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Introduce ASSERT_EXCLUSIVE_*_SCOPED(), which provide an intuitive
interface to use the scoped-access feature, without having to explicitly
mark the start and end of the desired scope. Basing duration of the
checks on scope avoids accidental misuse and resulting false positives,
which may be hard to debug. See added comments for usage.
The macros are implemented using __attribute__((__cleanup__(func))),
which is supported by all compilers that currently support KCSAN.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This adds support for scoped accesses, where the memory range is checked
for the duration of the scope. The feature is implemented by inserting
the relevant access information into a list of scoped accesses for
the current execution context, which are then checked (until removed)
on every call (through instrumentation) into the KCSAN runtime.
An alternative, more complex, implementation could set up a watchpoint for
the scoped access, and keep the watchpoint set up. This, however, would
require first exposing a handle to the watchpoint, as well as dealing
with cases such as accesses by the same thread while the watchpoint is
still set up (and several more cases). It is also doubtful if this would
provide any benefit, since the majority of delay where the watchpoint
is set up is likely due to the injected delays by KCSAN. Therefore,
the implementation in this patch is simpler and avoids hurting KCSAN's
main use-case (normal data race detection); it also implicitly increases
scoped-access race-detection-ability due to increased probability of
setting up watchpoints by repeatedly calling __kcsan_check_access()
throughout the scope of the access.
The implementation required adding an additional conditional branch to
the fast-path. However, the microbenchmark showed a *speedup* of ~5%
on the fast-path. This appears to be due to subtly improved codegen by
GCC from moving get_ctx() and associated load of preempt_count earlier.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
To avoid deadlock in case watchers can be interrupted, we need to ensure
that producers of the struct other_info can never be blocked by an
unrelated consumer. (Likely to occur with KCSAN_INTERRUPT_WATCHER.)
There are several cases that can lead to this scenario, for example:
1. A watchpoint A was set up by task T1, but interrupted by
interrupt I1. Some other thread (task or interrupt) finds
watchpoint A consumes it, and sets other_info. Then I1 also
finds some unrelated watchpoint B, consumes it, but is blocked
because other_info is in use. T1 cannot consume other_info
because I1 never returns -> deadlock.
2. A watchpoint A was set up by task T1, but interrupted by
interrupt I1, which also sets up a watchpoint B. Some other
thread finds watchpoint A, and consumes it and sets up
other_info with its information. Similarly some other thread
finds watchpoint B and consumes it, but is then blocked because
other_info is in use. When I1 continues it sees its watchpoint
was consumed, and that it must wait for other_info, which
currently contains information to be consumed by T1. However, T1
cannot unblock other_info because I1 never returns -> deadlock.
To avoid this, we need to ensure that producers of struct other_info
always have a usable other_info entry. This is obviously not the case
with only a single instance of struct other_info, as concurrent
producers must wait for the entry to be released by some consumer (which
may be locked up as illustrated above).
While it would be nice if producers could simply call kmalloc() and
append their instance of struct other_info to a list, we are very
limited in this code path: since KCSAN can instrument the allocators
themselves, calling kmalloc() could lead to deadlock or corrupted
allocator state.
Since producers of the struct other_info will always succeed at
try_consume_watchpoint(), preceding the call into kcsan_report(), we
know that the particular watchpoint slot cannot simply be reused or
consumed by another potential other_info producer. If we move removal of
a watchpoint after reporting (by the consumer of struct other_info), we
can see a consumed watchpoint as a held lock on elements of other_info,
if we create a one-to-one mapping of a watchpoint to an other_info
element.
Therefore, the simplest solution is to create an array of struct
other_info that is as large as the watchpoints array in core.c, and pass
the watchpoint index to kcsan_report() for producers and consumers, and
change watchpoints to be removed after reporting is done.
With a default config on a 64-bit system, the array other_infos consumes
~37KiB. For most systems today this is not a problem. On smaller memory
constrained systems, the config value CONFIG_KCSAN_NUM_WATCHPOINTS can
be reduced appropriately.
Overall, this change is a simplification of the prepare_report() code,
and makes some of the checks (such as checking if at least one access is
a write) redundant.
Tested:
$ tools/testing/selftests/rcutorture/bin/kvm.sh \
--cpus 12 --duration 10 --kconfig "CONFIG_DEBUG_INFO=y \
CONFIG_KCSAN=y CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n \
CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n \
CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y \
CONFIG_KCSAN_INTERRUPT_WATCHER=y CONFIG_PROVE_LOCKING=y" \
--configs TREE03
=> No longer hangs and runs to completion as expected.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Improve readability by introducing access_info and other_info structs,
and in preparation of the following commit in this series replaces the
single instance of other_info with an array of size 1.
No functional change intended.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Resolve these conflicts:
arch/x86/Kconfig
arch/x86/kernel/Makefile
Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull time(keeping) updates from Thomas Gleixner:
- Fix the time_for_children symlink in /proc/$PID/ so it properly
reflects that it part of the 'time' namespace
- Add the missing userns limit for the allowed number of time
namespaces, which was half defined but the actual array member was
not added. This went unnoticed as the array has an exessive empty
member at the end but introduced a user visible regression as the
output was corrupted.
- Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
to catch half updated data.
* tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ucount: Make sure ucounts in /proc/sys/user don't regress again
time/namespace: Add max_time_namespaces ucount
time/namespace: Fix time_for_children symlink
Pull scheduler fixes/updates from Thomas Gleixner:
- Deduplicate the average computations in the scheduler core and the
fair class code.
- Fix a raise between runtime distribution and assignement which can
cause exceeding the quota by up to 70%.
- Prevent negative results in the imbalanace calculation
- Remove a stale warning in the workqueue code which can be triggered
since the call site was moved out of preempt disabled code. It's a
false positive.
- Deduplicate the print macros for procfs
- Add the ucmap values to the SCHED_DEBUG procfs output for completness
* tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Add task uclamp values to SCHED_DEBUG procfs
sched/debug: Factor out printing formats into common macros
sched/debug: Remove redundant macro define
sched/core: Remove unused rq::last_load_update_tick
workqueue: Remove the warning in wq_worker_sleeping()
sched/fair: Fix negative imbalance in imbalance calculation
sched/fair: Fix race between runtime distribution and assignment
sched/fair: Align rq->avg_idle and rq->avg_scan_cost
Pull perf fixes from Thomas Gleixner:
"Three fixes/updates for perf:
- Fix the perf event cgroup tracking which tries to track the cgroup
even for disabled events.
- Add Ice Lake server support for uncore events
- Disable pagefaults when retrieving the physical address in the
sampling code"
* tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Disable page faults when getting phys address
perf/x86/intel/uncore: Add Ice Lake server uncore support
perf/cgroup: Correct indirection in perf_less_group_idx()
perf/core: Fix event cgroup tracking