Commit Graph

30906 Commits

Author SHA1 Message Date
Tom Zanussi
ff9d31d0d4 tracing: Remove unnecessary var_ref destroy in track_data_destroy()
Commit 656fe2ba85 (tracing: Use hist trigger's var_ref array to
destroy var_refs) centralized the destruction of all the var_refs
in one place so that other code didn't have to do it.

The track_data_destroy() added later ignored that and also destroyed
the track_data var_ref, causing a double-free error flagged by KASAN.

==================================================================
BUG: KASAN: use-after-free in destroy_hist_field+0x30/0x70
Read of size 8 at addr ffff888086df2210 by task bash/1694

CPU: 6 PID: 1694 Comm: bash Not tainted 5.1.0-rc1-test+ #15
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03
07/14/2016
Call Trace:
 dump_stack+0x71/0xa0
 ? destroy_hist_field+0x30/0x70
 print_address_description.cold.3+0x9/0x1fb
 ? destroy_hist_field+0x30/0x70
 ? destroy_hist_field+0x30/0x70
 kasan_report.cold.4+0x1a/0x33
 ? __kasan_slab_free+0x100/0x150
 ? destroy_hist_field+0x30/0x70
 destroy_hist_field+0x30/0x70
 track_data_destroy+0x55/0xe0
 destroy_hist_data+0x1f0/0x350
 hist_unreg_all+0x203/0x220
 event_trigger_open+0xbb/0x130
 do_dentry_open+0x296/0x700
 ? stacktrace_count_trigger+0x30/0x30
 ? generic_permission+0x56/0x200
 ? __x64_sys_fchdir+0xd0/0xd0
 ? inode_permission+0x55/0x200
 ? security_inode_permission+0x18/0x60
 path_openat+0x633/0x22b0
 ? path_lookupat.isra.50+0x420/0x420
 ? __kasan_kmalloc.constprop.12+0xc1/0xd0
 ? kmem_cache_alloc+0xe5/0x260
 ? getname_flags+0x6c/0x2a0
 ? do_sys_open+0x149/0x2b0
 ? do_syscall_64+0x73/0x1b0
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
 ? _raw_write_lock_bh+0xe0/0xe0
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __list_add_valid+0x2d/0x70
 ? deactivate_slab.isra.62+0x1f4/0x5a0
 ? getname_flags+0x6c/0x2a0
 ? set_track+0x76/0x120
 do_filp_open+0x11a/0x1a0
 ? may_open_dev+0x50/0x50
 ? _raw_spin_lock+0x7a/0xd0
 ? _raw_write_lock_bh+0xe0/0xe0
 ? __alloc_fd+0x10f/0x200
 do_sys_open+0x1db/0x2b0
 ? filp_open+0x50/0x50
 do_syscall_64+0x73/0x1b0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fa7b24a4ca2
Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 85 7a 0d 00 8b 00 85 c0
75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff
0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007fffbafb3af0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000055d3648ade30 RCX: 00007fa7b24a4ca2
RDX: 0000000000000241 RSI: 000055d364a55240 RDI: 00000000ffffff9c
RBP: 00007fffbafb3bf0 R08: 0000000000000020 R09: 0000000000000002
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000003 R14: 0000000000000001 R15: 000055d364a55240
==================================================================

So remove the track_data_destroy() destroy_hist_field() call for that
var_ref.

Link: http://lkml.kernel.org/r/1deffec420f6a16d11dd8647318d34a66d1989a9.camel@linux.intel.com

Fixes: 466f4528fb ("tracing: Generalize hist trigger onmax and save action")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-26 08:34:00 -04:00
Daniel Borkmann
1da6c4d914 bpf: fix use after free in bpf_evict_inode
syzkaller was able to generate the following UAF in bpf:

  BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
  BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
  Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

  CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
  #110
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x244/0x39d lib/dump_stack.c:113
    print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
    lookup_last fs/namei.c:2269 [inline]
    path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    filename_lookup+0x26a/0x520 fs/namei.c:2348
    user_path_at_empty+0x40/0x50 fs/namei.c:2608
    user_path include/linux/namei.h:62 [inline]
    do_mount+0x180/0x1ff0 fs/namespace.c:2980
    ksys_mount+0x12d/0x140 fs/namespace.c:3258
    __do_sys_mount fs/namespace.c:3272 [inline]
    __se_sys_mount fs/namespace.c:3269 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x457569
  Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
  48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
  ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
  RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
  RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
  RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
  R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
  R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

  Allocated by task 9424:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3722 [inline]
    __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
    kstrdup+0x39/0x70 mm/util.c:49
    bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
    vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
    do_symlinkat+0x242/0x2d0 fs/namei.c:4154
    __do_sys_symlink fs/namei.c:4173 [inline]
    __se_sys_symlink fs/namei.c:4171 [inline]
    __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 9425:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3817
    bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
    evict+0x4b9/0x980 fs/inode.c:558
    iput_final fs/inode.c:1550 [inline]
    iput+0x674/0xa90 fs/inode.c:1576
    do_unlinkat+0x733/0xa30 fs/namei.c:4069
    __do_sys_unlink fs/namei.c:4110 [inline]
    __se_sys_unlink fs/namei.c:4108 [inline]
    __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this scenario path lookup under RCU is racing with the final
unlink in case of symlinks. As Linus puts it in his analysis:

  [...] We actually RCU-delay the inode freeing itself, but
  when we do the final iput(), the "evict()" function is called
  synchronously. Now, the simple fix would seem to just RCU-delay
  the kfree() of the symlink data in bpf_evict_inode(). Maybe
  that's the right thing to do. [...]

Al suggested to piggy-back on the ->destroy_inode() callback in
order to implement RCU deferral there which can then kfree() the
inode->i_link eventually right before putting inode back into
inode cache. By reusing free_inode_nonrcu() from there we can
avoid the need for our own inode cache and just reuse generic
one as we currently do.

And in-fact on top of all this we should just get rid of the
bpf_evict_inode() entirely. This means truncate_inode_pages_final()
and clear_inode() will then simply be called by the fs core via
evict(). Dropping the reference should really only be done when
inode is unhashed and nothing reachable anymore, so it's better
also moved into the final ->destroy_inode() callback.

Fixes: 0f98621bef ("bpf, inode: add support for symlinks and fix mtime/ctime")
Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
2019-03-26 01:38:49 +01:00
Prasad Sodagudi
59c39840f5 genirq: Prevent use-after-free and work list corruption
When irq_set_affinity_notifier() replaces the notifier, then the
reference count on the old notifier is dropped which causes it to be
freed. But nothing ensures that the old notifier is not longer queued
in the work list. If it is queued this results in a use after free and
possibly in work list corruption.

Ensure that the work is canceled before the reference is dropped.

Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: marc.zyngier@arm.com
Link: https://lkml.kernel.org/r/1553439424-6529-1-git-send-email-psodagud@codeaurora.org
2019-03-24 22:13:17 +01:00
Anna-Maria Gleixner
f28d3d5346 timer/trace: Improve timer tracing
Timers are added to the timer wheel off by one. This is required in
case a timer is queued directly before incrementing jiffies to prevent
early timer expiry.

When reading a timer trace and relying only on the expiry time of the timer
in the timer_start trace point and on the now in the timer_expiry_entry
trace point, it seems that the timer fires late. With the current
timer_expiry_entry trace point information only now=jiffies is printed but
not the value of base->clk. This makes it impossible to draw a conclusion
to the index of base->clk and makes it impossible to examine timer problems
without additional trace points.

Therefore add the base->clk value to the timer_expire_entry trace
point, to be able to calculate the index the timer base is located at
during collecting expired timers.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-5-anna-maria@linutronix.de
2019-03-24 20:29:33 +01:00
Anna-Maria Gleixner
dc1e7dc5ac timer: Move trace point to get proper index
When placing the timer_start trace point before the timer wheel bucket
index is calculated, the index information in the trace point is useless.

It is not possible to simply move the debug_activate() call after the index
calculation, because debug_object_activate() needs to be called before
touching the object.

Therefore split debug_activate() and move the trace point into
enqueue_timer() after the new index has been calculated. The
debug_object_activate() call remains at the original place.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-3-anna-maria@linutronix.de
2019-03-24 20:29:32 +01:00
Anna-Maria Gleixner
d6b87eaf10 tick/sched: Update tick_sched struct documentation
Adapt the documentation order of struct members to the effective order of
struct members and add missing descriptions.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/20190321120921.16463-2-anna-maria@linutronix.de
2019-03-24 20:29:32 +01:00
Linus Torvalds
231c807a60 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
 "Third more careful attempt for this set of fixes:

   - Prevent a 32bit math overflow in the cpufreq code

   - Fix a buffer overflow when scanning the cgroup2 cpu.max property

   - A set of fixes for the NOHZ scheduler logic to prevent waking up
     CPUs even if the capacity of the busy CPUs is sufficient along with
     other tweaks optimizing the behaviour for asymmetric systems
     (big/little)"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Skip LLC NOHZ logic for asymmetric systems
  sched/fair: Tune down misfit NOHZ kicks
  sched/fair: Comment some nohz_balancer_kick() kick conditions
  sched/core: Fix buffer overflow in cgroup2 property cpu.max
  sched/cpufreq: Fix 32-bit math overflow
2019-03-24 11:42:10 -07:00
Linus Torvalds
49ef015632 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "A larger set of perf updates.

  Not all of them are strictly fixes, but that's solely the tip
  maintainers fault as they let the timely -rc1 pull request fall
  through the cracks for various reasons including travel. So I'm
  sending this nevertheless because rebasing and distangling fixes and
  updates would be a mess and risky as well. As of tomorrow, a strict
  fixes separation is happening again. Sorry for the slip-up.

  Kernel:

   - Handle RECORD_MMAP vs. RECORD_MMAP2 correctly so different
     consumers of the mmap event get what they requested.

  Tools:

   - A larger set of updates to perf record/report/scripts vs. time
     stamp handling

   - More Python3 fixups

   - A pile of memory leak plumbing

   - perf BPF improvements and fixes

   - Finalize the perf.data directory storage"

[ Note: the kernel part is strictly a fix, the updates are purely to
  tooling       - Linus ]

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  perf bpf: Show more BPF program info in print_bpf_prog_info()
  perf bpf: Extract logic to create program names from perf_event__synthesize_one_bpf_prog()
  perf tools: Save bpf_prog_info and BTF of new BPF programs
  perf evlist: Introduce side band thread
  perf annotate: Enable annotation of BPF programs
  perf build: Check what binutils's 'disassembler()' signature to use
  perf bpf: Process PERF_BPF_EVENT_PROG_LOAD for annotation
  perf symbols: Introduce DSO_BINARY_TYPE__BPF_PROG_INFO
  perf feature detection: Add -lopcodes to feature-libbfd
  perf top: Add option --no-bpf-event
  perf bpf: Save BTF information as headers to perf.data
  perf bpf: Save BTF in a rbtree in perf_env
  perf bpf: Save bpf_prog_info information as headers to perf.data
  perf bpf: Save bpf_prog_info in a rbtree in perf_env
  perf bpf: Make synthesize_bpf_events() receive perf_session pointer instead of perf_tool
  perf bpf: Synthesize bpf events with bpf_program__get_prog_info_linear()
  bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump()
  tools lib bpf: Introduce bpf_program__get_prog_info_linear()
  perf record: Replace option --bpf-event with --no-bpf-event
  perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
  ...
2019-03-24 11:16:27 -07:00
Linus Torvalds
a75eda7bce Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A set of small fixes plus the removal of stale board support code:

   - Remove the board support code from the clpx711x clocksource driver.
     This change had fallen through the cracks and I'm sending it now
     rather than dealing with people who want to improve that stale code
     for 3 month.

   - Use the proper clocksource mask on RICSV

   - Make local scope functions and variables static"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/clps711x: Remove board support
  clocksource/drivers/riscv: Fix clocksource mask
  clocksource/drivers/mips-gic-timer: Make gic_compare_irqaction static
  clocksource/drivers/timer-ti-dm: Make omap_dm_timer_set_load_start() static
  clocksource/drivers/tcb_clksrc: Make tc_clksrc_suspend/resume() static
  clocksource/drivers/clps711x: Make clps711x_clksrc_init() static
  time/jiffies: Make refined_jiffies static
2019-03-24 11:09:47 -07:00
Linus Torvalds
f6cc519b6a Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "Two small fixes:

   - Cure a recently introduces error path hickup which tries to
     unregister a not registered lockdep key in te workqueue code

   - Prevent unaligned cmpxchg() crashes in the robust list handling
     code by sanity checking the user space supplied futex pointer"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Ensure that futex address is aligned in handle_futex_death()
  workqueue: Only unregister a registered lockdep key
2019-03-24 10:58:01 -07:00
Linus Torvalds
e08fef881d Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of fixes for the interrupt subsystem:

   - Remove secondary GIC support on systems w/o device-tree support

   - A set of small fixlets in various irqchip drivers

   - static and fall-through annotations

   - Kernel doc and typo fixes"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Mark expected switch case fall-through
  genirq/devres: Remove excess parameter from kernel doc
  irqchip/irq-mvebu-sei: Make mvebu_sei_ap806_caps static
  irqchip/mbigen: Don't clear eventid when freeing an MSI
  irqchip/stm32: Don't set rising configuration registers at init
  irqchip/stm32: Don't clear rising/falling config registers at init
  dt-bindings: irqchip: renesas-irqc: Document r8a774c0 support
  irqchip/mmp: Make mmp_irq_domain_ops static
  irqchip/brcmstb-l2: Make two init functions static
  genirq: Fix typo in comment of IRQD_MOVE_PCNTXT
  irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp
  irqchip/gic: Drop support for secondary GIC in non-DT systems
  irqchip/imx-irqsteer: Fix of_property_read_u32() error handling
2019-03-24 10:51:23 -07:00
Thomas Gleixner
1b72d43237 tick: Remove outgoing CPU from broadcast masks
Valentin reported that unplugging a CPU occasionally results in a warning
in the tick broadcast code which is triggered when an offline CPU is in the
broadcast mask.

This happens because the outgoing CPU is not removing itself from the
broadcast masks, especially not from the broadcast_force_mask. The removal
happens on the control CPU after the outgoing CPU is dead. It's a long
standing issue, but the warning is harmless.

Rework the hotplug mechanism so that the outgoing CPU removes itself from
the broadcast masks after disabling interrupts and removing itself from the
online mask.

Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903211540180.1784@nanos.tec.linutronix.de
2019-03-23 18:26:43 +01:00
Gustavo A. R. Silva
93417a3fda genirq: Mark expected switch case fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

With -Wimplicit-fallthrough added to CFLAGS:

 kernel/irq/manage.c: In function ‘irq_do_set_affinity’:
 kernel/irq/manage.c:198:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   cpumask_copy(desc->irq_common_data.affinity, mask);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 kernel/irq/manage.c:199:2: note: here
   case IRQ_SET_MASK_OK_NOCOPY:
   ^~~~

Annotate it.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20190228213714.GA9246@embeddedor
2019-03-23 12:32:01 +01:00
Rasmus Villemoes
e1e41b6ce5 timekeeping: Consistently use unsigned int for seqcount snapshot
The timekeeping code uses a random mix of "unsigned long" and "unsigned
int" for the seqcount snapshots (ratio 14:12). Since the seqlock.h API is
entirely based on unsigned int, use that throughout.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190318195557.20773-1-linux@rasmusvillemoes.dk
2019-03-23 11:43:56 +01:00
Thomas Gleixner
4a98be8293 Merge tag 'perf-core-for-mingo-5.1-20190311' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/core improvements and fixes from Arnaldo:

kernel:

  Stephane Eranian :

  - Restore mmap record type correctly when handling PERF_RECORD_MMAP2
    events, as the same template is used for all the threads interested
    in mmap events, some may want just PERF_RECORD_MMAP, while some
    may want the extra info in MMAP2 records.

perf probe:

  Adrian Hunter:

  - Fix getting the kernel map, because since changes related to x86 PTI
    entry trampolines handling, there are more than one kernel map.

perf script:

  Andi Kleen:

  - Support insn output for normal samples, i.e.:

    perf script -F ip,sym,insn --xed

    Will fetch the sample IP from the thread address space and feed it
    to Intel's XED disassembler, producing lines such as:

      ffffffffa4068804 native_write_msr            wrmsr
      ffffffffa415b95e __hrtimer_next_event_base   movq  0x18(%rax), %rdx

    That match 'perf annotate's output.

  - Make the --cpu filter apply to  PERF_RECORD_COMM/FORK/... events, in
    addition to PERF_RECORD_SAMPLE.

perf report:

  - Add a new --samples option to save a small random number of samples
    per hist entry, using a reservoir technique to select a representative
    number of samples.

    Then allow browsing the samples using 'perf script' as part of the hist
    entry context menu. This automatically adds the right filters, so only
    the thread or CPU of the sample is displayed. Then we use less' search
    functionality to directly jump to the time stamp of the selected sample.

    It uses different menus for assembler and source display.  Assembler
    needs xed installed and source needs debuginfo.

  - Fix the UI browser scripts pop up menu when there are many scripts
    available.

perf report:

  Andi Kleen:

  - Add 'time' sort option. E.g.:

    % perf report --sort time,overhead,symbol --time-quantum 1ms --stdio
    ...
         0.67%  277061.87300  [.] _dl_start
         0.50%  277061.87300  [.] f1
         0.50%  277061.87300  [.] f2
         0.33%  277061.87300  [.] main
         0.29%  277061.87300  [.] _dl_lookup_symbol_x
         0.29%  277061.87300  [.] dl_main
         0.29%  277061.87300  [.] do_lookup_x
         0.17%  277061.87300  [.] _dl_debug_initialize
         0.17%  277061.87300  [.] _dl_init_paths
         0.08%  277061.87300  [.] check_match
         0.04%  277061.87300  [.] _dl_count_modids
         1.33%  277061.87400  [.] f1
         1.33%  277061.87400  [.] f2
         1.33%  277061.87400  [.] main
         1.17%  277061.87500  [.] main
         1.08%  277061.87500  [.] f1
         1.08%  277061.87500  [.] f2
         1.00%  277061.87600  [.] main
         0.83%  277061.87600  [.] f1
         0.83%  277061.87600  [.] f2
         1.00%  277061.87700  [.] main

tools headers:

  Arnaldo Carvalho de Melo:

  - Update x86's syscall_64.tbl, no change in tools/perf behaviour.

  -  Sync copies asm-generic/unistd.h and linux/in with the kernel sources.

perf data:

  Jiri Olsa:

  - Prep work to support having perf.data stored as a directory, with one
    file per CPU, that ultimately will allow having one ring buffer reading
    thread per CPU.

Vendor events:

  Martin Liška:

  - perf PMU events for AMD Family 17h.

perf script python:

  Tony Jones:

  - Add python3 support for the remaining Intel PT related scripts, with
    these we should have a clean build of perf with python3 while still
    supporting the build with python2.

libbpf:

  Arnaldo Carvalho de Melo:

  - Fix the build on uCLibc, adding the missing stdarg.h since we use
    va_list in one typedef.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-22 22:50:41 +01:00
Johannes Berg
3b0f31f2b8 genetlink: make policy common to family
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.

The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.

This reduces the size of e.g. nl80211.o (which has lots of commands):

   text	   data	    bss	    dec	    hex	filename
 398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
 397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
--------------------------------
   -832      +8       0    -824

Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.

Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
     {
    -	.policy = POLICY,
     },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
            .ops = OPS,
            .maxattr = M,
    +       .policy = POLICY,
            ...
    };

This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 10:38:23 -04:00
Thomas Gleixner
d7dcf26ff0 softirq: Remove tasklet_hrtimer
There are no more users of this interface.
Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Link: https://lkml.kernel.org/r/20190301224821.29843-4-bigeasy@linutronix.de
2019-03-22 14:36:02 +01:00
Valdis Kletnieks
48084abf21 watchdog/core: Make variables static
sparse complains:
  CHECK   kernel/watchdog.c
kernel/watchdog.c:45:19: warning: symbol 'nmi_watchdog_available'
			 	  was not declared. Should it be static?
kernel/watchdog.c:47:16: warning: symbol 'watchdog_allowed_mask'
			 	  was not declared. Should it be static?

They're not referenced by name from anyplace else, make them static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/7855.1552383228@turing-police
2019-03-22 13:40:17 +01:00
Valdis Kletnieks
e8750053d6 time/jiffies: Make refined_jiffies static
sparse complains:

  CHECK   kernel/time/jiffies.c
kernel/time/jiffies.c:92:20: warning: symbol 'refined_jiffies' was not
			     	      declared. Should it be static?

Its only used in file scope. Make it static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/32342.1552379915@turing-police
2019-03-22 13:38:26 +01:00
Valdis Kletnieks
bb2e320565 genirq/devres: Remove excess parameter from kernel doc
Building with 'make W=1' complains:

  CC      kernel/irq/devres.o
kernel/irq/devres.c:104: warning: Excess function parameter 'thread_fn'
			 description in 'devm_request_any_context_irq'

Remove it.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/31207.1552378676@turing-police
2019-03-22 13:34:12 +01:00
Chen Jie
5a07168d8d futex: Ensure that futex address is aligned in handle_futex_death()
The futex code requires that the user space addresses of futexes are 32bit
aligned. sys_futex() checks this in futex_get_keys() but the robust list
code has no alignment check in place.

As a consequence the kernel crashes on architectures with strict alignment
requirements in handle_futex_death() when trying to cmpxchg() on an
unaligned futex address which was retrieved from the robust list.

[ tglx: Rewrote changelog, proper sizeof() based alignement check and add
  	comment ]

Fixes: 0771dfefc9 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <dvhart@infradead.org>
Cc: <peterz@infradead.org>
Cc: <zengweilin@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com
2019-03-22 13:05:26 +01:00
Jakub Kicinski
83d163124c bpf: verifier: propagate liveness on all frames
Commit 7640ead939 ("bpf: verifier: make sure callees don't prune
with caller differences") connected up parentage chains of all
frames of the stack.  It didn't, however, ensure propagate_liveness()
propagates all liveness information along those chains.

This means pruning happening in the callee may generate explored
states with incomplete liveness for the chains in lower frames
of the stack.

The included selftest is similar to the prior one from commit
7640ead939 ("bpf: verifier: make sure callees don't prune with
caller differences"), where callee would prune regardless of the
difference in r8 state.

Now we also initialize r9 to 0 or 1 based on a result from get_random().
r9 is never read so the walk with r9 = 0 gets pruned (correctly) after
the walk with r9 = 1 completes.

The selftest is so arranged that the pruning will happen in the
callee.  Since callee does not propagate read marks of r8, the
explored state at the pruning point prior to the callee will
now ignore r8.

Propagate liveness on all frames of the stack when pruning.

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 19:57:02 -07:00
Lorenz Bauer
edbf8c01de bpf: add skc_lookup_tcp helper
Allow looking up a sock_common. This gives eBPF programs
access to timewait and request sockets.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Lorenz Bauer
85a51f8c28 bpf: allow helpers to return PTR_TO_SOCK_COMMON
It's currently not possible to access timewait or request sockets
from eBPF, since there is no way to return a PTR_TO_SOCK_COMMON
from a helper. Introduce RET_PTR_TO_SOCK_COMMON to enable this
behaviour.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Lorenz Bauer
0f3adc288d bpf: track references based on is_acquire_func
So far, the verifier only acquires reference tracking state for
RET_PTR_TO_SOCKET_OR_NULL. Instead of extending this for every
new return type which desires these semantics, acquire reference
tracking state iff the called helper is an acquire function.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-21 18:59:10 -07:00
Xu Yu
0803278b0b bpf: do not restore dst_reg when cur_state is freed
Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

Call trace:

  dump_stack+0xbf/0x12e
  print_address_description+0x6a/0x280
  kasan_report+0x237/0x360
  sanitize_ptr_alu+0x85a/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fault injection trace:

  kfree+0xea/0x290
  free_func_state+0x4a/0x60
  free_verifier_state+0x61/0xe0
  push_stack+0x216/0x2f0	          <- inject failslab
  sanitize_ptr_alu+0x2b1/0x8d0
  adjust_ptr_min_max_vals+0x8f2/0x1ca0
  adjust_reg_min_max_vals+0x8ed/0x22e0
  do_check+0x1ca6/0x5d00
  bpf_check+0x9ca/0x2570
  bpf_prog_load+0xc91/0x1030
  __se_sys_bpf+0x61e/0x1f00
  do_syscall_64+0xc8/0x550
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

When kzalloc() fails in push_stack(), free_verifier_state() will free
current verifier state. As push_stack() returns, dst_reg was restored
if ptr_is_dst_reg is false. However, as member of the cur_state,
dst_reg is also freed, and error occurs when dereferencing dst_reg.
Simply fix it by testing ret of push_stack() before restoring dst_reg.

Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-21 12:18:18 +01:00
Bart Van Assche
82efcab3b9 workqueue: Only unregister a registered lockdep key
The recent change to prevent use after free and a memory leak introduced an
unconditional call to wq_unregister_lockdep() in the error handling
path. If the lockdep key had not been registered yet, then the lockdep core
emits a warning.

Only call wq_unregister_lockdep() if wq_register_lockdep() has been
called first.

Fixes: 009bb421b6 ("workqueue, lockdep: Fix an alloc_workqueue() error path")
Reported-by: syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org
2019-03-21 12:00:18 +01:00
Martin KaFai Lau
cba368c1f0 bpf: Only print ref_obj_id for refcounted reg
Naresh reported that test_align fails because of the mismatch at the
verbose printout of the register states.  The reason is due to the newly
added ref_obj_id.

ref_obj_id is only useful for refcounted reg.  Thus, this patch fixes it
by only printing ref_obj_id for refcounted reg.  While at it, it also uses
comma instead of space to separate between "id" and "ref_obj_id".

Fixes: 1b98658968 ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-20 18:24:35 -07:00
Dmitry V. Levin
16add41164 syscall_get_arch: add "struct task_struct *" argument
This argument is required to extend the generic ptrace API with
PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going
to be called from ptrace_request() along with syscall_get_nr(),
syscall_get_arguments(), syscall_get_error(), and
syscall_get_return_value() functions with a tracee as their argument.

The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6)
should describe what system call is being called and what its arguments
are.

Reverts: 5e937a9ae9 ("syscall_get_arch: remove useless function arguments")
Reverts: 1002d94d30 ("syscall.h: fix doc text for syscall_get_arch()")
Reviewed-by: Andy Lutomirski <luto@kernel.org> # for x86
Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Kees Cook <keescook@chromium.org> # seccomp parts
Acked-by: Mark Salter <msalter@redhat.com> # for the c6x bit
Cc: Elvira Khabirova <lineprinter@altlinux.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: x86@kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-03-20 21:12:36 -04:00
YueHaibing
2efa48fec0 audit: Make audit_log_cap and audit_copy_inode static
Fix sparse warning:

kernel/auditsc.c:1150:6: warning: symbol 'audit_log_cap' was not declared. Should it be static?
kernel/auditsc.c:1908:6: warning: symbol 'audit_copy_inode' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-03-20 21:04:32 -04:00
Richard Guy Briggs
73e65b88fe audit: connect LOGIN record to its syscall record
Currently the AUDIT_LOGIN event is a standalone record that isn't
connected to any other records that may be part of its syscall event. To
avoid the confusion of generating two events, connect the records by
using its syscall context.

Please see the github issue
https://github.com/linux-audit/audit-kernel/issues/110

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-03-20 20:57:48 -04:00
Bart Van Assche
8194fe94ab kernel/workqueue: Document wq_worker_last_func() argument
This patch avoids that the following warning is reported when building
with W=1:

kernel/workqueue.c:938: warning: Function parameter or member 'task' not described in 'wq_worker_last_func'

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-19 10:48:20 -07:00
Valentin Schneider
b9a7b88316 sched/fair: Skip LLC NOHZ logic for asymmetric systems
The LLC NOHZ condition will become true as soon as >=2 CPUs in a
single LLC domain are busy. On big.LITTLE systems, this translates to
two or more CPUs of a "cluster" (big or LITTLE) being busy.

Issuing a NOHZ kick in these conditions isn't desired for asymmetric
systems, as if the busy CPUs can provide enough compute capacity to
the running tasks, then we can leave the NOHZ CPUs in peace.

Skip the LLC NOHZ condition for asymmetric systems, and rely on
nr_running & capacity checks to trigger NOHZ kicks when the system
actually needs them.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dietmar.Eggemann@arm.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190211175946.4961-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19 12:06:15 +01:00
Valentin Schneider
a0fe2cf086 sched/fair: Tune down misfit NOHZ kicks
In this commit:

  3b1baa6496 ("sched/fair: Add 'group_misfit_task' load-balance type")

we set rq->misfit_task_load whenever the current running task has a
utilization greater than 80% of rq->cpu_capacity. A non-zero value in
this field enables misfit load balancing.

However, if the task being looked at is already running on a CPU of
highest capacity, there's nothing more we can do for it. We can
currently spot this in update_sd_pick_busiest(), which prevents us
from selecting a sched_group of group_type == group_misfit_task as the
busiest group, but we don't do any of that in nohz_balancer_kick().

This means that we could repeatedly kick NOHZ CPUs when there's no
improvements in terms of load balance to be done.

Introduce a check_misfit_status() helper that returns true iff there
is a CPU in the system that could give more CPU capacity to a rq's
misfit task - IOW, there exists a CPU of higher capacity_orig or the
rq's CPU is severely pressured by rt/IRQ.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dietmar.Eggemann@arm.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: morten.rasmussen@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190211175946.4961-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19 12:06:15 +01:00
Valentin Schneider
e25a7a944f sched/fair: Comment some nohz_balancer_kick() kick conditions
We now have a comment explaining the first sched_domain based NOHZ kick,
so might as well comment them all.

While at it, unwrap a line that fits under 80 characters.

Co-authored-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dietmar.Eggemann@arm.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: morten.rasmussen@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190211175946.4961-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19 12:06:15 +01:00
Konstantin Khlebnikov
4c47acd824 sched/core: Fix buffer overflow in cgroup2 property cpu.max
Add limit into sscanf format string for on-stack buffer.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0d5936344f ("sched: Implement interface for cgroup unified hierarchy")
Link: https://lkml.kernel.org/r/155189230232.2620.13120481613524200065.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19 12:06:15 +01:00
Peter Zijlstra
a23314e9d8 sched/cpufreq: Fix 32-bit math overflow
Vincent Wang reported that get_next_freq() has a mult overflow bug on
32-bit platforms in the IOWAIT boost case, since in that case {util,max}
are in freq units instead of capacity units.

Solve this by moving the IOWAIT boost to capacity units. And since this
means @max is constant; simplify the code.

Reported-by: Vincent Wang <vincent.wang@unisoc.com>
Tested-by: Vincent Wang <vincent.wang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190305083202.GU32494@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19 12:06:11 +01:00
Li RongQing
95e0b46fce audit: fix a memleak caused by auditing load module
module.name will be allocated unconditionally when auditing load
module, and audit_log_start() can fail with other reasons, or
audit_log_exit maybe not called, caused module.name is not freed

so free module.name in audit_free_context and __audit_syscall_exit

unreferenced object 0xffff88af90837d20 (size 8):
  comm "modprobe", pid 1036, jiffies 4294704867 (age 3069.138s)
  hex dump (first 8 bytes):
    69 78 67 62 65 00 ff ff                          ixgbe...
  backtrace:
    [<0000000008da28fe>] __audit_log_kern_module+0x33/0x80
    [<00000000c1491e61>] load_module+0x64f/0x3850
    [<000000007fc9ae3f>] __do_sys_init_module+0x218/0x250
    [<0000000000d4a478>] do_syscall_64+0x117/0x400
    [<000000004924ded8>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000007dc331dd>] 0xffffffffffffffff

Fixes: ca86cad738 ("audit: log module name on init_module")
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
[PM: manual merge fixup in __audit_syscall_exit()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-03-18 19:11:49 -04:00
Martynas Pumputis
f01a7dbe98 bpf: Try harder when allocating memory for large maps
It has been observed that sometimes a higher order memory allocation
for BPF maps fails when there is no obvious memory pressure in a system.
E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
could not be created due to vmalloc unable to allocate 75497472B,
when the system's memory consumption (in MB) was the following:

    Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727

Later analysis [1] by Michal Hocko showed that the vmalloc was not trying
to reclaim memory from the page cache and was failing prematurely due to
__GFP_NORETRY.

Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") and [1], we can replace
__GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
and will try harder to fulfil allocation requests.

Unfortunately, replacing the body of the BPF map memory allocation
function with the kvmalloc_node helper function is not an option at
this point in time, given 1) kmalloc is non-optional for higher order
allocations, and 2) passing __GFP_RETRY_MAYFAIL to the kmalloc would
stress the slab allocator too much for large requests.

The change has been tested with the workloads mentioned above and by
observing oom_kill value from /proc/vmstat.

[1]: https://lore.kernel.org/bpf/20190310071318.GW5232@dhcp22.suse.cz/

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Acked-by: Yonghong Song <yhs@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20190318153940.GL8924@dhcp22.suse.cz/
2019-03-18 16:48:25 +01:00
Linus Torvalds
a9dce6679d Merge tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd system call from Christian Brauner:
 "This introduces the ability to use file descriptors from /proc/<pid>/
  as stable handles on struct pid. Even if a pid is recycled the handle
  will not change. For a start these fds can be used to send signals to
  the processes they refer to.

  With the ability to use /proc/<pid> fds as stable handles on struct
  pid we can fix a long-standing issue where after a process has exited
  its pid can be reused by another process. If a caller sends a signal
  to a reused pid it will end up signaling the wrong process.

  With this patchset we enable a variety of use cases. One obvious
  example is that we can now safely delegate an important part of
  process management - sending signals - to processes other than the
  parent of a given process by sending file descriptors around via scm
  rights and not fearing that the given process will have been recycled
  in the meantime. It also allows for easy testing whether a given
  process is still alive or not by sending signal 0 to a pidfd which is
  quite handy.

  There has been some interest in this feature e.g. from systems
  management (systemd, glibc) and container managers. I have requested
  and gotten comments from glibc to make sure that this syscall is
  suitable for their needs as well. In the future I expect it to take on
  most other pid-based signal syscalls. But such features are left for
  the future once they are needed.

  This has been sitting in linux-next for quite a while and has not
  caused any issues. It comes with selftests which verify basic
  functionality and also test that a recycled pid cannot be signaled via
  a pidfd.

  Jon has written about a prior version of this patchset. It should
  cover the basic functionality since not a lot has changed since then:

      https://lwn.net/Articles/773459/

  The commit message for the syscall itself is extensively documenting
  the syscall, including it's functionality and extensibility"

* tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add tests for pidfd_send_signal()
  signal: add pidfd_send_signal() syscall
2019-03-16 13:47:14 -07:00
Linus Torvalds
f67e3fb489 Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
 "New device-dax infrastructure to allow persistent memory and other
  "reserved" / performance differentiated memories, to be assigned to
  the core-mm as "System RAM".

  Some users want to use persistent memory as additional volatile
  memory. They are willing to cope with potential performance
  differences, for example between DRAM and 3D Xpoint, and want to use
  typical Linux memory management apis rather than a userspace memory
  allocator layered over an mmap() of a dax file. The administration
  model is to decide how much Persistent Memory (pmem) to use as System
  RAM, create a device-dax-mode namespace of that size, and then assign
  it to the core-mm. The rationale for device-dax is that it is a
  generic memory-mapping driver that can be layered over any "special
  purpose" memory, not just pmem. On subsequent boots udev rules can be
  used to restore the memory assignment.

  One implication of using pmem as RAM is that mlock() no longer keeps
  data off persistent media. For this reason it is recommended to enable
  NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
  at rest. We considered making this recommendation an actively enforced
  requirement, but in the end decided to leave it as a distribution /
  administrator policy to allow for emulation and test environments that
  lack security capable NVDIMMs.

  Summary:

   - Replace the /sys/class/dax device model with /sys/bus/dax, and
     include a compat driver so distributions can opt-in to the new ABI.

   - Allow for an alternative driver for the device-dax address-range

   - Introduce the 'kmem' driver to hotplug / assign a device-dax
     address-range to the core-mm.

   - Arrange for the device-dax target-node to be onlined so that the
     newly added memory range can be uniquely referenced by numa apis"

NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.

And apparently the PMEM modules can cause that a lot more than regular
RAM.  The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.

Quoting Dan from another email:
 "The exposure can be reduced in the volatile-RAM case by scanning for
  and clearing errors before it is onlined as RAM. The userspace tooling
  for that can be in place before v5.1-final. There's also runtime
  notifications of errors via acpi_nfit_uc_error_notify() from
  background scrubbers on the DIMM devices. With that mechanism the
  kernel could proactively clear newly discovered poison in the volatile
  case, but that would be additional development more suitable for v5.2.

  I understand the concern, and the need to highlight this issue by
  tapping the brakes on feature development, but I don't see PMEM as RAM
  making the situation worse when the exposure is also there via DAX in
  the PMEM case. Volatile-RAM is arguably a safer use case since it's
  possible to repair pages where the persistent case needs active
  application coordination"

* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: "Hotplug" persistent memory for use like normal RAM
  mm/resource: Let walk_system_ram_range() search child resources
  mm/memory-hotplug: Allow memory resources to be children
  mm/resource: Move HMM pr_debug() deeper into resource code
  mm/resource: Return real error codes from walk failures
  device-dax: Add a 'modalias' attribute to DAX 'bus' devices
  device-dax: Add a 'target_node' attribute
  device-dax: Auto-bind device after successful new_id
  acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
  device-dax: Add /sys/class/dax backwards compatibility
  device-dax: Add support for a dax override driver
  device-dax: Move resource pinning+mapping into the common driver
  device-dax: Introduce bus + driver model
  device-dax: Start defining a dax bus model
  device-dax: Remove multi-resource infrastructure
  device-dax: Kill dax_region base
  device-dax: Kill dax_region ida
2019-03-16 13:05:32 -07:00
Linus Torvalds
11efae3506 Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block
Pull more block layer changes from Jens Axboe:
 "This is a collection of both stragglers, and fixes that came in after
  I finalized the initial pull. This contains:

   - An MD pull request from Song, with a few minor fixes

   - Set of NVMe patches via Christoph

   - Pull request from Konrad, with a few fixes for xen/blkback

   - pblk fix IO calculation fix (Javier)

   - Segment calculation fix for pass-through (Ming)

   - Fallthrough annotation for blkcg (Mathieu)"

* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
  blkcg: annotate implicit fall through
  nvme-tcp: support C2HData with SUCCESS flag
  nvmet: ignore EOPNOTSUPP for discard
  nvme: add proper write zeroes setup for the multipath device
  nvme: add proper discard setup for the multipath device
  nvme: remove nvme_ns_config_oncs
  nvme: disable Write Zeroes for qemu controllers
  nvmet-fc: bring Disconnect into compliance with FC-NVME spec
  nvmet-fc: fix issues with targetport assoc_list list walking
  nvme-fc: reject reconnect if io queue count is reduced to zero
  nvme-fc: fix numa_node when dev is null
  nvme-fc: use nr_phys_segments to determine existence of sgl
  nvme-loop: init nvmet_ctrl fatal_err_work when allocate
  nvme: update comment to make the code easier to read
  nvme: put ns_head ref if namespace fails allocation
  nvme-trace: fix cdw10 buffer overrun
  nvme: don't warn on block content change effects
  nvme: add get-feature to admin cmds tracer
  md: Fix failed allocation of md_register_thread
  It's wrong to add len to sector_nr in raid10 reshape twice
  ...
2019-03-16 12:36:39 -07:00
David S. Miller
0aedadcf6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-03-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.

2) Fix BTF to properly resolve forward-declared enums into their corresponding
   full enum definition types during deduplication, from Andrii.

3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.

4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
   bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.

5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.

6) Various fixes in BPF helper function documentation in bpf.h UAPI header
   used to bpf-helpers(7) man page, from Quentin.

7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-16 12:20:08 -07:00
Linus Torvalds
aa2e3ac64a Merge tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes and cleanups from Steven Rostedt:
 "This contains a series of last minute clean ups, small fixes and error
  checks"

* tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/probe: Verify alloc_trace_*probe() result
  tracing/probe: Check event/group naming rule at parsing
  tracing/probe: Check the size of argument name and body
  tracing/probe: Check event name length correctly
  tracing/probe: Check maxactive error cases
  tracing: kdb: Fix ftdump to not sleep
  trace/probes: Remove kernel doc style from non kernel doc comment
  tracing/probes: Make reserved_field_names static
2019-03-15 14:47:02 -07:00
Linus Torvalds
2b9c272cf5 Merge tag 'fbdev-v5.1' of git://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
 "Just a couple of small fixes and cleanups:

   - fix memory access if logo is bigger than the screen (Manfred
     Schlaegl)

   - silence fbcon logo on 'quiet' boots (Prarit Bhargava)

   - use kvmalloc() for scrollback buffer in fbcon (Konstantin Khorenko)

   - misc fixes (Colin Ian King, YueHaibing, Matteo Croce, Mathieu
     Malaterre, Anders Roxell, Arnd Bergmann)

   - misc cleanups (Rob Herring, Lubomir Rintel, Greg Kroah-Hartman,
     Jani Nikula, Michal Vokáč)"

* tag 'fbdev-v5.1' of git://github.com/bzolnier/linux:
  fbdev: mbx: fix a misspelled variable name
  fbdev: omap2: fix warnings in dss core
  video: fbdev: Fix potential NULL pointer dereference
  fbcon: Silence fbcon logo on 'quiet' boots
  printk: Export console_printk
  ARM: dts: imx28-cfa10036: Fix the reset gpio signal polarity
  video: ssd1307fb: Do not hard code active-low reset sequence
  dt-bindings: display: ssd1307fb: Remove reset-active-low from examples
  fbdev: fbmem: fix memory access if logo is bigger than the screen
  video/fbdev: refactor video= cmdline parsing
  fbdev: mbx: fix up debugfs file creation
  fbdev: omap2: no need to check return value of debugfs_create functions
  video: fbdev: geode: remove ifdef OLPC noise
  video: offb: annotate implicit fall throughs
  omapfb: fix typo
  fbdev: Use of_node_name_eq for node name comparisons
  fbcon: use kvmalloc() for scrollback buffer
  fbdev: chipsfb: remove set but not used variable 'size'
  fbdev/via: fix spelling mistake "Expandsion" -> "Expansion"
2019-03-15 14:22:59 -07:00
Mathieu Malaterre
a2775bbc1d kernel/workqueue: Use __printf markup to silence compiler in function 'alloc_workqueue'
Silence warnings (triggered at W=1) by adding relevant __printf attributes.

  kernel/workqueue.c:4249:2: warning: function 'alloc_workqueue' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-15 08:47:22 -07:00
Masami Hiramatsu
a039480e9e tracing/probe: Verify alloc_trace_*probe() result
Since alloc_trace_*probe() returns -EINVAL only if !event && !group,
it should not happen in trace_*probe_create(). If we catch that case
there is a bug. So use WARN_ON_ONCE() instead of pr_info().

Link: http://lkml.kernel.org/r/155253785078.14922.16902223633734601469.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:54:21 -04:00
Masami Hiramatsu
5b7a962209 tracing/probe: Check event/group naming rule at parsing
Check event and group naming rule at parsing it instead
of allocating probes.

Link: http://lkml.kernel.org/r/155253784064.14922.2336893061156236237.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:54:11 -04:00
Masami Hiramatsu
b4443c17a3 tracing/probe: Check the size of argument name and body
Check the size of argument name and expression is not 0
and smaller than maximum length.

Link: http://lkml.kernel.org/r/155253783029.14922.12650939303827581096.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:53:57 -04:00
Masami Hiramatsu
dec65d79fd tracing/probe: Check event name length correctly
Ensure given name of event is not too long when parsing it,
and fix to update event name offset correctly when the group
name is given. For example, this makes probe event to check
the "p:foo/" error case correctly.

Link: http://lkml.kernel.org/r/155253782046.14922.14724124823730168629.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:53:47 -04:00