Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
uncacheable rendering them impractical for application usage. DAX-pte
mappings are cached and the goal of establishing DAX-pmd mappings is to
attain more performance, not dramatically less (3 orders of magnitude).
track_pfn_insert() relies on a previous call to reserve_memtype() to
establish the expected page_cache_mode for the range. While memremap()
arranges for reserve_memtype() to be called, devm_memremap_pages() does
not. So, teach track_pfn_insert() and untrack_pfn() how to handle
tracking without a vma, and arrange for devm_memremap_pages() to
establish the write-back-cache reservation in the memtype tree.
Cc: <stable@vger.kernel.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Kai Zhang <kai.ka.zhang@oracle.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
LLVM can generate code that tests for direct packet access via
skb->data/data_end in a way that currently gets rejected by the
verifier, example:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)
The reason why this gets rejected despite a proper test is that we
currently call find_good_pkt_pointers() only in case where we detect
tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
derived, for example, from a register of type pkt(id=Y,off=0,r=0)
pointing to skb->data. find_good_pkt_pointers() then fills the range
in the current branch to pkt(id=Y,off=0,r=Z) on success.
For above case, we need to extend that to recognize pkt_end >= rX
pattern and mark the other branch that is taken on success with the
appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
only two practical options to test for from what LLVM could have
generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
that we would need to take into account as well.
After the fix:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
29: (bf) r1 = r8
30: (25) if r8 > 0x3c goto pc+47
R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
R9=pkt(id=0,off=0,r=54) R10=fp
31: (b7) r1 = 1
[...]
Verifier test cases are also added in this work, one that demonstrates
the mentioned example here and one that tries a bad packet access for
the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
with both test variants (>, >=).
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The newly added bpf_overflow_handler function is only built of both
CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
only checks the latter:
kernel/events/core.c: In function 'perf_event_alloc':
kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first use in this function)
This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
is disabled entirely.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aa6a5f3cb2 ("perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some callers of __irq_set_trigger() masks all flags except trigger mode
flags. This is unnecessary, ase __irq_set_trigger() already does this
before usage of flags.
[ tglx: Moved the flag mask and adjusted comment. Removed the hunk in
enable_percpu_irq() as it is required there ]
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Link: http://lkml.kernel.org/r/20160719095408.13778-1-kuleshovmail@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some compilers are unhappy with the anon union in the state array. Replace
it with a named union.
While at it align the state array initializers proper and add the missing
name tags.
Fixes: cf392d10b6 "cpu/hotplug: Add multi instance support"
Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Fenguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
of_clk_init() ends up calling into pm_qos_update_request() very early
during boot where irq is expected to stay disabled.
pm_qos_update_request() uses cancel_delayed_work_sync() which
correctly assumes that irq is enabled on invocation and
unconditionally disables and re-enables it.
Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
enabling irq unexpectedly during early boot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Qiao Zhou <qiaozhou@asrmicro.com>
Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.
As a result we get:
WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
irq_work_queue_on+0xad/0xe0
tick_nohz_full_kick_cpu+0x44/0x50
tick_nohz_dep_set_cpu+0x74/0xb0
enqueue_task_dl+0x226/0x480
activate_task+0x5c/0xa0
dl_task_timer+0x19b/0x2c0
? push_dl_task.part.31+0x190/0x190
This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.
We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.
This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.
The task being woken up is already awake from a schedule()
and is doing the following:
do {
schedule()
set_current_state(TASK_(UN)INTERRUPTIBLE);
} while (!cond);
The waker, actually gets stuck doing the following in
try_to_wake_up():
while (p->on_cpu)
cpu_relax();
Analysis:
The instance I've seen involves the following race:
CPU1 CPU2
while () {
if (cond)
break;
do {
schedule();
set_current_state(TASK_UN..)
} while (!cond);
wakeup_routine()
spin_lock_irqsave(wait_lock)
raw_spin_lock_irqsave(wait_lock) wake_up_process()
} try_to_wake_up()
set_current_state(TASK_RUNNING); ..
list_del(&waiter.list);
CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:
CPU3
wakeup_routine()
raw_spin_lock_irqsave(wait_lock)
if (!list_empty)
wake_up_process()
try_to_wake_up()
raw_spin_lock_irqsave(p->pi_lock)
..
if (p->on_rq && ttwu_wakeup())
..
while (p->on_cpu)
cpu_relax()
..
CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.
CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.
The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
Reproduction of the issue:
The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.
Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
architectures do not have a full barrier in switch_to()
so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer fixes from Thomas Gleixner:
"Two fixlet from the timers departement:
- A fix for scheduler stalls in the tick idle code affecting
NOHZ_FULL kernels
- A trivial compile fix"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/nohz: Fix softlockup on scheduler stalls in kvm guest
clocksource/drivers/atmel-pit: Fix compilation error
Most (if not all) code here implicitly assumes that the maximum number of
IRQs per chip will be 32, and thus uses 'u32' or 'unsigned long' for many
tasks (for example "struct irq_data" declares its 'mask' field as 'u32',
and "struct irq_chip_generic" declares its 'installed' field as 'unsigned
long')
However, there is no check to verify that irqs_per_chip is <= 32. Hence,
calling irq_alloc_domain_generic_chips() with a bigger value will result in
unexpected results.
Provide a wrapper with a MAYBE_BUILD_BUG_ON(nrirqs >= 32) to catch such
cases.
[ tglx: Reduced changelog to the essential information ]
Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/57B31D94.5040701@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds the ability for a given state to have multiple
instances. Until now all states have a single instance and the startup /
teardown callback use global variables.
A few drivers need to perform a the same callbacks on multiple
"instances". Currently we have three drivers in tree which all have a
global list which they iterate over. With multi instance they support
don't need their private list and the functionality has been moved into
core code. Plus we hold the hotplug lock in core so no cpus comes/goes
while instances are registered and we do rollback in error case :)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is preparation for the following patch.
This rework here changes the arguments of cpuhp_invoke_callback(). It
passes now `state' and whether `startup' or `teardown' callback should
be invoked. The callback then is looked up by the function.
The following is a clanup of callers:
- cpuhp_issue_call() has one argument less
- struct cpuhp_cpu_state (which is used by the hotplug thread) gets also
its callback removed. The decision if it is a single callback
invocation moved to the `single' variable. Also a `bringup' variable
has been added to distinguish between startup and teardown callback.
- take_cpu_down() needs to start one step earlier. We always get here
via CPUHP_TEARDOWN_CPU callback. Before that change cpuhp_ap_states +
CPUHP_TEARDOWN_CPU pointed to an empty entry because TEARDOWN is saved
in bp_states for this reason. Now that we use cpuhp_get_step() to
lookup the state we must explicitly skip it in order not to invoke it
twice.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.
The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.
The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.
The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)
The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
__u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
... ctx->sample_period ...
... ctx->regs.ip ...
}
The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.
As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.
Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:
#3 inner/outer(us): 9/9 ts:1470836488.206734548
#4 inner/outer(us): 0/8 ts:1470836497.140808588
#5 inner/outer(us): 0/6 ts:1470836499.140825168 nmi-total:5 nmi-count:1
#6 inner/outer(us): 9/9 ts:1470836501.140841748
All time is still tracked in microseconds.
The NMI information is only shown when an NMI occurred during the sample.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of having the hwlat detector thread stay on one CPU, have it migrate
across all the CPUs specified by tracing_cpumask. If the user modifies the
thread's CPU affinity, the migration will stop until the next instance that
the tracer is instantiated. The migration happens at the end of each window
(period).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The hardware latency tracer has been in the PREEMPT_RT patch for some time.
It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.
The logic is pretty simple. It simply creates a thread that spins on a
single CPU for a specified amount of time (width) within a periodic window
(window). These numbers may be adjusted by their cooresponding names in
/sys/kernel/tracing/hwlat_detector/
The defaults are window = 1000000 us (1 second)
width = 500000 us (1/2 second)
The loop consists of:
t1 = trace_clock_local();
t2 = trace_clock_local();
Where trace_clock_local() is a variant of sched_clock().
The difference of t2 - t1 is recorded as the "inner" timestamp and also the
timestamp t1 - prev_t2 is recorded as the "outer" timestamp. If either of
these differences are greater than the time denoted in
/sys/kernel/tracing/tracing_thresh then it records the event.
When this tracer is started, and tracing_thresh is zero, it changes to the
default threshold of 10 us.
The hwlat tracer in the PREEMPT_RT patch was originally written by
Jon Masters. I have modified it quite a bit and turned it into a
tracer.
Based-on-code-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Without this patch irq_domain_disassociate() cannot properly release the
interrupt. In fact, irq_map_generic_chip() checks a bit on 'gc->installed'
but said bit is never cleared, only set.
Commit 088f40b7b0 ("genirq: Generic chip: Add linear irq domain support")
added irq_map_generic_chip() function and also stated "This lacks a removal
function for now".
This commit provides an implementation of an unmap function that can be
called by irq_domain_disassociate().
[ tglx: Made the function static and removed the export as we have neither
a prototype nor a modular user. ]
Fixes: 088f40b7b0 ("genirq: Generic chip: Add linear irq domain support")
Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/579F5C5A.2070507@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>