Pull tracing fix from Steven Rostedt:
"This fixes a memory leak from the error path in the event filter
logic"
* tag 'trace-v5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Avoid memory leak in predicate_parse()
force_sig_info always delivers to the current task and the signal
parameter always matches info.si_signo. So remove those parameters to
make it a simpler less error prone interface, and to make it clear
that none of the callers are doing anything clever.
This guarantees that force_sig_info will not grow any new buggy
callers that attempt to call force_sig on a non-current task, or that
pass an signal number that does not match info.si_signo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All callers of force_sig_info pass info.si_signo in for the signal
by definition as well as in practice.
Further all callers of force_sig_info except force_sig_fault_to_task
pass current as the target task to force_sig_info.
Factor out a static force_sig_info_to_task that
force_sig_fault_to_task can call.
This prepares the way for force_sig_info to have it's task and signal
parameters removed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for removing the special case in force_sig_info for
only having a signal number generate an appropriate siginfo in
force_sig the last caller of force_sig_info that does not
pass a filled out siginfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Forcing a signal or not allowing a pid namespace init to ignore
SIGKILL or SIGSTOP is more cleanly computed in send_signal.
There are two cases where we don't allow a pid namespace init
to ignore SIGKILL or SIGSTOP. If the sending process is
from an ancestor pid namespace and as such is effectively
the god to the target process, and if the it is the kernel
that is sending the signal, not another application.
It is known that a process is from an ancestor pid namespace if
it can see it's target but it's target does not have a pid for
the sender in it's pid namespace.
It is know that a signal is sent from the kernel if si_code is set to
SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
a signal with si_code == SI_KERNEL).
The only signals that matter are SIGKILL and SIGSTOP neither of
which can really be caught, and both of which always have a siginfo
layout that includes si_uid and si_pid. Therefore we never need
to worry about forcing a signal when si_pid and si_uid are absent.
So handle the two special cases of info and the case when si_pid and
si_uid are present.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Any time siginfo is not stored in the signal queue information is
lost. Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
not allocate a signal queue entry, and a queue overflow abort is not
triggered.
Fixes: ba005e1f41 ("tracepoint: Add signal loss events")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.
The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.
The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:
force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for removing the task parameter from force_sig_fault
introduce force_sig_fault_to_task and use it for the two cases where
it matters.
On mips force_fcr31_sig calls force_sig_fault and is called on either
the current task, or a task that is suspended and is being switched to
by the scheduler. This is safe because the task being switched to by
the scheduler is guaranteed to be suspended. This ensures that
task->sighand is stable while the signal is delivered to it.
On parisc user_enable_single_step calls force_sig_fault and is in turn
called by ptrace_request. The function ptrace_request always calls
user_enable_single_step on a child that is stopped for tracing. The
child being traced and not reaped ensures that child->sighand is not
NULL, and that the child will not change child->sighand.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
We also don't need __rcu annotations on cgroup_bpf.inactive since
it's not read/updated concurrently.
v4:
* drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
should be more clear to understand which mutex/refcount protects
each particular place
v3:
* amend cgroup_rcu_dereference to include percpu_ref_is_dying;
cgroup_bpf is now reference counted and we don't hold cgroup_mutex
anymore in cgroup_bpf_release
v2:
* replace xchg with rcu_swap_protected
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Drop __rcu annotations and rcu read sections from bpf_prog_array
helper functions. They are not needed since all existing callers
call those helpers from the rcu update side while holding a mutex.
This guarantees that use-after-free could not happen.
In the next patches I'll fix the callers with missing
rcu_dereference_protected to make sparse/lockdep happy, the proper
way to use these helpers is:
struct bpf_prog_array __rcu *progs = ...;
struct bpf_prog_array *p;
mutex_lock(&mtx);
p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
bpf_prog_array_length(p);
bpf_prog_array_copy_to_user(p, ...);
bpf_prog_array_delete_safe(p, ...);
bpf_prog_array_copy_info(p, ...);
bpf_prog_array_copy(p, ...);
bpf_prog_array_free(p);
mutex_unlock(&mtx);
No functional changes! rcu_dereference_protected with lockdep_is_held
should catch any cases where we update prog array without a mutex
(I've looked at existing call sites and I think we hold a mutex
everywhere).
Motivation is to fix sparse warnings:
kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
kernel/bpf/core.c:1803:9: expected struct callback_head *head
kernel/bpf/core.c:1803:9: got struct callback_head [noderef] <asn:4> *
kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
kernel/bpf/core.c:1877:44: expected struct bpf_prog_array_item *item
kernel/bpf/core.c:1877:44: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1901:26: expected struct bpf_prog_array_item *existing
kernel/bpf/core.c:1901:26: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1935:26: expected struct bpf_prog_array_item *[assigned] existing
kernel/bpf/core.c:1935:26: got struct bpf_prog_array_item [noderef] <asn:4> *
v2:
* remove comment about potential race; that can't happen
because all callers are in rcu-update section
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The field operator is ignored on several string fields. WATCH, DIR,
PERM and FILETYPE field operators are completely ignored and meaningless
since the op is not referenced in audit_filter_rules(). Range and
bitwise operators are already addressed in ghak73.
Honour the operator for WATCH, DIR, PERM, FILETYPE fields as is done in
the EXE field.
Please see github issue
https://github.com/linux-audit/audit-kernel/issues/114
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull perf/urgent fixes:
BPF:
Jiri Olsa:
- Fixup determination of end of kernel map, to avoid having BPF programs,
that are after the kernel headers and just before module texts mixed up in
the kernel map.
tools UAPI header copies:
Arnaldo Carvalho de Melo:
- Update copy of files related to new fspick, fsmount, fsconfig, fsopen,
move_mount and open_tree syscalls.
- Sync cpufeatures.h, sched.h, fs.h, drm.h, i915_drm.h and kvm.h headers.
Namespaces:
Namhyung Kim:
- Add missing byte swap ops for namespace events when processing records from
perf.data files that could have been recorded in a arch with a different
endianness.
- Fix access to the thread namespaces list by using the namespaces_lock.
perf data:
Shawn Landden:
- Fix 'strncat may truncate' build failure with recent gcc.
s/390
Thomas Richter:
- Fix s390 missing module symbol and warning for non-root users in 'perf record'.
arm64:
Vitaly Chikunov:
- Fix mksyscalltbl when system kernel headers are ahead of the kernel.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Various security techniques can obfuscate pointer printouts on the
console. Unfortunately, rcutorture relies on either "null" or all zeroes
to identify the last few statistics printouts at the end of the test.
These need to be identified because failing to do so will results in
false-positive complaints about grace-period hangs.
This commit therefore prints the "ver:" in capitals ("VER:") when
the RCU-protected pointer has been set to NULL, which causes rcutorture's
parse-console.sh script to correctly ignore these lines.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
I have been showing off a trivial RCU implementation for non-preemptive
environments for some time now:
#define rcu_read_lock()
#define rcu_read_unlock()
#define rcu_dereference(p) READ_ONCE(p)
#define rcu_assign_pointer(p, v) smp_store_release(&(p), (v))
void synchronize_rcu(void)
{
int cpu;
for_each_online_cpu(cpu)
sched_setaffinity(current->pid, cpumask_of(cpu));
}
Trivial or not, as the old saying goes, "if it ain't tested, it don't
work!". This commit therefore adds a "trivial" flavor to rcutorture
and a corresponding TRIVIAL test scenario. This variant does not handle
CPU hotplug, which is unconditionally enabled on x86 for post-v5.1-rc3
kernels, which is why the TRIVIAL.boot says "rcutorture.onoff_interval=0".
This commit actually does handle CONFIG_PREEMPT=y kernels, but only
because it turns back the Linux-kernel clock in order to provide these
alternative definitions (or the moral equivalent thereof):
#define rcu_read_lock() preempt_disable()
#define rcu_read_unlock() preempt_enable()
In CONFIG_PREEMPT=n kernels without debugging, these are equivalent to
empty macros give or take a compiler barrier. However, the have been
successfully tested with actual empty macros as well.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix symbol issue reported by kbuild test robot <lkp@intel.com>. ]
[ paulmck: Work around sched_setaffinity() issue noted by Andrea Parri. ]
[ paulmck: Add rcutorture.shuffle_interval=0 to TRIVIAL.boot to fix
interaction with shuffler task noted by Peter Zijlstra. ]
Tested-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Once removed, an rcu_torture element can be deferred-freed by a chain
of call_rcu() invocations, with each callback invoking another round of
call_rcu() until either a fixed number of call_rcu() invocations have
been chained or until the test ends. This means that if the test ends,
some of the rcu_torture elements will be "stranded" partway through the
deferred-free process, which results in false-positive warnings from
rcu_torture_writer() due to lack of forward progress should the test
end just at the end of a stutter interval.
This commit therefore suppresses rcu_torture_writer()'s forward-progress
checks when the test ends in order to avoid these false-positive reports..
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
In !PREEMPT kernels, cond_resched() is a no-op. In NO_HZ_FULL kernels,
in-kernel execution (such as that of rcutorture's kthreads) might extend
indefinitely without the scheduler gaining the aid of a scheduling-clock
interrupt. This combination can make the interaction of an rcutorture
forward-progress test and a CPU-hotplug stop_machine operation make less
forward progress than one might like. Additionally, Sebastian Siewior
notes that NO_HZ_FULL kernels have a scheduler check upon return to
userspace execution, which suggests that in-kernel emulation of tight
userspace loops containing system calls doing call_rcu() might also need
explicit checks in the PREEMPT && NO_HZ_FULL case.
This commit therefore introduces a rcu_torture_fwd_prog_cond_resched()
function that explicitly invokes schedule() in such kernels whenever
need_resched() returns true, while retaining use of cond_resched()
for kernels that are either !PREEMPT or !NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
After the end of each stutter pause interval, the rcu_torture_writer()
kthread checks to be sure that all prior callbacks have completed so
that all the test structures have been freed. This works fine except
for tasks RCU, in which grace periods can take one good long time.
This commit therefore exempts tasks RCU from this check.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Currently, the inter-stutter interval is the same as the stutter duration,
that is, whatever number of jiffies is passed into torture_stutter_init().
This has worked well for quite some time, but the addition of
forward-progress testing to rcutorture can delay processes for several
seconds, which can triple the time that they are stuttered.
This commit therefore adds a second argument to torture_stutter_init()
that specifies the inter-stutter interval. While locktorture preserves
the current behavior, rcutorture uses the RCU CPU stall warning interval
to provide a wider inter-stutter interval.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The stutter_wait() function is supposed to return true if it actually
waits and false otherwise, but it instead unconditionally returns false.
Which hides a bug in rcu_torture_writer() that fails to account for
the fact that one of the rcu_tortures[] array elements will normally be
referenced by rcu_torture_current, and thus not be on the freelist.
This commit therefore corrects the stutter_wait() return value and adds a
check for rcu_torture_current to rcu_torture_writer()'s check that things
get freed after everything goes quiescent. In addition, this commit
causes torture_stutter() to give a bit more than one second (instead of
only one jiffy) warning of the end of the stutter interval. Finally,
this commit disables long-delay readers and aggressive update-side
forward-progress checks while forward-progress testing is in flight.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The rcu_torture_fwd_prog_cbfree() function frees callbacks used during
rcutorture's call_rcu() forward-progress test, but does so in a tight
loop. This could cause problems given a very long list of callbacks to be
freed, and actual testing produces lists with as many as 25M callbacks.
This commit therefore adds a cond_resched() to this loop. While in
the area, this commit also rearranges the lock releases to look a bit
more sane.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
With this patch rcu_sync has a single state variable and the transition rules
become really simple:
GP_IDLE - owned by the first rcu_sync_enter() which moves it to
GP_ENTER - owned by rcu-callback which moves it to
GP_PASSED - owned by the last rcu_sync_exit() which moves it to
GP_EXIT - and this is the only "nontrivial" state.
rcu-callback moves it back to GP_IDLE unless another enter()
comes before a GP pass.
If rcu-callback is invoked before the next rcu_sync_exit() it
must see gp_count incremented by that enter() and set GP_PASSED.
Otherwise, if the next rcu_sync_exit() wins the race, it will
move it to
GP_REPLAY - owned by rcu-callback which moves it to GP_EXIT
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: While here, apply READ_ONCE() and WRITE_ONCE() to ->gp_state. ]
[ paulmck: Tweaks to make htmldocs happy. (Reported by kbuild test robot.) ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Turn DEFINE_STATIC_PERCPU_RWSEM() into __DEFINE_PERCPU_RWSEM() with the
additional "is_static" argument to introduce DEFINE_PERCPU_RWSEM().
Change cgroup.c to use DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Now that the RCU flavors have been consolidated, rcu_sync_type makes no
sense because none of internal update functions aside from .held() depend
on gp_type. This commit therefore removes this field and consolidates
the relevant code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: Added RCU and RCU-bh checks to rcu_sync_is_idle(). ]
[ paulmck: And applied subsequent feedback from Oleg Nesterov. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Because __call_srcu() is not used outside kernel/rcu/srcutree.c,
this commit makes it static.
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires
that the size of the reserved region be increased, which is not something
we want to be doing all that often. One approach would be to require
that loadable modules define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function. However, this is more than a bit user unfriendly.
This commit therefore creates an ___srcu_struct_ptrs linker section,
and pointers to srcu_struct structures created by DEFINE_SRCU() and
DEFINE_STATIC_SRCU() within a module are placed into that module's
___srcu_struct_ptrs section. The required init_srcu_struct() and
cleanup_srcu_struct() functions are then automatically invoked as needed
when that module is loaded and unloaded, thus allowing modules to continue
to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need
to increase the size of the reserved region.
Many of the algorithms and some of the code was cheerfully cherry-picked
from other code making use of linker sections, perhaps most notably from
tracepoints. All bugs are nevertheless the sole property of the author.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Use __section() and use "default" in srcu_module_notify()'s
"switch" statement as suggested by Joel Fernandes. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Currently, if a CPU has more than 10,000 callbacks pending, it will
increase rdp->blimit to LONG_MAX. If you are lucky, LONG_MAX is only
about two billion, but this is still a bit too many callbacks to invoke
back-to-back while otherwise ignoring the world.
This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
which is set to 10,000, for rdp->blimit.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
On systems whose rcu_node tree has only one node, the
rcu_check_gp_start_stall() function's values of rnp and rnp_root will
be identical. In this case, it clearly does not make sense to release
both rnp->lock and rnp_root->lock, but that is exactly what this function
does in the last early exit. This commit therefore unlocks only rnp->lock
when rnp and rnp_root are equal.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The dump_blkd_tasks() function dumps at most 10 blocked tasks, ignoring
the value of the ncheck parameter. This commit therefore substitutes
the value of ncheck for the hard-coded value of 10. Because all callers
currently pass 10 as the number, this patch does not change behavior,
but it is clearly an accident waiting to happen.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Because rdp is initialized but never used in synchronize_rcu_expedited(),
this commit removes it.
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
The rcu_data structure's ->deferred_qs field is used to indicate that the
current CPU is blocking an expedited grace period (perhaps a future one).
Given that it is used only for expedited grace periods, its current name
is misleading, so this commit renames it to ->exp_deferred_qs.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
It would be good to combine the dynticks and dynticks_nesting counters
in order to simplify the code. Unfortunately, there are concerns
about usermode upcalls appearing to RCU as half of an interrupt, as
Byungchul learned [1]. The "half" in "half interrupt" is due to an
unpaired rcu_irq_enter(): Normally, each rcu_irq_enter() has a later
call to rcu_irq_exit().
Out of an abundance of caution, Paul added warnings [2] in the RCU
code which if not fired by 2021 will be interpreted as meaning that
this half-interrupt scenario cannot happen any more, thus permitting
simplification of this code.
In the meantime, this commit makes the following changes:
(1) Combining these two counters requires that rcu_rrupt_from_idle()
is invoked only from hard-interrupt contexts as discussed here [3].
This commit therefore adds the required lockdep_assert_in_irq()
to check this constraint.
(2) Furthermore, rcu_rrupt_from_idle() is not explicit about how it
is using the counters which can lead to weird future bugs. This
commit therefore adds comments indicating the meaning and use of
each counter.
(3) Lastly, this commit checks for counter underflows as another check
that half interrupts don't occur. (Previously, the function would
simply return true upon underflow.)
All these checks checks are NOOPs if PROVE_LOCKING (and thus PROVE_RCU)
are disabled.
[1] https://lore.kernel.org/patchwork/patch/952349/
[2] Commit e11ec65cc8 ("rcu: Add warning to detect half-interrupts")
[3] https://lore.kernel.org/lkml/20190312150514.GB249405@google.com/
Cc: byungchul.park@lge.com
Cc: kernel-team@android.com
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Commit c19fa94a8f ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for
architectures that required 64bit aligned access for all 64bit words. As
the ftrace ring buffer stores data on 4 byte alignment, this config option
was used to force it to store data on 8 byte alignment to make sure the data
being stored and written directly into the ring buffer was 8 byte aligned as
it would cause issues trying to write an 8 byte word on a 4 not 8 byte
aligned memory location.
But with the removal of the metag architecture, which was the only
architecture to use this, there is no architecture supported by Linux that
requires 8 byte aligne access for all 8 byte words (4 byte alignment is good
enough). Removing this config can simplify the code a bit.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.
This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.
This also makes it clear that force_sig_mceerr passes current
into force_sig_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.
This also makes it clear force_sig passes current into force_sig_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.
This also makes it clear force_sigsegv always calls force_sig
on the current task.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The locking in force_sig_info is not prepared to deal with a task that
exits or execs (as sighand may change). The is not a locking problem
in force_sig as force_sig is only built to handle synchronous
exceptions.
Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal. The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.
So using force_sig rather than send_sig for SIGKILL is confusing
and pointless.
Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Fixes: cf3f89214e ("pidns: add reboot_pid_ns() to handle the reboot syscall")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
On systems with ACPI platform firmware the last stage of hibernation
is analogous to system suspend to S3 (suspend-to-RAM), so it should
be handled analogously. In particular, pm_suspend_via_firmware()
should return 'true' in that stage to let the callers of it know that
control will be passed to the platform firmware going forward, so
pm_set_suspend_via_firmware() needs to be called then in analogy with
acpi_suspend_begin().
However, the platform hibernation ->begin() callback is invoked
during the "freeze" transition (before creating a snapshot image of
system memory) as well as during the "hibernate" transition which is
the last stage of it and pm_set_suspend_via_firmware() should be
invoked by that callback in the latter stage only.
In order to implement that redefine the hibernation ->begin()
callback to take a pm_message_t argument to indicate which stage
of hibernation is taking place and rework acpi_hibernation_begin()
and acpi_hibernation_begin_old() to take it into account as needed.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull tracing warning fix from Steven Rostedt:
"Make the GCC 9 warning for sub struct memset go away.
GCC 9 now warns about calling memset() on partial structures when it
goes across multiple fields. This adds a helper for the place in
tracing that does this type of clearing of a structure"
* tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Silence GCC 9 array bounds warning
Custom trampolines can only be enabled if there is only a single ops
attached to it. If there's only a single callback registered to a function,
and the ops has a trampoline registered for it, then we can call the
trampoline directly. This is very useful for improving the performance of
ftrace and livepatch.
If more than one callback is registered to a function, the general
trampoline is used, and the custom trampoline is not restored back to the
direct call even if all the other callbacks were unregistered and we are
back to one callback for the function.
To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented
to one, and the ops that left has a trampoline.
Testing After this patch :
insmod livepatch_unshare_files.ko
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150
echo nop > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The trace event self tests enable loop through *all* events, enables each
one, one at a time, runs some code to trigger various events (not
necessarily the same events), and checks if anything went wrong. The issue
is that trace events are usually the least likely start up test to cause a
problem, but they take the longest to run (because there are so many
events). When one of the other tests trigger a bug, the trace event start up
tests causes the bisect to take much longer, because it takes 10s of seconds
to get through the trace event tests.
By making them a separate config (even though they are enabled by default if
start up tests are set), it is possible to turn them off and still run the
other tracing start up tests much quicker.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>