uprobe_perf_open()->uprobe_apply() can fail, but this error is wrongly
ignored. Change uprobe_perf_open() to do uprobe_perf_close() and return
the error code in this case.
Change uprobe_perf_close() to propogate the error from uprobe_apply()
as well, although it should not fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Preparation. Move uprobe_perf_close() up before uprobe_perf_open() to
avoid the forward declaration in the next patch and make it readable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that the ring buffer has a built in way to wake up readers
when there's data, using irq_work such that it is safe to do it
in any context. But it was still using the old "poor man's"
wait polling that checks every 1/10 of a second to see if it
should wake up a waiter. This makes the latency for a wake up
excruciatingly long. No need to do that anymore.
Completely remove the different wait_poll types from the tracers
and have them all use the default one now.
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On architectures with sizeof(int) < sizeof (long), the
computation of mask inside apply_slack() can be undefined if the
computed bit is > 32.
E.g. with: expires = 0xffffe6f5 and slack = 25, we get:
expires_limit = 0x20000000e
bit = 33
mask = (1 << 33) - 1 /* undefined */
On x86, mask becomes 1 and and the slack is not applied properly.
On s390, mask is -1, expires is set to 0 and the timer fires immediately.
Use 1UL << bit to solve that issue.
Suggested-by: Deborah Townsend <dstownse@us.ibm.com>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140418152310.GA13654@midget.suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If a cpu is idle and starts an hrtimer which is not pinned on that
same cpu, the nohz code might target the timer to a different cpu.
In the case that we switch the cpu base of the timer we already have a
sanity check in place, which determines whether the timer is earlier
than the current leftmost timer on the target cpu. In that case we
enqueue the timer on the current cpu because we cannot reprogram the
clock event device on the target.
If the timers base is already the target CPU we do not have this
sanity check in place so we enqueue the timer as the leftmost timer in
the target cpus rb tree, but we cannot reprogram the clock event
device on the target cpu. So the timer expires late and subsequently
prevents the reprogramming of the target cpu clock event device until
the previously programmed event fires or a timer with an earlier
expiry time gets enqueued on the target cpu itself.
Add the same target check as we have for the switch base case and
start the timer on the current cpu if it would become the leftmost
timer on the target.
[ tglx: Rewrote subject and changelog ]
Signed-off-by: Leon Ma <xindong.ma@intel.com>
Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the last hrtimer interrupt detected a hang it sets hang_detected=1
and programs the clock event device with a delay to let the system
make progress.
If hang_detected == 1, we prevent reprogramming of the clock event
device in hrtimer_reprogram() but not in hrtimer_force_reprogram().
This can lead to the following situation:
hrtimer_interrupt()
hang_detected = 1;
program ce device to Xms from now (hang delay)
We have two timers pending:
T1 expires 50ms from now
T2 expires 5s from now
Now T1 gets canceled, which causes hrtimer_force_reprogram() to be
invoked, which in turn programs the clock event device to T2 (5
seconds from now).
Any hrtimer_start after that will not reprogram the hardware due to
hang_detected still being set. So we effectivly block all timers until
the T2 event fires and cleans up the hang situation.
Add a check for hang_detected to hrtimer_force_reprogram() which
prevents the reprogramming of the hang delay in the hardware
timer. The subsequent hrtimer_interrupt will resolve all outstanding
issues.
[ tglx: Rewrote subject and changelog and fixed up the comment in
hrtimer_force_reprogram() ]
Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Link: http://lkml.kernel.org/r/53602DC6.2060101@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When reading from trace_pipe, if tracing is off but nothing was read
it should block. If something is read and tracing is off, then EOF
is returned. If tracing is on and there's nothing to read, it will block.
But because the check of whether tracing is off and something was read
is done after the block on the pipe, it is hit or miss if the EOF is
returned or not leading to inconsistent behavior.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Calling rcu_bh_qs() after every softirq action is not really needed.
What RCU needs is at least one rcu_bh_qs() per softirq round to note a
quiescent state was passed for rcu_bh.
Note for Paul and myself : this could be inlined as a single instruction
and avoid smp_processor_id()
(sone this_cpu_write(rcu_bh_data.passed_quiesce, 1))
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
__this_cpu_ptr is being phased out.
One special case is increment_cpu_stall_ticks().
A per cpu variable is incremented so use raw_cpu_inc().
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, small systems move back into RCU_SYSIDLE_NOT from
RCU_SYSIDLE_SHORT and large systems do not. This works because moving
aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness,
and on small systems, the performance impact should be negligible. That
said, this difference does make RCU a bit more complex, and RCU does not
seem to be suffering from any lack of complexity. This commit therefore
adjusts small-system operation to match that of large systems, so that
the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, RCU binds the grace-period kthreads to the timekeeping
CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y. This means that these
kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and
CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on
random CPUs. Given that we are trying to reduce the amount of manual
tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit
makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where
CONFIG_NO_HZ_FULL_SYSIDLE=n.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This patch merges the function rcu_force_quiescent_state() with
rcu_sched_force_quiescent_state(), using the rcu_state pointer. Firstly,
the rcu_sched_force_quiescent_state() function is deleted from the file
kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
calling force_quiescent_state with the argument rcu_preempt_state pointer
was deleted as well. The new function that combines the old ones uses
the rcu_state pointer and is located after rcu_batches_completed_bh()
in kernel/rcu/tree.c.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
This patch uses the rcu_state_pointer to combine the two definitions into one.
The resulting function is placed after the closing of the preprocessor
conditional CONFIG_TREE_PREEMPT_RCU.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This patch adds event tracing to dyntick_save_progress_counter() in the case
where it returns 1. I used the tracepoint string "dti" because this function
returns 1 in case the CPU is in dynticks idle mode.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Some of the accesses to the rcu_state structure's ->jiffies_stall
field are unprotected. This patch protects them with ACCESS_ONCE().
The following coccinelle script was used to acheive this:
/* coccinelle script to protect uses of ->jiffies_stall with ACCESS_ONCE() */
@@
identifier a;
@@
(
ACCESS_ONCE(a->jiffies_stall)
|
- a->jiffies_stall
+ ACCESS_ONCE(a->jiffies_stall)
)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Some of the uses of the rcu_state structure's ->jiffies_stall field
do not use ACCESS_ONCE(), despite there being unprotected accesses.
This commit therefore uses the ACCESS_ONCE() macro to protect this field.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The ->preemptible field in rcu_data is only initialized in the function
rcu_init_percpu_data(), and never used. This commit therefore removes
this field.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In the old days, the only source of requests for future grace periods
was NOCB CPUs. This has changed: CPUs routinely post requests for
future grace periods in order to promote power efficiency and reduce
OS jitter with minimal impact on grace-period latency. This commit
therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
instead of rcu_nocb_needs_gp(). The latter is no longer used, so is
now removed. This commit also adds tracing for the irq_work_queue()
wakeup case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_other_cpu_stall() and print_cpu_stall() functions print
grace-period numbers using an unsigned format, which means that the number
one less than zero is a very large number. This commit therefore causes
these numbers to be printed with a signed format in order to improve
readability of the RCU CPU stall-warning output.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
A number of ->gp_flags accesses don't have ACCESS_ONCE(), but all of
the can race against other loads or stores. This commit therefore
applies ACCESS_ONCE() to the unprotected ->gp_flags accesses.
Reported-by: Alexey Roytman <alexey.roytman@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Pull ftrace bugfix from Steven Rostedt:
"Takao Indoh reported that he was able to cause a ftrace bug while
loading a module and enabling function tracing at the same time.
He uncovered a race where the module when loaded will convert the
calls to mcount into nops, and expects the module's text to be RW.
But when function tracing is enabled, it will convert all kernel text
(core and module) from RO to RW to convert the nops to calls to ftrace
to record the function. After the convertion, it will convert all the
text back from RW to RO.
The issue is, it will also convert the module's text that is loading.
If it converts it to RO before ftrace does its conversion, it will
cause ftrace to fail and require a reboot to fix it again.
This patch moves the ftrace module update that converts calls to
mcount into nops to be done when the module state is still
MODULE_STATE_UNFORMED. This will ignore the module when the text is
being converted from RW back to RO"
* tag 'trace-fixes-v3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/module: Hardcode ftrace_module_init() call into load_module()
A race exists between module loading and enabling of function tracer.
CPU 1 CPU 2
----- -----
load_module()
module->state = MODULE_STATE_COMING
register_ftrace_function()
mutex_lock(&ftrace_lock);
ftrace_startup()
update_ftrace_function();
ftrace_arch_code_modify_prepare()
set_all_module_text_rw();
<enables-ftrace>
ftrace_arch_code_modify_post_process()
set_all_module_text_ro();
[ here all module text is set to RO,
including the module that is
loading!! ]
blocking_notifier_call_chain(MODULE_STATE_COMING);
ftrace_init_module()
[ tries to modify code, but it's RO, and fails!
ftrace_bug() is called]
When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.
The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.
The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.
Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com
Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
do_div() needs 'u64' type, or it reports warning. And negative number
is meaningless for "speed", so change all signed to unsigned within
swsusp_show_speed().
The related warning (with allmodconfig for unicore32):
CC kernel/power/hibernate.o
kernel/power/hibernate.c: In function ‘swsusp_show_speed’:
kernel/power/hibernate.c:237: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
[rjw: Subject]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On x86 the allocation of irq descriptors may allocate interrupts which
are in the range of the GSI interrupts. That's wrong as those
interrupts are hardwired and we don't have the irq domain translation
like PPC. So one of these interrupts can be hooked up later to one of
the devices which are hard wired to it and the io_apic init code for
that particular interrupt line happily reuses that descriptor with a
completely different configuration so hell breaks lose.
Inside x86 we allocate dynamic interrupts from above nr_gsi_irqs,
except for a few usage sites which have not yet blown up in our face
for whatever reason. But for drivers which need an irq range, like the
GPIO drivers, we have no limit in place and we don't want to expose
such a detail to a driver.
To cure this introduce a function which an architecture can implement
to impose a lower bound on the dynamic interrupt allocations.
Implement it for x86 and set the lower bound to nr_gsi_irqs, which is
the end of the hardwired interrupt space, so all dynamic allocations
happen above.
That not only allows the GPIO driver to work sanely, it also protects
the bogus callsites of create_irq_nr() in hpet, uv, irq_remapping and
htirq code. They need to be cleaned up as well, but that's a separate
issue.
Reported-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Krogerus Heikki <heikki.krogerus@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1404241617360.28206@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel passes any args it doesn't need through to init, except it
assumes anything containing '.' belongs to the kernel (for a module).
This change means all users can clearly distinguish which arguments
are for init.
For example, the kernel uses debug ("dee-bug") to mean log everything to
the console, where systemd uses the debug from the Scandinavian "day-boog"
meaning "fail to boot". If a future versions uses argv[] instead of
reading /proc/cmdline, this confusion will be avoided.
eg: test 'FOO="this is --foo"' -- 'systemd.debug="true true true"'
Gives:
argv[0] = '/debug-init'
argv[1] = 'test'
argv[2] = 'systemd.debug=true true true'
envp[0] = 'HOME=/'
envp[1] = 'TERM=linux'
envp[2] = 'FOO=this is --foo'
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We remove the waiting module removal in commit 3f2b9c9cdf (September
2013), but it turns out that modprobe in kmod (< version 16) was
asking for waiting module removal. No one noticed since modprobe would
check for 0 usage immediately before trying to remove the module, and
the race is unlikely.
However, it means that anyone running old (but not ancient) kmod
versions is hitting the printk designed to see if anyone was running
"rmmod -w". All reports so far have been false positives, so remove
the warning.
Fixes: 3f2b9c9cdf
Reported-by: Valerio Vanni <valerio.vanni@inwind.it>
Cc: Elliott, Robert (Server Storage) <Elliott@hp.com>
Cc: stable@kernel.org
Acked-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull irq fixes from Thomas Gleixner:
"A slighlty large fix for a subtle issue in the CPU hotplug code of
certain ARM SoCs, where the not yet online cpu needs to setup the cpu
local timer and needs to set the interrupt affinity to itself.
Setting interrupt affinity to a not online cpu is prohibited and
therefor the timer interrupt ends up on the wrong cpu, which leads to
nasty complications.
The SoC folks tried to hack around that in the SoC code in some more
than nasty ways. The proper solution is to have a way to enforce the
affinity setting to a not online cpu. The core patch to the genirq
code provides that facility and the follow up patches make use of it
in the GIC interrupt controller and the exynos timer driver.
The change to the core code has no implications to existing users,
except for the rename of the locked function and therefor the
necessary fixup in mips/cavium. Aside of that, no runtime impact is
possible, as none of the existing interrupt chips implements anything
which depends on the force argument of the irq_set_affinity()
callback"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Exynos_mct: Register clock event after request_irq()
clocksource: Exynos_mct: Use irq_force_affinity() in cpu bringup
irqchip: Gic: Support forced affinity setting
genirq: Allow forcing cpu affinity of interrupts
Use pr_fmt and remove embedded prefixes.
Realign modified multi-line statements to open parenthesis.
Convert embedded function name to "%s: ", __func__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As suggested by scripts/checkpatch.pl, substitude all pr_warning()
with pr_warn().
No functional change.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6612f05b88 ("cgroup: unify pidlist and other file handling")
has removed the only user of cgroup_pidlist_seq_operations :
cgroup_pidlist_open().
This patch removes it.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1d5be6b287 ("cgroup: move module ref handling into
rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
refcounts on subsystems.
And unified hierachy makes parse_cgroupfs_options not need to call
with cgroup_mutex held to protect the cgroup_subsys[].
So this patch removes BUG_ON() and the comment. As the comment
doesn't contain useful information afterwards, the whole comment is
removed.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.
* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.
* There is single monitoring point at the root. There's no way to
delegate management of a subtree.
* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.
* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.
This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.
v2: Patch description updated as per Serge.
v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Lennart Poettering <lennart@poettering.net>
Pull in driver-core-next to receive kernfs_notify() updates which will
be used by the planned "cgroup.populated" implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Support for uevent_helper, aka hotplug, is not required on many systems
these days but it can still be enabled via sysfs or sysctl.
Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com>
Signed-off-by: Michael Marineau <mike@marineau.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.
To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using ftrace_ops_list_func, we should skip 4 instead of 3,
to avoid ftrace_call+0x5/0xb appearing in the stack trace:
Depth Size Location (110 entries)
----- ---- --------
0) 2956 0 update_curr+0xe/0x1e0
1) 2956 68 ftrace_call+0x5/0xb
2) 2888 92 enqueue_entity+0x53/0xe80
3) 2796 80 enqueue_task_fair+0x47/0x7e0
4) 2716 28 enqueue_task+0x45/0x70
5) 2688 12 activate_task+0x22/0x30
Add a function using_ftrace_ops_list_func() to test for this while keeping
ftrace_ops_list_func to remain static.
Link: http://lkml.kernel.org/p/1398006644-5935-2-git-send-email-wangjiaxing@insigma.com.cn
Signed-off-by: Jiaxing Wang <wangjiaxing@insigma.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is no need to prohibit probing on the functions
used for preparation and uprobe only fetch functions.
Those are safely probed because those are not invoked
from kprobe's breakpoint/fault/debug handlers. So there
is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
update_bitfield_fetch_param
free_bitfield_fetch_param
kprobe_register
FETCH_FUNC_NAME(stack, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string_size) in trace_uprobe.c
FETCH_FUNC_NAME(file_offset, type) in trace_uprobe.c
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081800.26341.56504.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used for preparation, registeration, optimization,
controll etc. Those are safely probed because those are
not invoked from breakpoint/fault/debug handlers,
there is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
add_new_kprobe
aggr_kprobe_disabled
alloc_aggr_kprobe
alloc_aggr_kprobe
arm_all_kprobes
__arm_kprobe
arm_kprobe
arm_kprobe_ftrace
check_kprobe_address_safe
collect_garbage_slots
collect_garbage_slots
collect_one_slot
debugfs_kprobe_init
__disable_kprobe
disable_kprobe
disarm_all_kprobes
__disarm_kprobe
disarm_kprobe
disarm_kprobe_ftrace
do_free_cleaned_kprobes
do_optimize_kprobes
do_unoptimize_kprobes
enable_kprobe
force_unoptimize_kprobe
free_aggr_kprobe
free_aggr_kprobe
__free_insn_slot
__get_insn_slot
get_optimized_kprobe
__get_valid_kprobe
init_aggr_kprobe
init_aggr_kprobe
in_nokprobe_functions
kick_kprobe_optimizer
kill_kprobe
kill_optimized_kprobe
kprobe_addr
kprobe_optimizer
kprobe_queued
kprobe_seq_next
kprobe_seq_start
kprobe_seq_stop
kprobes_module_callback
kprobes_open
optimize_all_kprobes
optimize_kprobe
prepare_kprobe
prepare_optimized_kprobe
register_aggr_kprobe
register_jprobe
register_jprobes
register_kprobe
register_kprobes
register_kretprobe
register_kretprobe
register_kretprobes
register_kretprobes
report_probe
show_kprobe_addr
try_to_optimize_kprobe
unoptimize_all_kprobes
unoptimize_kprobe
unregister_jprobe
unregister_jprobes
unregister_kprobe
__unregister_kprobe_bottom
unregister_kprobes
__unregister_kprobe_top
unregister_kretprobe
unregister_kretprobe
unregister_kretprobes
unregister_kretprobes
wait_for_kprobe_optimizer
I tested those functions by putting kprobes on all
instructions in the functions with the bash script
I sent to LKML. See:
https://lkml.org/lkml/2014/3/27/33
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081753.26341.57889.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: fche@redhat.com
Cc: systemtap@sourceware.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>