Generate the state bit definitions from the lockdep_states.h file.
Also, move LOCK_USED to last, so that the
USED_IN
USED_IN_READ
ENABLED
ENABLED_READ
states are nicely bit aligned -- we're going to use that property
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a header file to generate all the states from.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
s/HELD_OVER/ENABLED/g
so that its similar to the hard and soft-irq names.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
s/\(LOCKF\?_ENABLED_[^ ]*\)S\(_READ\)\?\>/\1\2/g
So that the USED_IN and ENABLED have the same names.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Here is another version, with the incremental patch rolled up, and
added reclaim context annotation to kswapd, and allocation tracing
to slab allocators (which may only ever reach the page allocator
in rare cases, so it is good to put annotations here too).
Haven't tested this version as such, but it should be getting closer
to merge worthy ;)
--
After noticing some code in mm/filemap.c accidentally perform a __GFP_FS
allocation when it should not have been, I thought it might be a good idea to
try to catch this kind of thing with lockdep.
I coded up a little idea that seems to work. Unfortunately the system has to
actually be in __GFP_FS page reclaim, then take the lock, before it will mark
it. But at least that might still be some orders of magnitude more common
(and more debuggable) than an actual deadlock condition, so we have some
improvement I hope (the concept is no less complete than discovery of a lock's
interrupt contexts).
I guess we could even do the same thing with __GFP_IO (normal reclaim), and
even GFP_NOIO locks too... but filesystems will have the most locks and fiddly
code paths, so let's start there and see how it goes.
It *seems* to work. I did a quick test.
=================================
[ INFO: inconsistent lock state ]
2.6.28-rc6-00007-ged31348-dirty #26
---------------------------------
inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage.
modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes:
(testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
{in-reclaim-W} state was registered at:
[<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60
[<ffffffff80268f71>] lock_acquire+0x91/0xc0
[<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310
[<ffffffffa002002b>] brd_init+0x2b/0x216 [brd]
[<ffffffff8020903b>] _stext+0x3b/0x170
[<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
[<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 3929
hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310
hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310
softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0
softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0
other info that might help us debug this:
1 lock held by modprobe/8526:
#0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
stack backtrace:
Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26
Call Trace:
[<ffffffff80265483>] print_usage_bug+0x193/0x1d0
[<ffffffff80266530>] mark_lock+0xaf0/0xca0
[<ffffffff80266735>] mark_held_locks+0x55/0xc0
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60
[<ffffffff80285005>] __alloc_pages_internal+0x475/0x580
[<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffffa002006a>] brd_init+0x6a/0x216 [brd]
[<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
[<ffffffff8020903b>] _stext+0x3b/0x170
[<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10
[<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180
[<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190
[<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
[<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This modifies the timer code in a way to allow lockdep to detect
deadlocks resulting from a lock being taken in the timer function
as well as around the del_timer_sync() call.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
uids in namespaces other than init don't get a sysfs entry.
For those in the init namespace, while we're waiting to remove
the sysfs entry for the uid the uid is still hashed, and
alloc_uid() may re-grab that uid without getting a new
reference to the user_ns, which we've already put in free_user
before scheduling remove_user_sysfs_dir().
Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
Make sure that all timer functions call cpu_timer_sample_group() and not
cpu_clock_sample_group(). This ensures that we enable the process wide timer
in time, and therefore pay the O(n) thread group cost from the syscall.
Not doing it here, will result in the first jiffy tick after setting the timer
doing this, resulting in a very expensive tick (but only once) and a delay in
actually starting the timer.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jaswinder Singh Rajput reported that commit 23a185ca8a caused the
context switch and migration software counters to report zero always.
With that commit, the software counters only count events that occur
between sched-in and sched-out for a task. This is necessary for the
counter enable/disable prctls and ioctls to work. However, the
context switch and migration counts are incremented after sched-out
for one task and before sched-in for the next. Since the increment
doesn't occur while a task is scheduled in (as far as the software
counters are concerned) it doesn't count towards any counter.
Thus the context switch and migration counters need to count events
that occur at any time, provided the counter is enabled, not just
those that occur while the task is scheduled in (from the perf_counter
subsystem's point of view). The problem though is that the software
counter code can't tell the difference between being enabled and being
scheduled in, and between being disabled and being scheduled out,
since we use the one pair of enable/disable entry points for both.
That is, the high-level disable operation simply arranges for the
counter to not be scheduled in any more, and the high-level enable
operation arranges for it to be scheduled in again.
One way to solve this would be to have sched_in/out operations in the
hw_perf_counter_ops struct as well as enable/disable. However, this
takes a simpler approach: it adds a 'prev_state' field to the
perf_counter struct that allows a counter's enable method to know
whether the counter was previously disabled or just inactive
(scheduled out), and therefore whether the enable method is being
called as a result of a high-level enable or a schedule-in operation.
This then allows the context switch, migration and page fault counters
to reset their hw.prev_count value in their enable functions only if
they are called as a result of a high-level enable operation.
Although page faults would normally only occur while the counter is
scheduled in, this changes the page fault counter code too in case
there are ever circumstances where page faults get counted against a
task while its counters are not scheduled in.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
request_irq() calls into proc code via __setup_irq() which is not safe
in an atomic context, so request_irq() can itself use the more
reliable GFP_KERNEL allocation for the action descriptor.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
While reviewing the ring buffer code, I thougth I saw a bug with
if (!__raw_spin_trylock(&cpu_buffer->lock))
goto out_unlock;
But I forgot that we use a variable "lock_taken" that is set if
the spinlock is taken, and only unlock it if that variable is set.
To avoid further confusion from other reviewers, this patch
renames the label out_unlock with out_reset, which is the more
appropriate name.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
CONFIG_SOFTLOCKUP=y || CONFIG_DETECT_HUNG_TASKS=y is now the only user
of the 'one' constant in kernel/sysctl.c. Move it to the softlockup
block of constants.
This fixes a GCC warning.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I enabled all cgroup subsystems when compiling kernel, and then:
# mount -t cgroup -o net_cls xxx /mnt
# mkdir /mnt/0
This showed up immediately:
BUG: MAX_LOCKDEP_SUBCLASSES too low!
turning off the locking correctness validator.
It's caused by the cgroup hierarchy lock:
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->root == root)
mutex_lock_nested(&ss->hierarchy_mutex, i);
}
Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
MAX_LOCKDEP_SUBCLASSES is 8.
This patch uses different lockdep keys for different subsystems.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.
[rientjes@google.com: fix type of `old_bytes']
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catalin noticed that (38d47c1b70: futex: rely on get_user_pages() for
shared futexes) caused an mm_struct leak.
Some tracing with the function graph tracer quickly pointed out that
futex_wait() has exit paths with unbalanced reference counts.
This regression was discovered by kmemleak.
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: fix TIMER_ABSTIME for process wide cpu timers
timers: split process wide cpu clocks/timers, fix
x86: clean up hpet timer reinit
timers: split process wide cpu clocks/timers, remove spurious warning
timers: split process wide cpu clocks/timers
signal: re-add dead task accumulation stats.
x86: fix hpet timer reinit for x86_64
sched: fix nohz load balancer on cpu offline
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace, x86: fix the usage of ptrace_fork()
i8327: fix outb() parameter order
x86: fix math_emu register frame access
x86: math_emu info cleanup
x86: include correct %gs in a.out core dump
x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
x86: find nr_irqs_gsi with mp_ioapic_routing
x86: add clflush before monitor for Intel 7400 series
x86: disable intel_iommu support by default
x86: don't apply __supported_pte_mask to non-present ptes
x86: fix grammar in user-visible BIOS warning
x86/Kconfig.cpu: make Kconfig help readable in the console
x86, 64-bit: print DMI info in the oops trace
Only free child_counter if it has a parent; if it doesn't, then it
has a file pointing to it and we'll free it in perf_release.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The POSIX timer interface allows for absolute time expiry values through the
TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock
every time we start it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To decrease the chance of a missed enable, always enable the timer when we
sample it, we'll always disable it when we find that there are no active timers
in the jiffy tick.
This fixes a flood of warnings reported by Mike Galbraith.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the missing pair tracing_{start,stop}_record_cmdline() to record well
the cmdline associated with pid.
Changes in v2:
- fix a build error, the sched_switch tracer is needed to record the
cmdline.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we check if a task has been switched out since the last scan, we might
have a race condition on the following scenario:
- the task is freshly created and scheduled
- it puts its state to TASK_UNINTERRUPTIBLE and is not yet switched out
- check_hung_task() scans this task and will report a false positive because
t->nvcsw + t->nivcsw == t->last_switch_count == 0
Add a check for such cases.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I noticed by pure accident we have ptrace_fork() and friends. This was
added by "x86, bts: add fork and exit handling", commit
bf53de907d.
I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
believe this needs the fix. I think something like this program
int main(void)
{
int pid = fork();
if (!pid) {
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
kill(getpid(), SIGSTOP);
fork();
} else {
struct ptrace_bts_config bts = {
.flags = PTRACE_BTS_O_ALLOC,
.size = 4 * 4096,
};
wait(NULL);
ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK);
ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts));
ptrace(PTRACE_CONT, pid, NULL, NULL);
sleep(1);
}
return 0;
}
should crash the kernel.
If the task is traced by its natural parent ptrace_reparented() returns 0
but we should clear ->btsxxx anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew had some suggestions for the latencytop file; this patch takes care
of most of these:
* Add documentation
* Turn account_scheduler_latency into an inline function
* Don't report negative values to userspace
* Make the file operations struct const
* Fix a few checkpatch.pl warnings
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
kernel/trace/ring_buffer.c:70:37: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:84:39: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:96:43: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2500:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/ring_buffer.c:2505:44: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2507:46: warning: incorrect type in argument 2 (different signedness)
kernel/trace/trace.c:2130:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/trace.c:2280:40: warning: incorrect type in argument 3 (different signedness)
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make global variables and a global function static
The function '__trace_userstack' does not seem to have a caller, so it
is commented out.
Fix this sparse warnings:
kernel/trace/trace.c:82:5: warning: symbol 'tracing_disabled' was not declared. Should it be static?
kernel/trace/trace.c:600:10: warning: symbol 'trace_record_cmdline_disabled' was not declared. Should it be static?
kernel/trace/trace.c:957:6: warning: symbol '__trace_userstack' was not declared. Should it be static?
kernel/trace/trace.c:1694:5: warning: symbol 'tracing_release' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>