Commit Graph

31448 Commits

Author SHA1 Message Date
Paul E. McKenney
343e9099c8 rcu: fix rcutorture behavior during reboot
Impact: fix very rare reboot hang

Because rcutorture ignored all signals, it does not terminate in
response to the signals sent at shutdown time.  This can cause strange
failures due to its continuing to make use of kernel function too late
in the shutdown sequence.  This patch therefore adds a shutdown notifier
to rcutorture, causing it to shut down in response to a reboot or an
orderly shutdown.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-17 00:04:40 +01:00
Arjan van de Ven
3ac52669c7 resources: skip sanity check of busy resources
Impact: reduce false positives in iomem_map_sanity_check()

Some drivers (vesafb) only map/reserve a portion of a resource.
If then some other driver comes in and maps the whole resource,
the current code WARN_ON's. This is not the intent of the checks
in iomem_map_sanity_check(); rather these checks want to
warn when crossing *hardware* resources only.

This patch skips BUSY resources as suggested by Linus.

Note: having two drivers talk to the same hardware at the same
time is obviously not optimal behavior, but that's a separate story.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 23:30:49 +01:00
Bharata B Rao
80f40ee4a0 sched: use RCU variant of list traversal in for_each_leaf_rt_rq()
Impact: fix potential of rare crash

for_each_leaf_rt_rq() walks an RCU protected list (rq->leaf_rt_rq_list),
but doesn't use list_for_each_entry_rcu(). Fix this.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 21:39:14 +01:00
Ken Chen
e9515c3c9f sched, cpuacct: export percpu cpuacct cgroup stats
This patch export per-cpu CPU cycle usage for a given cpuacct cgroup.
There is a need for a user space monitor daemon to track group CPU
usage on per-cpu base.  It is also useful for monitoring CFS load
balancer behavior by tracking per CPU group usage.

Signed-off-by: Ken Chen <kenchen@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 12:15:01 +01:00
Ken Chen
720f54988e sched, cpuacct: refactoring cpuusage_read / cpuusage_write
Impact: micro-optimize the code on 64-bit architectures

In the thread regarding to 'export percpu cpuacct cgroup stats'
http://lkml.org/lkml/2008/12/7/13

akpm pointed out that current cpuacct code is inefficient.  This patch
refactoring the following:

* make cpu_rq locking only on 32-bit
* change iterator to each_present_cpu instead of each_possible_cpu to
  make it hotplug friendly.

It's a bit of code churn, but I was rewarded with 160 byte code size saving
on x86-64 arch and zero code size change on i386.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 12:15:00 +01:00
Ingo Molnar
9dfc3bc7d2 Merge branches 'tracing/fastboot', 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/hw-branch-tracing' into tracing/core 2008-12-16 12:03:38 +01:00
Peter Zijlstra
34f28ecd0f sched: optimize update_curr()
Impact: micro-optimization

Skip the hard work when there is none.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 09:46:33 +01:00
Mike Galbraith
03e89e4574 sched: fix wakeup preemption clock
Impact: sharpen the wakeup-granularity to always be against current scheduler time

It was possible to do the preemption check against an old time stamp.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-16 09:45:38 +01:00
Paul Menage
307257cf47 cgroups: fix a race between rmdir and remount
When a cgroup is removed, it's unlinked from its parent's children list,
but not actually freed until the last dentry on it is released (at which
point cgrp->root->number_of_cgroups is decremented).

Currently rebind_subsystems checks for the top cgroup's child list being
empty in order to rebind subsystems into or out of a hierarchy - this can
result in the set of subsystems bound to a hierarchy being
removed-but-not-freed cgroup.

The simplest fix for this is to forbid remounts that change the set of
subsystems on a hierarchy that has removed-but-not-freed cgroups.  This
bug can be reproduced via:

mkdir /mnt/cg
mount -t cgroup -o ns,freezer cgroup /mnt/cg
mkdir /mnt/cg/foo
sleep 1h < /mnt/cg/foo &
rmdir /mnt/cg/foo
mount -t cgroup -o remount,ns,devices,freezer cgroup /mnt/cg
kill $!

Though the above will cause oops in -mm only but not mainline, but the bug
can cause memory leak in mainline (and even oops)

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-15 16:27:07 -08:00
Linus Torvalds
ca7e716c78 Revert "sched_clock: prevent scd->clock from moving backwards"
This reverts commit 5b7dba4ff8, which
caused a regression in hibernate, reported by and bisected by Fabio
Comolli.

This revert fixes

 http://bugzilla.kernel.org/show_bug.cgi?id=12155
 http://bugzilla.kernel.org/show_bug.cgi?id=12149

Bisected-by: Fabio Comolli <fabio.comolli@gmail.com>
Requested-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-14 16:23:17 -08:00
Ingo Molnar
0cc0c027d4 perfcounters: release CPU context when exiting task counters
If counters are exiting via do_exit() not via filp close, then
the CPU context needs to be released - otherwise future percpu
counter creations might fail.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 23:25:02 +01:00
Ingo Molnar
088e2852c8 perfcounters, x86: fix sw counters on non-PMC CPUs
Make perf_max_counters default to at least 1 - this allows the sw
counters to be used.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:31:29 +01:00
Ingo Molnar
e06c61a879 perfcounters: add nr-of-faults counter
Impact: add new feature, new sw counter

Add a counter that counts the number of pagefaults a task
is experiencing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:31:27 +01:00
Ingo Molnar
6c594c21fc perfcounters: add task migrations counter
Impact: add new feature, new sw counter

Add a counter that counts the number of cross-CPU migrations a
task is suffering.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:31:26 +01:00
Ingo Molnar
5d6a27d8a0 perfcounters: add context switch counter
Impact: add new feature, new sw counter

Add a counter that counts the number of context-switches a task
is doing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:31:23 +01:00
Ingo Molnar
8cb391e878 perfcounters: fix task clock counter
Impact: bugfix

Update the task clock counter to the new math.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:30:50 +01:00
Ingo Molnar
9b51f66dcb perfcounters: implement "counter inheritance"
Impact: implement new performance feature

Counter inheritance can be used to run performance counters in a workload,
transparently - and pipe back the counter results to the parent counter.

Inheritance for performance counters works the following way: when creating
a counter it can be marked with the .inherit=1 flag. Such counters are then
'inherited' by all child tasks (be they fork()-ed or clone()-ed). These
counters get inherited through exec() boundaries as well (except through
setuid boundaries).

The counter values get added back to the parent counter(s) when the child
task(s) exit - much like stime/utime statistics are gathered. So inherited
counters are ideal to gather summary statistics about an application's
behavior via shell commands, without having to modify that application.

The timec.c command utilizes counter inheritance:

  http://redhat.com/~mingo/perfcounters/timec.c

Sample output:

   $ ./timec -e 1 -e 3 -e 5 ls -lR /usr/include/ >/dev/null

   Performance counter stats for 'ls':

           163516953 instructions
                2295 cache-misses
             2855182 branch-misses

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:30:49 +01:00
Ingo Molnar
ee06094f82 perfcounters: restructure x86 counter math
Impact: restructure code

Change counter math from absolute values to clear delta logic.

We try to extract elapsed deltas from the raw hw counter - and put
that into the generic counter.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-14 20:30:48 +01:00
Rusty Russell
968ea6d80e Merge ../linux-2.6-x86
Conflicts:

	arch/x86/kernel/io_apic.c
	kernel/sched.c
	kernel/sched_stats.h
2008-12-13 21:55:51 +10:30
Rusty Russell
320ab2b0b1 cpumask: convert struct clock_event_device to cpumask pointers.
Impact: change calling convention of existing clock_event APIs

struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.

Another single-patch change.  For safety, we BUG_ON() in
clockevents_register_device() if it's not set.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
2008-12-13 21:20:26 +10:30
Rusty Russell
0de26520c7 cpumask: make irq_set_affinity() take a const struct cpumask
Impact: change existing irq_chip API

Not much point with gentle transition here: the struct irq_chip's
setaffinity method signature needs to change.

Fortunately, not widely used code, but hits a few architectures.

Note: In irq_select_affinity() I save a temporary in by mangling
irq_desc[irq].affinity directly.  Ingo, does this break anything?

(Folded in fix from KOSAKI Motohiro)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: jeremy@xensource.com
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
2008-12-13 21:20:26 +10:30
Rusty Russell
29c0177e6a cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.
Impact: change calling convention of existing cpumask APIs

Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.

These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com
2008-12-13 21:20:25 +10:30
Rusty Russell
98a79d6a50 cpumask: centralize cpu_online_map and cpu_possible_map
Impact: cleanup

Each SMP arch defines these themselves.  Move them to a central
location.

Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
   CONFIG_INIT_ALL_POSSIBLE for this rather than break them.

2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
   Those archs simply have phys_cpu_present_map replaced everywhere.

3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
   so I just manipulate them both in sync.

4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
   declarations.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
Cc: ink@jurassic.park.msu.ru
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: tony.luck@intel.com
Cc: takata@linux-m32r.org
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: paulus@samba.org
Cc: schwidefsky@de.ibm.com
Cc: lethal@linux-sh.org
Cc: wli@holomorphy.com
Cc: davem@davemloft.net
Cc: jdike@addtoit.com
Cc: mingo@redhat.com
2008-12-13 21:19:41 +10:30
Oleg Nesterov
899921025b posix-timers: check ->it_signal instead of ->it_pid to validate the timer
Impact: clean up, speed up

->it_pid (was ->it_process) has also a special meaning: if it is NULL,
the timer is under deletion or it wasn't initialized yet. We can check
->it_signal != NULL instead, this way we can

	- simplify sys_timer_create() a bit

	- remove yet another check from lock_timer()

	- move put_pid(->it_pid) into release_posix_timer() which
	  runs outside of ->it_lock

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-12-12 17:00:34 +01:00
Oleg Nesterov
27af4245b6 posix-timers: use "struct pid*" instead of "struct task_struct*"
Impact: restructure, clean up code

k_itimer holds the ref to the ->it_process until sys_timer_delete(). This
allows to pin up to RLIMIT_SIGPENDING dead task_struct's. Change the code
to use "struct pid *" instead.

The patch doesn't kill ->it_process, it places ->it_pid into the union.
->it_process is still used by do_cpu_nanosleep() as before. It would be
trivial to change the nanosleep code as well, but since it uses it_process
in a special way I think it is better to keep this field for grep.

The patch bloats the kernel by 104 bytes and it also adds the new pointer,
->it_signal, to k_itimer. It is used by lock_timer() to verify that the
found timer was not created by another process. It is not clear why do we
use the global database (and thus the global idr_lock) for posix timers.
We still need the signal_struct->posix_timers which contains all useable
timers, perhaps it is better to use some form of per-process array
instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-12-12 17:00:07 +01:00
Woodruff, Richard
001474491f nohz: suppress needless timer reprogramming
In my device I get many interrupts from a high speed USB device in a very
short period of time.  The system spends a lot of time reprogramming the
hardware timer which is in a slower timing domain as compared to the CPU. 
This results in the CPU spending a huge amount of time waiting for the
timer posting to be done.  All of this reprogramming is useless as the
wake up time has not changed.

As measured using ETM trace this drops my reprogramming penalty from
almost 60% CPU load down to 15% during high interrupt rate.  I can send
traces to show this.

Suppress setting of duplicate timer event when timer already stopped. 
Timer programming can be very costly and can result in long cpu stall/wait
times.

[akpm@linux-foundation.org: coding-style fixes]
[tglx@linutronix.de: move the check to the right place and avoid raising
		     the softirq for nothing]

Signed-off-by: Richard Woodruff <r-woodruff2@ti.com>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-12-12 16:55:31 +01:00
Ingo Molnar
8299608f14 Merge branches 'irq/sparseirq', 'x86/quirks' and 'x86/reboot' into cpus4096
We merge the irq/sparseirq, x86/quirks and x86/reboot trees into the
cpus4096 tree because the io-apic changes in the sparseirq change
conflict with the cpumask changes in the cpumask tree, and we
want to resolve those.
2008-12-12 13:49:24 +01:00
Ingo Molnar
45ab6b0c76 Merge branch 'sched/core' into cpus4096
Conflicts:
	include/linux/ftrace.h
	kernel/sched.c
2008-12-12 13:48:57 +01:00
Heiko Carstens
d65bd5ecb2 sched: add missing arch_update_cpu_topology() call
arch_reinit_sched_domains() used to call arch_update_cpu_topology()
via arch_init_sched_domains(). This call got lost with
e761b77252 ("cpu hotplug, sched: Introduce
cpu_active_map and redo sched domain managment (take 2)".

So we might end up with outdated and missing cpus in the cpu core
maps (architecture used to call arch_reinit_sched_domains if cpu
topology changed).

This adds a call to arch_update_cpu_topology in partition_sched_domains
which gets called whenever scheduling domains get updated. Which is
what is supposed to happen when cpu topology changes.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 13:47:22 +01:00
Heiko Carstens
ee79d1bdb6 sched: let arch_update_cpu_topology indicate if topology changed
Change arch_update_cpu_topology so it returns 1 if the cpu topology changed
and 0 if it didn't change. This will be useful for the next patch which adds
a call to this function in partition_sched_domains.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 13:47:21 +01:00
Ingo Molnar
81444a7995 Merge branch 'tracing/fastboot' into cpus4096 2008-12-12 12:43:05 +01:00
Peter Zijlstra
cbc34ed1ac sched: fix tracepoints in scheduler
The trace point only caught one of many places where a task changes cpu,
put it in the right place to we get all of them.

Change the signature while we're at it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 12:08:26 +01:00
Ingo Molnar
92bf73e90a Merge branch 'x86/irq' into perfcounters/core
( with manual semantic merge of arch/x86/kernel/cpu/perf_counter.c )
2008-12-12 12:00:14 +01:00
Frederic Weisbecker
f8b755ac8e tracing/function-graph-tracer: Output arrows signal on hardirq call/return
Impact: make more obvious the hardirq calls in the output

When a hardirq is triggered inside the codeflow on output, we have
now two arrows that indicate the entry and return of the hardirq.

 0)               |          bit_waitqueue() {
 0)   0.880 us    |            __phys_addr();
 0)   2.699 us    |          }
 0)               |          __wake_up_bit() {
 0)   ==========> |          smp_apic_timer_interrupt() {
 0)   0.797 us    |            native_apic_mem_write();
 0)   0.715 us    |            exit_idle();
 0)               |            irq_enter() {
 0)   0.722 us    |              idle_cpu();
 0)   5.519 us    |            }
 0)               |            hrtimer_interrupt() {
 0)               |              ktime_get() {
 0)               |                ktime_get_ts() {
 0)   0.805 us    |                  getnstimeofday();

 [...]

 0) ! 108.528 us  |            }
 0)               |            irq_exit() {
 0)               |              do_softirq() {
 0)               |                __do_softirq() {
 0)   0.895 us    |                  __local_bh_disable();
 0)               |                  run_timer_softirq() {
 0)   0.827 us    |                    hrtimer_run_pending();
 0)   1.226 us    |                    _spin_lock_irq();
 0)               |                    _spin_unlock_irq() {
 0)   6.550 us    |                  }
 0)   0.924 us    |                  _local_bh_enable();
 0) + 12.129 us   |                }
 0) + 13.911 us   |              }
 0)   0.707 us    |              idle_cpu();
 0) + 17.009 us   |            }
 0) ! 137.419 us  |          }
 0)   <========== |
 0)   1.045 us    |          }
 0) ! 148.908 us  |        }
 0) ! 151.022 us  |      }
 0) ! 153.022 us  |    }
 0)   0.963 us    |    journal_mark_dirty();
 0)   0.925 us    |    __brelse();

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 11:14:09 +01:00
Ingo Molnar
c1dfdc7597 Merge commit 'v2.6.28-rc8' into sched/core 2008-12-12 10:29:35 +01:00
Markus Metzger
a93751cab7 x86, bts, ftrace: adapt the hw-branch-tracer to the ds.c interface
Impact: restructure code, cleanup

Remove BTS bits from the hw-branch-tracer (renamed from bts-tracer) and
use the ds interface.

Signed-off-by: Markus Metzger <markut.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 08:08:14 +01:00
Heiko Carstens
fa116ea35e nohz: no softirq pending warnings for offline cpus
Impact: remove false positive warning

After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
it still might have pending softirqs. However take_cpu_down makes sure
that the idle task will run next instead of ksoftirqd on the taken down cpu.
The idle task will call tick_nohz_stop_sched_tick which might warn about
pending softirqs just before the cpu kills itself completely.

However the pending softirqs on the dead cpu aren't a problem because they
will be moved to an online cpu during CPU_DEAD handling.

So make sure we warn only for online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 07:27:01 +01:00
Robert Richter
c4f50183f9 ring_buffer: adding EXPORT_SYMBOLs
I added EXPORT_SYMBOL_GPLs for all functions part of the API
(ring_buffer.h). This is required since oprofile is using the ring
buffer and the compilation as modules would fail otherwise.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-12 06:54:55 +01:00
Ingo Molnar
6a930700c8 perf counters: clean up state transitions
Impact: cleanup

Introduce a proper enum for the 3 states of a counter:

	PERF_COUNTER_STATE_OFF		= -1
	PERF_COUNTER_STATE_INACTIVE	=  0
	PERF_COUNTER_STATE_ACTIVE	=  1

and rename counter->active to counter->state and propagate the
changes everywhere.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:56 +01:00
Ingo Molnar
1d1c7ddbfa perf counters: add prctl interface to disable/enable counters
Add a way for self-monitoring tasks to disable/enable counters summarily,
via a prctl:

	PR_TASK_PERF_COUNTERS_DISABLE		31
	PR_TASK_PERF_COUNTERS_ENABLE		32

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:55 +01:00
Ingo Molnar
bae43c9945 perf counters: implement PERF_COUNT_TASK_CLOCK
Impact: add new perf-counter type

The 'task clock' counter counts the amount of time a task is executing,
in nanoseconds. It stops ticking when a task is scheduled out either due
to it blocking, sleeping or it being preempted.

This counter type is a Linux kernel based abstraction, it is available
even if the hardware does not support native hardware performance counters.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:54 +01:00
Ingo Molnar
01b2838c42 perf counters: consolidate hw_perf save/restore APIs
Impact: cleanup

Rename them to better match up the usual IRQ disable/enable APIs:

 hw_perf_disable_all()  => hw_perf_save_disable()
 hw_perf_restore_ctrl() => hw_perf_restore()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:53 +01:00
Ingo Molnar
5c92d12411 perf counters: implement PERF_COUNT_CPU_CLOCK
Impact: add new perf-counter type

The 'CPU clock' counter counts the amount of CPU clock time that is
elapsing, in nanoseconds. (regardless of how much of it the task is
spending on a CPU executing)

This counter type is a Linux kernel based abstraction, it is available
even if the hardware does not support native hardware performance counters.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:52 +01:00
Ingo Molnar
621a01eac8 perf counters: hw driver API
Impact: restructure code, introduce hw_ops driver abstraction

Introduce this abstraction to handle counter details:

 struct hw_perf_counter_ops {
	void (*hw_perf_counter_enable)	(struct perf_counter *counter);
	void (*hw_perf_counter_disable)	(struct perf_counter *counter);
	void (*hw_perf_counter_read)	(struct perf_counter *counter);
 };

This will be useful to support assymetric hw details, and it will also
be useful to implement "software counters". (Counters that count kernel
managed sw events such as pagefaults, context-switches, wall-clock time
or task-local time.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:51 +01:00
Ingo Molnar
ccff286d85 perf counters: group counter, fixes
Impact: bugfix

Check that a group does not span outside the context of a CPU or a task.

Also, do not allow deep recursive hierarchies.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:50 +01:00
Ingo Molnar
04289bb989 perf counters: add support for group counters
Impact: add group counters

This patch adds the "counter groups" abstraction.

Groups of counters behave much like normal 'single' counters, with a
few semantic and behavioral extensions on top of that.

A counter group is created by creating a new counter with the open()
syscall's group-leader group_fd file descriptor parameter pointing
to another, already existing counter.

Groups of counters are scheduled in and out in one atomic group, and
they are also roundrobin-scheduled atomically.

Counters that are member of a group can also record events with an
(atomic) extended timestamp that extends to all members of the group,
if the record type is set to PERF_RECORD_GROUP.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:49 +01:00
Ingo Molnar
9f66a3810f perf counters: restructure the API
Impact: clean up new API

Thorough cleanup of the new perf counters API, we now get clean separation
of the various concepts:

 - introduce perf_counter_hw_event to separate out the event source details

 - move special type flags into separate attributes: PERF_COUNT_NMI,
   PERF_COUNT_RAW

 - extend the type to u64 and reserve it fully to the architecture in the
   raw type case.

And make use of all these changes in the core and x86 perfcounters code.

Also change the syscall signature to:

  asmlinkage int sys_perf_counter_open(

	struct perf_counter_hw_event	*hw_event_uptr		__user,
	pid_t				pid,
	int				cpu,
	int				group_fd);

( Note that group_fd is unused for now - it's reserved for the counter
  groups abstraction. )

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:48 +01:00
Thomas Gleixner
dfa7c899b4 perf counters: expand use of counter->event
Impact: change syscall, cleanup

Make use of the new perf_counters event type.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:47 +01:00
Thomas Gleixner
eab656ae04 perf counters: clean up 'raw' type API
Impact: cleanup

Introduce a separate hw_event type.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-12-11 15:45:46 +01:00
Linus Torvalds
f9fc05e762 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: CPU remove deadlock fix
2008-12-10 14:41:06 -08:00