Impact: build fix
arch/x86/kernel/ds.c: In function 'ds_request':
arch/x86/kernel/ds.c:236: sorry, unimplemented: inlining failed in call to 'ds_get_context': recursive inlining
but the recursion here is scary ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the BTS bits from ptrace.c into ds.c.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make the ds code more debuggable
Turn BUG_ON's into WARN_ON_ONCE.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Introduce a proper enum for the 3 states of a counter:
PERF_COUNTER_STATE_OFF = -1
PERF_COUNTER_STATE_INACTIVE = 0
PERF_COUNTER_STATE_ACTIVE = 1
and rename counter->active to counter->state and propagate the
changes everywhere.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename them to better match up the usual IRQ disable/enable APIs:
hw_perf_disable_all() => hw_perf_save_disable()
hw_perf_restore_ctrl() => hw_perf_restore()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new perf-counter type
The 'CPU clock' counter counts the amount of CPU clock time that is
elapsing, in nanoseconds. (regardless of how much of it the task is
spending on a CPU executing)
This counter type is a Linux kernel based abstraction, it is available
even if the hardware does not support native hardware performance counters.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure code, introduce hw_ops driver abstraction
Introduce this abstraction to handle counter details:
struct hw_perf_counter_ops {
void (*hw_perf_counter_enable) (struct perf_counter *counter);
void (*hw_perf_counter_disable) (struct perf_counter *counter);
void (*hw_perf_counter_read) (struct perf_counter *counter);
};
This will be useful to support assymetric hw details, and it will also
be useful to implement "software counters". (Counters that count kernel
managed sw events such as pagefaults, context-switches, wall-clock time
or task-local time.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add group counters
This patch adds the "counter groups" abstraction.
Groups of counters behave much like normal 'single' counters, with a
few semantic and behavioral extensions on top of that.
A counter group is created by creating a new counter with the open()
syscall's group-leader group_fd file descriptor parameter pointing
to another, already existing counter.
Groups of counters are scheduled in and out in one atomic group, and
they are also roundrobin-scheduled atomically.
Counters that are member of a group can also record events with an
(atomic) extended timestamp that extends to all members of the group,
if the record type is set to PERF_RECORD_GROUP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up new API
Thorough cleanup of the new perf counters API, we now get clean separation
of the various concepts:
- introduce perf_counter_hw_event to separate out the event source details
- move special type flags into separate attributes: PERF_COUNT_NMI,
PERF_COUNT_RAW
- extend the type to u64 and reserve it fully to the architecture in the
raw type case.
And make use of all these changes in the core and x86 perfcounters code.
Also change the syscall signature to:
asmlinkage int sys_perf_counter_open(
struct perf_counter_hw_event *hw_event_uptr __user,
pid_t pid,
int cpu,
int group_fd);
( Note that group_fd is unused for now - it's reserved for the counter
groups abstraction. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change syscall, cleanup
Make use of the new perf_counters event type.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix rare lost events problem
There are CPUs whose performance counters misbehave on CSTATE transitions,
so provide a way to just disable/enable them around deep idle methods.
(hw_perf_enable_all() is cheap on x86.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix spurious missed counter wakeups
In the case of NMI events, close a race window that can occur if an NMI
hits counter code that temporarily disables+enables a counter, and the NMI
leaks into the disabled section.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
these warnings:
arch/x86/kernel/paravirt-spinlocks.c: In function ‘default_spin_lock_flags’:
arch/x86/kernel/paravirt-spinlocks.c:12: warning: passing argument 1 of ‘__raw_spin_lock’ from incompatible pointer type
arch/x86/kernel/paravirt-spinlocks.c: At top level:
arch/x86/kernel/paravirt-spinlocks.c:11: warning: ‘default_spin_lock_flags’ defined but not used
showed that the prototype of default_spin_lock_flags() was confused about
what type spinlocks have.
the proper type on UP is raw_spinlock_t.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make perfcounter NMI and IRQ sequence more robust
Make __smp_perf_counter_interrupt() a bit more conservative: first disable
all counters, then read out the status. Most invocations are because there
are real events, so there's no performance impact.
Code flow gets a bit simpler as well this way.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement performance counters for x86 Intel CPUs.
It's simplified right now: the PERFMON CPU feature is assumed,
which is available in Core2 and later Intel CPUs.
The design is flexible to be extended to more CPU types as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup on 32-bit
Peter pointed this parameter can be changed.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Provide a way to pause the function graph tracer
As suggested by Steven Rostedt, the previous patch that prevented from
spinlock function tracing shouldn't use the raw_spinlock to fix it.
It's much better to follow lockdep with normal spinlock, so this patch
adds a new flag for each task to make the function graph tracer able
to be paused. We also can send an ftrace_printk whithout worrying of
the irrelevant traced spinlock during insertion.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: trace more functions
When the function graph tracer is configured, three more files are not
traced to prevent only four functions to be traced. And this impacts the
normal function tracer too.
arch/x86/kernel/process_64/32.c:
I had crashes when I let this file traced. After some debugging, I saw
that the "current" task point was changed inside__swtich_to(), ie:
"write_pda(pcurrent, next_p);" inside process_64.c Since the tracer store
the original return address of the function inside current, we had
crashes. Only __switch_to() has to be excluded from tracing.
kernel/module.c and kernel/extable.c:
Because of a function used internally by the function graph tracer:
__kernel_text_address()
To let the other functions inside these files to be traced, this patch
introduces the __notrace_funcgraph function prefix which is __notrace if
function graph tracer is configured and nothing if not.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
reorder exit path in __get_smp_config().
also move two print outs to acpi_process_madt
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplify code
Pass irq_desc and cfg around, instead of raw IRQ numbers - this way
we dont have to look it up again and again.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: sanitize MSI irq number ordering from top-down to bottom-up
Increase new MSI IRQs starting from nr_irqs_gsi (which is somewhere below
256), instead of decreasing from NR_IRQS. (The latter method can result
in confusingly high IRQ numbers - if NR_CPUS is set to a high value and
NR_IRQS scales up to a high value.)
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.
To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.
When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).
This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix trampoline sizing bug, save space
While debugging a suspend-to-RAM related issue it occured to me that
if the trampoline code had grown past 4 KB, we would have been
allocating too little memory for it, since the 4 KB size of the
trampoline is hardcoded into arch/x86/kernel/e820.c . Change that
by making the kernel compute the trampoline size and allocate as much
memory as necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add additional fsb values to pentium_core_get_frequency, from latest edition
(September 2008) of Intel 64 and IA-32 Architectures Software Develper's Manual,
Volume 3B: System Programming Guide, Part 2. Values added are to detect 800,
1067 and 1333 FSB types.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: Dave Jones <davej@redhat.com>
p4-clockmod has a long history of abuse. It pretends to be a CPU
frequency scaling driver, even though it doesn't actually change
the CPU frequency, but instead just modulates the frequency with
wait-states.
The biggest misconception is that when running at the lower 'frequency'
p4-clockmod is saving power. This isn't the case, as workloads running
slower take longer to complete, preventing the CPU from entering deep C states.
However p4-clockmod does have a purpose. It can prevent overheating.
Having it hooked up to the cpufreq interfaces is the wrong way to achieve
cooling however. It should instead be hooked up to ACPI.
This diff introduces a means for a cpufreq driver to register with the
cpufreq core, but not present a sysfs interface.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
On those CPUs which are SpeedStep (EST) capable, we do not care at all if
p4-clockmod does not work, since a technically superior CPU frequency
management technology is to be used.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: cleanup
1) The #ifdef CONFIG_HOTPLUG_CPU seems unnecessary these days.
2) The loop can simply skip over offline cpus, rather than creating a tmp mask.
3) set_mask is set to either a single cpu or all online cpus in a policy.
Since it's just used for set_cpus_allowed(), any offline cpus in a policy
don't matter, so we can just use cpumask_of_cpu() or the policy->cpus.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Just come across this when booting on an old hw..
Looks somewhat ugly, that single missing space ;)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot crash with numcpus=0 on certain systems
Fix early exception in __get_smp_config with nosmp.
Bail out early when there is no MP table.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The access to the iommu->need_sync member needs to be protected by the
iommu->lock. Otherwise this is a possible race condition. Fix it with
this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In some rare cases a request can arrive an IOMMU with its originial
requestor id even it is aliased. Handle this by setting the device table
entry to the same protection domain for the original and the aliased
requestor id.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: remove stale IOTLB entries
In the non-default nofullflush case the GART is only flushed when
next_bit wraps around. But it can happen that an unmap operation unmaps
memory which is behind the current next_bit location. If these addresses
are reused it may result in stale GART IO/TLB entries. Fix this by
setting the GART next_bit always behind an unmapped location.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Import: robustness checks
Add more checks in the function graph code to detect errors and
perhaps print out better information if a bug happens.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature, let entry function decide to trace or not
This patch lets the graph tracer entry function decide if the tracing
should be done at the end as well. This requires all function graph
entry functions return 1 if it should trace, or 0 if the return should
not be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: better dumpstack output
I noticed in my crash dumps and even in the stack tracer that a
lot of functions listed in the stack trace are simply
return_to_handler which is ftrace graphs way to insert its own
call into the return of a function.
But we lose out where the actually function was called from.
This patch adds in hooks to the dumpstack mechanism that detects
this and finds the real function to print. Both are printed to
let the user know that a hook is still in place.
This does give a funny side effect in the stack tracer output:
Depth Size Location (80 entries)
----- ---- --------
0) 4144 48 save_stack_trace+0x2f/0x4d
1) 4096 128 ftrace_call+0x5/0x2b
2) 3968 16 mempool_alloc_slab+0x16/0x18
3) 3952 384 return_to_handler+0x0/0x73
4) 3568 -240 stack_trace_call+0x11d/0x209
5) 3808 144 return_to_handler+0x0/0x73
6) 3664 -128 mempool_alloc+0x4d/0xfe
7) 3792 128 return_to_handler+0x0/0x73
8) 3664 -32 scsi_sg_alloc+0x48/0x4a [scsi_mod]
As you can see, the real functions are now negative. This is due
to them not being found inside the stack.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new ftrace_graph_stop function
While developing more features of function graph, I hit a bug that
caused the WARN_ON to trigger in the prepare_ftrace_return function.
Well, it was hard for me to find out that was happening because the
bug would not print, it would just cause a hard lockup or reboot.
The reason is that it is not safe to call printk from this function.
Looking further, I also found that it calls unregister_ftrace_graph,
which grabs a mutex and calls kstop machine. This would definitely
lock the box up if it were to trigger.
This patch adds a fast and safe ftrace_graph_stop() which will
stop the function tracer. Then it is safe to call the WARN ON.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>