Again, since we only iterate the fair class, remove the abstraction.
Since this is the last user of the rq_iterator, remove all that too.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we only ever iterate the fair class, do away with this abstraction.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Take out the sched_class methods for load-balancing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Straight fwd code movement.
Since non of the load-balance abstractions are used anymore, do away with
them and simplify the code some. In preparation move the code around.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Assume A->B schedule is processing, if B have acquired BKL before and it
need reschedule this time. Then on B's context, it will go to
need_resched_nonpreemptible for reschedule. But at this time, prev and
switch_count are related to A. It's wrong and will lead to incorrect
scheduler statistics.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.
Reported-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1262612696.15495.15.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Marc reported that the BUG_ON in clockevents_notify() triggers on his
system. This happens because the kernel tries to remove an active
clock event device (used for broadcasting) from the device list.
The handling of devices which can be used as per cpu device and as a
global broadcast device is suboptimal.
The simplest solution for now (and for stable) is to check whether the
device is used as global broadcast device, but this needs to be
revisited.
[ tglx: restored the cpuweight check and massaged the changelog ]
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
The smp ipi data is passed around and given write access by
other cpus and should be separated from per-cpu data consumed by
this cpu.
Looking for hot lines, I saw call_function_data shared with
tick_cpu_sched.
Signed-off-by: Milton Miller <miltonm@bga.com>
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: : Nick Piggin <npiggin@suse.de>
LKML-Reference: <20100118020051.GR12666@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a task gets scheduled in. We don't touch the cpu bound events
so the priority order becomes:
cpu pinned, cpu flexible, task pinned, task flexible.
So schedule out cpu flexibles when a new task context gets in
and correctly order the groups to schedule in:
task pinned, cpu flexible, task flexible.
Cpu pinned groups don't need to be touched at this time.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
We don't need to schedule in/out pinned events on task tick,
now that pinned and flexible groups can be scheduled separately.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Tune the scheduling helpers so that we can choose to schedule either
pinned and/or flexible groups from a context.
And while at it, refactor a bit the naming of these helpers to make
these more consistent and flexible.
There is no (intended) change in scheduling behaviour in this
patch.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Each time we save a function entry from the function graph
tracer, we check if the trace array is set, which is wasteful
because it is set anyway before we start the tracer. All we need
is to ensure we have good read and write orderings. When we set
the trace array, we just need to guarantee it to be visible
before starting tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
LKML-Reference: <1263453795-7496-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing/filters: Add comment for match callbacks
tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING
tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching
lib: Introduce strnstr()
tracing/filters: Fix MATCH_END_ONLY filter matching
tracing/filters: Fix MATCH_FRONT_ONLY filter matching
ftrace: Fix MATCH_END_ONLY function filter
tracing/x86: Derive arch from bits argument in recordmcount.pl
ring-buffer: Add rb_list_head() wrapper around new reader page next field
ring-buffer: Wrap a list.next reference with rb_list_head()
The change in acpi_cpufreq to use smp_call_function_any causes a warning
when it is called since the function erroneously passes the cpu id to
cpumask_of_node rather than the node that the cpu is on. Fix this.
cpumask_of_node(3): node > nr_node_ids(1)
Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223
Call Trace:
[<ffffffff81028bb3>] cpumask_of_node+0x23/0x58
[<ffffffff81061f51>] smp_call_function_any+0x65/0xfa
[<ffffffff810160d1>] ? do_drv_read+0x0/0x2f
[<ffffffff81015fba>] get_cur_val+0xb0/0x102
[<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5
[<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515
[<ffffffff81562ce9>] ? __down_write+0xb/0xd
[<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922
Signed-off-by: David John <davidjon@xenontk.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now for kfifo_*_user it's not easily possible to distingush between
a user copy failing and the FIFO not containing enough data. The problem
is that both conditions are multiplexed into the same return code.
Avoid this by moving the "copy length" into a separate output parameter
and only return 0/-EFAULT in the main return value.
I didn't fully adapt the weird "record" variants, those seem
to be unused anyways and were rather messy (should they be just removed?)
I would appreciate some double checking if I did all the conversions
correctly.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before scheduling an event group, we first check if a group can go
on. We first check if the group is made of software only events
first, in which case it is enough to know if the group can be
scheduled in.
For that purpose, we iterate through the whole group, which is
wasteful as we could do this check when we add/delete an event to
a group.
So we create a group_flags field in perf event that can host
characteristics from a group of events, starting with a first
PERF_GROUP_SOFTWARE flag that reduces the check on the fast path.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
This is more proper that doing it through a list_for_each_entry()
that breaks after the first entry.
v2: Don't rotate pinned groups as its not needed to time share
them.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Split-up struct perf_event_context::group_list into pinned_groups
and flexible_groups (non-pinned).
This first appears to be useless as it duplicates various loops around
the group list handlings.
But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.
The another desired effect is that it makes easier to define distinct
scheduling rules on both.
Changes in v2:
- Respectively rename pinned_grp_list and
volatile_grp_list into pinned_groups and flexible_groups as per
Ingo suggestion.
- Various cleanups
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
We should be clear on 2 things:
- the length parameter of a match callback includes
tailing '\0'.
- the string to be searched might not be NULL-terminated.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
MATCH_FULL matching for PTR_STRING is not working correctly:
# echo 'func == vt' > events/bkl/lock_kernel/filter
# echo 1 > events/bkl/lock_kernel/enable
...
# cat trace
Xorg-1484 [000] 1973.392586: lock_kernel: ... func=vt_ioctl()
gpm-1402 [001] 1974.027740: lock_kernel: ... func=vt_ioctl()
We should pass to regex.match(..., len) the length (including '\0')
of the source string instead of the length of the pattern string.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
MATCH_FRONT_ONLY actually is a full matching:
# ./perf record -R -f -a -e lock:lock_acquire \
--filter 'name ~rcu_*' sleep 1
# ./perf trace
(no output)
We should pass the length of the pattern string to strncmp().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, futexes have two problem:
A) The current futex code doesn't handle private file mappings properly.
get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:
1) thread-A call futex(private-mapping, FUTEX_WAIT), it
sleeps on file mapping object.
2) thread-B writes a variable and it makes it cow.
3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
wakes up blocked thread on the anonymous page. (but it's nothing)
B) Current futex code doesn't handle zero page properly.
Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.
The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.
Performance concerns:
Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.
Compatibility concerns:
This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.
Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Ulrich Drepper <drepper@gmail.com>
LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The rcu_process_dyntick() function checks twice for the end of
the current grace period. However, it holds the current
rcu_node structure's ->lock field throughout, and doesn't get to
the second call to rcu_gp_in_progress() unless there is at least
one CPU corresponding to this rcu_node structure that has not
yet checked in for the current grace period, which would prevent
the current grace period from ending. So the current grace
period cannot have ended, and the second check is redundant, so
remove it.
Also, given that this function is used even with !CONFIG_NO_HZ,
its name is quite misleading. Change from rcu_process_dyntick()
to force_qs_rnp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1262646550562-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because a new grace period cannot start while we are executing
within the force_quiescent_state() function's switch statement,
if any test within that switch statement or within any function
called from that switch statement shows that the current grace
period has ended, we can safely re-do that test any time before
we leave the switch statement. This means that we no longer
need a return value from rcu_process_dyntick(), as we can simply
invoke rcu_gp_in_progress() to check whether the old grace
period has finished -- there is no longer any need to worry
about whether or not a new grace period has been started.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501857-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>