Commit Graph

25348 Commits

Author SHA1 Message Date
Paul E. McKenney
6b558c4c7a rcutorture: Bind rcuperf reader/writer kthreads to CPUs
This commit forces more deterministic behavior by binding rcuperf's
rcu_perf_reader() and rcu_perf_writer() kthreads to their respective
CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:37:39 -07:00
Paul E. McKenney
8704baab9b rcutorture: Add RCU grace-period performance tests
This commit adds a new rcuperf module that carries out simple performance
tests of RCU grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:37:38 -07:00
Paul E. McKenney
291783b8ad rcutorture: Expedited-GP batch progress access to torturing
This commit provides rcu_exp_batches_completed() and
rcu_exp_batches_completed_sched() functions to allow torture-test modules
to check how many expedited grace period batches have completed.
These are analogous to the existing rcu_batches_completed(),
rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:37:37 -07:00
Paul E. McKenney
9efafb8849 rcutorture: Allow for rcupdate.rcu_normal
Currently, rcu_torture_writer() checks only for rcu_gp_is_expedited()
when deciding whether or not to do dynamic control of RCU expediting.
This means that if rcupdate.rcu_normal is specified, rcu_torture_writer()
will attempt to dynamically control RCU expediting, but will nonetheless
only test normal RCU grace periods.  This commit therefore adds a check
for !rcu_gp_is_normal(), and prints a message and desists from testing
dynamic control of RCU expediting when doing so is futile.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:37:37 -07:00
Paul E. McKenney
5dffed1e57 rcu: Dump ftrace buffer when kicking grace-period kthread
If it is necessary to kick the grace-period kthread, that is a good
time to dump the trace buffer in order to learn why kicking was needed.
This commit therefore does the dump.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:36:37 -07:00
Boqun Feng
293e2421fe rcu: Remove superfluous versions of rcu_read_lock_sched_held()
Currently, we have four versions of rcu_read_lock_sched_held(), depending
on the combined choices on PREEMPT_COUNT and DEBUG_LOCK_ALLOC.  However,
there is an existing function preemptible() that already distinguishes
between the PREEMPT_COUNT=y and PREEMPT_COUNT=n cases, and allows these
four implementations to be consolidated down to two.

This commit therefore uses preemptible() to achieve this consolidation.

Note that there could be a small performance regression in the case
of CONFIG_DEBUG_LOCK_ALLOC=y && PREEMPT_COUNT=n.  However, given the
overhead associated with CONFIG_DEBUG_LOCK_ALLOC=y, this should be
down in the noise.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:50 -07:00
Paul E. McKenney
8c7c4829a8 rcu: Awaken grace-period kthread if too long since FQS
Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing.  This commit is a crude hack that does
a wakeup if a scheduling-clock interrupt sees that it has been
too long since force-quiescent-state (FQS) processing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:50 -07:00
Paul E. McKenney
fcfd0a237b rcu: Make FQS schedule advance only if FQS happened
Currently, the force-quiescent-state (FQS) code in rcu_gp_kthread() can
advance the next FQS even if one was not executed last time.  This can
happen due timeout-duration uncertainty.  This commit therefore avoids
advancing the FQS schedule unless an FQS was just executed.  In the
corner case where an FQS was not executed, but is due now, the code does
a one-jiffy wait.

This change prepares for kthread kicking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:49 -07:00
Paul E. McKenney
86057b80ae rcu: Awaken grace-period kthread when stalled
Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing.  This commit is a crude hack that does
a wakeup any time a stall is detected.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:49 -07:00
Paul E. McKenney
3b5f668e71 rcu: Overlap wakeups with next expedited grace period
The current expedited grace-period implementation makes subsequent grace
periods wait on wakeups for the prior grace period.  This does not fit
the dictionary definition of "expedited", so this commit allows these two
phases to overlap.  Doing this requires four waitqueues rather than two
because tasks can now be waiting on the previous, current, and next grace
periods.  The fourth waitqueue makes the bit masking work out nicely.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:11 -07:00
Paul E. McKenney
aff12cdf86 rcu: Consolidate expedited GP code into exp_funnel_lock()
This commit pulls the grace-period-start counter adjustment and tracing
from synchronize_rcu_expedited() and synchronize_sched_expedited()
into exp_funnel_lock(), thus eliminating some code duplication.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:11 -07:00
Paul E. McKenney
179e5dcd1e rcu: Consolidate expedited GP tracing into rcu_exp_gp_seq_snap()
This commit moves some duplicate code from synchronize_rcu_expedited()
and synchronize_sched_expedited() into rcu_exp_gp_seq_snap().  This
doesn't save lines of code, but does eliminate a "tell me twice" issue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:10 -07:00
Paul E. McKenney
4ea3e85b11 rcu: Consolidate expedited GP code into rcu_exp_wait_wake()
Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have
significant duplicate code.  This commit therefore consolidates some of
this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake()
in recognition of its added responsibilities.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:10 -07:00
Paul E. McKenney
356051e1de rcu: Add exp_funnel_lock() fastpath
This commit speeds up the low-contention case, especially for systems
with large rcu_node trees, by attempting to directly acquire the
->exp_mutex.  This fastpath checks the leaves and root first in
order to avoid excessive memory contention on the mutex itself.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:09 -07:00
Paul E. McKenney
f6a12f34a4 rcu: Enforce expedited-GP fairness via funnel wait queue
The current mutex-based funnel-locking approach used by expedited grace
periods is subject to severe unfairness.  The problem arises when a
few tasks, making a path from leaves to root, all wake up before other
tasks do.  A new task can then follow this path all the way to the root,
which needlessly delays tasks whose grace period is done, but who do
not happen to acquire the lock quickly enough.

This commit avoids this problem by maintaining per-rcu_node wait queues,
along with a per-rcu_node counter that tracks the latest grace period
sought by an earlier task to visit this node.  If that grace period
would satisfy the current task, instead of proceeding up the tree,
it waits on the current rcu_node structure using a pair of wait queues
provided for that purpose.  This decouples awakening of old tasks from
the arrival of new tasks.

If the wakeups prove to be a bottleneck, additional kthreads can be
brought to bear for that purpose.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:08 -07:00
Paul E. McKenney
d40a4f09a4 rcu: Shorten expedited_workdone* to exp_workdone*
Just a name change to save a few lines and a bit of typing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:08 -07:00
Paul E. McKenney
ec3833ed02 rcu: Force boolean subscript for expedited stall warnings
The cpu_online() function can return values other than 0 and 1, which
can result in subscript overflow when applied to a two-element array.
This commit allows for this behavior by using "!!" on the return value
from cpu_online() when used as a subscript.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:07 -07:00
Paul E. McKenney
e2fd9d3584 rcu: Remove expedited GP funnel-lock bypass
Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
turns out to be a pessimization at high load because it forces a tree
full of tasks to wait for an expedited grace period that they probably
do not need.  This commit therefore removes this optimization.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:07 -07:00
Paul E. McKenney
4f41530245 rcu: Add expedited-grace-period event tracing
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:06 -07:00
Paul E. McKenney
bea2de44ae rcu: Add funnel-locking tracing for expedited grace periods
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:06 -07:00
Paul E. McKenney
26ece8ef6e rcu: Fix synchronize_rcu_expedited() header comment
This commit brings the synchronize_rcu_expedited() function's header
comment into line with the new implementation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:04 -07:00
Paul E. McKenney
a1e1224849 rcu: Make cond_resched_rcu_qs() supply RCU-sched expedited QS
Although cond_resched_rcu_qs() supplies quiescent states to all flavors
of normal RCU grace periods, it does nothing for expedited RCU-sched
grace periods.  This commit therefore adds a check for a need for a
quiescent state from the current CPU by an expedited RCU-sched grace
period, and invokes rcu_sched_qs() to supply that quiescent state if so.

Note that the check is racy in that we might be migrated to some other
CPU just after checking the per-CPU variable.  This is OK because the
act of migration will do a context switch, which will supply the needed
quiescent state.  The only downside is that we might do an unnecessary
call to rcu_sched_qs(), but the probability is low and the overhead
is small.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:04 -07:00
Paul E. McKenney
251c617c75 rcu: Make expedited RCU-preempt stall warnings count accurately
Currently, synchronize_sched_expedited_wait() simply sets the ndetected
variable to the rcu_print_task_exp_stall() return value.  This means
that if the last rcu_node structure has no stalled tasks, record of
any stalled tasks in previous rcu_node structures is lost, which can
in turn result in failure to dump out the blocking rcu_node structures.
Or could, had the test been correct.

This commit therefore adds the return value of rcu_print_task_exp_stall()
to ndetected and corrects the later test for ndetected.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:03 -07:00
Paul E. McKenney
28728dd310 rcu: Make expedited RCU-sched grace period immediately detect idle
Currently, sync_sched_exp_handler() will force a reschedule unless
this CPU has already checked in or unless a reschedule has already
been called for.  This is clearly wasteful if sync_sched_exp_handler()
interrupted an idle CPU, so this commit immediately reports the
quiescent state in that case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:03 -07:00
Paul E. McKenney
274529ba9b rcu: Consolidate dumping of ftrace buffer
This commit consolidates a couple definitions and several calls for
single-shot ftrace-buffer dumping.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:29:08 -07:00
Alfredo Alvarez Fernandez
39e2e173fb locking/lockdep: Print chain_key collision information
A sequence of pairs [class_idx -> corresponding chain_key iteration]
is printed for both the current held_lock chain and the cached chain.

That exposes the two different class_idx sequences that led to that
particular hash value.

This helps with debugging hash chain collision reports.

Signed-off-by: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: sedat.dilek@gmail.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 15:03:58 +02:00
Yuyang Du
2b8c41daba sched/fair: Initiate a new task's util avg to a bounded value
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.

This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:

(1) An expected (or average) util_avg based on its cfs_rq's util_avg:

  util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:

  util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

where n denotes the nth task.

If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.

Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:46 +02:00
Yuyang Du
1c3de5e19f sched/fair: Update comments after a variable rename
The following commit:

  ed82b8a1ff ("sched/core: Move the sched_to_prio[] arrays out of line")

renamed prio_to_weight to sched_prio_to_weight, but the old name was not
updated in comments.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:45 +02:00
Steven Rostedt
47252cfbac sched/core: Add preempt checks in preempt_schedule() code
While testing the tracer preemptoff, I hit this strange trace:

   <...>-259     0...1    0us : schedule <-worker_thread
   <...>-259     0d..1    0us : rcu_note_context_switch <-__schedule
   <...>-259     0d..1    0us : rcu_sched_qs <-rcu_note_context_switch
   <...>-259     0d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
   <...>-259     0d..1    0us : _raw_spin_lock <-__schedule
   <...>-259     0d..1    0us : preempt_count_add <-_raw_spin_lock
   <...>-259     0d..2    0us : do_raw_spin_lock <-_raw_spin_lock
   <...>-259     0d..2    1us : deactivate_task <-__schedule
   <...>-259     0d..2    1us : update_rq_clock.part.84 <-deactivate_task
   <...>-259     0d..2    1us : dequeue_task_fair <-deactivate_task
   <...>-259     0d..2    1us : dequeue_entity <-dequeue_task_fair
   <...>-259     0d..2    1us : update_curr <-dequeue_entity
   <...>-259     0d..2    1us : update_min_vruntime <-update_curr
   <...>-259     0d..2    1us : cpuacct_charge <-update_curr
   <...>-259     0d..2    1us : __rcu_read_lock <-cpuacct_charge
   <...>-259     0d..2    1us : __rcu_read_unlock <-cpuacct_charge
   <...>-259     0d..2    1us : clear_buddies <-dequeue_entity
   <...>-259     0d..2    1us : account_entity_dequeue <-dequeue_entity
   <...>-259     0d..2    2us : update_min_vruntime <-dequeue_entity
   <...>-259     0d..2    2us : update_cfs_shares <-dequeue_entity
   <...>-259     0d..2    2us : hrtick_update <-dequeue_task_fair
   <...>-259     0d..2    2us : wq_worker_sleeping <-__schedule
   <...>-259     0d..2    2us : kthread_data <-wq_worker_sleeping
   <...>-259     0d..2    2us : pick_next_task_fair <-__schedule
   <...>-259     0d..2    2us : check_cfs_rq_runtime <-pick_next_task_fair
   <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
   <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
   <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
   <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
   <...>-259     0d..2    2us : set_next_entity <-pick_next_task_fair
   <...>-259     0d..2    3us : put_prev_entity <-pick_next_task_fair
   <...>-259     0d..2    3us : check_cfs_rq_runtime <-put_prev_entity
   <...>-259     0d..2    3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031    0d..2    3us : finish_task_switch <-__schedule
gnome-sh-1031    0d..2    3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031    0d..2    3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031    0...2    3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031    0...1  582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031    0...1  583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031    0...1  583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031    0...1  583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031    0...1  584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031    0...1  584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031    0...1  603us : <stack trace>
 => preempt_count_sub
 => _raw_spin_unlock
 => drm_gem_object_lookup
 => i915_gem_madvise_ioctl
 => drm_ioctl
 => do_vfs_ioctl
 => SyS_ioctl
 => entry_SYSCALL_64_fastpath

As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.

schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.

The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:

  NEED_RESCHED set
  preempt_schedule()
    preempt_disable()
      preempt_count_inc()
        function trace (before incrementing preempt count)
          preempt_disable_notrace()
          preempt_enable_notrace()
            sees NEED_RESCHED set
               preempt_schedule() (repeat)

Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:45 +02:00
Tim Chen
bfdb198ccd sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels
In account_entity_enqueue(), we do not do account_numa_enqueue()
as NUMA balancing is not needed for UP kernels.

Hence, we should remove the account_numa_dequeue() call from
account_entity_dequeue() for UP kernels.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:45 +02:00
Srikar Dronamraju
d02c071183 sched/fair: Reset nr_balance_failed after active balancing
To force a task migration during active balancing, nr_balance_failed is set
to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side
effect, the next regular load balance under the same sd, a cache hot task
might be migrated, just because nr_balance_failed count is high.

Resetting nr_balance_failed after a successful active balance ensures
that a hot task is not unreasonably migrated. This can be verified by
looking at othe number of hot task migrations reported by /proc/schedstat.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:44 +02:00
Dongsheng Yang
d740037fac sched/cpuacct: Split usage accounting into user_usage and sys_usage
Sometimes, cpuacct.usage is not detailed enough to see how much CPU
usage a group had. We want to know how much time it used in user mode
and how much in kernel mode.

This patch introduces more files to give this information:

 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
 /sys/fs/cgroup/cpuacct/cpuacct.usage
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
 /sys/fs/cgroup/cpuacct/cpuacct.usage_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_sys
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

... while keeping the ABI with the existing counter.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
[ Ported to newer kernels. ]
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:48:54 +02:00
Zhao Lei
5ca3726af7 sched/cpuacct: Show all possible CPUs in cpuacct output
Current code show stats of online CPUs in cpuacct.statcpus,
show stats of present cpus in cpuacct.usage(_percpu), and using
present CPUs for setting cpuacct.usage.

It will cause inconsistent result when a CPU is online or offline
or hotpluged.

We should always use possible CPUs to avoid above problem.

Here are the contents of a cpuacct.usage_percpu sysfs file,
on a 4 CPU system with maxcpus=32:

Before the patch:
 # cat cpuacct.usage_percpu
 2456565 411435 1052897 832584

After the patch:
 # cat cpuacct.usage_percpu
 2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a11d56cef12d0b4807f8be3a46bf9798c3014d59.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:45:56 +02:00
Wang Nan
d1b26c7024 perf/ring_buffer: Prepare writing into the ring-buffer from the end
Convert perf_output_begin() to __perf_output_begin() and make the later
function able to write records from the end of the ring-buffer.

Following commits will utilize the 'backward' flag.

This is the core patch to support writing to the ring-buffer backwards,
which will be introduced by upcoming patches to support reading from
overwritable ring-buffers.

In theory, this patch should not introduce any extra performance
overhead since we use always_inline, but it does not hurt to double
check that assumption:

When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
identical to original one. See:

   http://lkml.kernel.org/g/56F52E83.70409@huawei.com

When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
smaller:

 $ size kernel/events/ring_buffer.o*
   text       data        bss        dec        hex    filename
   4641          4          8       4653       122d kernel/events/ring_buffer.o.old
   4545          4          8       4557       11cd kernel/events/ring_buffer.o.new

Performance testing results:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

 CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
 Kernel : v4.5.0

                     MEAN         STDVAR
  BASE            800214.950    2853.083
  PRE            2253846.700    9997.014
  POST           2257495.540    8516.293

Where 'BASE' is pure performance without capturing. 'PRE' is test
result of pure 'v4.5.0' kernel. 'POST' is test result after this
patch.

Considering the stdvar, this patch doesn't hurt performance, within
noise margin.

For testing details, see:

  http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:49 +02:00
Wang Nan
1879445dfa perf/core: Set event's default ::overflow_handler()
Set a default event->overflow_handler in perf_event_alloc() so don't
need to check event->overflow_handler in __perf_event_overflow().
Following commits can give a different default overflow_handler.

Initial idea comes from Peter:

  http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

Since the default value of event->overflow_handler is not NULL, existing
'if (!overflow_handler)' checks need to be changed.

is_default_overflow_handler() is introduced for this.

No extra performance overhead is introduced into the hot path because in the
original code we still need to read this handler from memory. A conditional
branch is avoided so actually we remove some instructions.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:47 +02:00
Wang Nan
86e7972f69 perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer
Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.

This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:45 +02:00
Jiri Olsa
0a74c5b3d2 ftrace/perf: Check sample types only for sampling events
Currently we check sample type for ftrace:function events
even if it's not created as a sampling event. That prevents
creating ftrace_function event in counting mode.

Make sure we check sample types only for sampling events.

Before:
  $ sudo perf stat -e ftrace:function ls
  ...

   Performance counter stats for 'ls':

     <not supported>      ftrace:function

         0.001983662 seconds time elapsed

After:
  $ sudo perf stat -e ftrace:function ls
  ...

   Performance counter stats for 'ls':

              44,498      ftrace:function

         0.037534722 seconds time elapsed

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:45 +02:00
Alexander Shishkin
af5bb4ed12 perf/ring_buffer: Document AUX API usage
In order to ensure safe AUX buffer management, we rely on the assumption
that pmu::stop() stops its ongoing AUX transaction and not just the hw.

This patch documents this requirement for the perf_aux_output_{begin,end}()
APIs.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:43 +02:00
Alexander Shishkin
95ff4ca26c perf/core: Free AUX pages in unmap path
Now that we can ensure that when ring buffer's AUX area is on the way
to getting unmapped new transactions won't start, we only need to stop
all events that can potentially be writing aux data to our ring buffer.

Having done that, we can safely free the AUX pages and corresponding
PMU data, as this time it is guaranteed to be the last aux reference
holder.

This partially reverts:

  57ffc5ca67 ("perf: Fix AUX buffer refcounting")

... which was made to defer deallocation that was otherwise possible
from an NMI context. Now it is no longer the case; the last call to
rb_free_aux() that drops the last AUX reference has to happen in
perf_mmap_close() on that AUX area.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:42 +02:00
Alexander Shishkin
dcb10a967c perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops
When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
zero, new AUX transactions into this buffer can still be started,
even though the buffer in en route to deallocation.

This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
being zero, in which case there is no point starting new transactions,
in other words, the ring buffers that pass a certain point in
perf_mmap_close will not have their events sending new data, which
clears path for freeing those buffers' pages right there and then,
provided that no active transactions are holding the AUX reference.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:41 +02:00
Peter Zijlstra
2665784850 perf/core: Verify we have a single perf_hw_context PMU
There should (and can) only be a single PMU for perf_hw_context
events.

This is because of how we schedule events: once a hardware event fails to
schedule (the PMU is 'full') we stop trying to add more. The trivial
'fix' would break the Round-Robin scheduling we do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:30:41 +02:00
Axel Lin
63cc787e71 irqdomain: Export irq_domain_free_irqs_common
Export irq_domain_free_irqs_common so it can be used by modules.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-03-31 10:22:32 +02:00
Alexander Shishkin
201c2f85bd perf/core: Don't leak event in the syscall error path
In the error path, event_file not being NULL is used to determine
whether the event itself still needs to be free'd, so fix it up to
avoid leaking.

Reported-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 130056275a ("perf: Do not double free")
Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 09:54:07 +02:00
Peter Zijlstra
8fdc65391c perf/core: Fix time tracking bug with multiplexing
Stephane reported that commit:

  3cbaa59069 ("perf: Fix ctx time tracking by introducing EVENT_TIME")

introduced a regression wrt. time tracking, as easily observed by:

> This patch introduce a bug in the time tracking of events when
> multiplexing is used.
>
> The issue is easily reproducible with the following perf run:
>
>  $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
>      1.000730239            652,394      branches   (66.41%)
>      1.000730239            597,809      branches   (66.41%)
>      1.000730239            593,870      branches   (66.63%)
>      1.000730239            651,440      branches   (67.03%)
>      1.000730239            656,725      branches   (66.96%)
>      1.000730239      <not counted>      branches
>
> One branches event is shown as not having run. Yet, with
> multiplexing, all events should run especially with a 1s (-I 1000)
> interval. The delta for time_running comes out to 0. Yet, the event
> has run because the kernel is actually multiplexing the events. The
> problem is that the time tracking is the kernel and especially in
> ctx_sched_out() is wrong now.
>
> The problem is that in case that the kernel enters ctx_sched_out() with the
> following state:
>    ctx->is_active=0x7 event_type=0x1
>    Call Trace:
>     [<ffffffff813ddd41>] dump_stack+0x63/0x82
>     [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
>     [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
>     [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
>     [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
>     [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
>     [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
>     [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
>     [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
>
> In that case, the test:
>       if (is_active & EVENT_TIME)
>
> will be false and the time will not be updated. Time must always be updated on
> sched out.

Fix this by always updating time if EVENT_TIME was set, as opposed to
only updating time when EVENT_TIME changed.

Reported-by: Stephane Eranian <eranian@google.com>
Tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Fixes: 3cbaa59069 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 09:54:06 +02:00
Frederic Weisbecker
5529578a27 locking/atomic, sched: Unexport fetch_or()
This patch functionally reverts:

  5fd7a09cfb ("atomic: Export fetch_or()")

During the merge Linus observed that the generic version of fetch_or()
was messy:

  " This makes the ugly "fetch_or()" macro that the scheduler used
    internally a new generic helper, and does a bad job at it. "

  e23604edac Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Now that we have introduced atomic_fetch_or(), fetch_or() is only used
by the scheduler in order to deal with thread_info flags which type
can vary across architectures.

Lets confine fetch_or() back to the scheduler so that we encourage
future users to use the more robust and well typed atomic_t version
instead.

While at it, fetch_or() gets robustified, pasting improvements from a
previous patch by Ingo Molnar that avoids needless expression
re-evaluations in the loop.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29 11:52:11 +02:00
Frederic Weisbecker
f009a7a767 timers/nohz: Convert tick dependency mask to atomic_t
The tick dependency mask was intially unsigned long because this is the
type on which clear_bit() operates on and fetch_or() accepts it.

But now that we have atomic_fetch_or(), we can instead use
atomic_andnot() to clear the bit. This consolidates the type of our
tick dependency mask, reduce its size on structures and benefit from
possible architecture optimizations on atomic_t operations.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-29 11:52:11 +02:00
Al Viro
0c93b7d85d bpf: reject invalid names right in ->lookup()
... and other methods won't see them at all

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-27 23:49:27 -04:00
Alexander Potapenko
be7635e728 arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>.  Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.

Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Michal Hocko
36324a990c oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct.  This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention.  This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Andrew Morton
69b27baf00 sched: add schedule_timeout_idle()
This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00