android_kernel_xiaomi_sm8450

xiaomi-sm8450/android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Jessica Yu	425595a7fc	livepatch: reuse module loader code to write relocations Reuse module loader code to write relocations, thereby eliminating the need for architecture specific relocation code in livepatch. Specifically, reuse the apply_relocate_add() function in the module loader to write relocations instead of duplicating functionality in livepatch's arch-dependent klp_write_module_reloc() function. In order to accomplish this, livepatch modules manage their own relocation sections (marked with the SHF_RELA_LIVEPATCH section flag) and livepatch-specific symbols (marked with SHN_LIVEPATCH symbol section index). To apply livepatch relocation sections, livepatch symbols referenced by relocs are resolved and then apply_relocate_add() is called to apply those relocations. In addition, remove x86 livepatch relocation code and the s390 klp_write_module_reloc() function stub. They are no longer needed since relocation work has been offloaded to module loader. Lastly, mark the module as a livepatch module so that the module loader canappropriately identify and initialize it. Signed-off-by: Jessica Yu <jeyu@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # for s390 changes Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2016-04-01 15:00:11 +02:00
Jessica Yu	1ce15ef4f6	module: preserve Elf information for livepatch modules For livepatch modules, copy Elf section, symbol, and string information from the load_info struct in the module loader. Persist copies of the original symbol table and string table. Livepatch manages its own relocation sections in order to reuse module loader code to write relocations. Livepatch modules must preserve Elf information such as section indices in order to apply livepatch relocation sections using the module loader's apply_relocate_add() function. In order to apply livepatch relocation sections, livepatch modules must keep a complete copy of their original symbol table in memory. Normally, a stripped down copy of a module's symbol table (containing only "core" symbols) is made available through module->core_symtab. But for livepatch modules, the symbol table copied into memory on module load must be exactly the same as the symbol table produced when the patch module was compiled. This is because the relocations in each livepatch relocation section refer to their respective symbols with their symbol indices, and the original symbol indices (and thus the symtab ordering) must be preserved in order for apply_relocate_add() to find the right symbol. Signed-off-by: Jessica Yu <jeyu@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2016-04-01 15:00:10 +02:00
Paul E. McKenney	9eb5188a07	torture: Clarify refusal to run more than one torture test This commit clarifies error messages -- you only get to run one torture test at a time! Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:39:52 -07:00
Anna-Maria Gleixner	de26ca19a5	rcutorture: Consider FROZEN hotplug notifier transitions The hotplug notifier rcutorture_cpu_notify() doesn't consider the corresponding CPU_XXX_FROZEN transitions. They occur on suspend/resume and are usually handled the same way as the corresponding non frozen transitions. Mask the switch case action argument with '~CPU_TASKS_FROZEN' to map CPU_XXX_FROZEN hotplug transitions on corresponding non-frozen transitions. Cc: Josh Triplett <josh@joshtriplett.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:39:52 -07:00
Paul E. McKenney	67522beecf	rcutorture: Remove redundant initialization to zero The current code initializes the global per-CPU variables rcu_torture_count and rcu_torture_batch to zero. However, C does this initialization by default, and explicit initialization of per-CPU variables now needs a different syntax if "make tags" is to work. This commit therefore removes the initialization. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:39:51 -07:00
Artem Savkov	e6fb1fc108	rcuperf: Do not wake up shutdown wait queue if "shutdown" is false. After finishing its tests rcuperf tries to wake up shutdown_wq even if "shutdown" param is set to false, resulting in a wake_up() call on an unitialized wait_queue_head_t which leads to "BUG: spinlock bad magic" and "BUG: unable to handle kernel NULL pointer dereference". Fix by checking "shutdown" param before waking up the queue. Signed-off-by: Artem Savkov <artem.savkov@gmail.com>	2016-03-31 13:39:51 -07:00
Paul E. McKenney	620316e52a	rcutorture: Avoid RCU CPU stall warning and RT throttling Running rcuperf can result in RCU CPU stall warnings and RT throttling. These occur because on of the real-time writer processes does ftrace_dump() while still running at real-time priority. This commit therefore prevents these problems by setting the writer thread back to SCHED_NORMAL (AKA SCHED_OTHER) before doing ftrace_dump(). In addition, this commit adds a small fixed delay before dumping ftrace buffer in order to decrease the probability that this dumping will interfere with other writers' grace periods. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:39:47 -07:00
Paul E. McKenney	df37e66bfd	rcutorture: Add rcuperf holdoff boot parameter to reduce interference Boot-time activity can legitimately grab CPUs for extended time periods, so the commit adds a boot parameter to delay the start of the performance test until boot has completed. Defaults to 10 seconds. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:38:58 -07:00
Paul E. McKenney	ac2bb275e8	rcutorture: Make rcuperf collect expedited event-trace data This commit enables ftrace in the rcuperf TREE kernel build and adds an ftrace_dump() at the end of rcuperf processing. This data will be used to measure the actual durations of the expedited grace periods without the added delays inherent in the kernel-module measurements. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:38:53 -07:00
Paul E. McKenney	2094c99558	rcutorture: Set rcuperf writer kthreads to real-time priority This commit forces more deterministic update-side behavior by setting rcuperf's rcu_perf_writer() kthreads to real-time priority. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:37:39 -07:00
Paul E. McKenney	6b558c4c7a	rcutorture: Bind rcuperf reader/writer kthreads to CPUs This commit forces more deterministic behavior by binding rcuperf's rcu_perf_reader() and rcu_perf_writer() kthreads to their respective CPUs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:37:39 -07:00
Paul E. McKenney	8704baab9b	rcutorture: Add RCU grace-period performance tests This commit adds a new rcuperf module that carries out simple performance tests of RCU grace periods. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:37:38 -07:00
Paul E. McKenney	291783b8ad	rcutorture: Expedited-GP batch progress access to torturing This commit provides rcu_exp_batches_completed() and rcu_exp_batches_completed_sched() functions to allow torture-test modules to check how many expedited grace period batches have completed. These are analogous to the existing rcu_batches_completed(), rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:37:37 -07:00
Paul E. McKenney	9efafb8849	rcutorture: Allow for rcupdate.rcu_normal Currently, rcu_torture_writer() checks only for rcu_gp_is_expedited() when deciding whether or not to do dynamic control of RCU expediting. This means that if rcupdate.rcu_normal is specified, rcu_torture_writer() will attempt to dynamically control RCU expediting, but will nonetheless only test normal RCU grace periods. This commit therefore adds a check for !rcu_gp_is_normal(), and prints a message and desists from testing dynamic control of RCU expediting when doing so is futile. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:37:37 -07:00
Paul E. McKenney	5dffed1e57	rcu: Dump ftrace buffer when kicking grace-period kthread If it is necessary to kick the grace-period kthread, that is a good time to dump the trace buffer in order to learn why kicking was needed. This commit therefore does the dump. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:36:37 -07:00
Boqun Feng	293e2421fe	rcu: Remove superfluous versions of rcu_read_lock_sched_held() Currently, we have four versions of rcu_read_lock_sched_held(), depending on the combined choices on PREEMPT_COUNT and DEBUG_LOCK_ALLOC. However, there is an existing function preemptible() that already distinguishes between the PREEMPT_COUNT=y and PREEMPT_COUNT=n cases, and allows these four implementations to be consolidated down to two. This commit therefore uses preemptible() to achieve this consolidation. Note that there could be a small performance regression in the case of CONFIG_DEBUG_LOCK_ALLOC=y && PREEMPT_COUNT=n. However, given the overhead associated with CONFIG_DEBUG_LOCK_ALLOC=y, this should be down in the noise. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:50 -07:00
Paul E. McKenney	8c7c4829a8	rcu: Awaken grace-period kthread if too long since FQS Recent kernels can fail to awaken the grace-period kthread for quiescent-state forcing. This commit is a crude hack that does a wakeup if a scheduling-clock interrupt sees that it has been too long since force-quiescent-state (FQS) processing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:50 -07:00
Paul E. McKenney	fcfd0a237b	rcu: Make FQS schedule advance only if FQS happened Currently, the force-quiescent-state (FQS) code in rcu_gp_kthread() can advance the next FQS even if one was not executed last time. This can happen due timeout-duration uncertainty. This commit therefore avoids advancing the FQS schedule unless an FQS was just executed. In the corner case where an FQS was not executed, but is due now, the code does a one-jiffy wait. This change prepares for kthread kicking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:49 -07:00
Paul E. McKenney	86057b80ae	rcu: Awaken grace-period kthread when stalled Recent kernels can fail to awaken the grace-period kthread for quiescent-state forcing. This commit is a crude hack that does a wakeup any time a stall is detected. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:49 -07:00
Paul E. McKenney	3b5f668e71	rcu: Overlap wakeups with next expedited grace period The current expedited grace-period implementation makes subsequent grace periods wait on wakeups for the prior grace period. This does not fit the dictionary definition of "expedited", so this commit allows these two phases to overlap. Doing this requires four waitqueues rather than two because tasks can now be waiting on the previous, current, and next grace periods. The fourth waitqueue makes the bit masking work out nicely. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:11 -07:00
Paul E. McKenney	aff12cdf86	rcu: Consolidate expedited GP code into exp_funnel_lock() This commit pulls the grace-period-start counter adjustment and tracing from synchronize_rcu_expedited() and synchronize_sched_expedited() into exp_funnel_lock(), thus eliminating some code duplication. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:11 -07:00
Paul E. McKenney	179e5dcd1e	rcu: Consolidate expedited GP tracing into rcu_exp_gp_seq_snap() This commit moves some duplicate code from synchronize_rcu_expedited() and synchronize_sched_expedited() into rcu_exp_gp_seq_snap(). This doesn't save lines of code, but does eliminate a "tell me twice" issue. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:10 -07:00
Paul E. McKenney	4ea3e85b11	rcu: Consolidate expedited GP code into rcu_exp_wait_wake() Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have significant duplicate code. This commit therefore consolidates some of this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake() in recognition of its added responsibilities. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:10 -07:00
Paul E. McKenney	356051e1de	rcu: Add exp_funnel_lock() fastpath This commit speeds up the low-contention case, especially for systems with large rcu_node trees, by attempting to directly acquire the ->exp_mutex. This fastpath checks the leaves and root first in order to avoid excessive memory contention on the mutex itself. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:09 -07:00
Paul E. McKenney	f6a12f34a4	rcu: Enforce expedited-GP fairness via funnel wait queue The current mutex-based funnel-locking approach used by expedited grace periods is subject to severe unfairness. The problem arises when a few tasks, making a path from leaves to root, all wake up before other tasks do. A new task can then follow this path all the way to the root, which needlessly delays tasks whose grace period is done, but who do not happen to acquire the lock quickly enough. This commit avoids this problem by maintaining per-rcu_node wait queues, along with a per-rcu_node counter that tracks the latest grace period sought by an earlier task to visit this node. If that grace period would satisfy the current task, instead of proceeding up the tree, it waits on the current rcu_node structure using a pair of wait queues provided for that purpose. This decouples awakening of old tasks from the arrival of new tasks. If the wakeups prove to be a bottleneck, additional kthreads can be brought to bear for that purpose. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:08 -07:00
Paul E. McKenney	d40a4f09a4	rcu: Shorten expedited_workdone* to exp_workdone* Just a name change to save a few lines and a bit of typing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:08 -07:00
Paul E. McKenney	ec3833ed02	rcu: Force boolean subscript for expedited stall warnings The cpu_online() function can return values other than 0 and 1, which can result in subscript overflow when applied to a two-element array. This commit allows for this behavior by using "!!" on the return value from cpu_online() when used as a subscript. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:07 -07:00
Paul E. McKenney	e2fd9d3584	rcu: Remove expedited GP funnel-lock bypass Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking") turns out to be a pessimization at high load because it forces a tree full of tasks to wait for an expedited grace period that they probably do not need. This commit therefore removes this optimization. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:07 -07:00
Paul E. McKenney	4f41530245	rcu: Add expedited-grace-period event tracing Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:06 -07:00
Paul E. McKenney	bea2de44ae	rcu: Add funnel-locking tracing for expedited grace periods Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:06 -07:00
Paul E. McKenney	26ece8ef6e	rcu: Fix synchronize_rcu_expedited() header comment This commit brings the synchronize_rcu_expedited() function's header comment into line with the new implementation. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:04 -07:00
Paul E. McKenney	a1e1224849	rcu: Make cond_resched_rcu_qs() supply RCU-sched expedited QS Although cond_resched_rcu_qs() supplies quiescent states to all flavors of normal RCU grace periods, it does nothing for expedited RCU-sched grace periods. This commit therefore adds a check for a need for a quiescent state from the current CPU by an expedited RCU-sched grace period, and invokes rcu_sched_qs() to supply that quiescent state if so. Note that the check is racy in that we might be migrated to some other CPU just after checking the per-CPU variable. This is OK because the act of migration will do a context switch, which will supply the needed quiescent state. The only downside is that we might do an unnecessary call to rcu_sched_qs(), but the probability is low and the overhead is small. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:04 -07:00
Paul E. McKenney	251c617c75	rcu: Make expedited RCU-preempt stall warnings count accurately Currently, synchronize_sched_expedited_wait() simply sets the ndetected variable to the rcu_print_task_exp_stall() return value. This means that if the last rcu_node structure has no stalled tasks, record of any stalled tasks in previous rcu_node structures is lost, which can in turn result in failure to dump out the blocking rcu_node structures. Or could, had the test been correct. This commit therefore adds the return value of rcu_print_task_exp_stall() to ndetected and corrects the later test for ndetected. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:03 -07:00
Paul E. McKenney	28728dd310	rcu: Make expedited RCU-sched grace period immediately detect idle Currently, sync_sched_exp_handler() will force a reschedule unless this CPU has already checked in or unless a reschedule has already been called for. This is clearly wasteful if sync_sched_exp_handler() interrupted an idle CPU, so this commit immediately reports the quiescent state in that case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:34:03 -07:00
Paul E. McKenney	274529ba9b	rcu: Consolidate dumping of ftrace buffer This commit consolidates a couple definitions and several calls for single-shot ftrace-buffer dumping. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2016-03-31 13:29:08 -07:00
Alfredo Alvarez Fernandez	39e2e173fb	locking/lockdep: Print chain_key collision information A sequence of pairs [class_idx -> corresponding chain_key iteration] is printed for both the current held_lock chain and the cached chain. That exposes the two different class_idx sequences that led to that particular hash value. This helps with debugging hash chain collision reports. Signed-off-by: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: sedat.dilek@gmail.com Cc: tytso@mit.edu Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 15:03:58 +02:00
Yuyang Du	2b8c41daba	sched/fair: Initiate a new task's util avg to a bounded value A new task's util_avg is set to full utilization of a CPU (100% time running). This accelerates a new task's utilization ramp-up, useful to boost its execution in early time. However, it may result in (insanely) high utilization for a transient time period when a flood of tasks are spawned. Importantly, it violates the "fundamentally bounded" CPU utilization, and its side effect is negative if we don't take any measure to bound it. This patch proposes an algorithm to address this issue. It has two methods to approach a sensible initial util_avg: (1) An expected (or average) util_avg based on its cfs_rq's util_avg: util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight (2) A trajectory of how successive new tasks' util develops, which gives 1/2 of the left utilization budget to a new task such that the additional util is noticeably large (when overall util is low) or unnoticeably small (when overall util is high enough). In the meantime, the aggregate utilization is well bounded: util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n where n denotes the nth task. If util_avg is larger than util_avg_cap, then the effective util is clamped to the util_avg_cap. Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:49:46 +02:00
Yuyang Du	1c3de5e19f	sched/fair: Update comments after a variable rename The following commit: `ed82b8a1ff` ("sched/core: Move the sched_to_prio[] arrays out of line") renamed prio_to_weight to sched_prio_to_weight, but the old name was not updated in comments. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:49:45 +02:00
Steven Rostedt	47252cfbac	sched/core: Add preempt checks in preempt_schedule() code While testing the tracer preemptoff, I hit this strange trace: <...>-259 0...1 0us : schedule <-worker_thread <...>-259 0d..1 0us : rcu_note_context_switch <-__schedule <...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch <...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch <...>-259 0d..1 0us : _raw_spin_lock <-__schedule <...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock <...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock <...>-259 0d..2 1us : deactivate_task <-__schedule <...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task <...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task <...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair <...>-259 0d..2 1us : update_curr <-dequeue_entity <...>-259 0d..2 1us : update_min_vruntime <-update_curr <...>-259 0d..2 1us : cpuacct_charge <-update_curr <...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge <...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge <...>-259 0d..2 1us : clear_buddies <-dequeue_entity <...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity <...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity <...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity <...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair <...>-259 0d..2 2us : wq_worker_sleeping <-__schedule <...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping <...>-259 0d..2 2us : pick_next_task_fair <-__schedule <...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair <...>-259 0d..2 2us : clear_buddies <-pick_next_entity <...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair <...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair <...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity <...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup gnome-sh-1031 0...1 603us : <stack trace> => preempt_count_sub => _raw_spin_unlock => drm_gem_object_lookup => i915_gem_madvise_ioctl => drm_ioctl => do_vfs_ioctl => SyS_ioctl => entry_SYSCALL_64_fastpath As I'm tracing preemption disabled, it seemed incorrect that the trace would go across a schedule and report not being in the scheduler. Looking into this I discovered the problem. schedule() calls preempt_disable() but the preempt_schedule() calls preempt_enable_notrace(). What happened above was that the gnome-shell task was preempted on another CPU, migrated over to the idle cpu. The tracer stared with idle calling schedule(), which called preempt_disable(), but then gnome-shell finished, and it enabled preemption with preempt_enable_notrace() that does stop the trace, even though preemption was enabled. The purpose of the preempt_disable_notrace() in the preempt_schedule() is to prevent function tracing from going into an infinite loop. Because function tracing can trace the preempt_enable/disable() calls that are traced. The problem with function tracing is: NEED_RESCHED set preempt_schedule() preempt_disable() preempt_count_inc() function trace (before incrementing preempt count) preempt_disable_notrace() preempt_enable_notrace() sees NEED_RESCHED set preempt_schedule() (repeat) Now by breaking out the preempt off/on tracing into their own code: preempt_disable_check() and preempt_enable_check(), we can add these to the preempt_schedule() code. As preemption would then be disabled, even if they were to be traced by the function tracer, the disabled preemption would prevent the recursion. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:49:45 +02:00
Tim Chen	bfdb198ccd	sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels In account_entity_enqueue(), we do not do account_numa_enqueue() as NUMA balancing is not needed for UP kernels. Hence, we should remove the account_numa_dequeue() call from account_entity_dequeue() for UP kernels. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:49:45 +02:00
Srikar Dronamraju	d02c071183	sched/fair: Reset nr_balance_failed after active balancing To force a task migration during active balancing, nr_balance_failed is set to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side effect, the next regular load balance under the same sd, a cache hot task might be migrated, just because nr_balance_failed count is high. Resetting nr_balance_failed after a successful active balance ensures that a hot task is not unreasonably migrated. This can be verified by looking at othe number of hot task migrations reported by /proc/schedstat. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:49:44 +02:00
Dongsheng Yang	d740037fac	sched/cpuacct: Split usage accounting into user_usage and sys_usage Sometimes, cpuacct.usage is not detailed enough to see how much CPU usage a group had. We want to know how much time it used in user mode and how much in kernel mode. This patch introduces more files to give this information: # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys ... while keeping the ABI with the existing counter. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> [ Ported to newer kernels. ] Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:48:54 +02:00
Zhao Lei	5ca3726af7	sched/cpuacct: Show all possible CPUs in cpuacct output Current code show stats of online CPUs in cpuacct.statcpus, show stats of present cpus in cpuacct.usage(_percpu), and using present CPUs for setting cpuacct.usage. It will cause inconsistent result when a CPU is online or offline or hotpluged. We should always use possible CPUs to avoid above problem. Here are the contents of a cpuacct.usage_percpu sysfs file, on a 4 CPU system with maxcpus=32: Before the patch: # cat cpuacct.usage_percpu 2456565 411435 1052897 832584 After the patch: # cat cpuacct.usage_percpu 2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Tejun Heo <htejun@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a11d56cef12d0b4807f8be3a46bf9798c3014d59.1458635566.git.zhaolei@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:45:56 +02:00
Wang Nan	d1b26c7024	perf/ring_buffer: Prepare writing into the ring-buffer from the end Convert perf_output_begin() to __perf_output_begin() and make the later function able to write records from the end of the ring-buffer. Following commits will utilize the 'backward' flag. This is the core patch to support writing to the ring-buffer backwards, which will be introduced by upcoming patches to support reading from overwritable ring-buffers. In theory, this patch should not introduce any extra performance overhead since we use always_inline, but it does not hurt to double check that assumption: When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly identical to original one. See: http://lkml.kernel.org/g/56F52E83.70409@huawei.com When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes smaller: $ size kernel/events/ring_buffer.o* text data bss dec hex filename 4641 4 8 4653 122d kernel/events/ring_buffer.o.old 4545 4 8 4557 11cd kernel/events/ring_buffer.o.new Performance testing results: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE 2253846.700 9997.014 POST 2257495.540 8516.293 Where 'BASE' is pure performance without capturing. 'PRE' is test result of pure 'v4.5.0' kernel. 'POST' is test result after this patch. Considering the stdvar, this patch doesn't hurt performance, within noise margin. For testing details, see: http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:49 +02:00
Wang Nan	1879445dfa	perf/core: Set event's default ::overflow_handler() Set a default event->overflow_handler in perf_event_alloc() so don't need to check event->overflow_handler in __perf_event_overflow(). Following commits can give a different default overflow_handler. Initial idea comes from Peter: http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net Since the default value of event->overflow_handler is not NULL, existing 'if (!overflow_handler)' checks need to be changed. is_default_overflow_handler() is introduced for this. No extra performance overhead is introduced into the hot path because in the original code we still need to read this handler from memory. A conditional branch is avoided so actually we remove some instructions. Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:47 +02:00
Wang Nan	86e7972f69	perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer Add new ioctl() to pause/resume ring-buffer output. In some situations we want to read from the ring-buffer only when we ensure nothing can write to the ring-buffer during reading. Without this patch we have to turn off all events attached to this ring-buffer to achieve this. This patch is a prerequisite to enable overwrite support for the perf ring-buffer support. Following commits will introduce new methods support reading from overwrite ring buffer. Before reading, caller must ensure the ring buffer is frozen, or the reading is unreliable. Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:45 +02:00
Jiri Olsa	0a74c5b3d2	ftrace/perf: Check sample types only for sampling events Currently we check sample type for ftrace:function events even if it's not created as a sampling event. That prevents creating ftrace_function event in counting mode. Make sure we check sample types only for sampling events. Before: $ sudo perf stat -e ftrace:function ls ... Performance counter stats for 'ls': <not supported> ftrace:function 0.001983662 seconds time elapsed After: $ sudo perf stat -e ftrace:function ls ... Performance counter stats for 'ls': 44,498 ftrace:function 0.037534722 seconds time elapsed Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:45 +02:00
Alexander Shishkin	af5bb4ed12	perf/ring_buffer: Document AUX API usage In order to ensure safe AUX buffer management, we rely on the assumption that pmu::stop() stops its ongoing AUX transaction and not just the hw. This patch documents this requirement for the perf_aux_output_{begin,end}() APIs. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:43 +02:00
Alexander Shishkin	95ff4ca26c	perf/core: Free AUX pages in unmap path Now that we can ensure that when ring buffer's AUX area is on the way to getting unmapped new transactions won't start, we only need to stop all events that can potentially be writing aux data to our ring buffer. Having done that, we can safely free the AUX pages and corresponding PMU data, as this time it is guaranteed to be the last aux reference holder. This partially reverts: `57ffc5ca67` ("perf: Fix AUX buffer refcounting") ... which was made to defer deallocation that was otherwise possible from an NMI context. Now it is no longer the case; the last call to rb_free_aux() that drops the last AUX reference has to happen in perf_mmap_close() on that AUX area. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:42 +02:00
Alexander Shishkin	dcb10a967c	perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to zero, new AUX transactions into this buffer can still be started, even though the buffer in en route to deallocation. This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count being zero, in which case there is no point starting new transactions, in other words, the ring buffers that pass a certain point in perf_mmap_close will not have their events sending new data, which clears path for freeing those buffers' pages right there and then, provided that no active transactions are holding the AUX reference. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-31 10:30:41 +02:00

... 13 14 15 16 17 ...

22958 Commits