Commit Graph

15557 Commits

Author SHA1 Message Date
Lai Jiangshan
b592760547 workqueue: remove pwq_lock which is no longer used
To simplify locking, the previous patches expanded wq->mutex to
protect all fields of each workqueue instance including the pwqs list
leaving pwq_lock without any user.  Remove the unused pwq_lock.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:19 -07:00
Lai Jiangshan
a357fc0326 workqueue: protect wq->saved_max_active with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

This patch makes wq->saved_max_active protected by wq->mutex instead
of pwq_lock.  As pwq_lock locking around pwq_adjust_mac_active() is no
longer necessary, this patch also replaces pwq_lock lockings of
for_each_pwq() around pwq_adjust_max_active() to wq->mutex.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:19 -07:00
Lai Jiangshan
b09f4fd39c workqueue: protect wq->pwqs and iteration with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

init_and_link_pwq() and pwq_unbound_release_workfn() already grab
wq->mutex when adding or removing a pwq from wq->pwqs list.  This
patch makes it official that the list is wq->mutex protected for
writes and updates readers accoridingly.  Explicit IRQ toggles for
sched-RCU read-locking in flush_workqueue_prep_pwqs() and
drain_workqueues() are removed as the surrounding wq->mutex can
provide sufficient synchronization.

Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
and checks for wq->mutex too.

pwq_lock locking and assertion are not removed by this patch and a
couple of for_each_pwq() iterations are still protected by it.
They'll be removed by future patches.

tj: Rebased on top of the current dev branch.  Updated description.
    Folded in assert_rcu_or_wq_mutex() renaming from a later patch
    along with associated comment updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:18 -07:00
Lai Jiangshan
87fc741e94 workqueue: protect wq->nr_drainers and ->flags with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

wq->nr_drainers and ->flags are specific to each workqueue.  Protect
->nr_drainers and ->flags with wq->mutex instead of pool_mutex.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:18 -07:00
Lai Jiangshan
3c25a55daa workqueue: rename wq->flush_mutex to wq->mutex
Currently pwq->flush_mutex protects many fields of a workqueue
including, especially, the pwqs list.  We're going to expand this
mutex to protect most of a workqueue and eventually replace pwq_lock,
which will make locking simpler and easier to understand.

Drop the "flush_" prefix in preparation.

This patch is pure rename.

tj: Rebased on top of the current dev branch.  Updated description.
    Use WQ: and WR: instead of Q: and QR: for synchronization labels.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:17 -07:00
Lai Jiangshan
68e13a67dd workqueue: rename wq_mutex to wq_pool_mutex
wq->flush_mutex will be renamed to wq->mutex and cover all fields
specific to each workqueue and eventually replace pwq_lock, which will
make locking simpler and easier to understand.

Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex.
After the scheduled changes, wq_pool_mutex won't be protecting
anything specific to each workqueue instance anyway.

This patch is pure rename.

tj: s/wqs_mutex/wq_pool_mutex/.  Rewrote description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-25 16:57:17 -07:00
Fengguang Wu
dd5d70e869 timekeeping: __timekeeping_set_tai_offset can be static
Yet again, the kbuild test robot saves the day, noting
I left out defining __timekeeping_set_tai_offset as
static. It even sent me this patch.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-25 12:24:24 -07:00
Rado Vrbovsky
cfea7d7e45 tick: Change log level of NOHZ: local_softirq_pending message
The "NOHZ: local_softirq_pending" message is a largely informational
message.  This makes extra work for customers that have a policy of
investigating all kernel log messages logged at <= KERN_ERR log level.
This patch sets the message to a different log level.

[ tglx: Use pr_warn() ]

Signed-off-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/2037057938.893524.1360345050772.JavaMail.root@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-25 10:46:59 +01:00
David Daney
8ffbc7d9ff hrtimer/trivial: Fix comment text in hrtimer.c
The comments mention HRTIMER_ABS and HRTIMER_REL, these symbols don't
exist, the proper names are HRTIMER_MODE_ABS and HRTIMER_MODE_REL.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/1363202438-21234-1-git-send-email-ddaney.cavm@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-25 10:46:59 +01:00
Oleg Nesterov
2ca067efd8 poweroff: change orderly_poweroff() to use schedule_work()
David said:

    Commit 6c0c0d4d10 ("poweroff: fix bug in orderly_poweroff()")
    apparently fixes one bug in orderly_poweroff(), but introduces
    another.  The comments on orderly_poweroff() claim it can be called
    from any context - and indeed we call it from interrupt context in
    arch/powerpc/platforms/pseries/ras.c for example.  But since that
    commit this is no longer safe, since call_usermodehelper_fns() is not
    safe in interrupt context without the UMH_NO_WAIT option.

orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
sleepable.  Move the "force" logic into __orderly_poweroff() and change
orderly_poweroff() to use the global poweroff_work which simply calls
__orderly_poweroff().

While at it, remove the unneeded "int argc" and change argv_split() to
use GFP_KERNEL.

We use the global "bool poweroff_force" to pass the argument, this can
obviously affect the previous request if it is pending/running.  So we
only allow the "false => true" transition assuming that the pending
"true" should succeed anyway.  If schedule_work() fails after that we
know that work->func() was not called yet, it must see the new value.

This means that orderly_poweroff() becomes async even if we do not run
the command and always succeeds, schedule_work() can only fail if the
work is already pending.  We can export __orderly_poweroff() and change
the non-atomic callers which want the old semantics.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Feng Hong <hongfeng@marvell.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22 16:41:20 -07:00
Frederic Weisbecker
dc72c32e1f printk: Provide a wake_up_klogd() off-case
wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk()
nor printk_sched() are in use and there are actually no waiter on
log_wait waitqueue.  It should be a stub in this case for users like
bust_spinlocks().

Otherwise this results in this warning when CONFIG_PRINTK=n and
CONFIG_IRQ_WORK=n:

	kernel/built-in.o In function `wake_up_klogd':
	(.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue'

To fix this, provide an off-case for wake_up_klogd() when
CONFIG_PRINTK=n.

There is much more from console_unlock() and other console related code
in printk.c that should be moved under CONFIG_PRINTK.  But for now,
focus on a minimal fix as we passed the merged window already.

[akpm@linux-foundation.org: include printk.h in bust_spinlocks.c]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22 16:41:20 -07:00
Thomas Gleixner
9a7a71b1d0 timekeeping: Split timekeeper_lock into lock and seqcount
We want to shorten the seqcount write hold time. So split the seqlock
into a lock and a seqcount.

Open code the seqwrite_lock in the places which matter and drop the
sequence counter update where it's pointless.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[jstultz: Merge fixups from CLOCK_TAI collisions]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:01 -07:00
Thomas Gleixner
7e40672d93 timekeeping: Move lock out of timekeeper struct
Make the lock a separate entity. Preparatory patch for shadow
timekeeper structure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Merged with CLOCK_TAI changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:00 -07:00
Thomas Gleixner
eb93e4d930 timekeeping: Make jiffies_lock internal
Nothing outside of the timekeeping core needs that lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:00 -07:00
Thomas Gleixner
23a9537a69 timekeeping: Calc stuff once
Calculate the cycle interval shifted value once. No functional change,
just makes the code more readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
90adda98b8 hrtimer: Add hrtimer support for CLOCK_TAI
Add hrtimer support for CLOCK_TAI, as well as posix timer interfaces.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
1ff3c9677b timekeeping: Add CLOCK_TAI clockid
This add a CLOCK_TAI clockid and the needed accessors.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
cc244ddae6 timekeeping: Move TAI managment into timekeeping core from ntp
Currently NTP manages the TAI offset. Since there's plans for a
CLOCK_TAI clockid, push the TAI management into the timekeeping
core.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:58 -07:00
Linus Torvalds
cd82346934 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A fair chunk of the linecount comes from a fix for a tracing bug that
  corrupts latency tracing buffers when the overwrite mode is changed on
  the fly - the rest is mostly assorted fewliner fixlets."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add SNB/SNB-EP scheduling constraints for cycle_activity event
  kprobes/x86: Check Interrupt Flag modifier when registering probe
  kprobes: Make hash_64() as always inlined
  perf: Generate EXIT event only once per task context
  perf: Reset hwc->last_period on sw clock events
  tracing: Prevent buffer overwrite disabled for latency tracers
  tracing: Keep overwrite in sync between regular and snapshot buffers
  tracing: Protect tracer flags with trace_types_lock
  perf tools: Fix LIBNUMA build with glibc 2.12 and older.
  tracing: Fix free of probe entry by calling call_rcu_sched()
  perf/POWER7: Create a sysfs format entry for Power7 events
  perf probe: Fix segfault
  libtraceevent: Remove hard coded include to /usr/local/include in Makefile
  perf record: Fix -C option
  perf tools: check if -DFORTIFY_SOURCE=2 is allowed
  perf report: Fix build with NO_NEWT=1
  perf annotate: Fix build with NO_NEWT=1
  tracing: Fix race in snapshot swapping
2013-03-21 08:29:11 -07:00
Stephane Eranian
dd9c086d9f perf: Fix ring_buffer perf_output_space() boundary calculation
This patch fixes a flaw in perf_output_space(). In case the size
of the space needed is bigger than the actual buffer size, there
may be situations where the function would return true (i.e.,
there is space) when it should not. head > offset due to
rounding of the masking logic.

The problem can be tested by activating BTS on Intel processors.
A BTS record can be as big as 16 pages. The following command
fails:

  $ perf record -m 4 -c 1 -e branches:u my_test_program

You will get a buffer corruption with this. Perf report won't be
able to parse the perf.data.

The fix is to first check that the requested space is smaller
than the buffer size. If so, then the masking logic will work
fine. If not, then there is no chance the record can be saved
and it will be gracefully handled by upper code layers.

[ In v2, we also make the logic for the writable more explicit by
  renaming it to rb->overwrite because it tells whether or not the
  buffer can overwrite its tail (suggested by PeterZ). ]

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 12:04:35 +01:00
Tejun Heo
383efcd000 sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s
try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 11:48:20 +01:00
Ingo Molnar
3bf2391729 Merge branch 'perf/urgent' into perf/core
Merge in all pending fixes, before pulling the latest development
bits from Arnaldo - which will involve merge conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 11:03:10 +01:00
Steven Rostedt (Red Hat)
22f45649ce tracing: Update debugfs README file
Update the README file in debugfs/tracing to something more useful.
What's currently in the file is very old and what it shows doesn't
have much use. Heck, it tells you how to mount debugfs! But to read
this file you would have already needed to mount it.

Replace the file with current up-to-date information. It's rather
limited, but what do you expect from a pseudo README file.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-20 21:55:02 -04:00
Lai Jiangshan
519e3c1163 workqueue: avoid false negative in assert_manager_or_pool_lock()
If lockdep complains something for other subsystem, lockdep_is_held()
can be false negative, so we need to also test debug_locks before
triggering WARN.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 11:21:34 -07:00
Lai Jiangshan
881094532e workqueue: use rcu_read_lock_sched() instead for accessing pwq in RCU
rcu_read_lock_sched() is better than preempt_disable() if the code is
protected by RCU_SCHED.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 11:00:57 -07:00
Lai Jiangshan
951a078a52 workqueue: kick a worker in pwq_adjust_max_active()
If pwq_adjust_max_active() changes max_active from 0 to
saved_max_active, it needs to wakeup worker.  This is already done by
thaw_workqueues().

If pwq_adjust_max_active() increases max_active for an unbound wq,
while not strictly necessary for correctness, it's still desirable to
wake up a worker so that the requested concurrency level is reached
sooner.

Move wake_up_worker() call from thaw_workqueues() to
pwq_adjust_max_active() so that it can handle both of the above two
cases.  This also makes thaw_workqueues() simpler.

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:52:30 -07:00
Lai Jiangshan
6a092dfd51 workqueue: simplify current_is_workqueue_rescuer()
We can test worker->recue_wq instead of reaching into
current_pwq->wq->rescuer and then comparing it to self.

tj: Commit message.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:40:25 -07:00
Lai Jiangshan
12ee4fc67c workqueue: add missing POOL_FREEZING
get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing
is set and a new pool could go out of sync with the global freezing
state.

Fix it by adding POOL_FREEZING if workqueue_freezing.  wq_mutex is
already held so no further locking is necessary.  This also removes
the unused static variable warning when !CONFIG_FREEZER.

tj: Updated commit message.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:19:09 -07:00
Li Zefan
081aa458c3 cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()
These two functions share most of the code.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 07:50:25 -07:00
Li Zefan
3ac1707a13 cgroup: fix an off-by-one bug which may trigger BUG_ON()
The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.

The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-03-20 07:50:04 -07:00
Tejun Heo
7dbc725e47 workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full.  As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.

To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.

This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
a9ab775bca workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.

As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr().  Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound.  worker->rebind_work is queued to busy
workers.  Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind().  The manager isn't covered by either and
implements its own mechanism.

PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers.  This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.

There are a couple subtleties.  All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up.  This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.

Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers.  While DISASSOCIATED, all workers
are unbound and nr_running is zero.  As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals.  Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution.  The only change
needed for the workers is clearing REBOUND along with PREP.

* This patch leaves for_each_busy_worker() without any user.  Removed.

* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
  and rebind logic in manager_workers() removed.

* worker_thread() now looks at WORKER_DIE instead of testing whether
  @worker->entry is empty to determine whether it needs to do
  something special as dying is the only special thing now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
bd7c089eb2 workqueue: relocate rebind_workers()
rebind_workers() will be reimplemented in a way which makes it mostly
decoupled from the rest of worker management.  Move rebind_workers()
so that it's located with other CPU hotplug related functions.

This patch is pure function relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
822d8405d1 workqueue: convert worker_pool->worker_ida to idr and implement for_each_pool_worker()
Make worker_ida an idr - worker_idr and use it to implement
for_each_pool_worker() which will be used to simplify worker rebinding
on CPU_ONLINE.

pool->worker_idr is protected by both pool->manager_mutex and
pool->lock so that it can be iterated while holding either lock.

* create_worker() allocates ID without installing worker pointer and
  installs the pointer later using idr_replace().  This is because
  worker ID is needed when creating the actual task to name it and the
  new worker shouldn't be visible to iterations before fully
  initialized.

* In destroy_worker(), ID removal is moved before kthread_stop().
  This is again to guarantee that only fully working workers are
  visible to for_each_pool_worker().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
14a40ffccd sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself.  Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks.  In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-19 13:45:20 -07:00
Linus Torvalds
b63dc123b2 Merge branch 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Lai's patch to fix highly unlikely but still possible workqueue stall
  during CPU hotunplug."

* 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix possible pool stall bug in wq_unbind_fn()
2013-03-18 18:47:07 -07:00
Namhyung Kim
86e213e1d9 perf/cgroup: Add __percpu annotation to perf_cgroup->info
It's a per-cpu data structure but missed the __percpu annotation.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Link: http://lkml.kernel.org/r/1363600594-11453-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 11:02:06 +01:00
Peter Zijlstra
a8d7ad52a7 sched/tracing: Allow tracing the preemption decision on wakeup
Thomas noted that we do the wakeup preemption check after the
wakeup trace point, this means the tracepoint cannot test/report
this decision; which is rather important for latency sensitive
workloads. Therefore move the tracepoint after doing the
preemption check.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1363254519.26965.9.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 10:18:08 +01:00
Ingo Molnar
e75c8b475e Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull CPU runtime stats/accounting fixes from Frederic Weisbecker:

 " Some users are complaining that their threadgroup's runtime accounting
   freezes after a week or so of intense cpu-bound workload. This set tries
   to fix the issue by reducing the risk of multiplication overflow in the
   cputime scaling code. "

Stanislaw Gruszka further explained the historic context and impact of the
bug:

 " Commit 0cf55e1ec0 start to use scalling
   for whole thread group, so increase chances of hitting multiplication
   overflow, depending on how many CPUs are on the system.

   We have multiplication utime * rtime for one thread since commit
   b27f03d4bd.

   Overflow will happen after:

   rtime * utime > 0xffffffffffffffff jiffies

   if thread utilize 100% of CPU time, that gives:

   rtime > sqrt(0xffffffffffffffff) jiffies

   ritme > sqrt(0xffffffffffffffff) / (24 * 60 * 60 * HZ) days

   For HZ 100 it will be 497 days for HZ 1000 it will be 49 days.

   Bug affect only users, who run CPU intensive application for that
   long period. Also they have to be interested on utime,stime values,
   as bug has no other visible effect as making those values incorrect. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 10:09:31 +01:00
Ingo Molnar
1f1b396758 Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull tracing fixes from Steven Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:48:29 +01:00
Namhyung Kim
d610d98b5d perf: Generate EXIT event only once per task context
perf_event_task_event() iterates pmu list and generate events
for each eligible pmu context.  But if task_event has task_ctx
like in EXIT it'll generate events even though the pmu doesn't
have an eligible one. Fix it by moving the code to proper
places.

Before this patch:

  $ perf record -n true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ]

  $ perf report -D | tail
  Aggregated stats:
             TOTAL events:         73
              MMAP events:         67
              COMM events:          2
              EXIT events:          4
  cycles stats:
             TOTAL events:         73
              MMAP events:         67
              COMM events:          2
              EXIT events:          4

After this patch:

  $ perf report -D | tail
  Aggregated stats:
             TOTAL events:         70
              MMAP events:         67
              COMM events:          2
              EXIT events:          1
  cycles stats:
             TOTAL events:         70
              MMAP events:         67
              COMM events:          2
              EXIT events:          1

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:47:33 +01:00
Namhyung Kim
778141e3cf perf: Reset hwc->last_period on sw clock events
When cpu/task clock events are initialized, their sampling
frequencies are converted to have a fixed value.  However it
missed to update the hwc->last_period which was set to 1 for
initial sampling frequency calibration.

Because this hwc->last_period value is used as a period in
perf_swevent_ hrtime(), every recorded sample will have an
incorrected period of 1.

  $ perf record -e task-clock noploop 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ]

  $ perf report -n --show-total-period  --stdio
  # Samples: 4K of event 'task-clock'
  # Event count (approx.): 4000
  #
  # Overhead       Samples        Period  Command  Shared Object              Symbol
  # ........  ............  ............  .......  .............  ..................
  #
      99.95%          3998          3998  noploop  noploop        [.] main
       0.03%             1             1  noploop  libc-2.15.so   [.] init_cacheinfo
       0.03%             1             1  noploop  ld-2.15.so     [.] open_verify

Note that it doesn't affect the non-sampling event so that the
perf stat still gets correct value with or without this patch.

  $ perf stat -e task-clock noploop 1

   Performance counter stats for 'noploop 1':

         1000.272525 task-clock                #    1.000 CPUs utilized

         1.000560605 seconds time elapsed

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:15:18 +01:00
Feng Tang
e445cf1c42 timekeeping: utilize the suspend-nonstop clocksource to count suspended time
There are some new processors whose TSC clocksource won't stop during
suspend. Currently, after system resumes, kernel will use persistent
clock or RTC to compensate the sleep time, but with these nonstop
clocksources, we could skip the special compensation from external
sources, and just use current clocksource for time recounting.

This can solve some time drift bugs caused by some not-so-accurate or
error-prone RTC devices.

The current way to count suspended time is first try to use the persistent
clock, and then try the RTC if persistent clock can't be used. This
patch will change the trying order to:
	suspend-nonstop clocksource -> persistent clock -> RTC

When counting the sleep time with nonstop clocksource, use an accurate way
suggested by Jason Gunthorpe to cover very large delta cycles.

Signed-off-by: Feng Tang <feng.tang@intel.com>
[jstultz: Small optimization, avoiding re-reading the clocksource]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:51:29 -07:00
John Stultz
7859e404ae timekeeping: Use inject_offset in warp_clock
When warping the clock (from a local time RTC), use
timekeeping_inject_offset() to atomically add the offset.

This avoids any minor time error caused by the delay between
reading the time, and then setting the adjusted time.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:50:20 -07:00
Dong Zhu
c30bd09915 timekeeping: Avoid adjust kernel time once hwclock kept in UTC time
If the Hardware Clock kept in local time,kernel will adjust the time
to be UTC time.But if Hardware Clock kept in UTC time,system will make
a dummy settimeofday call first (sys_tz.tz_minuteswest = 0) to make sure
the time is not shifted,so at this point I think maybe it is not necessary
to set the kernel time once the sys_tz.tz_minuteswest is zero.

Signed-off-by: Dong Zhu <bluezhudong@gmail.com>
[jstultz: Updated to merge with conflicting changes ]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:50:12 -07:00
Steven Rostedt (Red Hat)
7fe70b579c tracing: Fix ftrace_dump()
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.

This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.

There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.

It also totally disables ftrace, and tracing for no good reasons.

As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.

ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.

The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.

For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.

The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.

Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 19:24:56 -04:00
zhangwei(Jovi)
52f6ad6dc3 tracing: Rename trace_event_mutex to trace_event_sem
trace_event_mutex is an rw semaphore now, not a mutex, change the name.

Link: http://lkml.kernel.org/r/513D843B.40109@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ Forward ported to my new code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:10 -04:00
zhangwei(Jovi)
36a78e9e87 tracing: Fix comment about prefix in arch_syscall_match_sym_name()
ppc64 has its own syscall prefix like ".SyS" or ".sys". Make the
comment in arch_syscall_match_sym_name() more understandable.

Link: http://lkml.kernel.org/r/513D842F.40205@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:09 -04:00
zhangwei(Jovi)
ad7067cebf tracing: Convert trace_destroy_fields() to static
trace_destroy_fields() is not used outside of the file. It can be
a static function.

Link: http://lkml.kernel.org/r/513D842A.2000907@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:08 -04:00
zhangwei(Jovi)
b3a8c6fd7b tracing: Move find_event_field() into trace_events.c
By moving find_event_field() and trace_find_field() into trace_events.c,
the ftrace_common_fields list and trace_get_fields() can become local to
the trace_events.c file.

find_event_field() is renamed to trace_find_event_field() to conform to
the tracing global function names.

Link: http://lkml.kernel.org/r/513D8426.9070109@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ rostedt: Modified trace_find_field() to trace_find_event_field() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:07 -04:00