Commit Graph

19547 Commits

Author SHA1 Message Date
Jason Low
39a4d9ca77 sched/fair: Stop searching for tasks in newidle balance if there are runnable tasks
It was found that when running some workloads (such as AIM7) on large
systems with many cores, CPUs do not remain idle for long. Thus, tasks
can wake/get enqueued while doing idle balancing.

In this patch, while traversing the domains in idle balance, in
addition to checking for pulled_task, we add an extra check for
this_rq->nr_running for determining if we should stop searching for
tasks to pull. If there are runnable tasks on this rq, then we will
stop traversing the domains. This reduces the chance that idle balance
delays a task from running.

This patch resulted in approximately a 6% performance improvement when
running a Java Server workload on an 8 socket machine.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1398303035-18255-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:53 +02:00
Vincent Guittot
d77b3ed5c9 sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
in a sched_domain level can or not reach different power state. As an example,
the flag should be cleared at CPU level if groups of cores can be power gated
independently. This information can be used in the load balance decision or to
add load balancing level between group of CPUs that can power gate
independantly.
This flag is part of the topology flags that can be set by arch.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: schwidefsky@de.ibm.com
Cc: cmetcalf@tilera.com
Cc: benh@kernel.crashing.org
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1397209481-28542-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:52 +02:00
Vincent Guittot
607b45e9a2 sched, powerpc: Create a dedicated topology table
Create a dedicated topology table for handling asymetric feature of powerpc.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: schwidefsky@de.ibm.com
Cc: cmetcalf@tilera.com
Cc: dietmar.eggemann@arm.com
Cc: devicetree@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1397209481-28542-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:51 +02:00
Vincent Guittot
2dfd747629 sched, s390: Create a dedicated topology table
BOOK level is only relevant for s390 so we create a dedicated topology table
with BOOK level and remove it from default table.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Cc: cmetcalf@tilera.com
Cc: benh@kernel.crashing.org
Cc: dietmar.eggemann@arm.com
Cc: preeti@linux.vnet.ibm.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: linux390@de.ibm.com
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1397209481-28542-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:50 +02:00
Vincent Guittot
143e1e28cb sched: Rework sched_domain topology definition
We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which either wants to add new level where a load
balance make sense like BOOK or powergating level or wants to change the flags
configuration of some levels.

For each level, we need a function pointer that returns cpumask for each cpu,
a function pointer that returns the flags for the level and a name. Only flags
that describe topology, can be set by an architecture. The current topology
flags are:

 SD_SHARE_CPUPOWER
 SD_SHARE_PKG_RESOURCES
 SD_NUMA
 SD_ASYM_PACKING

Then, each level must be a subset on the next one. The build sequence of the
sched_domain will take care of removing useless levels like those with 1 CPU
and those with the same CPU span and no more relevant information for
load balancing than its children.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux390@de.ibm.com
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:49 +02:00
Rik van Riel
68d1b02a58 sched/numa: Do not set preferred_node on migration to a second choice node
Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.

On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.

However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.

The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:47 +02:00
Rik van Riel
5085e2a328 sched/numa: Retry placement more frequently when misplaced
When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.

This patch reduces the interval at which migration is retried,
when the task's numa_scan_period is small.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:46 +02:00
Rik van Riel
792568ec6a sched/numa: Count pages on active node as local
The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.

However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted on the memory is counted as non-local, which
causes the scan rate to go up.

Counting the memory on any node where the task's numa group is
actively running as local, allows the scan rate to slow down
once the application is settled in.

This should reduce the overhead of the automatic NUMA placement
code, when a workload spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:33:45 +02:00
Ingo Molnar
2fe5de9ce7 Merge branch 'sched/urgent' into sched/core, to avoid conflicts
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 13:15:46 +02:00
Jason Low
2b4cfe64de sched/numa: Initialize newidle balance stats in sd_numa_init()
Also initialize the per-sd variables for newidle load balancing
in sd_numa_init().

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: morten.rasmussen@arm.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1398303035-18255-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:37 +02:00
Jason Low
0e5b5337f0 sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
The following commit:

  e5fc66119e ("sched: Fix race in idle_balance()")

can potentially cause rq->max_idle_balance_cost to not be updated,
even when load_balance(NEWLY_IDLE) is attempted and the per-sd
max cost value is updated.

Preeti noticed a similar issue with updating rq->next_balance.

In this patch, we fix this by making sure we still check/update those values
even if a task gets enqueued while browsing the domains.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1398725155-7591-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:36 +02:00
Peter Zijlstra
6ccdc84b81 sched: Skip double execution of pick_next_task_fair()
Tim wrote:

 "The current code will call pick_next_task_fair a second time in the
  slow path if we did not pull any task in our first try.  This is
  really unnecessary as we already know no task can be pulled and it
  doubles the delay for the cpu to enter idle.

  We instrumented some network workloads and that saw that
  pick_next_task_fair is frequently called twice before a cpu enters
  idle.  The call to pick_next_task_fair can add non trivial latency as
  it calls load_balance which runs find_busiest_group on an hierarchy of
  sched domains spanning the cpus for a large system.  For some 4 socket
  systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
  before a cpu can be idled."

Optimize the second call away for the common case and document the
dependency.

Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:35 +02:00
Steven Rostedt (Red Hat)
6227cb00cc sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
The check at the beginning of cpupri_find() makes sure that the task_pri
variable does not exceed the cp->pri_to_cpu array length. But that length
is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
priorities in that array.

As task_pri is computed from convert_prio() which should never be bigger
than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
hit.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:33 +02:00
Li Zefan
6a7cd273dc sched/deadline: Fix memory leak
Free cpudl->free_cpus allocated in cpudl_init().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 3.14+
Link: http://lkml.kernel.org/r/534F36CE.2000409@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:32 +02:00
Juri Lelli
5bfd126e80 sched/deadline: Fix sched_yield() behavior
yield_task_dl() is broken:

 o it forces current to be throttled setting its runtime to zero;
 o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
   will queue it back with proper parameters at replenish time.

Unfortunately, dl_task_timer() has this check at the very beginning:

	if (!dl_task(p) || dl_se->dl_new)
		goto unlock;

So, it just bails out and the task is never replenished. It actually
yielded forever.

To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.

Reported-by: yjay.kim <yjay.kim@lge.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:31 +02:00
Thomas Gleixner
2d513868e2 sched: Sanitize irq accounting madness
Russell reported, that irqtime_account_idle_ticks() takes ages due to:

       for (i = 0; i < ticks; i++)
               irqtime_account_process_tick(current, 0, rq);

It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cefa ("sched:
Export ns irqtimes through /proc/stat")

So instead of looping nr_ticks times just apply the whole thing at
once.

As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?

Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Shaun Ruffell <sruffell@digium.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:30 +02:00
Peter Zijlstra
ffb4ef21ac perf: Fix perf_event_init_context()
perf_pin_task_context() can return NULL but perf_event_init_context()
assumes it will not, correct this.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/20140505171428.GU26782@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:33:15 +02:00
Peter Zijlstra
46ce0fe97a perf: Fix race in removing an event
When removing a (sibling) event we do:

	raw_spin_lock_irq(&ctx->lock);
	perf_group_detach(event);
	raw_spin_unlock_irq(&ctx->lock);

	<hole>

	perf_remove_from_context(event);
		raw_spin_lock_irq(&ctx->lock);
		...
		raw_spin_unlock_irq(&ctx->lock);

Now, assuming the event is a sibling, it will be 'unreachable' for
things like ctx_sched_out() because that iterates the
groups->siblings, and we just unhooked the sibling.

So, if during <hole> we get ctx_sched_out(), it will miss the event
and not call event_sched_out() on it, leaving it programmed on the
PMU.

The subsequent perf_remove_from_context() call will find the ctx is
inactive and only call list_del_event() to remove the event from all
other lists.

Hereafter we can proceed to free the event; while still programmed!

Close this hole by moving perf_group_detach() inside the same
ctx->lock region(s) perf_remove_from_context() has.

The condition on inherited events only in __perf_event_exit_task() is
likely complete crap because non-inherited events are part of groups
too and we're tearing down just the same. But leave that for another
patch.

Most-likely-Fixes: e03a9a55b4 ("perf: Change close() semantics for group events")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Much-staring-at-traces-by: Vince Weaver <vincent.weaver@maine.edu>
Much-staring-at-traces-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:33:14 +02:00
Rafael J. Wysocki
aa9abe2c8e Merge branch 'pm-cpuidle' into pm-sleep 2014-05-07 01:50:57 +02:00
Rafael J. Wysocki
a6220fc19a PM / suspend: Always use deepest C-state in the "freeze" sleep state
If freeze_enter() is called, we want to bypass the current cpuidle
governor and always use the deepest available (that is, not disabled)
C-state, because we want to save as much energy as reasonably possible
then and runtime latency constraints don't matter at that point, since
the system is in a sleep state anyway.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Aubrey Li <aubrey.li@linux.intel.com>
2014-05-07 01:49:28 +02:00
Fabian Frederick
12d3089c19 kernel/cpuset.c: convert printk to pr_foo()
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-06 07:31:14 -04:00
Fabian Frederick
fc34ac1dc5 kernel/cpuset.c: kernel-doc fixes
This patch also converts seq_printf to seq_puts

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-06 07:31:14 -04:00
Christoph Lameter
bdffd893a0 tracing: Replace __get_cpu_var uses with this_cpu_ptr
Replace uses of &__get_cpu_var for address calculation with this_cpu_ptr.

Link: http://lkml.kernel.org/p/alpine.DEB.2.10.1404291415560.18364@gentwo.org

Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-05 22:40:53 -04:00
Andi Kleen
722a9f9299 asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
As requested by Linus add explicit __visible to the asmlinkage users.
This marks functions visible to assembler.

Tree sweep for rest of tree.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-4-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 16:07:46 -07:00
Linus Torvalds
2080cee435 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) e1000e computes header length incorrectly wrt vlans, fix from Vlad
    Yasevich.

 2) ns_capable() check in sock_diag netlink code, from Andrew
    Lutomirski.

 3) Fix invalid queue pairs handling in virtio_net, from Amos Kong.

 4) Checksum offloading busted in sxgbe driver due to incorrect
    descriptor layout, fix from Byungho An.

 5) Fix build failure with SMC_DEBUG set to 2 or larger, from Zi Shen
    Lim.

 6) Fix uninitialized A and X registers in BPF interpreter, from Alexei
    Starovoitov.

 7) Fix arch dependencies of candence driver.

 8) Fix netlink capabilities checking tree-wide, from Eric W Biederman.

 9) Don't dump IFLA_VF_PORTS if netlink request didn't ask for it in
    IFLA_EXT_MASK, from David Gibson.

10) IPV6 FIB dump restart doesn't handle table changes that happen
    meanwhile, causing the code to loop forever or emit dups, fix from
    Kumar Sandararajan.

11) Memory leak on VF removal in bnx2x, from Yuval Mintz.

12) Bug fixes for new Altera TSE driver from Vince Bridgers.

13) Fix route lookup key in SCTP, from Xugeng Zhang.

14) Use BH blocking spinlocks in SLIP, as per a similar fix to CAN/SLCAN
    driver.  From Oliver Hartkopp.

15) TCP doesn't bump retransmit counters in some code paths, fix from
    Eric Dumazet.

16) Clamp delayed_ack in tcp_cubic to prevent theoretical divides by
    zero.  Fix from Liu Yu.

17) Fix locking imbalance in error paths of HHF packet scheduler, from
    John Fastabend.

18) Properly reference the transport module when vsock_core_init() runs,
    from Andy King.

19) Fix buffer overflow in cdc_ncm driver, from Bjørn Mork.

20) IP_ECN_decapsulate() doesn't see a correct SKB network header in
    ip_tunnel_rcv(), fix from Ying Cai.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits)
  net: macb: Fix race between HW and driver
  net: macb: Remove 'unlikely' optimization
  net: macb: Re-enable RX interrupt only when RX is done
  net: macb: Clear interrupt flags
  net: macb: Pass same size to DMA_UNMAP as used for DMA_MAP
  ip_tunnel: Set network header properly for IP_ECN_decapsulate()
  e1000e: Restrict MDIO Slow Mode workaround to relevant parts
  e1000e: Fix issue with link flap on 82579
  e1000e: Expand workaround for 10Mb HD throughput bug
  e1000e: Workaround for dropped packets in Gig/100 speeds on 82579
  net/mlx4_core: Don't issue PCIe speed/width checks for VFs
  net/mlx4_core: Load the Eth driver first
  net/mlx4_core: Fix slave id computation for single port VF
  net/mlx4_core: Adjust port number in qp_attach wrapper when detaching
  net: cdc_ncm: fix buffer overflow
  Altera TSE: ALTERA_TSE should depend on HAS_DMA
  vsock: Make transport the proto owner
  net: sched: lock imbalance in hhf qdisc
  net: mvmdio: Check for a valid interrupt instead of an error
  net phy: Check for aneg completion before setting state to PHY_RUNNING
  ...
2014-05-05 15:59:46 -07:00
Andy Lutomirski
3d7ee969bf x86, vdso: Clean up 32-bit vs 64-bit vdso params
Rather than using 'vdso_enabled' and an awful #define, just call the
parameters vdso32_enabled and vdso64_enabled.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/87913de56bdcbae3d93917938302fc369b05caee.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:18:40 -07:00
Fabian Frederick
60106946ca kernel/cgroup.c: fix 2 kernel-doc warnings
Fix typo and variable name.

tj: Updated @cgrp argument description in cgroup_destroy_css_killed()

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-05 14:33:04 -04:00
Tejun Heo
15a4c835e4 cgroup, memcg: implement css->id and convert css_from_id() to use it
Until now, cgroup->id has been used to identify all the associated
csses and css_from_id() takes cgroup ID and returns the matching css
by looking up the cgroup and then dereferencing the css associated
with it; however, now that the lifetimes of cgroup and css are
separate, this is incorrect and breaks on the unified hierarchy when a
controller is disabled and enabled back again before the previous
instance is released.

This patch adds css->id which is a subsystem-unique ID and converts
css_from_id() to look up by the new css->id instead.  memcg is the
only user of css_from_id() and also converted to use css->id instead.

For traditional hierarchies, this shouldn't make any functional
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo
ddfcadab35 cgroup: update init_css() into init_and_link_css()
init_css() takes the cgroup the new css belongs to as an argument and
initializes the new css's ->cgroup and ->parent pointers but doesn't
acquire the matching reference counts.  After the previous patch,
create_css() puts init_css() and reference acquisition right next to
each other.  Let's move reference acquistion into init_css() and
rename the function to init_and_link_css().  This makes sense and is
easier to follow.  This makes the root csses to hold a reference on
cgrp_dfl_root.cgrp, which is harmless.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo
a2bed8209a cgroup: use RCU free in create_css() failure path
Currently, when create_css() fails in the middle, the half-initialized
css is freed by invoking cgroup_subsys->css_free() directly.  This
patch updates the function so that it invokes RCU free path instead.
As the RCU free path puts the parent css and owning cgroup, their
references are now acquired right after a new css is successfully
allocated.

This doesn't make any visible difference now but is to enable
implementing css->id and RCU protected lookup by such IDs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:14 -04:00
Tejun Heo
6fa4918d03 cgroup: protect cgroup_root->cgroup_idr with a spinlock
Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
ends up requiring cgroup_put() to be invoked under sleepable context.
This is okay for now but is an unusual requirement and we'll soon add
css->id which will have the same problem but won't be able to simply
grab cgroup_mutex as removal will have to happen from css_release()
which can't sleep.

Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
which protects the idr operations with the lock and use them for
cgroup_root->cgroup_idr.  cgroup_put() no longer needs to grab
cgroup_mutex and css_from_id() is updated to always require RCU read
lock instead of either RCU read lock or cgroup_mutex, which doesn't
affect the existing users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Tejun Heo
7d699ddb2b cgroup, memcg: allocate cgroup ID from 1
Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.

It's reasonable to reserve 0 for special purposes.  This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Tejun Heo
69dfa00ccb cgroup: make flags and subsys_masks unsigned int
There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks.  This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.

This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-04 15:09:13 -04:00
Tim Chen
3cf2f34e1a rwsem: Add comments to explain the meaning of the rwsem's count field
It took me quite a while to understand how rwsem's count field
mainifested itself in different scenarios.

Add comments to provide a quick reference to the the rwsem's count
field for each scenario where readers and writers are contending
for the lock.

Hopefully it will be useful for future maintenance of the code and
for people to get up to speed on how the logic in the code works.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1399060437.2970.146.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-04 20:34:26 +02:00
Thomas Gleixner
1e77d0a1ed genirq: Sanitize spurious interrupt detection of threaded irqs
Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:

- note_interrupt() is called for each action thread of a shared
  interrupt line. That's wrong as we are only interested whether none
  of the device drivers felt responsible for the interrupt, but by
  calling multiple times for a single interrupt line we account
  IRQ_NONE even if one of the drivers felt responsible.

- note_interrupt() when called from the thread handler is not
  serialized. That leaves the members of irq_desc which are used for
  the spurious detection unprotected.

To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.

If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return. 

If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.

If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.

Reported-by: Till Straumann <strauman@slac.stanford.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Austin Schuh <austin@peloton-tech.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: linux-can@vger.kernel.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
2014-05-03 23:15:39 +02:00
Linus Torvalds
0384dcae2b Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "This udpate delivers:

   - A fix for dynamic interrupt allocation on x86 which is required to
     exclude the GSI interrupts from the dynamic allocatable range.

     This was detected with the newfangled tablet SoCs which have GPIOs
     and therefor allocate a range of interrupts.  The MSI allocations
     already excluded the GSI range, so we never noticed before.

   - The last missing set_irq_affinity() repair, which was delayed due
     to testing issues

   - A few bug fixes for the armada SoC interrupt controller

   - A memory allocation fix for the TI crossbar interrupt controller

   - A trivial kernel-doc warning fix"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: irq-crossbar: Not allocating enough memory
  irqchip: armanda: Sanitize set_irq_affinity()
  genirq: x86: Ensure that dynamic irq allocation does not conflict
  linux/interrupt.h: fix new kernel-doc warnings
  irqchip: armada-370-xp: Fix releasing of MSIs
  irqchip: armada-370-xp: implement the ->check_device() msi_chip operation
  irqchip: armada-370-xp: fix invalid cast of signed value into unsigned variable
2014-05-03 08:32:48 -07:00
Linus Torvalds
98facf0e1e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "This update brings along:

   - Two fixes for long standing bugs in the hrtimer code, one which
     prevents remote enqueuing and the other preventing arbitrary delays
     after a interrupt hang was detected

   - A fix in the timer wheel which prevents math overflow

   - A fix for a long standing issue with the architected ARM timer
     related to the C3STOP mechanism.

   - A trivial compile fix for nspire SoC clocksource"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Prevent overflow in apply_slack
  hrtimer: Prevent remote enqueue of leftmost timers
  hrtimer: Prevent all reprogramming if hang detected
  clocksource: nspire: Fix compiler warning
  clocksource: arch_arm_timer: Fix age-old arch timer C3STOP detection issue
2014-05-03 08:31:45 -07:00
Linus Torvalds
00622e61ed Merge tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
 "This is a small fix where the trigger code used the wrong
  rcu_dereference().  It required rcu_dereference_sched() instead of the
  normal rcu_dereference().  It produces a nasty RCU lockdep splat due
  to the incorrect rcu notation"

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

* tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Use rcu_dereference_sched() for trace event triggers
2014-05-03 08:30:44 -07:00
Steven Rostedt (Red Hat)
561a4fe851 tracing: Use rcu_dereference_sched() for trace event triggers
As trace event triggers are now part of the mainline kernel, I added
my trace event trigger tests to my test suite I run on all my kernels.
Now these tests get run under different config options, and one of
those options is CONFIG_PROVE_RCU, which checks under lockdep that
the rcu locking primitives are being used correctly. This triggered
the following splat:

===============================
[ INFO: suspicious RCU usage. ]
3.15.0-rc2-test+ #11 Not tainted
-------------------------------
kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
4 locks held by swapper/1/0:
 #0:  ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
 #1:  (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
 #2:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
 #3:  (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
 ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
 0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
Call Trace:
 <IRQ>  [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
 [<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
 [<ffffffff810ee51c>] event_triggers_call+0x99/0x108
 [<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
 [<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
 [<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
 [<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
 [<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
 [<ffffffff8106eb77>] wake_up_process+0x36/0x3b
 [<ffffffff810575cc>] wake_up_worker+0x24/0x26
 [<ffffffff810578bc>] insert_work+0x5c/0x65
 [<ffffffff81059982>] __queue_work+0x26c/0x283
 [<ffffffff81059999>] ? __queue_work+0x283/0x283
 [<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
 [<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
 [<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
 [<ffffffff81059999>] ? __queue_work+0x283/0x283
 [<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
 [<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
 [<ffffffff81046d03>] irq_exit+0x42/0x97
 [<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
 [<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
 <EOI>  [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
 [<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
 [<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
 [<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
 [<ffffffff8102a23c>] start_secondary+0x212/0x219

The cause is that the triggers are protected by rcu_read_lock_sched() but
the data is dereferenced with rcu_dereference() which expects it to
be protected with rcu_read_lock(). The proper reference should be
rcu_dereference_sched().

Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-02 23:12:42 -04:00
Steven Rostedt (Red Hat)
fd06a54990 ftrace: Have function graph tracer use global_ops for filtering
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler. But it left out function graph filtering which
also depended on that code. The function graph filtering still
needs to use global_ops as the filter otherwise it wont filter
at all.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-01 23:21:16 -04:00
Linus Torvalds
60b88f3941 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
 "Fixed one missing place for the new taint flag, and remove a warning
  giving only false positives (now we finally figured out why)"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  module: remove warning about waiting module removal.
  Fix: tracing: use 'E' instead of 'X' for unsigned module taint flag
2014-05-01 10:35:01 -07:00
Sebastian Capella
2c730785d9 PM / hibernate: no kernel_power_off when pm_power_off NULL
Reboot logic in kernel/reboot will avoid calling kernel_power_off
when pm_power_off is null, and instead uses kernel_halt.  Change
hibernate's power_down to follow the behavior in the reboot call.

Calling the notifier twice (once for SYS_POWER_OFF and again for
SYS_HALT) causes a panic during hibernation on Kirkwood
Openblocks A6 board.

Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Reported-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Reviewed-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-01 01:02:25 +02:00
Rafael J. Wysocki
52c324f8a8 cpuidle: Combine cpuidle_enabled() with cpuidle_select()
Since both cpuidle_enabled() and cpuidle_select() are only called by
cpuidle_idle_call(), it is not really useful to keep them separate
and combining them will help to avoid complicating cpuidle_idle_call()
even further if governors are changed to return error codes sometimes.

This code modification shouldn't lead to any functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-01 00:13:47 +02:00
Oleg Nesterov
13f59c5e45 uprobes: Refuse to insert a probe into MAP_SHARED vma
valid_vma() rejects the VM_SHARED vmas, but this still allows to insert
a probe into the MAP_SHARED but not VM_MAYWRITE vma.

Currently this is fine, such a mapping doesn't really differ from the
private read-only mmap except mprotect(PROT_WRITE) won't work. However,
get_user_pages(FOLL_WRITE | FOLL_FORCE) doesn't allow to COW in this
case, and it would be safer to follow the same conventions as mm even
if currently this happens to work.

After the recent cda540ace6 "mm: get_user_pages(write,force) refuse
to COW in shared areas" only uprobes can insert an anon page into the
shared file-backed area, lets stop this and change valid_vma() to check
VM_MAYSHARE instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2014-04-30 19:10:43 +02:00
Oleg Nesterov
927d687480 uprobes/tracing: Fix uprobe_perf_open() on uprobe_apply() failure
uprobe_perf_open()->uprobe_apply() can fail, but this error is wrongly
ignored. Change uprobe_perf_open() to do uprobe_perf_close() and return
the error code in this case.

Change uprobe_perf_close() to propogate the error from uprobe_apply()
as well, although it should not fail.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2014-04-30 19:10:42 +02:00
Oleg Nesterov
ce5f36a58f uprobes/tracing: Make uprobe_perf_close() visible to uprobe_perf_open()
Preparation. Move uprobe_perf_close() up before uprobe_perf_open() to
avoid the forward declaration in the next patch and make it readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2014-04-30 19:10:42 +02:00
Steven Rostedt (Red Hat)
b1169cc69b tracing: Remove mock up poll wait function
Now that the ring buffer has a built in way to wake up readers
when there's data, using irq_work such that it is safe to do it
in any context. But it was still using the old "poor man's"
wait polling that checks every 1/10 of a second to see if it
should wake up a waiter. This makes the latency for a wake up
excruciatingly long. No need to do that anymore.

Completely remove the different wait_poll types from the tracers
and have them all use the default one now.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-04-30 08:40:05 -04:00
Jiri Bohac
98a01e779f timer: Prevent overflow in apply_slack
On architectures with sizeof(int) < sizeof (long), the
computation of mask inside apply_slack() can be undefined if the
computed bit is > 32.

E.g. with: expires = 0xffffe6f5 and slack = 25, we get:

expires_limit = 0x20000000e
bit = 33
mask = (1 << 33) - 1  /* undefined */

On x86, mask becomes 1 and and the slack is not applied properly.
On s390, mask is -1, expires is set to 0 and the timer fires immediately.

Use 1UL << bit to solve that issue.

Suggested-by: Deborah Townsend <dstownse@us.ibm.com>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140418152310.GA13654@midget.suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 13:46:17 +02:00
Leon Ma
012a45e3f4 hrtimer: Prevent remote enqueue of leftmost timers
If a cpu is idle and starts an hrtimer which is not pinned on that
same cpu, the nohz code might target the timer to a different cpu.

In the case that we switch the cpu base of the timer we already have a
sanity check in place, which determines whether the timer is earlier
than the current leftmost timer on the target cpu. In that case we
enqueue the timer on the current cpu because we cannot reprogram the
clock event device on the target.

If the timers base is already the target CPU we do not have this
sanity check in place so we enqueue the timer as the leftmost timer in
the target cpus rb tree, but we cannot reprogram the clock event
device on the target cpu. So the timer expires late and subsequently
prevents the reprogramming of the target cpu clock event device until
the previously programmed event fires or a timer with an earlier
expiry time gets enqueued on the target cpu itself.

Add the same target check as we have for the switch base case and
start the timer on the current cpu if it would become the leftmost
timer on the target.

[ tglx: Rewrote subject and changelog ]

Signed-off-by: Leon Ma <xindong.ma@intel.com>
Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 12:34:51 +02:00
Stuart Hayes
6c6c0d5a1c hrtimer: Prevent all reprogramming if hang detected
If the last hrtimer interrupt detected a hang it sets hang_detected=1
and programs the clock event device with a delay to let the system
make progress.

If hang_detected == 1, we prevent reprogramming of the clock event
device in hrtimer_reprogram() but not in hrtimer_force_reprogram().

This can lead to the following situation:

hrtimer_interrupt()
   hang_detected = 1;
   program ce device to Xms from now (hang delay)

We have two timers pending:
   T1 expires 50ms from now
   T2 expires 5s from now

Now T1 gets canceled, which causes hrtimer_force_reprogram() to be
invoked, which in turn programs the clock event device to T2 (5
seconds from now).

Any hrtimer_start after that will not reprogram the hardware due to
hang_detected still being set. So we effectivly block all timers until
the T2 event fires and cleans up the hang situation.

Add a check for hang_detected to hrtimer_force_reprogram() which
prevents the reprogramming of the hang delay in the hardware
timer. The subsequent hrtimer_interrupt will resolve all outstanding
issues.

[ tglx: Rewrote subject and changelog and fixed up the comment in
  	hrtimer_force_reprogram() ]

Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Link: http://lkml.kernel.org/r/53602DC6.2060101@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-30 12:34:51 +02:00