Commit Graph

27302 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
1a149d7d3f ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler
The current method to prevent the ring buffer from entering into a recursize
loop is to use a bitmask and set the bit that maps to the current context
(normal, softirq, irq or NMI), and if that bit was already set, it is
considered a recursive loop.

New code is being added that may require the ring buffer to be entered a
second time in the current context. The recursive locking prevents that from
happening. Instead of mapping a bitmask to the current context, just allow 4
levels of nesting in the ring buffer. This matches the 4 context levels that
it can already nest. It is highly unlikely to have more than two levels,
thus it should be fine when we add the second entry into the ring buffer. If
that proves to be a problem, we can always up the number to 8.

An added benefit is that reading preempt_count() to get the current level
adds a very slight but noticeable overhead. This removes that need.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-04 11:36:58 -04:00
Steven Rostedt (VMware)
12ecef0cb1 tracing: Reverse the order of trace_types_lock and event_mutex
In order to make future changes where we need to call
tracing_set_clock() from within an event command, the order of
trace_types_lock and event_mutex must be reversed, as the event command
will hold event_mutex and the trace_types_lock is taken from within
tracing_set_clock().

Link: http://lkml.kernel.org/r/20170921162249.0dde3dca@gandalf.local.home

Requested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-04 11:36:56 -04:00
Colin Ian King
6e7a239811 tracing: Remove redundant unread variable ret
Integer ret is being assigned but never used and hence it is
redundant. Remove it, fixes clang warning:

trace_events_hist.c:1077:3: warning: Value stored to 'ret' is never read

Link: http://lkml.kernel.org/r/20170823112309.19383-1-colin.king@canonical.com

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-04 11:36:55 -04:00
Joel Fernandes
d8c4deee6d tracing: Remove obsolete sched_switch tracer selftest
Since commit 87d80de280 ("tracing: Remove
obsolete sched_switch tracer"), the sched_switch tracer selftest is no longer
used.  This patch removes the same.

Link: http://lkml.kernel.org/r/20170909065517.22262-1-joelaf@google.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-04 11:36:54 -04:00
Linus Torvalds
013a8ee628 Merge tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlets from Steven Rostedt:
 "Two updates:

   - A memory fix with left over code from spliting out ftrace_ops and
     function graph tracer, where the function graph tracer could reset
     the trampoline pointer, leaving the old trampoline not to be freed
     (memory leak).

   - The update to Paul's patch that added the unnecessary READ_ONCE().
     This removes the unnecessary READ_ONCE() instead of having to
     rebase the branch to update the patch that added it"

* tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}()
  ftrace: Fix kmemleak in unregister_ftrace_graph
2017-10-04 08:34:01 -07:00
Thomas Gleixner
0b62bf862d watchdog/core: Put softlockup_threads_initialized under ifdef guard
The variable is unused when the softlockup detector is disabled in Kconfig.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-10-04 11:30:50 +02:00
Thomas Gleixner
5587185ddb watchdog/core: Rename some softlockup_* functions
The function names made sense up to the point where the watchdog
(re)configuration was unified to use softlockup_reconfigure_threads() for
all configuration purposes. But that includes scenarios which solely
configure the nmi watchdog.

Rename softlockup_reconfigure_threads() and softlockup_init_threads() so
the function names match the functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
2017-10-04 10:53:54 +02:00
Thomas Gleixner
34ddaa3e5c powerpc/watchdog: Make use of watchdog_nmi_probe()
The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
on powerpc because it is called multiple times for the boot CPU.

The first call is via:

  start_wd_on_cpu+0x80/0x2f0
  watchdog_nmi_reconfigure+0x124/0x170
  softlockup_reconfigure_threads+0x110/0x130
  lockup_detector_init+0xbc/0xe0
  kernel_init_freeable+0x18c/0x37c
  kernel_init+0x2c/0x160
  ret_from_kernel_thread+0x5c/0xbc

And then again via the CPU hotplug registration:

  start_wd_on_cpu+0x80/0x2f0
  cpuhp_invoke_callback+0x194/0x620
  cpuhp_thread_fun+0x7c/0x1b0
  smpboot_thread_fn+0x290/0x2a0
  kthread+0x168/0x1b0
  ret_from_kernel_thread+0x5c/0xbc

This can be avoided by setting up the cpu hotplug state with nocalls and
move the initialization to the watchdog_nmi_probe() function. That
initializes the hotplug callbacks without invoking the callback and the
following core initialization function then configures the watchdog for the
online CPUs (in this case CPU0) via softlockup_reconfigure_threads().

Reported-and-tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
2017-10-04 10:53:54 +02:00
Thomas Gleixner
e31d6883f2 watchdog/core, powerpc: Lock cpus across reconfiguration
Instead of dropping the cpu hotplug lock after stopping NMI watchdog and
threads and reaquiring for restart, the code and the protection rules
become more obvious when holding cpu hotplug lock across the full
reconfiguration.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710022105570.2114@nanos
2017-10-04 10:53:54 +02:00
Thomas Gleixner
6b9dc4806b watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
into two stages. One to stop the NMI and one to restart it after
reconfiguration. That was done by adding a boolean 'run' argument to the
code, which is functionally correct but not necessarily a piece of art.

Replace it by two explicit functions: watchdog_nmi_stop() and
watchdog_nmi_start().

Fixes: 6592ad2fcc ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
Requested-by: Linus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
2017-10-04 10:53:53 +02:00
Jean Delvare
e0596c80f4 kernel/params.c: improve STANDARD_PARAM_DEF readability
Align the parameters passed to STANDARD_PARAM_DEF for clarity.

Link: http://lkml.kernel.org/r/20170928162728.756143cc@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:26 -07:00
Jean Delvare
96802e6b1d kernel/params.c: fix an overflow in param_attr_show
Function param_attr_show could overflow the buffer it is operating on.

The buffer size is PAGE_SIZE, and the string returned by
attribute->param->ops->get is generated by scnprintf(buffer, PAGE_SIZE,
...) so it could be PAGE_SIZE - 1 long, with the terminating '\0' at the
very end of the buffer.  Calling strcat(..., "\n") on this isn't safe, as
the '\0' will be replaced by '\n' (OK) and then another '\0' will be added
past the end of the buffer (not OK.)

Simply add the trailing '\n' when writing the attribute contents to the
buffer originally.  This is safe, and also faster.

Credits to Teradata for discovering this issue.

Link: http://lkml.kernel.org/r/20170928162602.60c379c7@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:26 -07:00
Jean Delvare
90ceb2a3ad kernel/params.c: fix the maximum length in param_get_string
The length parameter of strlcpy() is supposed to reflect the size of the
target buffer, not of the source string.  Harmless in this case as the
buffer is PAGE_SIZE long and the source string is always much shorter than
this, but conceptually wrong, so let's fix it.

Link: http://lkml.kernel.org/r/20170928162515.24846b4f@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:26 -07:00
Cyrill Gorcunov
c9653850c9 kernel/kcmp.c: drop branch leftover typo
The else branch been left over and escaped the source code refresh.  Not
a problem but better clean it up.

Fixes: 0791e3644e ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files")
Link: http://lkml.kernel.org/r/20170917165838.GA1887@uranus.lan
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Michal Hocko
1fdcce6e16 memremap: add scheduling point to devm_memremap_pages
devm_memremap_pages is initializing struct pages in for_each_device_pfn
and that can take quite some time.  We have even seen a soft lockup
triggering on a non preemptive kernel

  NMI watchdog: BUG: soft lockup - CPU#61 stuck for 22s! [kworker/u641:11:1808]
  [...]
  RIP: 0010:[<ffffffff8118b6b7>]  [<ffffffff8118b6b7>] devm_memremap_pages+0x327/0x430
  [...]
  Call Trace:
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70

fix this by adding cond_resched every 1024 pages.

Link: http://lkml.kernel.org/r/20170918121410.24466-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Luis R. Rodriguez
3181c38e4d kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
This has no functional changes other than fixing a compiler warning:

  kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false

Fixes: 4f2fec00af ("sysctl: simplify unsigned int support")
Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: David Binderman <dcb314@hotmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Sherry Yang
a1b2289cef android: binder: drop lru lock in isolate callback
Drop the global lru lock in isolate callback before calling
zap_page_range which calls cond_resched, and re-acquire the global lru
lock before returning.  Also change return code to LRU_REMOVED_RETRY.

Use mmput_async when fail to acquire mmap sem in an atomic context.

Fix "BUG: sleeping function called from invalid context"
errors when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.

Also restore mmput_async, which was initially introduced in commit
ec8d7c14ea ("mm, oom_reaper: do not mmput synchronously from the oom
reaper context"), and was removed in commit 2129258024 ("mm: oom: let
oom_reap_task and exit_mmap run concurrently").

Link: http://lkml.kernel.org/r/20170914182231.90908-1-sherryy@android.com
Fixes: f2517eb76f ("android: binder: Add global lru shrinker to binder")
Signed-off-by: Sherry Yang <sherryy@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Kyle Yan <kyan@codeaurora.org>
Acked-by: Arve Hjønnevåg <arve@android.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Martijn Coenen <maco@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Riley Andrews <riandrews@android.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hoeun Ryu <hoeun.ryu@gmail.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
Jean Delvare
630cc2b30a kernel/params.c: align add_sysfs_param documentation with code
This parameter is named kp, so the documentation should use that.

Fixes: 9b473de872 ("param: Fix duplicate module prefixes")
Link: http://lkml.kernel.org/r/20170919142656.64aea59e@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:23 -07:00
Alexei Starovoitov
90caccdd8c bpf: fix bpf_tail_call() x64 JIT
- bpf prog_array just like all other types of bpf array accepts 32-bit index.
  Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent

The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40 in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40 ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a7 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03 16:04:44 -07:00
Linus Torvalds
847d9fb477 Merge branch 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The recent migration code updates assumed that migrations always
  execute from the top to the bottom once and didn't clean up internal
  states after each migration round; however, cgroup_transfer_tasks()
  repeats the inner steps multiple times and the garbage internal states
  from the previous iteration led to OOPS.

  Waiman fixed the bug by reinitializing the relevant states at the end
  of each migration round"

* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns
2017-10-03 10:40:36 -07:00
Paul E. McKenney
f39b536ce9 rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}()
The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
is currently protected with READ_ONCE().  However, this protection is
unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
current CPU, (2) Although NMI handlers can update this field, they reset
it back to its old value before return, and (3) Interrupts are disabled,
so nothing else can modify it.  The value of ->dynticks_nmi_nesting is
thus effectively constant, and so no protection is required.

This commit therefore removes the READ_ONCE() protection from these
two accesses.

Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.com

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-03 10:27:32 -04:00
Shu Wang
2b0b8499ae ftrace: Fix kmemleak in unregister_ftrace_graph
The trampoline allocated by function tracer was overwriten by function_graph
tracer, and caused a memory leak. The save_global_trampoline should have
saved the previous trampoline in register_ftrace_graph() and restored it in
unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
only used in unregister_ftrace_graph as default value 0, and it overwrote the
previous trampoline's value. Causing the previous allocated trampoline to be
lost.

kmmeleak backtrace:
    kmemleak_vmalloc+0x77/0xc0
    __vmalloc_node_range+0x1b5/0x2c0
    module_alloc+0x7c/0xd0
    arch_ftrace_update_trampoline+0xb5/0x290
    ftrace_startup+0x78/0x210
    register_ftrace_function+0x8b/0xd0
    function_trace_init+0x4f/0x80
    tracing_set_tracer+0xe6/0x170
    tracing_set_trace_write+0x90/0xd0
    __vfs_write+0x37/0x170
    vfs_write+0xb2/0x1b0
    SyS_write+0x55/0xc0
    do_syscall_64+0x67/0x180
    return_from_SYSCALL_64+0x0/0x6a

[
  Looking further into this, I found that this was left over from when the
  function and function graph tracers shared the same ftrace_ops. But in
  commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer
  together"), the two were separated, and the save_global_trampoline no
  longer was necessary (and it may have been broken back then too).
  -- Steven Rostedt
]

Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com

Cc: stable@vger.kernel.org
Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together")
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-03 10:27:32 -04:00
Joe Perches
64ec72a1ec PM: Use a more common logging style
Convert printks to pr_<level>.

Miscellanea:

o Use pr_fmt with "PM:" and remove "PM: " from format strings
o Coalesce format strings and realign format arguments
o Convert an embedded incorrect function name to "%s: ", __func__
o Convert a couple multi-line formats to multiple pr_<level> calls

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-10-03 02:57:17 +02:00
Viresh Kumar
7813dd6fc7 PM / OPP: Move the OPP directory out of power/
The drivers/base/power/ directory is special and contains code related
to power management core like system suspend/resume, hibernation, etc.
It was fine to keep the OPP code inside it when we had just one file for
it, but it is growing now and already has a directory for itself.

Lets move it directly under drivers/ directory, just like cpufreq and
cpuidle.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-10-03 02:45:12 +02:00
Linus Torvalds
8251354513 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp/hotplug fixes from Thomas Gleixner:
 "This addresses the fallout of the new lockdep mechanism which covers
  completions in the CPU hotplug code.

  The lockdep splats are false positives, but there is no way to
  annotate that reliably. The solution is to split the completions for
  CPU up and down, which requires some reshuffling of the failure
  rollback handling as well"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp/hotplug: Hotplug state fail injection
  smp/hotplug: Differentiate the AP completion between up and down
  smp/hotplug: Differentiate the AP-work lockdep class between up and down
  smp/hotplug: Callback vs state-machine consistency
  smp/hotplug: Rewrite AP state machine core
  smp/hotplug: Allow external multi-instance rollback
  smp/hotplug: Add state diagram
2017-10-01 12:34:42 -07:00
Linus Torvalds
7e103ace9c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "The scheduler pull request comes with the following updates:

   - Prevent a divide by zero issue by validating the input value of
     sysctl_sched_time_avg

   - Make task state printing consistent all over the place and have
     explicit state characters for IDLE and PARKED so they wont be
     displayed as 'D' state which confuses tools"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/sysctl: Check user input value of sysctl_sched_time_avg
  sched/debug: Add explicit TASK_PARKED printing
  sched/debug: Ignore TASK_IDLE for SysRq-W
  sched/debug: Add explicit TASK_IDLE printing
  sched/tracing: Use common task-state helpers
  sched/tracing: Fix trace_sched_switch task-state printing
  sched/debug: Remove unused variable
  sched/debug: Convert TASK_state to hex
  sched/debug: Implement consistent task-state printing
2017-10-01 12:10:02 -07:00
Linus Torvalds
1c6f705ba2 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:

 - Prevent a division by zero in the perf aux buffer handling

 - Sync kernel headers with perf tool headers

 - Fix a build failure in the syscalltbl code

 - Make the debug messages of perf report --call-graph work correctly

 - Make sure that all required perf files are in the MANIFEST for
   container builds

 - Fix the atrr.exclude kernel handling so it respects the
   perf_event_paranoid and the user permissions

 - Make perf test on s390x work correctly

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Only update ->aux_wakeup in non-overwrite mode
  perf test: Fix vmlinux failure on s390x part 2
  perf test: Fix vmlinux failure on s390x
  perf tools: Fix syscalltbl build failure
  perf report: Fix debug messages with --call-graph option
  perf evsel: Fix attr.exclude_kernel setting for default cycles:p
  tools include: Sync kernel ABI headers with tooling headers
  perf tools: Get all of tools/{arch,include}/ in the MANIFEST
2017-10-01 12:06:31 -07:00
Linus Torvalds
1de47f3cb7 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull  locking fixes from Thomas Gleixner:
 "Two fixes for locking:

   - Plug a hole the pi_stat->owner serialization which was changed
     recently and failed to fixup two usage sites.

   - Prevent reordering of the rwsem_has_spinner() check vs the
     decrement of rwsem count in up_write() which causes a missed
     wakeup"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem-xadd: Fix missed wakeup due to reordering of load
  futex: Fix pi_state->owner serialization
2017-10-01 12:02:47 -07:00
Linus Torvalds
3d9d62b99b Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:

 - Add a missing NULL pointer check in free_irq()

 - Fix a memory leak/memory corruption in the generic irq chip

 - Add missing rcu annotations for radix tree access

 - Use ffs instead of fls when extracting data from a chip register in
   the MIPS GIC irq driver

 - Fix the unmasking of IPI interrupts in the MIPS GIC driver so they
   end up at the target CPU and not at CPU0

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irq/generic-chip: Don't replace domain's name
  irqdomain: Add __rcu annotations to radix tree accessors
  irqchip/mips-gic: Use effective affinity to unmask
  irqchip/mips-gic: Fix shifts to extract register fields
  genirq: Check __free_irq() return value for NULL
2017-10-01 12:00:56 -07:00
Martin KaFai Lau
721e08dad1 bpf: Fix compiler warning on info.map_ids for 32bit platform
This patch uses u64_to_user_ptr() to cast info.map_ids to a userspace ptr.
It also tags the user_map_ids with '__user' for sparse check.

Fixes: cb4d2b3f03 ("bpf: Add name, load_time, uid and map_ids to bpf_prog_info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 04:09:42 +01:00
Tejun Heo
0d5936344f sched: Implement interface for cgroup unified hierarchy
There are a couple interface issues which can be addressed in cgroup2
interface.

* Stats from cpuacct being reported separately from the cpu stats.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it will be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't created on root.

* "cpu.weight.nice" is added. When read, it reads back the nice value
  which is closest to the current "cpu.weight".  When written, it sets
  "cpu.weight" to the weight value which matches the nice value.  This
  makes it easy to configure cgroups when they're competing against
  threads in threaded subtrees.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

v4: - Use cgroup2 basic usage stat as the information source instead
      of cpuacct.

v3: - Added "cpu.weight.nice" to allow using nice values when
      configuring the weight.  The feature is requested by PeterZ.
    - Merge the patch to enable threaded support on cpu and cpuacct.
    - Dropped the bits about getting rid of cpuacct from patch
      description as there is a pretty strong case for making cpuacct
      an implicit controller so that basic cpu usage stats are always
      available.
    - Documentation updated accordingly.  "cpu.rt.max" section is
      dropped for now.

v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
      for CFS bandwidth stats and also using raw division for u64.
      Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
      not included yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2017-09-29 14:30:37 -07:00
Tejun Heo
a1f7164c7b sched: Misc preps for cgroup unified hierarchy interface
Make the following changes in preparation for the cpu controller
interface implementation for cgroup2.  This patch doesn't cause any
functional differences.

* s/cpu_stats_show()/cpu_cfs_stat_show()/

* s/cpu_files/cpu_legacy_files/

v2: Dropped cpuacct changes as it won't be used by cpu controller
    interface anymore.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2017-09-29 14:30:36 -07:00
Linus Torvalds
99637e4268 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull waitid fix from Al Viro:
 "Fix infoleak in waitid()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix infoleak in waitid(2)
2017-09-29 12:59:59 -07:00
Al Viro
6c85501f2f fix infoleak in waitid(2)
kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
case and waitid(2) rusage should've been copied out exactly in that case, *not*
whenever kernel_waitid() has not returned an error.  Compat variant shares that
braino; none of kernel_wait4() callers do, so the below ought to fix it.

Reported-and-tested-by: Alexander Potapenko <glider@google.com>
Fixes: ce72a16fa7 ("wait4(2)/waitid(2): separate copying rusage to userland")
Cc: stable@vger.kernel.org # v4.13
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-29 13:43:15 -04:00
Peter Zijlstra
17de4ee04c sched/fair: Update calc_group_*() comments
I had a wee bit of trouble recalling how the calc_group_runnable()
stuff worked.. add hopefully better comments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Josef Bacik
2c8e4dce79 sched/fair: Calculate runnable_weight slightly differently
Our runnable_weight currently looks like this

runnable_weight = shares * runnable_load_avg / load_avg

The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio.  The problem with this is it biases us towards tasks that never
go to sleep.  Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks.  To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is

runnable_weight = shares * runnable_weight / load_weight

which will make the weight distribution fairer between interactive and
non-interactive groups.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org
Cc: riel@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Peter Zijlstra
9a2dd585b2 sched/fair: Implement more accurate async detach
The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Peter Zijlstra
f207934fb7 sched/fair: Align PELT windows between cfs_rq and its se
The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.

When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.

Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.

This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:16 +02:00
Peter Zijlstra
144d8487bc sched/fair: Implement synchonous PELT detach on load-balance migrate
Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.

While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:16 +02:00
Peter Zijlstra
1ea6c46a23 sched/fair: Propagate an effective runnable_load_avg
The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:

  runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.

However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.

Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.

Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:15 +02:00
Peter Zijlstra
0e2d2aaaae sched/fair: Rewrite PELT migration propagation
When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.

In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):

                     tg->weight * grq->avg.load_avg
  ge->avg.load_avg = ------------------------------
                               tg->load_avg

But that is the expression for ge->weight, and per the definition of
load_avg:

  ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.

Instead directly propagate runnable_sum.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:15 +02:00
Peter Zijlstra
2a2f5d4e44 sched/fair: Rewrite cfs_rq->removed_*avg
Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.

Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.

Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Vincent Guittot
9059393e4e sched/fair: Use reweight_entity() for set_user_nice()
Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.

Fixes-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added changelog, fixed the call condition. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Peter Zijlstra
840c5abca4 sched/fair: More accurate reweight_entity()
When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.

Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.

With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.

Reported-by: Paul Turner <pjt@google.com>
[josef: compile fix !SMP || !FAIR_GROUP]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Peter Zijlstra
8d5b9025f9 sched/fair: Introduce {en,de}queue_load_avg()
Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
for {en,de}queue_load_avg(). More users will follow.

Includes some code movement to avoid fwd declarations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:13 +02:00
Peter Zijlstra
b5b3e35f41 sched/fair: Rename {en,de}queue_entity_load_avg()
Since they're now purely about runnable_load, rename them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:13 +02:00
Peter Zijlstra
b382a531b9 sched/fair: Move enqueue migrate handling
Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:

 - {en,de}queue_entity_load_avg() will become purely about managing
   runnable_load

 - we can avoid a double update_tg_load_avg() and reduce pressure on
   the global tg->shares cacheline

The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
88c0616ee7 sched/fair: Change update_load_avg() arguments
Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
c7b5021681 sched/fair: Remove se->load.weight from se->avg.load_sum
Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum.  This prepares for better
reweighting of group entities.

Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().

So for se:

  ___update_load_sum(.weight = 1)
  ___upate_load_avg(.weight = se->load.weight)

and for cfs_rq:

  ___update_load_sum(.weight = cfs_rq->load.weight)
  ___upate_load_avg(.weight = 1)

Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
3d4b60d3e3 sched/fair: Cure calc_cfs_shares() vs. reweight_entity()
Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.

This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.

Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:11 +02:00