The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.
However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.
This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The running state is a subset of runnable state which means that running
can't be set if runnable (weight) is cleared. There are corner cases
where the current sched_entity has been already dequeued but cfs_rq->curr
has not been updated yet and still points to the dequeued sched_entity.
If ___update_load_avg() is called at that time, weight will be 0 and running
will be set which is not possible.
This case happens during pick_next_task_fair() when a cfs_rq becomes idles.
The current sched_entity has been dequeued so se->on_rq is cleared and
cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be
cleared when picking the idle thread). Because the cfs_rq becomes idle,
idle_balance() is called and ends up to call update_blocked_averages()
with these wrong running and runnable states.
Add a test in ___update_load_avg() to correct the running state in this case.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Link: http://lkml.kernel.org/r/1498885573-18984-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
is SIGSTOP'ed state the timestamps go wobbly.
It turns out we indeed fail to correctly account time while in 'OFF'
state and doing IOC_ENABLE without getting scheduled in exposes the
problem.
Further thinking about this problem, it occurred to me that we can
suffer a similar fate when we migrate an uncore event between CPUs.
The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
which will reset all the time stamp, resulting in a subsequent
update_event_times() to overwrite the total_time_* fields with smaller
values.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable the newly added jump opcodes, main parts are in two
different areas, namely direct packet access and dynamic map
value access. For the direct packet access, we now allow for
the following two new patterns to match in order to trigger
markings with find_good_pkt_pointers():
Variant 1 (access ok when taking the branch):
0: (61) r2 = *(u32 *)(r1 +76)
1: (61) r3 = *(u32 *)(r1 +80)
2: (bf) r0 = r2
3: (07) r0 += 8
4: (ad) if r0 < r3 goto pc+2
R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
R3=pkt_end R10=fp
5: (b7) r0 = 0
6: (95) exit
from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
7: (71) r0 = *(u8 *)(r2 +0)
8: (05) goto pc-4
5: (b7) r0 = 0
6: (95) exit
processed 11 insns, stack depth 0
Variant 2 (access ok on fall-through):
0: (61) r2 = *(u32 *)(r1 +76)
1: (61) r3 = *(u32 *)(r1 +80)
2: (bf) r0 = r2
3: (07) r0 += 8
4: (bd) if r3 <= r0 goto pc+1
R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
R3=pkt_end R10=fp
5: (71) r0 = *(u8 *)(r2 +0)
6: (b7) r0 = 1
7: (95) exit
from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
6: (b7) r0 = 1
7: (95) exit
processed 10 insns, stack depth 0
The above two basically just swap the branches where we need
to handle an exception and allow packet access compared to the
two already existing variants for find_good_pkt_pointers().
For the dynamic map value access, we add the new instructions
to reg_set_min_max() and reg_set_min_max_inv() in order to
learn bounds. Verifier test cases for both are added in a
follow-up patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
CPUFREQ_ENTRY_INVALID is a special symbol which is used to specify that
an entry in the cpufreq table is invalid. But using it outside of the
scope of the cpufreq table looks a bit incorrect.
We can represent an invalid frequency by writing it as 0 instead if we
need. Note that it is already done that way for the return value of the
->get() callback.
Lets do the same for ->fast_switch() and not use CPUFREQ_ENTRY_INVALID
outside of the scope of cpufreq table.
Also update the comment over cpufreq_driver_fast_switch() to clearly
mention what this returns.
None of the drivers return CPUFREQ_ENTRY_INVALID as of now from
->fast_switch() callback and so we don't need to update any of those.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 65d8fc777f ("futex: Remove requirement for lock_page() in
get_futex_key()") removed an unnecessary lock_page() with the
side-effect that page->mapping needed to be treated very carefully.
Two defensive warnings were added in case any assumption was missed and
the first warning assumed a correct application would not alter a
mapping backing a futex key. Since merging, it has not triggered for
any unexpected case but Mark Rutland reported the following bug
triggering due to the first warning.
kernel BUG at kernel/futex.c:679!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
Hardware name: linux,dummy-virt (DT)
task: ffff80001e271780 task.stack: ffff000010908000
PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
The fact that it's a bug instead of a warning was due to an unrelated
arm64 problem, but the warning itself triggered because the underlying
mapping changed.
This is an application issue but from a kernel perspective it's a
recoverable situation and the warning is unnecessary so this patch
removes the warning. The warning may potentially be triggered with the
following test program from Mark although it may be necessary to adjust
NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
system.
#include <linux/futex.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <unistd.h>
#define NR_FUTEX_THREADS 16
pthread_t threads[NR_FUTEX_THREADS];
void *mem;
#define MEM_PROT (PROT_READ | PROT_WRITE)
#define MEM_SIZE 65536
static int futex_wrapper(int *uaddr, int op, int val,
const struct timespec *timeout,
int *uaddr2, int val3)
{
syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
}
void *poll_futex(void *unused)
{
for (;;) {
futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
}
}
int main(int argc, char *argv[])
{
int i;
mem = mmap(NULL, MEM_SIZE, MEM_PROT,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
printf("Mapping @ %p\n", mem);
printf("Creating futex threads...\n");
for (i = 0; i < NR_FUTEX_THREADS; i++)
pthread_create(&threads[i], NULL, poll_futex, NULL);
printf("Flipping mapping...\n");
for (;;) {
mmap(mem, MEM_SIZE, MEM_PROT,
MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
}
return 0;
}
Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The more detailed value tracking can reduce the effectiveness of pruning
for some programs. So, to avoid rejecting previously valid programs, up
the limit to 128kinsns. Hopefully we will be able to bring this back
down later by improving pruning performance.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows us to, sometimes, combine information from a signed check of one
bound and an unsigned check of the other.
We now track the full range of possible values, rather than restricting
ourselves to [0, 1<<30) and considering anything beyond that as
unknown. While this is probably not necessary, it makes the code more
straightforward and symmetrical between signed and unsigned bounds.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unifies adjusted and unadjusted register value types (e.g. FRAME_POINTER is
now just a PTR_TO_STACK with zero offset).
Tracks value alignment by means of tracking known & unknown bits. This
also replaces the 'reg->imm' (leading zero bits) calculations for (what
were) UNKNOWN_VALUEs.
If pointer leaks are allowed, and adjust_ptr_min_max_vals returns -EACCES,
treat the pointer as an unknown scalar and try again, because we might be
able to conclude something about the result (e.g. pointer & 0x40 is either
0 or 0x40).
Verifier hooks in the netronome/nfp driver were changed to match the new
data structures.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.
The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.
The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,
(i) dereference the netdev: dev = rcu_dereference(map[i])
(ii) test ifindex: dev->ifindex == removing_ifindex
and then finally we can swap in the NULL dev in the map via an
xchg operation,
xchg(map[i], NULL)
The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
xchg(map[i], NULL)
The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.
Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
dev = rcu_dereference(map[i])
xchg(map[i]), new_dev);
rcu_call(dev,...)
odev = cmpxchg(map[i], dev, NULL)
Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,
CPU 1 CPU 2
notifier hook bpf devmap update
dev = rcu_dereference(map[i])
odev = cmpxchg(map[i], dev, NULL)
[...]
Now 'odev == dev' and we can do proper cleanup.
And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.
Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.
Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0: (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
Fixes: 2ddf71e23c ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The iovisor/bcc issue #748
(https://github.com/iovisor/bcc/issues/748) documents this issue.
For example, if you try to attach a bpf program to tracepoints
syscalls/sys_enter_newfstat, you will get the following error:
# ./tools/trace.py t:syscalls:sys_enter_newfstat
Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
Failed to attach BPF to tracepoint
The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.
This patch adds bpf support for these syscalls tracepoints by
. permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
. calling bpf programs in perf_syscall_enter and perf_syscall_exit
The legality of bpf program ctx access is also checked.
Function trace_event_get_offsets returns correct max offset for each
specific syscall tracepoint, which is compared against the maximum offset
access in bpf program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The latest change of compat_sys_sigpending in commit 8f13621abc
("sigpending(): move compat to native") has broken it in two ways.
First, it tries to write 4 bytes more than userspace expects:
sizeof(old_sigset_t) == sizeof(long) == 8 instead of
sizeof(compat_old_sigset_t) == sizeof(u32) == 4.
Second, on big endian architectures these bytes are being written in the
wrong order.
This bug was found by strace test suite.
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Inspired-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Fixes: 8f13621abc ("sigpending(): move compat to native")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify the ACPI system sleep support setup code to select
suspend-to-idle as the default system sleep state if
(1) the ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and
(2) the Low Power Idle S0 _DSM interface has been discovered and
(3) the default sleep state was not selected from the kernel command
line.
The main motivation for this change is that systems where the (1) and
(2) conditions are met typically ship with OSes that don't exercise
the S3 path in the platform firmware which remains untested and turns
out to be non-functional at least in some cases.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
Pull timer fix from Thomas Gleixner:
"A single fix for a multiplication overflow in the timer code on 32bit
systems"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix overflow in get_next_timer_interrupt
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each css_set directly points to the default cgroup it belongs to, so
there's no reason to walk the cgrp_links list on the default
hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
A cgroup can consume resources even after being deleted by a user.
For example, writing back dirty pages should be accounted and
limited, despite the corresponding cgroup might contain no processes
and being deleted by a user.
In the current implementation a cgroup can remain in such "dying" state
for an undefined amount of time. For instance, if a memory cgroup
contains a pge, mlocked by a process belonging to an other cgroup.
Although the lifecycle of a dying cgroup is out of user's control,
it's important to have some insight of what's going on under the hood.
In particular, it's handy to have a counter which will allow
to detect css leaks.
To solve this problem, add a cgroup.stat interface to
the base cgroup control files with the following metrics:
nr_descendants total number of visible descendant cgroups
nr_dying_descendants total number of dying descendant cgroups
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Creating cgroup hierearchies of unreasonable size can affect
overall system performance. A user might want to limit the
size of cgroup hierarchy. This is especially important if a user
is delegating some cgroup sub-tree.
To address this issue, introduce an ability to control
the size of cgroup hierarchy.
The cgroup.max.descendants control file allows to set the maximum
allowed number of descendant cgroups.
The cgroup.max.depth file controls the maximum depth of the cgroup
tree. Both are single value r/w files, with "max" default value.
The control files exist on each hierarchy level (including root).
When a new cgroup is created, we check the total descendants
and depth limits on each level, and if none of them are exceeded,
a new cgroup is created.
Only alive cgroups are counted, removed (dying) cgroups are
ignored.
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Chunyu Hu reported:
"per_cpu trace directories and files are created for all possible cpus,
but only the cpus which have ever been on-lined have their own per cpu
ring buffer (allocated by cpuhp threads). While trace_buffers_open, the
open handler for trace file 'trace_pipe_raw' is always trying to access
field of ring_buffer_per_cpu, and would panic with the NULL pointer.
Align the behavior of trace_pipe_raw with trace_pipe, that returns -NODEV
when openning it if that cpu does not have trace ring buffer.
Reproduce:
cat /sys/kernel/debug/tracing/per_cpu/cpu31/trace_pipe_raw
(cpu31 is never on-lined, this is a 16 cores x86_64 box)
Tested with:
1) boot with maxcpus=14, read trace_pipe_raw of cpu15.
Got -NODEV.
2) oneline cpu15, read trace_pipe_raw of cpu15.
Get the raw trace data.
Call trace:
[ 5760.950995] RIP: 0010:ring_buffer_alloc_read_page+0x32/0xe0
[ 5760.961678] tracing_buffers_read+0x1f6/0x230
[ 5760.962695] __vfs_read+0x37/0x160
[ 5760.963498] ? __vfs_read+0x5/0x160
[ 5760.964339] ? security_file_permission+0x9d/0xc0
[ 5760.965451] ? __vfs_read+0x5/0x160
[ 5760.966280] vfs_read+0x8c/0x130
[ 5760.967070] SyS_read+0x55/0xc0
[ 5760.967779] do_syscall_64+0x67/0x150
[ 5760.968687] entry_SYSCALL64_slow_path+0x25/0x25"
This was introduced by the addition of the feature to reuse reader pages
instead of re-allocating them. The problem is that the allocation of a
reader page (which is per cpu) does not check if the cpu is online and set
up for the ring buffer.
Link: http://lkml.kernel.org/r/1500880866-1177-1-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The clear_boot_tracer function is used to reset the default_bootup_tracer
string to prevent it from being accessed after boot, as it originally points
to init data. But since clear_boot_tracer() is called via the
init_lateinit() call, it races with the initcall for registering the hwlat
tracer. If someone adds "ftrace=hwlat" to the kernel command line, depending
on how the linker sets up the text, the saved command line may be cleared,
and the hwlat tracer never is initialized.
Simply have the clear_boot_tracer() be called by initcall_lateinit_sync() as
that's for tasks to be called after lateinit.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196551
Cc: stable@vger.kernel.org
Fixes: e7c15cd8a ("tracing: Added hardware latency tracer")
Reported-by: Zamir SUN <sztsian@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support. The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:
1. No support to monitor the same group of tasks for which we do
allocation using resctrl.
2. It gives random and inaccurate data (mostly 0s) once we run out
of RMIDs due to issues in Recycling.
3. Recycling results in inaccuracy of data because we cannot
guarantee that the RMID was stolen from a task when it was not
pulling data into cache or even when it pulled the least data. Also
for monitoring llc_occupancy, if we stop using an RMID_x and then
start using an RMID_y after we reclaim an RMID from an other event,
we miss accounting all the occupancy that was tagged to RMID_x at a
later perf_count.
2. Recycling code makes the monitoring code complex including
scheduling because the event can lose RMID any time. Since MBM
counters count bandwidth for a period of time by taking snap shot of
total bytes at two different times, recycling complicates the way we
count MBM in a hierarchy. Also we need a spin lock while we do the
processing to account for MBM counter overflow. We also currently
use a spin lock in scheduling to prevent the RMID from being taken
away.
4. Lack of support when we run different kind of event like task,
system-wide and cgroup events together. Data mostly prints 0s. This
is also because we can have only one RMID tied to a cpu as defined
by the cqm hardware but a perf can at the same time tie multiple
events during one sched_in.
5. No support of monitoring a group of tasks. There is partial support
for cgroup but it does not work once there is a hierarchy of cgroups
or if we want to monitor a task in a cgroup and the cgroup itself.
6. No support for monitoring tasks for the lifetime without perf
overhead.
7. It reported the aggregate cache occupancy or memory bandwidth over
all sockets. But most cloud and VMM based use cases want to know the
individual per-socket usage.
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.
This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.
The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.
The intel_pstate driver is updated to always reject remote callbacks.
This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.
The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
next tick.
Based on initial work from Steve Muckle.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>