Simple cases of overlapping changes in the packet scheduler.
Must easier to resolve this time.
Which probably means that I screwed it up somehow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull final power management fixes from Rafael Wysocki:
"These fix a regression in the schedutil cpufreq governor introduced by
a recent change and blacklist Dell XPS13 9360 from using the Low Power
S0 Idle _DSM interface which triggers serious problems on one of these
machines.
Specifics:
- Prevent the schedutil cpufreq governor from using the utilization
of a wrong CPU in some cases which started to happen after one of
the recent changes in it (Chris Redpath).
- Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
interface as that causes serious issue (related to NVMe) to appear
on one of these machines, even though the other Dells XPS13 9360 in
somewhat different HW configurations behave correctly (Rafael
Wysocki)"
* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
cpufreq: schedutil: Examine the correct CPU when we update util
When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
all SCHED_FEAT are turned into compile time constants being propagated
to support compiler optimizations.
Specifically, we expect that code blocks like this:
if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
/* FEATURE CODE */
}
are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
being removed by the compiler from the finale image.
For this mechanism to properly work it's required for the compiler to
have full access, from each translation unit, to whatever is the value
defined by the sched_feat macro. This macro is defined as:
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
and thus, the compiler can optimize that code only if the value of
sysctl_sched_features is visible within each translation unit.
Since:
029632fbb ("sched: Make separate sched*.c translation units")
the scheduler code has been split into separate translation units
however the definition of sysctl_sched_features is part of
kernel/sched/core.c while, for all the other scheduler modules, it is
visible only via kernel/sched/sched.h as an:
extern const_debug unsigned int sysctl_sched_features
Unfortunately, an extern reference does not allow the compiler to apply
constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
with code to load a memory reference and (eventually) doing an unconditional
jump of a chunk of code.
This mechanism is unavoidable when sched_features can be turned on and off at
run-time. However, this is not the case for "production" kernels compiled with
!CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
which cannot be changed at run-time and thus memory loads and jumps can be
avoided altogether.
This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
of the sysctl_sched_features constant for each translation unit. This will
ultimately allow the compiler to perform constants propagation and dead-code
pruning.
Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
the patch, by running 30 iterations of:
perf bench sched messaging --pipe --thread --group 4 --loop 50000
on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
powersave governor to rule out variations due to frequency scaling.
Statistics on the reported completion time:
count mean std min 99% max
v4.14-rc8 30.0 15.7831 0.176032 15.442 16.01226 16.014
v4.14-rc8+patch 30.0 15.5033 0.189681 15.232 15.93938 15.962
... show a 1.8% speedup on average completion time and 0.5% speedup in the
99 percentile.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.
Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.
Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Problem: This flag does not get cleared currently in the suspend or
resume path in the following cases:
* In case some driver's suspend routine returns an error.
* Successful s2idle case
* etc?
Why is this a problem: What happens is that the next suspend attempt
could fail even though the user did not enable the flag by writing to
/sys/power/wakeup_count. This is 1 use case how the issue can be seen
(but similar use case with driver suspend failure can be thought of):
1. Read /sys/power/wakeup_count
2. echo count > /sys/power/wakeup_count
3. echo freeze > /sys/power/wakeup_count
4. Let the system suspend, and wakeup the system using some wake source
that calls pm_wakeup_event() e.g. power button or something.
5. Note that the combined wakeup count would be incremented due
to the pm_wakeup_event() in the resume path.
6. After resuming the events_check_enabled flag is still set.
At this point if the user attempts to freeze again (without writing to
/sys/power/wakeup_count), the suspend would fail even though there has
been no wake event since the past resume.
Address that by clearing the flag just before a resume is completed,
so that it is always cleared for the corner cases mentioned above.
Signed-off-by: Rajat Jain <rajatja@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Not only is it annoying to have one single flag for all pointers, as if
that was a global choice and all kernel pointers are the same, but %pK
can't get the 'access' vs 'open' time check right anyway.
So make the /proc/kallsyms pointer value code use logic specific to that
particular file. We do continue to honor kptr_restrict, but the default
(which is unrestricted) is changed to instead take expected users into
account, and restrict access by default.
Right now the only actual expected user is kernel profiling, which has a
separate sysctl flag for kernel profile access. There may be others.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A static variable sig_enforce is used as status var to indicate the real
value of CONFIG_MODULE_SIG_FORCE, once this one is set the var will hold
true, but if the CONFIG is not set the status var will hold whatever
value is present in the module.sig_enforce kernel cmdline param: true
when =1 and false when =0 or not present.
Considering this cmdline param take place over the CONFIG value when
it's not set, other places in the kernel could misbehave since they
would have only the CONFIG_MODULE_SIG_FORCE value to rely on. Exporting
this status var allows the kernel to rely in the effective value of
module signature enforcement, being it from CONFIG value or cmdline
param.
Signed-off-by: Bruno E. O. Meneguele <brdeoliv@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Currently, all the lock waiters entering the slowpath will do one
lock stealing attempt to acquire the lock. That helps performance,
especially in VMs with over-committed vCPUs. However, the current
pvqspinlocks still don't perform as good as unfair locks in many cases.
On the other hands, unfair locks do have the problem of lock starvation
that pvqspinlocks don't have.
This patch combines the best attributes of an unfair lock and a
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
mode. A lock waiter goes into the unfair mode when there are waiters
in the wait queue but the pending bit isn't set. Otherwise, it will
go into the queued mode waiting in the queue for its turn.
On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
(make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
best time selected and <n> is the number of vCPUs available. The build
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
with various number of vCPUs are as follows:
vCPUs pvqlock hybrid pvqlock unfair lock
----- ------- -------------- -----------
30 342.1s 329.1s 329.1s
36 314.1s 305.3s 307.3s
45 345.0s 302.1s 306.6s
54 365.4s 308.6s 307.8s
72 358.9s 293.6s 303.9s
108 343.0s 285.9s 304.2s
The hybrid pvqspinlock performs better or comparable to the unfair
lock.
By turning on QUEUED_LOCK_STAT, the table below showed the number
of lock acquisitions in unfair mode and queue mode after a kernel
build with various number of vCPUs.
vCPUs queued mode unfair mode
----- ----------- -----------
30 9,130,518 294,954
36 10,856,614 386,809
45 8,467,264 11,475,373
54 6,409,987 19,670,855
72 4,782,063 25,712,180
It can be seen that as the VM became more and more over-committed,
the ratio of locks acquired in unfair mode increases. This is all
done automatically to get the best overall performance as possible.
Using a kernel locking microbenchmark with number of locking
threads equals to the number of vCPUs available on the same machine,
the minimum, average and maximum (min/avg/max) numbers of locking
operations done per thread in a 5-second testing interval are shown
below:
vCPUs hybrid pvqlock unfair lock
----- -------------- -----------
36 822,135/881,063/950,363 75,570/313,496/ 690,465
54 542,435/581,664/625,937 35,460/204,280/ 457,172
72 397,500/428,177/499,299 17,933/150,679/ 708,001
108 257,898/288,150/340,871 3,085/181,176/1,257,109
It can be seen that the hybrid pvqspinlocks are more fair and
performant than the unfair locks in this test.
The table below shows the kernel build times on a smaller 2-socket
16-core 32-thread E5-2620 v4 system.
vCPUs pvqlock hybrid pvqlock unfair lock
----- ------- -------------- -----------
16 436.8s 433.4s 435.6s
36 366.2s 364.8s 364.5s
48 423.6s 376.3s 370.2s
64 433.1s 376.6s 376.8s
Again, the performance of the hybrid pvqspinlock was comparable to
that of the unfair lock.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1510089486-3466-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use alternatives_text_reserved() to check if the address is in
the fixed pieces of alternative reserved, but the problem is that
we don't hold the smp_alt mutex when call this function. So the list
traversal may encounter a deleted list_head if another path is doing
alternatives_smp_module_del().
One solution is that we can hold smp_alt mutex before call this
function, but the difficult point is that the callers of this
functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(),
are called inside the text_mutex. So we must hold smp_alt mutex
before we go into these arch dependent code. But we can't now,
the smp_alt mutex is the arch dependent part, only x86 has it.
Maybe we can export another arch dependent callback to solve this.
But there is a simpler way to handle this problem. We can reuse the
text_mutex to protect smp_alt_modules instead of using another mutex.
And all the arch dependent checks of kprobes are inside the text_mutex,
so it's safe now.
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions"
Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull workqueue fix from Tejun Heo:
"Another fix for a really old bug.
It only affects drain_workqueue() which isn't used often and even then
triggers only during a pretty small race window, so it isn't too
surprising that it stayed hidden for so long.
The fix is straight-forward and low-risk. Kudos to Li Bin for
reporting and fixing the bug"
* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix NULL pointer dereference
The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.
The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.
Example:
$ cat /sys/kernel/cgroup/features
nsdelegate
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).
To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.
Format is siple: each control file name is printed on a new line.
Example:
$ cat /sys/kernel/cgroup/delegate
cgroup.procs
cgroup.subtree_control
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull perf fixes from Ingo Molnar:
"Various fixes:
- synchronize kernel and tooling headers
- cgroup support fix
- two tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/headers: Synchronize kernel ABI headers
perf/cgroup: Fix perf cgroup hierarchy support
perf tools: Unwind properly location after REJECT
perf symbols: Fix memory corruption because of zero length symbols
Cgroup v2 lacks the device controller, provided by cgroup v1.
This patch adds a new eBPF program type, which in combination
of previously added ability to attach multiple eBPF programs
to a cgroup, will provide a similar functionality, but with some
additional flexibility.
This patch introduces a BPF_PROG_TYPE_CGROUP_DEVICE program type.
A program takes major and minor device numbers, device type
(block/character) and access type (mknod/read/write) as parameters
and returns an integer which defines if the operation should be
allowed or terminated with -EPERM.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend struct bpf_prog_info to contain information about program
being bound to a device. Since the netdev may get destroyed while
program still exists we need a flag to indicate the program is
loaded for a device, even if the device is gone.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fact that we don't know which device the program is going
to be used on is quite limiting in current eBPF infrastructure.
We have to reverse or limit the changes which kernel makes to
the loaded bytecode if we want it to be offloaded to a networking
device. We also have to invent new APIs for debugging and
troubleshooting support.
Make it possible to load programs for a specific netdev. This
helps us to bring the debug information closer to the core
eBPF infrastructure (e.g. we will be able to reuse the verifer
log in device JIT). It allows device JITs to perform translation
on the original bytecode.
__bpf_prog_get() when called to get a reference for an attachment
point will now refuse to give it if program has a device assigned.
Following patches will add a version of that function which passes
the expected netdev in. @type argument in __bpf_prog_get() is
renamed to attach_type to make it clearer that it's only set on
attachment.
All calls to ndo_bpf are protected by rtnl, only verifier callbacks
are not. We need a wait queue to make sure netdev doesn't get
destroyed while verifier is still running and calling its driver.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 674e75411f (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization. This is stored in
sugov_cpu->cpu.
The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.
Fix this by consolidating the initialization code into sugov_start().
Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two arm64-specific prctls, to permit userspace to
control its vector length:
* PR_SVE_SET_VL: set the thread's SVE vector length and vector
length inheritance mode.
* PR_SVE_GET_VL: get the same information.
Although these prctls resemble instruction set features in the SVE
architecture, they provide additional control: the vector length
inheritance mode is Linux-specific and nothing to do with the
architecture, and the architecture does not permit EL0 to set its
own vector length directly. Both can be used in portable tools
without requiring the use of SVE instructions.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: Fixed up prctl constants to avoid clash with PDEATHSIG]
Signed-off-by: Will Deacon <will.deacon@arm.com>
This reverts commit b5149873a0.
It conflicts with the following isolcpus change from the sched branch.
edb9382175 ("sched/isolation: Move isolcpus= handling to the housekeeping code")
Let's revert for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
The bpf_verifer_ops array is generated dynamically and may be
empty depending on configuration, which then causes an out
of bounds access:
kernel/bpf/verifier.c: In function 'bpf_check':
kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds [-Werror=array-bounds]
This adds a check to the start of the function as a workaround.
I would assume that the function is never called in that configuration,
so the warning is probably harmless.
Fixes: 00176a34d9 ("bpf: remove the verifier ops from program structure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>