Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "The most significant change here is the extension of the Energy Model to cover non-CPU devices (as well as CPUs) from Lukasz Luba. There is also some new hardware support (Ice Lake server idle states table for intel_idle, Sapphire Rapids and Power Limit 4 support in the RAPL driver), some new functionality in the existing drivers (eg. a new switch to disable/enable CPU energy-efficiency optimizations in intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci, devfreq), including the elimination of W=1 build warnings from cpufreq done by Lee Jones. Specifics: - Make the Energy Model cover non-CPU devices (Lukasz Luba). - Add Ice Lake server idle states table to the intel_idle driver and eliminate a redundant static variable from it (Chen Yu, Rafael Wysocki). - Eliminate all W=1 build warnings from cpufreq (Lee Jones). - Add support for Sapphire Rapids and for Power Limit 4 to the Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui). - Fix function name in kerneldoc comments in the idle_inject power capping driver (Yangtao Li). - Fix locking issues with cpufreq governors and drop a redundant "weak" function definition from cpufreq (Viresh Kumar). - Rearrange cpufreq to register non-modular governors at the core_initcall level and allow the default cpufreq governor to be specified in the kernel command line (Quentin Perret). - Extend, fix and clean up the intel_pstate driver (Srinivas Pandruvada, Rafael Wysocki): * Add a new sysfs attribute for disabling/enabling CPU energy-efficiency optimizations in the processor. * Make the driver avoid enabling HWP if EPP is not supported. * Allow the driver to handle numeric EPP values in the sysfs interface and fix the setting of EPP via sysfs in the active mode. * Eliminate a static checker warning and clean up a kerneldoc comment. - Clean up some variable declarations in the powernv cpufreq driver (Wei Yongjun). - Fix up the ->enter_s2idle callback definition to cover the case when it points to the same function as ->idle correctly (Neal Liu). - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson). - Make the PM core emit "changed" uevent when adding/removing the "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi). - Add a helper macro for declaring PM callbacks and use it in the MMC jz4740 driver (Paul Cercueil). - Fix white space in some places in the hibernate code and make the system-wide PM code use "const char *" where appropriate (Xiang Chen, Alexey Dobriyan). - Add one more "unsafe" helper macro to the freezer to cover the NFS use case (He Zhe). - Change the language in the generic PM domains framework to use parent/child terminology and clean up a typo and some comment fromatting in that code (Kees Cook, Geert Uytterhoeven). - Update the operating performance points OPP framework (Lukasz Luba, Andrew-sh.Cheng, Valdis Kletnieks): * Refactor dev_pm_opp_of_register_em() and update related drivers. * Add a missing function export. * Allow disabled OPPs in dev_pm_opp_get_freq(). - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier): * Add support for delayed timers to the devfreq core and make the Samsung exynos5422-dmc driver use it. * Unify sysfs interface to use "df-" as a prefix in instance names consistently. * Fix devfreq_summary debugfs node indentation. * Add the rockchip,pmu phandle to the rk3399_dmc driver DT bindings. * List Dmitry Osipenko as the Tegra devfreq driver maintainer. * Fix typos in the core devfreq code. - Update the pm-graph utility to version 5.7 including a number of fixes related to suspend-to-idle (Todd Brandt). - Fix coccicheck errors and warnings in the cpupower utility (Shuah Khan). - Replace HTTP links with HTTPs ones in multiple places (Alexander A. Klimov)" * tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits) cpuidle: ACPI: fix 'return' with no value build warning cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode cpufreq: intel_pstate: Rearrange the storing of new EPP values intel_idle: Customize IceLake server support PM / devfreq: Fix the wrong end with semicolon PM / devfreq: Fix indentaion of devfreq_summary debugfs node PM / devfreq: Clean up the devfreq instance name in sysfs attr memory: samsung: exynos5422-dmc: Add module param to control IRQ mode memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold memory: samsung: exynos5422-dmc: Use delayed timer as default PM / devfreq: Add support delayed timer for polling mode dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle PM / devfreq: tegra: Add Dmitry as a maintainer PM / devfreq: event: Fix trivial spelling PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent cpuidle: change enter_s2idle() prototype cpuidle: psci: Prevent domain idlestates until consumers are ready cpuidle: psci: Convert PM domain to platform driver cpuidle: psci: Fix error path via converting to a platform driver cpuidle: psci: Fail cpuidle registration if set OSI mode failed ...
此提交包含在:
@@ -108,3 +108,15 @@ Description:
|
||||
frequency requested by governors and min_freq.
|
||||
The max_freq overrides min_freq because max_freq may be
|
||||
used to throttle devices to avoid overheating.
|
||||
|
||||
What: /sys/class/devfreq/.../timer
|
||||
Date: July 2020
|
||||
Contact: Chanwoo Choi <cw00.choi@samsung.com>
|
||||
Description:
|
||||
This ABI shows and stores the kind of work timer by users.
|
||||
This work timer is used by devfreq workqueue in order to
|
||||
monitor the device status such as utilization. The user
|
||||
can change the work timer on runtime according to their demand
|
||||
as following:
|
||||
echo deferrable > /sys/class/devfreq/.../timer
|
||||
echo delayed > /sys/class/devfreq/.../timer
|
||||
|
@@ -703,6 +703,11 @@
|
||||
cpufreq.off=1 [CPU_FREQ]
|
||||
disable the cpufreq sub-system
|
||||
|
||||
cpufreq.default_governor=
|
||||
[CPU_FREQ] Name of the default cpufreq governor or
|
||||
policy to use. This governor must be registered in the
|
||||
kernel before the cpufreq driver probes.
|
||||
|
||||
cpu_init_udelay=N
|
||||
[X86] Delay for N microsec between assert and de-assert
|
||||
of APIC INIT to start processors. This delay occurs
|
||||
|
@@ -147,9 +147,9 @@ CPUs in it.
|
||||
|
||||
The next major initialization step for a new policy object is to attach a
|
||||
scaling governor to it (to begin with, that is the default scaling governor
|
||||
determined by the kernel configuration, but it may be changed later
|
||||
via ``sysfs``). First, a pointer to the new policy object is passed to the
|
||||
governor's ``->init()`` callback which is expected to initialize all of the
|
||||
determined by the kernel command line or configuration, but it may be changed
|
||||
later via ``sysfs``). First, a pointer to the new policy object is passed to
|
||||
the governor's ``->init()`` callback which is expected to initialize all of the
|
||||
data structures necessary to handle the given policy and, possibly, to add
|
||||
a governor ``sysfs`` interface to it. Next, the governor is started by
|
||||
invoking its ``->start()`` callback.
|
||||
|
@@ -431,6 +431,17 @@ argument is passed to the kernel in the command line.
|
||||
supported in the current configuration, writes to this attribute will
|
||||
fail with an appropriate error.
|
||||
|
||||
``energy_efficiency``
|
||||
This attribute is only present on platforms, which have CPUs matching
|
||||
Kaby Lake or Coffee Lake desktop CPU model. By default
|
||||
energy efficiency optimizations are disabled on these CPU models in HWP
|
||||
mode by this driver. Enabling energy efficiency may limit maximum
|
||||
operating frequency in both HWP and non HWP mode. In non HWP mode,
|
||||
optimizations are done only in the turbo frequency range. In HWP mode,
|
||||
optimizations are done in the entire frequency range. Setting this
|
||||
attribute to "1" enables energy efficiency optimizations and setting
|
||||
to "0" disables energy efficiency optimizations.
|
||||
|
||||
Interpretation of Policy Attributes
|
||||
-----------------------------------
|
||||
|
||||
@@ -554,7 +565,11 @@ somewhere between the two extremes:
|
||||
Strings written to the ``energy_performance_preference`` attribute are
|
||||
internally translated to integer values written to the processor's
|
||||
Energy-Performance Preference (EPP) knob (if supported) or its
|
||||
Energy-Performance Bias (EPB) knob.
|
||||
Energy-Performance Bias (EPB) knob. It is also possible to write a positive
|
||||
integer value between 0 to 255, if the EPP feature is present. If the EPP
|
||||
feature is not present, writing integer value to this attribute is not
|
||||
supported. In this case, user can use
|
||||
"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
|
||||
|
||||
[Note that tasks may by migrated from one CPU to another by the scheduler's
|
||||
load-balancing algorithm and if different energy vs performance hints are
|
||||
|
@@ -18,6 +18,8 @@ Optional properties:
|
||||
format depends on the interrupt controller.
|
||||
It should be a DCF interrupt. When DDR DVFS finishes
|
||||
a DCF interrupt is triggered.
|
||||
- rockchip,pmu: Phandle to the syscon managing the "PMU general register
|
||||
files".
|
||||
|
||||
Following properties relate to DDR timing:
|
||||
|
||||
|
@@ -1,15 +1,17 @@
|
||||
====================
|
||||
Energy Model of CPUs
|
||||
====================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======================
|
||||
Energy Model of devices
|
||||
=======================
|
||||
|
||||
1. Overview
|
||||
-----------
|
||||
|
||||
The Energy Model (EM) framework serves as an interface between drivers knowing
|
||||
the power consumed by CPUs at various performance levels, and the kernel
|
||||
the power consumed by devices at various performance levels, and the kernel
|
||||
subsystems willing to use that information to make energy-aware decisions.
|
||||
|
||||
The source of the information about the power consumed by CPUs can vary greatly
|
||||
The source of the information about the power consumed by devices can vary greatly
|
||||
from one platform to another. These power costs can be estimated using
|
||||
devicetree data in some cases. In others, the firmware will know better.
|
||||
Alternatively, userspace might be best positioned. And so on. In order to avoid
|
||||
@@ -25,7 +27,7 @@ framework, and interested clients reading the data from it::
|
||||
+---------------+ +-----------------+ +---------------+
|
||||
| Thermal (IPA) | | Scheduler (EAS) | | Other |
|
||||
+---------------+ +-----------------+ +---------------+
|
||||
| | em_pd_energy() |
|
||||
| | em_cpu_energy() |
|
||||
| | em_cpu_get() |
|
||||
+---------+ | +---------+
|
||||
| | |
|
||||
@@ -35,7 +37,7 @@ framework, and interested clients reading the data from it::
|
||||
| Framework |
|
||||
+---------------------+
|
||||
^ ^ ^
|
||||
| | | em_register_perf_domain()
|
||||
| | | em_dev_register_perf_domain()
|
||||
+----------+ | +---------+
|
||||
| | |
|
||||
+---------------+ +---------------+ +--------------+
|
||||
@@ -47,12 +49,12 @@ framework, and interested clients reading the data from it::
|
||||
| Device Tree | | Firmware | | ? |
|
||||
+--------------+ +---------------+ +--------------+
|
||||
|
||||
The EM framework manages power cost tables per 'performance domain' in the
|
||||
system. A performance domain is a group of CPUs whose performance is scaled
|
||||
together. Performance domains generally have a 1-to-1 mapping with CPUFreq
|
||||
policies. All CPUs in a performance domain are required to have the same
|
||||
micro-architecture. CPUs in different performance domains can have different
|
||||
micro-architectures.
|
||||
In case of CPU devices the EM framework manages power cost tables per
|
||||
'performance domain' in the system. A performance domain is a group of CPUs
|
||||
whose performance is scaled together. Performance domains generally have a
|
||||
1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain are
|
||||
required to have the same micro-architecture. CPUs in different performance
|
||||
domains can have different micro-architectures.
|
||||
|
||||
|
||||
2. Core APIs
|
||||
@@ -70,14 +72,16 @@ CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
|
||||
Drivers are expected to register performance domains into the EM framework by
|
||||
calling the following API::
|
||||
|
||||
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
|
||||
struct em_data_callback *cb);
|
||||
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
|
||||
struct em_data_callback *cb, cpumask_t *cpus);
|
||||
|
||||
Drivers must specify the CPUs of the performance domains using the cpumask
|
||||
argument, and provide a callback function returning <frequency, power> tuples
|
||||
for each capacity state. The callback function provided by the driver is free
|
||||
Drivers must provide a callback function returning <frequency, power> tuples
|
||||
for each performance state. The callback function provided by the driver is free
|
||||
to fetch data from any relevant location (DT, firmware, ...), and by any mean
|
||||
deemed necessary. See Section 3. for an example of driver implementing this
|
||||
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
|
||||
performance domains using cpumask. For other devices than CPUs the last
|
||||
argument must be set to NULL.
|
||||
See Section 3. for an example of driver implementing this
|
||||
callback, and kernel/power/energy_model.c for further documentation on this
|
||||
API.
|
||||
|
||||
@@ -85,13 +89,20 @@ API.
|
||||
2.3 Accessing performance domains
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are two API functions which provide the access to the energy model:
|
||||
em_cpu_get() which takes CPU id as an argument and em_pd_get() with device
|
||||
pointer as an argument. It depends on the subsystem which interface it is
|
||||
going to use, but in case of CPU devices both functions return the same
|
||||
performance domain.
|
||||
|
||||
Subsystems interested in the energy model of a CPU can retrieve it using the
|
||||
em_cpu_get() API. The energy model tables are allocated once upon creation of
|
||||
the performance domains, and kept in memory untouched.
|
||||
|
||||
The energy consumed by a performance domain can be estimated using the
|
||||
em_pd_energy() API. The estimation is performed assuming that the schedutil
|
||||
CPUfreq governor is in use.
|
||||
em_cpu_energy() API. The estimation is performed assuming that the schedutil
|
||||
CPUfreq governor is in use in case of CPU device. Currently this calculation is
|
||||
not provided for other type of devices.
|
||||
|
||||
More details about the above APIs can be found in include/linux/energy_model.h.
|
||||
|
||||
@@ -106,42 +117,46 @@ EM framework::
|
||||
|
||||
-> drivers/cpufreq/foo_cpufreq.c
|
||||
|
||||
01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
|
||||
02 {
|
||||
03 long freq, power;
|
||||
04
|
||||
05 /* Use the 'foo' protocol to ceil the frequency */
|
||||
06 freq = foo_get_freq_ceil(cpu, *KHz);
|
||||
07 if (freq < 0);
|
||||
08 return freq;
|
||||
09
|
||||
10 /* Estimate the power cost for the CPU at the relevant freq. */
|
||||
11 power = foo_estimate_power(cpu, freq);
|
||||
12 if (power < 0);
|
||||
13 return power;
|
||||
14
|
||||
15 /* Return the values to the EM framework */
|
||||
16 *mW = power;
|
||||
17 *KHz = freq;
|
||||
18
|
||||
19 return 0;
|
||||
20 }
|
||||
21
|
||||
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
|
||||
23 {
|
||||
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
|
||||
25 int nr_opp, ret;
|
||||
26
|
||||
27 /* Do the actual CPUFreq init work ... */
|
||||
28 ret = do_foo_cpufreq_init(policy);
|
||||
29 if (ret)
|
||||
30 return ret;
|
||||
31
|
||||
32 /* Find the number of OPPs for this policy */
|
||||
33 nr_opp = foo_get_nr_opp(policy);
|
||||
34
|
||||
35 /* And register the new performance domain */
|
||||
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
|
||||
37
|
||||
38 return 0;
|
||||
39 }
|
||||
01 static int est_power(unsigned long *mW, unsigned long *KHz,
|
||||
02 struct device *dev)
|
||||
03 {
|
||||
04 long freq, power;
|
||||
05
|
||||
06 /* Use the 'foo' protocol to ceil the frequency */
|
||||
07 freq = foo_get_freq_ceil(dev, *KHz);
|
||||
08 if (freq < 0);
|
||||
09 return freq;
|
||||
10
|
||||
11 /* Estimate the power cost for the dev at the relevant freq. */
|
||||
12 power = foo_estimate_power(dev, freq);
|
||||
13 if (power < 0);
|
||||
14 return power;
|
||||
15
|
||||
16 /* Return the values to the EM framework */
|
||||
17 *mW = power;
|
||||
18 *KHz = freq;
|
||||
19
|
||||
20 return 0;
|
||||
21 }
|
||||
22
|
||||
23 static int foo_cpufreq_init(struct cpufreq_policy *policy)
|
||||
24 {
|
||||
25 struct em_data_callback em_cb = EM_DATA_CB(est_power);
|
||||
26 struct device *cpu_dev;
|
||||
27 int nr_opp, ret;
|
||||
28
|
||||
29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
|
||||
30
|
||||
31 /* Do the actual CPUFreq init work ... */
|
||||
32 ret = do_foo_cpufreq_init(policy);
|
||||
33 if (ret)
|
||||
34 return ret;
|
||||
35
|
||||
36 /* Find the number of OPPs for this policy */
|
||||
37 nr_opp = foo_get_nr_opp(policy);
|
||||
38
|
||||
39 /* And register the new performance domain */
|
||||
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
|
||||
41
|
||||
42 return 0;
|
||||
43 }
|
||||
|
@@ -167,11 +167,13 @@ For example::
|
||||
package-0
|
||||
---------
|
||||
|
||||
The Intel RAPL technology allows two constraints, short term and long term,
|
||||
with two different time windows to be applied to each power zone. Thus for
|
||||
each zone there are 2 attributes representing the constraint names, 2 power
|
||||
limits and 2 attributes representing the sizes of the time windows. Such that,
|
||||
constraint_j_* attributes correspond to the jth constraint (j = 0,1).
|
||||
Depending on different power zones, the Intel RAPL technology allows
|
||||
one or multiple constraints like short term, long term and peak power,
|
||||
with different time windows to be applied to each power zone.
|
||||
All the zones contain attributes representing the constraint names,
|
||||
power limits and the sizes of the time windows. Note that time window
|
||||
is not applicable to peak power. Here, constraint_j_* attributes
|
||||
correspond to the jth constraint (j = 0,1,2).
|
||||
|
||||
For example::
|
||||
|
||||
@@ -181,6 +183,9 @@ For example::
|
||||
constraint_1_name
|
||||
constraint_1_power_limit_uw
|
||||
constraint_1_time_window_us
|
||||
constraint_2_name
|
||||
constraint_2_power_limit_uw
|
||||
constraint_2_time_window_us
|
||||
|
||||
Power Zone Attributes
|
||||
=====================
|
||||
|
新增問題並參考
封鎖使用者