[ Upstream commit afda85b963c12947e298ad85d757e333aa40fd74 ]
If device_register() returns error in ibmebus_bus_init(), name of kobject
which is allocated in dev_set_name() called in device_add() is leaked.
As comment of device_add() says, it should call put_device() to drop
the reference count that was set in device_initialize() when it fails,
so the name can be freed in kobject_cleanup().
Signed-off-by: ruanjinjie <ruanjinjie@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20221110011929.3709774-1-ruanjinjie@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eac030b22ea12cdfcbb2e941c21c03964403c63f ]
lppaca_shared_proc() takes a pointer to the lppaca which is typically
accessed through get_lppaca(). With DEBUG_PREEMPT enabled, this leads
to checking if preemption is enabled, for example:
BUG: using smp_processor_id() in preemptible [00000000] code: grep/10693
caller is lparcfg_data+0x408/0x19a0
CPU: 4 PID: 10693 Comm: grep Not tainted 6.5.0-rc3 #2
Call Trace:
dump_stack_lvl+0x154/0x200 (unreliable)
check_preemption_disabled+0x214/0x220
lparcfg_data+0x408/0x19a0
...
This isn't actually a problem however, as it does not matter which
lppaca is accessed, the shared proc state will be the same.
vcpudispatch_stats_procfs_init() already works around this by disabling
preemption, but the lparcfg code does not, erroring any time
/proc/powerpc/lparcfg is accessed with DEBUG_PREEMPT enabled.
Instead of disabling preemption on the caller side, rework
lppaca_shared_proc() to not take a pointer and instead directly access
the lppaca, bypassing any potential preemption checks.
Fixes: f13c13a005 ("powerpc: Stop using non-architected shared_proc field in lppaca")
Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Rework to avoid needing a definition in paca.h and lppaca.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230823055317.751786-4-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b277fc793daf258877b4c0744b52f69d6e6ba22e ]
Platform device helper routines won't update the NUMA distance table
while creating a platform device, even if the device is present on a
NUMA node that doesn't have memory or CPU. This is especially true for
pmem devices. If the target node of the pmem device is not online, we
find the nearest online node to the device and associate the pmem device
with that online node. To find the nearest online node, we should have
the numa distance table updated correctly. Update the distance
information during the device probe.
For a papr scm device on NUMA node 3 distance_lookup_table value for
distance_ref_points_depth = 2 before and after fix is below:
Before fix:
node 3 distance depth 0 - 0
node 3 distance depth 1 - 0
node 4 distance depth 0 - 4
node 4 distance depth 1 - 2
node 5 distance depth 0 - 5
node 5 distance depth 1 - 1
After fix
node 3 distance depth 0 - 3
node 3 distance depth 1 - 1
node 4 distance depth 0 - 4
node 4 distance depth 1 - 2
node 5 distance depth 0 - 5
node 5 distance depth 1 - 1
Without the fix, the nearest numa node to the pmem device (NUMA node 3)
will be picked as 4. After the fix, we get the correct numa node which
is 5.
Fixes: da1115fdbd ("powerpc/nvdimm: Pick nearby online node if the device node is not online")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230404041433.1781804-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1c6b5a7e74052768977855f95d6b8812f6e7772c ]
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-6-aneesh.kumar@linux.ibm.com
Stable-dep-of: b277fc793daf ("powerpc/papr_scm: Update the NUMA distance table for the target node")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ef31cb83d19c4c589d650747cd5a7e502be9f665 ]
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-5-aneesh.kumar@linux.ibm.com
Stable-dep-of: b277fc793daf ("powerpc/papr_scm: Update the NUMA distance table for the target node")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8ddc6448ec5a5ef50eaa581a7dec0e12a02850ff ]
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.
Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-4-aneesh.kumar@linux.ibm.com
Stable-dep-of: b277fc793daf ("powerpc/papr_scm: Update the NUMA distance table for the target node")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5d08633e5f6564b60f1cbe09af3af40a74d66431 ]
The ibm,get-system-parameter RTAS function may return -2 or 990x,
which indicate that the caller should try again.
lparcfg's parse_system_parameter_string() ignores this, making it
possible to intermittently report incorrect SPLPAR characteristics.
Move the RTAS call into a coventional rtas_busy_delay()-based loop.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230125-b4-powerpc-rtas-queue-v3-4-26929c8cce78@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit daa8ab59044610aa8ef2ee45a6c157b5e11635e9 ]
The ibm,get-system-parameter RTAS function may return -2 or 990x,
which indicate that the caller should try again.
pseries_lpar_read_hblkrm_characteristics() ignores this, making it
possible to incorrectly detect TLB block invalidation characteristics
at boot.
Move the RTAS call into a coventional rtas_busy_delay()-based loop.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: 1211ee61b4 ("powerpc/pseries: Read TLB Block Invalidate Characteristics")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230125-b4-powerpc-rtas-queue-v3-3-26929c8cce78@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9aafbfa5f57a4b75bafd3bed0191e8429c5fa618 ]
rtas-error-log-max is not the name of an RTAS function, so rtas_token()
is not the appropriate API for retrieving its value. We already have
rtas_get_error_log_max() which returns a sensible value if the property
is absent for any reason, so use that instead.
Fixes: 8d633291b4 ("powerpc/eeh: pseries platform EEH error log retrieval")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
[mpe: Drop no-longer possible error handling as noticed by ajd]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221118150751.469393-6-nathanl@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 319fa1a52e438a6e028329187783a25ad498c4e6 ]
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf978 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable@vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2efd7f6eb9b7107e469837d8452e750d7d080a5d ]
In pseries_devicetree_update(), with each call to ibm,update-nodes the
partition firmware communicates the node to be deleted or updated by
placing its phandle in the work buffer. Each of delete_dt_node(),
update_dt_node(), and add_dt_node() have duplicate lookups using the
phandle value and corresponding refcount management.
Move the lookup and of_node_put() into pseries_devicetree_update(),
and emit a warning on any failed lookups.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-29-nathanl@linux.ibm.com
Stable-dep-of: 319fa1a52e43 ("powerpc/pseries/mobility: ignore ibm, platform-facilities updates")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit e561e472a3d441753bd012333b057f48fef1045b upstream.
The platform's RNG must be available before random_init() in order to be
useful for initial seeding, which in turn means that it needs to be
called from setup_arch(), rather than from an init call. Fortunately,
each platform already has a setup_arch function pointer, which means
it's easy to wire this up. This commit also removes some noisy log
messages that don't add much.
Fixes: a489043f46 ("powerpc/pseries: Implement arch_get_random_long() based on H_RANDOM")
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220611151015.548325-4-Jason@zx2c4.com
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 95839225639ba7c3d8d7231b542728dcf222bf2d ]
Commit a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path
check") added is_kvm_guest() and changed kvm_para_available() to use it.
is_kvm_guest() checks a static key, kvm_guest, and that static key is
set in check_kvm_guest().
The problem is check_kvm_guest() is only called on pseries, and even
then only in some configurations. That means is_kvm_guest() always
returns false on all non-pseries and some pseries depending on
configuration. That's a bug.
For PR KVM guests this is noticable because they no longer do live
patching of themselves, which can be detected by the omission of a
message in dmesg such as:
KVM: Live patching for a fast VM worked
To fix it make check_kvm_guest() an initcall, to ensure it's always
called at boot. It needs to be core so that it runs before
kvm_guest_init() which is postcore. To be an initcall it needs to return
int, where 0 means success, so update that.
We still call it manually in pSeries_smp_probe(), because that runs
before init calls are run.
Fixes: a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path check")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623130514.2543232-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eb8257a12192f43ffd41bd90932c39dade958042 ]
On pseries LPAR when an empty slot is assigned to partition OR in single
LPAR mode, kdump kernel crashes during issuing PHB reset.
In the kdump scenario, we traverse all PHBs and issue reset using the
pe_config_addr of the first child device present under each PHB. However
the code assumes that none of the PHB slots can be empty and uses
list_first_entry() to get the first child device under the PHB. Since
list_first_entry() expects the list to be non-empty, it returns an
invalid pci_dn entry and ends up accessing NULL phb pointer under
pci_dn->phb causing kdump kernel crash.
This patch fixes the below kdump kernel crash by skipping empty slots:
audit: initializing netlink subsys (disabled)
thermal_sys: Registered thermal governor 'fair_share'
thermal_sys: Registered thermal governor 'step_wise'
cpuidle: using governor menu
pstore: Registered nvram as persistent store backend
Issue PHB reset ...
audit: type=2000 audit(1631267818.000:1): state=initialized audit_enabled=0 res=1
BUG: Kernel NULL pointer dereference on read at 0x00000268
Faulting instruction address: 0xc000000008101fb0
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 7 PID: 1 Comm: swapper/7 Not tainted 5.14.0 #1
NIP: c000000008101fb0 LR: c000000009284ccc CTR: c000000008029d70
REGS: c00000001161b840 TRAP: 0300 Not tainted (5.14.0)
MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 28000224 XER: 20040002
CFAR: c000000008101f0c DAR: 0000000000000268 DSISR: 00080000 IRQMASK: 0
...
NIP pseries_eeh_get_pe_config_addr+0x100/0x1b0
LR __machine_initcall_pseries_eeh_pseries_init+0x2cc/0x350
Call Trace:
0xc00000001161bb80 (unreliable)
__machine_initcall_pseries_eeh_pseries_init+0x2cc/0x350
do_one_initcall+0x60/0x2d0
kernel_init_freeable+0x350/0x3f8
kernel_init+0x3c/0x17c
ret_from_kernel_thread+0x5c/0x64
Fixes: 5a090f7c36 ("powerpc/pseries: PCIE PHB reset")
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
[mpe: Tweak wording and trim oops]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/163215558252.413351.8600189949820258982.stgit@jupiter
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 333cf507465fbebb3727f5b53e77538467df312a upstream.
With commit c9f3401313a5 ("powerpc: Always enable queued spinlocks for
64s, disable for others") CONFIG_PPC_QUEUED_SPINLOCKS is always
enabled on ppc64le, external modules that use spinlock APIs are
failing.
ERROR: modpost: GPL-incompatible module XXX.ko uses GPL-only symbol 'shared_processor'
Before the above commit, modules were able to build without any
issues. Also this problem is not seen on other architectures. This
problem can be workaround if CONFIG_UNINLINE_SPIN_UNLOCK is enabled in
the config. However CONFIG_UNINLINE_SPIN_UNLOCK is not enabled by
default and only enabled in certain conditions like
CONFIG_DEBUG_SPINLOCKS is set in the kernel config.
#include <linux/module.h>
spinlock_t spLock;
static int __init spinlock_test_init(void)
{
spin_lock_init(&spLock);
spin_lock(&spLock);
spin_unlock(&spLock);
return 0;
}
static void __exit spinlock_test_exit(void)
{
printk("spinlock_test unloaded\n");
}
module_init(spinlock_test_init);
module_exit(spinlock_test_exit);
MODULE_DESCRIPTION ("spinlock_test");
MODULE_LICENSE ("non-GPL");
MODULE_AUTHOR ("Srikar Dronamraju");
Given that spin locks are one of the basic facilities for module code,
this effectively makes it impossible to build/load almost any non GPL
modules on ppc64le.
This was first reported at https://github.com/openzfs/zfs/issues/11172
Currently shared_processor is exported as GPL only symbol.
Fix this for parity with other architectures by exposing
shared_processor to non-GPL modules too.
Fixes: 14c73bd344 ("powerpc/vcpu: Assume dedicated processors as non-preempt")
Cc: stable@vger.kernel.org # v5.5+
Reported-by: marc.c.dionne@gmail.com
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210729060449.292780-1-srikar@linux.vnet.ibm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c669ef6979c370f98d4b876e54f19613c81e075 upstream.
Powerpc currently resets a CPU's idle task preempt_count to 0 before
said task starts executing the secondary startup routine (and becomes an
idle task proper).
This conflicts with commit f1a0a376ca0c ("sched/core: Initialize the
idle task with preemption disabled").
which initializes all of the idle tasks' preempt_count to
PREEMPT_DISABLED during smp_init(). Note that this was superfluous
before said commit, as back then the hotplug machinery would invoke
init_idle() via idle_thread_get(), which would have already reset the
CPU's idle task's preempt_count to PREEMPT_ENABLED.
Get rid of this preempt_count write.
Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210707183831.2106509-1-valentin.schneider@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ed78f56e1271f108e8af61baeba383dcd77adbec ]
In case performance stats for an nvdimm are not available, reading the
'perf_stats' sysfs file returns an -ENOENT error. A better approach is
to make the 'perf_stats' file entirely invisible to indicate that
performance stats for an nvdimm are unavailable.
So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
callback implemented as newly introduced 'papr_nd_attribute_visible()'
that returns an appropriate mode in case performance stats aren't
supported in a given nvdimm.
Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
available when 'papr_nd_attribute_visible()' is called during nvdimm
initialization.
Even though 'perf_stats' attribute is available since v5.9, there are
no known user-space tools/scripts that are dependent on presence of its
sysfs file. Hence I dont expect any user-space breakage with this
patch.
Fixes: 2d02bf835e ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210513092349.285021-1-vaibhav@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2c8c89b95831f46a2fb31a8d0fef4601694023ce ]
The paravit queued spinlock slow path adds itself to the queue then
calls pv_wait to wait for the lock to become free. This is implemented
by calling H_CONFER to donate cycles.
When hcall tracing is enabled, this H_CONFER call can lead to a spin
lock being taken in the tracing code, which will result in the lock to
be taken again, which will also go to the slow path because it queues
behind itself and so won't ever make progress.
An example trace of a deadlock:
__pv_queued_spin_lock_slowpath
trace_clock_global
ring_buffer_lock_reserve
trace_event_buffer_lock_reserve
trace_event_buffer_reserve
trace_event_raw_event_hcall_exit
__trace_hcall_exit
plpar_hcall_norets_trace
__pv_queued_spin_lock_slowpath
trace_clock_global
ring_buffer_lock_reserve
trace_event_buffer_lock_reserve
trace_event_buffer_reserve
trace_event_raw_event_rcu_dyntick
rcu_irq_exit
irq_exit
__do_irq
call_do_irq
do_IRQ
hardware_interrupt_common_virt
Fix this by introducing plpar_hcall_norets_notrace(), and using that to
make SPLPAR virtual processor dispatching hcalls by the paravirt
spinlock code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210508101455.1578318-2-npiggin@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ed8029d7b472369a010a1901358567ca3b6dbb0d ]
RCU complains about us calling printk() from an offline CPU:
=============================
WARNING: suspicious RCU usage
5.12.0-rc7-02874-g7cf90e481cb8 #1 Not tainted
-----------------------------
kernel/locking/lockdep.c:3568 RCU-list traversed in non-reader section!!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/0/0.
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7-02874-g7cf90e481cb8 #1
Call Trace:
dump_stack+0xec/0x144 (unreliable)
lockdep_rcu_suspicious+0x124/0x144
__lock_acquire+0x1098/0x28b0
lock_acquire+0x128/0x600
_raw_spin_lock_irqsave+0x6c/0xc0
down_trylock+0x2c/0x70
__down_trylock_console_sem+0x60/0x140
vprintk_emit+0x1a8/0x4b0
vprintk_func+0xcc/0x200
printk+0x40/0x54
pseries_cpu_offline_self+0xc0/0x120
arch_cpu_idle_dead+0x54/0x70
do_idle+0x174/0x4a0
cpu_startup_entry+0x38/0x40
rest_init+0x268/0x388
start_kernel+0x748/0x790
start_here_common+0x1c/0x614
Which happens because by the time we get to rtas_stop_self() we are
already offline. In addition the message can be spammy, and is not that
helpful for users, so remove it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210418135413.1204031-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 38d0b1c9cec71e6d0f3bddef0bbce41d05a3e796 ]
The pci_bus->bridge reference may no longer be valid after
pci_bus_remove() resulting in passing a bad value to device_unregister()
for the associated bridge device.
Store the host_bridge reference in a separate variable prior to
pci_bus_remove().
Fixes: 7340056567 ("powerpc/pci: Reorder pci bus/bridge unregistration during PHB removal")
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210211182435.47968-1-tyreld@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 11d92156f7a862091009d7655d19c1e7de37fc7a ]
The vio bus is a fake bus, which we use on pseries LPARs (guests) to
discover devices provided by the hypervisor. There's no need or sense
in creating the vio bus on bare metal systems.
Which is why commit 4336b93378 ("powerpc/pseries: Make vio and
ibmebus initcalls pseries specific") made the initialisation of the
vio bus only happen in LPARs.
However as a result of that commit we now see errors at boot on bare
metal systems:
Driver 'hvc_console' was unable to register with bus_type 'vio' because the bus was not initialized.
Driver 'tpm_ibmvtpm' was unable to register with bus_type 'vio' because the bus was not initialized.
This happens because those drivers are built-in, and are calling
vio_register_driver(). It in turn calls driver_register() with a
reference to vio_bus_type, but we haven't registered vio_bus_type with
the driver core.
Fix it by also guarding vio_register_driver() with a check to see if
we are on pseries.
Fixes: 4336b93378 ("powerpc/pseries: Make vio and ibmebus initcalls pseries specific")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Link: https://lore.kernel.org/r/20210316010938.525657-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f9619d5e5174867536b7e558683bc4408eab833f upstream.
Depending on the number of online CPUs in the original kernel, it is
likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
in the affinity mappings provided by irq_create_affinity_masks() are
thus not started by irq_startup(), as per-design with managed IRQs.
This can be a problem with multi-queue block devices driven by blk-mq :
such a non-started IRQ is very likely paired with the single queue
enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
causes the device to remain silent and likely hangs the guest at
some point.
This is a regression caused by commit 9ea69a55b3 ("powerpc/pseries:
Pass MSI affinity to irq_create_mapping()"). Note that this only happens
with the XIVE interrupt controller because XICS has a workaround to bypass
affinity, which is activated during kdump with the "noirqdistrib" kernel
parameter.
The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue
block driver, that was used to create the MSI vectors, and the single
queue mode enforced later on by blk-mq because of kdump (i.e. keeping
all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)
Given that I couldn't reproduce on x86, which seems to always have CPU#0
online even during kdump, I'm not sure where this should be fixed. Hence
going for another approach : fine-grained affinity is for performance
and we don't really care about that during kdump. Simply revert to the
previous working behavior of ignoring affinity masks in this case only.
Fixes: 9ea69a55b3 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210215094506.1196119-1-groug@kaod.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 768d70e19ba525debd571b36e6d0ab19956c63d7 ]
dlpar_configure_connector() has two problems in its handling of
ibm,configure-connector's return status:
1. When the status is -2 (busy, call again), we call
ibm,configure-connector again immediately without checking whether
to schedule, which can result in monopolizing the CPU.
2. Extended delay status (9900..9905) goes completely unhandled,
causing the configuration to unnecessarily terminate.
Fix both of these issues by using rtas_busy_delay().
Fixes: ab519a011c ("powerpc/pseries: Kernel DLPAR Infrastructure")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210107025900.410369-1-nathanl@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b866459489fe8ef0e92cde3cbd6bbb1af6c4e99b ]
Partitions with cache nodes in the device tree can encounter the
following warning on resume:
CPU 0 already accounted in PowerPC,POWER9@0(Data)
WARNING: CPU: 0 PID: 3177 at arch/powerpc/kernel/cacheinfo.c:197 cacheinfo_cpu_online+0x640/0x820
These calls to cacheinfo_cpu_offline/online have been redundant since
commit e610a466d1 ("powerpc/pseries/mobility: rebuild cacheinfo
hierarchy post-migration").
Fixes: e610a466d1 ("powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migration")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-25-nathanl@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 52719fce3f4c7a8ac9eaa191e8d75a697f9fbcbc ]
There are three ways pseries_suspend_begin() can be reached:
1. When "mem" is written to /sys/power/state:
kobj_attr_store()
-> state_store()
-> pm_suspend()
-> suspend_devices_and_enter()
-> pseries_suspend_begin()
This never works because there is no way to supply a valid stream id
using this interface, and H_VASI_STATE is called with a stream id of
zero. So this call path is useless at best.
2. When a stream id is written to /sys/devices/system/power/hibernate.
pseries_suspend_begin() is polled directly from store_hibernate()
until the stream is in the "Suspending" state (i.e. the platform is
ready for the OS to suspend execution):
dev_attr_store()
-> store_hibernate()
-> pseries_suspend_begin()
3. When a stream id is written to /sys/devices/system/power/hibernate
(continued). After #2, pseries_suspend_begin() is called once again
from the pm core:
dev_attr_store()
-> store_hibernate()
-> pm_suspend()
-> suspend_devices_and_enter()
-> pseries_suspend_begin()
This is redundant because the VASI suspend state is already known to
be Suspending.
The begin() callback of platform_suspend_ops is optional, so we can
simply remove that assignment with no loss of function.
Fixes: 32d8ad4e62 ("powerpc/pseries: Partition hibernation support")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201207215200.1785968-18-nathanl@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a40fdaf1420d6e6bda0dd2df1e6806013e58dbe1 ]
This reverts commit a0ff72f9f5.
Since the commit b015f6bc95 ("powerpc/pseries: Add cpu DLPAR
support for drc-info property"), the 'cpu_drcs' wouldn't be double
freed when the 'cpus' node not found.
So we needn't apply this patch, otherwise, the memory will be leaked.
Fixes: a0ff72f9f5 ("powerpc/pseries/hotplug-cpu: Remove double free in error path")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
[mpe: Caused by me applying a patch to a function that had changed in the interim]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111020752.1686139-1-zhangxiaoxu5@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Pull irq fixes from Thomas Gleixner:
"A set of updates for the interrupt subsystem:
- Make multiqueue devices which use the managed interrupt affinity
infrastructure work on PowerPC/Pseries. PowerPC does not use the
generic infrastructure for setting up PCI/MSI interrupts and the
multiqueue changes failed to update the legacy PCI/MSI
infrastructure. Make this work by passing the affinity setup
information down to the mapping and allocation functions.
- Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
bouncing and he's not reachable. We hope all is well with him and
say thanks for his work over the years"
* tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc/pseries: Pass MSI affinity to irq_create_mapping()
genirq/irqdomain: Add an irq_create_mapping_affinity() function
MAINTAINERS: Move Jason Cooper to CREDITS
With virtio multiqueue, normally each queue IRQ is mapped to a CPU.
Commit 0d9f0a52c8 ("virtio_scsi: use virtio IRQ affinity") exposed
an existing shortcoming of the arch code by moving virtio_scsi to
the automatic IRQ affinity assignment.
The affinity is correctly computed in msi_desc but this is not applied
to the system IRQs.
It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
lost at this point and never passed to irq_domain_alloc_descs()
(see commit 06ee6d571f ("genirq: Add affinity hint to irq allocation"))
because irq_create_mapping() doesn't take an affinity parameter.
Use the new irq_create_mapping_affinity() function, which allows to forward
the affinity setting from rtas_setup_msi_irqs() to irq_domain_alloc_descs().
With this change, the virtqueues are correctly dispatched between the CPUs
on pseries.
Fixes: e75eafb9b0 ("genirq/msi: Switch to new irq spreading infrastructure")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201126082852.1178497-3-lvivier@redhat.com
When offlining a CPU, powerpc/64s does not flush TLBs, rather it just
leaves the CPU set in mm_cpumasks, so it continues to receive TLBIEs
to manage its TLBs.
However the exit_flush_lazy_tlbs() function expects that after
returning, all CPUs (except self) have flushed TLBs for that mm, in
which case TLBIEL can be used for this flush. This breaks for offline
CPUs because they don't get the IPI to flush their TLB. This can lead
to stale translations.
Fix this by clearing the CPU from mm_cpumasks, then flushing all TLBs
before going offline.
These offlined CPU bits stuck in the cpumask also prevents the cpumask
from being trimmed back to local mode, which means continual broadcast
IPIs or TLBIEs are needed for TLB flushing. This patch prevents that
situation too.
A cast of many were involved in working this out, but in particular
Milton, Aneesh, Paul made key discoveries.
Fixes: 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Milton Miller <miltonm@us.ibm.com>
Debugged-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-5-npiggin@gmail.com
pseries|pnv_setup_rfi_flush already does the count cache flush setup, and
we just added entry and uaccess flushes. So the name is not very accurate
any more. In both platforms we then also immediately setup the STF flush.
Rename them to _setup_security_mitigations and fold the STF flush in.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
IBM Power9 processors can speculatively operate on data in the L1 cache
before it has been completely validated, via a way-prediction mechanism. It
is not possible for an attacker to determine the contents of impermissible
memory using this method, since these systems implement a combination of
hardware and software security measures to prevent scenarios where
protected data could be leaked.
However these measures don't address the scenario where an attacker induces
the operating system to speculatively execute instructions using data that
the attacker controls. This can be used for example to speculatively bypass
"kernel user access prevention" techniques, as discovered by Anthony
Steinhauser of Google's Safeside Project. This is not an attack by itself,
but there is a possibility it could be used in conjunction with
side-channels or other weaknesses in the privileged code to construct an
attack.
This issue can be mitigated by flushing the L1 cache between privilege
boundaries of concern. This patch flushes the L1 cache after user accesses.
This is part of the fix for CVE-2020-4788.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
IBM Power9 processors can speculatively operate on data in the L1 cache
before it has been completely validated, via a way-prediction mechanism. It
is not possible for an attacker to determine the contents of impermissible
memory using this method, since these systems implement a combination of
hardware and software security measures to prevent scenarios where
protected data could be leaked.
However these measures don't address the scenario where an attacker induces
the operating system to speculatively execute instructions using data that
the attacker controls. This can be used for example to speculatively bypass
"kernel user access prevention" techniques, as discovered by Anthony
Steinhauser of Google's Safeside Project. This is not an attack by itself,
but there is a possibility it could be used in conjunction with
side-channels or other weaknesses in the privileged code to construct an
attack.
This issue can be mitigated by flushing the L1 cache between privilege
boundaries of concern. This patch flushes the L1 cache on kernel entry.
This is part of the fix for CVE-2020-4788.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull powerpc fixes from Michael Ellerman:
- A fix for undetected data corruption on Power9 Nimbus <= DD2.1 in the
emulation of VSX loads. The affected CPUs were not widely available.
- Two fixes for machine check handling in guests under PowerVM.
- A fix for our recent changes to SMP setup, when
CONFIG_CPUMASK_OFFSTACK=y.
- Three fixes for races in the handling of some of our powernv sysfs
attributes.
- One change to remove TM from the set of Power10 CPU features.
- A couple of other minor fixes.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Ganesh Goudar, Jordan
Niethe, Mahesh Salgaonkar, Michael Neuling, Oliver O'Halloran, Qian Cai,
Srikar Dronamraju, Vasant Hegde.
* tag 'powerpc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries: Avoid using addr_to_pfn in real mode
powerpc/uaccess: Don't use "m<>" constraint with GCC 4.9
powerpc/eeh: Fix eeh_dev_check_failure() for PE#0
powerpc/64s: Remove TM from Power10 features
selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround
powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
powerpc/powernv/dump: Handle multiple writes to ack attribute
powerpc/powernv/dump: Fix race while processing OPAL dump
powerpc/smp: Use GFP_ATOMIC while allocating tmp mask
powerpc/smp: Remove unnecessary variable
powerpc/mce: Avoid nmi_enter/exit in real mode on pseries hash
powerpc/opal_elog: Handle multiple writes to ack attribute
When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in real mode, which can be fatal as it may try to access
memory outside RMO region.
Have two helper functions to separate things to be done in real mode
and virtual mode without changing any functionality. This also fixes
the following error as the use of addr_to_pfn is now moved to virtual
mode.
Without this change following kernel crash is seen on hitting UE.
[ 485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
[ 485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
[ 485.128047] Modules linked in:
[ 485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
[ 485.128074] NIP: c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
[ 485.128078] REGS: c000000003f1f970 TRAP: 0300 Tainted: G OE (5.7.0)
[ 485.128082] MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28008284 XER: 00000001
[ 485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
[ 485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
[ 485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
[ 485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
[ 485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
[ 485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
[ 485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
[ 485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
[ 485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
[ 485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
[ 485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
[ 485.128133] Call Trace:
[ 485.128135] Instruction dump:
[ 485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
[ 485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
[ 485.128152] ---[ end trace d34b27e29ae0e340 ]---
Fixes: 9ca766f989 ("powerpc/64s/pseries: machine check convert to use common event code")
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724063946.21378-1-ganeshgr@linux.ibm.com
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
acceptable by papr_scm. This is needed as since commit
92fe2aa859 ("libnvdimm: Validate command family indices") libnvdimm
performs a validation of 'nd_cmd_pkg.nd_family' received as part of
ND_CMD_CALL processing to ensure only known command families can use
the general ND_CMD_CALL pass-through functionality.
Without this change the ND_CMD_CALL pass-through targeting
NVDIMM_FAMILY_PAPR error out with -EINVAL.
Fixes: 92fe2aa859 ("libnvdimm: Validate command family indices")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200913211904.24472-1-vaibhav@linux.ibm.com
If the RTAS call to query the PE address for a device fails we jump the
err: label where an error message is printed along with the return code.
However, the printed return code is from the "ret" variable which isn't set
at that point since we assigned the result to "addr" instead. Fix this by
consistently using the "ret" variable for the result of the RTAS call
helpers an dropping the "addr" local variable"
Fixes: 98ba956f6a ("powerpc/pseries/eeh: Rework device EEH PE determination")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007040903.819081-2-oohall@gmail.com
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().
This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid. In dlpar_add_lmb() we get this address from the LMB itself.
In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.
If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup. The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.
Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.
Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:
Unpatched (12450 seconds):
Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
Patched (7065 seconds):
Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range. This is because we
are skipping a linear LMB search.
To see the distinction, consider smaller hot-add test on the same
LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:
Unpatched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
4,708 context-switches # 0.045 K/sec ( +- 0.69% )
2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
394 page-faults # 0.004 K/sec ( +- 0.22% )
445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
258,091,488,691 instructions # 0.58 insn per cycle
# 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)
105.583 +- 0.589 seconds time elapsed ( +- 0.56% )
Patched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
4,606 context-switches # 0.044 K/sec ( +- 0.20% )
2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
394 page-faults # 0.004 K/sec ( +- 0.25% )
442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
252,731,168,193 instructions # 0.57 insn per cycle
# 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)
104.829 +- 0.325 seconds time elapsed ( +- 0.31% )
This is consistent. An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first. On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.
Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
When support for EEH on PowerNV was added a lot of pseries specific code
was made "generic" and some of the quirks of pseries EEH came along for the
ride. One of the stranger quirks is eeh_pe containing two types of PE
address: pe->addr and pe->config_addr. There reason for this appears to be
historical baggage rather than any real requirements.
On pseries EEH PEs are manipulated using RTAS calls. Each EEH RTAS call
takes a "PE configuration address" as an input which is used to identify
which EEH PE is being manipulated by the call. When initialising the EEH
state for a device the first thing we need to do is determine the
configuration address for the PE which contains the device so we can enable
EEH on that PE. This process is outlined in PAPR which is the modern
(i.e post-2003) FW specification for pseries. However, EEH support was
first described in the pSeries RISC Platform Architecture (RPA) and
although they are mostly compatible EEH is one of the areas where they are
not.
The major difference is that RPA doesn't actually have the concept of a PE.
On RPA systems the EEH RTAS calls are done on a per-device basis using the
same config_addr that would be passed to the RTAS functions to access PCI
config space (e.g. ibm,read-pci-config). The config_addr is not identical
since the function and config register offsets of the config_addr must be
set to zero. EEH operations being done on a per-device basis doesn't make a
whole lot of sense when you consider how EEH was implemented on legacy PCI
systems.
For legacy PCI(-X) systems EEH was implemented using special PCI-PCI
bridges which contained logic to detect errors and freeze the secondary
bus when one occurred. This means that the EEH enabled state is shared
among all devices behind that EEH bridge. As a result there's no way to
implement the per-device control required for the semantics specified by
RPA. It can be made to work if we assume that a separate EEH bridge exists
for each EEH capable PCI slot and there are no bridges behind those slots.
However, RPA also specifies the ibm,configure-bridge RTAS call for
re-initalising bridges behind EEH capable slots after they are reset due
to an EEH event so that is probably not a valid assumption. This
incoherence was fixed in later PAPR, which succeeded RPA. Unfortunately,
since Linux EEH support seems to have been implemented based on the RPA
spec some of the legacy assumptions were carried over (probably for POWER4
compatibility).
The fix made in PAPR was the introduction of the "PE" concept and
redefining the EEH RTAS calls (set-eeh-option, reset-slot, etc) to operate
on a per-PE basis so all devices behind an EEH bride would share the same
EEH state. The "config_addr" argument to the EEH RTAS calls became the
"PE_config_addr" and the OS was required to use the
ibm,get-config-addr-info RTAS call to find the correct PE address for the
device. When support for the new interfaces was added to Linux it was
implemented using something like:
At probe time:
pdn->eeh_config_addr = rtas_config_addr(pdn);
pdn->eeh_pe_config_addr = rtas_get_config_addr_info(pdn);
When performing an RTAS call:
config_addr = pdn->eeh_config_addr;
if (pdn->eeh_pe_config_addr)
config_addr = pdn->eeh_pe_config_addr;
rtas_call(..., config_addr, ...);
In other words, if the ibm,get-config-addr-info RTAS call is implemented
and returned a valid result we'd use that as the argument to the EEH
RTAS calls. If not, Linux would fall back to using the device's
config_addr. Over time these addresses have moved around going from pci_dn
to eeh_dev and finally into eeh_pe. Today the users look like this:
config_addr = pe->config_addr;
if (pe->addr)
config_addr = pe->addr;
rtas_call(..., config_addr, ...);
However, considering the EEH core always operates on a per-PE basis and
even on pseries the only per-device operation is the initial call to
ibm,set-eeh-option I'm not sure if any of this actually works on an RPA
system today. It doesn't make much sense to have the fallback address in
a generic structure either since the bulk of the code which reference it
is in pseries anyway.
The EEH core makes a token effort to support looking up a PE using the
config_addr by having two arguments to eeh_pe_get(). However, a survey of
all the callers to eeh_pe_get() shows that all bar one have the config_addr
argument hard-coded to zero.The only caller that doesn't is in
eeh_pe_tree_insert() which has:
if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr)
return -EINVAL;
pe = eeh_pe_get(hose, edev->pe_config_addr, edev->bdfn);
The third argument (config_addr) is only used if the second (pe->addr)
argument is invalid. The preceding check ensures that the call to
eeh_pe_get() will never happen if edev->pe_config_addr is invalid so there
is no situation where eeh_pe_get() will search for a PE based on the 3rd
argument. The check also means that we'll never insert a PE into the tree
where pe_config_addr is zero since EEH_VALID_PE_ZERO is never set on
pseries. All the users of the fallback address on pseries never actually
use the fallback and all the only caller that supplies something for the
config_addr argument to eeh_pe_get() never use it either. It's all dead
code.
This patch removes the fallback address from eeh_pe since nothing uses it.
Specificly, we do this by:
1) Removing pe->config_addr
2) Removing the EEH_VALID_PE_ZERO flag
3) Removing the fallback address argument to eeh_pe_get().
4) Removing all the checks for pe->addr being zero in the pseries EEH code.
This leaves us with PE's only being identified by what's in their pe->addr
field and the EEH core relying on the platform to ensure that eeh_dev's are
only inserted into the EEH tree if they're actually inside a PE.
No functional changes, I hope.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200918093050.37344-9-oohall@gmail.com