The grain in EDAC is defined as "minimum granularity for an error
report, in bytes". The following calculation of the grain_bits in
edac_mc is wrong:
grain_bits = fls_long(e->grain) + 1;
Where grain_bits is defined as:
grain = 1 << grain_bits
Example:
grain = 8 # 64 bit (8 bytes)
grain_bits = fls_long(8) + 1
grain_bits = 4 + 1 = 5
grain = 1 << grain_bits
grain = 1 << 5 = 32
Replace it with the correct calculation:
grain_bits = fls_long(e->grain - 1);
The example gives now:
grain_bits = fls_long(8 - 1)
grain_bits = fls_long(7)
grain_bits = 3
grain = 1 << 3 = 8
Also, check if the hardware reports a reasonable grain != 0 and fallback
with a warning to 1 byte granularity otherwise.
[ bp: massage a bit. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190624150758.6695-2-rrichter@marvell.com
The function should return NULL in case no device is found, but it
always returns the last checked mc device from the list even if the
index did not match. Fix that.
I did some analysis why this did not raise any issues for about 3 years
and the reason is that edac_mc_find() is mostly used to search for
existing devices. Thus, the bug is not triggered.
[ bp: Drop the if (mci->mc_idx > idx) test in favor of readability. ]
Fixes: c73e8833be ("EDAC, mc: Fix locking around mc_devices list")
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lkml.kernel.org/r/20190514104838.15065-1-rrichter@marvell.com
Use mc_devices list instead to check whether we have EDAC driver
instances successfully registered with EDAC core.
Signed-off-by: Borislav Petkov <bp@suse.de>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions are documented at edac_mc.c.
As we'll be including edac_core.h at drivers-api book, move
those, in order for the kernel-doc markups be part of the API
documentation book.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
When accessing the mc_devices list of memory controller descriptors, we
need to hold mem_ctls_mutex. This was not always the case, fix that.
Make all external callers call a version which grabs the mutex since the
last is local to edac_mc.c.
Reported-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
They're both running only when ->edac_check is initialized so remove
that check from the workqueue function itself. Synchronize/generalize
the ->op_state check between the two.
Kill useless comments, while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
We use the ->edac_check function pointers to determine whether we need
to setup a polling workqueue. However, the destroy path is not balanced
and we might try to teardown an unitialized workqueue.
Balance init and destroy paths by looking at ->edac_check in both cases.
Set op_state to OP_OFFLINE *before* destroying anything.
Reported-by: Zhiqiang Hou <Zhiqiang.Hou@freescale.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Hide the EDAC workqueue pointer in a separate compilation unit and add
accessors for the workqueue manipulations needed.
Remove edac_pci_reset_delay_period() which wasn't used by anything. It
seems it got added without a user with
91b99041c1 ("drivers/edac: updated PCI monitoring")
Signed-off-by: Borislav Petkov <bp@suse.de>
EDAC workqueue destruction is really fragile. We cancel delayed work
but if it is still running and requeues itself, we still go ahead and
destroy the workqueue and the queued work explodes when workqueue core
attempts to run it.
Make the destruction more robust by switching op_state to offline so
that requeuing stops. Cancel any pending work *synchronously* too.
EDAC i7core: Driver loaded.
general protection fault: 0000 [#1] SMP
CPU 12
Modules linked in:
Supported: Yes
Pid: 0, comm: kworker/0:1 Tainted: G IE 3.0.101-0-default #1 HP ProLiant DL380 G7
RIP: 0010:[<ffffffff8107dcd7>] [<ffffffff8107dcd7>] __queue_work+0x17/0x3f0
< ... regs ...>
Process kworker/0:1 (pid: 0, threadinfo ffff88019def6000, task ffff88019def4600)
Stack:
...
Call Trace:
call_timer_fn
run_timer_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
intel_idle
cpuidle_idle_call
cpu_idle
Code: ...
RIP __queue_work
RSP <...>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Add edac_mc_add_mc_with_groups() for initializing the mem_ctl_info
object with the optional attribute groups. This allows drivers to
pass additional sysfs entries without manual (and racy)
device_create_file() and co calls.
edac_mc_add_mc() is kept as is, just calling edac_mc_add_with_groups()
with NULL groups.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: http://lkml.kernel.org/r/1423046938-18111-3-git-send-email-tiwai@suse.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Make keeping the sync between the mem_types enum and the actual string
names simpler by using designated initializers.
Signed-off-by: Borislav Petkov <bp@suse.de>
We're using edac_mc_workq_setup() both on the init path, when
we load an edac driver and when we change the polling period
(edac_mc_reset_delay_period) through /sys/.../edac_mc_poll_msec.
On that second path we don't need to init the workqueue which has been
initialized already.
Thanks to Tejun for workqueue insights.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1391457913-881-1-git-send-email-prarit@redhat.com
Cc: <stable@vger.kernel.org>
Both mci.mem_is_per_rank and mci.csbased denote the same thing: the
memory controller is csrows based. Merge both fields into one.
There's no need for the driver to actually fill it, as the core detects
it by checking if one of the layers has the csrows type as part of the
memory hierarchy:
if (layers[i].type == EDAC_MC_LAYER_CHIP_SELECT)
per_rank = true;
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
The number of variables at the stack is too big.
Reduces the stack usage by using a pre-allocated error
buffer.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
APEI GHES and i7core_edac/sb_edac currently can be loaded at
the same time, but those are Highlander modules:
"There can be only one".
There are two reasons for that:
1) Each driver assumes that it is the only one registering at
the EDAC core, as it is driver's responsibility to number
the memory controllers, and all of them start from 0;
2) If BIOS is handling the memory errors, the OS can't also be
doing it, as one will mangle with the other.
So, we need to add an module owner's lock at the EDAC core,
in order to avoid having two different modules handling memory
errors at the same time. The best way for doing this lock seems
to use the driver's name, as this is unique, and won't require
changes on every driver.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
There are some cases where the memory controller layout is
completely hidden. This is the case of firmware-driven error
code, like the one provided by GHES. Add a new layer to be
used on such memory error report mechanisms.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Linux 3.8-rc7
* tag 'v3.8-rc7': (12052 commits)
Linux 3.8-rc7
net: sctp: sctp_endpoint_free: zero out secret key data
net: sctp: sctp_setsockopt_auth_key: use kzfree instead of kfree
atm/iphase: rename fregt_t -> ffreg_t
ARM: 7641/1: memory: fix broken mmap by ensuring TASK_UNMAPPED_BASE is aligned
ARM: DMA mapping: fix bad atomic test
ARM: realview: ensure that we have sufficient IRQs available
ARM: GIC: fix GIC cpumask initialization
net: usb: fix regression from FLAG_NOARP code
l2tp: dont play with skb->truesize
net: sctp: sctp_auth_key_put: use kzfree instead of kfree
netback: correct netbk_tx_err to handle wrap around.
xen/netback: free already allocated memory on failure in xen_netbk_get_requests
xen/netback: don't leak pages on failure in xen_netbk_tx_check_gop.
xen/netback: shutdown the ring if it contains garbage.
drm/ttm: fix fence locking in ttm_buffer_object_transfer, 2nd try
virtio_console: Don't access uninitialized data.
net: qmi_wwan: add more Huawei devices, including E320
net: cdc_ncm: add another Huawei vendor specific device
ipv6/ip6_gre: fix error case handling in ip6gre_tunnel_xmit()
...
There are no more embedded kobjects in struct mem_ctl_info. Remove a header and
a comment that does not reflect the code anymore.
Signed-off-by: Shaun Ruffell <sruffell@digium.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Pull EDAC fixes from Borislav Petkov:
- EDAC core error path fix, from Denis Kirjanov.
- Generalization of AMD MCE bank names and some minor error reporting
improvements.
- EDAC core cleanups and simplifications, from Wei Yongjun.
- amd64_edac fixes for sysfs-reported values, from Josh Hunt.
- some heavy amd64_edac error reporting path shaving, leading to
removing a bunch of code.
- amd64_edac error injection method improvements.
- EDAC core cleanups and fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (24 commits)
EDAC, pci_sysfs: Use for_each_pci_dev to simplify the code
EDAC: Handle error path in edac_mc_sysfs_init() properly
MCE, AMD: Dump error status
MCE, AMD: Report decoded error type first
MCE, AMD: Dump CPU f/m/s triple with the error
MCE, AMD: Remove functional unit references
EDAC: Convert to use simple_open()
EDAC, Calxeda highbank: Convert to use simple_open()
EDAC: Fix mc size reported in sysfs
EDAC: Fix csrow size reported in sysfs
EDAC: Pass mci parent
EDAC: Add memory controller flags
amd64_edac: Fix csrows size and pages computation
amd64_edac: Use DBAM_DIMM macro
amd64_edac: Fix K8 chip select reporting
amd64_edac: Reorganize error reporting path
amd64_edac: Do not check whether error address is valid
amd64_edac: Improve error injection
amd64_edac: Cleanup error injection code
amd64_edac: Small fixlets and cleanups
...
A reported error could look like this
[ 226.178315] EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x427c0d offset:0xde0 grain:0 syndrome:0x1c6)
with two spaces back-to-back due to the msg argument of
edac_mc_handle_error being passed on empty by the specific drivers.
Handle that.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>