c44696fff0 ("EDAC: Remove arbitrary limit on number of channels")
lifted the arbitrary limit on memory controller channels in EDAC.
However, the dynamic channel attributes dynamic_csrow_dimm_attr and
dynamic_csrow_ce_count_attr remained 6.
This wasn't a problem except channels 6 and 7 weren't visible in sysfs
on machines with more than 6 channels after the conversion to static
attr groups with
2c1946b6d6 ("EDAC: Use static attribute groups for managing sysfs entries")
[ without that, we're exploding in edac_create_sysfs_mci_device()
because we're dereferencing out of the bounds of the
dynamic_csrow_dimm_attr array. ]
Add attributes for channels 6 and 7 along with a guard for the
future, should more channels be required and/or to sanity check for
misconfigured machines.
We still need to check against the number of channels present on the MC
first, as Thor reported.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reported-by: Hironobu Ishii <ishii.hironobu@jp.fujitsu.com>
Tested-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: <stable@vger.kernel.org> # 4.2
It is useless to do it if we're loaded on unsupported hardware so do
that only after we have detected at least 1 supported AMD northbridge.
Signed-off-by: Borislav Petkov <bp@suse.de>
In commit
2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
we switched from using PCI ids to determine which platform we are
running on to using CPU model instead.
I forgot that Broadwell-DE has its own distinct model number different
from Broadwell-EP or -EX.
Fixing this isn't just adding a line to the array of cpuids - the
exising code assumed a 1:1 mapping between entries in that array and the
"enum type" values. Added the type to pci_id_table structure to remove
this dependency and allows two Broadwell cpu models.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull EDAC updates from Borislav Petkov:
"It was pretty busy in EDAC land this time:
- Altera Arria10 L2 cache and On-Chip RAM ECC handling (Thor Thayer)
- Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac
(Tony Luck)
- Do not register sb_edac with pci_register_driver() (Tony Luck)
- Add support for Skylake to ie31200_edac (Jason Baron)
- Do not register amd64_edac with pci_register_driver() (Borislav
Petkov)
... plus the usual round of cleanups and fixes all over the place"
* tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (25 commits)
EDAC, amd64_edac: Drop pci_register_driver() use
EDAC, ie31200_edac: Add Skylake support
EDAC, sb_edac: Use cpu family/model in driver detection
EDAC, i7core: Remove double buffering of error records
EDAC, amd64_edac: Issue driver banner only on success
ARM: socfpga: Initialize Arria10 OCRAM ECC on startup
EDAC: Increment correct counter in edac_inc_ue_error()
EDAC, sb_edac: Remove double buffering of error records
EDAC: Fix used after kfree() error in edac_unregister_sysfs()
EDAC, altera: Avoid unused function warnings
EDAC, altera: Remove useless casts
ARM: socfpga: Enable Arria10 OCRAM ECC on startup
EDAC, altera: Add Arria10 OCRAM ECC support
Documentation: dt: socfpga: Add Altera Arria10 OCRAM binding
EDAC, altera: Make OCRAM ECC dependency check generic
EDAC, altera: Add register offset for ECC Enable
EDAC, altera: Extract error inject operations to a struct fops
ARM: socfpga: Enable Arria10 L2 cache ECC on startup
EDAC, altera: Add Arria10 L2 Cache ECC handling
Documentation, dt, socfpga: Add Altera Arria10 L2 cache binding
...
- remove homegrown instances counting.
- take F3 PCI device from amd_nb caching instead of F2 which was used with the
PCI core.
With those changes, the driver doesn't need to register a PCI driver and
relies on the northbridges caching which we do anyway on AMD.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Instead of picking a random PCI ID from the dozen or so we need to
access, just use x86_match_cpu() to pick based on CPU model number. The
choosing of PCI devices has been problematic in the past, see
11249e7399 ("sb_edac: Fix detection on SNB machines")
which fixed problems introduced by
d0585cd815 ("sb_edac: Claim a different PCI device").
This is especially ugly if future hardware might not even have
EDAC-relevant registers in PCI config space and we would still be
required to choose some "random" PCI devices to scan for just so our
driver loads.
Is this cleaner/clearer? It deletes much more code than it adds. Only
tested on Broadwell. The driver loads/unloads and loads again. Still
decodes errors too.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
In the bad old days the functions from x86_mce_decoder_chain could be
called in machine check context. So we used to carefully copy them and
defer processing until later. But in
f29a7aff4b ("x86/mce: Avoid potential deadlock due to printk() in MCE context")
we switched the logging code to save the record in a genpool, and call
the functions that registered to be notified later from a work queue.
So drop all the double buffering and do all the work we want to do as
soon as i7core_mce_check_error() is called.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/29ab2c370915c6e132fc5d88e7b72cb834bedbfe.1461855008.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The recently added Arria10 OCRAM ECC support caused some new harmless
warnings about unused functions when it is disabled:
drivers/edac/altera_edac.c:1067:20: error: 'altr_edac_a10_ecc_irq' defined but not used [-Werror=unused-function]
drivers/edac/altera_edac.c:658:12: error: 'altr_check_ecc_deps' defined but not used [-Werror=unused-function]
This rearranges the code slightly to have those two functions inside
of the same #ifdef that hides their callers. It also manages to
avoid a forward declaration of the IRQ handler in the process.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thor Thayer <tthayer@opensource.altera.com>
Cc: Alan Tull <atull@opensource.altera.com>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: c7b4be8db8 ("EDAC, altera: Add Arria10 OCRAM ECC support")
Link: http://lkml.kernel.org/r/1460837650-1237650-2-git-send-email-arnd@arndb.de
Signed-off-by: Borislav Petkov <bp@suse.de>
This patch fix spelling typos found in printk
within various part of the kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull RAS updates from Ingo Molnar:
"Various RAS updates:
- AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
MCA) error decoding support (Aravind Gopalakrishnan)
- x86 memcpy_mcsafe() support, to enable smart(er) hardware error
recovery in NVDIMM drivers, based on an extension of the x86
exception handling code. (Tony Luck)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/sb_edac: Fix computation of channel address
x86/mm, x86/mce: Add memcpy_mcsafe()
x86/mce/AMD: Document some functionality
x86/mce: Clarify comments regarding deferred error
x86/mce/AMD: Fix logic to obtain block address
x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
x86/mce: Move MCx_CONFIG MSR definitions
x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
x86/mm: Expand the exception table logic to allow new handling options
x86/mce/AMD: Set MCAX Enable bit
x86/mce/AMD: Carve out threshold block preparation
x86/mce/AMD: Fix LVT offset configuration for thresholding
x86/mce/AMD: Reduce number of blocks scanned per bank
x86/mce/AMD: Do not perform shared bank check for future processors
x86/mce: Fix order of AMD MCE init function call
Large memory Haswell-EX systems with multiple DIMMs per channel were
sometimes reporting the wrong DIMM.
Found three problems:
1) Debug printouts for socket and channel interleave were not interpreting
the register fields correctly. The socket interleave field is a 2^X
value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2,
2=3. 3=4).
2) Actual use of the socket interleave value didn't interpret as 2^X
3) Conversion of address to channel address was complicated, and wrong.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They're both running only when ->edac_check is initialized so remove
that check from the workqueue function itself. Synchronize/generalize
the ->op_state check between the two.
Kill useless comments, while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
We use the ->edac_check function pointers to determine whether we need
to setup a polling workqueue. However, the destroy path is not balanced
and we might try to teardown an unitialized workqueue.
Balance init and destroy paths by looking at ->edac_check in both cases.
Set op_state to OP_OFFLINE *before* destroying anything.
Reported-by: Zhiqiang Hou <Zhiqiang.Hou@freescale.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Hide the EDAC workqueue pointer in a separate compilation unit and add
accessors for the workqueue manipulations needed.
Remove edac_pci_reset_delay_period() which wasn't used by anything. It
seems it got added without a user with
91b99041c1 ("drivers/edac: updated PCI monitoring")
Signed-off-by: Borislav Petkov <bp@suse.de>