[ Upstream commit a1e2b8b36820d8c91275f207e77e91645b7c6836 ]
Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:
| For example, if we use Qemu to start a two NUMA node kernel,
| one of the nodes has 2M memory (less than NODE_MIN_SIZE),
| and the other node has 2G, then we will encounter the
| following panic:
|
| BUG: kernel NULL pointer dereference, address: 0000000000000000
| <...>
| RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
| <...>
| Call Trace:
| <TASK>
| deactivate_slab()
| bootstrap()
| kmem_cache_init()
| start_kernel()
| secondary_startup_64_no_verify()
The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.
The commit:
9391a3f9c7 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")
... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.
Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:
mm: page_alloc: skip memoryless nodes entirely
mm: memory_hotplug: drop memoryless node from fallback lists
Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.
This patch drops the limitation for minimal node size on x86:
- which works around the crash without the fixes to the core MM.
- makes x86 topologies less weird,
- removes an arbitrary and undocumented limitation on NUMA topologies.
[ mingo: Improved changelog clarity. ]
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 625395c4a0f4775e0fe00f616888d2e6c1ba49db ]
GCC-12 started triggering a new warning:
arch/x86/mm/numa.c: In function ‘cpumask_of_node’:
arch/x86/mm/numa.c:916:39: warning: the comparison will always evaluate as ‘false’ for the address of ‘node_to_cpumask_map’ will never be NULL [-Waddress]
916 | if (node_to_cpumask_map[node] == NULL) {
| ^~
node_to_cpumask_map is of type cpumask_var_t[].
When CONFIG_CPUMASK_OFFSTACK is set, cpumask_var_t is typedef'd to a
pointer for dynamic allocation, else to an array of one element. The
"wicked game" can be checked on line 700 of include/linux/cpumask.h.
The original code in debug_cpumask_set_cpu() and cpumask_of_node() were
probably written by the original authors with CONFIG_CPUMASK_OFFSTACK=y
(i.e. dynamic allocation) in mind, checking if the cpumask was available
via a direct NULL check.
When CONFIG_CPUMASK_OFFSTACK is not set, GCC gives the above warning
while compiling the kernel.
Fix that by using cpumask_available(), which does the NULL check when
CONFIG_CPUMASK_OFFSTACK is set, otherwise returns true. Use it wherever
such checks are made.
Conditional definitions of cpumask_available() can be found along with
the definition of cpumask_var_t. Check the cpumask.h reference mentioned
above.
Fixes: c032ef60d1 ("cpumask: convert node_to_cpumask_map[] to cpumask_var_t")
Fixes: de2d9445f1 ("x86: Unify node_to_cpumask_map handling between 32 and 64bit")
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220731160913.632092-1-code@siddh.me
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 28e5e44aa3f4e0e0370864ed008fb5e2d85f4dc8 upstream.
tl;dr:
Several SGX users reported seeing the following message on NUMA systems:
sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
This turned out to be the memblock code mistakenly throwing away SGX
memory.
=== Full Changelog ===
The 'max_pfn' variable represents the highest known RAM address. It can
be used, for instance, to quickly determine for which physical addresses
there is mem_map[] space allocated. The numa_meminfo code makes an
effort to throw out ("trim") all memory blocks which are above 'max_pfn'.
SGX memory is not considered RAM (it is marked as "Reserved" in the
e820) and is not taken into account by max_pfn. Despite this, SGX memory
areas have NUMA affinity and are enumerated in the ACPI SRAT table. The
existing SGX code uses the numa_meminfo mechanism to look up the NUMA
affinity for its memory areas.
In cases where SGX memory was above max_pfn (usually just the one EPC
section in the last highest NUMA node), the numa_memblock is truncated
at 'max_pfn', which is below the SGX memory. When the SGX code tries to
look up the affinity of this memory, it fails and produces an error message:
sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
and assigns the memory to NUMA node 0.
Instead of silently truncating the memory block at 'max_pfn' and
dropping the SGX memory, add the truncated portion to
'numa_reserved_meminfo'. This allows the SGX code to later determine
the NUMA affinity of its 'Reserved' area.
Before, numa_meminfo looked like this (from 'crash'):
blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
numa_reserved_meminfo is empty.
With this, numa_meminfo looks like this:
blk = { start = 0x0, end = 0x2080000000, nid = 0x0 }
{ start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
and numa_reserved_meminfo has an entry for node 1's SGX memory:
blk = { start = 0x4000000000, end = 0x4080000000, nid = 0x1 }
[ daveh: completely rewrote/reworked changelog ]
Fixes: 5d30f92e76 ("x86/NUMA: Provide a range-to-target_node lookup facility")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: a035b6bf86 ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ACPI updates from Rafael Wysocki:
"These add support for generic initiator-only proximity domains to the
ACPI NUMA code and the architectures using it, clean up some
non-ACPICA code referring to debug facilities from ACPICA, reduce the
overhead related to accessing GPE registers, add a new DPTF (Dynamic
Power and Thermal Framework) participant driver, update the ACPICA
code in the kernel to upstream revision 20200925, add a new ACPI
backlight whitelist entry, fix a few assorted issues and clean up some
code.
Specifics:
- Add support for generic initiator-only proximity domains to the
ACPI NUMA code and the architectures using it (Jonathan Cameron)
- Clean up some non-ACPICA code referring to debug facilities from
ACPICA that are not actually used in there (Hanjun Guo)
- Add new DPTF driver for the PCH FIVR participant (Srinivas
Pandruvada)
- Reduce overhead related to accessing GPE registers in ACPICA and
the OS interface layer and make it possible to access GPE registers
using logical addresses if they are memory-mapped (Rafael Wysocki)
- Update the ACPICA code in the kernel to upstream revision 20200925
including changes as follows:
+ Add predefined names from the SMBus sepcification (Bob Moore)
+ Update acpi_help UUID list (Bob Moore)
+ Return exceptions for string-to-integer conversions in iASL (Bob
Moore)
+ Add a new "ALL <NameSeg>" debugger command (Bob Moore)
+ Add support for 64 bit risc-v compilation (Colin Ian King)
+ Do assorted cleanups (Bob Moore, Colin Ian King, Randy Dunlap)
- Add new ACPI backlight whitelist entry for HP 635 Notebook (Alex
Hung)
- Move TPS68470 OpRegion driver to drivers/acpi/pmic/ and split out
Kconfig and Makefile specific for ACPI PMIC (Andy Shevchenko)
- Clean up the ACPI SoC driver for AMD SoCs (Hanjun Guo)
- Add missing config_item_put() to fix refcount leak (Hanjun Guo)
- Drop lefrover field from struct acpi_memory_device (Hanjun Guo)
- Make the ACPI extlog driver check for RDMSR failures (Ben
Hutchings)
- Fix handling of lid state changes in the ACPI button driver when
input device is closed (Dmitry Torokhov)
- Fix several assorted build issues (Barnabás Pőcze, John Garry,
Nathan Chancellor, Tian Tao)
- Drop unused inline functions and reduce code duplication by using
kobj_to_dev() in the NFIT parsing code (YueHaibing, Wang Qing)
- Serialize tools/power/acpi Makefile (Thomas Renninger)"
* tag 'acpi-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits)
ACPICA: Update version to 20200925 Version 20200925
ACPICA: Remove unnecessary semicolon
ACPICA: Debugger: Add a new command: "ALL <NameSeg>"
ACPICA: iASL: Return exceptions for string-to-integer conversions
ACPICA: acpi_help: Update UUID list
ACPICA: Add predefined names found in the SMBus sepcification
ACPICA: Tree-wide: fix various typos and spelling mistakes
ACPICA: Drop the repeated word "an" in a comment
ACPICA: Add support for 64 bit risc-v compilation
ACPI: button: fix handling lid state changes when input device closed
tools/power/acpi: Serialize Makefile
ACPI: scan: Replace ACPI_DEBUG_PRINT() with pr_debug()
ACPI: memhotplug: Remove 'state' from struct acpi_memory_device
ACPI / extlog: Check for RDMSR failure
ACPI: Make acpi_evaluate_dsm() prototype consistent
docs: mm: numaperf.rst Add brief description for access class 1.
node: Add access1 class to represent CPU to memory characteristics
ACPI: HMAT: Fix handling of changes from ACPI 6.2 to ACPI 6.3
ACPI: Let ACPI know we support Generic Initiator Affinity Structures
x86: Support Generic Initiator only proximity domains
...
In common with memoryless domains only register GI domains
if the proximity node is not online. If a domain is already
a memory containing domain, or a memoryless domain there is
nothing to do just because it also contains a Generic Initiator.
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax
instances, fronting performance-differentiated-memory like pmem, to be
added to the System RAM pool. The NUMA node for that hot-added memory is
derived from the device-dax instance's 'target_node' attribute.
Recall that the 'target_node' is the ACPI-PXM-to-node translation for
memory when it comes online whereas the 'numa_node' attribute of the
device represents the closest online cpu node.
Presently useful target_node information from the ACPI SRAT is discarded
with the expectation that "Reserved" memory will never be onlined. Now,
DEV_DAX_KMEM violates that assumption, there is a need to retain the
translation. Move, rather than discard, numa_memblk data to a secondary
array that memory_add_physaddr_to_target_node() may consider at a later
point in time.
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's a conflict between ongoing level-5 paging support and
the E820 rewrite. Since the E820 rewrite is essentially ready,
merge it into x86/mm to reduce tree conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
numa_nodemask_from_meminfo() generates a nodemask of nodes which have
memory according to a meminfo descriptor.
The two callsites of that function both set bits in copies of the
numa_nodes_parsed nodemask. In both cases, the information in supplied
numa_meminfo is a subset of numa_nodes_parsed. So setting those bits
again is not really necessary.
Here are the three call paths which show that the supplied numa_meminfo
argument describes memory regions in nodes which are already in
numa_nodes_parsed:
x86_numa_init()
numa_init()
Case 1:
acpi_numa_init()
acpi_parse_memory_affinity()
numa_add_memblk()
node_set(numa_nodes_parsed)
acpi_parse_slit()
acpi_numa_slit_init()
numa_set_distance()
numa_alloc_distance()
numa_nodemask_from_meminfo()
Case 2:
amd_numa_init()
numa_add_memblk()
node_set(numa_nodes_parsed)
Case 3
dummy_numa_init()
node_set(numa_nodes_parsed)
numa_add_memblk()
numa_register_memblks()
numa_nodemask_from_meminfo()
Thus, in all three cases, the respective bit in numa_nodes_parsed is
set, which means it is not necessary to set it again in a copy of
numa_nodes_parsed.
So remove that function.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20170314030801.13656-2-richard.weiyang@gmail.com
[ Heavily massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 header cleanups from Ingo Molnar:
"This tree is a cleanup of the x86 tree reducing spurious uses of
module.h - which should improve build performance a bit"
* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
x86/apic: Remove duplicated include from probe_64.c
x86/ce4100: Remove duplicated include from ce4100.c
x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
x86/platform: Delete extraneous MODULE_* tags fromm ts5500
x86: Audit and remove any remaining unnecessary uses of module.h
x86/kvm: Audit and remove any unnecessary uses of module.h
x86/xen: Audit and remove any unnecessary uses of module.h
x86/platform: Audit and remove any unnecessary uses of module.h
x86/lib: Audit and remove any unnecessary uses of module.h
x86/kernel: Audit and remove any unnecessary uses of module.h
x86/mm: Audit and remove any unnecessary uses of module.h
x86: Don't use module.h just for AUTHOR / LICENSE tags
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends. That changed
when we forked out support for the latter into the export.h file.
This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig. The advantage
in doing so is that module.h itself sources about 15 other headers;
adding significantly to what we feed cpp, and it can obscure what
headers we are effectively using.
Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance
for the presence of either and replace accordingly where needed.
Note that some bool/obj-y instances remain since module.h is
the header for some exception table entry stuff, and for things
like __init_or_module (code that is tossed when MODULES=n).
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160714001901.31603-3-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lots of code does
node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);
so create next_node_in() to do this and use it in various places.
[mhocko@suse.com: use next_node_in() helper]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Hui Zhu <zhuhui@xiaomi.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following commit:
a0acda9172 ("acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable")
Introduced numa_clear_kernel_node_hotplug(), which function is executed
during early bootup, and which marks all currently reserved memblock
regions as hot-memory-unswappable as well.
y14sg1 <y14sg1@comcast.net> reported that when running 32-bit NUMA kernels,
the grsecurity/PAX kernel patch flagged a size overflow in this function:
PAX: size overflow detected in function x86_numa_init arch/x86/mm/numa.c:691 [...]
... the reason for the overflow is that memblock_clear_hotplug() takes physical
addresses as arguments, while the start/end variables used by
numa_clear_kernel_node_hotplug() are 'unsigned long', which is 32-bit on PAE
kernels, but which has 64-bit physical addresses.
So on 32-bit PAE kernels that have physical memory above the 4GB boundary,
we truncate a 64-bit physical address range to 32 bits and pass it to
memblock_clear_hotplug(), which at minimum prevents the original memory-hotplug
bugfix from working, but might have other side effects as well.
The fix is to use the proper type to handle physical addresses, phys_addr_t.
Reported-by: y14sg1 <y14sg1@comcast.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When parsing SRAT, all memory ranges are added into numa_meminfo. In
numa_init(), before entering numa_cleanup_meminfo(), all possible memory
ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all
ranges over max_pfn or empty.
But, this only works if the nodes are continuous. Let's have a look at
the following example:
We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
On boot, only node 0,1,2,3 exist.
And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 60000000]
2. on node 0: [100000000, 20000000000]
3. on node 1: [20000000000, 40000000000]
4. on node 4: [40000000000, 60000000000]
5. on node 5: [60000000000, 80000000000]
6. on node 2: [80000000000, a0000000000]
7. on node 3: [a0000000000, a0800000000]
8. on node 6: [c0000000000, a0800000000]
9. on node 7: [e0000000000, a0800000000]
And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
end address is over max_pfn, which is a0800000000. But 4 and 5 are not
removed because their end addresses are less then max_pfn. But in fact,
node 4 and 5 don't exist.
In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
Since memory ranges in node 4 and 5 are in numa_meminfo, in
numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
If you run lscpu, it will show:
NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220
In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block
contains all available memory at boot time, if they overlap, it means the
ranges exist. If not, then remove them from numa_meminfo.
After this patch, lscpu will show:
NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I got below kernel panic during kdump test on Thinkpad T420
laptop:
[ 0.000000] No NUMA configuration found
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x0000000037ba4fff]
[ 0.000000] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81d21910
...
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff817c2a26>] dump_stack+0x45/0x57
[ 0.000000] [<ffffffff817bc8d2>] panic+0xd0/0x204
[ 0.000000] [<ffffffff81d21910>] ? numa_clear_kernel_node_hotplug+0xe6/0xf2
[ 0.000000] [<ffffffff8107741b>] __stack_chk_fail+0x1b/0x20
[ 0.000000] [<ffffffff81d21910>] numa_clear_kernel_node_hotplug+0xe6/0xf2
[ 0.000000] [<ffffffff81d21e5d>] numa_init+0x1a5/0x520
[ 0.000000] [<ffffffff81d222b1>] x86_numa_init+0x19/0x3d
[ 0.000000] [<ffffffff81d22460>] initmem_init+0x9/0xb
[ 0.000000] [<ffffffff81d0d00c>] setup_arch+0x94f/0xc82
[ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
[ 0.000000] [<ffffffff817bd0bb>] ? printk+0x55/0x6b
[ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
[ 0.000000] [<ffffffff81d05d9b>] start_kernel+0xe8/0x4d6
[ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
[ 0.000000] [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
[ 0.000000] [<ffffffff81d055ee>] x86_64_start_reservations+0x2a/0x2c
[ 0.000000] [<ffffffff81d05751>] x86_64_start_kernel+0x161/0x184
[ 0.000000] ---[ end Kernel panic - not syncing: stack-protector: Kernel sta
This is caused by writing over the end of numa mask bitmap
in numa_clear_kernel_node().
numa_clear_kernel_node() tries to set the node id in a mask bitmap,
by iterating all reserved regions and assuming that every region
has a valid nid.
This assumption is not true because there's an exception for some
graphic memory quirks. See trim_snb_memory() in arch/x86/kernel/setup.c
It is easily to reproduce the bug in the kdump kernel because kdump
kernel use pre-reserved memory instead of the whole memory, but
kexec pass other reserved memory ranges to 2nd kernel as well.
like below in my test:
kdump kernel ram 0x2d000000 - 0x37bfffff
One of the reserved regions: 0x40000000 - 0x40100000 which
includes 0x40004000, a page excluded in trim_snb_memory(). For
this memblock reserved region the nid is not set, it is still
default value MAX_NUMNODES. later node_set will set bit
MAX_NUMNODES thus stack corruption happen.
This also happens when booting with mem= kernel commandline
during my test.
Fixing it by adding a check, do not call node_set in case nid is
MAX_NUMNODES.
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bhe@redhat.com
Cc: qiuxishi@huawei.com
Link: http://lkml.kernel.org/r/20150407134132.GA23522@dhcp-16-198.nay.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
* Unnecessary buffer size calculation and condition on the lenght
removed from intel_cacheinfo.c::show_shared_cpu_map_func().
* uv_nmi_nr_cpus_pr() got overly smart and implemented "..."
abbreviation if the output stretched over the predefined 1024 byte
buffer. Replaced with plain printk.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second patch-bomb from Andrew Morton:
- a few hotfixes
- drivers/dma updates
- MAINTAINERS updates
- Quite a lot of lib/ updates
- checkpatch updates
- binfmt updates
- autofs4
- drivers/rtc/
- various small tweaks to less used filesystems
- ipc/ updates
- kernel/watchdog.c changes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
kernel/param: consolidate __{start,stop}___param[] in <linux/moduleparam.h>
ia64: remove duplicate declarations of __per_cpu_start[] and __per_cpu_end[]
frv: remove unused declarations of __start___ex_table and __stop___ex_table
kvm: ensure hard lockup detection is disabled by default
kernel/watchdog.c: control hard lockup detection default
staging: rtl8192u: use %*pEn to escape buffer
staging: rtl8192e: use %*pEn to escape buffer
staging: wlan-ng: use %*pEhp to print SN
lib80211: remove unused print_ssid()
wireless: hostap: proc: print properly escaped SSID
wireless: ipw2x00: print SSID via %*pE
wireless: libertas: print esaped string via %*pE
lib/vsprintf: add %*pE[achnops] format specifier
lib / string_helpers: introduce string_escape_mem()
lib / string_helpers: refactoring the test suite
lib / string_helpers: move documentation to c-file
include/linux: remove strict_strto* definitions
arch/x86/mm/numa.c: fix boot failure when all nodes are hotpluggable
fs: check bh blocknr earlier when searching lru
...
If all the nodes are marked hotpluggable, alloc node data will fail.
Because __next_mem_range_rev() will skip the hotpluggable memory
regions. numa_clear_kernel_node_hotplug() is called after alloc node
data.
numa_init()
...
ret = init_func(); // this will mark hotpluggable flag from SRAT
...
memblock_set_bottom_up(false);
...
ret = numa_register_memblks(&numa_meminfo); // this will alloc node data(pglist_data)
...
numa_clear_kernel_node_hotplug(); // in case all the nodes are hotpluggable
...
numa_register_memblks()
setup_node_data()
memblock_find_in_range_node()
__memblock_find_range_top_down()
for_each_mem_range_rev()
__next_mem_range_rev()
This patch moves numa_clear_kernel_node_hotplug() into
numa_register_memblks(), clear kernel node hotpluggable flag before
alloc node data, then alloc node data won't fail even all the nodes
are hotpluggable.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.
However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.
This causes a small glitch when running Linux as a hyperv numa
guest:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
Initmem setup node 0 [mem 0x00000000-0x7fffffff]
NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
Initmem setup node 1 [mem 0x80800000-0x1081fffff]
NODE_DATA [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000. However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.
This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.
Here's the same dmesg section with this patch applied:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
Initmem setup node 1 [mem 0x80200000-0x1081fffff]
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.
PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following path will cause array out of bound.
memblock_add_region() will always set nid in memblock.reserved to
MAX_NUMNODES. In numa_register_memblks(), after we set all nid to
correct valus in memblock.reserved, we called setup_node_data(), and
used memblock_alloc_nid() to allocate memory, with nid set to
MAX_NUMNODES.
The nodemask_t type can be seen as a bit array. And the index is 0 ~
MAX_NUMNODES-1.
After that, when we call node_set() in numa_clear_kernel_node_hotplug(),
the nodemask_t got an index of value MAX_NUMNODES, which is out of [0 ~
MAX_NUMNODES-1].
See below:
numa_init()
|---> numa_register_memblks()
| |---> memblock_set_node(memory) set correct nid in memblock.memory
| |---> memblock_set_node(reserved) set correct nid in memblock.reserved
| |......
| |---> setup_node_data()
| |---> memblock_alloc_nid() here, nid is set to MAX_NUMNODES (1024)
|......
|---> numa_clear_kernel_node_hotplug()
|---> node_set() here, we have an index 1024, and overflowed
This patch moves nid setting to numa_clear_kernel_node_hotplug() to fix
this problem.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On-stack variable numa_kernel_nodes in numa_clear_kernel_node_hotplug()
was not initialized. So we need to initialize it.
[akpm@linux-foundation.org: use NODE_MASK_NONE, per David]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>