Patch series "mm: remove quicklist page table caches".
A while ago Nicholas proposed to remove quicklist page table caches [1].
I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.
[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
This patch (of 3):
Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.
The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.
Also it might be better to instead make more general improvements to page
allocator if this is still so slow.
Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
The migrate_vma helper is only used by noveau to migrate device private
pages around. Other HMM_MIRROR users like amdgpu or infiniband don't
need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau
depend on it instead of the mix of a dummy dependency symbol plus the
actually selected one. Drop various odd dependencies, as the code is
pretty portable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ZONE_DEVICE support doesn't depend on anything HMM related, just on
various bits of arch support as indicated by the architecture. Also
don't select the option from nouveau as it isn't present in many setups,
and depend on it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The help describing the memory model selection is outdated. It still says
that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
SPARSEMEM.
Update the help text for the relevant options:
* add a generic help for the "Memory Model" prompt
* add description for FLATMEM
* reduce the description of DISCONTIGMEM and add a deprecation note
* prefer SPARSEMEM over DISCONTIGMEM
Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add 2 new Kconfig variables that are not used by anyone. I check that
various make ARCH=somearch allmodconfig do work and do not complain. This
new Kconfig needs to be added first so that device drivers that depend on
HMM can be updated.
Once drivers are updated then I can update the HMM Kconfig to depend on
this new Kconfig in a followup patch.
This is about solving Kconfig for HMM given that device driver are
going through their own tree we want to avoid changing them from the mm
tree. So plan is:
1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
2 - Kernel release N+1 update driver to depend on new Kconfig ie
stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
and ARCH_HAS_HMM_DEVICE (one or the other or both depending
on the driver)
3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
update in mm/Kconfig
Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'default n' is the default value for any bool or tristate Kconfig
setting so there is no need to write it explicitly.
Also since commit f467c5640c ("kconfig: only write '# CONFIG_FOO
is not set' for visible symbols") the Kconfig behavior is the same
regardless of 'default n' being present or not:
...
One side effect of (and the main motivation for) this change is making
the following two definitions behave exactly the same:
config FOO
bool
config FOO
bool
default n
With this change, neither of these will generate a
'# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
That might make it clearer to people that a bare 'default n' is
redundant.
...
Link: http://lkml.kernel.org/r/c3385916-e4d4-37d3-b330-e6b7dff83a52@samsung.com
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace jhash2 with xxhash.
Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c hash() 12081 MB/s
ksm: xxh64 hash() 8770 MB/s
ksm: xxh32 hash() 4529 MB/s
ksm: jhash2 hash() 1569 MB/s
Sioh Lee did some testing:
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns
As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use
it in ksm at all.
Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.
Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users have now been converted to the XArray. Removing the support
reduces code size and ensures new users will use the XArray instead.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
The deferred memory initialization relies on section definitions, e.g
PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
most architectures.
Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
the commit 2e3ca40f03 ("mm: relax deferred struct page
requirements") this requirement was relaxed and now it is possible to
enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
DISCONTINGMEM and NO_BOOTMEM which causes build failures.
For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
causes the following build failure:
CC mm/page_alloc.o
mm/page_alloc.c: In function 'update_defer_init':
mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
^~~~~~~~~~~~~~~~~
USEC_PER_SEC
mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
In file included from include/linux/cache.h:5:0,
from include/linux/printk.h:9,
from include/linux/kernel.h:14,
from include/asm-generic/bug.h:18,
from arch/arc/include/asm/bug.h:32,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from mm/page_alloc.c:18:
mm/page_alloc.c: In function 'deferred_grow_zone':
mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
^
include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
^~~~
include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
^~~~~~~~~~~~~~
mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
^~~~~
In file included from include/asm-generic/bug.h:18:0,
from arch/arc/include/asm/bug.h:32,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from mm/page_alloc.c:18:
mm/page_alloc.c: In function 'free_area_init_node':
mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
^
include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^
include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~
include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
#define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
^~~~~~~~~~~~~
mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
^~~~~
include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
__builtin_choose_expr(__safe_cmp(x, y), \
^
include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
#define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
^~~~~~~~~~~~~
mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
^~~~~
scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed
Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
as the systems that support DISCONTIGMEM do not seem to have that huge
amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.
Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
Pull documentation updates from Jonathan Corbet:
"There's been a fair amount of work in the docs tree this time around,
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script
to keep it updated.
- Various LICENSES updates from Thomas, along with a script to check
SPDX tags.
- Work to fix dangling references to documentation files; this
involved a fair number of one-liner comment changes outside of
Documentation/
... and the usual list of documentation improvements, typo fixes, etc"
* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
Documentation: document hung_task_panic kernel parameter
docs/admin-guide/mm: add high level concepts overview
docs/vm: move ksm and transhuge from "user" to "internals" section.
docs: Use the kerneldoc comments for memalloc_no*()
doc: document scope NOFS, NOIO APIs
docs: update kernel versions and dates in tables
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
docs/vm: transhuge: minor updates
docs/vm: transhuge: change sections order
Documentation: arm: clean up Marvell Berlin family info
Documentation: gpio: driver: Fix a typo and some odd grammar
docs: ranoops.rst: fix location of ramoops.txt
scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
docs: uio-howto.rst: use a code block to solve a warning
mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
w1: w1_io.c: fix a kernel-doc warning
Documentation/process/posting: wrap text at 80 cols
docs: admin-guide: add cgroup-v2 documentation
Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
Documentation: refcount-vs-atomic: Update reference to LKMM doc.
...
Pull dma-mapping updates from Christoph Hellwig:
- replace the force_dma flag with a dma_configure bus method. (Nipun
Gupta, although one patch is іncorrectly attributed to me due to a
git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few
cleanups to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
* tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping: (48 commits)
dma-direct: don't crash on device without dma_mask
nds32: use generic dma_noncoherent_ops
nds32: implement the unmap_sg DMA operation
nds32: consolidate DMA cache maintainance routines
x86/pci-dma: switch the VIA 32-bit DMA quirk to use the struct device flag
x86/pci-dma: remove the explicit nodac and allowdac option
x86/pci-dma: remove the experimental forcesac boot option
Documentation/x86: remove a stray reference to pci-nommu.c
core, dma-direct: add a flag 32-bit dma limits
dma-mapping: remove unused gfp_t parameter to arch_dma_alloc_attrs
dma-debug: check scatterlist segments
c6x: use generic dma_noncoherent_ops
arc: use generic dma_noncoherent_ops
arc: fix arc_dma_{map,unmap}_page
arc: fix arc_dma_sync_sg_for_{cpu,device}
arc: simplify arc_dma_sync_single_for_{cpu,device}
dma-mapping: provide a generic dma-noncoherent implementation
dma-mapping: simplify Kconfig dependencies
riscv: add swiotlb support
riscv: only enable ZONE_DMA32 for 64-bit
...
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.
Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.
For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
we initialize struct pages for all the allocated memory when deferred
struct pages are used.
My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
during boot") exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed
even before my fix, as Fengguang Wu found.
Below is a more detailed explanation of the problem.
We initialize struct pages in four places:
1. Early in boot a small set of struct pages is initialized to fill the
first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory that
is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call
(when memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.
The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages
for all the reserved pages, it is inherently unsafe to do va to phys if
the translation itself requires access of "struct page" as in case of
this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP
The following path exposes the problem:
start_kernel()
trap_init()
setup_cpu_entry_areas()
setup_cpu_entry_area(cpu)
get_cpu_gdt_paddr(cpu)
per_cpu_ptr_to_phys(addr)
pcpu_addr_to_page(addr)
virt_to_page(addr)
pfn_to_page(__pa(addr) >> PAGE_SHIFT)
We disable this path by not allowing NEED_PER_CPU_KM with deferred
struct pages feature.
The problems are discussed in these threads:
http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
Fixes: 3a80a7fa79 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead select the PHYS_ADDR_T_64BIT for 32-bit architectures that need a
64-bit phys_addr_t type directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: James Hogan <jhogan@kernel.org>
Several documents in Documentation/vm fit quite well into the "admin/user
guide" category. The documents that don't overload the reader with lots of
implementation details and provide coherent description of certain feature
can be moved to Documentation/admin-guide/mm.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Mike Rapoport says:
These patches convert files in Documentation/vm to ReST format, add an
initial index and link it to the top level documentation.
There are no contents changes in the documentation, except few spelling
fixes. The relatively large diffstat stems from the indentation and
paragraph wrapping changes.
I've tried to keep the formatting as consistent as possible, but I could
miss some places that needed markup and add some markup where it was not
necessary.
[jc: significant conflicts in vm/hmm.rst]
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>